Some time ago, I had the requirement to sample data from a table to do validation work. Basically, we wanted to validate syncronized data with a base table, but could not afford to compare every row. We had to come up with a way to ‘sample’ the data. Each table we had to sample had a different number of rows and in addition to that were growing. The mechanizm had to be simple, yet take into account table size. I did not want to do a custom query for every table. I wanted something somewhat generic – something that I could say check ‘x’ rows from this table, not check every ‘nth’ row which would mean that the result would grow over time.
The obvious first consideration was to number the rows and do it from that, but considering we were doing this in a program, I needed something a little more elegant. Here is the sql I came up with.
SELECT E.* FROM
(SELECT ROWNUMBER() OVER (ORDER BY EMPNO) AS ROWNBR, EMPNO FROM EMPLOYEE ) TEMP,
(SELECT INTEGER(ROUND((COUNT(*)/23.0),0)) AS HOWMANY FROM EMPLOYEE) TEMP2,
EMPLOYEE E
WHERE MOD(ROWNBR,HOWMANY)=0 AND E.EMPNO=TEMP.EMPNO
So here is what we do,
TEMP is an ordered list of the table key – we number each row and order by the empno (or whatever key)
TEMP2 is a cartesian join of the count of the table and how many rows you want. In this case it is 23 rows
E is the table we want and join to the above
The where clause is the key – we only take rows that have a modulus of 0 when you devide the rownumber by how many rows you want
Notes:
- This was written for the sample database employee table, however I did have to modify it to have more than 42 rows
- Small tables will have inaccurate results just because of the math so this works best on larger tables
- The number of rows returned is not exactly what was requested. Once again because of the math, but it is usally close. I found the larger the table, the more accurate the result. You can add a ‘FETCH FIRST n ROWS ONLY’ (where ‘n’ is your value) so you don’t get more than you want.
- How often do you find a legitimate use for a cartesian join?
- Don’t ask for more rows than are in the table – this can cause errors with the mod function
Leave a Reply