Friday, October 8, 2010

Top vs Sample

There was a question raised in my team's discussion forum as to which one of these is better to use to get a sample of the data? There were different kinds of suggestions from various online forums, but none had the reason behind why one is better than the other.
I read through TD documentation and this is what I put forth to my group:

1. The system evaluates the top n operator after all other clauses in the SELECT request have been evaluated. (So, I think it doesn’t help in data processing time. Must only affect the I/O time)
2. It gives the ranked result only when you specify order by clause. So, If you don’t specify order by, results can be any n rows.

1. SAMPLE clause comes with a keyword ‘RANDOMIZED ALLOCATION’ that specifies whether retrieved rows are sampled randomly across the amps or proportionate to the number of qualified rows per AMP. Proportional allocation is default.
2. SAMPLE also operates on the evaluated output. Hence, it does not improve query processing time.

My inference:
Comparing TOP and SAMPLE, as both execute the query completely, it is better to use TOP(without order by) because we will not provide the system an overhead of getting proportional output from each amp.

Just couple more interesting finds about TOP vs RANK:
1. The QUALIFY clause with the RANK or ROW_NUMBER ordered analytical functions returns the same results as the TOP n operator.
2. For best performance, use the TOP option instead of the QUALIFY clause with RANK or ROW_NUMBER. In best-case scenarios, the TOP option provides better performance; in worst-case scenarios, the TOP option provides equivalent performance

Hope this helps!