Tuesday, December 20, 2011

Simple things in PIG

A trivial task in PIG, which comes up a lot when doing sanity check on mapreduce program output is to count the number of rows in outputs and inputs. For those who are used to sql count, the way it is done in PIG is not intuitive. The reason is that COUNT in pig is used to count not number of rows, but number of tuples in a record, so an extra grouping is needed.

DATASET = LOAD .....
G = group DATASET all
C = foreach G GENERATE COUNT(DATASET);
DUMP C;