In the command: twt_samp = SAMPLE twitter 0.1
What records will be sampled?
Records from whole dataset picked at random with 10% chance.
The top 10% of records
Record picked at random until there are exactly 10% of total records
When you type in the command in grunt
grunt > twt_d1 = FILTER twt_samp BY created_date MATCHES ‘Fri Oct 05 2012’;
What will Pig do?
Pig will read the data and produce a new data set before the next command
Pig will test that it can process the request then later read the data when it is force to execute a map/reduce step by either a STORE or DUMP
Pig will build an index on the create-date field and use that to select records
When joining datasets, you have an option to tell Pig that the keys might be skewed
… JOIN data1 BY my-join-key USING ‘skewed’ …
PIG will get an estimate of my-join-key values to see if there are some values that occur with much higher frequency than others. There is some overhead cost for doing this (10% or so, but it depends on many factors).
How is this information used in map/reduce jobs?
If there is skew, then PIG will try to partition keys to be more balanced across reducers.
PIG will replicate the smaller dataset across mapper tasks.
PIG can use more reducers