Analyzing Yelp data with Pig

Analyzing Yelp data with Pig

5 questions


The reading for this module includes a Pig script for yelp data, which has the commands and example from the video ‘Other Pig Commands’.

Run the script for yelp data in your environment.

See the comments at the beginning to remind you how to run it. I recommend running it interactively through grunt so that you can answer these questions below one by one.

Use the relation with final ranking, and the DUMP command, to find the top ranking avg_wtd_star (i.e. average of weighted stars) rating for a business

This is in the result from the DUMP command. You should use the ‘DESC’ (descending) option in the RANK command so that the top rank appears at the beginning






The overall task in this question is the following:

For businesses that have more than 1 rating,find the average of weighted stars across businesses.

Here is one way to solve it that gets the average of the average:

A. start with Y_m from the Yelp script in the readings.

B. try >DESCRIBE Y_m to see whats in there

C. Enter commands, here is some pseudo code:

1. Filter Y_m to choose those business with num_ratings > 1, call the relation Y_m2

2. Use a GROUP ALL to create a bag of avg_wtdstars

3. Use the AVG function in a FOREACH statment to find AVG(Y_m2.avg_wtdstars),

4. DUMP the relation and answer the question

NOTE, the answers below are only showing the first 4 significant digits, pick the closest one to your answer.






Here is another way to get the average of weighted stars across businesses.

Get the average of all wtd_stars from all businsss with number of ratings > 1, grouped together.

Strategy: First, we have to group the businesses to get a count of the number of ratings. Then we filter that set – we actually already have that in Y_m2 from Question 2 above.

Then we have to use that result to select only businesses that are in that set.

Here is Pseudo-code for one way to solve it:

Start with Y_rate2, which has wtd_stars, and Y_m2 which has the businesses with number of ratings > 1

1. Join Y_rate2 with Y_m2 using business id as the key. Do you want an inner or outer join?

2. Make a GROUP ALL so the wtd_stars is in a bag

3. Use FOREACH to generate the average wtd_stars,

DUMP the relation and answer the question






Notice that the average of the average-weighted-stars is not the same as the average of all weighted-stars. This is sometimes called simpson’s paradox, or the pooling problem, because the averages can be misleading, in some cases.

This question here is more of a commentary about analysis.

Under what condition would you prefer to report the average of the average, or the average of all wtd_stars grouped together?

If you know that the businesses are similar, with similar customers and in similar neighborhoods, then grouping them all together will give you a better estimate of the overall average.

If you know that businesses are not the same, perhaps because of different demographics of their location or customers, then you should not group it all together and take averages of averages.

The other possibility is to report both, and depending on your context, you might need to report number of ratings as well, because that could be skewing a grouped average.


The above question 3 suggested that you could use a JOIN between the businesses with number of ratings > 1 and all businesses, to derives a relation that has businesses-with-number-of-ratings>1 , and wtd_stars. (Join Y_rate2 with Y_m2). In other words, the strategy in Question 3 was to get a grouping so we could get the num_rating and use the JOIN to only select the desired businesses.

However, JOINs might be expensive and there is another way to get the result in Q3 without doing a JOIN. In fact, we might think of this as a PIG-like thing to do.

Read the statements below and select the answer that fills in the blank in Step2 that helps avoid the JOIN.

First, recall the following description for Y_g relation, notice it has the wtd_stars values and it’s grouped by business_id.

Y_g: {group: chararray,Y_rate2: {(business_id: chararray,stars: int,useful_clipped: int,wtd_stars: double)}


1. get the number rated and filter by that number

Ygf = FOREACH Y_g GENERATE COUNT(Y_rate2.stars) as num_rated, Y_rate2.wtd_stars as wtd_stars2use;

Ygf_gt1 = FILTER Ygf BY num_rated>1;

2. What is the next step going to be, select the answer below that fills in the blank here:

Y_next = FOREACH Ygf_gt1 GENERATE __________ AS your_filled_in_field;

3. Group all records together

Yg2 = GROUP Y_next ALL;

A2 = FOREACH Yg2 GENERATE AVG(your_filled_in_field);





4 questions unanswered