Yelp Dataset Challengex

Download Report

Transcript Yelp Dataset Challengex

YELP DATASET
CHALLENGE
CAMPUS ARC II, 16 APRIL 2015
Mehdy Davary, Computer science department (IIUN)
ABOUT THE CHALLENGE DATASET
• 1.2M reviews
• 400K tips by 250K users for 42K businesses
• 400K business attributes, e.g., hours, parking availability, ambience
• Social network of 250K users for a total of 1.9M social edges.
• Aggregated check-ins over time for each of the 42K businesses
CITIES
• U.K.: Edinburgh
• U.S.: Pittsburgh, Charlotte, Urbana-Champaign, Phoenix, Las Vegas, Madison
PLATFORM
• The Hortonworks Sandbox is a single
node implementation of the
Hortonworks Data Platform (HDP). It is
a personal, portable Hadoop
environment.
The
Hortonworks
Sandbox
• H2O on Hortonworks Data Platform is
a fully Open Source Predictive
Analytics Platform.
• Neo4j is a Graph Database which stores
data in a Graph, with Nodes. Neo4j uses
Cypher queries to work with graph
data.
Sentiment
Analysis
H2O
Neoj4
THE HORTONWORKS SANDBOX
By now we have managed all YELP five JSON data files in Hadoop as tables which are sortable and
searchable. Mainly we use HCatalog, Pig, Python and Hive to load and process data.
H2O
H2O is a statistical analysis engine that uses Hadoop Distributed File System (HDFS) as its
storage platform and provides a user-friendly interface for easy querying.
NEOJ4
The real power of Neo4j is in connected data. To associate any two nodes, we add a
Relationship which describes how the records are related.
TO ANALYZE HORTONWORKS SANDBOX DATA
WITH EXCEL 2013
• Hortonworks ODBC driver (64-bit)
installed and configured.
• Microsoft Excel 2013 Professional Plus 64bit.
• Use the Microsoft Query feature to access
Hortonworks sandbox data.
• Use the Excel Power View feature to
analyze the data.
ABOUT REVIEWS ON “RESTAURANTS”
5 IMPORTANT DIMENSIONS
RAW DATA
• Food
• Service
• yelp_academic_dataset_review.json
• Ambience
• yelp_academic_dataset_business.json
• Deals/Discounts
• Quality-Price Ratio
A review can be associated with multiple dimensions (categories) at the same time.
DATA PREPARATION FOR DATA MINING
• All reviews
• Total reviews on “Restaurants”
• Reduced numbers of reviews
on “Restaurants” by using
(review.useful > 3 AND
review.cool > 2 AND
review.stars > 3 AND
business.review_count > 5) as
filtering factors
Review
All businesses
All
restaurants
1’127’525
706’290
Business
42’153
User
252’898
Tip
403’210
Checkin
31’617
Restaurants
r.useful > 3
r.cool >2
r.stars > 3
b.review_count > 5
22’584
review
--------------------------------funny: int
useful: int
cool: int
user_id: string
review_id: string
stars: int
text: string
date: string
type: string
business_id: string
business
--------------------------------attributes: string
business_id: string
full_address: string
open: boolean
hours: string
categories: string
city: string
review_count: int
name: string
neighborhoods: string
longitude: float
state: string
stars: float
latitude: float
type: string
user
--------------------------------yelping_since: string
votes: {funny: 1, useful: 5, cool: 0}, string
name: string
review_count: int
user_id: string
friends: string
fans: int
average_stars: float
type: string
compliments: string
elite: string
Retrieving the Parts of
speech(verbs, nouns,
adjectives etc) from the
sentence using the Stanford
NLP parser.
JAVA
IMPLEMENTATION OF
THE NLTK IN HADOOP
THE STANFORD NLP GROUP
Unfortunately, the frustration of being Dr. Goldberg's patient is a repeat of the experience I've had with so many other doctors in NYC -good doctor, terrible staff. It seems that his staff simply never answers the phone. It usually takes 2 hours of repeated calling to get an
answer. Who has time for that or wants to deal with it? I have run into this problem with many other doctors and I just don't get it. You
have office workers, you have patients with medical needs, why isn't anyone answering the phone? It's incomprehensible and not work the
aggravation. It's with regret that I feel that I have to give Dr. Goldberg 2 stars.
Unfortunately, frustration Dr. Goldberg's patient repeat experience I've doctors NYC -- good doctor, terrible staff. It staff simply answers
phone. It takes 2 hours repeated calling answer. Who time deal it? run problem doctors it. You office workers, patients medical needs,
answering phone? It's incomprehensible work aggravation. It's regret feel give Dr. Goldberg 2 stars.
((Unfortunately,RB),(frustration,NN),(being,VB),(Goldberg,NNP),(patient,NN),(repeat,NN),(experience,NN),('ve,VB),(had,VB),(so,RB),(man
y,JJ),(other,JJ),(doctors,NN),(NYC,NNP),(good,JJ),(doctor,NN),(terrible,JJ),(staff,NN),(seems,VB),(staff,NN),(simply,RB),(never,RB),(answers,V
B),(phone,NN),(usually,RB),(takes,VB),(hours,NN),(repeated,VB),(calling,VB),(get,VB),(answer,NN),(time,NN),(wants,VB),(deal,VB),(have,V
B),(run,VB),(problem,NN),(many,JJ),(other,JJ),(doctors,NN),(just,RB),(do,VB),(n't,RB),(get,VB),(have,VB),(office,NN),(workers,NN),(have,VB
),(patients,NN),(medical,JJ),(needs,NN),(n't,RB),(anyone,NN),(answering,VB),(phone,NN),('s,VB),(incomprehensible,NN),(not,RB),(work,V
B),(aggravation,NN),('s,VB),(regret,NN),(feel,VB),(have,VB),(give,VB),(Goldberg,NNP),(stars,NN))
{(Unfortunately, the frustration of being Dr. Goldberg's patient is a repeat of the experience I've had with so many other doctors in NYC -good doctor, terrible staff.),(It seems that his staff simply never answers the phone.),(It usually takes 2 hours of repeated calling to get an
answer.),(Who has time for that or wants to deal with it?),(I have run into this problem with many other doctors and I just don't get
it.),(You have office workers, you have patients with medical needs, why isn't anyone answering the phone?),(It's incomprehensible and
not work the aggravation.),(It's with regret that I feel that I have to give Dr. Goldberg 2 stars.)}
• Retrieving the Parts of speech(verbs, nouns,
adjectives etc) from the sentence using the Stanford
NLP parser.
• Using the SentiWordNet to find the Positive and
Negative values related to each Part of Speech.
• Summing up the Positive and Negative values
obtained to calculate a Net Positive and Net
Negative value related to a sentence.
SENTIWORDNET
A lexical resource for opinion
mining