FinalProject

Download Report

Transcript FinalProject

Text Mining Project:
Using Textual Content from Twitter
for Next-Place Prediction
Mingjun Wang
Apr 30th, 2015
Content
• Introduction
• Previous Work
• Methodology and Preliminary Work
– Hypothesis
– Models and Experiments
• Future Works
• Conclusion
Introduction
• Motivation
– Crimes are correlated with people’s daily
movement [13]
– People’s movement are difficult to model and
predict
• Objective
– Apply next-place prediction to model individuals’
daily movement for predicting crimes
Introduction
• In this project, we are focus on using textual
contents to model and predict individuals’
movement pattern
• Research Question
– Will online activities in social media correlate with
individuals’ movement pattern?
0.05 Topic 1: flight, delay, …
0.85 Topic 2: beer, party, rib, …
0.1 Topic 3: church, film, …
0.75 Topic 1: flight, delay, …
0.2 Topic 2: beer, party, rib, …
0.05 Topic 3: church, film, …
Example 1
• Intuitively,
– Predict next visiting place based on the features
extracted from social media
Venue
College
Tweet
Hard to remember
when to take
school shuttle
Coordinates
Time
(-87.57,42.01)
5:20 PM
Transport
Shop
Food
I was stuck in
loyola on the way
to buy gifts
@Bmfayy I admit I
am hungry after
travelling
I always like
the food here
(-87.55, 41.95)
(-87.69, 41.97)
(-87.70, 41.76)
5:22 PM
5: 26 PM
5:43 PM
Example 2
• Intuitively,
– Retrieve possible types of venues based on textual
content
User @omgitskelcey
Document as historical contents in each venue
Doc 1 : Historical tweets matched with Shop 1
Doc 2 : Historical tweets matched with Event 1
Doc 3 : Historical tweets matched with Food 1
Doc 4 : Historical tweets matched with Shop 2
….
Using tweet as query to retrieve the
Document in the right place
Time
Shop
@Bmfayy I admit I
am hungry after
travelling.
5: 26 PM
Food
I always like
the food here
5:43 PM
Previous Work in Next Place Prediction
• Location prediction is a traditional task in mobile
computing
– Home/Work area Prediction [1–3, 10]
– Prediction of an individual’s location at any time [6, 7, 12,
18]
• There are a variety of variables used in previous works
– Trajectories of geographical coordinates
• GPS [4, 5, 12, 14]
• Wi-Fi [20]
– Types of venues
• Check-ins from Location Based Social Network (LBSN) [11, 16, 19]
Previous Work in Next Place Prediction
• Our work is different from previous studies
– Incorporate textual content in next-place
prediction
– Match geographical coordinates with type of
venues to describe the physical environment
Hypothesis
• To incorporate textual content to next-place
prediction, we propose,
– A user’s historical textual contents correlate with
his/her future venue trajectory.
Data
• Twitter
• Geotagged tweets with textual contents from Twitter’s
public API [15].
– User ”63011649”; 2014-01-05 00:25:15; ”@LauraRoppo
eat clean train mean”; (-87.79786403, 41.93277408)
• Foursquare
– Provide check-in and real-time location sharing [17].
– Users’ historical check-ins ,which are type of venues, show
the physical environment around them.
• There is no overt connection between type of venues
and textual contents.
Data Preparation
• Apply Part-of-Speech ( POS ) tagging and
remove meaningless parts
• Calculate the distance between the geotagged
tweets with venues
Data Preparation
• Remove meaningless part
– Using Twitter POS model with the coarse 25-tag
tag set from TweetNLP [9].
Tweet
Hard to remember
when to take
school shuttle
I was stuck in
loyola on the way
to buy gifts
@Bmfayy I admit I
am hungry after
travelling
I always like
the food here
Words
hard, remember,
take, school,
shuttle
stuck, loyola, way,
buy, gifts
admit, hungry
travelling
like, food,
here
Data Preparation
• Calculate the distance between the geotagged
tweets with venues
– Match tweet with type of venues to stand for
physical environment
Strip
Club
Street
Pizza Place
I always like the
food here
Food
Medical
Center
Office
Data Preparation
• There are two ways to describe the physical
environment
– Nearest venue type
– Distance to each nearest venue type
Data Preparation
Data Preparation
Models and Experiments
• Classification Model to Identify the nearest
venue type
• Regression Model for the distance to each
nearest venue type
• Text Retrieval Model to identify the location
from textual content
Classification Model (General)
• First Step: Classify whether the individual will visit a
new place or not.
• Second Step : Classify which new place the individual
will go in the subset of tweets classified as go to new
place in first step.
• s
Text Enriched Model
• Hypothesis : Textual content in a user’s current tweet
correlates with his/her future venue trajectory.
– Assumption : Features extracted from textual content as
term frequency inverted document frequency (TF-IDF)
could stand for textual content of current tweet.
Text Enriched Model
• Hypothesis : TF-IDF features from textual
content in a user’s current tweet correlates
with his/her future venue trajectory.
Text Enriched with @-link Model
• We hypothesize the venue type and textual content of the
tweet most recently mention current user correlates with the
user’s own venue trajectory.
Text Enriched with @-link Model
• Thus, the Text-Enriched with @-link Model will
be the extension of Text-Enriched Model
Baseline Models
•
•
•
•
Most Frequent Check-in Model
Order - k Markov Model [4]
Historical Model [6]
Classification Model with historical visiting
Information
Results 1
Regression Model
• Regression Model for the distance to each
nearest venue type
– Using the same features as described in the
classification model
• Baseline
– Average distance to each venue type
Results 2
(km)
Mean Distance of Test
Set
MSE (Raw
Model)
MSE(two-stage
Model)
Travel&Transport
271
0.015252829
0.018597382
Food
125
0.014529229
0.013495641
Residence
301
0.012723374
0.019364779
Outdoors&Recreation
Professional&OtherPlac
es
255
0.01434006
0.01628372
62
0.011052592
0.009840732
Arts&Entertainment
283
0.026257121
0.026432174
NightlifeSpot
172
0.018325964
0.018896978
College&University
421
0.035374125
0.060547641
Shop&Service
126
0.013573609
0.011224759
6748
0.309573899
0.332126214
Event
Text Retrieval Model
• Query : Geotagged Tweets
• Document : A collection of historical tweets
matched with each venue type
• Rank the documents based on the query
terms
Text Retrieval Model
• BM25
Result 3
• In this model, we only consider the textual
content  inter – relation between each
tweet with the document (collections of
historical tweets in one venue )
• Therefore, we both use the textual content to
predict the current venue and next venue
Prediction Accuracy
Current Venue
Next
0.181
0.2016
Future Work
• Finish the Text Retrieval Model
• Improve next place prediction by further
investigate the social relation between
different users
• Apply the result from above models to
understand individuals’ movement pattern
and crime prediction
Summary
• To incorporate textual content in next-place
prediction,
• To understand how online social relationships
correlate with individuals’ movement patterns.
Reference
•
•
•
•
•
[1] Lars Backstrom, Eric Sun, and Cameron Marlow. Find me if you can: improving
geographical prediction with social and spatial proximity. In Proceedings of the
19th international conference on World wide web, pages 61–70. ACM, 2010.
[2] Zhiyuan Cheng, James Caverlee, and Kyumin Lee. You are where you tweet: a
content-based approach to geo-locating twitter users. In Proceedings of the 19th
ACM international conference on Information and knowledge management, pages
759–768. ACM, 2010.
[3] Manoranjan Dash, Hai Long Nguyen, Cao Hong, Ghim Eng Yap, Minh Nhut
Nguyen, Xiaoli Li, Shonali Priyadarsini Krishnaswamy, James Decraene, Spiros
Antonatos, Yue Wang, et al. Home and work place prediction for urban planning
using mobile network data. In Mobile Data Management (MDM), 2014 IEEE 15th
International Conference on, volume 2, pages 37–42. IEEE, 2014.
[4] Trinh Minh Tri Do and Daniel Gatica-Perez. Contextual conditional models for
smartphone-based human mobility prediction. In Proceedings of the 2012 ACM
Conference on Ubiquitous Computing, pages 163–172. ACM, 2012.
[5] Trinh Minh Tri Do and Daniel Gatica-Perez. Where and what: Using
smartphones to predict next locations and applications in daily life. Pervasive and
Mobile Computing, 12:79–91, 2014.
•
•
•
•
•
[6] Huiji Gao, Jiliang Tang, and Huan Liu. Exploring social-historical ties on
location-based social networks. In ICWSM, 2012.
[7] Huiji Gao, Jiliang Tang, and Huan Liu. Mobile location prediction in spatiotemporal context. In Nokia mobile data challenge workshop. Citeseer, 2012.
[8] Matthew S Gerber. Predicting crime using twitter and kernel density estimation.
Decision Support Systems, 61:115–125, 2014.
[9] Kevin Gimpel, Nathan Schneider, Brendan O’Connor, Dipanjan Das, Daniel Mills,
Jacob Eisenstein, Michael Heilman, Dani Yogatama, Jeffrey Flanigan, and Noah A
Smith. Part-of-speech tagging for twitter: Annotation, features, and experiments.
In Proceedings of the 49th Annual Meeting of the Association for Computational
Linguistics: Human Language Technologies: short papers-Volume 2, pages 42–47.
Association for Computational Linguistics, 2011.
[10] Brent Hecht, Lichan Hong, Bongwon Suh, and Ed H Chi. Tweets from justin
bieber’s heart: the dynamics of the location field in user profiles. In Proceedings of
the SIGCHI Conference on Human Factors in Computing Systems, pages 237–246.
ACM, 2011.
• [11] Defu Lian, Vincent W Zheng, and Xing Xie. Collaborative filtering
meets next check-in location prediction. In Proceedings of the 22nd
international conference on World Wide Web companion, pages 231–232.
International World Wide Web Conferences Steering Committee, 2013.
• [12] Zhongqi Lu, Yin Zhu, Vincent W Zheng, and Qiang Yang. Next place
prediction by learning with multiple models.
• [13] Fernando Mir ́o. Routine activity theory. The Encyclopedia of
Theoretical Criminology, 2014.
• [14] Anna Monreale, Fabio Pinelli, Roberto Trasarti, and Fosca Giannotti.
Wherenext: a location predictor on trajectory pattern mining. In
Proceedings of the 15th ACM SIGKDD international conference on
Knowledge discovery and data mining, pages 637–646. ACM, 2009.
• [15] Fred Morstatter, Ju ̈rgen Pfeffer, Huan Liu, and Kathleen M Carley. Is
the sample good enough? comparing data from twitter’s streaming api
with twitter’s firehose. arXiv preprint arXiv:1306.5204, 2013.
• [16] Anastasios Noulas, Salvatore Scellato, Neal Lathia, and Cecilia
Mascolo. Mining user mobility features for next place prediction in
location-based services. In ICDM, volume 12, pages 1038–1043. Citeseer,
2012.
• [17] Anastasios Noulas, Salvatore Scellato, Cecilia Mascolo, and
Massimiliano Pontil. An empirical study of geographic user activity
patterns in foursquare. ICwSM, 11:70–573, 2011.
• [18] Salvatore Scellato, Mirco Musolesi, Cecilia Mascolo, Vito Latora, and
Andrew T Campbell. Nextplace: a spatio-temporal prediction framework
for pervasive systems. In Pervasive Computing, pages 152–169. Springer,
2011.
• [19] Takuya Shinmura, Dandan Zhu, Jun Ota, and Yusuke Fukazawa.
Destination prediction considering both tweet contents and location
transition hitstory. In Mobile Computing and Ubiquitous Networking
(ICMU), 2014 Seventh International Conference on, pages 95–96. IEEE,
2014.
• [20] Libo Song, David Kotz, Ravi Jain, and
Xiaoning He. Evaluating next-cell predictors with
extensive wi-fi mobility data. Mobile Computing,
IEEE Transactions on, 5(12):1633–1649, 2006.
• [21] Xiaofeng Wang, Matthew S Gerber, and
Donald E Brown. Automatic crime prediction
using events extracted from twitter posts. In
Social Computing, Behavioral-Cultural Modeling
and Prediction, pages 231–238. Springer, 2012.