text analysis

Download Report

Transcript text analysis

Geosocial big data analysis using python and
FOSS4G with the case study of Korean data
Ilyoung Hong
Namseoul Univ
Dep of GIS engineering
Geosocial data
• Social Media- Tweeter, Facebook is the killer app for
Smartphone
• Smart Phone with GPS generates lots of geotagged social
data
• Social data with geotagged is called geosocial data
• Such as GeoTweet - geotagged tweet, 4sq Venues
Geosocial Data Researches
• Fujita, Hideyuki. "Geo-tagged Twitter collection and visualization system." Cartography and
Geographic Information Science 40.3 (2013): 183-191.
• =>Computational method, data collection
• Jung, Jin‐Kyu. "Code clouds: Qualitative geovisualization of geotweets." The Canadian
Geographer/Le Géographe canadien 59.1 (2015): 52-68.
=> qualitative approach, with content analysis
• Li, Linna, Michael F. Goodchild, and Bo Xu. "Spatial, temporal, and socioeconomic patterns in
the use of Twitter and Flickr." Cartography and Geographic Information Science 40.2 (2013): 6177.
Spatial statistical analysis with geodemographic data,
• Mitchell, Lewis, et al. "The geography of happiness: Connecting twitter sentiment and
expression, demographics, and objective characteristics of place." (2013): e64417.
=>Sentimental analysis, computational linguistics approach,
Multi Disciplinary Aspects of geosocial data analysis
Statistics
Linguistics
Text Mining
Sociology
Journalism
Media
GeoSocial Data
Data
Collection
Web programming
Data
Management
Data
Analyzing
Database Management
Qualitative
Analysis
Quantitative
Analysis
Data
Visualization
Geography,
Cartography,
GIS
challenges of geosocial research
• different data source, format
• Tweet, foursquare, Facebook,
• different analysis environment, difference software
• Java, php, Python, C, R, ArcGIS, web-programming, database programming, statistics,
geovisualizatrion,
• different domain knowledge, multidisciplinary research methods
• Computation, geography, sociology, psychology, statistics, linguistics, media, journalism
Need interdisciplinary cooperation, Are there any way to Integrate these
methods?
Why python/foss4g for Geosocial Big Data?
• Integrated analysis environment in software, library
• Python is free and open.
• Object-oriented programming (OOP) in Python
• WinPython, Anaconda(SCIPY,Ipython), Enthought Canopy for Python 2.7
• large amount of libraries, support different domain knowledge
• PyPI - the Python Package Index, currently 66086 packages
• Simple Coding environment
• Quick to Learn and to code
• Readability The syntax of Python is readable and clear.
Research Purpose
• Introduce the intergrated platform to analysize the GeoSocial using python & FOSS4G
• Data collection, management
• Data Analysis, Qualitative & Quantitave methods
• Sentimenal Analysis
• Geovisualizaing
• Present the Case Study with Korean Geosocail Data
• GeoTweet distribution
• Spatial Patterns of Fousquare Venues
• Sentimenal Anlysis of Korean GeoTweet
Architecteture, at beginning
Excel
csv
Twitter/
Foursquare
API
Socail
Media
JSON
Shape
ArcGIS
Data Collection
• Python Streaming API, tweepy
• limited rates for one user
• However, there is a restriction on data collection from Twitter: the
method
• call of Twitter API is limited by 350 calls per hour for one
authorized developer account
• switch to the other user id when reach to the limits
• unnecessary data.. filtering
• geotweet data is just 1% of total tweet
Columns from Tweet
● Tweet text; => qualitative approach, text mining, keword filter, sentimental analysis,
● Tweet ID; User ID; Destination user ID (only for tweets with “@user ID”);
User profile (including location name input by user);
=> behavioral features, heavy user feature, social network,
● Location coordinates (only for tweets tagged with the location coordinates).
• Geovisualization, Spatial Analysis using GIS
● Date and Time => temporal analysis
until now, made two researches
• Spatial Analysis of Location-Based Social Networks in Seoul,
Korea, Journal of Geographic Information System, 2015, 7,
259-265
• Spatial Distribution of Korean Geotweets* Journal of the
Korean Cartographic Association, 2015, 15(2), 93-101
Spatial Analysis of Location-Based Social
Networks in Seoul,
• The purpose of this study is to analyze the spatial patterns of location-based social network (LBSN)
data in Seoul using the spatial analysis techniques of geographic information system (GIS). The
study explores the applications of LBSN data by analyzing the association between Seoul’s Foursquare venues
data created based on user participation and the city’s characteristics. The data regarding Foursquare venues
were compiled with a program we created based on Foursquare’s Python API. The compiled information was
converted into GIS data, which in turn was depicted as a heat map. Cluster analysis was then performed
based on hotspots and the correlation with census variables was analyzed for each administrative unit using
geographically weighted regression (GWR). Based on analytical results, we were able to identify venue clusters
around city centers, as well as differences in hotspots for various venue categories and correlations with
census variables.
about 230,000 venue
data were collected for analysis
between March 15 and 21, 2015
Spatial Distribution of Korean
Geotweets*
In this study, we analyzed the distribution of Korean geotweet. Geotweet was analyzed,
which was collected at November 2014 through Twitter Streaming API. Using the Python
programming, it was carried out to analyze the collected data and GIS data conversion.
Twitter use and distribution are concentrated at Seoul and the metropolitan areas and a
few heavy users were creating a large number of tweets. Time series analysis showed the
characteristics of the tweets that make up the highest point on the Weekend and forms
the highest point at 14:00 during the day. In addition, differences in the content that
appears every high percentage of retweets and regions through text analysis were also
identified. Key Words : Tweeter API, Geotweet, Spatial distribution
• Nov, 2014, over 2 million tweet was collected.
Distribution of geotweet, Nov 2014
Spatial Distribution of geotweet
Daily Distribution of geotweet, Nov 2014
Text
analysis
• high percentage of
retweet
• some keyword that
represent regional
features
• PyTag, Word_cloud
Problems
• Using Exoplanary Statistic Analysis, Repeated Works but the
process is not automated
• Takes times, Data Error
• As time goes by, the data comes to be too big to handle.
• Need to be managed at database, not as a text file
• Data and Software show be compatible at the same environment
for the automated analysis
Python & FOSS4G
• integrated analysis environment
• large amount of libraries, support different domain
knowledge
• create the automated scripts for analysis
Social Media Server
Data Collection
Data Parsing
pyspatialite
Twitter API - Tweepy
GIS Data Server
Data Conversion
Spatialite
Visualize
Client
Geovisualization
Quantum GIS
WodCloud
pytagcloud
pyspatialite
Analysis Client
Shape/Text
Sentiment Analysis
Python NLTK
Statistical Analysis
PySAL
PANDAS
for Data Analysis
Analysis Process
Text
Mining
Quantitatives
Social
Media
Data
GeoTaged?
GIS
Database
Data
Type?
Analysis
method?
Quantatives
Setiment
Analysis
Spatial
Analysis
Statisitcal
Analysis
Word
Clouds
Visualiing
Method?
HeatMap
Thematic
Mapping
Hotspot
GWR
Spatialite Database, Why
-Standalone & File Based Database: easy to handle
- Compatable, interoperability:
Python, QGIS, ArcGIS, export/import to any format
- Easy to useability, GUI
pyspatialite
Sentiment Analysis with Python NLTK
Text Classification
• sentiment analysis using a NLTK
• Tweet Text => POS, NEU, NEG
values
Heatmap using Quantum GIS
2015, July, geotweet
Hot, Best Postive Place
Jongro
HongDae
youngsan
Word
Cloud
HongDae
Jongro
youngsan
Best Positive Tweet
Happy Pride from Kat! #seoul #gaypride #kqcf2015 #korea #hugagaytoday @ Seoul City Hall Korea
https://t.co/81TiNdqCMH
#seoulgayprideparade HAPPY PRIDE DAY KOREA!!!! #rainbow #lgbt #love #happy #seoul #korea @ Seoul
Plaza https://t.co/FUCkHxmIsc
Good times and more Korean BBQ with the Samsung team #MobLabs #GangnamStyle @ Gangnam, Seoul,
Korea https://t.co/NyIa440NZ3
Happy Sunday :) @ Myeongdong Cathedral https://t.co/TezVZTVtDH
We go by the zoo via the "Elephant Train" to the museum @ Seoul Grand Park Zoo
https://t.co/imXCgPrcBG
Korean food is the best food #korea #food #nofiilter @ Seoul ,Korea https://t.co/MqVDHqqoEy
Have a beautiful and fruitful week IG fam! #MondayLook #mamichoux @ Hongdae Seoul
https://t.co/lVM5NdLJyp
Happy the 4th of July to all my American friends! (@ Thursday Party in Seoul) https://t.co/CG27beaCQl
And with Elizaveta from Russia :) @ Trickeye Museum https://t.co/7NCrGUYOF1
Quick tour of a Korean apartment @ Hongdae Seoul South Korea https://t.co/yTy8mAVCZk
..
Conclusion and Future Work
• Aanalysis of Geosocial Data is the complex, multidiciplanary
process
• In this research, present the integrated architecture using Python
& FOSS4G
• Future work
• automated processing with Python scripts
• Need more work on QGIS and PySAL for more advanced analysis and
visualization