NSF CDI meeting

Download Report

Transcript NSF CDI meeting

Mapping Ideas from Cyberspace to Realspace. Funded by NSF CyberEnabled Discovery and Innovation (CDI) program. Award # 1028177.
(2010-2014) http://mappingideas.sdsu.edu/
Overview of the NSF-CDI project
(Year-3) and Research Progress
• Ming-Hsiang (Ming) Tsou [email protected],
Professor (Geography), San Diego State University, PI of “Mapping Ideas” project.
Co-PIs:
Dr. Dipak K Gupta (Political Science), Dr. Jean Marc Gawron (Linguistic),
Dr. Brian Spitzberg (Communication), Dr. Li An (Geography)
Principle Investigator: Dr. Ming-Hsiang Tsou [email protected], (Geography), Co-Pis: Dr. Dipak K Gupta
(Political Science), Dr. Jean Marc Gawron (Linguistic), Dr. Brian Spitzberg (Communication), Dr. Li An (Geography).
San Diego State University, USA.
Starting Date: October 1, 2010
(Four Years, $1.38M total)
Goal 1: Establish a new multidisciplinary
research framework to represent the
spatiotemporal diffusion of ideas and the semantic
web on the Internet.
Goal 2: Create effective visualization and
analysis methods for the dynamic geospatial
information landscape with three selected topics
(e.g. natural disasters, continuous threats for human
beings, and radical social movements).
Goal 3: Build domain-specific ontology, citation,
and (provocative) event knowledge bases with
thesaurus and citation networks for the three
selected topics and their Semantic Webs.
Goal 4: Develop theoretical model(s) capable of
integrating the individual (semantic usage, online
motivations) and societal (diffusion) motives and
practices associated with the spatiotemporal
diffusion of ideas.
Goal 1: Establish a new multidisciplinary research framework
Knowledge Discovery in Cyberspace (KDC)
Similar to the multidisciplinary research field, called “knowledge
discovery in databases (KDD)” (Fayyad et al. 1996), this emerging
research field, knowledge discovery in cyberspace (KDC), will
focus on how to handle and analyze very large information and
human messages collected from cyberspace and social media.
The purpose of KDC is to scale up our research capability of
handling millions of records and information items available in social
media (such as Twitter) or web pages (searched by Google, Yahoo, or
Bing search engines).
(Cited from: Ming-Hsiang Tsou & Michael Leitner (2013): Visualization of social media: seeing a
mirage or a message?, Cartography and Geographic Information Science, 40:2, 55-60 )
The Uniqueness
of KDC
(San Diego, New York, 92119, SDSU, Bus Stops,
Sea World…)
Place
(Scale, Space,
context)
Triangular Knowledge Base
(Human centered)
Interdependent!
Time
(Dynamic)
August 23, 2012 (snapshot), one week, two
months, before / after etc.
Messages
( content /
function, who,
what, how, media)
Tweets, web pages, emails, short messages
KDC: Knowledge Discovery in Cyberspace (7 steps)
Cyberspace
• Social Media (Twitter, Facebook, Flickr, Youtube)
• Web Pages, Weblogs, News, RSS, Emails, etc.
Selection
(research focus)
Target Data:
• Tweets (keywords, regions, API types)
• Web Pages (keywords, web search engines)
Collected Data:
•SQL databases (Tweet contents)
•Excel files (Web Search Results)
Preprocessed Data:
Collection
(Tools, APIs)
Preprocessing
(reduce noises, data
clearing, select regions,
time scale/series ???)
• SQL-output-Tweets – remove errors and duplicated
• Geocoded Excels - add lat/long ,
• Improving geolocation results.
Transformed Data:
Transformation
(mapping +
graphs)
• Graphics, Bar charts, WordCloud, etc.
• Original Point Maps (each point represented one web page or
one tweets).
All seven steps are systematic, algorithm-based procedures.
Transformed Data: (continued from last page)
• Graphics, Bar charts, WordCloud, etc.
• Original Point Maps (each point represented one web page or one
tweets).
Explore/Compare
Methods
(Select algorithms)
Visualized Data:
•Kernel Density maps, Differential KD maps, Point Density Maps.
•Excel files (Web Search Results)
Pattern Recognition:
• Decision Trees and Rules
• Nonlinear Regression and Classification Methods
• Example-based Methods (nearest-neighbor classification)
• Probabilistic Graphic Dependency Models
• Relational Learning Models
Knowledge Formalization:
• Verification (City Mayor Maps, Movie tweets)
• Discovery
• Prediction (Election)
• Description (Outbreaks, election)
Information
Mining
(Analyze Space-TimeInformation
relationships)
Interpretation /
Evaluation
Real
World
Cyber Information Space (BIG DATA)
(Web Pages, Social Media, Weblogs, Forums, News)
Information Mining Tools
CyberDiscovery Tools
“V”isualizing
“I”nformation
“S”pace
“I”n
“O”ntological
“N”etworks
(VISION)
Twitter_GeoSearch_Tool
• Search API
• Streaming API
• Yahoo API
• Bing API
• Google API
VISION Ontological Analysis Platform
WHO, Where, When, What - WHY? (Networks)
Spatial
Visualization
Tools / Methods
Computational
Linguistics Analysis
Tools / Methods
Place – Time – Messages
(content/functions)
New Theories (explanation) , New Models (simulation),
New Knowledge
Information Communication Channels in Cyberspace
• Web Pages ( Semi-Public Information Communication )
• Social Media (Twitter: Semi-Private Information Communication)
Web Pages: Use Web Search
Engines (Google, Yahoo, and Bing) to
retrieve up to 1,000 web pages per
keyword. Then analyze their contents
associated with their ranks and
geolocations.
Social Media (Tweets): Use Twitter
APIs to retrieve tweets based on
Keywords or #Hashtag and
geolocations (self-defined home-towns
or GPS locations).
Collect Web Page Contents, Ranks, and Locations:
We develop Cyber-Discovery Search Engine
(Retrieve up to 1000 results from Yahoo or Bing)
Twitter – Spatial search API
Twitter APIs:
•
•
•
REST API
Stream API
Search API
Center: 41.961295, -93.281859
Radius: 180 miles
Search API Limitations:
1. Spatial Search can only trace back up to
seven days. (Regular search can trace back to
14 days.)
2. Each search results can not exceed 1500
tweets.
Web Pages Search Results vs. Tweets
Web Page Visualization maps (using Google or
Yahoo search egnine result to convert Web Page IP
addresses into Lat/Lon. with Maxmind Lookup tables).
IP geolocation - the “registration location of Web
server” (not the physical location of machines).
(What is the veracity of geolocation ?).
Classifying different types of web pages and
social medias for content and linguistic analysis;
COMPARE between Bing engine and Yahoo engine: (“Jerry Sanders”
keyword -- % in 12 different web page categories – defined by our team
members)
Bing search: more commercial and informational (wiki), social media.
Yahoo search: more blogs and news, and educational pages.
(But in general, there are some similarity between the two engines)
Spatial Accuracy of Web Page Categories
based on the IP address geo-convertion.

Highest:

Educational


Social Media


68.97%
Government


73.86%
60.98%
Lowest:

Blog


Special
Interest Group


10.81%
12.81%
NGO

20.93%
Geolocation Accuracy in Different Keywords
• GREEN (correct)
• Blue (incorrect)
• Gray (unknown)
• Highest Spatial Accuracy
• McGinn – 33.57%
• Lowest Spatial Accuracy
• Santorum – 21.29%
• Highest N/A
• Flu – 35.52%
Web Page Information
Landscape (2012
Presidential Election)
Ming-Hsiang Tsou , Jiue-An Yang , Daniel Lusher , Su Han , Brian Spitzberg , Jean Mark Gawron ,
Dipak Gupta & Li An (2013): Mapping social activities and concepts with social media (Twitter) and web
search engines (Yahoo and Bing): a case study in 2012 US Presidential Election, Cartography and Geographic
Information Science, DOI:10.1080/15230406.2013.799738
http://mappingideas.sdsu.edu/mapshowcase/election/webpage/election3.html
Twitter Case Study #1: 2012 Summer
Comparing FIVE Movie Tweets & Box
Office
1)
Select 30 major U.S. Cities within 17 miles radius -- collect tweets with
movie keywords (TED, Spider Man, etc.)
2)
Compare the daily movie box office results and the number of
tweets containing each movie keywords.
Five Movies Correlation Test
Daily
Weekly
TED
0.8826
0.9989
Spider-Man
0.9409
0.9725
Ice Age
0.8895
0.9528
Dark Knight
0.9523
0.9375
Step Up
0.8931
0.8123
Daily: Daily_Tweets vs. Daily_Box_Revenue
Weekly: 8_to_13_days_before, one_week_before, release_day,
one_week_after, two_weeks_after, three_weeks_after,
four_weeks_after
Tweet_Daily and Box_Daily (TED)
Release
60,000
25,000,000
Tweet_Daily
20,000,000
Tweets
40,000
One Week
30,000
15,000,000
Two Week
10,000,000
20,000
5,000,000
10,000
0
0
-13-12-11-10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28
Box Office Revenue
Box_Daily
50,000
Tweet_Weekly and Box_Weekly (TED)
180,000
90000000
Box_weekly
Tweets_weekly
140,000
Tweets
80000000
70000000
120,000
60000000
100,000
50000000
80,000
40000000
60,000
30000000
40,000
20000000
20,000
10000000
0
0
-2
-1
0
1
2
Week (releasing day as 0)
3
4
Box Office Revenue
160,000
Case Study #2: 2012 Presidential
Election (Tweets)
Before Hurricane Sandy
After Hurricane Sandy
Sentiment
Analysis
(case study: 2012
Presidential
Election)
(Before / After
Hurricane Sandy)
User Behavior Analysis:
Tweeting Pattern – by Hours
1.2
1
Total Tweets (Z)
0.8
0.6
0.4
President Election
Public Health
0.2
Environmental
Movie
0
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
Hour in Day
Location Consistency
Geo-tagged Location vs. Self-report Location
Domain
GPS Tweet %
Elections
1.018
Health
4.360
Environmental
1.315
Movie
2.875
N/A
20%
nonConsistent
20%
Consistent
60%
Locations are aggregated by States
Developing new communication theories for cyberspace
social networks: DIFFUSION OF INNOVATIONS
THE INNOVATION DIFFUSION MODEL (E. Rogers, 1962):
Diffusion: “a special type of communication concerned with the spread of messages
that are perceived as new ideas” (E. Rogers, p. 35).
Torsten Hägerstrand, 1953 – Doctoral Dissertation - Culture Diffusion. 1967
Innovation Diffusion as a Spatial Process (book).
Public (Mass Media, TV, News)
vs.
Private (Personal Communications)
Publication (2013)
One special issue in
CaGIS journal
Five refereed journal
papers
SDSU: Human Dynamics in the Mobile Age
(HDMA)
A Proposal for an Area of Research Excellence at San Diego State University, 2013.
Human
Dynamics
Spatial
Science
Mobile
Technology
Five Core Faculty:
•
•
•
•
•
Ming-Hsiang Tsou (Coordinator of HDMA, Professor of Geography) 2012 SDSU President's
Leadership Fund (PLF) Awards for Excellence.
Sheldon Zhang ( Chair of Sociology, Professor ), expert in Human Trafficking, Transnational Organized
Crime, Juvenile Delinquency, Probation and Parole, etc.
John Elder (Distinguished Professor of Public Health, the Director of IBACH) 2012 SDSU President's
Leadership Fund (PLF) Awards for Excellence, expert in randomized trials for chronic disease prevention
in Latino communities.
Piotr Jankowski Incoming Chair of Geography), expert in Spatial Decision Support Systems.
Brian Spitzberg (Senate Distinguished Professor, School of Communication), expert in communication
theories, communicative (in)competence, obsessive relational intrusion.
along with other 29 funding faculty members from 14 different departments in 5 colleges.
Four new faculty lines will be hired in
2014 and 2015.
(Assistant or Associate Professor levels)
• One faculty line for Graduate School of Public Health is
suggested for the Institute for Behavioral and Community
Health [IBACH] to focus on Behavior-Environmental
interactions in Community Health. (2014)
• One faculty line is suggested for the Department of Geography
to focus on spatio-temporal data analytics. (2014)
• One faculty line is suggested for the Department of Sociology
to focus on Urban Sociology with demography and spatial
analysis techniques. (2015)
• One faculty line is suggested for the School of Communication
to focus on strategic communication with Big Data and Social
Media. (2015).