NSF CDI meeting - San Diego State University
Download
Report
Transcript NSF CDI meeting - San Diego State University
Mapping Ideas from Cyberspace to Realspace. Funded by NSF CyberEnabled Discovery and Innovation (CDI) program. Award # 1028177.
(2010-2014) http://mappingideas.sdsu.edu/
Cyberspace and realspace:
Distilling useful information
Li An
San Diego State University
[email protected]
Presented at 2013 NSF CDI Project Meeting
San Diego, California, August 8, 2013
Principle Investigator: Dr. Ming-Hsiang Tsou [email protected], (Geography), Co-PIs: Dr. Dipak K Gupta
(Political Science), Dr. Jean Marc Gawron (Linguistic), Dr. Brian Spitzberg (Communications), Dr. Li An
(Geography). San Diego State University, USA.
Cyberspace data—climate change and
global warming
•Search on Yahoo
•Record
• website location (IP registered address)
• rank (r) of the website (measure of popularity)
•Map all websites: (x, y, r)
•Geoprocess: krigging/interpolation (county)
•Correct background noise
Krigging maps of climate change
Time 4
What do such data represent?
Cyberspace data
Total population
(county)
Why? Some people do not have/use website
Some do not express their opinion on web, …
Classification of cyber search data
• Blog
• Commercial websites
• Educational
• Entertainment
• Forum
• Governmental
• Informational
• News
• NGO
• Social media website
• Special Interest Groups
• Offline
An example of classified data
(Climate change)
03-04-12 Edu
11-11-11 Edu
03-05-13 Edu
07-01-12 Edu
11-04-12 Edu
An example of classified data
(Global warming)
11-10-11 Edu
03-03-12 Edu
03-03-13 Edu
07-01-12 Edu
11-04-12 Edu
What do such data represent?
Why? Some educators do not have a website
Some do not express their opinion on web, …
Edu
Edu
Edu
Cyberspace data
Gov
Total population
(county)
Gov
Gov
Why? Some government agencies do not have a website
Some do not express their opinion on web, …
News
What happens on the ground?
•Socioeconomic and demographic data
•Climate data
•Political membership data at county level
(partially available)
9
The link between real and cyber space
Y = f (independent variables)
Edu
Cyberspace data
Gov
Y = f (independent variables)
----- if we use all realspace data
Edu
Edu
Total population
(county)
Gov
Gov
----- if we use cyberspace data for Y
News
Challenge
Data type
Reliability
Large
coverage
Realspace
data
High
Difficult
Slow
Cyberspace
data
Variable
Easy
Quick
Return time
Regression “backward”: explore data
usefulness
• Assume realspace data = f (independent variables)
reasonable results (with good fit)
Often
unavailable
• If cyberspace data = f (independent variables)
reasonable results (with good fit)
Shall we increase our confidence on the reliability of
cyberspace data (in place of realspace data)?
Discrete-time stepwise regression:
hypothesis
• For some topics or keywords (such as climate change and global
change), people’s attitudes and perceptions are largely constant
over a reasonable time scale. Therefore:
• Predictor variables should be largely the same over time
• If predictor variables that are selected change over time
• either attitudes and perceptions change
• or the measure of attitudes and perceptions is questionable
[Note values of predictor variables are largely constant]
Measure: consistency index (CI)
Candidate
variables
Time 1
Time 2
X1
x*
x
Time 3
Time 4
x
X2
X3
x
x
X4
X5
x
x
x
x
x
x
x
x
x
X6
X7
Time 5
x
x
X8
X9
X10
x
x
x
* Variable X1 is selected as a significant predictor at 0.05 level.
𝑐𝑜𝑢𝑛𝑡𝑠 𝑜𝑓 𝑐ℎ𝑒𝑐𝑘 𝑚𝑎𝑟𝑘𝑠
19
CI =
=
= 0.38
𝑡𝑜𝑡𝑎𝑙 𝑝𝑜𝑠𝑠𝑖𝑏𝑙𝑒 𝑐𝑜𝑢𝑛𝑡𝑠
10 × 5
Consistency index example (Pro-GW
T1
COMMLONE
T2
T3
T4
x
DemocraticPercEvan
EMPRATIO
FOREIGN
HIGHPLUS
Intercept
x
x
x
x
x
MaxDrySpel
N10_OWNVAC
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
N10ASIAN
x
x
N10AVGHHLD
x
N10FAMHHLD
x
x
x
x
N10HU_OCC
T5
x
N10M65OVER
x
x
x
x
x
N10NONREL
x
x
x
N10ONERACE
x
x
x
x
x
N10W65PLUS
PrecipIndx
x
x
STATEBORN
x
x
x
x
x
x
SummerDays
x
x
x
x
x
VETERANS
x
x
Counts
CI
57
0.57
Measure: consistency index (CI)
CI value
Edu
Gov
Locational
accurate
Pro
Unclassified
data
Climate
change
0.59
0.45
0.57
0.45
0.41
Global
warming
0.85
0.58
0.50
0.57
0.36
Tentative conclusion:
1) Search results from educational websites for “global
warming” are most consistent;
2) Data classification helps in extracting useful information.
Why this regression-based approach?
• Incorporate mechanism (independent variables)
• Capture big trend and pattern
• Some degree of fuzziness allowed
• Allow for data from multiple locations (all US counties) and
multiple times (5 here)
• Allow for calculation of one overall index (CI)
Acknowledgement
• NSF CDI Program
• SDSU Department of Geography
• The whole SDSU CDI team. In particular:
• Evan Casey, Elias Issa, Ninghua Wang, and Sarah Wandersee
Thank you
18
Tentative decision
Work on search results from education
websites on “global warming” (“climate
change” Okay)
GW030313, Edu
Point data interpolation/krigging
• A good way to create spatially contiguous data (surface)?
• What other methods?
21
Hazard
• An indicator for instantaneous risk or potentiality
• For an event to take place
• Related to internal and external features
• Great flexibility
• At a time POINT or average at a time interval
• At individual or aggregate levels
22
How to derive hazard ?
hazard
• Theoretical hazard curves
• Weibull Distribution
Weibull
Distribution
α = 1.5
23
α=1
• Gompertz Distribution
α=0
α = -0.5
• Exponential Distribution…
• Empirical hazard calculation
Time
• Counts of events
• Timing of events
Hazard of change
• Encapsulate historical events
23
Unit A
Unit B
Unit C
0
20
40
23
60
80
time units
What is survival analysis?
A class of statistical methods for studying the occurrence and timing
of events
• Unique dependent variable: hazard
• Great in handling imprecise time measures
• Great in handling time-dependent variables
• Great in handling information uncertainty and handling imbalanced data