Data Science - Department of Statistics

Download Report

Transcript Data Science - Department of Statistics

Data Science:
A Personal View from the CS (DB) Perspective
Peixiang Zhao
Department of Computer Science
Florida State University
[email protected]
Tallahassee, Florida, Sept., 2016
Synopsis
1. Introduction to Data Science
–
With a special focus from the computer science view:
databases, data mining, etc.
2. How to prepare yourself for (data science) research
3. My research portfolio
4. Conclusions
1 / 27
Who am I?
• Peixiang Zhao
– Assistant Professor at CS @ FSU
– Homepage: http://www.cs.fsu.edu/~zhao
– Office: 262 Love Building, FSU
– Ph.D.: University of Illinois at Urbana-Champaign, Aug. 2012
– Research Interest:
• Database, data mining, data-intensive computation and analytics,
and Graph/Information Network Analysis!
2 / 27
Who am I?
• Courses I am offering
– COP4710: Introductory database systems
• Every fall semester
• What are (relational) databases and how to use databases
– COP4930: Data mining
• Spring 2016
– COP 5725: Advanced databases systems
• Every spring semester
• Database internals and advanced topics, such as MapReduce,
mining, and Web search
3 / 27
Data Science
• What is data science?
– The sub-area of statistics and computer science dealing with
the acquisition, management, understanding, querying, and mining
data drawn from real-world applications
• https://www.youtube.com/watch?v=dKHz9LbgRmo
• http://www.youtube.com/watch?v=LrNlZ7-SMPk
4 / 27
What are involved? Data Scientists
5 / 27
Data Science – The CS Side
• Data science in Computer Science
– Include, but are not limited to
• Database systems
• Machine learning
• Data mining
• Information retrieval
• Network science
• Big data
• Systems
• ……
6 / 27
Data + Science
• Data:
– Model: Fully structured or relational, semi-structured,
unstructured, graph-structured, spatial-temporal, ……
– Format: textual, numeric, categorical, sequential, graph,
audio/video, time-series, streaming data
– Scale: from megabytes to zetabytes
– Quality, resolution, privacy, usability ……
• Common Tasks:
– Data acquisition, sanitation, transformation, storage, maintenance
and integration
– Indexing , querying, and ranking
– Knowledge discovery, mining and machine learning
7 / 27
Data Sciences
• Skillsets and Requirement
– Motivation and passion to work on the state-of-the-art
problems
– Strong mathematical reasoning and algorithm design abilities
– Good programming skills
• Your Bright Future
– DBA at Goldman-Sachs or D. E. Shaw
– Data scientist at Google, Facebook, Twitter or Foursquare
– Data engineer at Oracle, IBM, or Microsoft
– Researcher at MSR, IBM Research or Yahoo! Labs
– Professor shown up in SIGMOD, VLDB, KDD, or SIGIR
8 / 27
Databases: Examples
9 / 27
Databases: In Industry
10 / 27
Databases: In Science
CHARLES BACHMAN, 1973
JAMES GRAY, 1998
EDGAR CODD, 1981
MICHAEL STONEBRAKER,
2014
11 / 27
Database Systems
• System for providing EFFICIENT, CONVENIENT, and SAFE
MULTI-USER storage of and access to MASSIVE amounts
of PERSISTENT data
– http://cs.stanford.edu/people/widom/DB-mooc.html
12 / 27
Key Topics in Database Systems
• Modeling
– ER model vs. relational model
• Foundation
– Relational algebra, relational calculus, design principles
• SQL
– Implementation
• Storage & Representation
• Indexing
– B/B+/R tree, sorting, hashing ……
• Querying processing & Optimization
• Transactions & Recovery
13 / 27
How to prepare yourself for (data science) research
• What is research?
– Discover new knowledge
– Seek answers to non-trivial questions
• Research Process
1. Identification of the topic (e.g., Web search)
2. Hypothesis formulation (e.g., algorithm X is better than
Y=state-of-the-art)
3. Experiment design (measures, data, etc) (e.g., retrieval accuracy
on a sample of web data)
4. Test hypothesis (e.g., compare X and Y on the data)
5. Draw conclusions and repeat the cycle of hypothesis
formulation and testing if necessary (e.g., Y is better only for
some queries, now what?)
14 / 27
What is Good Research?
• Solid work:
– A clear hypothesis (research question) with conclusive results (either positive or
negative)
– Clearly adds to our knowledge base (what can we learn from this work?)
– Implications: a solid, focused contribution is often better than a non-conclusive
broad exploration
• High impact = high-importance-of-problem * high-quality-ofsolution
– high impact = open up an important problem
– high impact = close a problem with the best solution
– high impact = major milestones in between
– Implications: question the importance of the problem and don’t just be
satisfied with a good solution, make it the best
15 / 27
Challenge-Impact Analysis
Level of Challenges
Difficult
basic research
Problems,
but questionable impact
Low impact
Low risk
Bad research problems
(May not be publishable)
High impact
High risk (hard)
Good long-term
research problems
High impact
Low risk (easy)
Good short-term
research problems
Unknown
Good applications
Not interesting
for research
Known
“entry point” problems
Impact/Usefulness
16 / 27
How to Do Research in Data Sciences?
• Curiosity: allow you to ask questions
• Critical thinking: allow you to challenge assumptions
– Make sense of what you have read/heard
• Learning: take you to the frontier of knowledge
– Start with textbooks and courses
– Read papers in top-notch conferences/journals
– Implement your prototype ideas
• Persistence: so that you don’t give up
• Respect data and truth: ensure your research is solid
– Don’t throw away negative results
• Communication: publish and present your work
17 / 27
Tuning the Problem
Level of Challenges
Make an easy problem harder
Increase impact (more general)
Make a hard problem easier
Unknown
Known
Impact/Usefulness
18 / 27
Where to Publish?
• Databases
– SIGMOD, VLDB, ICDE
– ACM TODS, VLDB J., IEEE TKDE
• Data Mining
– KDD, ICDM, SDM
– ACM TKDD
• Information Retrieval
– SIGIR, CIKM
– ACM TOIS
• Web & Applications
– WWW, WSDM
19 / 27
My Research Portfolio
• What are information networks?
1. A large number of interacting physical, conceptual, and
human/societal entities
2. Entities are interconnected with relationships
• Information networks are ubiquitous
– Technological networks
– Social networks
– Biomedical, biochemical and ecological networks
– The Web
– ……
20/ 27
Real-world Information Networks
The network structure of
Citation
Networks
the Internet
(http://bluwiki.com/go/Citation)
Opte Project
Entities:
5199 papers from SIGOPS,
(http://www.opte.org/maps/)
SIGPLAN,
SIGART
Entities: class
C subnets
Relationship:
citations
Relationship: 5343
data packet
routes
Yeast protein interaction
network(baker’s yeast)
Twitter network
(http://www.bordalierinstitute.com/)
(http://yoan.dosimple.ch/blog/)
21 / 27
Information Networks: Model and Characteristics
• An information network can be modeled as a graph
comprising both vertices and edges
– G = (V, E)
• A real-world information network is
– massive (Jun. 2012)
• Web graph: 8.94 billion pages
• Facebook: 901 million active users and 125
billion friendship relations
– dynamic
• Facebook U.S. grows 149% in 2009
22/ 27
Querying Information Networks
• Motivation
– The most natural and easiest approach to managing and
accessing information networks is querying!
• Neighborhood query, keyword query, reachability query, shortest-path
query, graph query, frequency estimation query, ……
Who are my friends in Google+?
• Challenges
– The massive and dynamic nature of information
Which university is UIUC?
networks
precludes
the direct application of most
What is the
shorest route
between
and FSU?
What
are UIUC
the largest
well-studied,
memory-resident
graph algorithms!
phenotypic associations
between rice and maize?
23/ 27
My Focus and Solutions
Tasks
Efficient, cost-effective and potentially scalable solutions
Frequency
Estimation
gSketch
OLAP
Aggregation
Graph Cube
Tree+δ
Subgraph
Matching
P-Rank
SPath
Structural
Similarity
gSparsify
SimQuery
Unlabeled/
Labeled
Disconnected/
Connected
Unidimensional/
Multidimensional
Static/
Dynamic
Information
networks
24/ 27
My Other Work
• Location-based mining and ranking
– [SIGIR’11], [CIKM’11][TKDE’15]
• Text mining
– [SDM’12], [SIGIR’10] [KAIS’13]
• Mining large-scale information networks
– [ICDM’10][EDBT’09][SIGMOD’08][CIKM’15]
• Mining structural patterns
– [WWW-J.’08], [DASFAA’07]
• Industry-strength systems
– Hadoop-ML at IBM research
– Trinity at Microsoft research
25/ 27
Future Research Agenda
• Foundations and models of Information Networks
– Model, manage and access multi-genre heterogeneous information networks
– Querying and mining volatile, noisy and uncertain information networks
– Cyber-physical information networks
• Efficient and scalable computation in Information
Networks
– A unified declarative language for graph and network data
– A distributed graph computational framework for large-scale information
networks
• Knowledge discovery in large Information Networks
26/ 27
Conclusions
• We are in an information network era!
– Internet, social networks, collaboration and recommender
networks, public health-care networks, technological/biological
networks ……
• Data are pervasive, big, and of great value
• Research in data sciences is interesting and highly
rewarding
• Follow your heart and don’t give up!
27/ 27
Good Luck!
Q&A
28/ 27