Transcript Slide 1

Yahoo!-DAIS Seminar (CS591DAI)
Orientation
ChengXiang (“Cheng”) Zhai
Department of Computer Science
University of Illinois at Urbana-Champaign
Basic Information about CS591DAI
(=Yahoo!-DAIS seminar)



Meets at 4-5pm, Tuesdays, 0216 SC
1 Credit Hour  Miss at most two talks
At each meeting, you can expect







An interesting research talk
Opportunity for research discussion
An attendance sheet for collecting signatures
And snacks (thanks to Yahoo!)
Seminar coordinators: Yucheng Chen, Hao Luo
Website: http://dais.cs.uiuc.edu/seminars.html (under
construction)
Mailing list: [email protected] (make sure you subscribe to it)
The DAIS Group: Data & Information Systems
+ many grad/undergrad students + few postdocs/visitors
Prof. Emerita
Kevin Chang
Jiawei Han
Aditya Parameswaran
Saurabh Sinha
Marianne Winslett
Tandy Warnow
(+ Bioenginering)
Jian Peng
Hari Sundaram
(+Advertising)
Bioinformatics
Cheng Zhai
Landscape of DAIS Research
Intelligence
Intelligent information agent
Decision/Task support
Information analysis & Data mining
Data/Information Access
Recommendation
Search
Browsing
Storage
Gigabytes Terabytes Petabytes
Productivity (Web, email, …)
Decision making (government, business, personal)
Education
Health/Medical/Biology
Scalability
Application impact
5
DAIS & Related Areas in CS
Intelligence
Artificial Intelligence
Statistics
Intelligent information agent
Decision/Task support
Information analysis & Data mining
Data/Information Access
Recommendation
Search
Browsing
Parallel Computing
Storage
Gigabytes Terabytes Petabytes
Theory &
Algorithms
Scalability
Productivity (Web, email, …)
Systems & Networking
Decision making (government, business, personal)
Education
Health/Medical/Biology
Application impact
Human-Computer Interaction, Graphics
6
Overview of DAIS Faculty Research
Kevin C. Chang: Bridging Structured
and Unstructured Data
How to bring structured/semantic-rich access to the
myriad and massive unstructured data which accounts
for most of the world's information?



How to search the Web that Google does not see?
How to reach into and across pages that Google only
takes you to?
How to listen to buzz of the world and make sense?
Project 1. MetaQuerier:
Exploring and Integrating the Deep Web
MetaExplorer
• source discovery
• source modeling
• source indexing
FIND sources
db of dbs
MetaIntegrator
• source selection
• schema integration
• query mediation
QUERY sources
unified query interface
Project 2. WISDM: Data-aware Search over the Web
Data Aware Search
10
Demo: Entity Search-- “university of california
#location”
Project BigSocial: Social Data Analytics
Social Data Analytics
Aditya Parameswaran


Started August 2014
i.stanford.edu/~adityagp

Interests: Data Management, Mining and
Algorithms

Specific topics of interest




Interactive Data Analytics
Crowd-powered Analytics
Approximate Analytics
Visual Analytics and Data Mining
Research Style
Build Real Data
Analytics Tools /
Systems
Data Science and
Applied ML
Data
Systems
Data
Mining
Theory
Design Algorithms with
Guarantees
Research Goal: Simplifying Data Analytics
“….(in the next few years) we project a need for 1.5 million
additional analysts in the United States who can analyze data
effectively…“,
-- McKinsey Big Data Study, 2012
How do we make it easier for novice data analysts to get insights from data?
Simplifying Data Analytics: Four Aspects
Unstructured
Crowd
Powered
Analytics
Querying
Visualizations
Interactive
Analytics
Visual
Analytics
Scale
Approximate
Analytics
Example Projects
•
Crowd-powered search
•
Crowd-powered data extraction & cleaning
•
Interactive query synthesis
•
Speculative querying and caching
•
Recommending visualizations automatically
•
Approximate visualizations with guarantees
•
Fundamental principles: cost, latency, error
•
Browsing-based query processing
Crowd
Query
Visual
Approx
Hari Sundaram

Prof. Hari Sundaram used a separate file for his
presentation, which is available at;
http://times.cs.uiuc.edu/czhai/pub/DAIS-fall14-Hari.pdf
Jiawei Han: Data Mining & Information
Networks
How to perform data mining effectively in massive data and in
heterogeneous information networks?
How to mine structures and construct networks from unstructured
real-world data?
Specific research subareas:

Effective methods for mining heterogeneous networks

Construction of heterogeneous networks from unstructured data

Multi-dimensional unstructured data summarization & OLAP

Truth discovery and outlier mining in networked data

Spatiotemporal and cyber-physical data mining (e.g., mobile objects,
sensor/mobile data mining)
Recent Research Focus on Data Mining



Network Construction, Search and Mining on Real datasets
 News Network: 10M news articles + news of last 70 years
 Computer Science Research Network: DBLP + citations +
abstracts + Web pages of researcher + other related web pages
 Tweet network and other social media network
 Bio-Medical Research Network: PubMed and other medical
sources
 Networkfication of Knowledge-Bases: Wikipedia, DBPedia,
Freebase
 Cyber-Physical Networks (internet of things)
Other frontiers: trajectory mining, truth finding, anomaly, …
Two ARL-funded projects in 2014-2016 for Network Science
Collaborative Technology Alliance (NSCTA)
20
Constructing Unified, Structured
Knowledge Networks
Current State-of-the-Art
 Natural language processing
 Graph-based mining
 Theory of planned behavior
Near perfect reliable network
construction through progressive source processing
and network refinement
Research Topics/Technical Approach
Automated Construction of adaptable knowledge
networks
 Preliminary network construction via NLP and social
network techniques
 Exploitation of links for network refinement
 Streaming updates
Multi-dimension truth analysis
Team: H. Ji (RPI) (lead), J. Han (UIUC), G. Cao (PSU), C.
 Credibility analysis by processing the interactions of the
Voss (ARL), W. Wallace (RPI)
physical and social/cognitive states of the social
Rotations: Ji: post-doc (Hong): 3 mo. at ARL; Han:
network and its interactions with the information
student (Liu/Brova): 3 mo. at ARL; Wallace: student (Yulia): 3
network
mo. at ARL; each PI: visits to ARL totaling 1 wk or more. RPI
Information processing & social/cognitive modeling
students visit UIUC for one week each.
 Build socio-cognitive models to predict human behavior
Army Needs and Benefits
 User-oriented and constraint-aware information
 Exploitation of unstructured data for improved
processing
situational analysis
 Predictive tools to understand adversarial intent
Construction of latent social and information networks require understanding connecting the
dots in documents by linking entities and understanding human behavior
21
Distributed, User-Oriented Multi-Scale
Network Summarization and OLAP
Current State-of-the-Art
 Summarization on relational and text data
 Social and cognitive computing
 Online analytical processing on data cubes
Support of distributed, multiscale, multi-genre network summarization, OLAP and
situation analysis for diverse user groups
Research Topics/Technical Approach
Multi-scale network summarization & aggregation
 Creation and enrichment of multi-dimension
information from text and unstructured data
 Network cube construction: conflict resolution, topical
hierarchy generation, multidimensional indexing &
selective, partial cube materialization
User-oriented adaptation of network cube views
Team: J. Han (UIUC) (lead), J. Hendler (RPI),
 User- or social network-oriented OLAP
T. Hanratty (ARL), H Ji (RPI), B. Welles (NEU)
 Accommodate dynamic updates of underlying social
Rotations: Han: student (Tao/Song): 3 mo. at ARL; Ji:
and/or communication networks
student (Zhang): 3 mo. at ARL; each PI: visits to ARL, total 1
Cost- and constraint-aware network cube
wk; 3 univ. mutual-visits: PIs + students
 Distributed, user- or community-oriented cube views
Army Needs and Benefits
 Cost-, constraint- and availability- aware drilling, search
 Summarization/visualization matched to the user
and analysis
cognitive abilities for improved situational awareness
Cognitive modeling, visualization, and supporting human
 Flexible and efficient situation analysis for diverse groups decision making
of users
 Experimental platform: news, blogs, tweets, etc.
Distributed, user-oriented, multi-scale summarization of networks to support online
information processing and situation analysis for diverse groups of users
22
From Data Mining to Mining Info. Networks
Han, Kamber and Pei,
Data Mining, 3rd ed. 2011
Yu, Han and Faloutsos (eds.),
Link Mining, 2010
Sun and Han, Mining Heterogeneous
Information Networks, 2012
23
ChengXiang Zhai: Intelligent text
information management & analysis
How can we develop intelligent algorithms and systems to help people
manage and exploit large amounts of text data (e.g., Web pages, blog
articles, news, email, literature…)?
Two subtopics:
 Information retrieval: how can we connect the right information with
the right users at the right time with minimum or no user effort?
 Text mining: How can we automatically discover useful knowledge
from text? How can we mine text data together with non-textual data in
an integrative manner?
Applications: Web, biomedical, health, education, …
Research methodology:
 Emphasize general & principled solutions without manual effort
 Mainly use statistical models, machine learning, and natural
language processing techniques
Sample Project: Latent Aspect Rating Analysis
How to infer aspect ratings?
How to infer aspect weights?
Value
Location
Service
…..
Value
Location
Service
Solution: Latent Rating Regression Model
Aspect Segmentation
Reviews + overall ratings
+
Latent Rating Regression
Aspect segments
Term weights
location:1
amazing:1
walk:1
anywhere:1
0.0
0.9
0.1
0.3
0.1
0.7
0.1
0.9
0.6
0.8
0.7
0.8
0.9
room:1
nicely:1
appointed:1
comfortable:1
nice:1
accommodating:1
smile:1
friendliness:1
attentiveness:1
Topic model for aspect discovery
Aspect Rating
Aspect Weight
1.3
0.2
1.8
0.2
3.8
0.6
Aspect-Based Opinion Summarization
Reviewer Behavior Analysis & Personalized
Ranking of Entities
People like
expensive hotels
because of
good service
Query: 0.9 value
0.1 others
Non-Personalized
Personalized
People like cheap
hotels because of
good value
Tandy Warnow
Computational Phylogenetics and
Metagenomics
Courtesy of the Tree of Life project
Gene Tree Estimation: first align, then
construct the tree
S1
S2
S3
S4
=
=
=
=
AGGCTATCACCTGACCTCCA
TAGCTATCACGACCGC
TAGCTGACCGC
TCACGACCGACA
S1
S1
S2
S3
S4
=
=
=
=
-AGGCTATCACCTGACCTCCA
TAG-CTATCAC--GACCGC-TAG-CT-------GACCGC--------TCAC--GACCGACA
S2
Sounds easy, but every good approach
is NP-hard, and statistical methods
(based on stochastic models of evolution)
are very slow.
S4
S3
Accuracy is essential, datasets are
big, and they are also messy.
Species tree estimation is even harder, because
gene trees can be different from the species tree!
Avian Phylogenomics Project
Erich Jarvis,
HHMI
MTP Gilbert,
Copenhagen
G Zhang,
BGI
• Approx. 50 species, whole genomes
T. Warnow
UT-Austin
S. Mirarab Md. S. Bayzid
UT-Austin
UT-Austin
Plus many many other people…
• 8000+ genes, UCEs
Challenges:
Maximum likelihood tree estimation on multi-million-site sequence alignments
Massive gene tree incongruence
My students and I developed a new technique (“Statistical Binning”)
to enable a statistical estimation of the avian species tree, taking
gene tree incongruence into account (both papers under review in
Science)
1kp: Thousand Transcriptome Project
G. Ka-Shu Wong
U Alberta
J. Leebens-Mack
U Georgia
N. Wickett
Northwestern
N. Matasci
iPlant
T. Warnow,
UT-Austin
S. Mirarab,
UT-Austin
N. Nguyen,
UT-Austin
Md. S.Bayzid
UT-Austin
Plus many many other people…
Plant Tree of Life based on transcriptomes of ~1200 species

More than 13,000 gene families (most not single copy)
Gene Tree Incongruence

Challenges:
Multiple sequence alignment of datasets with > 100,000 sequences
Gene tree incongruence
My students and I developed
ASTRAL – a technique to estimate species trees on large datasets (ECCB),
and used it to analyze this dataset (under review in PNAS)
UPP – new multiple sequence alignment method that can analyze up to
1,000,000 sequences (in preparation)
Computer science and mathematics issues:
Heuristics for NP-hard optimization problems
Graph algorithms
Statistical estimation on messy data
Mining sets of trees/alignments
High Performance Computing
Mathematical modelling
Probabilistic analysis of algorithms
Bioinformatics Problems
Multiple Sequence Alignment
Gene Tree Estimation
Species tree estimation (when gene trees conflic
Genome rearrangement phylogeny
Phylogenetic network estimation
Metagenomic data analysis
Also: Computational Historical Linguistics
Jian Peng: Machine learning for
computational biology
Biological
data
Machine
learning
Knowledge
Biological data integration
Disease-related genes
Functional homologs
Hypothesis
Experiments
Biological data integration: network biology
 Modeling information diffusion on biological networks
 Integrating networks from multiple species
 Inference of gene function from network data
Other computational biology projects

Protein science




Translational bioinformatics



Structure prediction
Protein folding
Viral proteins
Drug discovery and optimization
Drug repositioning
Genomics


Large-scale read mapping
Algorithms for genome assembly
Machine learning: modeling complex data

Graphical models





Learning representations for heterogeneous data


Latent variable models
Efficient learning and inference algorithms
Causal/correlation structures
Applications to protein folding, gene expression analysis and
biological network construction
Low-dimensional embedding for network, text and molecular
data sets
Learning structured prediction with complex loss functions

Applications to biology, computer vision, speech recognition
and natural language processing
Saurabh Sinha: Bioinformatics
How is information about us encoded in our DNA ?
How does this information evolve, giving rise to what Darwin called “endless forms
most beautiful”?
Research questions:
 Gene regulation: How are genes turned on and off in precisely orchestrated ways?
 Comparative genomics: What can we learn by comparing genomes of tens of
different species?
 Regulatory evolution: Can we build a mathematical model of evolution?
 Genomics of behavior: How does DNA encode animal behavior ?
http://www.sinhalab.net/
39
Genomics of behavior: honeybee
•
•
•
•
What causes older bees to be more aggressive than younger ones?
What causes Africanized bees to be more aggressive than European ones?
What causes a bee to become aggressive if you annoy them?
DNA sequence analysis shows that origins of aggression are the same !
Genomics of aging
Find genes associated
with aging, by searching
the DNA sequence for
certain patterns.
Knock down one such
gene; old cells became
young !
A complete bioinformatics pipeline
From cells … to data … to analysis … to hypotheses & experiments
QUESTIONS?
43