Big Data Research Progress

Download Report

Transcript Big Data Research Progress

Big Data Research Progress
Chao
Jan 22, 2013
Big Data Lab
• Big Data@CSAIL, MIT
–
–
–
–
–
–
–
–
–
–
–
http://bigdata.csail.mit.edu/
23 nodes
GROWING BIG LINKED DATA FROM SEED: BUILDING A DEMO
VISION MACHINE: LEARNING ONLINE FROM 25 MILLION
IMAGES
NATURAL LANGUAGE INTERFACE FOR BIG DATA
SCIDB
MACHINE LEARNING
SOCIAL: CONDENSR
SOCIAL: TWITINFO
SOCIAL: INFLUENCE MODELING
…
Big Data Lab
• NASA tournament lab
– http://www.nasa.gov/directorates/heo/ntl/
• Big data challenge
– http://open.nasa.gov/blog/2012/10/03/nasatournament-labs-big-data-challenge/
– Apply the process of open innovation to
conceptualizing new and novel approaches to
using “big data” information sets from various U.S.
government agencies, e.g., health, energy and
earth science.
Big Data People
• Jimmy Lin (University of Maryland)
– http://www.umiacs.umd.edu/~jimmylin/
• Ron Bekkerman (LinkedIn)
– http://people.cs.umass.edu/~ronb/
• Misha Bilenko (MSR)
– http://research.microsoft.com/en-us/um/people/mbilenko/
• John Langford (Yahoo! Research)
– http://hunch.net/~jl/
Tutorial
• Scaling Up Machine Learning-Parallel and
Distributed Approaches
• KDD’2011
• Ron Bekkerman (LinkedIn), Misha Bilenko
(MSR) and John Langford (Yahoo! Research)
• http://hunch.net/~large_scale_survey/
Tutorial
• State-of-the-art platforms and algorithm choices
• Hardware options (from FPGAs and GPUs to multi-core
systems and commodity clusters)
• Programming frameworks (including CUDA, MPI,
MapReduce, and DryadLINQ)
• Learning settings (e.g., semi-supervised and online
learning)
• Example-driven, covering a number of popular
algorithms (e.g., boosted trees, spectral clustering,
belief propagation) and diverse applications (e.g.,
speech recognition and object recognition in vision).
Parallelization: platform choices
Platform
Communication Scheme Data size
Peer-to-Peer
TCP/IP
Petabytes
Virtual Clusters MapReduce / MPI
Terabytes
HPC Clusters
MPI / MapReduce
Terabytes
Multicore
Multithreading
Gigabytes
GPU
CUDA
Gigabytes
FPGA
HDL
Gigabytes
The Book
•
•
•
•
Cambridge Uni Press
Due in November 2011
21 chapters
Covering
– Platforms
– Algorithms
– Learning setups
– Applications
Chapter contributors
2
12
3
4
13
14
5
6
15
16
7
8
17
18
9
10
19
20
11
21
New age of big data
• The world has gone mobile
– 5 billion cellphones produce daily data
• Social networks have gone online
– Twitter produces 200M tweets a day
• Crowdsourcing is the reality
– Labeling of 100,000+ data instances is doable
• Within a week 
Big Data Data
• DATA.GOV
– http://www.data.gov/developers/community/dev
elopers
– Data portal provided by US government
Big Data in Q&A
• It is estimated that 2.5 quintillion bytes of new
data are created daily with an estimated 80% of
this produced as "unstructured" data
• IBM Watson deep Q&A
–
–
–
–
–
http://www.research.ibm.com/articles/watson.shtml
Evidence-based decision support
Jeopardy!
Provide a single correct answer with confidence
Analyze over 200 million pages in three seconds
Big Data in Q&A
• IBM Watson deep Q&A
– Health care
• 2011, pilot program with WellPoint, whose affiliated
health plans cover one in nine Americans
• 2012, partnership with Memorial Sloan-Kettering
Cancer Center, where work is under way to teach
Watson about oncology diagnosis and treatment
options
Big Data Blog
• http://whatsthebigdata.com/
– News and events about Big Data
• http://www.greenplum.com/industrybuzz/big-data/research-papers
– News and research papers about Big Data
Big Data Publication
• Fast Data in the Era of Big Data: Twitter’s Real-Time Related Query
Suggestion Architecture
• http://arxiv.org/pdf/1210.7350v1.pdf
• Architecture behind Twitter's real-time related query suggestion
and spelling correction service
– First implementation: typical Hadoop-based analytics stack, did
not meet the latency requirement
– Second implementation: system deployed in production, custom
in-memory processing engine
Big Data Publication
• Fast Candidate Generation for Two-Phase Document Ranking:
Postings List Intersection with Bloom Filters
• http://www.umiacs.umd.edu/~jimmylin/publications/Asadi_Lin_CIK
M2012.pdf
• Most modern web search engines employ a two-phase ranking
strategy: a candidate list of documents is generated using a “cheap”
but low-quality scoring function, which is then reranked by an
“expensive" but high-quality method
• Candidate generation for conjunctive query processing in this
context
• A fast, approximate postings list intersection algorithms based on
Bloom Filters
Big Data Publication
• Why Not Grab a Free Lunch? Mining Large Corpora for Parallel
Sentences to Improve Translation Modeling
– http://www.umiacs.umd.edu/~jimmylin/publications/Ture_Lin_
NAACL-HLT2012.pdf
• Large-Scale Machine Learning at Twitter
– http://www.umiacs.umd.edu/~jimmylin/publications/Lin_Kolcz_
SIGMOD2012.pdf
• Smoothing Techniques for Adaptive Online Language Models: Topic
Tracking in Tweet Streams
– http://www.umiacs.umd.edu/~jimmylin/publications/Lin_etal_K
DD2011.pdf
Big Data Book
• Data-Intensive Text Processing with
MapReduce
• http://lintool.github.com/MapReduceAlgorith
ms/MapReduce-book-final.pdf