- Researchmap

Download Report

Transcript - Researchmap

Computational Challenges in BIG DATA
Takeaki Uno
National Institute of Informatics
& Graduated School for Advanced Studies
28/Apr/2012 China-Korea-Japan Workshop
Self Introduction,
for Better Understanding
Takeaki Uno: National Institute of Informatics
(institute for activating joint research)
Research Area: Algorithms, (data mining, genome science)
+ Enumeration, Graph algorithms, Pattern mining, Similarity search
(Tree algorithms, Dynamic programming, NP-hardness, Reverse
search,…)
+ Implementations for Data Mining Algorithms,
for String Similarity Search,…
+ Web systems for small business optimizations
+ Problem solver, or implementer, for research colleagues
Difficulty on Big Data
• One of recent main topic on informatics is BIG DATA
 I will talk my opinions for
“what kind of problems shall we study”, and
“what kind of approaches shall we take”
from view point of theoretical computer science
• Difficulties on big data are, roughly,
modeling;
noise, diversity, too much symbolic, ill-formulate
computation; time / memory efficient, online / dynamic,…
• Here we do not consider the topics of natural sciences/engineering
- devices, architectures, social implementations, …
Profiling Big Data
• The features of big data are;
- huge size; of course
- noisy; accuracies differ, objectives to get the data differ
- individuality; many kinds of records (people)
- very sparse (graph), but locally dense (has localities)
• We have disadvantages, but at the same time have advantages
- data can be re-used for other objectives (so, cost is cheap)
- accuracy increases automatically by using many records
Locality/minority is important: we have to classify the data (or
extract necessary parts) according to the objectives for analysis
Objectives for Big Data
• Big data is used for many objectives by different users, and life span
of problems may not be long, according to rapid changes of the world
 as fundamental research, we should not focus on each
specified application
• To be useful as fundamental tools in the application areas
with rapid changes and diversity
+ abstract general problems common to many applications
+ methods (algorithms, processes) should be simple, as much as
possible , and the output can be explained clearly what it is
+ models shoube determined according to computational efficiency
Algorithms as “Natural Science”
• For fast computation, algorithms have to fit real-world data
 have to observe / analyze the behaviors of the algorithms on the
real-world data, and design the algorithms according to the results
 it would be a “natural science approach” (there are not many)
feasible hypothesis  verify by experiments
• Parallelization is necessary, but…
since best algorithms change, have do from scratch many times
(in optimization, improvements by algorithms > architecture)
 extract simple good problems and parallelize them
A Direction; find similar string pairs
Motivation : Similarity search on string data (especially, genome)
index construction is heavy, LSH (approximate hash) is not
good, and parallelization is not easy
Modeling: For given a set S of strings of fixed length l, find all
pairs of short strings such that their Hamming distance is at most d.
 avoid overheads of executing many queries; big data needs
many queries, Hamming distance is simple/basic
ATGCCGCG
GCGTGTAC
GCCTCTAT
TGCGTTTC
TGTAATGA
...
• ATGCCGCG & AAGCCGCC
• GCCTCTAT & GCTTCTAA
• TGTAATGA & GGTAATGG
...
Block Combination
• Consider partition of each string into k(>d)blocks
 If two strings are similar, they share at least k-d same blocks
 for each string, candidates of similar strings are those having
at least k-d same blocks
• For all combinations of k-d blocks, we find the records having the
same blocks (exact search)
 we have to do several times, but not so many
Cost for Mismatches
• No index, thus memory efficient (work space = input data size)
• Worst case time complexity is n2, but works fine in real-world data
• Pairwise comparison can be parallelized easily
good subproblem; linear speed up even in GPU computing
• Combination with LSH works in many kinds of data
Euclidean distance, Manhattan distance…
Implementation; performance
Mouse X chr.
15min. By PC
Note:
BLASTZ 2weeks
MURASAKI
2-3 hours with
1% error
Human X chr
Comparison of
Mouse X and
human X
chromosome
(150MB for each,
with 10% error)