- Researchmap
Download
Report
Transcript - Researchmap
Computational Challenges in BIG DATA
Takeaki Uno
National Institute of Informatics
& Graduated School for Advanced Studies
28/Apr/2012 China-Korea-Japan Workshop
Self Introduction,
for Better Understanding
Takeaki Uno: National Institute of Informatics
(institute for activating joint research)
Research Area: Algorithms, (data mining, genome science)
+ Enumeration, Graph algorithms, Pattern mining, Similarity search
(Tree algorithms, Dynamic programming, NP-hardness, Reverse
search,…)
+ Implementations for Data Mining Algorithms,
for String Similarity Search,…
+ Web systems for small business optimizations
+ Problem solver, or implementer, for research colleagues
Difficulty on Big Data
• One of recent main topic on informatics is BIG DATA
I will talk my opinions for
“what kind of problems shall we study”, and
“what kind of approaches shall we take”
from view point of theoretical computer science
• Difficulties on big data are, roughly,
modeling;
noise, diversity, too much symbolic, ill-formulate
computation; time / memory efficient, online / dynamic,…
• Here we do not consider the topics of natural sciences/engineering
- devices, architectures, social implementations, …
Profiling Big Data
• The features of big data are;
- huge size; of course
- noisy; accuracies differ, objectives to get the data differ
- individuality; many kinds of records (people)
- very sparse (graph), but locally dense (has localities)
• We have disadvantages, but at the same time have advantages
- data can be re-used for other objectives (so, cost is cheap)
- accuracy increases automatically by using many records
Locality/minority is important: we have to classify the data (or
extract necessary parts) according to the objectives for analysis
Objectives for Big Data
• Big data is used for many objectives by different users, and life span
of problems may not be long, according to rapid changes of the world
as fundamental research, we should not focus on each
specified application
• To be useful as fundamental tools in the application areas
with rapid changes and diversity
+ abstract general problems common to many applications
+ methods (algorithms, processes) should be simple, as much as
possible , and the output can be explained clearly what it is
+ models shoube determined according to computational efficiency
Algorithms as “Natural Science”
• For fast computation, algorithms have to fit real-world data
have to observe / analyze the behaviors of the algorithms on the
real-world data, and design the algorithms according to the results
it would be a “natural science approach” (there are not many)
feasible hypothesis verify by experiments
• Parallelization is necessary, but…
since best algorithms change, have do from scratch many times
(in optimization, improvements by algorithms > architecture)
extract simple good problems and parallelize them
A Direction; find similar string pairs
Motivation : Similarity search on string data (especially, genome)
index construction is heavy, LSH (approximate hash) is not
good, and parallelization is not easy
Modeling: For given a set S of strings of fixed length l, find all
pairs of short strings such that their Hamming distance is at most d.
avoid overheads of executing many queries; big data needs
many queries, Hamming distance is simple/basic
ATGCCGCG
GCGTGTAC
GCCTCTAT
TGCGTTTC
TGTAATGA
...
• ATGCCGCG & AAGCCGCC
• GCCTCTAT & GCTTCTAA
• TGTAATGA & GGTAATGG
...
Block Combination
• Consider partition of each string into k(>d)blocks
If two strings are similar, they share at least k-d same blocks
for each string, candidates of similar strings are those having
at least k-d same blocks
• For all combinations of k-d blocks, we find the records having the
same blocks (exact search)
we have to do several times, but not so many
Cost for Mismatches
• No index, thus memory efficient (work space = input data size)
• Worst case time complexity is n2, but works fine in real-world data
• Pairwise comparison can be parallelized easily
good subproblem; linear speed up even in GPU computing
• Combination with LSH works in many kinds of data
Euclidean distance, Manhattan distance…
Implementation; performance
Mouse X chr.
15min. By PC
Note:
BLASTZ 2weeks
MURASAKI
2-3 hours with
1% error
Human X chr
Comparison of
Mouse X and
human X
chromosome
(150MB for each,
with 10% error)