Transcript PPT

Big Data Analytics
Carlos Ordonez
Big Data Analytics research
• Input? BIG DATA (large data sets, large files,
many documents, many tables, fast growing)
• How? Fast external algorithms; memory-efficient
data structures at two storage levels; parallel:
multi-threaded or multi-node
• Efficiency goal: linear time O(n) and linear
speedup
• Hardware? single node or parallel cluster
• Infrastructure? parallel file system; any large files
• Challenging: Theory+programming in action
Systems research today
• Transaction processing? Main memory, lock-free
• Efficient analysis? Optimal joins, compiled queries,
streams, exploit ample RAM, explout multi-core
• Compiler versus interpreter?
• Massive storage? Posix, HDFS
• Fast external algorithms? Simple tasks.
• Parallel computation? Multi-core with threads, Sharednothing, message-passing
• Exploiting new hardware? Difficult/customized
• Analyzing: queries, cubes, statistics. Machine learning
• Hot today: Information integration (database+files)
DB Systems involves Core CS research:
Theory+Programming
•
•
Theory we use:
– Time complexity (big O()) and I/O cost (disk, solid state memory)
– Data structures (trees, hash tables, linked lists)
– Relational model and information retrieval models
– Multivariate statistics, machine learning, discrete mathematics, linear
algebra
– Compilers and programming languages: parsing/compiling/optimizing
code; recursion
Programming:
– Languages: mostly C++, but also R, SQL, Java
– Unix, but we have a lot of past work on MS Windows
– Systems: Threads, binary I/O, parallel file systems, code generation, code
optimization, interpreter runtime
Sample of target problems
Business Intelligence: cubes,
lattices
Bayesian statistics: MCMC, classification,
regression, variable/feature selection
Big Data summarization: vector outer
products
Graph transitive closure and
linear recursion
Why join the DBMS group?
• Just came back from ATT Labs (formerly the famous ATT Bell
Labs)..my head is spinning with C++ 14 and Unix commands.
Currently programming with my PhD students.
• Balance between theory (mathematics) and programming (C++)
• Mature and stable CS research area
• Job prospect upon graduation is excellent. Great opportunity to
join industrial labs.
• Visit my web page, DBLP. Google “Ordonez SQL”, stop by on
my office hours