Transcript lot
Hadoop Technical
Workshop
Academic Hadoop Usage
Overview
• University of Washington Curriculum
– Teaching Methods
– Reflections
– Student Background
– Course Staff Requirements
UW: Course Summary
• Course title: “Problem Solving on Large
Scale Clusters”
• Primary purpose: developing large-scale
problem solving skills
• Format: 6 weeks of lectures + labs, 4 week
project
UW: Course Goals
• Think creatively about large-scale
problems in a parallel fashion; design
parallel solutions
• Manage large data sets under memory,
bandwidth limitations
• Develop a foundation in parallel algorithms
for large-scale data
• Identify and understand engineering tradeoffs in real systems
Lectures
•
•
•
•
2 hours, once per week
Half formal lecture, half discussion
Mostly covered systems & background
Included group activities for reinforcement
Classroom Activities
• Worksheets included pseudo-code
programming, working through examples
– Performed in groups of 2—3
• Small-group discussions about
engineering and systems design
– Groups of ~10
– Course staff facilitated, but mostly openended
Readings
• No textbook
• One academic paper per week
– E.g., “Simplified Data Processing on Large
Clusters”
– Short homework covered comprehension
• Formed basis for discussion
Lecture Schedule
•
•
•
•
•
•
Introduction to Distributed Computing
MapReduce: Theory and Implementation
Networks and Distributed Reliability
Real-World Distributed Systems
Distributed File Systems
Other Distributed Systems
Intro to Distributed Computing
•
•
•
•
What is distributed computing?
Flynn’s Taxonomy
Brief history of distributed computing
Some background on synchronization and
memory sharing
MapReduce
• Brief refresher on functional programming
• MapReduce slides
– More detailed version of module I
• Discussion on MapReduce
Networking and Reliability
• Crash course in networking
• Distributed systems reliability
– What is reliability?
– How do distributed systems fail?
– ACID, other metrics
• Discussion: Does MapReduce provide
reliability?
Real Systems
• Design and implementation of Nutch
• Tech talk from Googler on Google Maps
Distributed File Systems
• Introduced GFS
• Discussed implementation of NFS and
AndrewFS for comparison
Other Distributed Systems
• BOINC: Another platform
• Broader definition of distributed systems
– DNS
– One Laptop per Child project
Labs
• Also 2 hours, once per week
• Focused on applications of distributed
systems
• Four lab projects over six weeks
Lab Schedule
• Introduction to Hadoop, Eclipse Setup,
Word Count
• Inverted Index
• PageRank on Wikipedia
• Clustering on Netflix Prize Data
Design Projects
• Final four weeks of quarter
• Teams of 1—3 students
• Students proposed topic, gathered data,
developed software, and presented
solution
Example: Geozette
Image © Julia Schwartz
Example: Galaxy Simulation
Image © Slava Chernyak, Mike Hoak
Other Projects
• Bayesian Wikipedia spam filter
• Unsupervised synonym extraction
• Video collage rendering
Ongoing research: traceroutes
• Analyze time-stamped traceroute data to
model changes in Internet router topology
– 4.5 GB of data/day * 1.5 years
– 12 billion traces from 200 PlanetLab sites
• Calculates prevalence and persistence of
routes between hosts
Ongoing research: dynamic
program traces
• Dynamic memory trace data from
simulators can reach hundreds of GB
• Existing work focuses on sampling
• New capability: record all accesses and
post-process with Hadoop
Common Features
• Hadoop!
• Used publicly-available web APIs for data
• Many involved reading papers for
algorithms and translating into
MapReduce framework
Background Topics
• Programming Languages
• Systems:
– Operating Systems
– File Systems
– Networking
• Databases
Programming Languages
• MapReduce is based on functional
programming map and fold
• FP is taught in one quarter, but not
reinforced
– “Crash course” necessary
– Worksheets to pose short problems in terms
of map and fold
– Immutable data a key concept
Multithreaded programming
• Taught in OS course at Washington
– Not a prerequisite!
• Students need to understand multiple
copies of same method running in parallel
File Systems
• Necessary to understand GFS
• Comparison to NFS, other distributed file
systems relevant
Networking
• TCP/IP
• Concepts of “connection,” network splits,
other failure modes
• Bandwidth issues
Other Systems Topics
• Process Scheduling
• Synchronization
• Memory coherency
Databases
• Concept of shared consistency model
• Consensus
• ACID characteristics
– Journaling
– Multi-phase commit processes
Course Staff
• Instructor (me!)
• Two undergrad teaching assistants
– Helped facilitate discussions, directed labs
• One student sys admin
– Worked only about three hours/week
Preparation
• Teaching assistants had taken previous
iteration of course in winter
• Lectures retooled based on feedback from
that quarter
– Added reasonably large amount of
background material
• Ran & solved all labs in advance
The Course: What Worked
• Discussions
– Often covered broad range of subjects
• Hands-on lab projects
• “Active learning” in classroom
• Independent design projects
Things to Improve: Coverage
• Algorithms were not reinforced during
lecture
– Students requested much more time be spent
on “how to parallelize an iterative algorithm”
• Background material was very fast-paced
Things to Improve: Projects
• Labs could have used a
moderated/scripted discussion component
– Just “jumping in” to the code proved difficult
– No time was devoted to Hadoop itself in
lecture
– Clustering lab should be split in two
• Design projects could have used more
time
Future Course Ideas Overview
•
•
•
•
•
Systems course
Web application design
Integration in other applications courses
Misc. content ideas
Making your own data sets
Systems Course
• Focused on parallel & distributed systems
• Hadoop included in comparison to other
cluster techniques
• Emphasis on performance, profiling, and
management
Topic Map
MapReduce
Distributed FS
Condor
IPC mechanisms
n-phase
commit
Communication networks
and topologies
consensus
sockets
Definitions of failure and reliability
MPI
Introductory Material
• Networking basics
• Multithreading
Distributed Reliability
• Reliability metrics
• Methods of failure
• Techniques to combat failure
– Journaling, n-phase commit
• Techniques to achieve consensus
– Leader election, voting
Parallel Processing
• How to parallelize algorithms
• Parallelization in one machine vs. across
several machines
– Techniques applicable to one vs. other
– Cache coherency
– Memory distribution
Parallelization Frameworks
• Multithreading on one machine
• RPC, MPI, PVM
• Higher-level scheduling
– Condor vs. Hadoop
• Tradeoffs in design
Algorithm Design Comparison
• Matrix multiplication, sorting, searching,
PageRank, etc
– … For a standard distributed system
– … For Hadoop
Distributed Storage
• NFS, AFS, GFS
• Database clustering techniques
– Distributed SQL databases
– HBase
– Distributed memory caches, object stores
Lab Focus
• Implementing parallel and distributed
algorithms
• Experiment with different frameworks
• Perform measurements
– Bandwidth consumption
– Latency & performance
• Code analysis
Final Thoughts
• Lots of low-level programming involved
• Appropriate mostly for last-year students
• Hadoop community would find scholarly
benchmarks useful
– wiki.apache.org/hadoop/ProjectSuggestions
– “JIRA” bug/feature request database
Web Application Design
Basic Web Development Topics
3-tier apps, e-commerce requirements
LAMP Stack / Ruby on Rails
HTML
XML
HTTP
Large-Scale Web Server
Technology
Amazon Web Services
Nutch/Lucene, Hadoop
RPC Systems
SOAP/XMLHttpRequest
Distributed System Basics
Next Steps
• RPC
– Internal RPC; message queues and
distributed back-ends
– Thrift, ProtocolBuffers
– SOAP and XMLHttpRequest
Scaling Really Big
• Nutch/Lucene
• Hadoop
• Amazon Web Services
Data Aggregation and Analysis
•
•
•
•
How to crawl and parse web pages
Generate link graphs
Perform analyses (e.g., PageRank)
Semantic analysis
Web Site Tuning
• Web page layout optimization
– Speed
– Accessibility
– Ease-of-use
• Server log analysis
– User-targeted site features
• Service replication
– Consistency, latency issues
Security and the Web
•
•
•
•
Data sanitization
SQL injection attacks
DOS attacks
Data collection methods & ethics
– User data privacy
Projects
• Code labs in Python, PHP, Ruby
• Simple database design
• Building a small search engine with
Nutch/Lucene
• Design scalable architecture and run on
Amazon EC2
• Web site design project
– Security/penetration analysis of other teams’
sites
Final Course Thoughts
• Web-based services are increasingly
relevant
– Exciting new opportunity for students
– Example course in action:
www.cs.washington.edu/education/courses/
cse454/07au/
Using Hadoop in Other Courses
• Hadoop is a natural component for many
existing courses
– Artificial intelligence
– Web search
– Data mining / information retrieval
– Databases (HBase)
– Networking
– Computational biology? Graphics?
Low Level Module
• “MapReduce in a week:”
code.google.com/edu/content/parallel.html
• 3-lecture series on distributed processing
and Hadoop; enough to get students
started
• … more discussion of online resources
next
AI/Data Mining Ideas
• Use Nutch to perform a web crawl and
classify pages using Bayesian analysis
• Hadoop makes processing easy
– Data sanitization
– Classifier engine (Use WEKA right in Hadoop)
– HDFS for document storage/retrieval/search
AI/Data Mining Ideas
• Extract semantically valuable data from
web pages
– E.g., match names to phone numbers,
– News articles to locations
• Hadoop allows students to explore a much
broader scale than previously possible
Graphics Examples
• Re-encode a render pipeline as a set of
MapReduce tasks
• Use feature detection + clustering on a
corpus of images to find images with
similar shapes/features
Student-Generated Ideas
• Data processing with Yahoo Pig
• Distributed SQL databases
• Distributed systems “ground-up” projects:
– Sockets, then RPC, then Hadoop
• Other concepts: Bittorrent, DHTs, P2P
• Other frameworks: e.g., BOINC projects
Making Datasets
• Your department is full of data!
– Graphics data
– Sensor data from RFID, Ubicomp, robotics…
– Measurements from networking lab
– Ask around: Someone has a few dozen gigs
of log files to donate
– (What happens if you leave Ethereal in
promiscuous mode for a week straight?)
Making Datasets
• Other departments are full of data!
– Biology
– Chemistry
– Physics (campus particle accelerator?)
Making Datasets
• The web is full of data!
– Use Nutch to crawl web sites
– Wikis are especially good (hmm..)
Conclusions
• Hadoop isn’t a full course in itself
– But it combines well with a lot of other ideas
• Can be used for at least a half a course
• … Or as little as a week or two
• Look around you – Hadoop can be applied
to more areas than you might think
Open Source Tools for Teaching
Overview
•
•
•
•
•
Slides
Lab Materials
Readings
Video Lectures
Datasets
http://code.google.com/edu
Slides
• Multiple short course outlines available:
• “MapReduce in a week”
• “Introduction to Problem Solving on Large Scale Clusters”
• “MapReduce Mini Lecture Series”
Labs
• Lab designs from UW course available
– “Introduction to MapReduce”
– “A Simple Inverted Index”
– “PageRank on the Wikipedia Corpus”
– “Clustering the Netflix Movie Data”
Readings
• Google has several papers available
– “Introduction to Distributed Systems”
– “MapReduce: Simplified Data Processing on
Large Scale Clusters”
– “The Google File System”
– “BigTable: A Distributed Storage System for
Structured Data”
http://research.google.com/pubs/papers.html
Lecture Videos
MapReduce Mini-series
Datasets: Wikipedia
• Wikipedia supports free “bulk download” of
data
– Current site snapshot (big)
– Entire revision history (massive)
• Eliminates need for Nutch crawls
• Good for indexing, search labs
http://download.wikimedia.org
Datasets: Netflix
• Netflix’s web site provides
recommendations
• Theory: Other people watched movie X,
then Y. You watched X, you might like Y.
• Open question: Can you provide more
useful recommendations than their current
system?
Datasets: Netflix
• The Netflix Prize: $1,000,000 if you can
find a better algorithm, based on their
criteria
• They provide you with a large dataset of
existing rental associations to work with
www.netflixprize.com
Conclusions
• Lots of starter materials available on the
web
– Good for reference
– Get teaching assistants up to speed
• Readings, sample worksheets and other
resources are open content & ready to use
Aaron Kimball
[email protected]