JavaZone 2005 Talk - Programvareverkstedet

Download Report

Transcript JavaZone 2005 Talk - Programvareverkstedet

Java @ Google
JavaZone 2005
Knut Magne Risvik
Google Inc.
September 14, 2005
Presentation Outline
• Background: Google’s mission and computing platform.
• GFS and MapReduce: Ebony and Ivory of our Infrastructure
• Java for computing: Coupling infrastructure and Java
• Java in Google products: apps and middle-tiers
• The Java expertise at Google: We host Java leadership.
• Giving it back: Google contributions to Java
• Closing notes and Q&A: Swags for good questions.
Google’s Mission
To organize the world’s information
and make it universally
accessible and useful
Explosive Computational Requirements
Every Google service sees continuing
growth in computational needs
more
data
• More queries: More users,
happier users
• More data: Bigger web, mailbox,
blog, etc.
• Better results: Find the right
information, and find it faster
more
queries
better
results
A Simple Challenge For Our Computing Platform
1. Create world’s largest computing infrastructure
2. Make sure we can afford it
Need to drive efficiency of the computing infrastructure
to unprecedented levels
Many Interesting Challenges
• Server design and architecture
• Power efficiency
• System software
• Large scale networking
• Performance tuning and optimization
• System management and repairs automation
Design Philosophy
Single-machine performance does not matter
• The problems we are interested in are too large for any single system
• Can partition large problems, so throughput beats peak performance
Stuff Breaks
• If you have one server, it may stay up three years (1,000 days)
• If you have 1,000 servers, expect to lose one a day
“Ultra-reliable” hardware makes programmers lazy
• A reliable platform will still fail – software still needs to be fault-tolerant
• Fault-tolerant software beats fault-tolerant hardware
Why Use Commodity PCs?
•
Single high-end 8-way Intel server:
– IBM eserver xSeries 440
– 8 2-GHz Xeon, 64 GB RAM, 8 TB of disk
– $758,000
•
Commodity machines:
– Rack of 88 machines
– 176 2-GHz Xeons, 176 GB RAM, ~7 TB of disk
– $278,000
•
1/3X price, 22X CPU, 3X RAM, 1X disk
Sources: racksaver.com, TPC-C performance results, both from late 2002
When Ultra-reliable Machines Won’t Help…
Take-home lesson: Murphy was right
google.stanford.edu (circa 1997)
Lego Disk Case
google.com (1999)
Google Data Center (circa 2000)
google.com (new data center 2001)
google.com (3 days later)
When Servers Sleep… (2004)
Google Query Serving Infrastructure
Misc. servers
query
Spell checker
Google Web Server
Ad Server
Doc servers
I2
I0
I1
I2
I0
I1
…
IN
D0
D1
IN
D0
D1
I2
IN
Index shards
Elapsed time: 0.25s, machines involved: 1000+
DM
…
DM
…
I1
Replicas
I0
…
Replicas
Index servers
D0
D1
Doc shards
DM
Reliable Building Blocks
•
Need to store data reliably
•
Need to run jobs on pools of machines
•
Need to make it easy to apply lots of computational resources to problems
In-house solutions:
•
Storage: Google File System (GFS)
•
Job scheduling: Global Work Queue (GWQ)
•
MapReduce: simplified large-scale data processing
Google File System - GFS
Misc. servers
Replicas
GFS Master
Masters
Client
GFS Master
Client
Client
C0
C1
C5
C2
Chunkserver 1
•
•
•
•
C1
C5
C3
Chunkserver 2
C0
…
C5
C2
Chunkserver N
Master manages metadata
Data transfers happen directly between clients/chunkservers
Files broken into chunks (typically 64 MB)
Chunks triplicated across three machines for safety
GoogleFile API access to GFS
• GoogleFile. Public API with two roles:
–
Creational class. Static methods to obtain
InputStream, OutputStream and
GoogleChannel on top of a Google file.
–
File manipulation. A subset of the methods
provided by the java.io.File class.
• GoogleInputStream. Implements the read method.
• GoogleOutputStream. Extends
java.io.OutputStream, write method.
• GoogleChannel. This is a public class. It
implements the ByteChannel interface and a
subset of the methods in the FileChannel class.
This class provides random access.
• GoogleFile.Stats.
• The JNI Layer is implemented by the class
FileImpl and set of SWIG JNI wrappers generated
during the build process.
GFS Usage at Google
• 30+ Clusters
• Clusters as large as 2000+ chunkservers
• Petabyte-sized filesystems
• 2000+ MB/s sustained read/write load
• All in the presence of HW failures
•
More information can be found:
The Google File System
Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung
19th ACM Symposium on Operating Systems Principles
(http://labs.google.com/papers/gfs.html)
MapReduce: Large Scale Data Processing
• Many tasks: Process lots of data to produce other data
• Want to use hundreds or thousands of CPUs, and it has to be easy
• MapReduce provides, for programs following a particular programming
model:
– Automatic parallelization and distribution
– Fault-tolerance
– I/O scheduling
– Status and monitoring
Example: Word Frequencies in Web Pages
A typical exercise for a new engineer in his or her first week
• Have files with one document per record
• Specify a map function that takes a key/value pair
key = document name
value = document text
• Output of map function is (potentially many) key/value pairs.
In our case, output (word, “1”) once per word in the document
“document1”, “to be or not to be”
“to”, “1”
“be”, “1”
“or”, “1”
…
Example continued: word frequencies in web pages
• MapReduce library gathers together all pairs with the same key
• The reduce function combines the values for a key
In our case, compute the sum
key = “be”
values = “1”, “1”
“2”
key = “not”
values = “1”
key = “or”
values = “1”
“1”
“1”
key = “to”
values = “1”, “1”
“2”
• Output of reduce (usually 0 or 1 value) paired with key and saved
“be”, “2”
“not”, “1”
“or”, “1”
“to”, “2”
Example: Pseudo-code
map(String input_key, String input_value):
// input_key: document name
// input_value: document contents
for each word w in input_values:
EmitIntermediate(w, "1");
Reduce(String key, Iterator intermediate_values):
// key: a word, same for input and output
// intermediate_values: a list of counts
int result = 0;
for each v in intermediate_values:
result += ParseInt(v);
Emit(AsString(result));
Total 80 lines of code
Typical Google Cluster
•
100s/1000s of 2-CPU x86 machines, 2-4 GB of memory
•
Limited bisection bandwidth
•
Storage: local IDE disks and Google File System (GFS)
•
GFS running on the same machines provides reliable, replicated
storage of input and output data
•
Job scheduling system: jobs made up of tasks, scheduler assigns
tasks to machines
Execution
GFS: Google File System
Map task 1
Map task 2
Map task 3
map
map
map
k1:v
k2:v
k1:v
k3:v
k1:v
Shuffle and Sort
Reduce task 1
k1:v,v,v
k3,v
Reduce task 2
k2:v
reduce
GFS: Google File System
reduce
k4,v
k4:v
Optimizations
•
Shuffle stage is pipelined with mapping
•
Many more tasks than machines, for load balancing
•
Locality: map tasks scheduled near the data they read
•
Backup copies of map & reduce tasks (avoids stragglers)
•
Compress intermediate data
•
Re-execute tasks on machine failure
MapReduce status: MR_Indexer-beta6-large-2003_10_28_00_03
MapReduce status: MR_Indexer-beta6-large-2003_10_28_00_03
MapReduce status: MR_Indexer-beta6-large-2003_10_28_00_03
MapReduce status: MR_Indexer-beta6-large-2003_10_28_00_03
MapReduce status: MR_Indexer-beta6-large-2003_10_28_00_03
MapReduce status: MR_Indexer-beta6-large-2003_10_28_00_03
MapReduce status: MR_Indexer-beta6-large-2003_10_28_00_03
MapReduce status: MR_Indexer-beta6-large-2003_10_28_00_03
MapReduce status: MR_Indexer-beta6-large-2003_10_28_00_03
MapReduce status: MR_Indexer-beta6-large-2003_10_28_00_03
MapReduce status: MR_Indexer-beta6-large-2003_10_28_00_03
Results
Using 1800 machines:
•
MR_Grep scanned 1 terabyte in 100 seconds
•
MR_Sort sorted 1 terabyte of 100 byte records in 14 minutes
Rewrote Google's production indexing system
•
a sequence of 7, 10, 14, 17, 21, 24 MapReductions
•
simpler
•
more robust
•
faster
•
more scalable
Usage in March 2005
Number of jobs
Average completion time
Machine days used
Input data read
Intermediate data
72,229
934 secs
358,528 days ≈ 1 millennium
12,571 TB
2,756 TB
Output data written
941 TB
Average worker machines
232
Average worker deaths per job
1.9
Average map tasks per job
3097
Average reduce tasks per job
144
Unique map implementations
309
Unique reduce implementations
235
Unique map/reduce combinations
411
Widely applicable at Google
– Implemented as a C++ library linked to user programs
– Java JNI interface similar to GFS API.
– Can read and write many different data types
Example uses:
distributed grep
distributed sort
term-vector per host
document clustering
machine learning
...
web access log stats
web link-graph reversal
inverted index construction
statistical machine translation
…
Conclusion
•
MapReduce has proven to be a useful abstraction
•
Greatly simplifies large-scale computations at Google
•
Fun to use: focus on problem, let library deal with messy details
MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean
and Sanjay Ghemawat
OSDI'04: Sixth Symposium on Operating System Design and
Implementation
(Search Google for “MapReduce”)
Java in Google Applications
AdWords FE
- Millions of ads
- Billions of transactions
- Extreme rates
GMail middletier
- UI and storage brokerage
- Content searching, analysis
- Tagging
Java expertise @ Google
• Joshua Bloch - Collections Framework, Java 5.0 language enhancements, java.math, Author of
"Effective Java," Coauthor of "Java Puzzlers."
• Neal Gafter - Lead developer of javac, implementor of Java 5.0 language enhancements, "Coauthor
of "Java Puzzlers."
• Robert Griesmer - Architect and technical lead of the HotSpot JVM.
• Doug Kramer - Javadoc architect, Java platform documentation lead.
• Tim Lindholm - Original member of the Java project, key contributor to the Java programming
language, implementor of the classic JVM, coauthor of "The Java Viutual Machine Specification."
• Michael "madbot" McCloskey - Designer and implementer of java.util.regexp.
• Srdjan Mitrovic - Co-implementor of the HotSpot JVM.
• David Stoutamire - Technical lead for Java performance, designer and implementer of parallel
garbage collection.
• Frank Yellin - Original member of the Java project, Co-implementor of classic JVM, KVM and
CLDC, Coauthor of "The Java Viutual Machine specification."
Giving it back – JCP Expert Groups
• Executive Committe for J2SE/J2EE
• JSR 166X: Concurrency Utilities (continuing) http://www.jcp.org/en/jsr/detail?id=166
• JSR 199: Java Compiler API http://www.jcp.org/en/jsr/detail?id=199
• JSR 220: Enterprise JavaBeans 3. http://www.jcp.org/en/jsr/detail?id=220
• JSR 250: Common Annotations for the Java Platform http://www.jcp.org/en/jsr/detail?id=250
• JSR 260: Javadoc Tag Technology Update http://www.jcp.org/en/jsr/detail?id=260
• JSR 269: Pluggable Annotation Processing API http://www.jcp.org/en/jsr/detail?id=269
• JSR 270: J2SE 6.0 ("Mustang") Release Contents http://www.jcp.org/en/jsr/detail?id=270
Google representative: gafter
• JSR 273: Design-Time API for JavaBeans JBDT http://www.jcp.org/en/jsr/detail?id=273
• JSR 274: The BeanShell Scripting Language http://www.jcp.org/en/jsr/detail?id=274
• JSR 277: Java Module System http://www.jcp.org/en/jsr/detail?id=277
Closing Notes
• Google = Computing infrastructure
• Java is becoming a first class citizen at Google
• Essential native interfaces being built
• API design extremely important at our scale, the Java expertise is
driving general API work
• Google brings high-scale industrial experience into JCP expert groups.
Q&A
Knut Magne Risvik
Google Inc.
September 14, 2005