Transcript ppt

EMTM 600
Software Development
Spring 2011
Lecture Notes 6
I enjoy hearing from my former students! In the future,
please feel free to contact me if you wish to discuss
other topics related to this class, or new technologies,
or issues that arise in your future projects!
Cloud computing was born out solutions
to the need to process GIGANTIC amounts of data
For its search applications, Google had to build large clusters of
commodity hardware. So did Amazon, for sales data. Then the
social networking sites!
At first, cloud computing was a way for these companies to recoup
some of their infrastructure investments. Now it is a business model!
At the same time, they had to develop programming techniques for the
processing of very large amounts of information.
MapReduce, first developed at Google is such a programming
paradigm.
You can do cloud computing without MapReduce. But if you need to
process large amounts of information, MapReduce is very useful.
And it is implemented very nicely on the main clouds.
EMTM 600 Spring 2011 Val Tannen
Search and Social Networking Applications
What are the technological challenges at Google, Yahoo, and other
Web search engines? How about at Facebook, YouTube, or other
social networking sites?
Gigantic amounts of DATA! Petabytes of data!
 Google, Yahoo!, etc., crawl the web and collect web pages
see Jeff Dean slides), web request logs as well as documents in
PDF, Word, Powerpoint, etc
 This data must be processed to create inverted indices, preference
graphs, summaries, query frequencies, input data for statistical models
used in advertisement pricing.
 Facebook, YouTube, etc., collect enormous amounts of user
provided content (Facebook estimated at 2.5PB in 2009).
 This data must be processed for the same reasons the search data
must be processed.
EMTM 600 Spring 2011 Val Tannen
(
How are petabytes of data stored and
processed?
 Stored: on many servers, owned or leased
 Either relatively few ( < 200 ) powerful expensive servers ( > 10K$ )
 Or (tens of) thousands of commodity servers (2-4K$)
 In both cases highly scalable DFS (distributed file systems) are needed,
eg.: Global File System (GFS) open-source (part of Linux) or Google
File Systems (GFS(!)) which is proprietary
 Processed:
 Either proprietary MPP (massively parallel processing) proprietary
solutions (Greenplum, Aster, Google’s MapReduce)
 Or using Hadoop, an open source implementation of the MapReduce
idea.
 In both cases, the processing may use clusters of thousands of
processors (Facebook has a >1000 node single cluster, Yahoo! had a
>5000 node single cluster; see next for Google)
EMTM 600 Spring 2011 Val Tannen
A (Very!) Scalable Distributed-Parallel
Programming Model: MapReduce
 In many circles, considered the key building block for much of Google’s
data analysis
 A programming language built on it: Sawzall,
http://labs.google.com/papers/sawzall.html
 … Sawzall has become one of the most widely used programming
languages at Google. … [O]n one dedicated Workqueue cluster with
1500 Xeon CPUs, there were 32,580 Sawzall jobs launched, using an
average of 220 machines each. While running those jobs, 18,636
failures occurred (application failure, network outage, system crash,
etc.) that triggered rerunning some portion of the job. The jobs read a
total of 3.2x1015 bytes of data (2.8PB) and wrote 9.9x1012 bytes
(9.3TB).
 Other similar languages: Yahoo! Pig Latin, Microsoft’s Dryad
 Cloned in open source: Hadoop, http://hadoop.apache.org/core/
EMTM 600 Spring 2011 Val Tannen
Parallel programming pattern: Map
2.3
3.4
4.5
5.3
4.2
1.6
6.1
2.07
0.67
2.60
map ( log2 )
1.20
1.76
2.16
EMTM 600 Spring 2011 Val Tannen
2.40
Parallel programming pattern: Reduce
12.86
reduce ( + )
7.52
2.96
1.20
1.76
5.34
4.56
2.16
EMTM 600 Spring 2011 Val Tannen
2.40
2.07
2.74
2.60
0.67
2.60
Parallel programming pattern: Pipeline
Example: the Sieve of Eratosthenes
...,6,5,4,3
main
Filter
divisible by
...,11,9,7,5
2
Filter
divisible by
3
7,11,13,17...
17,19,23,29...
11,13,17,19...
13,17,19,23...
Filter
divisible by
11
Filter
divisible by
Filter
divisible by
7
5
EMTM 600 Spring 2011 Val Tannen
MapReduce Primitives
Well-known since Lisp (1962):
map (apply function to all items in a collection) and
reduce (apply function to set of items with a common key)
We start with:
A user-defined function to be applied to all data,
map: (key,value)  (key, value) (or even a set of (key, value)
)
Another user-specified operation
reduce: (key, {set of values})  result
A set of n nodes, each with data
All nodes run map on all of their data, producing new data
with keys. This data is collected by key, then reduced
EMTM 600 Spring 2011 Val Tannen
Example
Bill
Id
Vend.
Name
Amt.
1234
ATT
600
1345
IBM
750
2235
SUN
200
2456
SUN
550
2534
SUN
800
2611
IBM
300
3123
IBM
250
3356
IBM
350
3467
ATT
400
Map
(and
group by
key)
EMTM 600 Spring 2011 Val Tannen
Vend.
Name
Amt.
Vend.
Name
Total
Amt.
ATT
600
ATT
1000
ATT
400
Vend.
Name
Amt.
IBM
750
IBM
300
Vend.
Name
Total
Amt.
IBM
250
IBM
1650
IBM
350
Total
Amt.
1550
Reduce
Vend.
Name
Amt.
SUN
200
Vend.
Name
SUN
550
SUN
SUN
800
Actual Example Tasks
Count word occurrences
Map: output word as key with count 1
Reduce: sum the counts
Distributed grep – all lines matching a pattern
Map: filter by pattern
Reduce: output set (union)
Count URL access frequency
Map: output each URL as key, with count 1
Reduce: sum the counts
For each IP address, get the document with the most inlinks (reduce is “max”)
Number of queries by IP address (requires multiple steps)
EMTM 600 Spring 2011 Val Tannen
MapReduce Dataflow Diagram
Coordinator
Data
partitions
by key
Map
computation
partitions
Reduce computation partitions
Redistribution by output’s key
EMTM 600 Spring 2011 Val Tannen
Some Details
 Fewer computation partitions than data partitions
(because TBs of data but only thousands of computers)
All data is accessible via a distributed filesystem with replication
Worker nodes produce data in key order (makes it easy to
merge)
The master is responsible for scheduling, keeping all nodes busy
The master knows how many data partitions there are, which
have completed – atomic commits to disk
 Fault tolerance: master triggers re-execution of work
originally performed by failed nodes – to make their
data available again
 Locality: master tries to do work on nodes that have
replicas of the data
EMTM 600 Spring 2011 Val Tannen
Hadoop: A “Modern” Open-Source “Clone”
of MapReduce + GlobalFileSystem
 Underlying Hadoop: HDFS, a page-level replicating filesystem
 Supports “streaming” page access from each site
 Master/Worker: “Namenode” vs “Datanodes”
Source: Hadoop HDFS architecture documentation
EMTM 600 Spring 2011 Val Tannen
Hadoop HDFS + MapReduce
Source: “Meet Hadoop”, Devaraj Das, Yahoo Bangalore & Apache
EMTM 600 Spring 2011 Val Tannen
Hadoop MapReduce Architecture
“Jobtracker” (Master):
Accepts jobs submitted by users
Gives tasks to Tasktrackers – makes scheduling
decisions, co-locates tasks to data
Monitors task, tracker status, re-executes tasks if
needed
“Tasktrackers” (Workers):
Run Map and Reduce tasks
Manage storage, transmission of intermediate output
EMTM 600 Spring 2011 Val Tannen
Programming Hadoop directly is
somewhat arcane
Luckily, some query languages have been developed
that compile into Hadoop, eg.,
Pig Latin at Yahoo! (open-source under Apache)
Hive at Facebook (also open-source under Apache)
An SQL program:
An equivalent Pig Latin program:
SELECT category, AVG(pagerank)
FROM urls
WHERE pagerank > 0.2
GROUP BY category
HAVING COUNT(*) > 106
good_urls = FILTER urls
BY pagerank > 0.2;
groups = GROUP good_urls
BY category;
big_groups = FILTER groups
BY COUNT(good_urls)>106 ;
output = FOREACH big_groups
GENERATE category, AVG(good_urls.pagerank);
EMTM 600 Spring 2011 Val Tannen
YouTube Technology
Problems: videos are large, need to be served reliably,
continuously
 Platform: mostly open-source!
 Apache, Linux, lighttpd (web server optimized for video)
 Python, MySQL
 Servers:








Web load balancing with NetScaler from Citrix
Web requests processed by app server written in Python
App server connects to databases and other info sources
Use of Python for flexible rapid development and deployment
Row-level caching in databases
Python objects cached and shared across app servers
Each video served by multiple machines with redundancy
Simple network path for video serving (few nodes from server to user)
EMTM 600 Spring 2011 Val Tannen
New trend: developer communities
 Both Facebook and YouTube offering users the ability to build
extensions!
 Facebook encourages users to set up their own Facebook miniservers with additional content they generate.
Facebook provides configuration standards that integrate the user’s
app with the whole site.
It also provides a web service with an API that gives the user’s app
access to some of the site’s content in order to link with friends,
etc. Likely it’s yet clunky but do not underestimate the enthusiasm
of the Facebook afficionados!
 YouTube offers users the “Data API Protocol”, essentially a web
service allowing them to perform functions normally executed on the
YouTube website: search for videos, retrieve standard video feeds,
comments and video responses. Also it lets users’ apps upload videos
to YouTube or update existing videos or submit authenticated requests
to create playlists, subscriptions, or contacts.
EMTM 600 Spring 2011 Val Tannen
.NET vs. Java EE
They are fundamentally different:
•
Java EE is a specification, controlled by Sun (on its way to be open
source) on which products by vendors like IBM, BEA, Oracle, Sun,
Borland, SAP, open source are based.
•
.NET is a family of products, with specifications used for the interfaces
between the products.
And they are fundamentally similar:
•
Both focus on web access, interoperability, availability, high-throughput
and scalability.
•
Both rely on the same three-tier architecture (presentation, business, data)
and many of the same design patterns.
EMTM 600 Spring 2011 Val Tannen
.NET vs. Java EE
Platform
•
.NET works only with Windows, IIS, and “officially” with SQLServer
(although an “open” data interface, part of ADO.NET, is specified and
implemented by other RDBMS vendors).
•
Java EE works on several platforms: Windows and Linux are
the most used ones.
Development Language
•
J2EE works only with Java (although CORBA connections are
possible to components written in other prog langs).
•
.NET works with several programming languages: C#, C++, VB, etc.
EMTM 600 Spring 2011 Val Tannen
.NET vs. Java EE
Interface specifications
•
In Java EE the interfaces between components are specified and supplied
in Java. The specification is completely open.
•
In .NET most of the interfaces are open and supplied in a new
programming language designed by Microsoft called the Common
Language. This language is also the target of the compilers from the
various development languages. It was designed at the same time as C#
and bears many similarities with it. If you know C# the specifications are
very easy to read. VB programmers who don’t want to learn C# must rely
on the .NET Studio wizzards for help (and may miss some details).
EMTM 600 Spring 2011 Val Tannen
.NET vs. Java EE
Runtime Execution Engine
•
Here .NET and Java EE are similar. The Common Language plays a
similar role to the Java Bytecode; both are intermediate languages.
Implementations of the framework components are supplied in the
intermediate language.
•
But there is an important difference. .NET intermediate language code is
compiled into native code and then executed (this is called just-in-time JIT
compilation). The compiler + runtime environment is the Common
Language Runtime (CLR).
Java EE does not specify JIT so many vendors and most open source
implementations use the good old Java Virtual Machine interpreter, which is
of course slower.
EMTM 600 Spring 2011 Val Tannen
.NET and the history of MS products (1)
.NET has evolved from COM (Component Object Model) which was introduced
to facilitate (among other things) development in multiple languages by
providing reusable functionality.
Major COM headaches:
•
Interface specification compatibility, especially between C++ and VB
•
Very limited inheritance
•
Memory leaks, since de-allocation is managed by the programmer
.NET solutions
•
Interfaces in Common Language
•
CL Specification describes minimal requirements of languages
•
CL has full inheritance support, essentially same as C#
•
Garbage collector for memory management
EMTM 600 Spring 2011 Val Tannen
.NET vs. Java EE
Application Server, Enterprise Components, Container
•
Container-like support with .NET Enterprise Services for .NET Serviced
Components (like EJBs)
•
.NET Services include transaction (DTC: Distributed Transaction
Coordinator, now System.Transaction) and security (role-based, quite
flexible) support.
•
.NET components are automatically multithreaded, each thread needs a
separate “context”.
EMTM 600 Spring 2011 Val Tannen
.NET and the history of MS products (2)
COM server in Windows (registry, DLL) --> MTS (Microsoft Transaction Server) -->
--> COM+ Services
-->
.NET Enterprise Services
•
Declarative support (especially transactional support) of components started with
MTS
•
MTS already had DTC, worked with database systems
•
MTS got rid of DLLs
•
But COM and MTS don’t work together well
•
COM+ 1.0 merges COM and MTS components
•
Product names COM+ Services to emphasize other services beyond transactions
•
.NET Enterprise Services combines COM+ Services with components defined in
.NET’s CL
•
Better management of concurrency contexts: .NET objects are “agile”
•
COM+ 1.5 new features that improve .NET Enterprise Services (see in the textbook
p. 28)
EMTM 600 Spring 2011 Val Tannen
.NET vs. Java EE
Remote Interoperation (eg. between Web server and App server)
•
Java EE uses RMI over IIOP. It’s supposed to be faster and allow for
better security.
•
.NET uses .NET Remoting which is an RMI-style technology (i.e., CORBA
without IDL)
•
Various transport protocols and remoting hosts can be used (e.g., Windows
services and TCP) but most used are IIS (virtual website) and HTTP.
•
Data formatters used for marshalling and unmarshalling are either binary
(as in RMI/IIOP) or SOAP (i.e., XML).
•
MS seems to be recommending SOAP over HTTP with ASP.NET hosting
process within HTTP. This transitions easily to a Web service!
•
But XML encoding is slow. And security with .NET Remoting is messy…
EMTM 600 Spring 2011 Val Tannen
.NET and the history of MS products (2)
.NET Remoting succeeds DCOM (Distributed COM objects) which was essentially the
MS answer to CORBA and to RMI.
It is still possible to do remote interoperation based on DCOM with .NET components,
using a .NET Remoting feature based on class inheritance.
DCOM is very fast and it works better with COM+ features.
It runs over various protocols such as TCP/IP, UDP (broadcast) or IPX (Novell)
But DCOM is not as flexible as .NET Remoting, in particular it needs more capabilities on
the client including the installation of an application proxy.
But SOAP/XML/HTTP is much slower.
EMTM 600 Spring 2011 Val Tannen
.NET vs. Java EE
Persistence
•
Java EE has entity beans, CMP or BMP as well as POJOs with DAOs.
BMP and DAOs rely on JDBC.
•
Originally persistence in .NET was very much hand-programmed using
OLE DB and ODBC. Later, ADO.NET was added to great effect.
•
ADO.NET may have been named after ADO (ActiveX Data Objects) but it’s
very different. ADO.NET has
•
Entity Classes implement business objects that are accessible by multiple
applications, and from outside the application server. Like Entity EJBs they need
access by context. Unlike Entity EJBs, there is no CMP, persistence is
implemented through DAOs.
•
.NET Data Providers are JDBC-like components: they come standard for native
access to SQL Server and Oracle, and for OLE DB and ODBC sources (plus
third-party products)
•
Datasets: new and interesting!
EMTM 600 Spring 2011 Val Tannen
.NET and the history of MS products (3)
ADO.NET datasets are based on some ideas of the OLE DB authors (Blakeley et al)
A memory-resident representation of relational data (tables with attributes, constraints)
It is independent of the data source (data provider in .NET lingo) and it can be populated
with data from multiple sources!
It can be passed around as an object, including in distributed settings
It has automatic XML serialization!
Thus, it can be see as a variant of the Composite Entity pattern (see our Java EE
textbook)
EMTM 600 Spring 2011 Val Tannen
.NET vs. Java EE
SOA
•
Java EE can make any session bean into a Web service with the help of
JAXP, JAX-WS, etc
•
.NET is even closer to the spirit of SOA. In effect, it can make all its
remotely accessible components into Web services and ensure that even
the Enterprise Services are in effect Web services
•
ASP.NET has been extended to facilitate the creation of Web services. It
already has strong XML serialization over HTTP support. It can now
produce WSDL descriptions.
•
The WSDL document can be processed by wsdl.exe on the client to create
a proxy class for the Web service. Calls to the methods of this proxy class
result in Web service invocations.
EMTM 600 Spring 2011 Val Tannen
A .NET view of the layered architecture
IIS Web Server
presentation
.NET App Server
controller
domain
ASP.NET
.NET/
COM+
data mapping
data source
ADO.NET
Entity
ASP.NET
EMTM 600 Spring 2011 Val Tannen
ADO.NET
.NET vs. J2EE: basic equivalences
Java EE
.NET
Web presentation
JSP
ASP.NET
Dynamic content
Servlets
ASP.NET
RDBMS access
JDBC
.NET Data Provider
Application server
components
Session EJB
.NET Serviced
Components (using
COM+ Services)
Persistent objects
Entity EJB
ADO.NET Entity
Asynchronous
components
Message-driven
Beans
COM+ queued
components
Web Services
JAX-WS, JAXP,
SAAJ, JBI, etc
ASP.NET Web
Services w/ XML
Security
JAAS, XWSSecurity
System.Security
EMTM 600 Spring 2011 Val Tannen
.NET vs. J2EE: the big issues
•
.NET: programming language neutrality (30+ PLs ?!?)
•
Java EE: platform and vendor neutrality
•
Performance: conflicting claims; many comparisons are apples to
oranges…
•
SOA-friendly: ,NET makes some things easier; its ADO.NET
DataSet objects have “built-in” XML serialization.
•
Cost: .NET is claimed to have lower development costs.
•
Platform: Linux is cheaper than Windows
•
Skills: work with whom you have :) Perhaps the most important!
EMTM 600 Spring 2011 Val Tannen
Five reasons against migrating from EJB to .NET
article by G. Baker on Builder.com
1. CLR does not support Java
2. IIS does not support JSP
3. Using ASP.NET server controls will require re-designs
4. No support for Container Managed Persistence
5. Different session handling implementations
EMTM 600 Spring 2011 Val Tannen
Java EE Applications Life-Cycle and VIPs
•
•
•
Development
•
Application architect and assembler
•
Web components provider
•
EJB provider
•
Database and persistence designer
Deployment (best done with IDE-based tools)
•
Web server and servlet deployer
•
EJB deployer
•
Database deployer
Administration
•
System administrator (sometimes can be done with IDE tools)
•
Database administrator (DBMS vendors provide tools)
EMTM 600 Spring 2011 Val Tannen
Can J2EE Applications be developed entirely with
open source tools?
I guess so, if (1) the developers are very disciplined, (2) the transaction rate is
reasonably low and (3) you are not locked into vendors.
By the way, for simple Web applications, this IS the way to go: Tomcat + JDBC
to MySQL or some other free RDBMS.
Open source tools for building J2EE applications:
1.
Application server: JBoss, which uses Tomcat as its Web container
(http://www.jboss.org) also, check for bargains at
http://www.theserverside.com/reviews/matrix.tss
2.
Relational Database System: MySQL (http://www.mysql.com)
3.
Integrated Development Environment: Eclipse
(http://www.eclipse.org)
JBoss comes with its own RDBMS; MySQL is a little better. None of them can
handle large transaction volume.
JBoss already has enhanced Eclipse with some Java EE wizzards; not many
though…
EMTM 600 Spring 2011 Val Tannen
Lightweight Open Source Alternatives to some
of JBoss capabilities
•
The Spring Framework:
According to http://www.springframework.org, Spring is a layered Java EE
application framework.
One advantage of this framework is that it will allow you to use parts of the
Spring in isolation, unlike Struts. We may use Spring only to simplify our
use of JDBC or we could use Spring to manage all our business objects.
Spring maintains an application container that manages
concurrency/transactions.
EMTM 600 Spring 2011 Val Tannen
Lightweight Open Source Alternatives to some
of JBoss capabilities
•
Hibernate:
Hibernate is a versatile object/relational persistence and query
service.
Hibernate lets you develop persistent classes following objectoriented idiom - including association, inheritance, polymorphism,
composition, and collections. Hibernate allows you to express
queries in its own portable SQL extension (HQL), as well as in
native SQL, or with an object-oriented Criteria and Example API.
Hibernate also does some XML data binding).
The official link is http://www.hibernate.org.
EMTM 600 Spring 2011 Val Tannen
What’s next?
More declarative programming.
Security built into software modularization.
Languages that do “everything” (eg. LINQ at MS).
More parallelism (to exploit multi-cores).
Machine-learning libraries widely used.
The race between complexity and reliability will
continue 
EMTM 600 Spring 2011 Val Tannen