Scalable Computing talk

Download Report

Transcript Scalable Computing talk

Scaleable Computing
Jim Gray
Researcher
US-WAT MSR San Francisco
Microsoft Corporation
[email protected]
™
Outline


Why scaleable servers?
Problems and solutions for scaleable servers








How Internet Information Server revolutionizes OLTP
“Wolfpack” Windows NT® clusters for
scaleability, availability, manageability
ActiveX™ object model as structuring principle
OLE DB (DAO) for data sources
MTX as a new programming paradigm
MTX as a server
Distributed transactions to coordinate components
“Falcon” queues for asynchronous processing
Kinds Of
Information Processing
Point-to-point
Immediate
Timeshifted
Broadcast
Conversation
Money
Lecture
Concert
Network
Mail
Book
Newspaper
Database
It’s ALL going electronic
Immediate is being stored for analysis (so ALL database)
Analysis and automatic processing are being added
Low rent min $/byte
Shrinks time now or later
Shrinks space here or there
Automate processing knowbots
Immediate OR time-delayed
Why Put Everything
In Cyberspace?
Point-to-point
OR
broadcast
Network
Locate
Process
Analyze
Summarize
Database
Magnetic Storage
Cheaper Than Paper

File cabinet:
cabinet (four drawer) 250$
paper (24,000 sheets) 250$
space (2x3 @ 10$/ft2) 180$
total
700$
3¢/sheet


Disk:
Image:
disk (4 GB =)
ASCII: 2 mil pages
800$
0.04¢/sheet
(80x cheaper)
200,000 pages
0.4¢/sheet

Store everything on disk
(8x cheaper)
Databases
Information at Your Fingertips™
Information Network™
Knowledge Navigator™


All information will be in an
online database (somewhere)
You might record everything you



Read: 10MB/day, 400 GB/lifetime
(eight tapes today)
Hear: 400MB/day, 16 TB/lifetime
(three tapes/year today)
See: 1MB/s, 40GB/day, 1.6 PB/lifetime
(maybe someday)
Database Store
ALL Data Types

The old world:
 Millions of objects
 100-byte objects



People
Name
Address
David
NY
Mike
Berk
Won
Austin
The new world:

Billions of objects
Big objects (1 MB)
Objects have
behavior (methods)



People
Name
Address Papers
David
NY
Mike
Berk
Won
Austin
Picture Voice

Paperless office
Library of Congress online
All information online
Entertainment
Publishing
Business
WWW and Internet
Billions Of Clients



Every device will be “intelligent”
Doors, rooms, cars…
Computing will be ubiquitous
Billions Of Clients
Need Millions Of Servers

All clients networked
to servers



May be nomadic
or on-demand
Fast clients want
faster servers
Servers provide




Shared Data
Control
Coordination
Communication
Clients
Mobile
clients
Fixed
clients
Servers
Server
Super
server
Conclusion




Commodity hardware allows
new applications
New applications need huge servers
Ideally, clients and servers are
built of the same “stuff”
Servers should be built from



Commodity software and
Commodity hardware
Servers should be able to


Scale up (grow by adding CPUs,
disks, networks)
Scale down (can start small)
Scaleable Systems
BOTH SMP And Cluster
SMP super
server
Departmental
server
Personal
system
Grow up with SMP; 4xP6
is now standard
Grow out with cluster
Cluster has inexpensive parts
Cluster
of PCs
SMPs Have Advantages




Single system image
easier to manage, easier
to program threads in
shared memory, disk, Net
4x SMP is commodity
SMP super
Software capable of 16x server
Problems:
Departmental
>4 not commodity
server
 Scale-down problem
(starter systems expensive)
Personal
 There is a BIGGEST one
system

The TPC-C Revolution
Shows How Far SMPs Have Come
Performance is amazing:


Prices dropping fast
Vendor's tpmC and $/tpmC
UNIX Dis-Economy of Scale
$500
$450
$400
Price $/TPM-C

2,000 users is the min!
30,000 users on a 4x12 alpha cluster (Oracle)
$350
DB2
$300
Informix
$250
Better

$200
$150
Microsoft
Oracle
Sybase
$100
$50
Informix on NT
$0
0
5000
10000
15000
20000
Performance tpmC
25000
30000
35000




SQL Server executes,
returns ODBC
Web server builds
HTML page
Sends it to client
via HTTP
6750 transactions/
minute C on 4xP6
Net: Internet
server performance
is GREAT!
IIS
= Web
ODBC

HTTP
TPC-C
Web-Based Benchmarks
SQL
What Happens To Prices?



No expensive UNIX front end
(20$/tpmC)
No expensive TP monitor
software (10$/tpmC)
=> 81$/tpmC
TPC Price/tpmC
100
93
90
Informix on SNI
Oracle on DEC Unix
Oracle on Compaq/NT
Sybase on Compaq/NT
Microsoft on Compaq with Visigenics
Microsoft on HP with Visagenics
Microsoft on Intergraph with IIS
Microsoft on Compaq with IIS
80
70
66
64 66
60
50
40
54
45
44
35
44
38
44
40
39 39
35
30
27
30
20
42
40
38
41 39
31
22
18
19 21
16
8
10
3
0
processor
disk
software
net
Scaleable Systems
Clusters Scale Beyond Largest SMP
SMP super
server
Departmental
server
Personal
system
Cluster
of PCs
Clusters Have Advantages


Clients and servers made from the same stuff
Inexpensive:


Fault tolerance:


Spare modules mask failures
Modular growth


Built with commodity components
Grow by adding small modules
Unlimited growth:
no biggest one
Parallelism
The OTHER aspect of clusters

Clusters of machines
allow two kinds
of parallelism



Many little jobs: online
transaction processing
 TPC-A, B, C…
A few big jobs: data
search and analysis
 TPC-D, DSS, OLAP
Both give
automatic parallelism
Thesis
Many little beat few big
$1
million
3
1 MM
$100 K
$10 K
Pico Processor
Micro
Mini
Mainframe
Nano 1 MB
10 pico-second ram
10 nano-second ram
100 MB
10 GB 10 microsecond ram
1 TB
14"




9"
5.25"
3.5"
2.5" 1.8"
10 millisecond disc
100 TB 10 second tape archive
Smoking, hairy golf ball
How to connect the many little parts?
How to program the many little parts?
Fault tolerance?
1 M SPECmarks, 1TFLOP
106 clocks to bulk ram
Event-horizon on chip
VM reincarnated
Multiprogram cache,
On-Chip SMP
Future Super Server:
4T Machine

Array of 1,000 4B machines
1
bps processors
 1 BB DRAM
 10 BB disks
 1 Bbps comm lines
 1 TB tape robot


A few megabucks
Challenge:
 Manageability
 Programmability
CPU
50 GB Disc
5 GB RAM
Cyber Brick
a 4B machine
 Security
 Availability
 Scaleability
 Affordability

As easy as a single system
Future servers are CLUSTERS
of processors, discs
Distributed database techniques
make clusters work
The Hardware Is In Place…
And then a miracle occurs
?



SNAP: scaleable network
and platforms
Commodity-distributed
OS built on:
 Commodity platforms
 Commodity network
interconnect
Enables parallel applications
Two Scaleability Projects
1-TB DB and 1 billion TPD
1 Terabyte DB
Grow UP and
grow OUT
SMP super
server
Departmental
server
Personal
system
1 billion
transactions
per day
Building The Biggest Node



There is a biggest node (size grows over time)
Today, with Windows NT, it is probably 1TB
We are building it (with help from
DEC and SPOT)







1 TB GeoSpatial SQL Server database
(1.4 TB of disks = 280 drives)
30K BTU, 8 KVA, 1.5 metric tons
We plan to put it on the Web
as a demonstration application
It will hold satellite images of the entire planet
One pixel per 10 meters
Better resolution in U.S. (courtesy of USGS)
What’s A TeraByte?
1 Terabyte
1,000,000,000 business letters
100,000,000 book pages
50,000,000 FAX images
10,000,000 TV pictures (mpeg)
4,000 LandSat images
150 miles of book shelf
15 miles of book shelf
7 miles of book shelf
10 days of video
16 earth images (100m)
Library of Congress (in ASCII) is 25 TB
1980: $200 million of disc
$5 million of tape silo
1996: $200,000 of magnetic disc
$50,000 nearline tape
10,000 discs
10,000 tapes
120 discs
50 tapes
Terror Byte!
User Interface
+
+
+
Next
What The 1-Billion TPD
Project Is Doing







Building a 20-node Windows NT
Cluster (with help from Intel)
All commodity parts
Using SQL Server & DTC
distributed transactions
Each node has 1/20th of the DB
Each node does 1/20th of the work
15% of the transactions are “distributed”
Uses the “Viper” distributed
transaction coordinator
How Much Is 1 Billion
Transactions Per Day?
1 Btpd = 11,574 tps
(transactions per second)
Millions of transactions per day
~ 700,000 tpm
1,000.
(transactions/minute)



400 M customers
250,000 ATMs worldwide
7 billion transactions / year
(card+cheque) in 1994
0.1
NYSE
Visa ~20 M tpd
1.
BofA

185 million calls
(peak day worldwide)
AT&T

10.
Visa
AT&T
Mtpd

100.
1 Btpd

Outline


Why scaleable servers?
Problems and solutions for scaleable servers








How Internet Information Server revolutionizes OLTP
“Wolfpack” Windows NT clusters for
scaleability, availability, manageability
ActiveX object model as structuring principle
OLE DB (DAO) for data sources
MTX as a new programming paradigm
MTX as a server
Distributed transactions to coordinate components
“Falcon” queues for asynchronous processing
“Wolfpack” Windows NT Clusters
The great hope


Tandem, Teradata, VAX clusters are proprietary
Microsoft & 60 vendors defining
Windows NT Clusters



Code name “Wolfpack”
Almost all big hardware and software vendors involved
No special hardware needed -but it may help
“Wolfpack”
clusters
 Key goals:




Initial “Wolfpack” is two-node failover




Easy: to install, manage, program
Reliable: more reliable than single node
Scaleable: added parts add throughput
Each node can be 4x (or more) SMP
File, print, Internet, mail, DB, other services
Easy to manage
Next (NT5) “Wolfpack” is modest size cluster


About 16 nodes (so 64 to 128 CPUs)
No hard limit, algorithms designed
to go further
Outline


Why scaleable servers?
Problems and solutions for scaleable servers








How Internet Information Server revolutionizes OLTP
“Wolfpack” Windows NT clusters for
scaleability, availability, manageability
ActiveX object model as structuring principle
OLE DB (DAO) for data sources
MTX as a new programming paradigm
MTX as a server
Distributed transactions to coordinate components
“Falcon” queues for asynchronous processing
The BIG Picture
Components and transactions




Software modules are objects
Object Request Broker (a.k.a., Transaction
Processing Monitor) connects objects
(clients to servers)
Standard interfaces allow software plug-ins
Transaction ties execution of a “job” into an
atomic unit: all-or-nothing, durable, isolated
Object Request Broker
Component Object Model




COM is Microsoft model, engine inside OLE ALL
Microsoft software is based on COM (ActiveX)
CORBA + OpenDoc is equivalent
Heated debate over which is best
Both share same key goals:








Encapsulation: hide implementation
Polymorphism: generic operations
key to GUI and reuse
Versioning: allow upgrades
Transparency: local/remote
Security: invocation can be remote
Shrink-wrap: minimal inheritance
Automation: easy
COM now managed by the Open Group
OLE DB: Objects Meet Databases
The basis for universal
data servers, access, & integration






OLE DB: object-oriented (COM
oriented) programming interface
to data
Breaks DBMS into components
Anything can be a data source
Optimization/navigation “on top
of” other data sources
A way to componentized a DBMS
Makes an RDBMS and O-R
DBMS (assumes optimizer
understands objects)
DBMS
engine
Database
Spreadsheet
Photos
Mail
Map
Document
Transactions Coordinate
Components (ACID)


Programmer’s view: bracket
a collection of actions
A simple failure model

Only two outcomes:
Begin()
action
action
action
action
Commit()
Success!
Begin()
Begin()
action
action
action
action
action
action
Rollback()
Fail !
Rollback()
Failure!
Distributed Transactions
Enable Huge Throughput




Each node capable of 7 KtmpC (7,000 active users!)
Can add nodes to cluster (to support 100,000 users)
Transactions coordinate nodes
ORB / TP monitor spreads work among nodes
Distributed Transactions
Enable Huge DBs


Distributed database technology
spreads data among nodes
Transaction processing technology
manages nodes
Microsoft Transaction Service
A new programming paradigm




Develop your ActiveX object on the desktop
Better yet: download them from the Net
Script your work flows as invocations of ActiveX objects
All on desktop
Design and
development
phase
Server(s)
Client
Presentation
layer
Workflow
layer
Application
objects

Then, move work flows and objects to server(s)

Gives
Database
layer
desktop development three-tier deployment
Deployment
phase
Presentation
layer
Workflow
layer
Application
Objects
Database
layer
MTX execution environment
MTX Provides Server-Side
Execution Environment






Scheduling and
load balancing
Deadlocks and
starvation
Network
Receiver
Queue
Connections
Context Security
Thread Pool
Service logic
Synchronization
Shared Data
Configuration

Directory registration,
congestion and
flow control
Authentication
Object handles
National language
Clients
Management

Accepts ActiveX objects
Manages bindings
(it’s an ORB)
Efficient (pre-bound
servers)
Manages thread pools
Manages security
Includes transaction
services
Provides operator
interface
GUI administrative
interface
Structure of a
scaleable server
MTX Also Coordinates
And Interoperates

Coordinates
distributed
transactions
Client application
Begin dist tran:
Update sales
Update inventory
Update warranty
Commit
Windows NT
Server
Windows NT
Server
DTC
DTC
SQL Server
Sales
SQL Server
Inventory
Windows NT
Server
DTC
Other DBMS
Warranty
MTX Also Coordinates
And Interoperates

Interoperates with
Internet and with
legacy systems
Browser/client
Windows NT
Server 4.0
Internet
Information Server
MTx
ActiveX
Components
SNA Server
LU6.2
OLETX
XA
CICS/MVS
SQL Server
Other DBMS
“Falcon” Queue Management
Asynchronous transaction processing





Many tasks are
time-shifted
“Falcon” gives a
QUEUE mechanism
Message-oriented
middleware
Decouples client
from server
Server works on
priority queues
Point-to-point
Immediate
Time
shifted
Broadcast
conversation
money
lecture
concert
mail
book
newspaper
Server
Client
Net
work
Database
Outline


Why scaleable servers?
Problems and solutions for scaleable servers








How Internet Information Server revolutionizes OLTP
“Wolfpack” Windows NT clusters for
scaleability, availability, manageability
ActiveX object model as structuring principle
OLE DB (DAO) for data sources
MTX as a new programming paradigm
MTX as a server
Distributed transactions to coordinate components
“Falcon” queues for asynchronous processing
™