scalablewebarch
Download
Report
Transcript scalablewebarch
Building
Scalable Web Architectures
Aaron Bannert
[email protected] / [email protected]
QuickTi me™ and a
T IFF (Uncom pressed) decom pressor
are needed to see t his pict ure.
Goal
To build a reliable, scalable, cheap, flexible,
extendable internet application.
The Age of LAMP
What does a LAMP architecture
give us?
Scalability
Grows in small steps
Stays up when it counts
Can grow with your traffic
Room for the future
Reliability
High Quality of Service
Minimal Downtime
Stability
Redundancy
Resilience
Low Cost
Little or no software licensing costs
Minimal hardware requirements
Abundance of talent
Reduced maintenance costs
Flexible
Modular Components
Public APIs
Open Architecture
Vendor Neutral
Many options at all levels
Extendable
Free/Open Source Licensing
Right to Use
Right to Inspect
Right to Improve
Plugins
Some Free
Some Commercial
Can always customize
Free as in Beer?
Price
Speed
Quality
Pick any two.
LAMP-like Architectures
The Big Picture
External Caching Tier
Web Serving Tier
Application Server Tier
Internal Cache Tier
Database Tier
Misc. Services (DNS, Mail, etc…)
The Glue
•Routers
•Switches
•Firewalls
•Load Balancers
Software Choices
Building LAMP Software
External Caching Tier
External Caching Tier
What is this?
Squid
Apache’s mod_proxy
Commercial HTTP Accelerator
External Caching Tier
What does it do?
Caches outbound HTTP objects
Images, CSS, XML, HTML, etc…
Flushes Connections
Useful for modem users, frees up web tier
Denial of Service Defense
External Caching Tier
Hardware Requirements
Lots of Memory
Moderate to little CPU
Fast Network
Moderate Disk Capacity
Room for cache, logs, etc… (disks are cheap)
One slow disk is OK
Two Cheapies > One Expensive
External Caching Tier
Other Questions
What to cache?
How much to cache?
Where to cache (internal vs. external)?
Web Serving Tier
Web Serving Tier
What is this?
Apache
thttpd
Tux Web Server
IIS
Netscape
Web Serving Tier
What does it do?
HTTP, HTTPS
Serves Static Content from disk
Generates Dynamic Content
CGI/PHP/Python/mod_perl/etc…
Dispatches requests to the App Server Tier
Tomcat, Weblogic, Websphere, JRun, etc…
Web Serving Tier
Hardware Requirements
Lots and lots of Memory
Memory is main bottleneck in web serving
Memory determines max number of users
Fast Network
CPU depends on usage
Dynamic content needs CPU
Static file serving requires very little CPU
Cheap slow disk, enough to hold your content
Web Serving Tier: Zero-copy
Performance Hint
Dedicated static content servers
Modern web servers are very good at serving static
content such as
• HTML
• CSS
• Images
• Zip/GZ/Tar files
Web Serving Tier
Performance Hint
Stateless Sessions
Each connection is a fresh start
Server remembers nothing
Benefits?
Allows Better Caching
Scales Horizontally
Web Serving Tier
Choices
How much dynamic content?
When to offload dynamic processing?
When to offload database operations?
When to add more web servers?
Application Server Tier
Application Server Tier
What does it do?
Dynamic Page Processing
JSP
Servlets
Standalone mod_perl/PHP/Python engines
Internal Services
Eg. Search, Shopping Cart, Credit Card Processing
Application Server Tier
1. How does it work?
1. Web Tier generates the request using
HTTP (aka “REST”, sortof)
RPC/Corba
Java RMI
XMLRPC/Soap
(or something homebrewed)
2. App Server processes request and responds
Application Server Tier
Caveats
Decoupling of services is GOOD
Manage Complexity using well-defined APIs
Don’t decouple for scaling, change your algorithms!
Remote Calling overhead can be expensive
Marshaling of data
Sockets, net latency, throughput constraints…
XML, Soap, XMLRPC, yuck (don’t scale well)
Better to use Java’s RMI, good old RPC or even Corba
Application Server Tier
More Caveats
Remote Calling can introduce new failure
scenarios
Classic Distributed Problems
•
How to detect remote failures?
•
How long to wait until deciding it’s failed?
How to react to remote failures?
What do we do when all app servers have failed?
Application Server Tier
Hardware Requirements
Lots and Lots and Lots of Memory
App Servers are very memory hungry
Java was hungry to being with
Consider going to 64bit for larger memory-space
Disk depends on application, typically minimal needed
FAST CPU required, and lots of them
(This will be an expensive machine.)
Database Tier
Database Tier
Available DB Products
Free/Open Source DBs
PostgreSQL
GNU DBM
Ingres
SQLite
Commercial
Oracle
MS SQL
IBM DB2
Sybase
SleepyCat
MySQL
SQLite
mSQL
Berkeley DB
Database Tier
What does it do?
Data Storage and Retrieval
Data Aggregation and Computation
Sorting
Filtering
ACID properties
(Atomic, Consistent, Isolated, Durable)
Database Tier
Choices
How much logic to place inside the DB?
Use Connection Pooling?
Data Partitioning?
Spreading a dataset across multiple logical database
“slices” in order to achieve better performance.
Database Tier
Hardware Requirements
Entirely dependent upon application.
Likely to be your most expensive machine(s).
Tons of Memory
Spindles galore
RAID is useful (in software or hardware)
Reliability usually trumps Speed
• RAID levels 0, 5, 1+0, and 5+0 are useful
CPU also important
Dual power supplies
Dual Network
Internal Cache Tier
Internal Cache Tier
What is this?
Object Cache
What Applications?
Memcache
Local Lookup Tables
BDB, GDBM, SQL-based
Application-local Caching (eg. LRU tables)
Homebrew Caching (disk or memory)
Internal Cache Tier
What does it do?
Caches objects closer to the
Application or Web Tiers
Tuned for your application
Very Fast Access
Scales Horizontally
Internal Cache Tier
Hardware Requirements
Lots of Memory
Note that 32bit processes are typically limited to 2GB
of RAM
Little or no disk
Moderate to low CPU
Fast Network
Misc. Services (DNS, Mail, etc…)
Misc. Services (DNS, Mail, etc…)
Why mention these?
Every LAMP system has them
Crucial but often overlooked
Source of hidden problems
Misc. Services: DNS
Important Points
Always have an offsite NS slave
Always have an onsite NS slave
Minimize network latency
Don’t use NAT, load balancers, etc…
Misc. Services: Time Synchronization
Synchronize the clocks on your systems!
Hints:
Use NTPDATE at boot time to set clock
Use NTPD to stay in synch
Don’t ever change the clock on a running
system!
Misc. Services: Monitoring
System Health Monitoring
Nagios
Big Brother
Orcalator
Ganglia
Fault Notification
The Glue
•Routers
•Switches
•Firewalls
•Load Balancers
Routers and Switches
Expensive
Complex
Crucial Piece of the System
Hints
Use GigE if you can
Jumbo Frames are GOOD
VLans to manage complexity
LACP (802.3ad) for failover/redundancy
Load Balancers
Hardware vs. Software
Software is complex to set up, but cheaper
Hardware is expensive, but dedicated
IMHO: Use SW at first, graduate to HW
Load Balancers
What services to balance?
HTTP Caches and Servers, App Servers, DB
Slaves
What NOT to balance?
DNS
LDAP
NIS
Memcache
Spread
Anything with it’s own built-in balancing
Message Busses
What is out there?
Spread
JMS
MQSeries
Tibco Rendezvous
What does it do?
Various forms of distributed message delivery.
Guaranteed Delivery, Broadcasting, etc…
Useful for heterogeneous distributed systems
What about the OS?
Operating System Selection
Lots of OS choices
Linux
FreeBSD
NetBSD
OpenBSD
OpenSolaris
Commercial Unix
What’s Important?
Maintainability
Upgrade Path
Security Updates
Bug Fixes
Usability
Do your engineers like it?
Cost
Hardware Requirements
(you don’t need a commercial Unix anymore)
Features to look for
Multi-processor Support
64bit Capable
Mature Thread Support
Vibrant User Community
Support for your devices
Hardware Choices
Building LAMP Hardware
Commodity Hardware Discussion
Consistency vs. Specialization
Consistency reduces maintenance costs
Less Burn-in testing
Fewer drivers to support
Fewer OS variants
Fewer types of security updates, upgrades
In Sort: “Don’t throw hardware at the problem.”
However, specialization may improve ROI
Put the money where best needed
Commodity Hardware Discussion
What I do when planning for growth:
Specialize in the beginning
When cost is more important
And designs aren’t yet mature
Design for horizontal scalability
Plan on machine-sized pieces
Want to grow by just adding more boxes
Eventually settle on two or three machine types
In-House vs. Colocation
Almost no reason to stay in-house these
days
Colos keep getting cheaper
Leased lines are still expensive
Beige-Box vs. Name Brand
Determine your Req’s ahead of time
Talk to your engineers First
How important is a support plan?
Hardware will break, plan on it
Name Brand usually has fewer options
Works well if they have exactly what you need
Seek a neutral technical advisor
In the end it should come down to cost
Disk Drive Technologies
SCSI
IDE
Expensive
Big (300GB)
Fast
Reliable
Cheap
Huge (500GB!)
Slow
On-board support, often
w/ RAID0/1
Use SCSI for Performance
Use IDE for cluster nodes
IDE w/ RAID for cheap speed
Disk Drive Technologies
PATA
Immature drivers
Particularly w/ OSS
• Linux has poor support
SATA
Tried and Tested
Obsolete
Prices coming down
Unnecessary addons
Hot Swap not often
needed, costs more
SATA is not SCSI
The fast SATAs cost as much as SCSI
SATA not quite there for servers
Disk Drive Technology: Spindles
Number of Spindles
More spindles can give
Higher Throughput
Higher Concurrency
• Concurrency is crucial for Databases
Reliability
• Failover drives, mirrors
Memory Technologies
ECC
Expensive
Use only for keystone
machines
Non-ECC
Cheap, Fast
Use for cluster nodes
Processor Technologies: SMP
Multiple Processors
Significantly higher cost
ECC, Dual Power
Expensive Chassis, Motherboard
Less Reliable
More parts to break
Requires MP-capable OS
Good in Linux 2.6, Solaris, FreeBSD 5.x
Dual CPU systems cost more than double
Possible exception: Dual-Core CPUs
Processor Technologies: 64bit
Most 32bit OSes limit each process to 2GB
Some 32bit BIOSes are limited to 3.6GB RAM
64bit chips are still expensive
64bit OSes are becoming quite mature
Solaris 10 (AMD64)
Linux 2.6 (x86_64)
Programs work but not yet tuned
Java looks good
MySQL not so good
Summary
Design for Horizontal Scalability
Design Stateless Systems
Decouple Internal Services
Write well-defined APIs
Use Commodity Parts
Standardize Hardware
Use Commodity Software
(Open Source!)
Avoid Fads
THE END
Thank You