CSE 190: Internet E

Download Report

Transcript CSE 190: Internet E

CSE 190: Internet E-Commerce
Lecture 14: Operations
Operations
•
Everything it takes to keep a web site up and running, 24x7
–
–
–
–
–
–
–
–
–
–
–
–
–
–
•
Deployment Process
Monitoring (SNMP)
Build system
Link rot
Maintenance window
Load testing
Browser compliance
Log rotation
Database backups
Disk failure
Router failure
Robots
Staffing
Data centers
Expense of running a high availability site is comparable to running a
physical store front
Deployment Process
• Proceeds in three phases
– Development
• Within corporation, not accessible outside
– Stage
• Within internet environment
• UAT run here
• Only operations staff may access
– Live
• Accessible to outside world
Monitoring
• SNMP (Simple Network Management Protocol)
–
–
–
–
Used to monitor both hardware, software
Provides: Counters, Values, Triggers, Statistics
Remote control of services
Information stored in MIB (Management Information
Base)
– RMON sometimes used as alternative to SNMPv2
• Software
– HP OpenView
Maintenance Window
• Installation
– Standard: J2EE standard web service descriptor (XML file with tarball of
files)
– InstallShield
– Custom installation scripts
• Upgrades
– Defined time on Friday or weekend to upgrade site, posted on web site
– Process:
•
•
•
•
•
•
•
•
Front page linked to ‘Site down’
Load balancer redirected if appropriate
Application stops accepting new clients
(Pause) Application terminates all active sessions
Application upgraded
Sanity checks performed
Servers rebooted
Load balancer restored
Link Rot
• Link rot: the continual process by which
links become invalid over time
• Tracked with custom tools
• Best practice: Pages have permanent
URLs
• Referral field:
– Tracking this in logs shows who’s linking to
what URL on your site
Load Testing
• Network load (60% bandwidth max)
– Average page size (~20-30k)
• CPU load: Occurs at least three levels
–
–
–
–
HTTP level
Application level
DB query level
Metrics: maximum number of simultaneous users, latency vs. users
• Memory usage (256 M – 1 G per machine)
• Disk I/O load
– 1 Gb per machine typical
• Tools
–
–
–
–
Mercury Interactive: WinRunner
Segue: SilkTest
Rational: SiteLoad
Microsoft: WCAT
Browser Compatibility
• Cost of testing proportional to the number
of platforms you’re compatible with
• The same product isn’t the same on
different operating systems
– E.g. IE4.5 isn’t the same on Mac vs. Windows
• Incompatible DOMs between MS,
Netscape, Mozilla
• Browser archive
– http://browsers.evolt.org/
Robots
•
•
Robots: Automatically traverse web pages to retrieve documents, link
structure, data
Used for:
–
–
–
–
•
Indexing
HTML validation
Link validation
Mirroring
Problems:
– Too much rapid access from single IP
– May be indexing dynamic, obsolete data
•
Robot exclusion file:
# /robots.txt file for mysite.com
User-agent: webcrawler
Disallow:
User-agent: lycra
Disallow: /
•
User-agent: *
Disallow: /jsp
Disallow: /logs
Failure Models
•
•
•
•
Mean Time To Failure (MTTF) = average amount of time the system is up
Mean Time between Failures (MTBF) = average amount of time between failures
Mean Time To Repair (MTTR) = average amount of time the system is down after it
fails - active repair time (diagnostics and repair)
Mean Down Time (MDT) - average amount of time system is down after it fails - active
repair time + preventive maintenance + logistics time (time spent waiting for
personnel, etc)
•
Intrinsic availability: Mean Time To Failure (MTTF)
Mean Time To Failure (MTTF) + MTTR
•
Operational availability: Mean Time Between Failure (MTBF)
Mean Time Between Failure (MTBF) + MDT
Burn in
Useful Life
Wear out
Hardware Failure Rate
Integration Useful Life
& test
Obsolete
Software Failure Rate
When things go wrong
• Network operations
– Software recovers from common failures
– Network staff paged by email if server not
available (via SNMP)
– Usually rotating assignment
• Application developers may be called in if
restarting servers, etc. fails completely.
Only if it doesn’t look like a network
problem.
Data Centers
• Data centers: Host your machines in their own premises
– Also called “colocation”
• Features
–
–
–
–
–
Security: controlled entrance, exit
Weather: maintained temperature, humidity
Power: Backup power, available circuits
Bandwidth: OC-192 connections
Monitoring: 24/7 staff, may reboot misbehaving machines
• Machines typically arranged in “cages”; 1u, 2u machines
• Server blades
• Examples
– NTT / Verio
– Exodus / Global Crossing