Modern Distributed Systems Design – Security and High Availability

Download Report

Transcript Modern Distributed Systems Design – Security and High Availability

Modern Distributed Systems
Design
– Security and High Availability
1. Measuring Availability
2. Highly Available Data Management
3. Redundant System Design
Measuring Availability
• How resiliency and high availability are
interconnected?
• Define downtime and what causing
downtime.
• How to meager availability?
Measuring Availability
Percentage Uptime
98%
99%
99.8%
99.9%
99.99%
99.999%
99.9999%
Percentage
Downtime
2%
1%
0.2%
0.1%
0.01%
0.001%
0.00001%
Downtime per year
Downtime per week
7.3 days
3.65 days
17h30m
8h45m
52.5m
5.25m
31.5s
3h22m
1h41m
20m10s
10m5s
1m
6s
0.6s
Define Downtime
• Downtime could be defined by following:
“If a user cannot get his job done on time,
the system is down”
What causing downtime?
• Planned – ones that easiest to reduce that include
scheduled system maintenance, hot-swappable
hard drives, cluster upgrades and even failovers.
Usually 30% of all downtime;
• People or human factor – dumb mistakes and
complex innovation in IT equipment, software and
protocols requires greater knowledge of engineers.
Usually 15 % of all downtime;
• Software Failures - due to software bugs and
viruses. (40%)
How to meager availability?
MTBF
Availability = ---------------------, where
MTBF + MTTR
MTBF – “mean time between failures” and
MTTR - “maximum time to repair”
What can go wrong?
•
•
•
•
•
•
Hardware
Environmental and Physical Failures
Network Failures
Database System Failures
Web Server Failures
File and Print Server Failures
The Cost of Downtime.
Industry
Business Operation
Financial
Financial
Brokerage Operation
Credit Card/Sales
Authorization
Pay per view TV
Catalog sales
Airlines
Media
Retail
Transportation
Average Downtime cost
per hour
$6.45 Mil
$2.6M
$150K
$90K-$115K
$89.5K
Levels of Availability:
1.Regular Availability
2.Increased Availability
3.High Availability
4.Disaster recovery
5.Fault-Tolerant System
Highly Available Data
Management
• Data management is the most sensitive area
of modern distributed systems.
• Quick overview of existing data topologies
Redundant System Design
• Redundant storage (RAID, Multi-hosting,
Multi-Pathing, DiskArray, JBOD, etc)
• Failover Configurations and Management
• Introduction to SAN and Fibre Channel
protocol
• Security aspects of data management in
Storage Area Networks
Redundant storage
Redundant Storage (RAID 5)
Failover Configurations and
Management
Failover must meet following requirements:
• Transparent to client;
• Quick (no more then 5 min, ideally 0-2
min);
• Minimal manual intervention, guaranteed
data access.
Failover components:
• Two servers, one primary another takeover;
• Two network connections, third is highly
recommended
• All disks on a failover pair should have
some sort of redundancy
• Application portability
• No single point of failure.
Symmetric Failover
Asymmetric Failover
Fibre Channel, SAN, IP Storage
Security in IP Storage
Networks
• Security in Fibre Channel SANs
• Security Options for IP Storage Networks
Fibre Channel SAN Security
• Port or hard zoning
• WWN Zoning
• LUN Masking
Security Options for IP Storage
Networks
• iSNS
• LUN Masking as in Fibre Channel and
VLAN tagging
• IP Security or IPSec
• ACL