Disks - file

Download Report

Transcript Disks - file

CIT 470: Advanced Network and
System Administration
Disks
CIT 470: Advanced Network and System Administration
Slide #1
Topics
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
Disk interfaces
Disk components
Performance
Reliability
Partitions
RAID
Adding a disk
Logical volumes
Filesystems
Storage Management
CIT 470: Advanced Network and System Administration
Slide #2
Volumes
A volume is a chunk of storage as seen by the server.
 A disk.
 A partition.
 A RAID set.
Logical Unit Numbers (LUNs) identify volumes.
A volume is formatted with a filesystem.





ext2 and ext3
ZFS
FAT
NTFS
ISO9660
CIT 470: Advanced Network and System Administration
Slide #3
Disk Interfaces
SCSI
Standard interface for servers.
IDE
Standard interface for PCs.
Fibre Channel (FC-AL)
High bandwidth
Can run SCSI or IP
iSCSI
SCSI over fast (e.g., 10-gigabit) IP network equipment.
USB
Fast enough for slow devices on PCs.
CIT 470: Advanced Network and System Administration
Slide #4
SCSI
Small Computer Systems Interface
Fast, reliable, expensive.
A bus, not a simple PC to device interface.
Each device has a target # ranging 0-7 or 0-15.
Devices can communicate directly w/o CPU.
Many versions
Original: SCSI-1 (1979) 5MB/s
Current: SCSI-3 (2001) 320MB/s
Serial Attached SCSI (SAS)
Up to 128 devices
Up to 750 MB/s full duplex.
CIT 470: Advanced Network and System Administration
Slide #5
IDE
Integrated Drive Electronics / AT attachment
Slower, less reliable, cheap.
Only allows 2 devices per interface.
ATAPI standard added removable devices.
Many versions
Original: IDE / ATA (1984)
Current: Ultra-ATA/133 133MB/s
Serial ATA
Up to 128 devices.
150 MB/s (SATA-1) and 300 MB/s (SATA-2)
CIT 470: Advanced Network and System Administration
Slide #6
IDE vs. SCSI
SCSI offers better performance/scale
Faster bus
Faster hard drives (up to 15,000rpm).
Lower CPU usage
Better handling of multiple requests.
Cheaper IDE often best for workstations.
Convergence
SATA2 and SAS converging on a single std.
CIT 470: Advanced Network and System Administration
Slide #7
Hard Drive Components
CIT 470: Advanced Network and System Administration
Slide #8
Hard Drive Components
Actuator
Moves arm across disk to read/write data.
Arm has multiple read/write heads (often 2/platter.)
Platters
Rigid substrate material.
Thin coating of magnetic material stores data.
Coating type determines areal density: Gbits/in2
Spindle Motor
Spins platters from 3600-15,000 rpm.
Speed determines disk latency.
Cache
8-32MB of cache memory
Reliability: write-back vs. write-through
CIT 470: Advanced Network and System Administration
Slide #9
Disk Information: hdparm
# hdparm -i /dev/hde
/dev/hde:
Model=WDC WD1200JB-00CRA1, FwRev=17.07W17, SerialNo=WD-WMA8C4533667
Config={ HardSect NotMFM HdSw>15uSec SpinMotCtl Fixed DTR>5Mbs FmtGapReq }
RawCHS=16383/16/63, TrkSize=57600, SectSize=600, ECCbytes=40
BuffType=DualPortCache, BuffSize=8192kB, MaxMultSect=16, MultSect=off
CurCHS=16383/16/63, CurSects=16514064, LBA=yes, LBAsects=234441648
IORDY=on/off, tPIO={min:120,w/IORDY:120}, tDMA={min:120,rec:120}
PIO modes: pio0 pio1 pio2 pio3 pio4
DMA modes: mdma0 mdma1 mdma2
UDMA modes: udma0 udma1 udma2 udma3 udma4 *udma5
AdvancedPM=no WriteCache=enabled
Drive conforms to: device does not report version:
* signifies the current active mode
CIT 470: Advanced Network and System Administration
Slide #10
Disk Performance
Seek Time
Time to move head to desired track (3-8 ms)
Rotational Delay
Time until head over desired block (8ms for 7200)
Latency
Seek Time + Rotational Delay
Throughput
Data transfer rate (20-100 MB/s)
CIT 470: Advanced Network and System Administration
Slide #11
Latency vs. Throughput
Which is more important?
Depends on the type of load.
Sequential access – Throughput
Multimedia on a single user PC
Random access – Latency
Most servers
How to improve performance
Faster disks (15 krpm vs 7200rpm)
Caching (disk, controller, server OS, client OS)
More spindles (disks).
More disk controllers.
CIT 470: Advanced Network and System Administration
Slide #12
Disk Performance: hdparm
# hdparm -tT /dev/sda
/dev/sda:
Timing cached reads:
1954 MB in 2.00
seconds = 977.02 MB/sec
Timing buffered disk reads:
268 MB in 3.02 seconds = 88.66 MB/sec
CIT 470: Advanced Network and System Administration
Slide #13
Reliability
MTBF
Average time between failures (>100,000 hours).
Real failure curves
Early phase: high failure rate from defects.
Constant failure rate phase: MTBF valid.
Wearout phase: high failure rate from wear.
Failures more likely on traumatic events.
Power on/off.
Systems often wear out before MTBF.
CIT 470: Advanced Network and System Administration
Slide #14
Partitions and the MBR
4 primary partitions.
One can be used as an
extended partition,
which is a link to an
Extended boot record
on the 1st sector of
that partition.
Each logical partition
is described by its
own EBR, which
links to the next EBR.
CIT 470: Advanced Network and System Administration
Slide #15
Extended Partitions and EBRs
There is only one extended partition.
– It is one of the primary partitions.
– It contains one or more logical partitions.
– It should contain all disk space not used by the
other primary partitions.
EBRs contain two entries.
– The first entry describes a logical partition.
– The second entry points to the next EBR if there
are more logical partitions after the current one.
CIT 470: Advanced Network and System Administration
Slide #16
Why Partition?
1. Separate OS from user files, to allow user
backups + OS upgrades w/o problems.
2. Have a faster swap area for virtual memory.
3. Improve performance by keeping filesystem
tables small and keeping frequently used together
files close together on the disk.
4. Limit the effect of disk full issues, often caused
by log or cache files.
5. Multi-boot systems with multiple OSes.
CIT 470: Advanced Network and System Administration
Slide #17
RAID
Redundant Array of Independent Disks
Combine physical disks into single logical unit.
Can be implemented in hardware or software.
Hardware RAID controllers:
Caching
Automate rebuilding of arrays
Advantages
Capacity
Reliability
Throughput
CIT 470: Advanced Network and System Administration
Slide #18
RAID Levels
Level
Min Description
JBOD
2
Merge disks for capacity, no striping.
RAID 0 2
Striped for performance + capacity.
RAID 1 2
Mirrored for fault tolerance.
RAID 3 3
Striped set with dedicated parity disk.
RAID 4 3
Block instead of byte level striping.
RAID 5 3
Striped set with distributed parity.
RAID 6 4
Striped set with dual distributed parity.
CIT 470: Advanced Network and System Administration
Slide #19
Striping
• Distribute data across multiple disks.
• Improve I/O by accessing disks in parallel.
– Independent requests can be serviced in parallel
by separate disks.
– Single multi-block requests can be serviced by
multiple disks.
• Performance vs. reliability
– Performance increases with # disks.
– Reliability decreases with # disks.
CIT 470: Advanced Network and System Administration
Slide #20
Parity
Store extra bit with each chunk of data.
Odd parity
Even parity
 add 0 if # of 1s is odd
 add 1 if # of 1s is even
 add 0 if # of 1s is even
 add 1 if # of 1s is odd
7-bit data
even parity
odd parity
0000000
00000000
10000000
1011011
11011011
01011011
1100110
01100110
11100110
1111111
11111111
01111111
CIT 470: Advanced Network and System Administration
Slide #21
Error Detection with Parity
Even: every byte must have even # of 1s.
What if you read a byte with an odd # of 1s?
– It’s an error.
– An odd # of bits were flipped.
What if you read a byte with an even # of 1s?
– It may be correct.
– It may be an error where an even # of bits are bad.
CIT 470: Advanced Network and System Administration
Slide #22
Error Correction
XOR each block to get parity information.
XOR with parity block to retrieve missing block on bad drive.
CIT 470: Advanced Network and System Administration
Slide #23
RAID 0: Striping, no Parity
Performance
Throughput = n * disk speed
Reliability
 Lower reliability.
 If one disk lost, entire set is lost.
 MTBF = (avg MTBF)/# disks
Capacity
n * disk size
CIT 470: Advanced Network and System Administration
Slide #24
RAID 1: Disk Mirroring
Performance
– Reads are faster since read operations will
return after first read is complete.
– Writes are slower because write operations
return after second write is complete.
Reliability
– System continues to work after one disk dies.
– Doesn’t protect against disk or controller
failure that corrupts data instead of killing
disk.
– Doesn’t protect against human or software
error.
Capacity
– n/2 * disk size
CIT 470: Advanced Network and System Administration
Slide #25
RAID 3: Striping + Dedicated Parity
Reliability
Survive failure of any 1 disk.
Performance
 Striping increases performance,
but
 Parity disk must be accessed on
every write.
 Parity calculation decreases write
performance.
 Good for sequential reads (large
graphics + video files.)
Capacity
(n-1) * disk size
CIT 470: Advanced Network and System Administration
Slide #26
RAID 4: Stripe + Block Parity Disk
• Identical to RAID 3
except uses block
striping instead of
byte striping.
CIT 470: Advanced Network and System Administration
Slide #27
RAID 5: Stripe + Distributed Parity
Reliability
Survive failure of any 1 disk.
Performance
 Fast reads (RAID 0), but
slow writes.
 Like RAID 4 but without
bottleneck of a single parity
disk.
 Still have to read blocks +
write parity block if alter
any data blocks.
Capacity
(n-1) * disk size
CIT 470: Advanced Network and System Administration
Slide #28
RAID 6: Striped w/ Dual Parity
• Like RAID 5 but with two parity blocks.
• Can survive failure of two drives at once.
CIT 470: Advanced Network and System Administration
Slide #29
Nested RAID Levels
Many RAID systems can use both
– Physical drives.
– RAID sets.
as RAID volumes:
– Allows admins to combine advantages of levels.
– Nested levels named by combination of levels,
e.g. RAID 01 or RAID 0+1
CIT 470: Advanced Network and System Administration
Slide #30
RAID 01 (0+1)
Mirror of stripes.
– If disk fails in RAID 0
array, can be tolerated by
using disk from other
RAID 0.
– Cannot tolerate 2 disk
failures unless both from
same stripe.
CIT 470: Advanced Network and System Administration
Slide #31
RAID 10 (1+0)
Stripe of mirrors.
– Can tolerate all but one
drive can failing from
each RAID 1 set.
– Uses more disk space
than RAID 5 but
provides higher
performance.
– Highest capacity,
performance, and cost.
CIT 470: Advanced Network and System Administration
Slide #32
RAID 51 (5+1)
Mirror of RAID 5s
Capacity = (n/2-1) *
disk size
Min disks: 6
CIT 470: Advanced Network and System Administration
Slide #33
RAID Failures
RAID sets work after single disk failure
 Except RAID 0
 Operate in degraded mode
RAID set rebuilds after bad disk replaced
 Can take hours to rebuild parity/mirror data.
 Some hardware allows hot swapping, so server
doesn’t have to be rebooted to replace disk.
 Some hardware supports a hot spare disk that
will be used immediately on disk failure for
rebuild.
CIT 470: Advanced Network and System Administration
Slide #34
You still need backups
Human and software errors
– RAID won’t protect you from rm –rf / or copying
over the wrong file.
System crash
– Crashes can interrupt write operations, leading to
situation where data is updated but parity not.
Correlated disk failures
– Accidents (power failures, dropping the machine) can
impact all disks at once.
– Disks bought at same time often fail at same time.
Hardware data corruption
– If a disk controller writes bad data, all disks will have the
bad data.
CIT 470: Advanced Network and System Administration
Slide #35
Logical Volumes
What are logical volumes?
Appear to user as a physical volume.
But can span multiple partitions and/or disks.
Why logical volumes?
Aggregate disks for performance/reliability.
Grow and shrink logical volumes on the fly.
Move logical volumes btw physical devices.
Replace volumes w/o interrupting service.
CIT 470: Advanced Network and System Administration
Slide #36
LVM
CIT 470: Advanced Network and System Administration
Slide #37
LVM Components
Logical Volume Group (LVG)
Set of physical volumes (partitions or disks.)
May be divided into logical volumes (LVs.)
LVs made up of fixed sized logical extents
Each LE is 4MB.
Physical extents are the same size.
CIT 470: Advanced Network and System Administration
Slide #38
Mapping Modes
Linear Mapping
LVs assigned to continguous areas of PV space.
Striped Mapping
LEs interleaved across PVs to improve performance.
CIT 470: Advanced Network and System Administration
Slide #39
Setting up a LVG and LV
1. Initialize physical volumes
pvcreate /dev/hda1
pvcreate /dev/hdb1
2. Initialize a volume group
vgcreate nku_proj /dev/hda1 /dev/hdb1
Use vgextend to add more PVs later.
3. Create logical volumes
lvcreate -n nku1 --size 100G nku_proj1
4. Create filesystem
mkfs –v –t ext3 /dev/nku_proj/nku1
CIT 470: Advanced Network and System Administration
Slide #40
Extending a LV
Set absolute size
lvextend –L120G /dev/nku_proj/nku1
Or set relative size
lvextend –L+20G /dev/nku_proj/nku1
Expand the filesystem without unmounting
ext2online –v /dev/nku_proj/nku1
Check size
df –k
CIT 470: Advanced Network and System Administration
Slide #41
Adding a Disk
Install new hardware
Verify disk recognized by BIOS.
Boot
Verify device exists in /dev
Partition
fdisk /dev/sdb
Create filesystem
mkfs –v –t ext3 /dev/sdb1
Add entry to /etc/fstab
/dev/sdb1 /proj ext3 defaults 0 2
mount -a
CIT 470: Advanced Network and System Administration
Slide #42
Filesystems
ext3fs
Current common Linux filesystem.
Journaling eliminates need for regular fscking.
ext2fs
Old Linux non-fragmenting fast filesystem.
CIT 470: Advanced Network and System Administration
Slide #43
inode
• File consists of inode +
data blocks.
• inodes are static:
– table has fixed # inodes
– size: 128 bytes
• inode contains
–
–
–
–
–
–
UID of owner
GID of group
Permissions
Timestamps
Reference count
Block pointers
CIT 470: Advanced Network and System Administration
Slide #44
When don’t you need a filesystem?
Swap space
mkswap –v /dev/sdb1
Server applications
Oracle
VMWare Server
CIT 470: Advanced Network and System Administration
Slide #45
Swap
Can use swapfile instead of swap partition
dd if=/dev/zero of=/swapfile bs=1024k
count=512
mkswap /swapfile
Enable swap
swapon /swapfile
swapon /dev/sda2
Disable swap
swapoff /swapfile
swapoff /dev/sda2
Check swap resource usage
cat /proc/swaps
CIT 470: Advanced Network and System Administration
Slide #46
Mounting
To use a filesystem
mount /dev/sda1 /mnt
df /mnt
Automatic mounting
Add an entry in /etc/fstab
Unmount
umount /dev/sda1
Cannot unmount a volume in use.
CIT 470: Advanced Network and System Administration
Slide #47
fstab
# /etc/fstab: static file system information.
#
# <file system> <mount point>
<type> <options> <dump>
<pass>
proc
/proc
proc
defaults 0 0
/dev/hdc1
/
ext3
defaults 0 1
/dev/hdc5
/win
vfat
user,rw 0 0
/dev/hdc7
none
swap
sw
0 0
/dev/hdc8
/var
ext3
defaults 0 2
/dev/hdc9
/home
ext3
defaults 0 2
/dev/hda
/media/cdrom0
iso9660 ro,user 0 0
/dev/fd0
/media/floppy0 auto
rw,user 0 0
CIT 470: Advanced Network and System Administration
Slide #48
UUIDs
Universally Unique IDentifiers
– 128-bit numbers written as 32 hex digits.
– 3.4 × 1038 possible UUIDs
Used to identify devices on Linux
– To find UUID for a specific device: vol_id –u /dev/sda1
– All devices: ls –l /dev/disk/by-uuid
# /etc/fstab: static file system information.
#
# <file system> <mount point> <type> <options>
UUID=fbdfebe2-fbde-42c9-963d-12428b642f1d /
UUID=a1858e04-78b9-460b-a6cb-3f1dfe3fa16e /home
UUID=c4f14e27-96cd-420c-9860-4bd5298e3f76 none
CIT 470: Advanced Network and System Administration
<dump> <pass>
ext3 defaults 0
ext3 defaults
swap sw
1
0
0
2
0
Slide #49
fsck: check + repair fs
Filesystem corruption sources
Power failure
System crash
Types of corruption
Unreferenced inodes.
Bad superblocks.
Unused data blocks not recorded in block maps.
Data blocks listed as free that are used in files.
fsck can fix these and more
Asks user to make more complex decisions.
Stores unfixable files in lost+found.
CIT 470: Advanced Network and System Administration
Slide #50
Cost of Storage
Disks are cheap
1TB SATA disks cost $100 in late 2008.
Storage is expensive
 20% of cost is hard disks
 80% of cost is overhead
•
•
•
•
•
Servers
Power
AC
Support
Backups
CIT 470: Advanced Network and System Administration
Slide #51
Storage Management
Group-based Storage
– Ideally, each group has its own fileserver.
– Group activities can interfere with each other:
• Capacity (filling disks)
• Performance
Storage Needs Assessment
– Ask customers for anticipated storage growth.
– Monitor servers to measure current growth.
Storage SLA
– Availability
– Performance (response time)
– Cost and time to add new storage.
CIT 470: Advanced Network and System Administration
Slide #52
References
1.
2.
3.
4.
5.
6.
7.
8.
Aeleen Frisch, Essential System Administration, 3rd edition, O’Reilly,
2002.
Charles M. Kozierok, “Reference Guide—Hard Disk Drives,”
http://www.pcguide.com/ref/hdd/, 2005.
A.J. Lewis, LVM HOWTO, http://www.tldp.org/HOWTO/LVMHOWTO/index.html, 2005.
H. Mauelson and M. O’Keefe, “The Linux Logical Volume
Manager,” Red Hat Magazine,
http://www.redhat.com/magazine/009jul05/features/lvm2/, July 2005.
Evi Nemeth et al, UNIX System Administration Handbook, 3rd
edition, Prentice Hall, 2001.
Octane, “SCSI Technology Primer,”
http://arstechnica.com/paedia/s/scsi-1.html, 2002.
RedHat, RHEL4 System Administration Guide,
http://www.redhat.com/docs/manuals/enterprise/RHEL-4Manual/sysadmin-guide/, 2005.
Wikipedia, “RAID”, http://en.wikipedia.org/wiki/RAID
CIT 470: Advanced Network and System Administration
Slide #53