Now: How to systematically deal with failures, or build "fault-tolerant" systems.
We'll allow more complicated failures and also try to recover from failures.
Thinking about large, distributed systems. 100s, 1000s, even more machines, potentially located across the globe.
Will also have to think about what these applications are doing, what they need.
Building Fault-Tolerant Systems
General approach:
Identify possible faults (software, hardware, design, operation, environment, ...).
Detect and contain.
Handle the fault.
Do nothing, fail-fast (detect and report to next higher-level), fail-stop (detect and stop), mask, ...
Caveats
Components are always unreliable. We aim to build a reliable system out of them, but our guarantees will be probabilistic.
Reliability comes at a cost; always a tradeoff. Common tradeoff is reliability vs. simplicity.
All of this is tricky. It's easy to miss some possible faults in step 1, e.g. Hence, we iterate.
We'll have to rely on *some* code to work correctly. In practice, there is only a small portion of mission-critical code. We have stringent development processes for those components.
Quantifying Reliability
Goal: Increase availability.
Metrics: MTTF = mean time to failure MTTR = mean time to repair availability = MTTF / (MTTF + MTTR)
Example: Suppose my OS crashes once every month, and takes 10 minutes to recover. MTTF = 30 days = 720 hours = 43,200 minutes MTTR = 10 minutes availability = 43,200 / 43,210 = .9997
Reliability via Replication
To improve reliability, add redundancy.
One way to add redundancy: replication.
Today: Replication within a single machine to deal with disk failures.
Tomorrow in recitation: replication across machines to deal with machine failures.
Dealing with Disk Failures
Why disks?
Starting from single machine because we want to improve reliability there first before we move to multiple machines.
Disks in particular because if disk fails, your data is gone. Can replace other components like CPU easily. Cost of disk failure is high.
Are disk failures frequent?
Manufactures claim MTBF is 700K+ hours, which is bogus.
Likely: Ran 1000 disks for 3000 hours (125 days) => 3 million hours total, had 4 failures, and concluded: 1 failure every 750,000 hours.
But failures aren't memoryless: Disk is more likely to fail at beginning of its lifespan and the end than in the middle (see slides).
Whole-Disk Failures
General scenario: Entire disk fails, all data on that disk is lost. What to do? RAID provides a suite of techniques.
RAID 1: Mirror data across 2 disks.
Pro: Can handle single-disk failures.
Pro: Performance improvement on reads (issue two in parallel), not a terrible performance hit on writes (have to issue two writes, but you can issue them in parallel too).
Con: To mirror N disks' worth of data, you need 2N disks.
RAID 4: With N disks, add an additional parity disk. Sector i on the parity disk is the XOR of all of the sector i's from the data disk.
Pro: Can handle single-disk failures (if one disk fails, xor the other disks to recover its data).
Can use same technique to recover from single-sector errors.
Pro: To store N disks' worth of data we only need N+1 disks.
Pro: Improved performance if you stripe files across the array. E.g., an N-sector-length file can be stored as one sector per disk. Reading the whole file means N parallel 1-sector reads instead of 1 long N-sector read.
RAID is a system for reliability, but we never forget about performance, and in fact performance influenced much of the design of RAID.
Con: Every write hits the parity disk.
RAID 5: Same as RAID 4, except intersperse the parity sectors amongst all N+1 disks to load balance writes.
You need a way to figure out which disk holds the parity sector for sector i, but it's not hard.
RAID 5 used in practice, but falling out in favor of RAID 6, which uses the same techniques but provides protection against two disks failing at the same time.
Your Future
RAID, and even replication, don't solve everything.
E.g., what about failures that aren't independent?
Wednesday: We'll introduce transactions, which let us make some abstractions to reason about faults.
Next-week: We'll get transaction-based systems to perform well on a single machine.
Week after: We'll get everything to work across machines.