Failure Resiliency - Haohan (Carrie) Zhang

Definitions:

Def: failure: have components the meet the specified behavior (deviation from specified behavior)
Def: Specified Behavior: corner case. Especially in many subtle interactions.
What can fail? Hardware and environment (earthquake).

Layering model of resiliency

Code and data modularity bounds impact of failure, a failure just detected might happened many layers bellow.
E.g.: Firefox/TCP/IP
- The application doesn’t know the failure on IP’s package missing, cause TCP retransmitted package - failure masking
E.g.: Bad block remapping
- OS/Driver -> Controller -> Disk
E.g.: Coda File System
- Emacs -> Venus -> Server
- If Failure happens on server, the emacs would just wait longer to get a response but knows nothing

Limits of Failure Masking

Not all failures can be masked
- Sometimes limited by fundamentals
- Often limited by “cost” - cost is too high (overhead, include usability)
  - physical limits, cache too small
  - TCP time out, impatient user
  - Length of error-correcting code, number of errors can be detected > number of errors can be corrected
  - Storage overhead of storing # of server replicas
  - Update overhead limits # of server replicas
Unmasked failures are visible to next higher layer
- Which might have own resiliency mechanism
- Hierarchical layers of resiliency may exist
- Upper layer mechanisms typically
  - More heavy
  - Involve greater semantic knowledges
  - Affect more system components
- E.g.: Coda resiliency mechanisms, user->Adobe Acrobat->Venus->RPC2->UDP socket

Failure Detection

Category of failures
- Transient - soft failures / Heisenbugs
  - E.g. deadlocks - can be corrected by undoing and retrying
  - Unlikely combination of circumstances
  - ROC - recovery-oriented computing
- Persistent
  - E.g. broken Ethernet cable
  - Retry code does not help
  - Failure happens until repaired
  - Duration of failures and repair are random variables
  - Variables: Means of distributions are MTBF and MTTR
    - MTBF - mean time between failures (mean in service times)
    - MTTR - mean time to repair (mean failure times)
  - If the failure lies in software, then the independence assumption of distributed system does not work! One entities fail would cost another’s (all) entities to fail.

Empirical Failure Data

Jim Gray - Turing Winner, related DB
- Tandem was a high availability OLTP system
- Only un maskable failures listed
- Some failures likely underreported
- Human is the major source of failure, “operator” failures could be “usability” failures
Oppenheimer Study
- Online
- Content
- Most-Read

Next Post → ← Previous Post