E.g. deadlocks - can be corrected by undoing and retrying
Unlikely combination of circumstances
ROC - recovery-oriented computing
Persistent
E.g. broken Ethernet cable
Retry code does not help
Failure happens until repaired
Duration of failures and repair are random variables
Variables: Means of distributions are MTBF and MTTR
MTBF - mean time between failures (mean in service times)
MTTR - mean time to repair (mean failure times)
If the failure lies in software, then the independence assumption of distributed system does not work! One entities fail would cost another’s (all) entities to fail.
Empirical Failure Data
Jim Gray - Turing Winner, related DB
Tandem was a high availability OLTP system
Only un maskable failures listed
Some failures likely underreported
Human is the major source of failure, “operator” failures could be “usability” failures