Software Reliability

May 9, 2020 · 1 min read · 199 words · engineering

A fault is defined as a component of a system deviating from its spec. A failure is when the system as a whole stops providing the required service to the user. Systems that can anticipate faults and cope with them are fault-tolerant or resilient. Obviously, we can't anticipate every type of fault so it only makes sense to talk about certain types.

Hardware#

  • Hard disks have a mean time to failure (MTTF) of about 10 to 50 years
  • Redundancy and dual power supplies help
  • So does moving away from single-server systems

Software#

  • Software bugs
  • Runaway processes
  • Third-Party API failures

Carefully thinking about assumption and interactions made in the software can help

Humans#

  • Understand that all humans are unreliable
  • Design systems that minimizes opportunity for error.
  • Decouple the places where people make the most mistakes from the places that cause system failures
  • Test thoroughly at all levels
  • Set up monitoring/logs

Some tools exist to try and trigger faults as a way to see where the weak points in the application are.


This post was based on the book Designing Data-Intensive Applications by Martin Kleppmann.

Previous

Make Git Forget About Files Previously Tracked

Next

Preserving leading zeros when writing CSV files for Excel


Authored by Anthony Fox on May 9, 2020

Have comments or feedback? I'd love to hear from you.