Software Reliability

May 9, 2020 · 1 min read · 199 words · engineering

A fault is defined as a component of a system deviating from its spec. A failure is when the system as a whole stops providing the required service to the user. Systems that can anticipate faults and cope with them are fault-tolerant or resilient. Obviously, we can't anticipate every type of fault so it only makes sense to talk about certain types.

Hardware#

Hard disks have a mean time to failure (MTTF) of about 10 to 50 years
Redundancy and dual power supplies help
So does moving away from single-server systems

Software#

Software bugs
Runaway processes
Third-Party API failures

Carefully thinking about assumption and interactions made in the software can help

Humans#

Understand that all humans are unreliable
Design systems that minimizes opportunity for error.
Decouple the places where people make the most mistakes from the places that cause system failures
Test thoroughly at all levels
Set up monitoring/logs

Some tools exist to try and trigger faults as a way to see where the weak points in the application are.

This post was based on the book Designing Data-Intensive Applications by Martin Kleppmann.

Make Git Forget About Files Previously Tracked

Preserving leading zeros when writing CSV files for Excel

Authored by Anthony Fox on May 9, 2020

Have comments or feedback? I'd love to hear from you.