How Failing Fast Boosts the Resilience of Your Software

By Florentina Patrascu

What do you do when you encounter an error or failure? Do you handle it gracefully, do you anticipate it, or do you stop it in its tracks altogether? I’m here to tell you how failing fast can do it all.

“Failing fast” refers to a software development approach in which a system is designed to identify and handle errors as soon as they occur, rather than continuing to operate with faulty components or processes. The end-goal is to prevent the spread of errors, something that can lead to larger, more complex issues down the line and implicitly, higher costs on solving the problems in a later phase.

Why “failing fast” might work for you

One of the main arguments for adopting such an approach is that it can boost the resilience of a software system. By identifying and fixing errors early in the process, even before going live on production, a system can recover more quickly, and continue operating effectively in a stable environment. This can be particularly important in areas where uptime and reliability are paramount, such as in mission-critical systems or in systems that serve a large number of users.

However, this method works on a smaller scale just as well. Who hasn’t been in a situation where they implemented a change that  ended up braking a separate feature? Wouldn’t you have liked to be in a position where that change could be detected and reported immediately, so you could take care of it directly? Working on a variety of projects during my career taught me the value of applying a failing fast strategy, especially when I was new to a team, with not many options to validate my work.

How to adopt a failing fast approach

It might not be for everyone, and it might come with its challenges, yet a good failing fast strategy should consider the following steps:

  1. Implement robust testing: To identify and address errors as soon as they occur, it is important to have robust testing in place. This can include unit tests, integration tests, and end-to-end tests, as well as testing in different stages of the development process. The key is to make your tests easy to run, in any phase of the development process.
  2. Use monitoring and alerting tools: Tools that monitor the performance and behavior of a software system can help identify errors. Firstly, alerting systems can notify developers of issues in real-time, allowing them to respond quickly and address any issues. More importantly, though, they can connect to automated remediation tools that work on restabilizing your system.
  3. Use fallback or recovery mechanisms: Implementing fallback or recovery mechanisms can help a system continue operating effectively even in the event of an error. For example, a system could use a backup server or database to continue serving users if the primary server or database experiences an issue.
  4. Implement rollback or roll forward strategies: In some cases, it may be necessary to roll back or roll forward to a previous state to recover from an error. Implementing strategies for doing so can help a system recover more quickly and effectively. Another idea can be implementing some new features behind a toggle that can easily be put off if case of an issue.
  5. Use error handling and exception handling strategies: Implementing clear strategies for handling errors and exceptions can help a system recover more quickly and continue operating effectively.

Benefits of a failing fast system

A failing fast strategy can have a significant impact in making your systems more robust, with:

  1. Better reliability: It can minimize the impact of errors and allow your system to continue operating reliably.
  2. Improved troubleshooting: It provides detailed information about errors, which can help developers identify and fix issues more quickly and efficiently.
  3. Faster recovery: It can allow quicker recovery from errors, reducing downtime and minimizing the impact on users.
  4. Adaptability: It improves your system making it better equipped to handle change and unexpected events
  5. Reduced risk: It minimizes the risk of errors cascading into more complex issues, reducing the potential for costly and time-consuming repairs.

Failing fast and DevOps

A failing fast strategy goes hand-in-hand with a strong DevOps mindset – the earlier you fail in continuous integration pipelines, the faster you find yourself with a reliable change, ready to be released.

The purpose of DevOps in this strategy is not to maximize failure but rather for the development teams to have a structured environment where the quicker they fail, the quicker they can discover ways to improve systems and products. If failures will happen early in the development process, for example from the pull request phase where you can run unit tests, developers are more likely to spot security defects and errors before a product goes into deployment. This minimizes the likelihood of finding a severe flaw in an application just before it is rolled out to the end-users.

Conclusion

In conclusion, failing fast should be considered a valuable influence on your software development process, with the potential to boost the resilience of a system by identifying and addressing errors early on. By implementing robust testing, monitoring, alerting tools, and error handling strategies, developers can build systems that are more reliable, adaptable, and better equipped to handle change and unexpected events.

About Florentina Patrascu

Florentina is a .NET Developer and a Team Lead, with over 8 years of experience in the field of IT. Her role as a team lead allows her to maintain a strategic perspective in helping her colleagues reach their maximum potential within the company, while also focusing on providing the best quality to the clients she works with.

Share this article