Availability & Recovery: When Your App Goes Down, What Saves Your Business

Q: What is high availability?

Wikipedia defines high availability as 'a characteristic of a system that aims to ensure an agreed level of operational performance, usually uptime, for a higher than normal period.' In practice, it means designing with redundancy so a single component failure does not take down the whole service.

Data center with redundant servers and blinking indicator lights — application availability and recovery

Quick Answer:

Availability is the percentage of time your application is running; recovery is how fast it returns when something breaks. According to Google's Site Reliability Engineering book, 99.9% availability allows 8.76 hours of downtime per year, while 99.99% cuts that budget to just 52.6 minutes annually. The difference costs real money.

Key Takeaways:

The "nines" budget: Google SRE documents that 99% allows 3.65 days of annual downtime; 99.9% allows 8.76 hours; 99.99% allows 52.6 minutes; 99.999% allows 5.26 minutes.
RPO is not RTO: RPO measures how much data you can lose; RTO measures how long you can be down. AWS describes recovery scenarios with RPO in seconds and RTO in minutes as the most demanding end of the spectrum.
The technical definition of HA: Wikipedia defines high availability as "ensuring an agreed level of operational performance, usually uptime, for a higher than normal period."
Active vs. passive redundancy: Wikipedia distinguishes active redundancy (multiple components with automatic failover) from passive redundancy (excess capacity that absorbs the failure, like a two-engine boat).
Non-disruptive testing: AWS recommends "non-disruptive recovery and failback drills" — a plan that is never exercised is not a plan, it is a wish.

Almost every business owner we meet — in Houston, Cypress, Monterrey, or Bogotá — gives the same answer when we ask about their disaster-recovery plan: "the developer has it." That is a dangerous answer, because when the online store is down at 8 a.m. on Black Friday, it is not the developer losing sales. It is you.

This article translates the terms that infrastructure teams use — availability, RPO, RTO, redundancy, failover — into business decisions you can make without being an engineer. Not so you can configure anything, but so you know what to demand from whoever does.

What "availability" actually means

Availability is expressed as a percentage of the total time a system responds correctly. It sounds simple, but the difference between 99% and 99.99% is not cosmetic: it is the difference between days and minutes of annual downtime.

According to the Site Reliability Engineering book published by Google, the canonical levels are:

The "nines" table (figures published by Google SRE):

99% (two nines): 3.65 days of downtime per year, 7.2 hours per month.
99.9% (three nines): 8.76 hours per year, 43.2 minutes per month.
99.99% (four nines): 52.6 minutes per year, 4.32 minutes per month.
99.999% (five nines): 5.26 minutes per year, 25.9 seconds per month.

Wikipedia, in its "High availability" article, presents the same orders of magnitude with slightly different rounding: 99.9% translates to roughly "9 hours" of annual downtime and 99.99% to "53 minutes." Both sources agree on what matters: every added nine cuts the downtime budget by a factor of ten.

What this means for the business owner: when a vendor promises "99.9% uptime," they are promising your site can be down for nearly 9 hours per year and still meet the contract. If those 9 hours fall during your peak season, the contract was honored and you lost sales. That is why the right question is not "how many nines do you promise?" but "when are the maintenance windows scheduled?" and "how is compliance measured?"

RPO and RTO: the two questions that matter

When something fails — and it will — two numbers define the damage: how much information you lose and how long it takes to come back. The industry calls them RPO and RTO.

RPO (Recovery Point Objective) is the maximum amount of data your business can afford to lose, measured in time. If you back up every 24 hours, your RPO is 24 hours: in the worst case, you lose a full day of orders, invoices, and customer records. If you replicate in real time, your RPO is measured in seconds.

RTO (Recovery Time Objective) is the maximum acceptable time between failure and the moment the service is operational again. An RTO of 4 hours means your team — or your vendor — commits to being back up in 4 hours or less.

According to the AWS Disaster Recovery page, Amazon's managed replication services "enable RPOs of seconds and RTOs of minutes" for on-premises-to-cloud recovery scenarios. That does not mean you need that level — it means it exists and can be bought if your business model justifies it.

A numeric example:

An online store doing $20,000 USD a day in sales. A 4-hour outage on an average day costs about $3,300 in lost sales; on a payday Friday it can triple. If your current RTO is "however long it takes Carlos to answer the WhatsApp," your real RTO is probably closer to 8-12 hours than the 4 you assume.

How availability is built: redundancy

High availability is not bought with a single bigger server; it is built with redundancy. Wikipedia distinguishes two models:

Active redundancy "uses multiple components with automatic failure detection and reconfiguration, enabling high availability without performance decline in complex systems." Think of two servers running in parallel: if one fails, the other is already processing traffic and the customer never notices.

Passive redundancy "accommodates performance decline through excess capacity — like a boat with dual engines continuing operation despite single-engine failure." The second engine exists but is not used until the first fails; there is a transition moment when performance drops.

The difference matters when quoting. Active redundancy costs roughly double in infrastructure but delivers invisible failover. Passive redundancy costs less but introduces a degradation window of minutes to hours. The decision depends on how much you lose per minute of downtime — not on the technical pride of the team.

Replication, failback, and non-disruptive testing

AWS describes a pattern on its Disaster Recovery page that has become standard: data is replicated to "a staging area subnet in your AWS account" using "affordable storage and minimal compute resources." In practice, this means your site keeps running on its primary infrastructure while a copy sits ready, cheap, in another region. When something fails, that copy is activated.

Two details from the AWS documentation matter to the business owner:

Failback: AWS describes the ability to "initiate data replication back to your primary site when the issue is resolved" and "fail back to your primary site whenever you're ready." Recovery is not just bringing up the backup; it is returning to the original state without losing what was processed during the outage.
Non-disruptive testing: AWS recommends "non-disruptive tests to confirm that implementation is complete" and "non-disruptive recovery and failback drills." A recovery plan that is never exercised in production is a document, not a plan.

The five questions every owner should be able to answer

If your business depends on an application, site, or system staying up, you should be able to answer these five questions without calling anyone:

What is our RPO? How much data can we lose in the worst case? If the answer is "I don't know," assume 24 hours.
What is our RTO? How long until we're back? Count from the failure, not from when someone noticed.
When was the last restore test? A backup that has never been restored is not a backup, it is a file.
What happens if our primary provider goes down? AWS, GoDaddy, and Shopify have had regional outages. Do you have an answer or do you have hope?
What does an hour of downtime cost in real sales? That number defines how much it is worth investing in availability.

The most expensive mistake: optimizing for nines instead of dollars

Technical teams without business context tend to chase nines like trophies. But five nines — which Google SRE documents at just 5.26 minutes of annual downtime — costs orders of magnitude more than three nines. For most mid-sized businesses, the investment to move from 99.9% to 99.99% does not pay back in extra sales: people do not abandon a brand because the site was down 8 hours a year if those hours were at 3 a.m.

The right business question is not "how do I get to 99.999%?" It is "what does an hour of downtime cost me, and when does it happen?" That question turns an abstract technical conversation into a concrete financial decision.

"Nobody goes out of business for not having five nines. They go out of business for having zero tested backups on the day the thing that never happens, happens."
- Diego Medina F, Founder of MerchandisePROS

What this means for your business

MerchandisePROS' Website Consulting service combines a UI/UX audit with a Core Web Vitals and speed analysis — the real first step to understanding not just how fast your site is, but how resilient it is when something breaks. We document your actual RPO and RTO (not the ones you think you have), check whether your hosting is genuinely backed up, and hand you a prioritized list of the risks that can take your operation down.

If you have never done this exercise, start with the 60-second Free Audit: it shows where your site stands today against competitors and best practices, including load speed (the first public signal of infrastructure problems). It is not a marketing report; it is an initial diagnostic you can take to your developer or to your next operations meeting.

Free Audit (60 seconds) Free Consultation

Frequently Asked Questions

What does 99.9% availability actually mean?

According to Google's Site Reliability Engineering book, 99.9% availability allows 8.76 hours of downtime per year, or roughly 43.2 minutes per month. It is not "almost always up" — it is an explicit downtime budget.

What is the difference between RPO and RTO?

RPO (Recovery Point Objective) is how much data you can afford to lose, measured in time. RTO (Recovery Time Objective) is how long you can afford to be offline. AWS describes scenarios with RPO in seconds and RTO in minutes as the most demanding end of the spectrum.

What is high availability?

Wikipedia defines high availability as "a characteristic of a system that aims to ensure an agreed level of operational performance, usually uptime, for a higher than normal period." In practice, it means designing with redundancy so a single component failure does not take down the whole service.

What is the difference between active and passive redundancy?

According to Wikipedia, active redundancy uses multiple components with automatic failure detection and reconfiguration, maintaining performance without degradation. Passive redundancy accommodates performance decline through excess capacity — like a boat with two engines continuing to run if one fails.

Does my business need five nines of availability?

Almost never. Five nines (99.999%) allows only 5.26 minutes of downtime per year per Google SRE and costs orders of magnitude more than three nines. The right question is not "how many nines do I want" but "what does each hour of downtime cost me in real sales".