Quick Answer:
Availability is the percentage of time your application is running; recovery is how fast it returns when something breaks. According to Google's Site Reliability Engineering book, 99.9% availability allows 8.76 hours of downtime per year, while 99.99% cuts that budget to just 52.6 minutes annually. The difference costs real money.
Key Takeaways:
Almost every business owner we meet — in Houston, Cypress, Monterrey, or Bogotá — gives the same answer when we ask about their disaster-recovery plan: "the developer has it." That is a dangerous answer, because when the online store is down at 8 a.m. on Black Friday, it is not the developer losing sales. It is you.
This article translates the terms that infrastructure teams use — availability, RPO, RTO, redundancy, failover — into business decisions you can make without being an engineer. Not so you can configure anything, but so you know what to demand from whoever does.
Availability is expressed as a percentage of the total time a system responds correctly. It sounds simple, but the difference between 99% and 99.99% is not cosmetic: it is the difference between days and minutes of annual downtime.
According to the Site Reliability Engineering book published by Google, the canonical levels are:
The "nines" table (figures published by Google SRE):
Wikipedia, in its "High availability" article, presents the same orders of magnitude with slightly different rounding: 99.9% translates to roughly "9 hours" of annual downtime and 99.99% to "53 minutes." Both sources agree on what matters: every added nine cuts the downtime budget by a factor of ten.
What this means for the business owner: when a vendor promises "99.9% uptime," they are promising your site can be down for nearly 9 hours per year and still meet the contract. If those 9 hours fall during your peak season, the contract was honored and you lost sales. That is why the right question is not "how many nines do you promise?" but "when are the maintenance windows scheduled?" and "how is compliance measured?"
When something fails — and it will — two numbers define the damage: how much information you lose and how long it takes to come back. The industry calls them RPO and RTO.
RPO (Recovery Point Objective) is the maximum amount of data your business can afford to lose, measured in time. If you back up every 24 hours, your RPO is 24 hours: in the worst case, you lose a full day of orders, invoices, and customer records. If you replicate in real time, your RPO is measured in seconds.
RTO (Recovery Time Objective) is the maximum acceptable time between failure and the moment the service is operational again. An RTO of 4 hours means your team — or your vendor — commits to being back up in 4 hours or less.
According to the AWS Disaster Recovery page, Amazon's managed replication services "enable RPOs of seconds and RTOs of minutes" for on-premises-to-cloud recovery scenarios. That does not mean you need that level — it means it exists and can be bought if your business model justifies it.
A numeric example:
An online store doing $20,000 USD a day in sales. A 4-hour outage on an average day costs about $3,300 in lost sales; on a payday Friday it can triple. If your current RTO is "however long it takes Carlos to answer the WhatsApp," your real RTO is probably closer to 8-12 hours than the 4 you assume.
High availability is not bought with a single bigger server; it is built with redundancy. Wikipedia distinguishes two models:
Active redundancy "uses multiple components with automatic failure detection and reconfiguration, enabling high availability without performance decline in complex systems." Think of two servers running in parallel: if one fails, the other is already processing traffic and the customer never notices.
Passive redundancy "accommodates performance decline through excess capacity — like a boat with dual engines continuing operation despite single-engine failure." The second engine exists but is not used until the first fails; there is a transition moment when performance drops.
The difference matters when quoting. Active redundancy costs roughly double in infrastructure but delivers invisible failover. Passive redundancy costs less but introduces a degradation window of minutes to hours. The decision depends on how much you lose per minute of downtime — not on the technical pride of the team.
AWS describes a pattern on its Disaster Recovery page that has become standard: data is replicated to "a staging area subnet in your AWS account" using "affordable storage and minimal compute resources." In practice, this means your site keeps running on its primary infrastructure while a copy sits ready, cheap, in another region. When something fails, that copy is activated.
Two details from the AWS documentation matter to the business owner:
If your business depends on an application, site, or system staying up, you should be able to answer these five questions without calling anyone:
Technical teams without business context tend to chase nines like trophies. But five nines — which Google SRE documents at just 5.26 minutes of annual downtime — costs orders of magnitude more than three nines. For most mid-sized businesses, the investment to move from 99.9% to 99.99% does not pay back in extra sales: people do not abandon a brand because the site was down 8 hours a year if those hours were at 3 a.m.
The right business question is not "how do I get to 99.999%?" It is "what does an hour of downtime cost me, and when does it happen?" That question turns an abstract technical conversation into a concrete financial decision.
"Nobody goes out of business for not having five nines. They go out of business for having zero tested backups on the day the thing that never happens, happens."
- Diego Medina F, Founder of MerchandisePROS
MerchandisePROS' Website Consulting service combines a UI/UX audit with a Core Web Vitals and speed analysis — the real first step to understanding not just how fast your site is, but how resilient it is when something breaks. We document your actual RPO and RTO (not the ones you think you have), check whether your hosting is genuinely backed up, and hand you a prioritized list of the risks that can take your operation down.
If you have never done this exercise, start with the 60-second Free Audit: it shows where your site stands today against competitors and best practices, including load speed (the first public signal of infrastructure problems). It is not a marketing report; it is an initial diagnostic you can take to your developer or to your next operations meeting.
According to Google's Site Reliability Engineering book, 99.9% availability allows 8.76 hours of downtime per year, or roughly 43.2 minutes per month. It is not "almost always up" — it is an explicit downtime budget.
RPO (Recovery Point Objective) is how much data you can afford to lose, measured in time. RTO (Recovery Time Objective) is how long you can afford to be offline. AWS describes scenarios with RPO in seconds and RTO in minutes as the most demanding end of the spectrum.
Wikipedia defines high availability as "a characteristic of a system that aims to ensure an agreed level of operational performance, usually uptime, for a higher than normal period." In practice, it means designing with redundancy so a single component failure does not take down the whole service.
According to Wikipedia, active redundancy uses multiple components with automatic failure detection and reconfiguration, maintaining performance without degradation. Passive redundancy accommodates performance decline through excess capacity — like a boat with two engines continuing to run if one fails.
Almost never. Five nines (99.999%) allows only 5.26 minutes of downtime per year per Google SRE and costs orders of magnitude more than three nines. The right question is not "how many nines do I want" but "what does each hour of downtime cost me in real sales".
Start with a free audit and find out where your site stands on infrastructure risk, speed, and experience.
Free Audit Free Consultation