Rate Limiting: Why Free APIs Block You and How Real Apps Handle Traffic Spikes

Q: Why does my app silently slow down during high traffic?

Many APIs use a 'load shedder' that rejects low-priority requests without surfacing an error to the end user. Stripe, for example, reserves a fraction of its infrastructure for critical requests and rejects non-critical requests that exceed their allocation. The end customer experience: half-loaded pages, partial responses, or long waits with no visible error.

API rate limiting explained: why servers return HTTP 429 and how it affects your business

Quick Answer:

Rate limiting is how a server tells an app "you asked too much, slow down." According to MDN, when the limit is exceeded the server responds with HTTP 429 Too Many Requests and optionally a Retry-After header indicating how many seconds to wait. Every serious API — from Stripe to OpenAI — applies it to protect its infrastructure.

Key Takeaways:

HTTP 429 is the official code: according to MDN, it indicates "the client has sent too many requests in a given amount of time." It is defined in IETF RFC 6585.
Retry-After tells you how long to wait: the response header specifies the seconds to wait before retrying. For example, Retry-After: 3600 means wait 60 minutes.
Stripe rejects millions of requests a month: engineer Paul Tarjan reported on the Stripe blog that their rate limiter "has rejected millions of requests this month alone."
The dominant algorithm is token bucket: Stripe confirms it directly: "We use the token bucket algorithm to do rate limiting" and implements the system with Redis.
Four layers exist in production: Stripe uses a request rate limiter, concurrent requests limiter, fleet usage load shedder, and worker utilization load shedder — layers that reject lower-priority traffic before critical traffic.

If a store integration ever stopped working at noon on a Monday, if your AI chatbot started returning cut-off responses during a promotion, or if your payment gateway silently rejected a few transactions on Black Friday, you most likely hit a rate limit. It is not your team's fault. It is a defense mechanism every serious API in the world applies in order to survive.

This article explains exactly what rate limiting is, why it exists, how companies like Stripe — which processes billions of dollars in payments every year — apply it, and most importantly, what signals you should watch in your own business to detect when an API is throttling you before a customer notices. Whether you run a business in Houston, Cypress, Monterrey, or Bogotá, this is the kind of invisible technical friction that separates a smooth digital experience from one that silently bleeds revenue.

What rate limiting actually is

Rate limiting is a mechanism by which a server restricts how many requests it accepts from a client within a defined time window. When a client exceeds that limit, the server rejects the next request.

According to Mozilla Developer Network (MDN), when this happens the server responds with HTTP 429 Too Many Requests, formally indicating that "the client has sent too many requests in a given amount of time." MDN notes the mechanism "is commonly called rate limiting — a way of asking the client to slow down the rate of requests." It is a standardized IETF code, defined in RFC 6585.

The 429 response usually carries a key header: Retry-After. This header tells the client how many seconds it should wait before trying again. MDN gives a concrete example: a response with Retry-After: 3600 means the client should wait 3600 seconds — sixty minutes — before resuming requests.

What happens behind the scenes: the server does not drop your connection cold. It responds with an HTTP code and, if the API developer is disciplined, also with precise instructions on how to retry. The problem is that many poorly-built integrations ignore the header and retry immediately — making the situation worse and sometimes triggering longer blocks.

How professional APIs apply it: the Stripe case

Few companies have documented their rate-limit architecture as transparently as Stripe, the payment processor that handles transactions for millions of merchants worldwide. In a Stripe engineering blog post written by Paul Tarjan, the company describes exactly what types of limiters it uses and why.

According to Stripe, the platform uses four different types of limiters, not one:

1. Request Rate Limiter (requests per second). Restricts each user to N requests per second. Stripe reports on this limiter: "Our rate limits for requests is constantly triggered. It has rejected millions of requests this month alone." This gives a sense of scale: the rate limiter is the first line of defense and fires millions of times.

2. Concurrent Requests Limiter. Limits how many simultaneous in-flight requests a client can have. Stripe describes it: "Our concurrent request limiter is triggered much less often (12,000 requests this month)." It is a secondary brake addressing a different pattern: clients that open many long requests in parallel.

3. Fleet Usage Load Shedder. Reserves a portion of infrastructure for critical requests. Stripe explains: "if our reservation number is 20%, then any non-critical request over their 80% allocation would be rejected." That is, if the fleet is under pressure, lower-priority requests are dropped first so critical payment requests still get through.

4. Worker Utilization Load Shedder. The last line of defense during serious incidents. Stripe categorizes traffic into "critical methods, POSTs, GETs, test mode traffic" — and sheds lower-priority traffic when the infrastructure is genuinely threatened.

Why this matters:

A single rate-limit layer is not enough for an API that processes real payments
Separate layers attack different abuse patterns: bursts, parallelism, sustained overload, and crisis
The end customer may experience "silent slowness" rather than explicit errors

The most common algorithm: token bucket

The most widely used rate-limit algorithm in the industry — and the one Stripe declares publicly — is the token bucket. Stripe confirms it verbatim: "We use the token bucket algorithm to do rate limiting."

The concept is simple. Picture a bucket that fills with tokens at a steady rate (say, one token per second) up to a maximum. Every incoming request spends a token. If the bucket is empty, the request is rejected with a 429. If it has tokens, the request passes and one token is consumed.

The advantage of token bucket over other algorithms is that it allows bursts. If your app made no requests for thirty seconds, you accumulated thirty tokens. When a real spike arrives — a customer opening checkout, a promotion firing simultaneous purchases — those stored tokens let you push through the burst without getting blocked.

Stripe also describes that its system "added the ability to briefly burst above the cap for sudden spikes in usage during real-time events (e.g. a flash sale)." In other words, serious APIs explicitly design for legitimate traffic spikes so they do not block you.

Stripe also reveals the implementation detail: "We implement our rate limiters using Redis." Redis is an in-memory database extremely fast at coordinating token counts across multiple servers in microseconds — something critical when you receive millions of requests per hour.

How this affects you as a business owner

So far this sounds like an engineering problem. It is not. These are the scenarios where silent rate limiting costs you real money:

Checkout that half-loads at peak hours. If your store integrates a payment gateway, a shipping API, and a tax service, all three apply rate limits. On payday Friday or Black Friday, one of those three will saturate first. If your developer did not implement Retry-After-aware retries, the customer sees a generic error — and abandons the cart.

AI chatbot or assistant that "stutters" during a promotion. Every AI API applies rate limiting by paid tier. When a campaign drives simultaneous conversations, responses start taking 30, 60, 90 seconds. The customer closes the tab. No one tells you — conversion just drops.

Inventory or ERP integrations that fail every Monday. Sync jobs scheduled for the same time — usually start of business — saturate external APIs and get 429s. If the system does not retry with exponential backoff, your data stays stale all day.

Paid ads with poorly loaded pixels. If your site fires many events to the Meta or Google Ads pixel on a slow page, the browser can drop requests before sending them. You lose conversion signal and the ad algorithm optimizes against incomplete data.

The common pattern: the end customer does not see a technical error. They see "the site is slow" or "the pay button didn't load." They blame your brand, not someone else's API. And next time they buy elsewhere.

How to detect rate-limit problems in your business

There are signals a business owner can check without being technical:

Ask your team or agency to review the server logs from the last 30 days, searching for 429 codes. If they appear more than a handful of times a day, there is a saturation pattern your customers are living. Have them show you the count by day and by hour.

Measure the response times of your critical integrations — payment gateway, internal search, chatbot, "add to cart." If any exceeds 2 seconds at peak hours, you are already losing customers even if you do not see explicit errors.

Audit your Core Web Vitals in Google Search Console. LCP (Largest Contentful Paint) and INP (Interaction to Next Paint) are direct signals of saturation issues under load. Google also uses them as a ranking factor, so technical friction also becomes lower SEO visibility at the same time.

Ask your team whether your integrations implement retry with exponential backoff when they receive a 429. If the answer is "no" or "I don't know," you have an obvious improvement that can recover lost revenue without spending a peso on marketing.

"The 429 error is not a business error — it is a message from the server saying 'your app isn't handling bursts well.' The difference between apps that scale and apps that break is in how they respond to the limit, not in whether they ever hit it."
- Diego Medina F, Founder of MerchandisePROS

What This Means for Your Business

If your website is the sales engine — your online store, your lead-capture site, your appointment portal — then silent rate limits are one of the most expensive and least visible sources of lost revenue. They do not show up in Google Analytics. Your agency does not report them. They only surface in metrics that are already bleeding: lower conversion, lower time on site, higher bounce, worse Google ranking.

MerchandisePROS Website Consulting is built exactly for this. We audit your site's complete technical experience: response times for every integration, 429 error handling in logs, Core Web Vitals under real load, and saturation patterns during peak hours. The deliverable is concrete: a report with each issue detected, its estimated impact on conversion, and the technical correction your team or agency must implement. We do not sell you code — we give you the map so whoever already writes your code knows exactly what to fix.

Start with the free 60-second audit. Find out immediately whether your site has hidden technical friction costing you sales.

Free 60-Second Audit Free Consultation

Frequently Asked Questions

What is rate limiting?

Rate limiting is a mechanism where a server restricts how many requests a client can make in a given time window. When the limit is exceeded, the server responds with HTTP 429 Too Many Requests, which MDN defines as the code indicating "the client has sent too many requests in a given amount of time."

What does the HTTP 429 error mean?

HTTP 429 Too Many Requests is a client error code indicating that too many requests were sent in a given time window. The response typically includes a Retry-After header telling the client how many seconds to wait before retrying. It is defined in IETF RFC 6585.

Why does my app silently slow down during high traffic?

Many APIs use a "load shedder" that rejects low-priority requests without surfacing an error to the end user. Stripe, for example, reserves a fraction of its infrastructure for critical requests and rejects non-critical requests that exceed their allocation. The end customer experience: half-loaded pages, partial responses, or long waits with no visible error.

What algorithm do serious APIs use to rate-limit requests?

The most common one is the token bucket. Stripe publicly confirmed it on its engineering blog: "We use the token bucket algorithm to do rate limiting." It allows short bursts of traffic (consuming accumulated tokens) and then refills at a steady rate. Stripe implements it using Redis to coordinate across servers.

How do I know if my business is being rate-limited by an API?

Common signs: 429 errors in server logs, sudden slowdowns in checkout or search during peak hours, AI features that "think" for more than 30 seconds, and integrations that fail Monday morning when every shop logs in at once. A technical audit catches these failures before a customer feels them.