Guides & Tutorials

How to rate limit AI features and avoid surprise costs

As AI-powered chat and applications become more and more common, it’s worth considering the risks. Whether you’re proxying calls to OpenAI, Anthropic, or some other cloud-based LLM provider, a single user session can trigger dozens of inference requests in agent-driven workflows. And the consequences can be costly.

We’ve seen this go wrong in very ordinary ways. A single chat session kicks off an agent loop. That loop fans out into multiple inference calls. Suddenly what looked like “one request” turns into dozens. Multiply that by a few curious users and you’re staring at a bill you didn’t plan for.

This guide covers options to prevent those runaway bills or absue with for rate limiting on Netlify.

Rate limiting is the process of capping the number of requests a client can make to your application within a time window. It’s a great tool for mitigating abuse of any endpoint and ensuring resources are not overwhelmed. If a user exceeds the limit, they’re blocked until the time window resets.

Why AI endpoints need rate limiting

Traditional web endpoints serve static assets or make quick database queries. These types of simple responses are easy to cache and fairly cheap. In times of increased traffic, your out of pocket costs remain predictable.

But AI endpoints are different. Every request consumes an indeterminate number of tokens which can add up quickly. Most major LLM providers are priced according to the number of tokens consumed and since their output is non-deterministic by nature, accurately forecasting costs is a major challenge. You might see usage spikes simply due to a user pasting in a large document. Or perhaps an agent retries the same prompt over and over, with minor variations. Those spikes won’t show up in your averages but they will absolutely show up on your bill.

Latency is another unpredictable factor. LLM inference can take anywhere from 500ms to 30+ seconds. Without limits, a traffic spike can queue requests faster than they complete, causing timeouts across the board.

As bad as that sounds, imagine a malicious user whose goal is to simply impact business operations. They don’t need to orchestrate a DDoS attack anymore. They just need to trigger a lot of expensive inference calls and if your application goes offline, all the better. A simple script hammering your /api/chat endpoint could rack up thousands in unexpected charges before you even notice.

Understanding Netlify’s rate limiting options

Netlify offers two approaches to rate limiting:

  • Code-based rules work on all plans. You define limits directly in your function configuration, and they deploy with your code. This is what we’ll focus on below.
  • UI-based rules are available on Enterprise plans. These offer advanced targeting (by IP range, geolocation, headers) and team-wide policy enforcement.

Both approaches let you set request limits per time window, choose between blocking (returning a 429) or rewriting to a custom error page, and aggregate by IP address, domain, or both.

Project setup

Let’s build a rate-limited chat endpoint from scratch. First make sure you’ve installed NodeJS and the Netlify CLI.

Start with a fresh NPM project:

$ mkdir ai-rate-limit-demo
$ npm init -y

Create a new Netlify project:

$ ntl projects:create --name ai-rate-limit-demo

Create a new function:

$ ntl functions:create --name chat

Building the chat endpoint

Here’s a basic serverless function that proxies requests to OpenAI via Netlify’s AI Gateway. The AI Gateway automatically injects API keys and handles routing, so your function stays clean:

netlify/functions/chat.ts
import type { Config, Context } from "@netlify/functions";
import OpenAI from "openai";
const openai = new OpenAI();
// No API key needed. AI Gateway provides it automatically
export default async (request: Request, context: Context) => {
if (request.method !== "POST") {
return new Response("Method not allowed", { status: 405 });
}
try {
let body: { message?: string };
try {
body = await request.json();
} catch (parseError) {
return new Response(
JSON.stringify({ error: "Invalid JSON in request body" }),
{
status: 400,
headers: { "Content-Type": "application/json" },
}
);
}
const { message } = body;
if (!message || typeof message !== "string") {
return new Response(JSON.stringify({ error: "Message is required" }), {
status: 400,
headers: { "Content-Type": "application/json" },
});
}
const completion = await openai.chat.completions.create({
model: "gpt-4o-mini",
messages: [{ role: "user", content: message }],
max_tokens: 500,
});
return new Response(
JSON.stringify({
response: completion.choices[0]?.message?.content,
}),
{ headers: { "Content-Type": "application/json" } }
);
} catch (error) {
console.error("Chat error:", error);
return new Response(
JSON.stringify({ error: "Failed to process request" }),
{ status: 500, headers: { "Content-Type": "application/json" } }
);
}
};
export const config: Config = {
path: "/api/chat",
};

Install the OpenAI SDK and some Typescript utilities for Netlify Functions:

$ npm install openai @netlify/functions

Start the dev server:

$ ntl dev

Test the function:

$ curl -XPOST http://localhost:8888/api/chat \
-H "Content-Type: application/json" \
-d '{"message":"Hello, how are you?"}'
Request from ::1: POST /api/chat
Response with status 200 in 1923 ms.
{"response":"Hello! I'm just a computer program, so I don't have feelings, but I'm here and ready to help you. How can I assist you today?"}

This works, but it’s completely unprotected. Anyone can hit this endpoint as fast as their connection allows.

Adding basic rate limiting

Now let’s add some rate limiting. Modify the config export to include a rateLimit block:

export const config: Config = {
path: "/api/chat",
rateLimit: {
windowLimit: 20,
windowSize: 60,
aggregateBy: ["ip", "domain"],
},
};

This configuration:

  • Allows 20 requests per 60 seconds per IP address
  • Automatically returns HTTP 429 when the limit is exceeded
  • Counts requests per unique combination of IP and your domain

The windowSize can be set between 1 and 180 seconds. The aggregateBy array determines how requests are grouped: [“ip", "domain"] means each visitor gets their own quota. Enterprise users with High-Performance Edge can pool requests across all visitors by just specifying ["domain"] alone.

Now that we’ve got our configuration in place, let’s deploy and test our rate-limited function:

$ netlify deploy --prod

Once that’s complete, you can verify rate limiting works by hitting the API in quick succession. For example:

for i in {1..25}; do
curl -s -o /dev/null -w "%{http_code}\n" \
-X POST https://ai-rate-limit-demo.netlify.app/api/chat \
-H "Content-Type: application/json" \
-d '{"message": "Hello"}'
done

You should see 200 responses followed by 429 responses once you exceed 20 requests.

Customizing the rate limit response

But a bare 429 response is not very user friendly so let’s add a custom error page that tells users when they can retry.

First, create the following file in your publish directory (the following assumes this to be the public/ folder):

public/rate-limited.html
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<title>Slow down!</title>
</head>
<body>
<h1>You're sending requests too fast</h1>
<p>To keep this service running smoothly for everyone, we limit how many requests you can make. Please wait 60 seconds and try again.</p>
</body>
</html>

Update your function config to rewrite to this page instead of returning 429:

export const config: Config = {
path: "/api/chat",
rateLimit: {
action: "rewrite",
to: "/rate-limited.html",
windowLimit: 20,
windowSize: 60,
aggregateBy: ["ip", "domain"],
},
};

Now users will understand why things are running slower than expected and can plan accordingly.

Rate limiting for different use cases

The right limits depend on your application. Here are configurations for common scenarios:

1. Public chatbot

For a public-facing chatbot, where you expect casual usage, you’ll want provide fairly generous limits:

rateLimit: {
windowLimit: 30, // 30 requests
windowSize: 60, // per minute
aggregateBy: ["ip", "domain"],
}

This allows roughly one request every 2 seconds per user—plenty for conversational interactions.

2. API with authenticated users

If your endpoint requires authentication, you might want higher limits for legitimate users while still protecting against abuse:

rateLimit: {
windowLimit: 100, // 100 requests
windowSize: 60, // per minute
aggregateBy: ["ip", "domain"],
}

Consider implementing tiered limits based on user roles in your application logic.

3. High-cost model endpoint

For endpoints using expensive models (GPT-4, Claude Opus), you might want to be more conservative:

rateLimit: {
windowLimit: 10, // 10 requests
windowSize: 60, // per minute
aggregateBy: ["ip", "domain"],
}

Using Edge Functions for lower latency

For AI endpoints where you want rate limiting decisions made as close to the user as possible, consider Edge Functions. They run on Deno at the network edge and support the same rateLimit config syntax:

netlify/edge-functions/chat.ts
import type { Config, Context } from "@netlify/edge-functions";
export default async (request: Request, context: Context) => {
// Edge Functions run on Deno, but npm packages work through bundling
// The AI Gateway injects environment variables here too
const OPENAI_API_KEY = Netlify.env.get("OPENAI_API_KEY");
const OPENAI_BASE_URL = Netlify.env.get("OPENAI_BASE_URL");
const body = await request.json();
const response = await fetch(`${OPENAI_BASE_URL}/v1/chat/completions`, {
method: "POST",
headers: {
"Content-Type": "application/json",
"Authorization": `Bearer ${OPENAI_API_KEY}`,
},
body: JSON.stringify({
model: "gpt-4o-mini",
messages: [{ role: "user", content: body.message }],
max_tokens: 500,
}),
});
return response;
};
export const config: Config = {
path: "/api/edge-chat",
rateLimit: {
windowLimit: 20,
windowSize: 60,
aggregateBy: ["ip", "domain"],
},
};

Edge Functions apply rate limits before the request reaches your function code, reducing wasted compute on rejected requests. Note that Edge Functions have a 50ms CPU execution limit (though network wait time doesn’t count), making them ideal for lightweight proxying rather than complex processing.

Rate limiting proxied external APIs

Sometimes you’re not running a serverless function—you’re proxying directly to an external API. You can rate limit these through redirects in netlify.toml:

[[redirects]]
from = "/api/external-ai"
to = "https://api.example.com/inference"
status = 200
force = true
[redirects.rate_limit]
window_limit = 50
window_size = 60
aggregate_by = ["ip", "domain"]

This protects the external API from abuse through your domain without writing any function code.

Protecting your whole site

If you’re building an AI-first application where most routes involve inference, you might want a blanket rate limit:

[[redirects]]
from = "/*"
to = "/:splat"
[redirects.rate_limit]
action = "rewrite"
to = "/rate-limited.html"
window_limit = 100
window_size = 60
aggregate_by = ["ip", "domain"]

This catches everything but still allows reasonable usage. Adjust the window_limit based on your application’s needs.

Handling rate limits gracefully in your frontend

Your frontend should anticipate 429 responses and handle them gracefully:

async function sendMessage(message) {
try {
const response = await fetch('/api/chat', {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({ message }),
});
if (response.status === 429) {
// Rate limited
const retryAfter = response.headers.get('Retry-After') || 60;
showNotification(`Too many requests. Please wait ${retryAfter} seconds.`);
return null;
}
if (!response.ok) {
throw new Error(`HTTP ${response.status}`);
}
return await response.json();
} catch (error) {
console.error('Request failed:', error);
showNotification('Something went wrong. Please try again.');
return null;
}
}

For a better UX, consider implementing exponential backoff for retries:

async function sendWithRetry(message, maxRetries = 3) {
for (let attempt = 0; attempt < maxRetries; attempt++) {
const response = await fetch('/api/chat', {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({ message }),
});
if (response.status !== 429) {
return response;
}
// Exponential backoff: 1s, 2s, 4s
const delay = Math.pow(2, attempt) * 1000;
await new Promise(resolve => setTimeout(resolve, delay));
}
throw new Error('Rate limit exceeded after retries');
}

Monitoring and tuning your limits

Rate limits aren’t set-and-forget. You need visibility into how they’re performing.

Check deploy logs

Netlify validates rate limit rules during post-processing. Check your deploy logs to confirm rules
are applied:

Post-processing - Rate limiting rules applied:
/api/chat: 20 requests per 60 seconds per IP

Track 429 responses

This is a great opportunity to check out Netlify Observability which surfaces aggregate response codes from a single cohesive view of all site traffic.

If necessary, add some logging to your function to provide additional detail. These values will be printed to your function logs:

export default async (request: Request, context: Context) => {
// Log request metadata for analysis
console.log(JSON.stringify({
timestamp: new Date().toISOString(),
path: new URL(request.url).pathname,
ip: context.ip,
userAgent: request.headers.get("user-agent"),
}));
// ... rest of your function
};

Adjust based on real usage

Once you’ve got a baseline of your average usage, you can dial in the limits to properly suit your application’s needs.

Consider the following strategies when evaluating your requirements:

  1. High 429 rate (>5%): Your limits might be too tight for legitimate usage. Consider increasing windowLimit or windowSize.
  2. Low 429 rate (<0.1%): Your limits might be too loose to catch abuse. Consider tightening, especially if you see cost spikes.
  3. Latency spikes: If you see p99 latency increasing, traffic might be overwhelming your backend despite staying under limits. Consider lowering limits or adding a global cap.

Complete example

Here’s a production-ready chat function that combines everything we’ve covered:

netlify/functions/chat.ts
import type { Config, Context } from "@netlify/functions";
import OpenAI from "openai";
const openai = new OpenAI();
interface ChatRequest {
message: string;
conversationId?: string;
}
export default async (request: Request, context: Context) => {
// Only allow POST
if (request.method !== "POST") {
return new Response(
JSON.stringify({ error: "Method not allowed" }),
{ status: 405, headers: { "Content-Type": "application/json" } }
);
}
// Log for monitoring
console.log(JSON.stringify({
event: "chat_request",
timestamp: new Date().toISOString(),
ip: context.ip,
}));
try {
const body: ChatRequest = await request.json();
// Validate input
if (!body.message || typeof body.message !== "string") {
return new Response(
JSON.stringify({ error: "Message is required" }),
{ status: 400, headers: { "Content-Type": "application/json" } }
);
}
// Limit message length to control token usage
// This number is intentionally conservative
// long prompts are the fastest way to blow up costs
if (body.message.length > 2000) {
return new Response(
JSON.stringify({ error: "Message too long (max 2000 characters)" }),
{ status: 400, headers: { "Content-Type": "application/json" } }
);
}
const startTime = Date.now();
const completion = await openai.chat.completions.create({
model: "gpt-4o-mini",
messages: [
{
role: "system",
content: "You are a helpful assistant. Be concise."
},
{ role: "user", content: body.message }
],
max_tokens: 500,
});
const duration = Date.now() - startTime;
// Log completion for monitoring
console.log(JSON.stringify({
event: "chat_complete",
duration,
tokens: completion.usage?.total_tokens,
}));
return new Response(
JSON.stringify({
response: completion.choices[0]?.message?.content,
usage: {
tokens: completion.usage?.total_tokens,
}
}),
{
headers: {
"Content-Type": "application/json",
"Cache-Control": "no-store",
}
}
);
} catch (error) {
console.error("Chat error:", error);
// Check if it's a rate limit from OpenAI
if (error instanceof Error && error.message.includes("rate limit")) {
return new Response(
JSON.stringify({ error: "Service temporarily unavailable" }),
{
status: 503,
headers: {
"Content-Type": "application/json",
"Retry-After": "30",
}
}
);
}
return new Response(
JSON.stringify({ error: "Failed to process request" }),
{ status: 500, headers: { "Content-Type": "application/json" } }
);
}
};
export const config: Config = {
path: "/api/chat",
rateLimit: {
action: "rewrite",
to: "/rate-limited.html",
windowLimit: 20,
windowSize: 60,
aggregateBy: ["ip", "domain"],
},
};

Enterprise features

As mentioned above, Enterprise plans with High-Performance Edge unlock additional capabilities such as more rules per project and an admin UI for creating and managing rules without code deployments. This is especially handy for responding to abuse patterns in real-time.

You also get access to advanced targeting features allowing you to rate-limit based on IP range, geolocation, request headers (this is useful for targeting specific user-agents) or cookies (this is useful for fine-tuning limits based on a user’s session).

Additional benefits include team-wide policies and per-domain aggregation which gives you the ability to define aggregate rules that apply to all users.

Summary

Rate limiting AI endpoints isn’t a nice to have. The costs and latency characteristics of LLM inference mean that unprotected endpoints are both expensive and fragile. With these strategies under your belt, implementing reasonable constraints is easier than ever.

Netlify’s code-based rate limiting gives you:

  • Protection against abuse: Per-IP limits prevent any single actor from monopolizing resources
  • Cost control: Hard caps prevent runaway spending from loops or attacks
  • Better UX: Custom error pages and proper HTTP status codes help users understand what’s happening
  • Flexibility: Adjust limits based on endpoint, model cost, or user type

Start with conservative limits, monitor your 429 rate, and adjust based on real usage patterns. Your budget and your users will thank you.


Keep reading

Recent posts

How do the best dev and marketing teams work together?