As AI-powered chat and applications become more and more common, it’s worth considering the risks. Whether you’re proxying calls to OpenAI, Anthropic, or some other cloud-based LLM provider, a single user session can trigger dozens of inference requests in agent-driven workflows. And the consequences can be costly.
We’ve seen this go wrong in very ordinary ways. A single chat session kicks off an agent loop. That loop fans out into multiple inference calls. Suddenly what looked like “one request” turns into dozens. Multiply that by a few curious users and you’re staring at a bill you didn’t plan for.
This guide covers options to prevent those runaway bills or absue with for rate limiting on Netlify.
Rate limiting is the process of capping the number of requests a client can make to your application within a time window. It’s a great tool for mitigating abuse of any endpoint and ensuring resources are not overwhelmed. If a user exceeds the limit, they’re blocked until the time window resets.
Why AI endpoints need rate limiting
Traditional web endpoints serve static assets or make quick database queries. These types of simple responses are easy to cache and fairly cheap. In times of increased traffic, your out of pocket costs remain predictable.
But AI endpoints are different. Every request consumes an indeterminate number of tokens which can add up quickly. Most major LLM providers are priced according to the number of tokens consumed and since their output is non-deterministic by nature, accurately forecasting costs is a major challenge. You might see usage spikes simply due to a user pasting in a large document. Or perhaps an agent retries the same prompt over and over, with minor variations. Those spikes won’t show up in your averages but they will absolutely show up on your bill.
Latency is another unpredictable factor. LLM inference can take anywhere from 500ms to 30+ seconds. Without limits, a traffic spike can queue requests faster than they complete, causing timeouts across the board.
As bad as that sounds, imagine a malicious user whose goal is to simply impact business operations. They don’t need to orchestrate a DDoS attack anymore. They just need to trigger a lot of expensive inference calls and if your application goes offline, all the better. A simple script hammering your /api/chat endpoint could rack up thousands in unexpected charges before you even notice.
Understanding Netlify’s rate limiting options
Netlify offers two approaches to rate limiting:
- Code-based rules work on all plans. You define limits directly in your function configuration, and they deploy with your code. This is what we’ll focus on below.
- UI-based rules are available on Enterprise plans. These offer advanced targeting (by IP range, geolocation, headers) and team-wide policy enforcement.
Both approaches let you set request limits per time window, choose between blocking (returning a 429) or rewriting to a custom error page, and aggregate by IP address, domain, or both.
Project setup
Let’s build a rate-limited chat endpoint from scratch. First make sure you’ve installed NodeJS and the Netlify CLI.
Start with a fresh NPM project:
$ mkdir ai-rate-limit-demo$ npm init -yCreate a new Netlify project:
$ ntl projects:create --name ai-rate-limit-demoCreate a new function:
$ ntl functions:create --name chatBuilding the chat endpoint
Here’s a basic serverless function that proxies requests to OpenAI via Netlify’s AI Gateway. The AI Gateway automatically injects API keys and handles routing, so your function stays clean:
import type { Config, Context } from "@netlify/functions";import OpenAI from "openai";
const openai = new OpenAI();// No API key needed. AI Gateway provides it automatically
export default async (request: Request, context: Context) => { if (request.method !== "POST") { return new Response("Method not allowed", { status: 405 }); }
try { let body: { message?: string }; try { body = await request.json(); } catch (parseError) { return new Response( JSON.stringify({ error: "Invalid JSON in request body" }), { status: 400, headers: { "Content-Type": "application/json" }, } ); }
const { message } = body;
if (!message || typeof message !== "string") { return new Response(JSON.stringify({ error: "Message is required" }), { status: 400, headers: { "Content-Type": "application/json" }, }); }
const completion = await openai.chat.completions.create({ model: "gpt-4o-mini", messages: [{ role: "user", content: message }], max_tokens: 500, });
return new Response( JSON.stringify({ response: completion.choices[0]?.message?.content, }), { headers: { "Content-Type": "application/json" } } ); } catch (error) { console.error("Chat error:", error); return new Response( JSON.stringify({ error: "Failed to process request" }), { status: 500, headers: { "Content-Type": "application/json" } } ); }};
export const config: Config = { path: "/api/chat",};Install the OpenAI SDK and some Typescript utilities for Netlify Functions:
$ npm install openai @netlify/functionsStart the dev server:
$ ntl devTest the function:
$ curl -XPOST http://localhost:8888/api/chat \ -H "Content-Type: application/json" \ -d '{"message":"Hello, how are you?"}'Request from ::1: POST /api/chatResponse with status 200 in 1923 ms.{"response":"Hello! I'm just a computer program, so I don't have feelings, but I'm here and ready to help you. How can I assist you today?"}This works, but it’s completely unprotected. Anyone can hit this endpoint as fast as their connection allows.
Adding basic rate limiting
Now let’s add some rate limiting. Modify the config export to include a rateLimit block:
export const config: Config = { path: "/api/chat", rateLimit: { windowLimit: 20, windowSize: 60, aggregateBy: ["ip", "domain"], },};This configuration:
- Allows 20 requests per 60 seconds per IP address
- Automatically returns
HTTP 429when the limit is exceeded - Counts requests per unique combination of IP and your domain
The windowSize can be set between 1 and 180 seconds. The aggregateBy array determines how requests are grouped: [“ip", "domain"] means each visitor gets their own quota. Enterprise users with High-Performance Edge can pool requests across all visitors by just specifying ["domain"] alone.
Now that we’ve got our configuration in place, let’s deploy and test our rate-limited function:
$ netlify deploy --prodOnce that’s complete, you can verify rate limiting works by hitting the API in quick succession. For example:
for i in {1..25}; do curl -s -o /dev/null -w "%{http_code}\n" \ -X POST https://ai-rate-limit-demo.netlify.app/api/chat \ -H "Content-Type: application/json" \ -d '{"message": "Hello"}'doneYou should see 200 responses followed by 429 responses once you exceed 20 requests.
Customizing the rate limit response
But a bare 429 response is not very user friendly so let’s add a custom error page that tells users when they can retry.
First, create the following file in your publish directory (the following assumes this to be the public/ folder):
<!DOCTYPE html><html lang="en"><head> <meta charset="UTF-8"> <title>Slow down!</title></head><body> <h1>You're sending requests too fast</h1> <p>To keep this service running smoothly for everyone, we limit how many requests you can make. Please wait 60 seconds and try again.</p></body></html>Update your function config to rewrite to this page instead of returning 429:
export const config: Config = { path: "/api/chat", rateLimit: { action: "rewrite", to: "/rate-limited.html", windowLimit: 20, windowSize: 60, aggregateBy: ["ip", "domain"], },};Now users will understand why things are running slower than expected and can plan accordingly.
Rate limiting for different use cases
The right limits depend on your application. Here are configurations for common scenarios:
1. Public chatbot
For a public-facing chatbot, where you expect casual usage, you’ll want provide fairly generous limits:
rateLimit: { windowLimit: 30, // 30 requests windowSize: 60, // per minute aggregateBy: ["ip", "domain"],}This allows roughly one request every 2 seconds per user—plenty for conversational interactions.
2. API with authenticated users
If your endpoint requires authentication, you might want higher limits for legitimate users while still protecting against abuse:
rateLimit: { windowLimit: 100, // 100 requests windowSize: 60, // per minute aggregateBy: ["ip", "domain"],}Consider implementing tiered limits based on user roles in your application logic.
3. High-cost model endpoint
For endpoints using expensive models (GPT-4, Claude Opus), you might want to be more conservative:
rateLimit: { windowLimit: 10, // 10 requests windowSize: 60, // per minute aggregateBy: ["ip", "domain"],}Using Edge Functions for lower latency
For AI endpoints where you want rate limiting decisions made as close to the user as possible, consider Edge Functions. They run on Deno at the network edge and support the same rateLimit config syntax:
import type { Config, Context } from "@netlify/edge-functions";
export default async (request: Request, context: Context) => { // Edge Functions run on Deno, but npm packages work through bundling // The AI Gateway injects environment variables here too
const OPENAI_API_KEY = Netlify.env.get("OPENAI_API_KEY"); const OPENAI_BASE_URL = Netlify.env.get("OPENAI_BASE_URL");
const body = await request.json();
const response = await fetch(`${OPENAI_BASE_URL}/v1/chat/completions`, { method: "POST", headers: { "Content-Type": "application/json", "Authorization": `Bearer ${OPENAI_API_KEY}`, }, body: JSON.stringify({ model: "gpt-4o-mini", messages: [{ role: "user", content: body.message }], max_tokens: 500, }), });
return response;};
export const config: Config = { path: "/api/edge-chat", rateLimit: { windowLimit: 20, windowSize: 60, aggregateBy: ["ip", "domain"], },};Edge Functions apply rate limits before the request reaches your function code, reducing wasted compute on rejected requests. Note that Edge Functions have a 50ms CPU execution limit (though network wait time doesn’t count), making them ideal for lightweight proxying rather than complex processing.
Rate limiting proxied external APIs
Sometimes you’re not running a serverless function—you’re proxying directly to an external API. You can rate limit these through redirects in netlify.toml:
[[redirects]] from = "/api/external-ai" to = "https://api.example.com/inference" status = 200 force = true [redirects.rate_limit] window_limit = 50 window_size = 60 aggregate_by = ["ip", "domain"]This protects the external API from abuse through your domain without writing any function code.
Protecting your whole site
If you’re building an AI-first application where most routes involve inference, you might want a blanket rate limit:
[[redirects]] from = "/*" to = "/:splat" [redirects.rate_limit] action = "rewrite" to = "/rate-limited.html" window_limit = 100 window_size = 60 aggregate_by = ["ip", "domain"]This catches everything but still allows reasonable usage. Adjust the window_limit based on your application’s needs.
Handling rate limits gracefully in your frontend
Your frontend should anticipate 429 responses and handle them gracefully:
async function sendMessage(message) { try { const response = await fetch('/api/chat', { method: 'POST', headers: { 'Content-Type': 'application/json' }, body: JSON.stringify({ message }), });
if (response.status === 429) { // Rate limited const retryAfter = response.headers.get('Retry-After') || 60; showNotification(`Too many requests. Please wait ${retryAfter} seconds.`); return null; }
if (!response.ok) { throw new Error(`HTTP ${response.status}`); }
return await response.json(); } catch (error) { console.error('Request failed:', error); showNotification('Something went wrong. Please try again.'); return null; }}For a better UX, consider implementing exponential backoff for retries:
async function sendWithRetry(message, maxRetries = 3) { for (let attempt = 0; attempt < maxRetries; attempt++) { const response = await fetch('/api/chat', { method: 'POST', headers: { 'Content-Type': 'application/json' }, body: JSON.stringify({ message }), });
if (response.status !== 429) { return response; }
// Exponential backoff: 1s, 2s, 4s const delay = Math.pow(2, attempt) * 1000; await new Promise(resolve => setTimeout(resolve, delay)); }
throw new Error('Rate limit exceeded after retries');}Monitoring and tuning your limits
Rate limits aren’t set-and-forget. You need visibility into how they’re performing.
Check deploy logs
Netlify validates rate limit rules during post-processing. Check your deploy logs to confirm rules
are applied:
Post-processing - Rate limiting rules applied: /api/chat: 20 requests per 60 seconds per IPTrack 429 responses
This is a great opportunity to check out Netlify Observability which surfaces aggregate response codes from a single cohesive view of all site traffic.
If necessary, add some logging to your function to provide additional detail. These values will be printed to your function logs:
export default async (request: Request, context: Context) => { // Log request metadata for analysis console.log(JSON.stringify({ timestamp: new Date().toISOString(), path: new URL(request.url).pathname, ip: context.ip, userAgent: request.headers.get("user-agent"), }));
// ... rest of your function};Adjust based on real usage
Once you’ve got a baseline of your average usage, you can dial in the limits to properly suit your application’s needs.
Consider the following strategies when evaluating your requirements:
- High 429 rate (>5%): Your limits might be too tight for legitimate usage. Consider increasing
windowLimitorwindowSize. - Low 429 rate (<0.1%): Your limits might be too loose to catch abuse. Consider tightening, especially if you see cost spikes.
- Latency spikes: If you see p99 latency increasing, traffic might be overwhelming your backend despite staying under limits. Consider lowering limits or adding a global cap.
Complete example
Here’s a production-ready chat function that combines everything we’ve covered:
import type { Config, Context } from "@netlify/functions";import OpenAI from "openai";
const openai = new OpenAI();
interface ChatRequest { message: string; conversationId?: string;}
export default async (request: Request, context: Context) => { // Only allow POST if (request.method !== "POST") { return new Response( JSON.stringify({ error: "Method not allowed" }), { status: 405, headers: { "Content-Type": "application/json" } } ); }
// Log for monitoring console.log(JSON.stringify({ event: "chat_request", timestamp: new Date().toISOString(), ip: context.ip, }));
try { const body: ChatRequest = await request.json();
// Validate input if (!body.message || typeof body.message !== "string") { return new Response( JSON.stringify({ error: "Message is required" }), { status: 400, headers: { "Content-Type": "application/json" } } ); }
// Limit message length to control token usage // This number is intentionally conservative // long prompts are the fastest way to blow up costs if (body.message.length > 2000) { return new Response( JSON.stringify({ error: "Message too long (max 2000 characters)" }), { status: 400, headers: { "Content-Type": "application/json" } } ); }
const startTime = Date.now();
const completion = await openai.chat.completions.create({ model: "gpt-4o-mini", messages: [ { role: "system", content: "You are a helpful assistant. Be concise." }, { role: "user", content: body.message } ], max_tokens: 500, });
const duration = Date.now() - startTime;
// Log completion for monitoring console.log(JSON.stringify({ event: "chat_complete", duration, tokens: completion.usage?.total_tokens, }));
return new Response( JSON.stringify({ response: completion.choices[0]?.message?.content, usage: { tokens: completion.usage?.total_tokens, } }), { headers: { "Content-Type": "application/json", "Cache-Control": "no-store", } } ); } catch (error) { console.error("Chat error:", error);
// Check if it's a rate limit from OpenAI if (error instanceof Error && error.message.includes("rate limit")) { return new Response( JSON.stringify({ error: "Service temporarily unavailable" }), { status: 503, headers: { "Content-Type": "application/json", "Retry-After": "30", } } ); }
return new Response( JSON.stringify({ error: "Failed to process request" }), { status: 500, headers: { "Content-Type": "application/json" } } ); }};
export const config: Config = { path: "/api/chat", rateLimit: { action: "rewrite", to: "/rate-limited.html", windowLimit: 20, windowSize: 60, aggregateBy: ["ip", "domain"], },};Enterprise features
As mentioned above, Enterprise plans with High-Performance Edge unlock additional capabilities such as more rules per project and an admin UI for creating and managing rules without code deployments. This is especially handy for responding to abuse patterns in real-time.
You also get access to advanced targeting features allowing you to rate-limit based on IP range, geolocation, request headers (this is useful for targeting specific user-agents) or cookies (this is useful for fine-tuning limits based on a user’s session).
Additional benefits include team-wide policies and per-domain aggregation which gives you the ability to define aggregate rules that apply to all users.
Summary
Rate limiting AI endpoints isn’t a nice to have. The costs and latency characteristics of LLM inference mean that unprotected endpoints are both expensive and fragile. With these strategies under your belt, implementing reasonable constraints is easier than ever.
Netlify’s code-based rate limiting gives you:
- Protection against abuse: Per-IP limits prevent any single actor from monopolizing resources
- Cost control: Hard caps prevent runaway spending from loops or attacks
- Better UX: Custom error pages and proper HTTP status codes help users understand what’s happening
- Flexibility: Adjust limits based on endpoint, model cost, or user type
Start with conservative limits, monitor your 429 rate, and adjust based on real usage patterns. Your budget and your users will thank you.


