---
title: "Rate limiting AI features on Netlify to avoid surprise costs"
description: "AI chat features can quietly make many requests behind the scenes, leading to unexpected costs or slowdowns. This guide shows how to use rate limiting on Netlify to keep AI-powered apps reliable and affordable."
source: "https://www.netlify.com/blog/how-to-rate-limit-ai-features-and-avoid-surprise-costs/"
last_updated: "2026-07-05T04:57:00.000Z"
---
As AI-powered chat and applications become more and more common, it’s worth considering the risks. Whether you’re proxying calls to OpenAI, Anthropic, or some other cloud-based LLM provider, a single user session can trigger dozens of inference requests in agent-driven workflows. And the consequences can be costly.

We’ve seen this go wrong in very ordinary ways. A single chat session kicks off an agent loop. That loop fans out into multiple inference calls. Suddenly what looked like “one request” turns into dozens. Multiply that by a few curious users and you’re staring at a bill you didn’t plan for.

This guide covers options to prevent those runaway bills or absue with for rate limiting on Netlify.

> [Rate limiting](https://docs.netlify.com/manage/security/secure-access-to-sites/rate-limiting/) is the process of capping the number of requests a client can make to your application within a time window. It’s a great tool for mitigating abuse of any endpoint and ensuring resources are not overwhelmed. If a user exceeds the limit, they’re blocked until the time window resets.

## Why AI endpoints need rate limiting

Traditional web endpoints serve static assets or make quick database queries. These types of simple responses are easy to cache and fairly cheap. In times of increased traffic, your out of pocket costs remain predictable.

But AI endpoints are different. Every request consumes an indeterminate number of tokens which can add up quickly. Most major LLM providers are priced according to the number of tokens consumed and since their output is non-deterministic by nature, accurately forecasting costs is a major challenge. You might see usage spikes simply due to a user pasting in a large document. Or perhaps an agent retries the same prompt over and over, with minor variations. Those spikes won’t show up in your averages but they will absolutely show up on your bill.

Latency is another unpredictable factor. LLM inference can take anywhere from 500ms to 30+ seconds. Without limits, a traffic spike can queue requests faster than they complete, causing timeouts across the board.

As bad as that sounds, imagine a malicious user whose goal is to simply impact business operations. They don’t need to orchestrate a DDoS attack anymore. They just need to trigger a lot of expensive inference calls and if your application goes offline, all the better. A simple script hammering your `/api/chat` endpoint could rack up thousands in unexpected charges before you even notice.

## Understanding Netlify’s rate limiting options

Netlify offers two approaches to rate limiting:

-   **Code-based rules** work on all plans. You define limits directly in your function configuration, and they deploy with your code. This is what we’ll focus on below.
-   **UI-based rules** are available on Enterprise plans. These offer advanced targeting (by IP range, geolocation, headers) and team-wide policy enforcement.

Both approaches let you set request limits per time window, choose between blocking (returning a `429`) or rewriting to a custom error page, and aggregate by IP address, domain, or both.

## Project setup

Let’s build a rate-limited chat endpoint from scratch. First make sure you’ve installed NodeJS and the [Netlify CLI](https://docs.netlify.com/api-and-cli-guides/cli-guides/get-started-with-cli/).

Start with a fresh NPM project:

```
$ mkdir ai-rate-limit-demo$ npm init -y
```

Create a new Netlify project:

```
$ ntl projects:create --name ai-rate-limit-demo
```

Create a new function:

```
$ ntl functions:create --name chat
```

## Building the chat endpoint

Here’s a basic serverless function that proxies requests to OpenAI via [Netlify’s AI Gateway](https://docs.netlify.com/build/ai-gateway/overview/). The AI Gateway automatically injects API keys and handles routing, so your function stays clean:

netlify/functions/chat.ts

```
import type { Config, Context } from "@netlify/functions";import OpenAI from "openai";
const openai = new OpenAI();// No API key needed. AI Gateway provides it automatically
export default async (request: Request, context: Context) => {  if (request.method !== "POST") {    return new Response("Method not allowed", { status: 405 });  }
  try {    let body: { message?: string };    try {      body = await request.json();    } catch (parseError) {      return new Response(        JSON.stringify({ error: "Invalid JSON in request body" }),        {          status: 400,          headers: { "Content-Type": "application/json" },        }      );    }
    const { message } = body;
    if (!message || typeof message !== "string") {      return new Response(JSON.stringify({ error: "Message is required" }), {        status: 400,        headers: { "Content-Type": "application/json" },      });    }
    const completion = await openai.chat.completions.create({      model: "gpt-4o-mini",      messages: [{ role: "user", content: message }],      max_tokens: 500,    });
    return new Response(      JSON.stringify({        response: completion.choices[0]?.message?.content,      }),      { headers: { "Content-Type": "application/json" } }    );  } catch (error) {    console.error("Chat error:", error);    return new Response(      JSON.stringify({ error: "Failed to process request" }),      { status: 500, headers: { "Content-Type": "application/json" } }    );  }};
export const config: Config = {  path: "/api/chat",};
```

Install the [OpenAI SDK](https://www.npmjs.com/package/openai) and some Typescript utilities for [Netlify Functions](https://docs.netlify.com/build/functions/overview/):

```
$ npm install openai @netlify/functions
```

Start the dev server:

```
$ ntl dev
```

Test the function:

```
$ curl -XPOST http://localhost:8888/api/chat \  -H "Content-Type: application/json" \  -d '{"message":"Hello, how are you?"}'Request from ::1: POST /api/chatResponse with status 200 in 1923 ms.{"response":"Hello! I'm just a computer program, so I don't have feelings, but I'm here and ready to help you. How can I assist you today?"}
```

This works, but it’s completely unprotected. Anyone can hit this endpoint as fast as their connection allows.

## Adding basic rate limiting

Now let’s add some rate limiting. Modify the config export to include a `rateLimit` block:

```
export const config: Config = {  path: "/api/chat",  rateLimit: {    windowLimit: 20,    windowSize: 60,    aggregateBy: ["ip", "domain"],  },};
```

This configuration:

-   Allows **20 requests per 60 seconds** per IP address
-   Automatically returns `HTTP 429` when the limit is exceeded
-   Counts requests per unique combination of IP and your domain

The `windowSize` can be set between `1` and `180` seconds. The `aggregateBy` array determines how requests are grouped: `[“ip", "domain"]` means each visitor gets their own quota. Enterprise users with High-Performance Edge can pool requests across all visitors by just specifying `["domain"]` alone.

Now that we’ve got our configuration in place, let’s deploy and test our rate-limited function:

```
$ netlify deploy --prod
```

Once that’s complete, you can verify rate limiting works by hitting the API in quick succession. For example:

```
for i in {1..25}; do  curl -s -o /dev/null -w "%{http_code}\n" \    -X POST https://ai-rate-limit-demo.netlify.app/api/chat \    -H "Content-Type: application/json" \    -d '{"message": "Hello"}'done
```

You should see `200` responses followed by `429` responses once you exceed 20 requests.

## Customizing the rate limit response

But a bare `429` response is not very user friendly so let’s add a custom error page that tells users when they can retry.

First, create the following file in [your publish directory](https://docs.netlify.com/build/configure-builds/overview/#set-the-publish-directory) (the following assumes this to be the `public/` folder):

public/rate-limited.html

```
<!DOCTYPE html><html lang="en"><head>  <meta charset="UTF-8">  <title>Slow down!</title></head><body>  <h1>You're sending requests too fast</h1>  <p>To keep this service running smoothly for everyone, we limit how many requests you can make. Please wait 60 seconds and try again.</p></body></html>
```

Update your function config to rewrite to this page instead of returning `429`:

```
export const config: Config = {  path: "/api/chat",  rateLimit: {    action: "rewrite",    to: "/rate-limited.html",    windowLimit: 20,    windowSize: 60,    aggregateBy: ["ip", "domain"],  },};
```

Now users will understand why things are running slower than expected and can plan accordingly.

## Rate limiting for different use cases

The right limits depend on your application. Here are configurations for common scenarios:

### 1\. Public chatbot

For a public-facing chatbot, where you expect casual usage, you’ll want provide fairly generous limits:

```
rateLimit: {  windowLimit: 30,      // 30 requests  windowSize: 60,       // per minute  aggregateBy: ["ip", "domain"],}
```

This allows roughly one request every 2 seconds per user—plenty for conversational interactions.

### 2\. API with authenticated users

If your endpoint requires authentication, you might want higher limits for legitimate users while still protecting against abuse:

```
rateLimit: {  windowLimit: 100,     // 100 requests  windowSize: 60,       // per minute  aggregateBy: ["ip", "domain"],}
```

Consider implementing tiered limits based on user roles in your application logic.

### 3\. High-cost model endpoint

For endpoints using expensive models (GPT-4, Claude Opus), you might want to be more conservative:

```
rateLimit: {  windowLimit: 10,      // 10 requests  windowSize: 60,       // per minute  aggregateBy: ["ip", "domain"],}
```

## Using Edge Functions for lower latency

For AI endpoints where you want rate limiting decisions made as close to the user as possible, consider [Edge Functions](https://docs.netlify.com/build/edge-functions/overview/). They run on Deno at the network edge and support the same `rateLimit` config syntax:

netlify/edge-functions/chat.ts

```
import type { Config, Context } from "@netlify/edge-functions";
export default async (request: Request, context: Context) => {  // Edge Functions run on Deno, but npm packages work through bundling  // The AI Gateway injects environment variables here too
  const OPENAI_API_KEY = Netlify.env.get("OPENAI_API_KEY");  const OPENAI_BASE_URL = Netlify.env.get("OPENAI_BASE_URL");
  const body = await request.json();
  const response = await fetch(`${OPENAI_BASE_URL}/v1/chat/completions`, {    method: "POST",    headers: {      "Content-Type": "application/json",      "Authorization": `Bearer ${OPENAI_API_KEY}`,    },    body: JSON.stringify({      model: "gpt-4o-mini",      messages: [{ role: "user", content: body.message }],      max_tokens: 500,    }),  });
  return response;};
export const config: Config = {  path: "/api/edge-chat",  rateLimit: {    windowLimit: 20,    windowSize: 60,    aggregateBy: ["ip", "domain"],  },};
```

Edge Functions apply rate limits before the request reaches your function code, reducing wasted compute on rejected requests. Note that Edge Functions have a 50ms CPU execution limit (though network wait time doesn’t count), making them ideal for lightweight proxying rather than complex processing.

## Rate limiting proxied external APIs

Sometimes you’re not running a serverless function—you’re proxying directly to an external API. You can rate limit these through redirects in `netlify.toml`:

```
[[redirects]]  from = "/api/external-ai"  to = "https://api.example.com/inference"  status = 200  force = true  [redirects.rate_limit]    window_limit = 50    window_size = 60    aggregate_by = ["ip", "domain"]
```

This protects the external API from abuse through your domain without writing any function code.

## Protecting your whole site

If you’re building an AI-first application where most routes involve inference, you might want a blanket rate limit:

```
[[redirects]]  from = "/*"  to = "/:splat"  [redirects.rate_limit]    action = "rewrite"    to = "/rate-limited.html"    window_limit = 100    window_size = 60    aggregate_by = ["ip", "domain"]
```

This catches everything but still allows reasonable usage. Adjust the `window_limit` based on your application’s needs.

## Handling rate limits gracefully in your frontend

Your frontend should anticipate `429` responses and handle them gracefully:

```
async function sendMessage(message) {  try {    const response = await fetch('/api/chat', {      method: 'POST',      headers: { 'Content-Type': 'application/json' },      body: JSON.stringify({ message }),    });
    if (response.status === 429) {      // Rate limited      const retryAfter = response.headers.get('Retry-After') || 60;      showNotification(`Too many requests. Please wait ${retryAfter} seconds.`);      return null;    }
    if (!response.ok) {      throw new Error(`HTTP ${response.status}`);    }
    return await response.json();  } catch (error) {    console.error('Request failed:', error);    showNotification('Something went wrong. Please try again.');    return null;  }}
```

For a better UX, consider implementing exponential backoff for retries:

```
async function sendWithRetry(message, maxRetries = 3) {  for (let attempt = 0; attempt < maxRetries; attempt++) {    const response = await fetch('/api/chat', {      method: 'POST',      headers: { 'Content-Type': 'application/json' },      body: JSON.stringify({ message }),    });
    if (response.status !== 429) {      return response;    }
    // Exponential backoff: 1s, 2s, 4s    const delay = Math.pow(2, attempt) * 1000;    await new Promise(resolve => setTimeout(resolve, delay));  }
  throw new Error('Rate limit exceeded after retries');}
```

## Monitoring and tuning your limits

Rate limits aren’t set-and-forget. You need visibility into how they’re performing.

### Check deploy logs

Netlify validates rate limit rules during post-processing. Check your deploy logs to confirm rules  
are applied:

```
Post-processing - Rate limiting rules applied:  /api/chat: 20 requests per 60 seconds per IP
```

### Track 429 responses

This is a great opportunity to check out [Netlify Observability](https://docs.netlify.com/manage/monitoring/observability/overview/) which surfaces aggregate response codes from a single cohesive view of all site traffic.

If necessary, add some logging to your function to provide additional detail. These values will be printed to your function logs:

```
export default async (request: Request, context: Context) => {  // Log request metadata for analysis  console.log(JSON.stringify({    timestamp: new Date().toISOString(),    path: new URL(request.url).pathname,    ip: context.ip,    userAgent: request.headers.get("user-agent"),  }));
  // ... rest of your function};
```

### Adjust based on real usage

Once you’ve got a baseline of your average usage, you can dial in the limits to properly suit your application’s needs.

Consider the following strategies when evaluating your requirements:

1.  **High 429 rate (>5%)**: Your limits might be too tight for legitimate usage. Consider increasing `windowLimit` or `windowSize`.
2.  **Low 429 rate (<0.1%)**: Your limits might be too loose to catch abuse. Consider tightening, especially if you see cost spikes.
3.  **Latency spikes**: If you see p99 latency increasing, traffic might be overwhelming your backend despite staying under limits. Consider lowering limits or adding a global cap.

## Complete example

Here’s a production-ready chat function that combines everything we’ve covered:

netlify/functions/chat.ts

```
import type { Config, Context } from "@netlify/functions";import OpenAI from "openai";
const openai = new OpenAI();
interface ChatRequest {  message: string;  conversationId?: string;}
export default async (request: Request, context: Context) => {  // Only allow POST  if (request.method !== "POST") {    return new Response(      JSON.stringify({ error: "Method not allowed" }),      { status: 405, headers: { "Content-Type": "application/json" } }    );  }
  // Log for monitoring  console.log(JSON.stringify({    event: "chat_request",    timestamp: new Date().toISOString(),    ip: context.ip,  }));
  try {    const body: ChatRequest = await request.json();
    // Validate input    if (!body.message || typeof body.message !== "string") {      return new Response(        JSON.stringify({ error: "Message is required" }),        { status: 400, headers: { "Content-Type": "application/json" } }      );    }
    // Limit message length to control token usage    // This number is intentionally conservative    // long prompts are the fastest way to blow up costs    if (body.message.length > 2000) {      return new Response(        JSON.stringify({ error: "Message too long (max 2000 characters)" }),        { status: 400, headers: { "Content-Type": "application/json" } }      );    }
    const startTime = Date.now();
    const completion = await openai.chat.completions.create({      model: "gpt-4o-mini",      messages: [        {          role: "system",          content: "You are a helpful assistant. Be concise."        },        { role: "user", content: body.message }      ],      max_tokens: 500,    });
    const duration = Date.now() - startTime;
    // Log completion for monitoring    console.log(JSON.stringify({      event: "chat_complete",      duration,      tokens: completion.usage?.total_tokens,    }));
    return new Response(      JSON.stringify({        response: completion.choices[0]?.message?.content,        usage: {          tokens: completion.usage?.total_tokens,        }      }),      {        headers: {          "Content-Type": "application/json",          "Cache-Control": "no-store",        }      }    );  } catch (error) {    console.error("Chat error:", error);
    // Check if it's a rate limit from OpenAI    if (error instanceof Error && error.message.includes("rate limit")) {      return new Response(        JSON.stringify({ error: "Service temporarily unavailable" }),        {          status: 503,          headers: {            "Content-Type": "application/json",            "Retry-After": "30",          }        }      );    }
    return new Response(      JSON.stringify({ error: "Failed to process request" }),      { status: 500, headers: { "Content-Type": "application/json" } }    );  }};
export const config: Config = {  path: "/api/chat",  rateLimit: {    action: "rewrite",    to: "/rate-limited.html",    windowLimit: 20,    windowSize: 60,    aggregateBy: ["ip", "domain"],  },};
```

## Enterprise features

As mentioned above, Enterprise plans with [High-Performance Edge](https://www.netlify.com/platform/core/high-performance-edge/) unlock additional capabilities such as more rules per project and an admin UI for creating and managing rules without code deployments. This is especially handy for responding to abuse patterns in real-time.

You also get access to advanced targeting features allowing you to rate-limit based on IP range, geolocation, request headers (this is useful for targeting specific user-agents) or cookies (this is useful for fine-tuning limits based on a user’s session).

Additional benefits include team-wide policies and per-domain aggregation which gives you the ability to define aggregate rules that apply to all users.

## Summary

Rate limiting AI endpoints isn’t a nice to have. The costs and latency characteristics of LLM inference mean that unprotected endpoints are both expensive and fragile. With these strategies under your belt, implementing reasonable constraints is easier than ever.

Netlify’s code-based rate limiting gives you:

-   **Protection against abuse**: Per-IP limits prevent any single actor from monopolizing resources
-   **Cost control**: Hard caps prevent runaway spending from loops or attacks
-   **Better UX**: Custom error pages and proper HTTP status codes help users understand what’s happening
-   **Flexibility**: Adjust limits based on endpoint, model cost, or user type

Start with conservative limits, monitor your `429` rate, and adjust based on real usage patterns. Your budget and your users will thank you.

### Share

-   [X (fka Twitter)](https://twitter.com/intent/tweet?text=How to rate limit AI features and avoid surprise costs&url=https://www.netlify.com/blog/how-to-rate-limit-ai-features-and-avoid-surprise-costs/)
-   [LinkedIn](https://www.linkedin.com/shareArticle?mini=true&url=https%3A%2F%2Fwww.netlify.com%2Fblog%2Fhow-to-rate-limit-ai-features-and-avoid-surprise-costs%2F)
-   [Facebook](https://www.facebook.com/sharer.php?u=https://www.netlify.com/blog/how-to-rate-limit-ai-features-and-avoid-surprise-costs/)
-   [Bluesky](https://bsky.app/intent/compose?text=How to rate limit AI features and avoid surprise costs+https://www.netlify.com/blog/how-to-rate-limit-ai-features-and-avoid-surprise-costs/)

* * *

### Tags

-   [AI](/blog/tags/ai/)
-   [Security](/blog/tags/security/)

## Keep reading

![](/_astro/3f45eb6eda4ea8814be310e3df4a7883a5bd9ba0-1200x675_ZcBDUS.webp)

Guides & Tutorials May 15, 2026

[

### How to build a real-time AI chatbot in minutes with Netlify Agent Runners (no backend)

](/blog/how-to-build-a-real-time-ai-chatbot-in-minutes-with-netlify-agent-runners-no-backend)

-   ![Profile picture of Nahrin Jalal](/_astro/f0e7c8f227a03fe58340c99ef5439d5a896c0733-272x272_Z23kDpD.webp)
    
    Nahrin Jalal
    

![](/_astro/8fe9e8a23f944c9912003233d99a2df7fee637cf-1600x900_Z1gMhmf.webp)

Guides & Tutorials May 15, 2026

[

### Tracking AI search traffic: how to use Netlify Log Drains to maximize AEO

](/blog/tracking-ai-search-traffic)

-   ![Profile picture of Nahrin Jalal](/_astro/f0e7c8f227a03fe58340c99ef5439d5a896c0733-272x272_Z23kDpD.webp)
    
    Nahrin Jalal
    

## Recent posts

News & Announcements June 25, 2026

[

### Netlify Functions, designed for Agent Experience

](/blog/netlify-functions-designed-for-agent-experience)

-   ![Profile picture of Eduardo Bouças](/_astro/52958f21e8450baf6d8e60302341a984e220c0cd-512x512_13VDlu.webp)
    
    Eduardo Bouças
    

News & Announcements June 24, 2026

[

### How we measure Netlify’s Agent Experience

](/blog/how-we-measure-netlify-agent-experience)

-   ![Profile picture of Sean Roberts](/_astro/bbf2243f8171dbddd80ab2103622106cef84d125-512x512_Z1d2LKE.webp)
    
    Sean Roberts
    

Guides & Tutorials May 15, 2026

[

### How to build a real-time AI chatbot in minutes with Netlify Agent Runners (no backend)

](/blog/how-to-build-a-real-time-ai-chatbot-in-minutes-with-netlify-agent-runners-no-backend)

-   ![Profile picture of Nahrin Jalal](/_astro/f0e7c8f227a03fe58340c99ef5439d5a896c0733-272x272_Z23kDpD.webp)
    
    Nahrin Jalal
    

![](/_astro/3f255b372fa958df35802666ee33b4609b2d71bd-1200x1586_1VtE2D.webp)

### How do the best dev and marketing teams work together?

[Access the report](https://www.netlify.com/reports/2024-leadership-trend-report/access/)