I’m on a project where we’re connecting to OpenAI to build a chat bot.
Because OpenAI’s rate limits are more complicated than other APIs, we made sure to proactively avoid rate limits by keeping track of how many tokens we’re using. However, this approach is incomplete in that it does nothing to limit an individual user’s usage.
A chat bot feature is unique in that it can quickly exhaust your organization’s rate limits across multiple dimensions: requests per minute (RPM) and tokens per minute (TPM). This is because…
- Chatting enables the user to send messages in rapid succession, which increases RPM.
- Messages can be long, contain attachments, and as the conversation grows, so does the context window, which increases TPM.
Once a user hits the organization’s limit, everything that uses OpenAI is affected, and will no longer work. In the case of my project, not only does that mean the chat bot feature would be down, but also other tools we use that leverage OpenAI.
In short, one user can trigger a denial of service simply by using a feature as it was intended to be used.
Fortunately, there are a few solutions to this problem.
Our base
For the sake of this demonstration, we’ll use RubyLLM, but the concepts we’ll learn apply to any implementation and platform.
chat = RubyLLM.chat
response = chat.ask "What's a good rate limit strategy for a chat bot?"
puts response.content
Rate limit messages
The first thing we can do is limit the number of messages a user can send in a given time frame. This is known as “fixed-window rate limiting”. Although Rails ships with a rate limit mechanism, that won’t help us when we need to rate limit token usage. Instead, we can rely on a cache store like Redis, since its built-in INCR and EXPIRE APIs lend themselves well to a rate limiting mechanism.
# app/models/usage.rb
class Usage
MAX_RPM = ENV.fetch("USAGE_MAX_RPM", 10).to_i
def initialize(user, cache_store = ActiveSupport::Cache::RedisCacheStore.new)
@user = user
@cache_store = cache_store
end
def track!
track_rpm!
end
def exceeded?
rpm_exceeded?
end
private
attr_reader :user, :cache_store
def rpm_key
"usage:user:#{user.id}:rpm"
end
def track_rpm!
cache_store.increment(rpm_key, 1, expires_in: 1.minute)
end
def rpm_exceeded?
cache_store.read(rpm_key, raw: true).to_i >= MAX_RPM
end
end
# app/models/user.rb
class User < ApplicationRecord
def usage
Usage.new(self)
end
end
We can create a simple object that tracks user requests per minute simply by incrementing the value by one for each request made, making sure to expire the key after one minute.
Here’s how that might look when used with our chat bot:
chat = RubyLLM.chat
user = User.last!
usage = user.usage
unless usage.exceeded?
response = chat.ask "What's a good rate limit strategy for a chat bot?"
usage.track!
puts response.content
else
# Alert user that they've exceeded their usage.
end
Before making a request, we check to see if the user has exhausted their individual rate limit. If not, we make the request and track the usage.
Limit token usage
Limiting requests is just half of the problem, since token usage is also a metric that requires rate limiting.
We can’t just validate the length of the message, since token usage doesn’t map 1:1 with character length. Additionally, token usage is also based on output tokens.
Fortunately, OpenAI returns usage data in the response, which
RubyLLM exposes in a Message instance.
class Usage
MAX_RPM = ENV.fetch("USAGE_MAX_RPM", 10).to_i
+ MAX_TPM = ENV.fetch("USAGE_MAX_TPM", 10_000).to_i
def initialize(user, cache_store = ActiveSupport::Cache::RedisCacheStore.new)
@user = user
@cache_store = cache_store
end
- def track!
- track_rpm!
+ def track!(total_tokens)
+ [ track_rpm!, track_tpm!(total_tokens) ]
end
def exceeded?
- rpm_exceeded?
+ rpm_exceeded? || tpm_exceeded?
end
private
@@ -29,4 +30,16 @@ class Usage
def rpm_exceeded?
cache_store.read(rpm_key, raw: true).to_i >= MAX_RPM
end
+
+ def tpm_key
+ "usage:user:#{user.id}:tpm"
+ end
+
+ def track_tpm!(total_tokens)
+ cache_store.increment(tpm_key, total_tokens, expires_in: 1.minute)
+ end
+
+ def tpm_exceeded?
+ cache_store.read(tpm_key, raw: true).to_i >= MAX_TPM
+ end
end
We can use the same pattern we used for tracking requests to track tokens. The only difference is that we need to supply that information.
Here’s how that would look in our chat bot:
unless usage.exceeded?
response = chat.ask "What's a good rate limit strategy for a chat bot?"
- usage.track!
+ tokens = response.tokens
+ total_tokens = tokens.input + tokens.output + tokens.thinking
+
+ usage.track!(total_tokens)
puts response.content
else
After making a request, we extract
the token usage from the response, and pass it to our Usage instance
to be used in the next request.
Calculating per-user rate limits
Since the rate limits set by OpenAI and other providers are at the organization level, how might we evenly distribute those values on a per-user basis?
A simple approach suitable for most early-stage products would be to divide the organization limit by the expected concurrent users. In order to ensure we account for unexpected spikes in traffic, we can add in a buffer.
Per-user limit = (Organization limit × buffer) / Expected concurrent users
Below is what that looks like using the values in this demonstration.
| Metric | Org Limit | Concurrent Users | Buffer | Per-user Limit |
|---|---|---|---|---|
| RPM | 125 | 10 | 0.8 | (125 * 0.8) / 10 |
| TPM | 125,000 | 10 | 0.8 | (125,000 * 0.8) / 10 |
Additional considerations
The core problem is that a chat bot is inherently expensive. If we stick to these numbers, we’re constrained to 10 concurrent users, which feels pretty limited, even for an early-stage product.
One solution is to queue the requests rather than reject them, which might look something like this:
class ProcessPromptJob < ApplicationJob
queue_as :default
MAX_ATTEMPTS = 2
rescue_from(Usage::RateLimitExhaustedError) do
if executions < MAX_ATTEMPTS
retry_job wait: 15.seconds
else
# Broadcast failure message to user
end
end
def perform(user, prompt)
raise Usage::RateLimitExhaustedError if user.usage.exceeded?
chat = RubyLLM.chat
response = chat.ask(prompt)
tokens = response.tokens
total_tokens = tokens.input + tokens.output + tokens.thinking
user.usage.track!(total_tokens)
# Broadcast LLM response to user
end
end
class Usage
+ class RateLimitExhaustedError < StandardError; end
+
MAX_RPM = ENV.fetch("USAGE_MAX_RPM", 10).to_i
MAX_TPM = ENV.fetch("USAGE_MAX_TPM", 10_000).to_i
The flow would be something like this:
User sends message
└─▶ Show loading state
└─▶ Enqueue background job
└─▶ Rate limit exceeded?
├─ No ──▶ Process request ──▶ Broadcast response
└─ Yes ──▶ Wait 15 seconds
└─▶ Rate limit exceeded?
├─ No ──▶ Process request ──▶ Broadcast response
└─ Yes ──▶ Broadcast failure
Wrapping up
Limiting a user’s application usage is not a new problem, but it’s more relevant today with the rise in chat bot features.
The examples above highlight simple solutions, but they’re just a start — a more nuanced approach will likely be needed.
For example, if a user hits their limit, do you attempt to upsell them? Do you flat out block them? Or, should you fall back to a cheaper model? Maybe a combination of these suggestions?
Regardless, these decisions should involve stakeholders such as designers and the product team.