💰 Setting Team Budgets

Track spend, set budgets for your Internal Team

Setting Monthly Team Budgets

1. Create a team

  • Set max_budget=000000001 ($ value the team is allowed to spend)
  • Set budget_duration="1d" (How frequently the budget should update)

Create a new team and set max_budget and budget_duration

curl -X POST '' \
-H 'Authorization: Bearer sk-1234' \
-H 'Content-Type: application/json' \
-d '{
"team_alias": "QA Prod Bot",
"max_budget": 0.000000001,
"budget_duration": "1d"


"team_alias": "QA Prod Bot",
"team_id": "de35b29e-6ca8-4f47-b804-2b79d07aa99a",
"max_budget": 0.0001,
"budget_duration": "1d",
"budget_reset_at": "2024-06-14T22:48:36.594000Z"

Possible values for budget_duration

budget_durationWhen Budget will reset
budget_duration="1s"every 1 second
budget_duration="1m"every 1 min
budget_duration="1h"every 1 hour
budget_duration="1d"every 1 day
budget_duration="1mo"every 1 month

2. Create a key for the team

Create a key for Team=QA Prod Bot and team_id="de35b29e-6ca8-4f47-b804-2b79d07aa99a" from Step 1

💡 The Budget for Team="QA Prod Bot" budget will apply to this team

curl -X POST '' \
-H 'Authorization: Bearer sk-1234' \
-H 'Content-Type: application/json' \
-d '{"team_id": "de35b29e-6ca8-4f47-b804-2b79d07aa99a"}'


{"team_id":"de35b29e-6ca8-4f47-b804-2b79d07aa99a", "key":"sk-5qtncoYjzRcxMM4bDRktNQ"}

3. Test It

Use the key from step 2 and run this Request twice

curl -X POST '' \
-H 'Authorization: Bearer sk-mso-JSykEGri86KyOvgxBw' \
-H 'Content-Type: application/json' \
-d ' {
"model": "llama3",
"messages": [
"role": "user",
"content": "hi"

On the 2nd response - expect to see the following exception

"error": {
"message": "Budget has been exceeded! Current cost: 3.5e-06, Max budget: 1e-09",
"type": "auth_error",
"param": null,
"code": 400


Prometheus metrics for remaining_budget

More info about Prometheus metrics here

You'll need the following in your proxy config.yaml

success_callback: ["prometheus"]
failure_callback: ["prometheus"]

Expect to see this metric on prometheus to track the Remaining Budget for the team

litellm_remaining_team_budget_metric{team_alias="QA Prod Bot",team_id="de35b29e-6ca8-4f47-b804-2b79d07aa99a"} 9.699999999999992e-06

Dynamic TPM/RPM Allocation

Prevent projects from gobbling too much tpm/rpm.

Dynamically allocate TPM/RPM quota to api keys, based on active keys in that minute. See Code

  1. Setup config.yaml
- model_name: my-fake-model
model: gpt-3.5-turbo
api_key: my-fake-key
mock_response: hello-world
tpm: 60

callbacks: ["dynamic_rate_limiter"]

master_key: sk-1234 # OR set `LITELLM_MASTER_KEY=".."` in your .env
database_url: postgres://.. # OR set `DATABASE_URL=".."` in your .env
  1. Start proxy
litellm --config /path/to/config.yaml
  1. Test it!
- Run 2 concurrent teams calling same model
- model has 60 TPM
- Mock response returns 30 total tokens / request
- Each team will only be able to make 1 request per minute

import requests
from openai import OpenAI, RateLimitError

def create_key(api_key: str, base_url: str):
response =
"Authorization": "Bearer {}".format(api_key)

_response = response.json()

return _response["key"]

key_1 = create_key(api_key="sk-1234", base_url="")
key_2 = create_key(api_key="sk-1234", base_url="")

# call proxy with key 1 - works
openai_client_1 = OpenAI(api_key=key_1, base_url="")

response =
model="my-fake-model", messages=[{"role": "user", "content": "Hello world!"}],

print("Headers for call 1 - {}".format(response.headers))
_response = response.parse()
print("Total tokens for call - {}".format(_response.usage.total_tokens))

# call proxy with key 2 - works
openai_client_2 = OpenAI(api_key=key_2, base_url="")

response =
model="my-fake-model", messages=[{"role": "user", "content": "Hello world!"}],

print("Headers for call 2 - {}".format(response.headers))
_response = response.parse()
print("Total tokens for call - {}".format(_response.usage.total_tokens))
# call proxy with key 2 - fails
try:"my-fake-model", messages=[{"role": "user", "content": "Hey, how's it going?"}])
raise Exception("This should have failed!")
except RateLimitError as e:
print("This was rate limited b/c - {}".format(str(e)))

Expected Response

This was rate limited b/c - Error code: 429 - {'error': {'message': {'error': 'Key=<hashed_token> over available TPM=0. Model TPM=0, Active keys=2'}, 'type': 'None', 'param': 'None', 'code': 429}}

[BETA] Set Priority / Reserve Quota

Reserve tpm/rpm capacity for projects in prod.


  1. Setup config.yaml
- model_name: gpt-3.5-turbo
model: "gpt-3.5-turbo"
api_key: os.environ/OPENAI_API_KEY
rpm: 100

callbacks: ["dynamic_rate_limiter"]
priority_reservation: {"dev": 0, "prod": 1}

master_key: sk-1234 # OR set `LITELLM_MASTER_KEY=".."` in your .env
database_url: postgres://.. # OR set `DATABASE_URL=".."` in your .env


  • Dict[str, float]
    • str: can be any string
    • float: from 0 to 1. Specify the % of tpm/rpm to reserve for keys of this priority.

Start Proxy

litellm --config /path/to/config.yaml
  1. Create a key with that priority
curl -X POST '' \
-H 'Authorization: Bearer <your-master-key>' \
-H 'Content-Type: application/json' \
-D '{
"metadata": {"priority": "dev"} # 👈 KEY CHANGE

Expected Response

"key": "sk-.."
  1. Test it!
curl -X POST '' \
-H 'Content-Type: application/json' \
-H 'Authorization: sk-...' \ # 👈 key from step 2.
-D '{
"model": "gpt-3.5-turbo",
"messages": [
"role": "user",
"content": "what llm are you"

Expected Response

Key=... over available RPM=0. Model RPM=100, Active keys=None