admin管理员组

文章数量:1122846

I am implementing an API Management (APIM) solution to integrate two Azure OpenAI deployments (PTU and Pay-per-go). My goal is to manage token usage effectively and avoid 429 errors by dynamically switching between these deployments. Below is the current logic and the issue I’m encountering. I am utilizing guidance from the APIM documentation on Limit Azure OpenAI API token usage.

Objective:

  • Primary Service: PTU deployment with a 10,000 tokens-per-minute quota.
  • Fallback Service: Pay-per-go deployment, triggered when:
    1. Remaining tokens for PTU drop below 600.
    2. PTU deployment returns a 429 status code.

Inbound Policy Snippet:

<inbound>
    <base />
    <choose>
        <when condition="@(context.Request.Url.Path.Contains('/chat/completions'))">
            <azure-openai-token-limit 
                counter-key="@(context.Request.Headers.GetValueOrDefault('Ocp-Apim-Subscription-Key'))" 
                tokens-per-minute="10000" 
                estimate-prompt-tokens="false" 
                remaining-tokens-variable-name="remainingTokens" />
            <choose>
                <when condition="@(context.Variables.ContainsKey('remainingTokens') && context.Variables.GetValueOrDefault<int>('remainingTokens') <= 600)">
                    <set-variable name="backendUrl" value="/" />
                    <set-variable name="backendApiKey" value="PAY_PER_GO_API_KEY" />
                </when>
                <otherwise>
                    <set-variable name="backendUrl" value="/" />
                    <set-variable name="backendApiKey" value="PTU_API_KEY" />
                </otherwise>
            </choose>
        </when>
    </choose>
</inbound>

Backend Policy Snippet:

<backend>
    <choose>
        <when condition="@(context.Response.StatusCode == 429 && context.Variables.GetValueOrDefault<string>('backendUrl') == '/')">
            <set-variable name="backendUrl" value="/" />
            <set-variable name="backendApiKey" value="PAY_PER_GO_API_KEY" />
            <set-backend-service base-url="@(context.Variables.GetValueOrDefault<string>('backendUrl'))" />
            <forward-request buffer-request-body="true" />
        </when>
        <otherwise>
            <set-backend-service base-url="@(context.Variables.GetValueOrDefault<string>('backendUrl'))" />
            <forward-request buffer-request-body="true" />
        </otherwise>
    </choose>
</backend>

Problem: Despite having the above logic in place, I am still receiving 429 status codes. It seems that the conditions for switching backends are not being triggered as expected.

Here's a trace of the load_testing result using locust (no. of users 3, ramp up speed 3)

backend-url = / Remaining Tokens = 8854 Consumed Tokens = 1146
backend-url = / Remaining Tokens = 8697 Consumed Tokens = 1303
backend-url = / Remaining Tokens = 8702 Consumed Tokens = 1298
backend-url = / Remaining Tokens = 7536 Consumed Tokens = 1318
backend-url = / Remaining Tokens = 4973 Consumed Tokens = 1280
backend-url = / Remaining Tokens = 4904 Consumed Tokens = 1349
backend-url = / Remaining Tokens = 3544 Consumed Tokens = 1391
backend-url = / Remaining Tokens = 1136 Consumed Tokens = 1170
backend-url = / Remaining Tokens = 1066 Consumed Tokens = 1240
{'Content-Length': '85', 'Content-Type': 'application/json', 'Retry-After': '28', 'remainingTokens': '0', 'Date': 'Fri, 22 Nov 2024 07:53:50 GMT'}
{'Content-Length': '85', 'Content-Type': 'application/json', 'Retry-After': '27', 'remainingTokens': '0', 'Date': 'Fri, 22 Nov 2024 07:53:51 GMT'}
{'Content-Length': '85', 'Content-Type': 'application/json', 'Retry-After': '26', 'remainingTokens': '0', 'Date': 'Fri, 22 Nov 2024 07:53:51 GMT'}
{'Content-Length': '85', 'Content-Type': 'application/json', 'Retry-After': '25', 'remainingTokens': '0', 'Date': 'Fri, 22 Nov 2024 07:53:53 GMT'}
{'Content-Length': '85', 'Content-Type': 'application/json', 'Retry-After': '25', 'remainingTokens': '0', 'Date': 'Fri, 22 Nov 2024 07:53:53 GMT'}

Though there are times when the request is switched to the pay-per-go too. As per my understanding remaining token is only assigned after successful execution of request. Now since these are parallel requests, the tokens count drops at once and before it could fulfill the remaining tokens condition <600 , it crosses the tokens-per-min condition i.e 10k hence it throws 429 from there only.

What can be done to fix this so that i never get 429 error while switching between the two azure openai service ?

I am implementing an API Management (APIM) solution to integrate two Azure OpenAI deployments (PTU and Pay-per-go). My goal is to manage token usage effectively and avoid 429 errors by dynamically switching between these deployments. Below is the current logic and the issue I’m encountering. I am utilizing guidance from the APIM documentation on Limit Azure OpenAI API token usage.

Objective:

  • Primary Service: PTU deployment with a 10,000 tokens-per-minute quota.
  • Fallback Service: Pay-per-go deployment, triggered when:
    1. Remaining tokens for PTU drop below 600.
    2. PTU deployment returns a 429 status code.

Inbound Policy Snippet:

<inbound>
    <base />
    <choose>
        <when condition="@(context.Request.Url.Path.Contains('/chat/completions'))">
            <azure-openai-token-limit 
                counter-key="@(context.Request.Headers.GetValueOrDefault('Ocp-Apim-Subscription-Key'))" 
                tokens-per-minute="10000" 
                estimate-prompt-tokens="false" 
                remaining-tokens-variable-name="remainingTokens" />
            <choose>
                <when condition="@(context.Variables.ContainsKey('remainingTokens') && context.Variables.GetValueOrDefault<int>('remainingTokens') <= 600)">
                    <set-variable name="backendUrl" value="https://pay-per-go-backend-url.com/" />
                    <set-variable name="backendApiKey" value="PAY_PER_GO_API_KEY" />
                </when>
                <otherwise>
                    <set-variable name="backendUrl" value="https://ptu-url.com/" />
                    <set-variable name="backendApiKey" value="PTU_API_KEY" />
                </otherwise>
            </choose>
        </when>
    </choose>
</inbound>

Backend Policy Snippet:

<backend>
    <choose>
        <when condition="@(context.Response.StatusCode == 429 && context.Variables.GetValueOrDefault<string>('backendUrl') == 'https://ptu-url.com/')">
            <set-variable name="backendUrl" value="https://pay-per-go-backend-url.com/" />
            <set-variable name="backendApiKey" value="PAY_PER_GO_API_KEY" />
            <set-backend-service base-url="@(context.Variables.GetValueOrDefault<string>('backendUrl'))" />
            <forward-request buffer-request-body="true" />
        </when>
        <otherwise>
            <set-backend-service base-url="@(context.Variables.GetValueOrDefault<string>('backendUrl'))" />
            <forward-request buffer-request-body="true" />
        </otherwise>
    </choose>
</backend>

Problem: Despite having the above logic in place, I am still receiving 429 status codes. It seems that the conditions for switching backends are not being triggered as expected.

Here's a trace of the load_testing result using locust (no. of users 3, ramp up speed 3)

backend-url = https://ptu-url.com/ Remaining Tokens = 8854 Consumed Tokens = 1146
backend-url = https://ptu-url.com/ Remaining Tokens = 8697 Consumed Tokens = 1303
backend-url = https://ptu-url.com/ Remaining Tokens = 8702 Consumed Tokens = 1298
backend-url = https://ptu-url.com/ Remaining Tokens = 7536 Consumed Tokens = 1318
backend-url = https://ptu-url.com/ Remaining Tokens = 4973 Consumed Tokens = 1280
backend-url = https://ptu-url.com/ Remaining Tokens = 4904 Consumed Tokens = 1349
backend-url = https://ptu-url.com/ Remaining Tokens = 3544 Consumed Tokens = 1391
backend-url = https://ptu-url.com/ Remaining Tokens = 1136 Consumed Tokens = 1170
backend-url = https://ptu-url.com/ Remaining Tokens = 1066 Consumed Tokens = 1240
{'Content-Length': '85', 'Content-Type': 'application/json', 'Retry-After': '28', 'remainingTokens': '0', 'Date': 'Fri, 22 Nov 2024 07:53:50 GMT'}
{'Content-Length': '85', 'Content-Type': 'application/json', 'Retry-After': '27', 'remainingTokens': '0', 'Date': 'Fri, 22 Nov 2024 07:53:51 GMT'}
{'Content-Length': '85', 'Content-Type': 'application/json', 'Retry-After': '26', 'remainingTokens': '0', 'Date': 'Fri, 22 Nov 2024 07:53:51 GMT'}
{'Content-Length': '85', 'Content-Type': 'application/json', 'Retry-After': '25', 'remainingTokens': '0', 'Date': 'Fri, 22 Nov 2024 07:53:53 GMT'}
{'Content-Length': '85', 'Content-Type': 'application/json', 'Retry-After': '25', 'remainingTokens': '0', 'Date': 'Fri, 22 Nov 2024 07:53:53 GMT'}

Though there are times when the request is switched to the pay-per-go too. As per my understanding remaining token is only assigned after successful execution of request. Now since these are parallel requests, the tokens count drops at once and before it could fulfill the remaining tokens condition <600 , it crosses the tokens-per-min condition i.e 10k hence it throws 429 from there only.

What can be done to fix this so that i never get 429 error while switching between the two azure openai service ?

Share Improve this question edited Nov 29, 2024 at 8:26 silenthunter25 asked Nov 22, 2024 at 9:14 silenthunter25silenthunter25 254 bronze badges 2
  • It might be happening due to concurrent requests – Ikhtesam Afrin Commented Nov 25, 2024 at 12:25
  • @IkhtesamAfrin yeah true. Is there a way to implement fallback mechanism. I couldn't find anything for azure-openai-token-limit. Whenever the tokens exceeds the token-per-min capacity it directly goes to the <error> block and we get 429. Can something be done from there ? – silenthunter25 Commented Nov 28, 2024 at 8:17
Add a comment  | 

1 Answer 1

Reset to default 0

The azure-openai-token-limit policy prevents Azure OpenAI Service API usage spikes on a per key basis by limiting consumption of language model tokens to a specified number per minute. When the token usage is exceeded, the caller receives a 429 Too Many Requests response status code.

It is the recommended way for rate limit or azure-openai-token-limit policy to work. Whenever the specified token usage is exceeds, you will get 429 Too Many Requests status code.

You could be getting this error due to high throughput and parallel requests hitting the policy resulting in token consumption exhausted before the retry after takes place.

Now since these are parallel requests, the tokens count drops at once and before it could fulfill the remaining tokens condition <600 , it crosses the tokens-per-min condition i.e 10k hence it throws 429 from there only.

Yes, indeed. You will only get 429 status code, if the token usage is exhausted. You can validate it from your logs itself that you are getting this status code whenever remaining token is zero.

So, I would suggest you to either increase the tokens per minute count in your policy or wait for the retry after logic to take affect.

本文标签: azureIssues with Backend Switching Logic in API Management for OpenAI DeploymentsStack Overflow