Anyone facing latency issues using AsyncAzureOpenAI client on LLM deployments?

Question

Anyone facing latency issues using AsyncAzureOpenAI client on LLM deployments?

Stanislav Rubint 5

Ive been using the AsyncAzureOpenAI python client for LLM calls succesfully for last months and suddenly and abruptly the latency of response is terrible.

Everything is fine using synchronous client that is telling us that there is no problem with the subscription, deployed model nor application itself, but we spent months building our system on async features of language and we cannot just switch to the synchronous version. I tried 4o, 4.1 and 5 models, mini and nano versions as well without success.

Any hints? Recognised problems?

Thank you for anything 🙏

Stanislav

Stanislav Rubint 5

using this snipped I pinpointed issue to the responses API:

async def main():
    import time, httpx

    tstart = time.time()
    print("Generating content...")
    resp = await client.chat.completions.create(
        model="YOUR_DEPLOYMENT_NAME", messages=[{"role": "user", "content": "Hi"}], max_tokens=50
    )
    print(resp.choices[0].message.content)
    tend = time.time()
    print(f"Response time: {tend - tstart:.2f} seconds\n\n")

    tstart = time.time()
    response = await client.responses.create(
        input="Create itinerary for a France vacation, include 3 cities.",
        instructions="you are a helpful travel planner",
        model=model_name,
        max_output_tokens=10000,
    )
    tend = time.time()

    print(response.output_text)
    print(f"Response time: {tend - tstart:.2f} seconds")

❯ python main.py
DEBUG:asyncio:Using selector: EpollSelector
Generating content...
DEBUG:openai._base_client:Request options: {'method': 'post', 'url': '/chat/completions', 'headers': {'api-key': '<redacted>'}, 'files': None, 'idempotency_key': 'stainless-python-retry-b9ab847d-d524-4b20-b16a-faee6c70aa24', 'json_data': {'messages': [{'role': 'user', 'content': 'Hi'}], 'model': 'YOUR_DEPLOYMENT_NAME', 'max_tokens': 50}}
DEBUG:openai._base_client:Sending HTTP Request: POST https://webygroup-ai-resource.cognitiveservices.azure.com/openai/deployments/gpt-4.1-mini/chat/completions?api-version=2025-04-01-preview
DEBUG:httpcore.connection:connect_tcp.started host='webygroup-ai-resource.cognitiveservices.azure.com' port=443 local_address=None timeout=5.0 socket_options=None
DEBUG:httpcore.connection:connect_tcp.complete return_value=<httpcore._backends.anyio.AnyIOStream object at 0x730a4ada4080>
DEBUG:httpcore.connection:start_tls.started ssl_context=<ssl.SSLContext object at 0x730a4b2b6650> server_hostname='webygroup-ai-resource.cognitiveservices.azure.com' timeout=5.0
DEBUG:httpcore.connection:start_tls.complete return_value=<httpcore._backends.anyio.AnyIOStream object at 0x730a4af3dbb0>
DEBUG:httpcore.http11:send_request_headers.started request=<Request [b'POST']>
DEBUG:httpcore.http11:send_request_headers.complete
DEBUG:httpcore.http11:send_request_body.started request=<Request [b'POST']>
DEBUG:httpcore.http11:send_request_body.complete
DEBUG:httpcore.http11:receive_response_headers.started request=<Request [b'POST']>
DEBUG:httpcore.http11:receive_response_headers.complete return_value=(b'HTTP/1.1', 200, b'OK', [(b'Content-Length', b'1245'), (b'Content-Type', b'application/json'), (b'apim-request-id', b'091e77b1-f4d5-42ad-b9b3-ee10ba51ac80'), (b'Strict-Transport-Security', b'max-age=31536000; includeSubDomains; preload'), (b'x-content-type-options', b'nosniff'), (b'x-ms-region', b'Sweden Central'), (b'x-ratelimit-remaining-requests', b'1998'), (b'x-ratelimit-limit-requests', b'2000'), (b'x-ratelimit-remaining-tokens', b'1999998'), (b'x-ratelimit-limit-tokens', b'2000000'), (b'azureml-model-session', b'd006-20250728230048'), (b'x-accel-buffering', b'no'), (b'x-ms-rai-invoked', b'true'), (b'x-request-id', b'54955eea-8f21-4901-98dd-ec04b83b9715'), (b'x-ms-deployment-name', b'gpt-4.1-mini'), (b'Date', b'Fri, 15 Aug 2025 08:26:22 GMT')])
INFO:httpx:HTTP Request: POST https://webygroup-ai-resource.cognitiveservices.azure.com/openai/deployments/gpt-4.1-mini/chat/completions?api-version=2025-04-01-preview "HTTP/1.1 200 OK"
DEBUG:httpcore.http11:receive_response_body.started request=<Request [b'POST']>
DEBUG:httpcore.http11:receive_response_body.complete
DEBUG:httpcore.http11:response_closed.started
DEBUG:httpcore.http11:response_closed.complete
DEBUG:openai._base_client:HTTP Response: POST https://webygroup-ai-resource.cognitiveservices.azure.com/openai/deployments/gpt-4.1-mini/chat/completions?api-version=2025-04-01-preview "200 OK" Headers({'content-length': '1245', 'content-type': 'application/json', 'apim-request-id': '091e77b1-f4d5-42ad-b9b3-ee10ba51ac80', 'strict-transport-security': 'max-age=31536000; includeSubDomains; preload', 'x-content-type-options': 'nosniff', 'x-ms-region': 'Sweden Central', 'x-ratelimit-remaining-requests': '1998', 'x-ratelimit-limit-requests': '2000', 'x-ratelimit-remaining-tokens': '1999998', 'x-ratelimit-limit-tokens': '2000000', 'azureml-model-session': 'd006-20250728230048', 'x-accel-buffering': 'no', 'x-ms-rai-invoked': 'true', 'x-request-id': '54955eea-8f21-4901-98dd-ec04b83b9715', 'x-ms-deployment-name': 'gpt-4.1-mini', 'date': 'Fri, 15 Aug 2025 08:26:22 GMT'})
DEBUG:openai._base_client:request_id: 54955eea-8f21-4901-98dd-ec04b83b9715
Hello! How can I assist you today?
Response time: 0.80 seconds


DEBUG:openai._base_client:Request options: {'method': 'post', 'url': '/responses', 'headers': {'api-key': '<redacted>'}, 'files': None, 'idempotency_key': 'stainless-python-retry-72c504b1-6b64-4942-bae9-f82959a722d3', 'json_data': {'input': 'Create itinerary for a France vacation, include 3 cities.', 'instructions': 'you are a helpful travel planner', 'max_output_tokens': 10000, 'model': 'gpt-4.1-mini'}}
DEBUG:openai._base_client:Sending HTTP Request: POST https://webygroup-ai-resource.cognitiveservices.azure.com/openai/responses?api-version=2025-04-01-preview
DEBUG:httpcore.http11:send_request_headers.started request=<Request [b'POST']>
DEBUG:httpcore.http11:send_request_headers.complete
DEBUG:httpcore.http11:send_request_body.started request=<Request [b'POST']>
DEBUG:httpcore.http11:send_request_body.complete
DEBUG:httpcore.http11:receive_response_headers.started request=<Request [b'POST']>

And here it hangs from hundreds of seconds up to indefinitely.

Stanislav Rubint 5

further pinpointed issue to the response being stuck in "response.in_progress":

❯ time curl -v -X POST "https://###.cognitiveservices.azure.com/openai/responses?api-version=2025-04-01-preview" \
  -H "Content-Type: application/json" \
  -H "api-key: $AZURE_API_KEY" \
  -d '{
    "input": "Create itinerary for a France vacation, include 3 cities.",
    "instructions": "you are a helpful travel planner",
    "max_output_tokens": 10000,
    "model": "gpt-4.1-mini","stream":true
  }'

Note: Unnecessary use of -X or --request, POST is already inferred.
* Host ###.cognitiveservices.azure.com:443 was resolved.
* IPv6: (none)
* IPv4: 51.12.73.214
*   Trying 51.12.73.214:443...
* ALPN: curl offers h2,http/1.1
* TLSv1.3 (OUT), TLS handshake, Client hello (1):
*  CAfile: /etc/ssl/certs/ca-certificates.crt
*  CApath: /etc/ssl/certs
* TLSv1.3 (IN), TLS handshake, Server hello (2):
* TLSv1.3 (OUT), TLS change cipher, Change cipher spec (1):
* TLSv1.3 (OUT), TLS handshake, Client hello (1):
* TLSv1.3 (IN), TLS change cipher, Change cipher spec (1):
* TLSv1.3 (IN), TLS handshake, Server hello (2):
* TLSv1.3 (IN), TLS handshake, Encrypted Extensions (8):
* TLSv1.3 (IN), TLS handshake, Certificate (11):
* TLSv1.3 (IN), TLS handshake, CERT verify (15):
* TLSv1.3 (IN), TLS handshake, Finished (20):
* TLSv1.3 (OUT), TLS handshake, Finished (20):
* SSL connection using TLSv1.3 / TLS_AES_256_GCM_SHA384 / secp384r1 / RSASSA-PSS
* ALPN: server accepted h2
* Server certificate:
*  subject: C=US; ST=WA; L=Redmond; O=Microsoft Corporation; CN=swedencentral.api.cognitive.microsoft.com
*  start date: Jul 17 20:45:14 2025 GMT
*  expire date: Jan 13 20:45:14 2026 GMT
*  subjectAltName: host "webygroup-ai-resource.cognitiveservices.azure.com" matched cert's "*.cognitiveservices.azure.com"
*  issuer: C=US; O=Microsoft Corporation; CN=Microsoft Azure RSA TLS Issuing CA 04
*  SSL certificate verify ok.
*   Certificate level 0: Public key type RSA (2048/112 Bits/secBits), signed using sha384WithRSAEncryption
*   Certificate level 1: Public key type RSA (4096/152 Bits/secBits), signed using sha384WithRSAEncryption
*   Certificate level 2: Public key type RSA (2048/112 Bits/secBits), signed using sha256WithRSAEncryption
* Connected to webygroup-ai-resource.cognitiveservices.azure.com (51.12.73.214) port 443
* using HTTP/2
* [HTTP/2] [1] OPENED stream for https://###-ai-resource.cognitiveservices.azure.com/openai/responses?api-version=2025-04-01-preview
* [HTTP/2] [1] [:method: POST]
* [HTTP/2] [1] [:scheme: https]
* [HTTP/2] [1] [:authority: webygroup-ai-resource.cognitiveservices.azure.com]
* [HTTP/2] [1] [:path: /openai/responses?api-version=2025-04-01-preview]
* [HTTP/2] [1] [user-agent: curl/8.12.1]
* [HTTP/2] [1] [accept: */*]
* [HTTP/2] [1] [content-type: application/json]
* [HTTP/2] [1] [api-key: ###]
* [HTTP/2] [1] [content-length: 209]
> POST /openai/responses?api-version=2025-04-01-preview HTTP/2
> Host: ###.cognitiveservices.azure.com
> User-Agent: curl/8.12.1
> Accept: */*
> Content-Type: application/json
> api-key: ###
> Content-Length: 209
> 
* upload completely sent off: 209 bytes
* TLSv1.3 (IN), TLS handshake, Newsession Ticket (4):
< HTTP/2 200 
< content-type: text/event-stream; charset=utf-8
< x-ms-middleware-request-id: 96d91594-9f64-4e9e-a368-a0b8b439928a
< x-ms-region: Sweden Central
< apim-request-id: 96d91594-9f64-4e9e-a368-a0b8b439928a
< x-request-id: 3dd1b89c-0fe8-46b0-941c-21233b265480
< x-content-type-options: nosniff
< strict-transport-security: max-age=31536000; includeSubDomains; preload
< x-ms-deployment-name: gpt-4.1-mini
< x-ratelimit-remaining-requests: 1997
< x-ratelimit-limit-requests: 2000
< date: Fri, 15 Aug 2025 08:52:48 GMT
< 
event: response.created
data: {"type":"response.created","sequence_number":0,"response":{"id":"resp_689ef55bc0b88190876825ba5c8677b507c49072c25f8674","object":"response","created_at":1755247965,"status":"in_progress","background":false,"content_filters":null,"error":null,"incomplete_details":null,"instructions":"you are a helpful travel planner","max_output_tokens":10000,"max_tool_calls":null,"model":"gpt-4.1-mini","output":[],"parallel_tool_calls":true,"previous_response_id":null,"prompt_cache_key":null,"reasoning":{"effort":null,"summary":null},"safety_identifier":null,"service_tier":"auto","store":true,"temperature":1.0,"text":{"format":{"type":"text"}},"tool_choice":"auto","tools":[],"top_p":1.0,"truncation":"disabled","usage":null,"user":null,"metadata":{}}}

event: response.in_progress
data: {"type":"response.in_progress","sequence_number":1,"response":{"id":"resp_689ef55bc0b88190876825ba5c8677b507c49072c25f8674","object":"response","created_at":1755247965,"status":"in_progress","background":false,"content_filters":null,"error":null,"incomplete_details":null,"instructions":"you are a helpful travel planner","max_output_tokens":10000,"max_tool_calls":null,"model":"gpt-4.1-mini","output":[],"parallel_tool_calls":true,"previous_response_id":null,"prompt_cache_key":null,"reasoning":{"effort":null,"summary":null},"safety_identifier":null,"service_tier":"auto","store":true,"temperature":1.0,"text":{"format":{"type":"text"}},"tool_choice":"auto","tools":[],"top_p":1.0,"truncation":"disabled","usage":null,"user":null,"metadata":{}}}

2 answers

Your answer

Answer 1

Hello Stanislav Rubint,

Thank you for the excellent detailed report. Your troubleshooting steps, especially isolating the issue to the /responses endpoint and confirming it with curl, are helpful and point strongly towards a service-side issue rather than a problem with your code or the Python library.

From what you describe, the key point is that:

chat.completions endpoint is returning normally (~0.8s latency).
responses endpoint consistently hangs in "response.in_progress" and never completes, even when streaming is enabled.
The issue reproduces across different models (gpt-4o, 4.1, 5, mini, nano), which suggests it’s not specific to a single deployment.

Several developers have reported similar issues: Please refer Q&A thread

Suggestions to try:

Lower max_output_tokens (e.g., from 10000 down to 2048) to see if the response completes — in some cases, high values trigger buffering behavior that delays or prevents output.
Ensure your client is reading the stream fully — the SDK should consume until response.completed event. If the stream consumer closes early, the response can hang indefinitely.
Check regional behavior — your logs show Sweden Central. If possible, try deploying in another region (e.g., East US, West Europe) to see if latency differs.

Monitoring and Updates:

Azure Status Page: Monitor for official updates about Sweden Central region.
Community Threads: Several users are tracking this issue—check for updates on the linked threads.
Your Approach: Your debugging methodology looks good and has helped identify the issue.

Reference links (for better understanding):

Please take a minute to accept this answer if you appreciated the inputs.

Thank you 😊

Nikhil Jha (Accenture International Limited) 240 Reputation points Microsoft External Staff Moderator

2025-08-20T07:04:10.8066667+00:00

Hello Stanislav Rubint,

I hope this has been helpful! We appreciate hearing from you and would love to help others who may have the same question. Accepting answers helps increase visibility of this question for other members of the Microsoft Q&A community. Thank you for helping to improve Microsoft Q&A!

Answer 2

Hello,

thank you for suggestions. But even basic prompts does not work on swedencentral. Sometimes I even get this error:

You can retry your request, or contact us through an Azure support request at: https://go.microsoft.com/fwlink/?linkid=2213926 if you keep seeing this error. (Please include the request ID ea027b70-e548-44f6-b9cd-37ea9368a359 in your email.)', 'type': 'server_error', 'param': None, 'code': None}}

I tried also Poland server with great success. Using this snippet:

API_KEY: str = "<MY_API_KEY>"
ENDPOINT: str = "https://<MY_RESOURCE>.cognitiveservices.azure.com/openai/v1/responses?api-version=preview"
N_REQUESTS = 100

HEADERS = {
    "Content-Type": "application/json",
    "Authorization": f"Bearer {API_KEY}",
}

PAYLOAD = {
    "model": "gpt-4.1-mini",
    "input": "Be concise. Respond with exactly 9 words: 'The quick brown fox jumps over the lazy dog.'",
}

async def fetch(session, i):
    """Send a single request and measure latency"""
    start = time.perf_counter()
    async with session.post(ENDPOINT, headers=HEADERS, json=PAYLOAD) as resp:
        response = await resp.json()  # consume response (optional)
    end = time.perf_counter()
    latency = (end - start) * 1000  # ms
    try:
        print(f"{response['output'][0]['content'][0]['text']}")
    except KeyError:
        print("Error: Response format unexpected or missing content.")
        print(f"Response: {response}")
    print(f"Request {i} latency: {latency:.2f} ms")
    return latency


async def run_load_test():
    async with aiohttp.ClientSession(timeout=aiohttp.ClientTimeout(total=60)) as session:
        tasks = [fetch(session, i) for i in range(N_REQUESTS)]
        latencies = await asyncio.gather(*tasks)

    avg_latency = statistics.mean(latencies)
    jitter = statistics.pstdev(latencies)  # population stddev
    min_latency = min(latencies)
    max_latency = max(latencies)
    print("\n--- Results ---")
    print(f"Total requests: {N_REQUESTS}")
    print(f"Average latency: {avg_latency:.2f} ms")
    print(f"Jitter (stddev): {jitter:.2f} ms")
    print(f"Min latency: {min_latency:.2f} ms")
    print(f"Max latency: {max_latency:.2f} ms")


if __name__ == "__main__":
    asyncio.run(run_load_test())

I was able to check that problem lies on responses api in swedencentral:

Server Poland:
--- Results ---
Total requests: 100
Average latency: 1563.12 ms
Jitter (stddev): 492.06 ms
Min latency: 1064.30 ms
Max latency: 3316.79 ms

Server Swedencentral:
--- Results ---
Total requests: 100
Average latency: 13854.43 ms
Jitter (stddev): 9689.98 ms
Min latency: 2154.94 ms
Max latency: 57902.09 ms

Thank you for update in advance,
Stanley

Stanislav Rubint 5 Reputation points

2025-08-19T10:06:15.1233333+00:00

the tests above were on gpt-4.1-mini. When I tried 4.1, which works on swedencentral, in poland sometimes I got this error:

{'error': {'code': 'InternalServerError', 'message': 'Backend returned unexpected response. Please contact Microsoft for help.'}}
Nikhil Jha (Accenture International Limited) 240 Reputation points Microsoft External Staff Moderator

2025-08-20T07:01:41.6633333+00:00

Hello Stanislav Rubint,
Thank you for your prompt response. You can use resources deployed in different region; whichever is providing you with low latency and better performance. It would work better for you without troubling your code with synchronous client and continuing your system with async features. I’ve followed up with you in a private chat for further clarification.
Nikhil Jha (Accenture International Limited) 240 Reputation points Microsoft External Staff Moderator

2025-08-20T11:29:31.12+00:00

Hello Stanislav Rubint,
There was latency issue in Sweden Central region last week, it was reported and mitigated, there were no new issues reported on the same. Could you please re-deploy in the same region and let me know if you are still facing the issue.

Share via

Anyone facing latency issues using AsyncAzureOpenAI client on LLM deployments?

2 answers

Your answer