Anyone facing latency issues using AsyncAzureOpenAI client on LLM deployments?

Stanislav Rubint 5 Reputation points
2025-08-15T07:44:53.18+00:00

Ive been using the AsyncAzureOpenAI python client for LLM calls succesfully for last months and suddenly and abruptly the latency of response is terrible.

Everything is fine using synchronous client that is telling us that there is no problem with the subscription, deployed model nor application itself, but we spent months building our system on async features of language and we cannot just switch to the synchronous version. I tried 4o, 4.1 and 5 models, mini and nano versions as well without success.

Any hints? Recognised problems?

Thank you for anything 🙏

Stanislav

Azure AI Language
Azure AI Language
An Azure service that provides natural language capabilities including sentiment analysis, entity extraction, and automated question answering.
{count} votes

2 answers

Sort by: Most helpful
  1. Nikhil Jha (Accenture International Limited) 240 Reputation points Microsoft External Staff Moderator
    2025-08-19T08:53:13.45+00:00

    Hello Stanislav Rubint,

    Thank you for the excellent detailed report. Your troubleshooting steps, especially isolating the issue to the /responses endpoint and confirming it with curl, are helpful and point strongly towards a service-side issue rather than a problem with your code or the Python library.

    From what you describe, the key point is that:

    • chat.completions endpoint is returning normally (~0.8s latency).
    • responses endpoint consistently hangs in "response.in_progress" and never completes, even when streaming is enabled.
    • The issue reproduces across different models (gpt-4o, 4.1, 5, mini, nano), which suggests it’s not specific to a single deployment.

    Several developers have reported similar issues: Please refer Q&A thread

    Suggestions to try:

    1. Lower max_output_tokens (e.g., from 10000 down to 2048) to see if the response completes — in some cases, high values trigger buffering behavior that delays or prevents output.
    2. Ensure your client is reading the stream fully — the SDK should consume until response.completed event. If the stream consumer closes early, the response can hang indefinitely.
    3. Check regional behavior — your logs show Sweden Central. If possible, try deploying in another region (e.g., East US, West Europe) to see if latency differs.

    Monitoring and Updates:

    • Azure Status Page: Monitor for official updates about Sweden Central region.
    • Community Threads: Several users are tracking this issue—check for updates on the linked threads.
    • Your Approach: Your debugging methodology looks good and has helped identify the issue.

    Reference links (for better understanding):


    Please take a minute to accept this answer if you appreciated the inputs.

    Thank you 😊


  2. Stanislav Rubint 5 Reputation points
    2025-08-19T09:54:13.4166667+00:00

    Hello,

    thank you for suggestions. But even basic prompts does not work on swedencentral. Sometimes I even get this error:

    You can retry your request, or contact us through an Azure support request at: https://go.microsoft.com/fwlink/?linkid=2213926 if you keep seeing this error. (Please include the request ID ea027b70-e548-44f6-b9cd-37ea9368a359 in your email.)', 'type': 'server_error', 'param': None, 'code': None}}
    

    I tried also Poland server with great success. Using this snippet:

    API_KEY: str = "<MY_API_KEY>"
    ENDPOINT: str = "https://<MY_RESOURCE>.cognitiveservices.azure.com/openai/v1/responses?api-version=preview"
    N_REQUESTS = 100
    
    HEADERS = {
        "Content-Type": "application/json",
        "Authorization": f"Bearer {API_KEY}",
    }
    
    PAYLOAD = {
        "model": "gpt-4.1-mini",
        "input": "Be concise. Respond with exactly 9 words: 'The quick brown fox jumps over the lazy dog.'",
    }
    
    async def fetch(session, i):
        """Send a single request and measure latency"""
        start = time.perf_counter()
        async with session.post(ENDPOINT, headers=HEADERS, json=PAYLOAD) as resp:
            response = await resp.json()  # consume response (optional)
        end = time.perf_counter()
        latency = (end - start) * 1000  # ms
        try:
            print(f"{response['output'][0]['content'][0]['text']}")
        except KeyError:
            print("Error: Response format unexpected or missing content.")
            print(f"Response: {response}")
        print(f"Request {i} latency: {latency:.2f} ms")
        return latency
    
    
    async def run_load_test():
        async with aiohttp.ClientSession(timeout=aiohttp.ClientTimeout(total=60)) as session:
            tasks = [fetch(session, i) for i in range(N_REQUESTS)]
            latencies = await asyncio.gather(*tasks)
    
        avg_latency = statistics.mean(latencies)
        jitter = statistics.pstdev(latencies)  # population stddev
        min_latency = min(latencies)
        max_latency = max(latencies)
        print("\n--- Results ---")
        print(f"Total requests: {N_REQUESTS}")
        print(f"Average latency: {avg_latency:.2f} ms")
        print(f"Jitter (stddev): {jitter:.2f} ms")
        print(f"Min latency: {min_latency:.2f} ms")
        print(f"Max latency: {max_latency:.2f} ms")
    
    
    if __name__ == "__main__":
        asyncio.run(run_load_test())
    

    I was able to check that problem lies on responses api in swedencentral:

    Server Poland:
    --- Results ---
    Total requests: 100
    Average latency: 1563.12 ms
    Jitter (stddev): 492.06 ms
    Min latency: 1064.30 ms
    Max latency: 3316.79 ms
    
    Server Swedencentral:
    --- Results ---
    Total requests: 100
    Average latency: 13854.43 ms
    Jitter (stddev): 9689.98 ms
    Min latency: 2154.94 ms
    Max latency: 57902.09 ms
    

    Thank you for update in advance,
    Stanley


Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.