Hello Stanislav Rubint,
Thank you for the excellent detailed report. Your troubleshooting steps, especially isolating the issue to the /responses
endpoint and confirming it with curl
, are helpful and point strongly towards a service-side issue rather than a problem with your code or the Python library.
From what you describe, the key point is that:
-
chat.completions
endpoint is returning normally (~0.8s latency). -
responses
endpoint consistently hangs in"response.in_progress"
and never completes, even when streaming is enabled. - The issue reproduces across different models (
gpt-4o
,4.1
,5
,mini
,nano
), which suggests it’s not specific to a single deployment.
Several developers have reported similar issues: Please refer Q&A thread
Suggestions to try:
- Lower
max_output_tokens
(e.g., from10000
down to2048
) to see if the response completes — in some cases, high values trigger buffering behavior that delays or prevents output. - Ensure your client is reading the stream fully — the SDK should consume until
response.completed
event. If the stream consumer closes early, the response can hang indefinitely. - Check regional behavior — your logs show
Sweden Central
. If possible, try deploying in another region (e.g.,East US
,West Europe
) to see if latency differs.
Monitoring and Updates:
- Azure Status Page: Monitor for official updates about Sweden Central region.
- Community Threads: Several users are tracking this issue—check for updates on the linked threads.
- Your Approach: Your debugging methodology looks good and has helped identify the issue.
Reference links (for better understanding):
- From Microsoft tech community
- From Github on AsyncAzureClient
- From Microsoft Q&A customer facing similar issue
Please take a minute to accept this answer if you appreciated the inputs.
Thank you 😊