Hello dillon.bailey,
Welcome to the Microsoft Q&A and thank you for posting your questions here.
I understand that you noticed instability around Azure AI Foundry Agent.
Yea, there are many reports on this, but I think what I saw with your scenario can be managed. The steps and ways you can solve it depend on the following:
- You will need to use single-call streaming or createAndPoll / createThreadAndRun, eliminates the race where the run exists server-side but is not yet streamable; the SDK provides helpers that either stream the run in one call or poll on your behalf. Use these methods instead of
create()
then immediately.stream()
. This is an exact JS/TS example (documented by Microsoft docs):
This exact approach (client.runs.create(...).stream() / createAndPoll(...) /// Install: // npm install @azure/ai-projects @azure/identity import { AIProjectClient } from "@azure/ai-projects"; import { DefaultAzureCredential } from "@azure/identity"; const endpoint = process.env.PROJECT_ENDPOINT!; const client = new AIProjectClient(endpoint, new DefaultAzureCredential(), { // optional: tune client retry policy retryOptions: { maxRetries: 5, retryDelayInMs: 1000, maxRetryDelayInMs: 10000 } }); // Preferred: create + stream in one operation (SDK example) async function runAgentAndStream(threadId: string, agentId: string) { try { // SDK sample pattern (create then stream) that avoids the manual race when used as shown in docs: const stream = await client.runs.create(threadId, agentId).stream(); for await (const event of stream) { // event is one of the RunStreamEvent types (message.delta, error, done, etc.) // handle messages and tool calls here console.log("Stream event:", event); if (event.type === "error") { console.error("Run error:", event); } if (event.type === "done") { console.log("Run finished."); } } } catch (err) { console.error("Stream failed:", err); throw err; } }
createThreadAndRun(...)
) is present in Microsoft’s JS docs and samples. UsecreateAndPoll
if you want the SDK to poll for readiness for you. - https://learn.microsoft.com/en-us/javascript/api/overview/azure/ai-agents-readme - Robust fallback pattern is another way (if you must separate create > stream). If your runtime forces you to
create()
separately (legacy code or tooling), use:- Idempotency metadata: before create, generate a client
requestId
and include it in the runmetadata
. The REST/SDK’s create-run supportsmetadata
. Ifcreate()
times out,listRuns(threadId)
and search for a run whose metadata includes yourrequestId
. That tells you the server actually created the run. - https://learn.microsoft.com/en-us/rest/api/aifoundry/aiagents/runs/create-run?view=rest-aifoundry-aiagents-v1 - Exponential backoff polling for run readiness, not a blind immediate .stream() call. TS/Node sample (robust):
Check the following links - https://learn.microsoft.com/en-us/rest/api/aifoundry/aiagents/runs/list-runs?view=rest-aifoundry-aiagents-v1 and https://learn.microsoft.com/en-us/azure/ai-foundry/agents/how-to/tools/function-callingimport { AIProjectClient } from "@azure/ai-projects"; import { DefaultAzureCredential } from "@azure/identity"; import { setTimeout as sleep } from "node:timers/promises"; import { v4 as uuidv4 } from "uuid"; const client = new AIProjectClient(process.env.PROJECT_ENDPOINT!, new DefaultAzureCredential(), { retryOptions: { maxRetries: 5, retryDelayInMs: 1000, maxRetryDelayInMs: 10000 } }); async function createRunWithIdempotency(threadId: string, agentId: string, payload?: any) { const requestId = uuidv4(); // attach metadata.requestId so we can detect a server-created run const createBody = { ...payload, metadata: { clientRequestId: requestId } }; let run: any = undefined; try { // SDK call - exact parameter shape depends on SDK version; many SDKs accept body or options run = await client.runs.create(threadId, agentId, { body: createBody }); } catch (err) { // If create timed out or errored, try to find a run with our requestId console.warn("create() failed — checking if run was created anyway:", err); const runsIter = client.runs.list(threadId, { order: "desc", limit: 10 }); // page through recent runs for await (const r of runsIter) { if (r.metadata?.clientRequestId === requestId) { run = r; console.log("Found server-side run created despite create() error:", run.id); break; } } if (!run) throw err; // rethrow original error if no server-side run found } // Poll run readiness with exponential backoff const deadline = Date.now() + 5 60_000; // 5 minutes let delay = 1000; while (Date.now() < deadline) { const current = await client.runs.get(threadId, run.id); if (current.status === "in_progress" || current.status === "queued") { // not ready yet } else { // if terminal (completed/failed/etc.) break to process break; } await sleep(delay); delay = Math.min(8000, Math.floor(delay 1.8)); } // Now stream events (or GET messages if stream is not supported) try { const eventStream = await client.runs.events(threadId, run.id); for await (const ev of eventStream) { console.log("event:", ev); } } catch (evErr) { // fallback: read final messages console.warn("Streaming events failed, falling back to messages list:", evErr); const messages = await client.threads.listMessages(threadId); return messages; } }
- Idempotency metadata: before create, generate a client
- Set client retry options when constructing AIProjectClient, e.g. retryOptions: { maxRetries: 5, retryDelayInMs: 1000, maxRetryDelayInMs: 10000 }. Use them to safely retry transient 5xx/408/504 failures. - https://learn.microsoft.com/en-us/javascript/api/%40azure/ai-projects/aiprojectclientoptionalparams?view=azure-node-preview Do not blindly retry
create()
on client-side timeouts use metadata-requestId +listRuns
to detect whether the server already created the run (de-dup). - https://learn.microsoft.com/en-us/rest/api/aifoundry/aiagents/runs/create-run?view=rest-aifoundry-aiagents-v1 - In the Azure AI Foundry project portal: Tracing > connect an Application Insights resource. Instrument your Node.js app with Azure Monitor OpenTelemetry distro. - https://learn.microsoft.com/en-us/azure/ai-foundry/agents/concepts/tracing
- Create alerts for spikes of
StatusCode
5xx orRun
failures. - https://learn.microsoft.com/en-us/azure/ai-foundry/agents/how-to/metrics - Enable continuous evaluation and the Foundry dashboards to watch run success rate, latency, tool failures. - https://learn.microsoft.com/en-us/azure/ai-foundry/how-to/continuous-evaluation-agents
- If you are using Managed Identity, ensure the identity has the right roles on the target resource. If the error is 403/500 on the internal token service, gather tracing/Request-IDs and file a support ticket with Microsoft via your Portal.
I hope this is helpful! Do not hesitate to let me know if you have any other questions or clarifications.
Please don't forget to close up the thread here by upvoting and accept it as an answer if it is helpful.