Azure Content Understanding Classifier Support Confidence Scores?

Cameron Kenny 0 Reputation points
2025-05-26T11:21:01.4633333+00:00

We're testing the 2025-05-01-preview classifier, and it's working well, but there’s no confidence score in the classification/splitting output. Is this expected? And is there a roadmap to include per-category confidence like Document Intelligence has for field extraction?

Example output:

{"id":"8a62e2b6-d021-4b6a-b767-7fdbf9c34227","status":"Succeeded","result":{"classifierId":"mortgage_classifier","apiVersion":"2025-05-01-preview","createdAt":"2025-05-26T10:34:30.0071401Z","contents":[{"category":"company_filing","kind":"document","startPageNumber":1,"endPageNumber":22},{"category":"employment_contract","kind":"document","startPageNumber":23,"endPageNumber":35},{"category":"company_filing","kind":"document","startPageNumber":36,"endPageNumber":36},{"category":"payslips","kind":"document","startPageNumber":37,"endPageNumber":37}]}}%

Azure AI Document Intelligence
{count} votes

2 answers

Sort by: Most helpful
  1. Suwarna S Kale 3,951 Reputation points
    2025-08-28T02:24:08.69+00:00

    Hello Brikesh Kumar,

    Thank you for posting your question in the Microsoft Q&A forum. 

    This is a critical observation, and the behavior you're describing points to a specific and important characteristic of the Azure Document Intelligence classifier when operating in splitMode="auto" on very large files. 

    The key is to understand that the process involves two distinct steps: 

    1. Splitting: First, the service must find the boundaries between documents within your large PDF. It analyzes pages to determine where one document ends and the next begins. 
    2. Classification: Second, for each identified document span (page range), it must classify it into one of your trained categories. 

    Your problem is likely occurring in Step 1 (Splitting), not Step 2 (Classification). 

    Why Confidence is Zero for Many Splits? -  When you submit a 750-page PDF, the splitMode="auto" algorithm is working extremely hard to find logical break points. It's looking for signals like: 

    • Changes in layout, fonts, and formatting. 
    • The presence of what look like document headers or footers. 
    • Patterns that suggest a natural boundary. 

    For a file of that size, especially if it contains many similar-looking documents (like thousands of payslips or invoices from the same company), the splitting algorithm can become less confident about the exact boundaries. 

    When the splitting service has low confidence that it has found a true, distinct document, it still must return something. It will often return a page range, but it assigns a classification confidence of 0. This is a clear indicator that: "I found a block of pages, but I am not confident enough that it is a coherent document to even try classifying it against your trained models." 

    This is very different from it being confident that the content is not one of your classes. It's saying the foundational step splitting failed for that segment. 

    The "confidence": 0 is not a classification confidence problem per se; it's a splitting confidence problem. The service is telling you that it could not reliably isolate a discrete document from that specific page range in your massive file, so it doesn't trust the subsequent classification result enough to give it a score. 

    Please, let me know the response helps answer your question? If the above answer helped, please do not forget to "Accept Answer" as this may help other community members to refer the info if facing a similar issue. 🙂 

    1 person found this answer helpful.

  2. Suwarna S Kale 3,951 Reputation points
    2025-05-26T13:06:54.2966667+00:00

    Hello Cameron Kenny,

    Thank you for posting your question in the Microsoft Q&A forum. 

    The absence of confidence scores in the classification/splitting output of the 2025-05-01-preview classifier appears to be expected behavior, as confidence scores are typically provided for field extraction rather than classification. Currently, there is no explicit mention of per-category confidence scores for classification in the roadmap, but improvements to Document Intelligence are continuously being made. 

    To resolve this, you may consider below configs: 

    1. Reviewing the latest Azure AI Document Intelligence release notes for updates. 
    2. Checking if confidence scores are available in a different API version. 
    3. Exploring alternative methods, such as custom models, which provide confidence scores for extracted fields. 

    Some reference documentations may help:

    If the above answer helped, please do not forget to "Accept Answer" as this may help other community members to refer the info if facing a similar issue. Your contribution to the Microsoft Q&A community is highly appreciated. 


Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.