Hello Brikesh Kumar,
Thank you for posting your question in the Microsoft Q&A forum.
This is a critical observation, and the behavior you're describing points to a specific and important characteristic of the Azure Document Intelligence classifier when operating in splitMode="auto" on very large files.
The key is to understand that the process involves two distinct steps:
- Splitting: First, the service must find the boundaries between documents within your large PDF. It analyzes pages to determine where one document ends and the next begins.
- Classification: Second, for each identified document span (page range), it must classify it into one of your trained categories.
Your problem is likely occurring in Step 1 (Splitting), not Step 2 (Classification).
Why Confidence is Zero for Many Splits? - When you submit a 750-page PDF, the splitMode="auto" algorithm is working extremely hard to find logical break points. It's looking for signals like:
- Changes in layout, fonts, and formatting.
- The presence of what look like document headers or footers.
- Patterns that suggest a natural boundary.
For a file of that size, especially if it contains many similar-looking documents (like thousands of payslips or invoices from the same company), the splitting algorithm can become less confident about the exact boundaries.
When the splitting service has low confidence that it has found a true, distinct document, it still must return something. It will often return a page range, but it assigns a classification confidence of 0. This is a clear indicator that: "I found a block of pages, but I am not confident enough that it is a coherent document to even try classifying it against your trained models."
This is very different from it being confident that the content is not one of your classes. It's saying the foundational step splitting failed for that segment.
The "confidence": 0 is not a classification confidence problem per se; it's a splitting confidence problem. The service is telling you that it could not reliably isolate a discrete document from that specific page range in your massive file, so it doesn't trust the subsequent classification result enough to give it a score.
Please, let me know the response helps answer your question? If the above answer helped, please do not forget to "Accept Answer" as this may help other community members to refer the info if facing a similar issue. 🙂