Classifier for Custom Forms and Wage Stubs

Question

Classifier for Custom Forms and Wage Stubs

Tirgar, Ashkan 0

I'd like to train a custom classifier to categorize several documents, including some forms with predefined format. Among input document types, we also have wage stubs that need to be classified as such. However, training a classifier on wage stubs is out of scope because it needs a super large training dataset to accommodate all variations, which can be many.

I’m aware there is a prebuilt model for wage stubs— is there a way to incorporate that model into the classifier? Alternatively, is there a more efficient approach than training our classifier on all wage stub variations to reliably identify wage stubs among other document types?

Ravada Shivaprasad 1,115 Reputation points Microsoft External Staff Moderator

2025-07-31T20:26:41.5133333+00:00

Hi Tirgar, Ashkan

Yes, you can effectively incorporate the prebuilt wage stub model into your custom classifier while avoiding the need for extensive training data. The most efficient approach is to use Azure Document Intelligence's composed models feature, which allows you to combine the prebuilt wage stub model with your custom form classifier in a single, streamlined process. This approach eliminates the need for collecting and training on thousands of wage stub variations, as the prebuilt model handles this specialized task while your custom classifier focuses on identifying other document types.

The composed models approach works by processing documents through a single API call, automatically selecting the appropriate model based on the document type. When a document is submitted, the system first attempts to identify it as a wage stub using the prebuilt model, applying a configurable confidence threshold (typically set around 0.95) to ensure accurate identification. If the document isn't confidently identified as a wage stub, it falls back to your custom classifier for processing other document types. This dual-processing system maintains reliability while significantly reducing the training data requirements.

For optimal implementation, you'll want to implement proper confidence threshold management. Set high thresholds (≥0.95) for automated processing and establish review queues for low-confidence classifications. Regular monitoring of classification accuracy will help identify areas where adjustments might be needed. Additionally, consider implementing incremental training capabilities to fine-tune your custom classifier as new document types emerge.

While an alternative approach exists using separate API calls for wage stub detection followed by custom classification, the composed models approach is recommended due to its simplicity and efficiency. It reduces operational overhead while maintaining the flexibility to handle various document types effectively. The system can be easily maintained and updated through Azure Document Intelligence's incremental training features, allowing you to adapt to changing document patterns without rebuilding the entire classification system.

Reference Links : Document Intelligence custom classification model

Hope it helps!

Thank you
Tirgar, Ashkan 0 Reputation points

2025-07-31T23:24:34.99+00:00

Hi Ravada, thank you for the details and explanation. I am aware of composed models and how it can be used to combine Custom Extraction models and Custom Classifiers. However, I did not see any options in the Azure Document Intelligence Studio to add any Prebuilt model like Wage Stubs to the composed model. Can you please clarify how that can be achieved?
Ravada Shivaprasad 1,115 Reputation points Microsoft External Staff Moderator

2025-08-10T08:38:57.26+00:00

Hi Tirgar, Ashkan

Sorry for Delayed response

You're correct in observing that the Azure Document Intelligence Studio currently does not support directly adding prebuilt models—such as the Wage Stubs model—into a composed model through its graphical interface. The Studio allows you to compose multiple custom models (like custom classifiers and custom extraction models), but prebuilt models are not composable via the Studio UI.

However, you can still achieve a similar outcome programmatically using the Azure Document Intelligence REST API or SDKs. The recommended approach is to implement a two-stage classification and extraction pipeline in your application logic. In this setup, you first analyze the document using the prebuilt-payStub model. If the confidence score is high (typically ≥ 0.95), you accept the result. If not, you route the document to your custom classifier or composed model for further analysis. This method effectively integrates the prebuilt model into your workflow without needing to include it in a composed model.

This approach is particularly useful because it leverages the robustness of the prebuilt model for known document types like wage stubs, while still allowing flexibility and customization for other document types through your custom models. Microsoft also recommends this strategy in their official documentation and community forums, as it avoids the need to retrain models on formats already well-supported by prebuilt models.

Hope it helps!

Thank you
Tirgar, Ashkan 0 Reputation points

2025-08-11T12:40:32.71+00:00

Hi Ravada, thank you for your inputs. I tried testing a random document that is not a pay stub with the Prebuilt model for pay stubs. Unfortunately, I do not see any confidence score that could be an indication that the analyzed document is not a pay stub. I have attached the test document and JSON result. Test document.pdf and Test document json.txt. I'd appreciate if you could take a look and let me know where I can find the confidence score representing that this document, for example, is not a Pay Stub and should therefore be routed to my composed or custom model for next steps. I appreciate your assistance.

Tirgar, Ashkan 0

Thank you, Sina, for the detailed explanation.

I understand the confidence-based routing and 2-step classification approach you've outlined, and I see how it can be implemented in principle. However, I'm encountering an issue: when I pass even a random blank page containing only the text 'Test document', the model returns a confidence score of '1' and assigns "docType": "payStub.us".

I've included the JSON response from the Pay Stub prebuilt model below for reference. This suggests that the docType confidence score may not be reliable, which in turn makes it difficult to implement the proposed approach.

I'd appreciate your thoughts on this.

{
	"status": "succeeded",
	"createdDateTime": "2025-08-19T12:09:09Z",
	"lastUpdatedDateTime": "2025-08-19T12:09:09Z",
	"analyzeResult": {
		"apiVersion": "2024-11-30",
		"modelId": "prebuilt-payStub.us",
		"stringIndexType": "utf16CodeUnit",
		"content": "Test document",
		"pages": [
			{
				"pageNumber": 1,
				"angle": 0.11868470162153244,
				"width": 8.5,
				"height": 11,
				"unit": "inch",
				"words": [
					{
						"content": "Test",
						"polygon": [
							0.9906,
							1.0287,
							1.2575,
							1.0279,
							1.2572,
							1.1681,
							0.9924,
							1.1676
						],
						"confidence": 0.992,
						"span": {
							"offset": 0,
							"length": 4
						}
					},
					{
						"content": "document",
						"polygon": [
							1.2985,
							1.0285,
							1.9424,
							1.0293,
							1.9424,
							1.1681,
							1.2983,
							1.1681
						],
						"confidence": 0.994,
						"span": {
							"offset": 5,
							"length": 8
						}
					}
				],
				"lines": [
					{
						"content": "Test document",
						"polygon": [
							0.9906,
							1.0273,
							1.9416,
							1.0294,
							1.9413,
							1.1681,
							0.9903,
							1.1662
						],
						"spans": [
							{
								"offset": 0,
								"length": 13
							}
						]
					}
				],
				"spans": [
					{
						"offset": 0,
						"length": 13
					}
				]
			}
		],
		"styles": [],
		"documents": [
			{
				"docType": "payStub.us",
				"boundingRegions": [
					{
						"pageNumber": 1,
						"polygon": [
							0,
							0,
							8.5,
							0,
							8.5,
							11,
							0,
							11
						]
					}
				],
				"fields": {},
				"confidence": 1,
				"spans": [
					{
						"offset": 0,
						"length": 13
					}
				]
			}
		],
		"contentFormat": "text"
	}
}

1 answer

Your answer

Tirgar, Ashkan 0 Reputation points

2025-07-31T23:24:34.99+00:00

Hi Ravada, thank you for the details and explanation. I am aware of composed models and how it can be used to combine Custom Extraction models and Custom Classifiers. However, I did not see any options in the Azure Document Intelligence Studio to add any Prebuilt model like Wage Stubs to the composed model. Can you please clarify how that can be achieved?
Ravada Shivaprasad 1,115 Reputation points Microsoft External Staff Moderator

2025-08-10T08:38:57.26+00:00

Hi Tirgar, Ashkan

Sorry for Delayed response

You're correct in observing that the Azure Document Intelligence Studio currently does not support directly adding prebuilt models—such as the Wage Stubs model—into a composed model through its graphical interface. The Studio allows you to compose multiple custom models (like custom classifiers and custom extraction models), but prebuilt models are not composable via the Studio UI.

However, you can still achieve a similar outcome programmatically using the Azure Document Intelligence REST API or SDKs. The recommended approach is to implement a two-stage classification and extraction pipeline in your application logic. In this setup, you first analyze the document using the prebuilt-payStub model. If the confidence score is high (typically ≥ 0.95), you accept the result. If not, you route the document to your custom classifier or composed model for further analysis. This method effectively integrates the prebuilt model into your workflow without needing to include it in a composed model.

This approach is particularly useful because it leverages the robustness of the prebuilt model for known document types like wage stubs, while still allowing flexibility and customization for other document types through your custom models. Microsoft also recommends this strategy in their official documentation and community forums, as it avoids the need to retrain models on formats already well-supported by prebuilt models.

Hope it helps!

Thank you
Tirgar, Ashkan 0 Reputation points

2025-08-11T12:40:32.71+00:00

Hi Ravada, thank you for your inputs. I tried testing a random document that is not a pay stub with the Prebuilt model for pay stubs. Unfortunately, I do not see any confidence score that could be an indication that the analyzed document is not a pay stub. I have attached the test document and JSON result. Test document.pdf and Test document json.txt. I'd appreciate if you could take a look and let me know where I can find the confidence score representing that this document, for example, is not a Pay Stub and should therefore be routed to my composed or custom model for next steps. I appreciate your assistance.
Tirgar, Ashkan 0 Reputation points

2025-08-19T12:20:24.13+00:00

Thank you, Sina, for the detailed explanation.

I understand the confidence-based routing and 2-step classification approach you've outlined, and I see how it can be implemented in principle. However, I'm encountering an issue: when I pass even a random blank page containing only the text 'Test document', the model returns a confidence score of '1' and assigns "docType": "payStub.us".

I've included the JSON response from the Pay Stub prebuilt model below for reference. This suggests that the docType confidence score may not be reliable, which in turn makes it difficult to implement the proposed approach.

I'd appreciate your thoughts on this.

{ "status": "succeeded", "createdDateTime": "2025-08-19T12:09:09Z", "lastUpdatedDateTime": "2025-08-19T12:09:09Z", "analyzeResult": { "apiVersion": "2024-11-30", "modelId": "prebuilt-payStub.us", "stringIndexType": "utf16CodeUnit", "content": "Test document", "pages": [ { "pageNumber": 1, "angle": 0.11868470162153244, "width": 8.5, "height": 11, "unit": "inch", "words": [ { "content": "Test", "polygon": [ 0.9906, 1.0287, 1.2575, 1.0279, 1.2572, 1.1681, 0.9924, 1.1676 ], "confidence": 0.992, "span": { "offset": 0, "length": 4 } }, { "content": "document", "polygon": [ 1.2985, 1.0285, 1.9424, 1.0293, 1.9424, 1.1681, 1.2983, 1.1681 ], "confidence": 0.994, "span": { "offset": 5, "length": 8 } } ], "lines": [ { "content": "Test document", "polygon": [ 0.9906, 1.0273, 1.9416, 1.0294, 1.9413, 1.1681, 0.9903, 1.1662 ], "spans": [ { "offset": 0, "length": 13 } ] } ], "spans": [ { "offset": 0, "length": 13 } ] } ], "styles": [], "documents": [ { "docType": "payStub.us", "boundingRegions": [ { "pageNumber": 1, "polygon": [ 0, 0, 8.5, 0, 8.5, 11, 0, 11 ] } ], "fields": {}, "confidence": 1, "spans": [ { "offset": 0, "length": 13 } ] } ], "contentFormat": "text" } }

Answer 1

Sina Salam 23,931 Volunteer Moderator

Hello Tirgar, Ashkan,

Welcome to the Microsoft Q&A and thank you for posting your questions here.

I understand that you are facing difficulties in reliably identifying wage stubs within a mixed set of documents using Azure AI Document Intelligence, especially given the impracticality of training a custom classifier on the many variations of wage stubs.

This is not just about clarifying whether composed models can combine prebuilt and custom models simply because they cannot, but also about showing you the best-practice workflow to achieve your goal without retraining on known formats. Microsoft’s documentation confirms that composed models in Azure Document Intelligence are designed only for combining custom models, so prebuilt models such as prebuilt-payStub.us must be integrated through orchestration in your application logic rather than within the Studio interface itself.

Therefore, the most reliable architecture is a two-stage classification and extraction pipeline. In this approach, your application first calls the prebuilt pay stub model (prebuilt-payStub.us – docs), which returns docType, confidence, and per-field confidence values in the Analyze response. You then apply a calibrated confidence threshold based on your own validation data (Microsoft recommends deriving this from accuracy confidence baselines, not using a fixed universal number) and check for coverage of key pay stub fields such as Employee, Employer, PayPeriod, GrossPay, and NetPay. If the document-level confidence and field coverage meet your criteria, you accept it as a pay stub; otherwise, you route it to your custom classifier or composed custom extraction models for further processing.

This method ensures that known document types like pay stubs benefit from the robustness of Microsoft’s prebuilt models, while the rest of your corpus is handled by a tailored classifier trained on your actual document types. For ongoing accuracy, you should set up confidence-based routing rules, implement a review queue for low-confidence cases, and leverage incremental training on your custom classifier as new document types emerge (custom classification docs). Below is a simplified orchestration pattern you can adapt:

result = client.analyze_document("prebuilt-payStub.us", document)
doc = result.documents[0] if result.documents else None
if doc and doc.confidence >= PAYSTUB_THRESHOLD and coverage_ok(doc.fields):
    return "PayStub", doc
else:
    return classify_with_custom_model(document)

By following this confidence-and-coverage-based decision flow, you avoid the pitfalls of trying to compose prebuilt and custom models directly, maintain accuracy across document types, and ensure your pipeline is both scalable and easy to maintain.

I hope this is helpful! Do not hesitate to let me know if you have any other questions or clarifications.

Please don't forget to close up the thread here by upvoting and accept it as an answer if it is helpful.

Sina Salam 23,931 Reputation points Volunteer Moderator

2025-08-20T12:09:04.64+00:00
Hello Tirgar, Ashkan,

Thank you for your feedback.

Regarding your last comment, and the JSON response from the Pay Stub prebuilt model below for reference.

What your JSON actually reads:

You invoked modelId: prebuilt-payStub.us (that’s the Pay Stub prebuilt; the doc page lists this model). A prebuilt invoked by ID will return that docType with some confidence, because you asked for that schema; it’s not adjudicating whether the input is a pay stub.

The Analyze response always contains a documents array with docType and a confidence per the v4 API; that confidence is not guaranteed to be a “class probability,” and field confidences may be absent. You must validate fields.

Why the Prebuilt Pay Stub Model Returns Confidence = 1 on Non-Pay Stub Input from your JSON, we can see that:

docType is hard-coded as "payStub.us" because you explicitly invoked the prebuilt-payStub.us model.

The confidence value of 1 does not represent a classification probability between "pay stub" vs. "not a pay stub". Instead, it reflects the model’s internal certainty that it has produced some result for the schema you asked for.

This is by design: prebuilt models in Azure AI Document Intelligence are schema-based extractors, not classifiers. When called, they always return the docType corresponding to the model you invoked, regardless of whether the input is truly that type of document.

So in effect, the prebuilt-payStub model cannot tell you this is not a pay stub. It will try to fit any input into the pay stub schema, and confidence will often appear misleadingly high.

My intent to make it work about your scenario is not practical or feasible write up I can post here. However, you can make it work by changing your production-grade flow that (a) prevents false accepts like your “Test document,” (b) keeps prebuilt extraction where it works better, and (c) get traceable thresholds.

What I meant is here as an example in this code snippet:

def is_valid_pay_stub(doc, required_fields=["Employee", "Employer", "PayPeriod", "GrossPay", "NetPay"]): fields = doc.fields if doc and hasattr(doc, "fields") else {} valid_fields = 0 for f in required_fields: val = fields.get(f) if val and val.value: if isinstance(val.value, (int, float)) and val.value > 0: valid_fields += 1 elif isinstance(val.value, str) and len(val.value.strip()) > 2: valid_fields += 1 coverage_score = valid_fields / len(required_fields) return coverage_score >= 0.7 result = client.analyze_document("prebuilt-payStub.us", document) doc = result.documents[0] if result.documents else None if is_valid_pay_stub(doc): return "PayStub", doc else: return classify_with_custom_model(document)

Have a good success.

NB: Please, always post under the answer comment.
Ravada Shivaprasad 1,115 Reputation points Microsoft External Staff Moderator

2025-08-21T01:08:45.2+00:00

Hi Tirgar, Ashkan,

Do you get any chance to check above response

Thanks
Tirgar, Ashkan 0 Reputation points

2025-08-21T14:30:32.5466667+00:00
Hi Sina, Ravada,

Thank you so much for your response. I understand the approach suggested in the Python code. Essentially, we’re examining the document to see if it contains a certain number of fields that are typically expected on pay stubs. While this is a creative workaround, it does come with its own challenges, including:

It will require a thorough analysis to identify a reliable, static set of fields that appear across different variations of pay stubs.

Since the Pay Stub Prebuilt model must be the first block in the flow, all documents will need to pass through it initially. This means that even if pay stubs only account for 10% of all documents, 100% of documents will incur the cost of processing through the pay stub model, including the remaining 90% that fail the pay stub model and need to be processed again by the custom model in the second block, resulting in further costs.

Again, I very much appreciate your effort to provide solutions. I’m open to any additional thoughts or suggestions.
Sina Salam 23,931 Reputation points Volunteer Moderator

2025-08-22T11:02:43.5633333+00:00
Hi Tirgar, Ashkan,

Thank you for asking for more clarification.

This portion might not allow me to expatiate more as I would.

For your deeper need for cost-efficient, scalable, and accurate document classification in mixed-document environments using Azure AI Document Intelligence, you will need to review your architecture to balance cost, accuracy, and scalability without needing to retrain on all pay stub variations. I discussed with some colleagues about this kind of scenario this morning.

You can implement a three-layer pipeline:

Pre-classifier (lightweight, fast)

Prebuilt model (only for likely pay stubs)

Field coverage validation (to confirm extraction quality)

This will balance cost, accuracy, and scalability without needing to retrain on all pay stub variations.

Success.

NB: Please, forget not to close the thread by accepting the answer for the benefit of this community.

Thank you.
Tirgar, Ashkan 0 Reputation points

2025-08-22T18:09:57.7466667+00:00

Hi Sina,

Thank you for the additional suggestions. Would you mind providing some clarification on the 'Pre-classifier (lightweight, fast)' step? I wanted to better understand if this is something different, or it is just referring to the Custom classifier model. Comparing with the previous approach we discussed, this would essentially be swapping the 'Pay Stub Prebuilt model + Field Coverage validation' block with the Custom Classifier block, if I am understanding it correctly.
Ravada Shivaprasad 1,115 Reputation points Microsoft External Staff Moderator

2025-08-29T00:38:44.8833333+00:00

Hi Tirgar, Ashkan

Sorry for Delayed response

The Pre-classifier and Custom classifier work together in a two-stage process: the Pre-classifier acts as a quick filter, using simple rules or lightweight models (around 1-3MB) to identify potential matches, while the Custom classifier provides the final, accurate classification using more complex deep learning models. This approach replaces the old Pay Stub Prebuilt model + Field Coverage validation system, offering better efficiency by only sending promising candidates to the more resource-intensive Custom classifier.

Hope it helps!

Thank you

Share via

Classifier for Custom Forms and Wage Stubs

1 answer

Your answer