What to consider when choosing best file format for scans for custom model.

Question

What to consider when choosing best file format for scans for custom model.

Will B 0

Hello,

We have been using a custom model to parse text form scans. We have noticed that the file format of the scans can actually have a significant impact on the OCR itself. Handwritten text within fields can often be recognized as different characters.

We noticed this because in some cases we were processing pdfs and in other cases we were processing pngs. In all cases we were using azure to interpret handwritten text on forms. In most cases pdfs performed better than pngs with 300 dpi. But in some cases, when we would export a pdf as a png with a 300 dpi, the model would actually interpret some text correctly that it interpreted incorrectly in the pdf version.

Typically the pdf version lead to better results, but this has us wondering, what have you at Microsoft found to be the best file formats for the best HTR results? I assume higher image resolution on whatever format we use can lead to better results but are there any other guidelines you could share?

I was able to find some information on what types of files could be processed at links below, but not on any different results between file types.

https://learn.microsoft.com/en-us/azure/ai-services/document-intelligence/prebuilt/layout?view=doc-intel-4.0.0&tabs=rest%2Csample-code

https://learn.microsoft.com/en-us/azure/ai-services/document-intelligence/train/custom-template?view=doc-intel-4.0.0

Ravada Shivaprasad 1,115 Reputation points Microsoft External Staff Moderator

2025-08-19T20:55:40.09+00:00

Hi Will B

You've raised a highly relevant and technically nuanced point regarding the impact of file formats on handwritten text recognition (HTR) using Azure Document Intelligence. Microsoft's internal guidance and public documentation indicate that while PDFs generally yield superior results due to their ability to preserve layout, embedded fonts, and vector data, there are cases where high-resolution image formats like PNGs—especially at 300 DPI—can outperform PDFs. This typically occurs when the PDF is a rasterized scan rather than a native digital document, as rasterization can enhance contrast and clarity of handwritten strokes, improving OCR accuracy.

For optimal HTR performance, Microsoft recommends a minimum resolution of 150 DPI, with 300 DPI being ideal for handwritten content. The text height should be at least 12 pixels in a 1024x768 image, which corresponds to roughly 8pt font at 150 DPI. Clear contrast and legibility are critical, as low-quality scans or poor lighting conditions can significantly degrade recognition accuracy. Among supported formats, PDFs are preferred for multi-page structured documents, while PNGs and TIFFs are suitable for single-page or image-only scenarios. JPEGs are also supported but may suffer from compression artifacts that affect OCR precision.

When training custom models, it’s important to use diverse samples with varied handwriting styles and ensure consistent, precise annotation. Retraining with updated samples helps the model adapt to evolving handwriting patterns. Additionally, language settings and contextual cues play a role in recognition accuracy, as seen in cases where punctuation and locale settings affected results. Microsoft also advises avoiding password-protected or corrupted PDFs, and ensuring that scanned documents are clean and well-aligned.

Hope it helps!

Thank you
Ravada Shivaprasad 1,115 Reputation points Microsoft External Staff Moderator

2025-08-22T10:37:17.94+00:00

Hello Will B

Do you get any chance to check the above response

Thank you

1 answer

Your answer

Ravada Shivaprasad 1,115 Reputation points Microsoft External Staff Moderator

2025-08-19T20:55:40.09+00:00

Hi Will B

You've raised a highly relevant and technically nuanced point regarding the impact of file formats on handwritten text recognition (HTR) using Azure Document Intelligence. Microsoft's internal guidance and public documentation indicate that while PDFs generally yield superior results due to their ability to preserve layout, embedded fonts, and vector data, there are cases where high-resolution image formats like PNGs—especially at 300 DPI—can outperform PDFs. This typically occurs when the PDF is a rasterized scan rather than a native digital document, as rasterization can enhance contrast and clarity of handwritten strokes, improving OCR accuracy.

For optimal HTR performance, Microsoft recommends a minimum resolution of 150 DPI, with 300 DPI being ideal for handwritten content. The text height should be at least 12 pixels in a 1024x768 image, which corresponds to roughly 8pt font at 150 DPI. Clear contrast and legibility are critical, as low-quality scans or poor lighting conditions can significantly degrade recognition accuracy. Among supported formats, PDFs are preferred for multi-page structured documents, while PNGs and TIFFs are suitable for single-page or image-only scenarios. JPEGs are also supported but may suffer from compression artifacts that affect OCR precision.

When training custom models, it’s important to use diverse samples with varied handwriting styles and ensure consistent, precise annotation. Retraining with updated samples helps the model adapt to evolving handwriting patterns. Additionally, language settings and contextual cues play a role in recognition accuracy, as seen in cases where punctuation and locale settings affected results. Microsoft also advises avoiding password-protected or corrupted PDFs, and ensuring that scanned documents are clean and well-aligned.

Hope it helps!

Thank you
Ravada Shivaprasad 1,115 Reputation points Microsoft External Staff Moderator

2025-08-22T10:37:17.94+00:00

Hello Will B

Do you get any chance to check the above response

Thank you

Answer 1

Will B 0

Hello Ravada,

Yes I did. Thank you for your response. When you say the following:

"there are cases where high-resolution image formats like PNGs—especially at 300 DPI—can outperform PDFs. This typically occurs when the PDF is a rasterized scan rather than a native digital document, as rasterization can enhance contrast and clarity of handwritten strokes, improving OCR accuracy."

All of these documents are scans. None are native digital documents. Would Microsoft advise using pdfs in these instances or high resolution pngs or other types of files? If it depends, what other factors can help us make an the best decisions?

Thank you!

Ravada Shivaprasad 1,115 Reputation points Microsoft External Staff Moderator

2025-08-29T01:02:54.73+00:00

Hi Will B

Sorry for Delayed response

For scanned documents, Microsoft generally recommends using high-resolution PNGs (at least 300 DPI) over PDFs when OCR accuracy is critical—especially for handwritten content. This is because rasterized images like PNGs can enhance contrast and stroke clarity, which improves text recognition. However, if the document is structured or multi-page, PDFs may be more practical. Key factors to consider include image resolution, text size, contrast, document layout, and whether the content is handwritten or printed. The choice also depends on the OCR engine’s capabilities and how the output will be used.

Hope it helps!

Thank you

Share via

What to consider when choosing best file format for scans for custom model.

1 answer

Your answer