What to consider when choosing best file format for scans for custom model.

Will B 0 Reputation points
2025-08-19T16:18:41.8166667+00:00

Hello,

We have been using a custom model to parse text form scans. We have noticed that the file format of the scans can actually have a significant impact on the OCR itself. Handwritten text within fields can often be recognized as different characters.

We noticed this because in some cases we were processing pdfs and in other cases we were processing pngs. In all cases we were using azure to interpret handwritten text on forms. In most cases pdfs performed better than pngs with 300 dpi. But in some cases, when we would export a pdf as a png with a 300 dpi, the model would actually interpret some text correctly that it interpreted incorrectly in the pdf version.

Typically the pdf version lead to better results, but this has us wondering, what have you at Microsoft found to be the best file formats for the best HTR results? I assume higher image resolution on whatever format we use can lead to better results but are there any other guidelines you could share?

I was able to find some information on what types of files could be processed at links below, but not on any different results between file types.

https://learn.microsoft.com/en-us/azure/ai-services/document-intelligence/prebuilt/layout?view=doc-intel-4.0.0&tabs=rest%2Csample-code

https://learn.microsoft.com/en-us/azure/ai-services/document-intelligence/train/custom-template?view=doc-intel-4.0.0

Azure AI Document Intelligence
{count} votes

1 answer

Sort by: Most helpful
  1. Will B 0 Reputation points
    2025-08-22T22:19:16.18+00:00

    Hello Ravada,

    Yes I did. Thank you for your response. When you say the following:

    "there are cases where high-resolution image formats like PNGs—especially at 300 DPI—can outperform PDFs. This typically occurs when the PDF is a rasterized scan rather than a native digital document, as rasterization can enhance contrast and clarity of handwritten strokes, improving OCR accuracy."

    All of these documents are scans. None are native digital documents. Would Microsoft advise using pdfs in these instances or high resolution pngs or other types of files? If it depends, what other factors can help us make an the best decisions?

    Thank you!


Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.