Hello,
We have been using a custom model to parse text form scans. We have noticed that the file format of the scans can actually have a significant impact on the OCR itself. Handwritten text within fields can often be recognized as different characters.
We noticed this because in some cases we were processing pdfs and in other cases we were processing pngs. In all cases we were using azure to interpret handwritten text on forms. In most cases pdfs performed better than pngs with 300 dpi. But in some cases, when we would export a pdf as a png with a 300 dpi, the model would actually interpret some text correctly that it interpreted incorrectly in the pdf version.
Typically the pdf version lead to better results, but this has us wondering, what have you at Microsoft found to be the best file formats for the best HTR results? I assume higher image resolution on whatever format we use can lead to better results but are there any other guidelines you could share?
I was able to find some information on what types of files could be processed at links below, but not on any different results between file types.
https://learn.microsoft.com/en-us/azure/ai-services/document-intelligence/prebuilt/layout?view=doc-intel-4.0.0&tabs=rest%2Csample-code
https://learn.microsoft.com/en-us/azure/ai-services/document-intelligence/train/custom-template?view=doc-intel-4.0.0