Indexer Duplication When Re‑Indexing Blob Container—How Does It Determine Unique Documents?

Mark Gergis 0 Reputation points
2025-08-29T17:49:47.17+00:00

Hello,

I'm encountering a puzzling issue with Azure AI Search indexers: I have a blob container containing about 10,000 files, and whenever I run the indexer again (after updates), it ends up duplicating all entries—resulting in approximately 20,000 documents in my search index. I’m trying to understand exactly how the indexer decides what counts as a new document, and why it appears to duplicate rather than update.

Specifically:

  1. What does the indexer use to identify a document? Does it rely on the file name? Or does it ingest metadata or the content itself?
  2. Why does re-indexing duplicate files instead of updating them? Is there a built‑in change detection mechanism, and if so, what are its limits? Are deletions detected automatically, or do I need to configure anything specifically?
  3. How can I avoid duplicates on re-index runs? Are there settings or policies, like soft delete detection (native blob soft‑delete or custom metadata) that must be configured from the very first indexer run to prevent duplication?

Thanks in advance!

Azure AI Search
Azure AI Search
An Azure search service with built-in artificial intelligence capabilities that enrich information to help identify and explore relevant content at scale.
{count} votes

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.