Indexer Duplication When Re‑Indexing Blob Container—How Does It Determine Unique Documents?

Question

Indexer Duplication When Re‑Indexing Blob Container—How Does It Determine Unique Documents?

Mark Gergis 0

Hello,

I'm encountering a puzzling issue with Azure AI Search indexers: I have a blob container containing about 10,000 files, and whenever I run the indexer again (after updates), it ends up duplicating all entries—resulting in approximately 20,000 documents in my search index. I’m trying to understand exactly how the indexer decides what counts as a new document, and why it appears to duplicate rather than update.

Specifically:

What does the indexer use to identify a document? Does it rely on the file name? Or does it ingest metadata or the content itself?
Why does re-indexing duplicate files instead of updating them? Is there a built‑in change detection mechanism, and if so, what are its limits? Are deletions detected automatically, or do I need to configure anything specifically?
How can I avoid duplicates on re-index runs? Are there settings or policies, like soft delete detection (native blob soft‑delete or custom metadata) that must be configured from the very first indexer run to prevent duplication?

Thanks in advance!

RAMAMURTHY MAKARAPU 160 Reputation points Microsoft External Staff Moderator

2025-08-29T20:49:01.58+00:00
Hello @Mark Gergis ,

Welcome to Microsoft Q&A Platform. Thank you for reaching out & hope you are doing well.

This is a common challenge with Azure AI Search indexers when using blob storage. I'll address your questions and explain how to prevent duplication while making sure documents are updated correctly.

How the indexer identifies a document

The index uses the document key—your index’s key field—as the unique identifier.

When pulling data from Blob Storage, the default identifier is the blob’s full path (URL), shown as metadata_storage_path.

This means that two blobs with identical file names but different folders are treated as separate documents, since their paths are distinct.

Deduplication is based solely on the key field value in your index, not on file content.

If your index schema uses a different key, such as metadata_storage_name or a custom field, duplicates may result if that field isn’t consistently unique for each run.

Why does re-indexing create duplicates instead of updating?

When the indexer processes data:

If it finds a blob with a key that already exists in the index, it updates or overwrites that document.

If the key value has changed since the last index, it inserts a new document rather than updating the previous one.

Key reasons for duplication:

Key mismatch: For example, using metadata_storage_name (just the file name) instead of metadata_storage_path.

Change detection disabled: If the indexer re-ingests all documents on every run and the key field isn’t properly set, duplicates will result.

File moves or renames: Moving a blob to a different path causes it to appear as a new document.

Change Detection & Deletions

Azure Cognitive Search indexers offer two powerful optional features to help manage your data efficiently:

Change Detection

Leverages metadata_storage_last_modified to identify file changes.

This ensures that only new or updated files are indexed.

If not enabled, the indexer scans all files and may re-ingest everything, increasing the risk of duplication if key mapping isn’t consistent.

Deletion Detection

Identifies when a blob is removed and deletes the corresponding entry from the index.

Requires you to activate a soft delete column or set up a hard delete policy.

For blobs, this typically relies on is Deleted metadata or container deletion policies, which you need to configure directly.

By default, deletions are not detected. Without soft-delete configuration, removed blobs stay in the index forever.

How to prevent duplicates

Set the index key to metadata_storage_path. This ensures each blob path is unique.

Enable the change detection policy:

"dataChangeDetectionPolicy": { "@odata.type": "#Microsoft.Azure.Search.HighWaterMarkChangeDetectionPolicy", "highWaterMarkColumnName": "metadata_storage_last_modified" }

Enable deletion detection policy (optional, but recommended)

"dataDeletionDetectionPolicy": { "@odata.type": "#Microsoft.Azure.Search.SoftDeleteColumnDeletionDetectionPolicy", "softDeleteColumnName": "isDeleted", "softDeleteMarkerValue": "true" }

Avoid re-mapping key later.

The choice of key must be correct from the very first indexer run. Changing it later means the indexer won’t recognize prior documents and will duplicate.

Share via

Indexer Duplication When Re‑Indexing Blob Container—How Does It Determine Unique Documents?

Your answer