Semantic Deduplication¶

Remove near-duplicate generated records using embedding-based similarity as a graph post-processor

Overview¶

SyGra supports semantic deduplication as a graph post-processing step via:

sygra.core.graph.graph_postprocessor.SemanticDedupPostProcessor

It embeds a configured output field (e.g., answer, description) and removes items whose cosine similarity is above a configurable threshold.

This is useful when:

Your generation workflow tends to repeat the same/very similar answers.
You are generating multiple records and want to reduce redundant samples.
You want a report of duplicate pairs to inspect or tune dedup behavior.

Quick Start¶

Add the post processor under graph_post_process in your task graph_config.yaml.

Example (dedup over answer, see tasks/examples/semantic_dedup/graph_config.yaml):

graph_post_process:
  - processor: sygra.core.graph.graph_postprocessor.SemanticDedupPostProcessor
    params:
      field: answer
      similarity_threshold: 0.92
      id_field: id
      embedding_backend: sentence_transformers
      embedding_model: all-MiniLM-L6-v2
      dedup_mode: nearest_neighbor
      vectorstore_k: 20
      keep: first
      max_pairs_in_report: 1000

Example (dedup over description, see tasks/examples/semantic_dedup_no_seed/graph_config.yaml):

graph_post_process:
  - processor: sygra.core.graph.graph_postprocessor.SemanticDedupPostProcessor
    params:
      field: description
      similarity_threshold: 0.85
      id_field: id
      embedding_backend: sentence_transformers
      embedding_model: all-MiniLM-L6-v2
      keep: first
      max_pairs_in_report: 1000

Configuration Reference¶

Parameters¶

All parameters are provided under params:.

Parameter	Type	Description	Default
`field`	string	Field to embed and compare for similarity. If the field value is a list/tuple, values are joined with newlines.	`text`
`similarity_threshold`	float	Cosine similarity threshold. Higher values drop fewer items.	`0.9`
`id_field`	string	Optional ID field used in the report for readability. If missing, indices are used.	`id`
`embedding_backend`	string	Embedding backend. Currently only `sentence_transformers` is supported.	`sentence_transformers`
`embedding_model`	string	SentenceTransformers model name to use for embeddings.	`all-MiniLM-L6-v2`
`report_filename`	string	Optional report JSON filename. If relative, it is written next to the graph output file. If omitted, the report name is derived from the output file name.	(derived)
`keep`	string	Which item to keep when duplicates are found: `first` or `last`.	`first`
`max_pairs_in_report`	int	Max number of duplicate pairs written to the report.	`2000`
`dedup_mode`	string	Dedup implementation to use: `nearest_neighbor` (default) or `all_pairs`. Any other value is unsupported and will raise an error. `nearest_neighbor` avoids building a full similarity matrix by only comparing against nearest neighbors / kept items. `all_pairs` computes a full similarity matrix (exact, but O(n^2)).	`nearest_neighbor`
`vectorstore_k`	int	Number of nearest neighbors to retrieve/consider when `dedup_mode: nearest_neighbor`.	`20`

How dedup is applied¶

A greedy pass keeps an item if it is not too similar to a previously kept one.
Similarity is computed via cosine similarity over normalized embeddings.
keep: first keeps the earlier item, keep: last prefers the later item.

Output report¶

If SyGra provides metadata["output_file"] at runtime, the post processor writes a JSON report next to the output file.

Report naming¶

If report_filename is provided:
absolute paths are used as-is
relative paths are resolved relative to the output directory
Otherwise, the report filename is derived from the output filename:
output_*.json -> semantic_dedup_report_*.json

Report format (high level)¶

The report includes:

input_count, output_count, dropped_count
configuration (field, similarity_threshold, embedding_model, etc.)
a bounded list of duplicate pairs under duplicates

Each entry in duplicates contains:

kept_index, dropped_index
kept_id, dropped_id
similarity

Dependencies¶

When using embedding_backend: sentence_transformers, this feature requires the sentence-transformers package to be available in your environment.

Performance considerations¶

When dedup_mode: nearest_neighbor (default), dedup runs incrementally and does not build a full similarity matrix. This is typically faster and uses less memory for larger outputs.

When dedup_mode: all_pairs, the implementation computes a full similarity matrix (O(n^2) time/memory), so it is intended for relatively small output lists.

If you plan to deduplicate very large outputs, consider:

generating in smaller batches
using a higher threshold to reduce comparisons
implementing an approximate/streaming dedup strategy

Troubleshooting¶

Unsupported embedding backend¶

If you set embedding_backend to anything other than sentence_transformers, SyGra will raise:

ValueError: Unsupported embedding_backend: ...

No report is written¶

A report is only written if metadata["output_file"] is present. If you are running in a context where SyGra does not set it, the post processor will still deduplicate in-memory but will not persist the report.