Semantic Deduplication¶
Remove near-duplicate generated records using embedding-based similarity as a graph post-processor
Overview¶
SyGra supports semantic deduplication as a graph post-processing step via:
sygra.core.graph.graph_postprocessor.SemanticDedupPostProcessor
It embeds a configured output field (e.g., answer, description) and removes items whose cosine similarity is above a configurable threshold.
This is useful when:
- Your generation workflow tends to repeat the same/very similar answers.
- You are generating multiple records and want to reduce redundant samples.
- You want a report of duplicate pairs to inspect or tune dedup behavior.
Quick Start¶
Add the post processor under graph_post_process in your task graph_config.yaml.
Example (dedup over answer, see tasks/examples/semantic_dedup/graph_config.yaml):
graph_post_process:
- processor: sygra.core.graph.graph_postprocessor.SemanticDedupPostProcessor
params:
field: answer
similarity_threshold: 0.92
id_field: id
embedding_backend: sentence_transformers
embedding_model: all-MiniLM-L6-v2
dedup_mode: nearest_neighbor
vectorstore_k: 20
keep: first
max_pairs_in_report: 1000
Example (dedup over description, see tasks/examples/semantic_dedup_no_seed/graph_config.yaml):
graph_post_process:
- processor: sygra.core.graph.graph_postprocessor.SemanticDedupPostProcessor
params:
field: description
similarity_threshold: 0.85
id_field: id
embedding_backend: sentence_transformers
embedding_model: all-MiniLM-L6-v2
keep: first
max_pairs_in_report: 1000
Configuration Reference¶
Parameters¶
All parameters are provided under params:.
| Parameter | Type | Description | Default |
|---|---|---|---|
field |
string | Field to embed and compare for similarity. If the field value is a list/tuple, values are joined with newlines. | text |
similarity_threshold |
float | Cosine similarity threshold. Higher values drop fewer items. | 0.9 |
id_field |
string | Optional ID field used in the report for readability. If missing, indices are used. | id |
embedding_backend |
string | Embedding backend. Currently only sentence_transformers is supported. |
sentence_transformers |
embedding_model |
string | SentenceTransformers model name to use for embeddings. | all-MiniLM-L6-v2 |
report_filename |
string | Optional report JSON filename. If relative, it is written next to the graph output file. If omitted, the report name is derived from the output file name. | (derived) |
keep |
string | Which item to keep when duplicates are found: first or last. |
first |
max_pairs_in_report |
int | Max number of duplicate pairs written to the report. | 2000 |
dedup_mode |
string | Dedup implementation to use: nearest_neighbor (default) or all_pairs. Any other value is unsupported and will raise an error. nearest_neighbor avoids building a full similarity matrix by only comparing against nearest neighbors / kept items. all_pairs computes a full similarity matrix (exact, but O(n^2)). |
nearest_neighbor |
vectorstore_k |
int | Number of nearest neighbors to retrieve/consider when dedup_mode: nearest_neighbor. |
20 |
How dedup is applied¶
- A greedy pass keeps an item if it is not too similar to a previously kept one.
- Similarity is computed via cosine similarity over normalized embeddings.
keep: firstkeeps the earlier item,keep: lastprefers the later item.
Output report¶
If SyGra provides metadata["output_file"] at runtime, the post processor writes a JSON report next to the output file.
Report naming¶
- If
report_filenameis provided: - absolute paths are used as-is
- relative paths are resolved relative to the output directory
- Otherwise, the report filename is derived from the output filename:
output_*.json->semantic_dedup_report_*.json
Report format (high level)¶
The report includes:
input_count,output_count,dropped_count- configuration (
field,similarity_threshold,embedding_model, etc.) - a bounded list of duplicate pairs under
duplicates
Each entry in duplicates contains:
kept_index,dropped_indexkept_id,dropped_idsimilarity
Dependencies¶
When using embedding_backend: sentence_transformers, this feature requires the sentence-transformers package to be available in your environment.
Performance considerations¶
When dedup_mode: nearest_neighbor (default), dedup runs incrementally and does not build a full similarity matrix. This is typically faster and uses less memory for larger outputs.
When dedup_mode: all_pairs, the implementation computes a full similarity matrix (O(n^2) time/memory), so it is intended for relatively small output lists.
If you plan to deduplicate very large outputs, consider:
- generating in smaller batches
- using a higher threshold to reduce comparisons
- implementing an approximate/streaming dedup strategy
Troubleshooting¶
Unsupported embedding backend¶
If you set embedding_backend to anything other than sentence_transformers, SyGra will raise:
ValueError: Unsupported embedding_backend: ...
No report is written¶
A report is only written if metadata["output_file"] is present. If you are running in a context where SyGra does not set it, the post processor will still deduplicate in-memory but will not persist the report.