Similarity Analysis
What is it?
Similarity analysis is based on the relative locations of utterances in embedding space. This analysis can be quite powerful given that no trained ML model is needed; only a dataset needs to be supplied.
Within Azimuth, different similarity analyses are provided to determine how similar utterances are within a class, across classes, and so on. This can help indicate whether classes are well-defined, or whether changes should be made to improve the dataset, such as by redefining classes, relabeling or omitting data, or augmenting the dataset.
Where is this used in Azimuth?
In Azimuth, the similarity analysis is used to derive Smart Tags, and also to show the most similar utterances in both dataset splits on the Utterances Details.
How is it Computed?
Similarity Computation
To get utterance embeddings, Azimuth uses a sentence encoder (all-MiniLM-L12-v2
from sentence-transformers) based on a BERT
architecture (Reimers and Gurevych, 20191). It then computes
the cosine similarity (via a dot product on normalized embeddings) between each utterance in
the dataset and all other utterances in both dataset splits (training and evaluation).
On the Utterances Details, the most similar examples are presented in descending order (i.e., most similar first), along with the cosine similarity to the selected utterance. A cosine similarity of 1 indicates that the utterance is identical, while 0 indicates that it is unrelated.
Smart Tags
No Close Tags
Some utterances may have no close neighbors in a dataset split - that is, their most similar
utterances have low cosine similarities. When the cosine similarity of an utterance's closest
neighbor is below a threshold (default = 0.5), the utterance gets tagged with
no_close_train
and/or no_close_eval
, according to the dataset split being assessed (training or
evaluation). Note that this tag is class label-agnostic.
Few Similar Tags
It can be useful to assess whether the most similar data samples to an utterance (its neighbors) come from the same or different classes. When most of its neighboring utterances are from a different class, it might indicate a mislabeling issue, overlapping classes, data drift, or simply a difficult utterance to predict.
Two Smart Tags highlight these sorts of utterances, based on
the label heterogeneity of the neighborhood in each dataset split (training or evaluation). If
90% or more of an utterance's most similar data samples (neighbors) in a dataset split belong to a
different class, it will be tagged as conflicting_neighbors_train
and/or conflicting_neighbors_eval
, based on which dataset split is being examined. (E.g., an
utterance in the test set will be compared to its neighbors in both the training and evaluation
dataset splits.)
Configuration
Similarity Analysis Configuration details how to change the encoder for the embeddings on which similarity is computed, as well as the two thresholds used to determine the smart tags.
-
Reimers, Nils, and Iryna Gurevych. "Sentence-bert: Sentence embeddings using siamese bert-networks." arXiv preprint arXiv:1908.10084 (2019). ↩