Similarity Analysis

What is it?

Similarity analysis is based on the relative locations of utterances in embedding space. This analysis can be quite powerful given that no trained ML model is needed; only a dataset needs to be supplied.

Within Azimuth, different similarity analyses are provided to determine how similar utterances are within a class, across classes, and so on. This can help indicate whether classes are well-defined, or whether changes should be made to improve the dataset, such as by redefining classes, relabeling or omitting data, or augmenting the dataset.

Where is this used in Azimuth?

In Azimuth, the similarity analysis is used to derive Smart Tags, and also to show the most similar utterances in both dataset splits on the Utterances Details.

Image title — Similar utterances in the Utterance Details.

How is it Computed?

Similarity Computation

To get utterance embeddings, Azimuth uses a sentence encoder (all-MiniLM-L12-v2 from sentence-transformers) based on a BERT architecture (Reimers and Gurevych, 2019¹). It then computes the cosine similarity (via a dot product on normalized embeddings) between each utterance in the dataset and all other utterances in both dataset splits (training and evaluation).

On the Utterances Details, the most similar examples are presented in descending order (i.e., most similar first), along with the cosine similarity to the selected utterance. A cosine similarity of 1 indicates that the utterance is identical, while 0 indicates that it is unrelated.

Smart Tags

No Close Tags

Some utterances may have no close neighbors in a dataset split - that is, their most similar utterances have low cosine similarities. When the cosine similarity of an utterance's closest neighbor is below a threshold (default = 0.5), the utterance gets tagged with no_close_train and/or no_close_eval, according to the dataset split being assessed (training or evaluation). Note that this tag is class label-agnostic.

Few Similar Tags

It can be useful to assess whether the most similar data samples to an utterance (its neighbors) come from the same or different classes. When most of its neighboring utterances are from a different class, it might indicate a mislabeling issue, overlapping classes, data drift, or simply a difficult utterance to predict.

Two Smart Tags highlight these sorts of utterances, based on the label heterogeneity of the neighborhood in each dataset split (training or evaluation). If 90% or more of an utterance's most similar data samples (neighbors) in a dataset split belong to a different class, it will be tagged as conflicting_neighbors_train and/or conflicting_neighbors_eval, based on which dataset split is being examined. (E.g., an utterance in the test set will be compared to its neighbors in both the training and evaluation dataset splits.)

Configuration

Similarity Analysis Configuration details how to change the encoder for the embeddings on which similarity is computed, as well as the two thresholds used to determine the smart tags.

Reimers, Nils, and Iryna Gurevych. "Sentence-bert: Sentence embeddings using siamese bert-networks." arXiv preprint arXiv:1908.10084 (2019). ↩