Dataset Warnings

Datasets can suffer from a variety of issues, such as class imbalance, classes with low sample counts, and dataset shift. These warnings help detect some of these issues.

Missing samples

In this first analysis, the application flags when a class has fewer than X (default is 20) samples in either the training or the evaluation set. The plot helps to visualize the values for each class.

Class Imbalance

In this second analysis, Azimuth detects class imbalance issues. It raises a flag for all classes where the relative difference between the number of samples in that class and the mean sample count per class in a dataset split is above a certain threshold Y. The default is 50%.

Dataset Shift

A discrepancy between the training and evaluation splits can cause problems with a model. For example, the model may not have a representative sample of examples to train on, making it generalize poorly in production.

Alternatively, if your evaluation set does not come from the same data distribution as the data in production, measuring model performance on this evaluation set may not be a good indicator of the performance in production. Distribution analysis aims to give warnings when the training and evaluation sets look too different in some aspect of the data.

Representation mismatch

This analysis flags when a class is over-represented in the evaluation set (relative to other classes) or the training set. If the delta between the percentage of a class in each set is above Z% (default is 5%), the analysis flags it.

Length mismatch

Length mismatch compares the number of tokens per utterance in both sets. The application flags a warning if the mean and/or standard deviation between the 2 distributions is above A and B ( default is 3 for both) respectively.

Configuration

All thresholds mentioned (X/Y/Z/A/B) can be modified in the config file, as explained in Dataset Warnings Configuration.