Skip to content

Syntax Analysis

What is it?

The syntax of an utterance usually refers to its structure, such as how ideas and words are ordered in a sentence. Azimuth provides smart tags based on the length and syntactic structure of utterances.

Where is this used in Azimuth?

Based on the syntax of each utterance, Azimuth computes syntactic Smart Tags (Extreme Length and Partial Syntax families). Additionally, the length mismatch plot in the Dataset Class Distribution Analysis compares the length of the utterances in the training set and the evaluation set.

How is it computed?

POS Tags

Part-of-speech (POS) tagging is a common technique to tag each word in a given text as belonging to a category according to its grammatical properties. Examples could be 'verb', or 'direct object'.

Azimuth uses spaCy, an open-source library, to perform part-of-speech (POS) and dependency tagging on each token of an utterance. It is set up for all languages supported by Azimuth. Azimuth then computes the smart tags missing_subj, missing_verb, and missing_obj based on the presence of certain tags. Subjects and objects are identified by dependency tags that are language-dependent and specified in the Syntax Analysis Config, whereas verbs are identified by POS tags (["VERB", "AUX"]) that are consistent across languages.

Word Count

To compute the number of words per utterance, we use the spaCy model from the config.

import spacy

spacy_model = spacy.load(config.syntax.spacy_model)
doc = spacy_model(utterance)
tokens = [token.text for token in doc if not token.is_punct]
word_count = len(tokens)
Based on the word count, the long_utterance and short_utterance smart tags are computed.

Sentence Count

The smart tag multiple_sentences is based on a spaCy sentencizer:

from spacy.lang.en import English

# Sentencizer; English() should work for other languages that have similar sentence conventions.
spacy_sentencizer_en = English()
spacy_sentencizer_en.add_pipe("sentencizer")

Configuration

Syntax Analysis Config explains how to edit the thresholds to determine what is considered a short or long utterance, the tags used to detect subjects and objects, and the spaCy model used to parse utterances.