Syntax Analysis
What is it?
The syntax of an utterance usually refers to its structure, such as how ideas and words are ordered in a sentence. Azimuth provides smart tags based on the length and syntactic structure of utterances.
Where is this used in Azimuth?
Based on the syntax of each utterance, Azimuth computes syntactic Smart Tags. Additionally, the length mismatch plot in the Dataset Class Distribution Analysis compares the length of the utterances in the training set and the evaluation set.
How is it computed?
POS Tags
Part-of-speech (POS) tagging is a common technique to tag each word in a given text as belonging to a category according to its grammatical properties. Examples could be 'verb', or 'direct object'.
Azimuth uses spaCy, an open-source library, to perform POS tagging on each token of an utterance. It is currently set up for English only.
import spacy
from spacy.lang.en import English
# Sentencizer
spacy_pipeline = English()
spacy_pipeline.add_pipe("sentencizer") # (5)
# Part of Speech
subj_tags = ["nsubj", "nsubjpass"] # (1)
obj_tags = ["dobj", "pobj", "dobj"] # (2)
verb_tags = ["VERB", "AUX"] # (3)
spacy_pos = spacy.load("en_core_web_sm") # (4)
- Tags to detect a subject in a sentence.
- Tags to detect an object in a sentence.
- Tags to detect a verb in a sentence.
- Parser to determine the POS tags in an utterance.
- Used to compute the number of sentences in an utterance.
Based on this, the following smart tags are computed: multiple_sentences
, missing_subj
, missing_verb
and missing_obj
.
Token Count
To compute the number of tokens per utterance, the following tokenizer is used:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
Based on the token count, the long_sentence
(> 15 tokens) and short_sentence
(<= 3 tokens) smart
tags are computed.