Data
Functions:
-
mask_labels
–This function creates labels from a sequence of input ids by masking
-
validate_spans
–Make sure the spans are valid, don't overlap, and are in order.
mask_labels(input_ids, offset_mapping, predicted_spans, masked_token_id=MASKED_TOKEN_ID)
This function creates labels from a sequence of input ids by masking the tokens that do not have any overlap with the character spans that are designated for prediction. The labels can then be used to train a model to predict everything except the masked tokens.
The function also returns a list of midpoints for splitting the labels into a source and a target. The source is the part of the labels that is used to predict the target. There is one midpoint for each span that is designated for prediction. Each midpoint is the index of the first token that overlaps with the corresponding span.
Parameters:
-
input_ids
(Sequence[int]
) –A sequence of token ids.
-
offset_mapping
(Iterable[tuple[int, int]]
) –The offset mapping returned by the tokenizer.
-
predicted_spans
(Iterable[Iterable[int]]
) –The character spans that are designated for prediction. The spans are given as a sequence of two-element sequences, where the first element is the beginning of the span (inclusive) and the second element is the end of the span (not inclusive).
Returns:
-
tuple[list[int], list[int]]
–tuple[list[int], list[int]]: A tuple of masked labels and corresponding midpoints for splitting the labels into a source and a target.
Source code in tapeagents/finetune/data.py
47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 |
|
validate_spans(text, predicted_spans)
Make sure the spans are valid, don't overlap, and are in order.
Source code in tapeagents/finetune/data.py
96 97 98 99 100 101 102 103 104 105 106 107 108 |
|