Schema Validator¶
Introduction¶
Schema validator enables users to ensure correctness of generated data before uploading to HF or File System.
Key features supported for schema validation are as follows:-
- YAML based schema check:- Users can define their schema using YAML config files in the following ways:-
- Define a custom schema class inside
custom_schemas.py
and add it's path inschema
key insideschema_config
. -
Add expected schema config in a list of dict format inside
fields
key insideschema_config
. -
Rule based validation support:- Aside from adding validator rules inside custom class, users can choose from validation methods supported(details in additional validation rules section) and add it as a key for a particular field's dict.
Usage Illustration¶
Let's assume we have the following record generated which we want to validate:-
{
"id": 130426,
"conversation": [
{
"role": "user",
"content": "I am trying to get the CPU cycles at a specific point in my code."
},
{
"role": "assistant",
"content": "The `rdtsc` function you're using gives you the number of cycles since the CPU was last reset, which is not what you want in this case."
}
],
"taxonomy": [
{
"category": "Coding",
"subcategory": ""
}
],
"annotation_type": [
"mistral-large"
],
"language": [
"en"
],
"tags": [
"glaiveai/glaive-code-assistant-v2",
"reannotate",
"self-critique"
]
}
custom_schemas.py
defining the
expected keys and values along with additional validation rules if any.
class CustomUserSchema(BaseModel):
'''
This demonstrates an example of a customizable user schema that can be modified or redefined by the end user.
Below is a sample schema with associated validator methods.
'''
id: int
conversation: list[dict[str,Any]]
taxonomy: list[dict[str, Any]]
annotation_type: list[str]
language: list[str]
tags: list[str]
@root_validator(pre=True)
def check_non_empty_lists(cls, values):
if not values.get('id'):
raise ValueError('id cannot be empty')
return values
Sample YAML configuration to use custom schema defined in custom_schemas.py:-¶
schema_config:
schema: grasp.validators.custom_schemas.CustomUserSchema
Sample YAML configuration to define schema in YAML:-¶
schema_config:
fields:
- name: id
type: int
is_greater_than: 99999
- name: conversation
type: list[dict[str, any]]
- name: taxonomy
type: list[dict[str, any]]
- name: annotation_type
type: list[str]
- name: language
type: list[str]
- name: tags
type: list[str]
fields
is expected to be a list of dicts with name
and type
present in each dict with additional option
of providing validation key. In the above example is_greater_than
is a validation key shown for demonstration purpose
to ensure id
key in each record has a value with 6 digits or more.
Additional Validation Rules Supported:-¶
Currently we support the following validation rules that can be directly used by the user:-
is_greater_than
: Ensures value present in a given field is greater than value provided by user in schema definition.is_equal_to
: Ensures value present in a given field is exactly same as value provided by user in schema definition.is_less_than
: Ensures value present in a given field is less than value provided by user in schema definition.
More rules will be added in subsequent releases for users to use directly in their schema.
Rules for using schema validation:-¶
Now that we have covered a sample example on how to define schema and use it, here are some rules users have to keep in mind:-
- Schema validation is skipped if
schema_config
key is not present ingraph_config.yaml
. It is assumed that user doesn't want schema validation to happen, hence we skip validation check in this case. - If
schema_config
key is present ingraph_config.yaml
, it is expected that eitherschema
orfields
key is present insideschema_config
and has been defined correctly. Absence of both or invalid definition ofschema
path orfields
will raise exception. type
defined in eithercustom_schemas.py
or insidefields
have to be valid python types. Typo while defining type, for examplelisr
instead oflist
will raise invalid type error stopping the pipeline execution, and user has to re-define correctly.