GPTHuggingfaceDatasetConfig¶
Module: fast_llm.data.preparation.gpt_memmap.config
Fields¶
path—core-
Type:
strorPathDefault:NoneName or path of the dataset.
config_name—optional-
Type:
strorNoneDefault:NoneSpecific configuration name for the dataset.
data_directory—optional-
Type:
strorNoneDefault:Nonedata_dir argument passed to
load_dataset data_files—optional-
Type:
stror list[str] orNoneDefault:Nonedata_files argument passed to
load_dataset data_type—optional-
Type:
DataTypeorNoneDefault:NoneData type of the dataset field. If not provided, it will be inferred based on the tokenizer vocabulary size.
disable_disk_space_check—optional-
Type:
boolDefault:FalseDisable disk space check. Useful for environments where disk space is not accurately reported.
load_from_disk—feature-
Type:
boolDefault:FalseUse the
load_from_diskmethod for datasets saved withsave_to_disk. source_schema—optional-
Type: LanguageModelSourceConfig Default: (sub-fields optional)
Configuration for the data source.
split—optional-
Type:
strDefault:"train"Split of the dataset to use.
trust_remote_code—optional-
Type:
boolDefault:FalseTrust remote code when downloading the dataset.