Data Handlers¶
Components¶
Base Interface (DataHandler
)¶
The abstract base class that defines the core interface for all data handlers:
read()
: Read data from a sourcewrite()
: Write data to a destinationget_files()
: List available files in the data source
HuggingFace Handler¶
Specializes in interacting with HuggingFace datasets, supporting:
- Reading from public/private datasets
- Streaming large datasets
- Sharded dataset handling
- Dataset card (README) management
- Multiple data splits
File System Handler¶
Manages local file operations with support for:
- JSON, JSONL (JSON Lines), Parquet files
- Special data type handling (datetime, numpy arrays)
Configuration¶
The YAML data configuration consists of the following:
- data_config
: Defines data source and sink configurations with source
and sink
keys.
- The source
key specifies the input data config, and the sink
key specifies the output data config.
Basic Structure¶
data_config:
source:
type: "hf" # or "disk" for local filesystem
# source-specific configurations
sink:
type: "hf" # or "disk" for local filesystem
# sink-specific configurations
Use Cases¶
1. Reading from HuggingFace Public Datasets¶
data_config:
source:
type: "hf"
repo_id: "google-research-datasets/mbpp"
config_name: "sanitized"
split: ["train", "validation", "prompt"]
Source Configuration¶
Configure your data source using DataSourceConfig
:
from grasp.core.dataset.dataset_config import DataSourceConfig
# For HuggingFace datasets
hf_config = DataSourceConfig(
repo_id="datasets/dataset-name",
config_name="default",
split="train",
token="your_hf_token", # Optional for private datasets
streaming=True, # Optional for large datasets
shard=None # Optional for sharded datasets
)
# For local files
file_config = DataSourceConfig(
file_path="/path/to/data.json",
encoding="utf-8", # Optional, defaults to utf-8
)
Output Configuration¶
Configure your output destination using OutputConfig
:
from grasp.core.dataset.dataset_config import OutputConfig
# For HuggingFace datasets
hf_output = OutputConfig(
repo_id="your-username/your-dataset",
config_name="default",
split="train",
token="your_hf_token",
private=True # Optional, defaults to False
)
# For local files
file_output = OutputConfig(
encoding="utf-8" # Optional, defaults to utf-8
)
Usage Examples¶
Working with HuggingFace Datasets¶
- Reading from a public dataset:
YAML:
data_config:
source:
type: "hf"
repo_id: "google-research-datasets/mbpp"
config_name: "sanitized"
split: ["train", "validation", "prompt"]
Python:
from grasp.core.dataset.huggingface_handler import HuggingFaceHandler
from grasp.core.dataset.dataset_config import DataSourceConfig
# Configure source
config = DataSourceConfig(
repo_id="databricks/databricks-dolly-15k",
config_name="default",
split="train"
)
# Initialize handler
handler = HuggingFaceHandler(source_config=config)
# Read data
data = handler.read()
- Writing to your private dataset:
YAML:
data_config:
sink:
type: "hf"
repo_id: "your-username/your-dataset"
config_name: "custom_config"
split: "train"
push_to_hub: true
private: true
Python:
# Configure output
output_config = OutputConfig(
repo_id="your-username/your-dataset",
config_name="default",
split="train",
token="your_hf_token",
private=True
)
handler = HuggingFaceHandler(output_config=output_config)
handler.write(data)
- Working with sharded datasets:
YAML:
data_config:
source:
type: "hf"
repo_id: "large-dataset"
shard:
regex: "-.*\\.parquet$"
index: [0, 1, 2] # Only process first 3 shards
Python:
config = DataSourceConfig(
repo_id="large-dataset",
shard={"regex": "-.*\\.parquet$", "index": [0, 1, 2]} # Only process first 3 shards
)
handler = HuggingFaceHandler(source_config=config)
shard_files = handler.get_files()
for shard_path in shard_files:
shard_data = handler.read(path=shard_path)
# Process shard data
Field Transformations¶
YAML:
data_config:
source:
type: "hf"
repo_id: "dataset/name"
transformations:
- transform: grasp.processors.data_transform.RenameFieldsTransform
params:
mapping:
old_field: new_field
overwrite: false
Python:
from grasp.processors.data_transform import RenameFieldsTransform
config = DataSourceConfig(
repo_id="dataset/name",
transformations=[
{
"transform": RenameFieldsTransform,
"params": {
"mapping": {"old_field": "new_field"},
"overwrite": False
}
}
]
)
handler = HuggingFaceHandler(source_config=config)
data = handler.read()
Working with Local Files¶
- Reading from JSON/Parquet/JSONL files:
YAML:
data_config:
source:
type: "disk"
file_path: "data/input.json" # also supports Parquet, JSONL
Python:
from data_handlers import FileHandler
config = DataSourceConfig(file_path="/data/input.parquet")
handler = FileHandler(source_config=config)
data = handler.read()
- Writing to JSONL with custom encoding:
YAML:
data_config:
sink:
type: "disk"
file_path: "data/output.jsonl"
encoding: "utf-16"
Python:
output_config = OutputConfig(encoding="utf-16")
handler = FileHandler(output_config=output_config)
handler.write(data, path="/data/output.jsonl")