Skip to content

HuggingFace Handler

Specializes in interacting with HuggingFace datasets, supporting:

  • Reading from public/private datasets
  • Streaming large datasets
  • Sharded dataset handling
  • Dataset card (README) management
  • Multiple data splits

Reading from HuggingFace Public Datasets

YAML:

data_config:
  source:
    type: "hf"
    repo_id: "google-research-datasets/mbpp"
    config_name: "sanitized"
    split: ["train", "validation", "prompt"]

Python:

from sygra.core.dataset.huggingface_handler import HuggingFaceHandler
from sygra.core.dataset.dataset_config import DataSourceConfig

# Configure source
config = DataSourceConfig(
    repo_id="databricks/databricks-dolly-15k",
    config_name="default",
    split="train"
)

# Initialize handler
handler = HuggingFaceHandler(source_config=config)

# Read data
data = handler.read()

Writing to your private dataset

YAML:

data_config:
  sink:
    type: "hf"
    repo_id: "your-username/your-dataset"
    config_name: "custom_config"
    split: "train"
    push_to_hub: true
    private: true

Python:

from sygra.core.dataset.dataset_config import OutputConfig
from sygra.core.dataset.huggingface_handler import HuggingFaceHandler

output_config = OutputConfig(
    repo_id="your-username/your-dataset",
    config_name="default",
    split="train",
    token="your_hf_token",
    private=True
)

handler = HuggingFaceHandler(output_config=output_config)
handler.write(data)

Working with sharded datasets

YAML:

data_config:
  source:
    type: "hf"
    repo_id: "large-dataset"
    shard:
      regex: "-.*\\.parquet$"
      index: [0, 1, 2]

Python:

from sygra.core.dataset.dataset_config import DataSourceConfig
from sygra.core.dataset.huggingface_handler import HuggingFaceHandler

config = DataSourceConfig(
    repo_id="large-dataset",
    shard={"regex": "-.*\\.parquet$", "index": [0, 1, 2]}
)

handler = HuggingFaceHandler(source_config=config)
shard_files = handler.get_files()

for shard_path in shard_files:
    shard_data = handler.read(path=shard_path)
    # Process shard data

Field Transformations

YAML:

data_config:
  source:
    type: "hf"
    repo_id: "dataset/name"
    transformations:
      - transform: sygra.processors.data_transform.RenameFieldsTransform
        params:
          mapping:
            old_field: new_field
          overwrite: false

Python:

from sygra.processors.data_transform import RenameFieldsTransform
from sygra.core.dataset.dataset_config import DataSourceConfig
from sygra.core.dataset.huggingface_handler import HuggingFaceHandler

config = DataSourceConfig(
    repo_id="dataset/name",
    transformations=[
        {
            "transform": RenameFieldsTransform,
            "params": {
                "mapping": {"old_field": "new_field"},
                "overwrite": False
            }
        }
    ]
)

handler = HuggingFaceHandler(source_config=config)
data = handler.read()