Audio to Text Data Generation¶

This module introduces support for multimodal data generation pipelines that convert audio to text. SyGra supports two distinct approaches for audio-to-text conversion:

Audio Understanding LLMs - Models like Qwen2-Audio-7B that can reason about, analyze, and answer questions about audio content
Dedicated Transcription Models - Models like Whisper and gpt-4o-transcribe optimized specifically for accurate speech-to-text conversion

Note: For gpt-4o-audio multimodal generation, see the GPT-4o Audio documentation.

Key Features¶

Audio Understanding LLMs¶

Supports audio-only and audio+text prompts
Audio reasoning, classification, and Q&A capabilities
Uses standard chat completions API
Contextual understanding of audio content

Dedicated Transcription Models¶

Accurate speech-to-text conversion
Multilingual support (50+ languages)
Multiple output formats (JSON, SRT, VTT, text)
Word and segment-level timestamps
Optimized for transcription accuracy

Common Features¶

Converts audio fields into base64-encoded data URLs compatible with LLM APIs
Compatible with HuggingFace datasets, streaming, and on-disk formats
Automatically handles lists of audio per field
Seamless round-tripping between loading, prompting, and output publishing

Choosing the Right Approach¶

Use Case	Recommended Approach
Accurate speech-to-text transcription	Transcription Models
Generating subtitles with timestamps	Transcription Models
Multilingual transcription	Transcription Models
Audio classification or event detection	Audio Understanding LLMs
Answering questions about audio	Audio Understanding LLMs
Audio reasoning or analysis	Audio Understanding LLMs
Combining audio with text context	Audio Understanding LLMs

Part 1: Audio Understanding with LLMs¶

This section covers audio understanding using LLMs like Qwen2-Audio-7B that can reason about audio content.

Supported Audio Input Types¶

Each audio field in a dataset record may be one of the following:

Local file path (e.g., "data/aud.wav")
Supported Extensions: .wav, .flac, .ogg, mp3, .m4a, .aac, .aiff
HTTP(S) URL (e.g., "https://example.com/audio.wav")
Raw bytes
HuggingFace datasets.Audio object
Dictionary: { "bytes": <byte_data> }
A list of any of the above
A base64-encoded data URL (e.g., "data:audio/wav;base64,...")

Input Source: Local Disk Dataset¶

Supports .json, .jsonl, or .parquet datasets with local or remote audio paths.

File Layout¶

project/
├── data/
│   ├── 000001.wav
│   ├── 000002.wav
│   └── input.json

`data/input.json`¶

[
  { "id": "1", "audio": "data/000001.wav" },
  { "id": "2", "audio": "https://example.com/audio.wav" }
]

Configuration¶

data_config:
  source:
    type: "disk"
    file_path: "data/input.json"

Local paths are resolved relative to file_path.
Remote URLs are fetched and encoded to base64 automatically.

Input Source: HuggingFace Dataset¶

Supports datasets hosted on the HuggingFace Hub in streaming or download mode.

Example Record¶

{ "id": "1", "audio": "HuggingFace datasets.Audio object or URL" }

Configuration¶

data_config:
  source:
    type: "hf"
    repo_id: "myorg/my-dataset"
    config_name: "default"
    split: "train"
    streaming: true

Handles both datasets.Audio fields and string URLs.
Audio is resolved and encoded to base64.

Multiple Audio Fields¶

If a record has more than one audio fields (e.g., "bird_sounds" and "animal_sounds"), reference them individually:

- type: audio_url
  audio_url: "{bird_sounds}"
- type: audio_url
  audio_url: "{animal_sounds}"

How Audio Transformation Works¶

Detects audio-like fields from supported types.
Converts each to a base64-encoded data:audio/... string.

Expands fields containing list of audio internally into multiple prompt entries.

Input:

{ "audio": ["data/000001.wav", "data/000002.wav"] }

Prompt config:

- type: audio_url
  audio_url: "{audio}"

Will expand to:

- type: audio_url
  audio_url: "data:audio/wav;base64,..."
- type: audio_url
  audio_url: "data:audio/wav;base64,..."

4. Leaves already-encoded data URLs unchanged.

HuggingFace Sink Round-Tripping¶

When saving output back to HuggingFace datasets:

sink:
  type: "hf"
  repo_id: "<your_repo>"
  config_name: "<your_config>"
  split: "train"
  push_to_hub: true
  private: true
  token: "<hf_token>"

Each field that originally contained a data:audio/... base64 string will be: - Decoded back into a HuggingFace datasets.Audio object. - Stored in its native audio format in the output dataset. - Uploaded to the dataset repo as proper audio entries (not strings).

Example Configuration: Identify the animal in the audio¶

data_config:
  source:
    type: "hf"
    repo_id: "datasets-examples/doc-audio-1"
    split: "train"
    streaming: true

  sink:
    type: "hf"
    repo_id: ServiceNow-AI/SyGra
    config_name: MM-doc-audio-1
    split: train
    push_to_hub: true
    private: true
    token: "<hf_token>"

graph_config:
  nodes:
    identify_animal:
      output_keys: animal
      node_type: llm
      prompt:
        - user:
            - type: text
              text: |
                Identify the animal in the provided audio.
            - type: audio_url
              audio_url: "{audio}"

      model:
        name: qwen_2_audio_7b
        parameters:
          max_tokens: 1000
          temperature: 0.3
  edges:
    - from: START
      to: identify_animal
    - from: identify_animal
      to: END

output_config:
    output_map:
        id:
          from: "id"
        audio:
          from: "audio"
        animal:
          from: "animal"

Part 2: Speech-to-Text Transcription¶

This section covers dedicated transcription models optimized for accurate speech-to-text conversion.

Supported Transcription Models¶

whisper-1 - OpenAI's Whisper model, general-purpose transcription
gpt-4o-transcribe - OpenAI's GPT-4o-based transcription model with improved accuracy

Transcription Model Configuration¶

Configure the transcription model in your sygra/config/models.yaml:

transcribe:
  model: gpt-4o-transcribe  # or whisper-1
  input_type: audio  # Required for transcription routing
  model_type: azure_openai  # or openai
  api_version: 2025-03-01-preview
  # URL and auth_token from environment variables:
  # SYGRA_TRANSCRIBE_URL and SYGRA_TRANSCRIBE_TOKEN
  parameters:
    language: en  # Optional: ISO-639-1 language code
    response_format: json  # json, verbose_json, text, srt, vtt
    temperature: 0  # 0-1, controls randomness

Critical Configuration: `input_type: audio`¶

Transcription requires input_type: audio in the model configuration to route to the transcription API:

# ✓ Correct - Routes to transcription API
transcribe:
  model: whisper-1
  input_type: audio
  model_type: openai

# ✗ Incorrect - Will not route to transcription API
transcribe:
  model: whisper-1
  model_type: openai

Supported Languages¶

Transcription models support 50+ languages including:

Language	Code	Language	Code
English	en	Spanish	es
French	fr	German	de
Italian	it	Portuguese	pt
Dutch	nl	Russian	ru
Chinese	zh	Japanese	ja
Korean	ko	Arabic	ar
Hindi	hi	Turkish	tr

For a complete list, see OpenAI Whisper Documentation.

Response Formats¶

Format	Description	Use Case
`json`	JSON with transcribed text only	Simple transcription
`verbose_json`	JSON with text, timestamps, and metadata	Detailed analysis
`text`	Plain text only	Direct text output
`srt`	SubRip subtitle format with timestamps	Video subtitles
`vtt`	WebVTT subtitle format with timestamps	Web video subtitles

Example Outputs¶

JSON Format:

{
  "text": "Hello, how are you today?"
}

Verbose JSON Format:

{
  "task": "transcribe",
  "language": "english",
  "duration": 2.5,
  "text": "Hello, how are you today?",
  "segments": [
    {
      "id": 0,
      "seek": 0,
      "start": 0.0,
      "end": 2.5,
      "text": " Hello, how are you today?",
      "temperature": 0.0,
      "avg_logprob": -0.2
    }
  ]
}

SRT Format:

1
00:00:00,000 --> 00:00:02,500
Hello, how are you today?

Transcription Example Configuration¶

Based on tasks/examples/transcription_apis/graph_config.yaml:

Input Data (`test.json`)¶

[
  {
    "id": "1",
    "audio": "/path/to/audio/meeting_recording.mp3"
  },
  {
    "id": "2",
    "audio": "/path/to/audio/interview.wav"
  }
]

Graph Configuration¶

data_config:
  source:
    type: "disk"
    file_path: "tasks/examples/transcription_apis/test.json"

graph_config:
  nodes:
    audio_to_text:
      output_keys: transcription
      node_type: llm
      prompt:
        - user:
            - type: audio_url
              audio_url: "{audio}"
      model:
        name: transcribe

  edges:
    - from: START
      to: audio_to_text
    - from: audio_to_text
      to: END

output_config:
  output_map:
    id:
      from: id
    audio:
      from: audio
    transcription:
      from: transcription

Output¶

[
  {
    "id": "1",
    "audio": "/path/to/audio/meeting_recording.mp3",
    "transcription": "Welcome everyone to today's meeting. Let's start with the agenda..."
  },
  {
    "id": "2",
    "audio": "/path/to/audio/interview.wav",
    "transcription": "Thank you for joining us today. Can you tell us about your background?"
  }
]

Advanced Transcription Features¶

Language Specification¶

Specifying the language improves accuracy and speed:

model:
  name: transcribe
  parameters:
    language: es  # Spanish
    response_format: json
    temperature: 0

Timestamps (Verbose JSON)¶

For detailed timestamp information:

model:
  name: transcribe
  parameters:
    response_format: verbose_json
    timestamp_granularities: ["word", "segment"]  # Word and segment-level timestamps

Context Prompt¶

Provide context to improve accuracy on specific terms:

prompt:
  - user:
      - type: audio_url
        audio_url: "{audio}"
      - type: text
        text: "The audio contains technical terms like Kubernetes, Docker, and CI/CD."

The text prompt is automatically passed as the prompt parameter to the transcription API.

Comparison: Transcription vs Audio-Understanding LLMs¶

Feature	Transcription Models	Audio LLMs (Qwen2-Audio)
Primary Use	Speech-to-text conversion	Audio understanding, reasoning, Q&A
API Endpoint	`audio.transcriptions.create`	`chat.completions.create`
Output	Transcribed text only	Contextual text responses
Timestamps	Yes (word/segment level)	No
Multiple Formats	Yes (JSON, SRT, VTT, text)	No (text only)
Language Support	50+ languages	Varies by model
Best For	Accurate transcription, subtitles	Audio reasoning, classification, Q&A
Configuration	`input_type: audio` required	Standard LLM config
Supported Audio	MP3, MP4, MPEG, MPGA, M4A, WAV, WEBM, FLAC, OGG	Same

Best Practices for Transcription¶

1. Language Specification¶

Always specify the language if known:

parameters:
  language: en  # or es, fr, de, etc.

2. Temperature Setting¶

Use temperature 0 for deterministic transcription:

parameters:
  temperature: 0  # Recommended for transcription

3. Audio Quality¶

Use high-quality audio files (16kHz or higher sample rate)
Minimize background noise for better accuracy
Ensure clear speech with minimal overlapping speakers

4. Context Prompts¶

Provide context for technical terms or specific vocabulary:

- type: text
  text: "This audio discusses machine learning models including BERT, GPT, and transformers."

5. File Size Limits¶

Maximum audio file size: 25 MB (OpenAI limit)
For longer audio, split into chunks before transcription

Notes¶

Audio generation is not supported in this module. The audio_url type is strictly for passing existing audio inputs (e.g., loaded from datasets), not for generating new audio via model output.
Transcription models require input_type: audio in model configuration to route to the transcription API.
For audio understanding LLM examples, see: tasks/examples/audio_to_text
For transcription examples, see: tasks/examples/transcription_apis

Audio to Text Data Generation¶

Key Features¶

Audio Understanding LLMs¶

Dedicated Transcription Models¶

Common Features¶

Choosing the Right Approach¶

Part 1: Audio Understanding with LLMs¶

Supported Audio Input Types¶

Input Source: Local Disk Dataset¶

File Layout¶

data/input.json¶

Configuration¶

Input Source: HuggingFace Dataset¶

Example Record¶

Configuration¶

Multiple Audio Fields¶

How Audio Transformation Works¶

HuggingFace Sink Round-Tripping¶

Example Configuration: Identify the animal in the audio¶

Part 2: Speech-to-Text Transcription¶

Supported Transcription Models¶

Transcription Model Configuration¶

Critical Configuration: input_type: audio¶

Supported Languages¶

Response Formats¶

Example Outputs¶

Transcription Example Configuration¶

Input Data (test.json)¶

Graph Configuration¶

Output¶

Advanced Transcription Features¶

Language Specification¶

Timestamps (Verbose JSON)¶

Context Prompt¶

Comparison: Transcription vs Audio-Understanding LLMs¶

Best Practices for Transcription¶

1. Language Specification¶

2. Temperature Setting¶

3. Audio Quality¶

4. Context Prompts¶

5. File Size Limits¶

Notes¶

See Also¶

`data/input.json`¶

Critical Configuration: `input_type: audio`¶

Input Data (`test.json`)¶