Audio to Text Data Generation¶
This module introduces support for multimodal data generation pipelines that convert audio to text. SyGra supports two distinct approaches for audio-to-text conversion:
- Audio Understanding LLMs - Models like
Qwen2-Audio-7Bthat can reason about, analyze, and answer questions about audio content - Dedicated Transcription Models - Models like
Whisperandgpt-4o-transcribeoptimized specifically for accurate speech-to-text conversion
Note: For gpt-4o-audio multimodal generation, see the GPT-4o Audio documentation.
Key Features¶
Audio Understanding LLMs¶
- Supports audio-only and audio+text prompts
- Audio reasoning, classification, and Q&A capabilities
- Uses standard chat completions API
- Contextual understanding of audio content
Dedicated Transcription Models¶
- Accurate speech-to-text conversion
- Multilingual support (50+ languages)
- Multiple output formats (JSON, SRT, VTT, text)
- Word and segment-level timestamps
- Optimized for transcription accuracy
Common Features¶
- Converts audio fields into base64-encoded data URLs compatible with LLM APIs
- Compatible with HuggingFace datasets, streaming, and on-disk formats
- Automatically handles lists of audio per field
- Seamless round-tripping between loading, prompting, and output publishing
Choosing the Right Approach¶
| Use Case | Recommended Approach |
|---|---|
| Accurate speech-to-text transcription | Transcription Models |
| Generating subtitles with timestamps | Transcription Models |
| Multilingual transcription | Transcription Models |
| Audio classification or event detection | Audio Understanding LLMs |
| Answering questions about audio | Audio Understanding LLMs |
| Audio reasoning or analysis | Audio Understanding LLMs |
| Combining audio with text context | Audio Understanding LLMs |
Part 1: Audio Understanding with LLMs¶
This section covers audio understanding using LLMs like Qwen2-Audio-7B that can reason about audio content.
Supported Audio Input Types¶
Each audio field in a dataset record may be one of the following:
- Local file path (e.g.,
"data/aud.wav") - Supported Extensions:
.wav,.flac,.ogg,mp3,.m4a,.aac,.aiff - HTTP(S) URL (e.g.,
"https://example.com/audio.wav") - Raw
bytes - HuggingFace
datasets.Audioobject - Dictionary:
{ "bytes": <byte_data> } - A list of any of the above
- A base64-encoded data URL (e.g.,
"data:audio/wav;base64,...")
Input Source: Local Disk Dataset¶
Supports .json, .jsonl, or .parquet datasets with local or remote audio paths.
File Layout¶
project/
├── data/
│ ├── 000001.wav
│ ├── 000002.wav
│ └── input.json
data/input.json¶
[
{ "id": "1", "audio": "data/000001.wav" },
{ "id": "2", "audio": "https://example.com/audio.wav" }
]
Configuration¶
data_config:
source:
type: "disk"
file_path: "data/input.json"
- Local paths are resolved relative to
file_path. - Remote URLs are fetched and encoded to base64 automatically.
Input Source: HuggingFace Dataset¶
Supports datasets hosted on the HuggingFace Hub in streaming or download mode.
Example Record¶
{ "id": "1", "audio": "HuggingFace datasets.Audio object or URL" }
Configuration¶
data_config:
source:
type: "hf"
repo_id: "myorg/my-dataset"
config_name: "default"
split: "train"
streaming: true
- Handles both
datasets.Audiofields and string URLs. - Audio is resolved and encoded to base64.
Multiple Audio Fields¶
If a record has more than one audio fields (e.g., "bird_sounds" and "animal_sounds"), reference them individually:
- type: audio_url
audio_url: "{bird_sounds}"
- type: audio_url
audio_url: "{animal_sounds}"
How Audio Transformation Works¶
- Detects audio-like fields from supported types.
- Converts each to a base64-encoded
data:audio/...string. -
Expands fields containing list of audio internally into multiple prompt entries.
Input:
{ "audio": ["data/000001.wav", "data/000002.wav"] }Prompt config:
- type: audio_url audio_url: "{audio}"Will expand to:
4. Leaves already-encoded data URLs unchanged.- type: audio_url audio_url: "data:audio/wav;base64,..." - type: audio_url audio_url: "data:audio/wav;base64,..."
HuggingFace Sink Round-Tripping¶
When saving output back to HuggingFace datasets:
sink:
type: "hf"
repo_id: "<your_repo>"
config_name: "<your_config>"
split: "train"
push_to_hub: true
private: true
token: "<hf_token>"
Each field that originally contained a data:audio/... base64 string will be:
- Decoded back into a HuggingFace datasets.Audio object.
- Stored in its native audio format in the output dataset.
- Uploaded to the dataset repo as proper audio entries (not strings).
Example Configuration: Identify the animal in the audio¶
data_config:
source:
type: "hf"
repo_id: "datasets-examples/doc-audio-1"
split: "train"
streaming: true
sink:
type: "hf"
repo_id: ServiceNow-AI/SyGra
config_name: MM-doc-audio-1
split: train
push_to_hub: true
private: true
token: "<hf_token>"
graph_config:
nodes:
identify_animal:
output_keys: animal
node_type: llm
prompt:
- user:
- type: text
text: |
Identify the animal in the provided audio.
- type: audio_url
audio_url: "{audio}"
model:
name: qwen_2_audio_7b
parameters:
max_tokens: 1000
temperature: 0.3
edges:
- from: START
to: identify_animal
- from: identify_animal
to: END
output_config:
output_map:
id:
from: "id"
audio:
from: "audio"
animal:
from: "animal"
Part 2: Speech-to-Text Transcription¶
This section covers dedicated transcription models optimized for accurate speech-to-text conversion.
Supported Transcription Models¶
whisper-1- OpenAI's Whisper model, general-purpose transcriptiongpt-4o-transcribe- OpenAI's GPT-4o-based transcription model with improved accuracy
Transcription Model Configuration¶
Configure the transcription model in your sygra/config/models.yaml:
transcribe:
model: gpt-4o-transcribe # or whisper-1
input_type: audio # Required for transcription routing
model_type: azure_openai # or openai
api_version: 2025-03-01-preview
# URL and auth_token from environment variables:
# SYGRA_TRANSCRIBE_URL and SYGRA_TRANSCRIBE_TOKEN
parameters:
language: en # Optional: ISO-639-1 language code
response_format: json # json, verbose_json, text, srt, vtt
temperature: 0 # 0-1, controls randomness
Critical Configuration: input_type: audio¶
Transcription requires input_type: audio in the model configuration to route to the transcription API:
# ✓ Correct - Routes to transcription API
transcribe:
model: whisper-1
input_type: audio
model_type: openai
# ✗ Incorrect - Will not route to transcription API
transcribe:
model: whisper-1
model_type: openai
Supported Languages¶
Transcription models support 50+ languages including:
| Language | Code | Language | Code |
|---|---|---|---|
| English | en | Spanish | es |
| French | fr | German | de |
| Italian | it | Portuguese | pt |
| Dutch | nl | Russian | ru |
| Chinese | zh | Japanese | ja |
| Korean | ko | Arabic | ar |
| Hindi | hi | Turkish | tr |
For a complete list, see OpenAI Whisper Documentation.
Response Formats¶
| Format | Description | Use Case |
|---|---|---|
json |
JSON with transcribed text only | Simple transcription |
verbose_json |
JSON with text, timestamps, and metadata | Detailed analysis |
text |
Plain text only | Direct text output |
srt |
SubRip subtitle format with timestamps | Video subtitles |
vtt |
WebVTT subtitle format with timestamps | Web video subtitles |
Example Outputs¶
JSON Format:
{
"text": "Hello, how are you today?"
}
Verbose JSON Format:
{
"task": "transcribe",
"language": "english",
"duration": 2.5,
"text": "Hello, how are you today?",
"segments": [
{
"id": 0,
"seek": 0,
"start": 0.0,
"end": 2.5,
"text": " Hello, how are you today?",
"temperature": 0.0,
"avg_logprob": -0.2
}
]
}
SRT Format:
1
00:00:00,000 --> 00:00:02,500
Hello, how are you today?
Transcription Example Configuration¶
Based on tasks/examples/transcription_apis/graph_config.yaml:
Input Data (test.json)¶
[
{
"id": "1",
"audio": "/path/to/audio/meeting_recording.mp3"
},
{
"id": "2",
"audio": "/path/to/audio/interview.wav"
}
]
Graph Configuration¶
data_config:
source:
type: "disk"
file_path: "tasks/examples/transcription_apis/test.json"
graph_config:
nodes:
audio_to_text:
output_keys: transcription
node_type: llm
prompt:
- user:
- type: audio_url
audio_url: "{audio}"
model:
name: transcribe
edges:
- from: START
to: audio_to_text
- from: audio_to_text
to: END
output_config:
output_map:
id:
from: id
audio:
from: audio
transcription:
from: transcription
Output¶
[
{
"id": "1",
"audio": "/path/to/audio/meeting_recording.mp3",
"transcription": "Welcome everyone to today's meeting. Let's start with the agenda..."
},
{
"id": "2",
"audio": "/path/to/audio/interview.wav",
"transcription": "Thank you for joining us today. Can you tell us about your background?"
}
]
Advanced Transcription Features¶
Language Specification¶
Specifying the language improves accuracy and speed:
model:
name: transcribe
parameters:
language: es # Spanish
response_format: json
temperature: 0
Timestamps (Verbose JSON)¶
For detailed timestamp information:
model:
name: transcribe
parameters:
response_format: verbose_json
timestamp_granularities: ["word", "segment"] # Word and segment-level timestamps
Context Prompt¶
Provide context to improve accuracy on specific terms:
prompt:
- user:
- type: audio_url
audio_url: "{audio}"
- type: text
text: "The audio contains technical terms like Kubernetes, Docker, and CI/CD."
The text prompt is automatically passed as the prompt parameter to the transcription API.
Comparison: Transcription vs Audio-Understanding LLMs¶
| Feature | Transcription Models | Audio LLMs (Qwen2-Audio) |
|---|---|---|
| Primary Use | Speech-to-text conversion | Audio understanding, reasoning, Q&A |
| API Endpoint | audio.transcriptions.create |
chat.completions.create |
| Output | Transcribed text only | Contextual text responses |
| Timestamps | Yes (word/segment level) | No |
| Multiple Formats | Yes (JSON, SRT, VTT, text) | No (text only) |
| Language Support | 50+ languages | Varies by model |
| Best For | Accurate transcription, subtitles | Audio reasoning, classification, Q&A |
| Configuration | input_type: audio required |
Standard LLM config |
| Supported Audio | MP3, MP4, MPEG, MPGA, M4A, WAV, WEBM, FLAC, OGG | Same |
Best Practices for Transcription¶
1. Language Specification¶
Always specify the language if known:
parameters:
language: en # or es, fr, de, etc.
2. Temperature Setting¶
Use temperature 0 for deterministic transcription:
parameters:
temperature: 0 # Recommended for transcription
3. Audio Quality¶
- Use high-quality audio files (16kHz or higher sample rate)
- Minimize background noise for better accuracy
- Ensure clear speech with minimal overlapping speakers
4. Context Prompts¶
Provide context for technical terms or specific vocabulary:
- type: text
text: "This audio discusses machine learning models including BERT, GPT, and transformers."
5. File Size Limits¶
- Maximum audio file size: 25 MB (OpenAI limit)
- For longer audio, split into chunks before transcription
Notes¶
- Audio generation is not supported in this module. The
audio_urltype is strictly for passing existing audio inputs (e.g., loaded from datasets), not for generating new audio via model output. - Transcription models require
input_type: audioin model configuration to route to the transcription API. - For audio understanding LLM examples, see:
tasks/examples/audio_to_text - For transcription examples, see:
tasks/examples/transcription_apis
See Also¶
- GPT-4o Audio - Multimodal audio generation and understanding with GPT-4o
- Text to Speech - Text-to-speech generation
- Image to Text - Vision-based multimodal pipelines
- OpenAI Whisper Documentation - Official OpenAI Whisper API reference