Audio Classification¶

This tutorial demonstrates how to build a multimodal pipeline for processing audio files and generating textual output using the GraSP framework. You’ll learn to integrate audio-capable LLMs for audio classification, speech recognition, or content analysis.

Key Features You’ll Learn
multimodal processing, audio classification, base64 encoding, audio-capable LLMs, HuggingFace dataset integration

Prerequisites¶

GraSP framework installed (see Installation Guide)
Access to an LLM that supports audio input (e.g., Qwen2-Audio-7B)
Basic knowledge of audio file formats

What You’ll Build¶

You’ll create a pipeline that: - Loads audio samples from a HuggingFace dataset - Processes audio files (detects, encodes, and prepares for LLM input) - Sends audio and instructions to an LLM - Receives and structures the LLM’s analysis

Step 1: Project Structure¶

audio_to_text/
├── graph_config.yaml    # Workflow for audio processing

Step 2: Pipeline Implementation¶

Graph Configuration (`graph_config.yaml`)¶

The graph_config.yaml file defines the workflow for the audio-to-text task. Here’s what it does:

Data Source: Loads audio samples from the HuggingFace datasets-examples/doc-audio-1 repository, with streaming enabled for efficiency.
Nodes: Defines a single node named identify_animal of type llm. This node is configured with:
Prompt: Combines a text instruction ("Identify the animal in the provided audio.") and the audio file (as a base64-encoded URL) for multimodal input.
Model: Uses the qwen_2_audio_7b model with parameters suitable for audio analysis.
Edges: Sets up a simple workflow from START to identify_animal and then to END.
Output Config: Maps the output fields (id, audio, animal) from the state to the final output structure.

Reference Implementation¶

See the GraSP repository for the complete example:

Graph configuration: audio_to_text/graph_config.yaml

Step 3: Output Collection¶

The system captures the LLM’s analysis (e.g., animal identification) and structures results in a standardized JSON format for downstream use.

Step 4: Running the Pipeline¶

From your GraSP project root, run:

python main.py --task path/to/your/audio_to_text

Example Output¶

[
    {
        "id": "sample1",
        "audio_url": "data:audio/wav;base64,UklGRuQAAABXQVZFZm10IBAAAAABAAEA...",
        "analysis": "The audio contains the sound of a dog barking."
    },
    {
        "id": "sample2",
        "audio_url": "data:audio/wav;base64,UklGRuQAAABXQVZFZm10IBAAAAABAAEA...",
        "analysis": "The audio contains the sound of a cat meowing."
    }
]

Try It Yourself¶

Add new audio samples or use your own recordings
Modify the prompt to instruct the LLM to perform different analyses (e.g., speech-to-text)

Next Steps¶

Explore image to QnA for multimodal image processing
Learn about structured output for standardized results