Image to QnA¶

This tutorial demonstrates how to build a multimodal Question and Answer (QnA) system for images using the GraSP framework. You’ll learn to extract text from images, generate questions, and provide detailed answers using LLMs.

Key Features You’ll Learn
multimodal LLMs, text extraction, question generation, image processing, multi-step reasoning

Prerequisites¶

GraSP framework installed (see Installation Guide)
Access to multimodal LLMs (e.g., Qwen VL 72B)
Basic understanding of image and text data

What You’ll Build¶

You’ll create a system that: - Processes images and extracts text - Generates diverse questions based on the text - Answers questions with reasoning and evidence - Handles multiple images as a document set

Step 1: Project Structure¶

image_to_qna/
├── graph_config.yaml    # Workflow for image processing, QnA generation
├── task_executor.py     # Custom processors and logic

Step 2: Pipeline Implementation¶

Parent Graph (`graph_config.yaml`)¶

The main pipeline is defined in image_to_qna/graph_config.yaml:

Data Source: Loads images and metadata from the HuggingFaceM4/Docmatix dataset, applying transformations for image metadata and loop counters.
Nodes:
extract_text: An LLM node with custom pre- and post-processors. Extracts text from each image.
update_loop_count: Updates the loop counter for image processing.
generate_questions: Generates questions from the extracted text.
generate_answers: Answers each generated question using the document content and question.
Edges: The graph loops over images and questions, processing each in turn.
Output Config: Custom output formatting is handled by the output generator in task_executor.py.

Reference: image_to_qna/graph_config.yaml

Task Executor (`task_executor.py`)¶

This file implements custom logic for the pipeline: - ImagesMetadata, ImagesPreProcessor, ExtractTextPostProcessor: Handle image metadata, pre-processing, and text extraction. - ImageLoopChecker: Edge condition for looping through images. - Output formatting: Assembles all image references, extracted text, questions, answers, and reasoning in a single output.

Reference: task_executor.py

Step 3: Output Collection¶

Assembles all image references, extracted text, questions, answers, and reasoning in a single output

Step 4: Running the Pipeline¶

From your GraSP project root, run:

python main.py --task path/to/your/image_to_qna

Example Output¶

[
    {
        "id": "de850e9019beb83118db75f247a9b17dda378a98abb83c99562593af00a461af",
        "num_images": 1,
        "ocr_texts": ["WISE COUNTY BOARD OF SUPERVISORS..."],
        "num_questions": 3,
        "generated_questions": ["What specific topics...", "Considering the agenda items...", "When and where is the Wise County Board..."]
        // ...
    }
]

Try It Yourself¶

Use your own images or datasets
Adjust the prompt for different question types

Next Steps¶

Explore audio classification to know how to process audio inputs
Learn about structured output for standardized results
Explore agent simulation for multi-agent conversations