Abstract
Workflows are a fundamental component of automation in enterprise platforms. Building them can be complex and often requires manual configuration through low-code or visual tools. We explore using vision–language models (VLMs) to automatically generate structured workflows from visual inputs—hand-drawn sketches and computer-generated diagrams. We introduce StarFlow, a framework for this task, curate a diverse dataset of workflow diagrams (synthetic, manually annotated, and real-world), and fine-tune multiple VLMs. Our results show that fine-tuning significantly enhances structured workflow generation, outperforming larger general-purpose models on this task.
Dataset
We build a diverse dataset of workflow diagrams, spanning synthetic, human-annotated, and real-world samples, to enhance training and evaluation. Below, we show the distribution of the dataset across splits and how the dataset was constructed.
Source | Train | Valid | Test |
---|---|---|---|
Synthetic | 12,376 | 1,000 | 1,000 |
Manual | 3,035 | 333 | 865 |
Digital | 2,613 | 241 | 701 |
Whiteboard | 484 | 40 | 46 |
User Interface | 373 | 116 | 87 |
Total | 18,881 | 1,730 | 2,699 |
Dataset distribution across splits. Samples are collected from synthetic, manual, digital, whiteboard, and UI sources.

Results
We benchmark a range of off-the-shelf and fine-tuned vision–language models on the BigDocs-Sketch2Flow dataset, evaluating their ability to generate structured workflows from sketches and diagrams. Below we summarize the results across different model sizes and categories.
Model | FlowSim w/ inputs |
FlowSim no inputs |
TreeBLEU w/ inputs |
TreeBLEU no inputs |
Trigger match |
Component match |
---|---|---|---|---|---|---|
Open-weights Models | ||||||
Qwen-2.5-VL-3B-Instruct <4B | 0.410 | 0.384 | 0.360 | 0.329 | 0.027 | 0.201 |
Phi-3.5-Vision-4B-Instruct 4–12B | 0.364 | 0.346 | 0.337 | 0.295 | 0.079 | 0.193 |
Phi-4-Multimodal-6B-Instruct 4–12B | 0.465 | 0.404 | 0.394 | 0.298 | 0.054 | 0.244 |
Qwen-2.5-VL-7B-Instruct 4–12B | 0.614 | 0.538 | 0.562 | 0.508 | 0.036 | 0.280 |
LLaMA-3.2-11B-Vision-Instruct 4–12B | 0.466 | 0.435 | 0.416 | 0.382 | 0.075 | 0.239 |
Pixtral-12B 4–12B | 0.632 | 0.582 | 0.617 | 0.541 | 0.088 | 0.261 |
Qwen-2.5-VL-72B-Instruct >12B | 0.710 | 0.643 | 0.703 | 0.655 | 0.325 | 0.305 |
LLaMA-3.2-90B-Vision-Instruct >12B | 0.687 | 0.603 | 0.681 | 0.627 | 0.328 | 0.286 |
Proprietary Models | ||||||
GPT-4o-Mini proprietary | 0.642 | 0.617 | 0.650 | 0.623 | 0.254 | 0.305 |
GPT-4o proprietary | 0.786 | 0.707 | 0.794 | 0.718 | 0.282 | 0.317 |
Claude-3.7-Sonnet proprietary | 0.763 | 0.679 | 0.769 | 0.701 | 0.318 | 0.305 |
Gemini Flash 2.0 proprietary | 0.780 | 0.713 | 0.798 | 0.743 | 0.466 | 0.329 |
Finetuned Models | ||||||
Qwen-2.5-VL-3B-Instruct (ft) <4B | 0.941 | 0.911 | 0.941 | 0.902 | 0.775 | 0.909 |
Phi-3.5-Vision-4B-Instruct (ft) 4–12B | 0.917 | 0.882 | 0.917 | 0.869 | 0.703 | 0.874 |
Phi-4-Multimodal-6B-Instruct (ft) 4–12B | 0.939 | 0.908 | 0.940 | 0.902 | 0.770 | 0.907 |
Qwen-2.5-VL-7B-Instruct (ft) 4–12B | 0.957 | 0.927 | 0.956 | 0.920 | 0.819 | 0.934 |
LLaMA-3.2-11B-Vision-Instruct (ft) 4–12B | 0.955 | 0.924 | 0.954 | 0.915 | 0.805 | 0.934 |
Pixtral-12B (ft) 4–12B | 0.952 | 0.919 | 0.950 | 0.908 | 0.753 | 0.930 |
Flow quality metrics comparison across different models. Best per category is bold; runner-up is underlined. Size badges: <4B, 4–12B, >12B, proprietary.
Get Started
Running inference with StarFlow is simple. Load one of the released fine-tuned vision–language models, pass in your workflow sketch, and get structured JSON back:
from transformers import AutoProcessor, Qwen2_5_VLForConditionalGeneration
from PIL import Image
# Load processor and model
processor = AutoProcessor.from_pretrained("ServiceNow/Qwen2.5-VL-7B-Instruct-StarFlow")
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
"ServiceNow/Qwen2.5-VL-7B-Instruct-StarFlow"
)
# Load your sketch or diagram
image = Image.open("workflow_sketch.png")
# Prepare input with an instruction
inputs = processor(images=image, text="Generate workflow JSON", return_tensors="pt")
# Generate structured workflow
outputs = model.generate(**inputs, max_length=4096)
workflow_json = processor.decode(outputs[0], skip_special_tokens=True)
print(workflow_json)
This will output a workflow_json
string containing nodes, edges, and triggers
extracted from your sketch—ready to run inside workflow automation platforms.
Swap in other StarFlow fine-tuned models (e.g., LLaMA or Pixtral variants) by just changing the model name.
Video Presentation
BibTeX
@article{Bechard2025StarFlow,
title = {StarFlow: Generating Structured Workflow Outputs From Sketch Images},
author = {Bechard, Patrice and Wang, Chao and Abaskohi, Amirhossein and Rodriguez, Juan and Pal, Christopher and Vazquez, David and Gella, Spandana and Rajeswar, Sai and Taslakian, Perouz},
journal = {arXiv preprint arXiv:2503.21889},
year = {2025},
doi = {10.48550/arXiv.2503.21889},
url = {https://arxiv.org/abs/2503.21889}
}