StarFlow: Generating Structured Workflow Outputs From Sketch Images

Bechard, Patrice; Wang, Chao; Abaskohi, Amirhossein; Rodriguez, Juan; Pal, Christopher; Vazquez, David; Gella, Spandana; Rajeswar, Sai; Taslakian, Perouz

💫 StarFlow: Generating Structured Workflow Outputs From Sketch Images

Patrice Bechard, Chao Wang, Amirhossein Abaskohi, Juan Rodriguez, Christopher Pal, David Vazquez, Spandana Gella, Sai Rajeswar, Perouz Taslakian

ServiceNow Research

arXiv Blog Code Dataset Models

StarFlow converts sketches and diagrams into structured workflows using fine-tuned vision–language models and a purpose-built dataset.

Abstract

Workflows are a fundamental component of automation in enterprise platforms. Building them can be complex and often requires manual configuration through low-code or visual tools. We explore using vision–language models (VLMs) to automatically generate structured workflows from visual inputs—hand-drawn sketches and computer-generated diagrams. We introduce StarFlow, a framework for this task, curate a diverse dataset of workflow diagrams (synthetic, manually annotated, and real-world), and fine-tune multiple VLMs. Our results show that fine-tuning significantly enhances structured workflow generation, outperforming larger general-purpose models on this task.

Dataset

We build a diverse dataset of workflow diagrams, spanning synthetic, human-annotated, and real-world samples, to enhance training and evaluation. Below, we show the distribution of the dataset across splits and how the dataset was constructed.

Dataset 🤗

Source	Train	Valid	Test
Synthetic	12,376	1,000	1,000
Manual	3,035	333	865
Digital	2,613	241	701
Whiteboard	484	40	46
User Interface	373	116	87
Total	18,881	1,730	2,699

Dataset distribution across splits. Samples are collected from synthetic, manual, digital, whiteboard, and UI sources.

Dataset creation — To generate our dataset, we start from preexisting workflows and convert them to diagrams automatically. We then ask human labelers to convert the diagrams into manual or digital sketches.

Results

We benchmark a range of off-the-shelf and fine-tuned vision–language models on the BigDocs-Sketch2Flow dataset, evaluating their ability to generate structured workflows from sketches and diagrams. Below we summarize the results across different model sizes and categories.

Models 🤗

Model	FlowSim w/ inputs	FlowSim no inputs	TreeBLEU w/ inputs	TreeBLEU no inputs	Trigger match	Component match
Open-weights Models
Qwen-2.5-VL-3B-Instruct <4B	0.410	0.384	0.360	0.329	0.027	0.201
Phi-3.5-Vision-4B-Instruct 4–12B	0.364	0.346	0.337	0.295	0.079	0.193
Phi-4-Multimodal-6B-Instruct 4–12B	0.465	0.404	0.394	0.298	0.054	0.244
Qwen-2.5-VL-7B-Instruct 4–12B	0.614	0.538	0.562	0.508	0.036	0.280
LLaMA-3.2-11B-Vision-Instruct 4–12B	0.466	0.435	0.416	0.382	0.075	0.239
Pixtral-12B 4–12B	0.632	0.582	0.617	0.541	0.088	0.261
Qwen-2.5-VL-72B-Instruct >12B	0.710	0.643	0.703	0.655	0.325	0.305
LLaMA-3.2-90B-Vision-Instruct >12B	0.687	0.603	0.681	0.627	0.328	0.286
Proprietary Models
GPT-4o-Mini proprietary	0.642	0.617	0.650	0.623	0.254	0.305
GPT-4o proprietary	0.786	0.707	0.794	0.718	0.282	0.317
Claude-3.7-Sonnet proprietary	0.763	0.679	0.769	0.701	0.318	0.305
Gemini Flash 2.0 proprietary	0.780	0.713	0.798	0.743	0.466	0.329
Finetuned Models
Qwen-2.5-VL-3B-Instruct (ft) <4B	0.941	0.911	0.941	0.902	0.775	0.909
Phi-3.5-Vision-4B-Instruct (ft) 4–12B	0.917	0.882	0.917	0.869	0.703	0.874
Phi-4-Multimodal-6B-Instruct (ft) 4–12B	0.939	0.908	0.940	0.902	0.770	0.907
Qwen-2.5-VL-7B-Instruct (ft) 4–12B	0.957	0.927	0.956	0.920	0.819	0.934
LLaMA-3.2-11B-Vision-Instruct (ft) 4–12B	0.955	0.924	0.954	0.915	0.805	0.934
Pixtral-12B (ft) 4–12B	0.952	0.919	0.950	0.908	0.753	0.930

Flow quality metrics comparison across different models. Best per category is bold; runner-up is underlined. Size badges: <4B, 4–12B, >12B, proprietary.

Get Started

Running inference with StarFlow is simple. Load one of the released fine-tuned vision–language models, pass in your workflow sketch, and get structured JSON back:

from transformers import AutoProcessor, Qwen2_5_VLForConditionalGeneration
from PIL import Image

# Load processor and model
processor = AutoProcessor.from_pretrained("ServiceNow/Qwen2.5-VL-7B-Instruct-StarFlow")
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    "ServiceNow/Qwen2.5-VL-7B-Instruct-StarFlow"
)

# Load your sketch or diagram
image = Image.open("workflow_sketch.png")

# Prepare input with an instruction
inputs = processor(images=image, text="Generate workflow JSON", return_tensors="pt")

# Generate structured workflow
outputs = model.generate(**inputs, max_length=4096)
workflow_json = processor.decode(outputs[0], skip_special_tokens=True)

print(workflow_json)

This will output a workflow_json string containing nodes, edges, and triggers extracted from your sketch—ready to run inside workflow automation platforms. Swap in other StarFlow fine-tuned models (e.g., LLaMA or Pixtral variants) by just changing the model name.

Video Presentation

BibTeX

@article{Bechard2025StarFlow,
  title   = {StarFlow: Generating Structured Workflow Outputs From Sketch Images},
  author  = {Bechard, Patrice and Wang, Chao and Abaskohi, Amirhossein and Rodriguez, Juan and Pal, Christopher and Vazquez, David and Gella, Spandana and Rajeswar, Sai and Taslakian, Perouz},
  journal = {arXiv preprint arXiv:2503.21889},
  year    = {2025},
  doi     = {10.48550/arXiv.2503.21889},
  url     = {https://arxiv.org/abs/2503.21889}
}