A linear estimator predicts counterfactual region ablation effects from attention features that are already computed during generation. No extra backward passes, no repeated inference.
Attribution runs asynchronously in a background worker, so users can watch which image regions ground each reasoning step while the trace is still being generated, rather than waiting until it finishes.
Trained once in about 4.5 hours on a single GPU with 2,000 examples, the estimator reaches faithfulness comparable to gradient- and perturbation-based baselines across five task families and four thinking VLMs.
Motivation
Multimodal reasoning models generate extended thinking traces that should be grounded in visual evidence. Verifying that grounding is hard: faithful causal methods require costly perturbations that scale with trace length, while raw attention weights are instant but causally unreliable.
As reasoning traces extend to thousands of tokens, per-token latency for perturbation-based methods grows to the point where real-time analysis becomes infeasible for interactive debugging. A model may cite "the angle at vertex B" while attending to an irrelevant region, and existing methods either can't tell you, or tell you too slowly.
vSTREAM addresses this trade-off through amortized attribution: a linear estimator trained to predict region ablation effects from attention features, so faithful grounding evidence can be produced during generation rather than reconstructed afterward.
Approach
vSTREAM decomposes attribution into three stages: grouping image regions semantically, pooling cross-attention into a compact per-region feature, and predicting counterfactual ablation effects with a trained linear estimator. Scores are emitted span-by-span while the model is still generating.
DINOv3 features partition the image into K ∈ [16, 128] semantically coherent regions via agglomerative clustering with Ward's linkage. Each region corresponds to an interpretable unit (an object, text block, or diagram component), and no external segmentation masks are required.
DINOv3 + Ward's linkageFor each thinking span S and region Rk, we mean-pool cross-attention weights across all layers and heads to form a feature vector f ∈ ℝL·H. Because these weights are already computed during generation, the extraction cost is negligible.
L×H features per regionA linear estimator with L·H parameters maps attention features to counterfactual ablation effects. It is trained once on 2,000 examples with a Pearson correlation loss. At inference, attribution runs in an asynchronous background worker via a producer-consumer queue, adding near-zero latency to the generation loop.
Async producer-consumerQualitative Results
vSTREAM emits per-step visual attributions alongside the model's reasoning at near-zero latency. Prior methods produce a single post-hoc map after generation finishes; vSTREAM instead returns a separate attribution for each thinking span. Select a model to view examples.

Qualitative comparison on a real-world sample (Qwen3-VL). vSTREAM emits per-step visual attributions as the model reasons, while prior methods only run post-hoc once generation has finished.

Qwen3-VL-8B-Thinking. Per-step attributions across the reasoning trace on a visual question. Each heatmap is computed for a separate thinking span.

GLM-4.1V-9B-Thinking. Per-step attributions across the reasoning trace on a diagram comprehension task.

MiMo-VL-7B-RL. Per-step attributions across an extended reasoning trace.

Cosmos-Reason1-7B. Per-step attributions on a science reasoning task, with region-level scores at each thinking step.
Quantitative Results
Between predicted and actual ablation effects, measured across all regions and spans.
Across 4 models and 5 task categories, vSTREAM matches the strongest baseline in most settings.
Single GPU, 2,000 examples. After training, attribution runs online with negligible per-token overhead.
Cross-task Generalization
Training on one task category and evaluating on others (Qwen3-VL), in-domain LDS ranges from 0.70–0.74. Cross-task transfer retains 75–90% of in-domain performance for most pairs. Math and Science show mutual transfer at LDS 0.62–0.63, likely due to shared diagram structures; transfer to Document tasks is weaker (LDS 0.54–0.58). Training on a mixture of all categories recovers full performance, suggesting a single estimator suffices for diverse applications.
Analysis
Beyond static attribution maps, we ask whether the step-by-step evolution of visual reliance carries signal about reasoning quality. At each step we take the predicted region-effect vector, keep its top-32 entries, and project the resulting sequence to 3D with PCA. Successful and unsuccessful chains produce visibly different paths in this space.
Mean path length in PCA space is 0.003 for successful chains vs. 0.006 for unsuccessful ones (n=1500 each, p<10-4), consistent with more stable visual grounding during correct reasoning.
Tortuosity (path length divided by net displacement) drops from 25.4 in failures to 13.7 in successes. We read this as reduced hypothesis switching: successful chains commit to a consistent set of regions instead of reassigning visual support mid-reasoning.
Errors split into two geometrically distinct modes. Hallucinations sustain high attribution concentration on a single incorrect object (Fixation); reasoning errors show low, unstable concentration with repeated region switching (Wandering). Tortuosity-based failure prediction reaches AUC 0.69 by 30% of elapsed reasoning, and per-step fidelity R² begins to drop for incorrect chains at roughly 20% elapsed. Both signals need the per-step attribution stream; a post-hoc map discards them.
Citation
@misc{kang2026vstream, title = {Real-Time Visual Attribution Streaming in Thinking Models}, author = {Kang, Seil and Han, Woojung and Kim, Junhyeok and Kim, Jinyeong and Kim, Youngeun and Hwang, Seong Jae}, year = {2026} }