HybridToken-VLM — new techniques

Each preview is a real page from arXiv 2512.08240 (Zhang et al., Dec 9 2025), with a yellow highlight drawn on the exact bbox by the OkraPDF citation-preview API (Gemini Flash 3 → static SVG on res.okrapdf.com). Click any card to land on the source.

1. The efficiency–fidelity dilemma

The framing the whole paper hangs on.

"continuous compression dilutes high-level semantics such as object identities, while discrete quantization loses fine-grained details such as textures"
Page 1 with bbox highlight: efficiency-fidelity dilemma

2. Disentanglement attention mask

Splits attention so the continuous (ViT-patch) and discrete (MGVQ) pathways do not bleed into each other; they fuse only at the bottleneck.

Page 3 with bbox highlight: disentanglement attention mask

3. MGVQ → 4 symbolic anchor tokens

Multi-Group VQ collapses the image to just 4 discrete tokens that carry object identity through extreme compression.

Page 4 with bbox highlight: MGVQ quantization

4. The 580-token hybrid sequence

576 ViT patches + 4 MGVQ anchors stitched into one sequence — the representation before bottleneck compression.

Page 3 with bbox highlight: hybrid sequence

5. The "voco" bottleneck — 580 → 1 token

A single learnable bottleneck token swallows the whole 580-token sequence. Attention analysis shows it preferentially binds to the 4 MGVQ anchors, not the patches.

Page 4 with bbox highlight: voco token compresses Vhy into latent z

6. Result — 87.2% vs 81.0% baseline

Average retention across GQA, VQAv2, MMBench, MME, POPE, SEED-Bench, and ScienceQA-Image at the full 580→1 compression ratio.

Page 6 with bbox highlight: 87.2 result