HybridToken-VLM — new techniques

Each preview is a real page from arXiv 2512.08240 (Zhang et al., Dec 9 2025), with a yellow highlight drawn on the exact bbox by the OkraPDF citation-preview API (Gemini Flash 3 → static SVG on res.okrapdf.com). Click any card to land on the source.

1. The efficiency–fidelity dilemma

The framing the whole paper hangs on.

"continuous compression dilutes high-level semantics such as object identities, while discrete quantization loses fine-grained details such as textures"

Page 1 with bbox highlight: efficiency-fidelity dilemma

p.1 · Text link.okrapdf.com/bdmw93d

2. Disentanglement attention mask

Splits attention so the continuous (ViT-patch) and discrete (MGVQ) pathways do not bleed into each other; they fuse only at the bottleneck.

Page 3 with bbox highlight: disentanglement attention mask

p.3 · Text link.okrapdf.com/BG8vlvC

3. MGVQ → 4 symbolic anchor tokens

Multi-Group VQ collapses the image to just 4 discrete tokens that carry object identity through extreme compression.

Page 4 with bbox highlight: MGVQ quantization

p.4 · Caption link.okrapdf.com/phTQdr4

4. The 580-token hybrid sequence

576 ViT patches + 4 MGVQ anchors stitched into one sequence — the representation before bottleneck compression.

Page 3 with bbox highlight: hybrid sequence

p.3 · Text link.okrapdf.com/epftXdh

5. The "voco" bottleneck — 580 → 1 token

A single learnable bottleneck token swallows the whole 580-token sequence. Attention analysis shows it preferentially binds to the 4 MGVQ anchors, not the patches.

Page 4 with bbox highlight: voco token compresses Vhy into latent z

p.4 · Caption link.okrapdf.com/YwnEgia

6. Result — 87.2% vs 81.0% baseline

Average retention across GQA, VQAv2, MMBench, MME, POPE, SEED-Bench, and ScienceQA-Image at the full 580→1 compression ratio.

p.6 · Text link.okrapdf.com/wC9Fz98

Each card is the citation's preview_href — a static SVG with a yellow highlight rect drawn at the bbox returned by google/gemini-3-flash-preview. The bbox is now constrained to a ParseBench layout element (Text, Caption, Section-header, Table, etc.) — never tighter than a full visible line, so single-keyword targets like "MGVQ" or "voco" expand to the surrounding paragraph or caption instead of landing on a sub-line sliver.