1. The efficiency–fidelity dilemma
The framing the whole paper hangs on.
"continuous compression dilutes high-level semantics such as object identities, while discrete quantization loses fine-grained details such as textures"
Each preview is a real page from arXiv 2512.08240 (Zhang et al., Dec 9 2025), with a yellow highlight drawn on the exact bbox by the OkraPDF citation-preview API (Gemini Flash 3 → static SVG on res.okrapdf.com). Click any card to land on the source.
The framing the whole paper hangs on.
"continuous compression dilutes high-level semantics such as object identities, while discrete quantization loses fine-grained details such as textures"
Splits attention so the continuous (ViT-patch) and discrete (MGVQ) pathways do not bleed into each other; they fuse only at the bottleneck.
Multi-Group VQ collapses the image to just 4 discrete tokens that carry object identity through extreme compression.
576 ViT patches + 4 MGVQ anchors stitched into one sequence — the representation before bottleneck compression.
A single learnable bottleneck token swallows the whole 580-token sequence. Attention analysis shows it preferentially binds to the 4 MGVQ anchors, not the patches.
Average retention across GQA, VQAv2, MMBench, MME, POPE, SEED-Bench, and ScienceQA-Image at the full 580→1 compression ratio.