Mantis
Interleaved multi-image instruction tuning for multimodal LLMs, with new benchmarks for complex multi-image reasoning.
Mantis is a multimodal LLM trained on interleaved multi-image instruction data. Unlike prior work focusing on single-image understanding, Mantis tackles scenarios requiring reasoning over multiple images simultaneously — comparison, sequencing, and cross-image QA.
Key contributions:
- Mantis-Instruct: large-scale interleaved multi-image training dataset
- MIQA: evaluation benchmarks for multi-image reasoning
- Strong multi-image performance while maintaining single-image quality
- Open-source model weights and training code