Mantis

Interleaved multi-image instruction tuning for multimodal LLMs, with new benchmarks for complex multi-image reasoning.

Mantis is a multimodal LLM trained on interleaved multi-image instruction data. Unlike prior work focusing on single-image understanding, Mantis tackles scenarios requiring reasoning over multiple images simultaneously — comparison, sequencing, and cross-image QA.

Key contributions:

  • Mantis-Instruct: large-scale interleaved multi-image training dataset
  • MIQA: evaluation benchmarks for multi-image reasoning
  • Strong multi-image performance while maintaining single-image quality
  • Open-source model weights and training code

Links: GitHub · Paper · Demo