TapTechNews August 7th news, Mianbi Intelligent open-sourced the MiniCPM-V2.6 model yesterday, and the official stated that it would raise the end-side AI multi-modal capability to a level that fully benchmarks against GPT-4V.
The official said that the MiniCPM-V2.6 model with only 8B parameters has achieved 3SOTA results in single-image, multi-image, and video understanding under 20B, and has the following characteristics:
The three-in-one strongest end-side multi-modal: For the first time on the end-side, it realizes single-image, multi-image, video understanding and other multi-modal core capabilities that fully surpass GPT-4V, and single-image understanding越级compares to the multi-modal king Gemini1.5Pro and the new top stream GPT-4omini.
Many functions are on the upper end for the first time: real-time video understanding, multi-image joint understanding, multi-image ICL visual analog learning, multi-image OCR, etc.
The highest multi-modal pixel density: Analogous to knowledge density, the small cannon 2.6 has achieved twice the single-token encoded pixel density (tokendensity) of GPT-4o.
End-side friendly: After quantization, the end-side 6GB memory is available; the end-side inference speed reaches 18 tokens/s, which is 33% faster than the previous-generation model. It supports llama.cpp, ollama, and vllm inference immediately after release; and supports multiple languages.
The unified high-definition framework: The traditional advantage of the small cannon, OCR ability, continues its SOTA performance level and further covers single-image, multi-image, and video understanding.
TapTechNews attached the open-source address:
GitHub: https://github.com/OpenBMB/MiniCPM-V
HuggingFace: https://huggingface.co/openbmb/MiniCPM-V-2_6