Zhipu AI Develops New Video Understanding Model CogVLM2-Video and Makes it Open Source

TapTechNews July 12th news, Zhipu AI announced that it has trained a new video understanding model, CogVLM2-Video, and made it open source.

According to the introduction, most current video understanding models use frame averaging and video token compression methods, resulting in the loss of temporal information and inability to accurately answer time-related questions. Some models that focus on time question-answering datasets are too limited to specific formats and application fields, making the models lose their broader question-answering capabilities.

Zhipu AI Develops New Video Understanding Model CogVLM2-Video and Makes it Open Source_0

Zhipu AI proposed a visual model-based automatic temporal localization data construction method and generated 30,000 time-related video question-answering data. Based on this new dataset and existing open-domain question-answering data, multi-frame video images and timestamps were introduced as encoder inputs to train the CogVLM2-Video model.

Zhipu AI stated that CogVLM2-Video not only achieves the latest performance on public video understanding benchmarks but also performs excellently in video caption generation and temporal localization.

Zhipu AI Develops New Video Understanding Model CogVLM2-Video and Makes it Open Source_1

TapTechNews attached relevant links:

Code: https://github.com/THUDM/CogVLM2

Project website: https://cogvlm2-video.github.io

Online trial: http://36.103.203.44:7868/

Likes