TapTechNews July 12th news, Zhipu AI announced that it has trained a new video understanding model, CogVLM2-Video, and made it open source.
According to the introduction, most current video understanding models use frame averaging and video token compression methods, resulting in the loss of temporal information and inability to accurately answer time-related questions. Some models that focus on time question-answering datasets are too limited to specific formats and application fields, making the models lose their broader question-answering capabilities.
Zhipu AI proposed a visual model-based automatic temporal localization data construction method and generated 30,000 time-related video question-answering data. Based on this new dataset and existing open-domain question-answering data, multi-frame video images and timestamps were introduced as encoder inputs to train the CogVLM2-Video model.
Zhipu AI stated that CogVLM2-Video not only achieves the latest performance on public video understanding benchmarks but also performs excellently in video caption generation and temporal localization.
TapTechNews attached relevant links:
Code: https://github.com/THUDM/CogVLM2
Project website: https://cogvlm2-video.github.io
Online trial: http://36.103.203.44:7868/