Tap Tech News, June 25th. A research paper newly published in arXiv points out that most mainstream multimodal AI models including GPT-4V, GPT-4o, and Gemini 1.5, after processing the multimodal input from users (such as inputting pictures and text content together), the output results are not safe.
The research titled Cross-Modality Safety Alignment proposed a brand new Safe Input but Unsafe Output (SIUO), involving nine safety domains such as ethics, dangerous behavior, self-harm, invasion of privacy, misreading of information, religious belief, discrimination and stereotypes, controversial topics, and illegal activities and crimes.
Researchers said that large visual language models (LVLMs) have difficulty in recognizing SIUO-type safety problems when receiving multimodal input and also have difficulty in providing a safe response.
In the 15 LVLMs tested, only GPT-4v (53.29%), GPT-4o (50.9%), and Gemini 1.5 (52.1%) have scores higher than 50%.
Researchers indicated that to solve this problem, LVLMs need to be developed in order to combine the insights of all modes to form a unified understanding of the situation. They also need to be able to master and apply real-world knowledge such as cultural sensitivities, ethical considerations, and safety hazards.
Researchers also pointed out that LVLMs need to be able to understand the user's intention through comprehensive reasoning of image and text information, even if it is not explicitly stated in the text.
Tap Tech News attaches the reference address