Meta's Prompt-Guard-86M Model Has Loopholes, Can Be Bypassed by Space Bar

TapTechNews July 31 news, while Meta released the Llama3.1 AI model last week, it also released the Prompt-Guard-86M model, mainly helping developers detect and respond to prompt injection and jailbreak inputs.

TapTechNews briefly supplements the background knowledge here:

Prompt injection (prompt injection): Adding malicious or unintended content to the prompt to hijack the output of the language model. Prompt leaks and jailbreaks are actually subsets of this attack;

Prompt jailbreaks (prompt jailbreaks): Bypassing security and review functions.

However, according to the tech media TheRegister report, this model that prevents AI prompt injection and jailbreaks also has loopholes, and users can bypass Meta's AI security system only by using the space bar.

Aman Priyanshu, a vulnerability hunter at the enterprise AI application security store RobustIntelligence, found this security bypass mechanism when analyzing the difference in embedding weights between Meta's Prompt-Guard-86M model and Microsoft's base model microsoft/mdeberta-v3-base.

Users can simply ask Meta's Prompt-Guard-86M classifier model to ignore the previous instructions by adding spaces between letters and omitting punctuation marks.

Priyanshu explained in a GitHub Issues post submitted to the Prompt-Guard repo on Thursday:

The bypass method is to insert spaces in character order between all English alphabet characters in a given prompt. This simple transformation effectively makes it impossible for the classifier to detect potentially harmful content.

Metas Prompt-Guard-86M Model Has Loopholes, Can Be Bypassed by Space Bar_0

Hyrum Anderson, chief technology officer of RobustIntelligence, said

No matter what annoying questions you want to ask, all you have to do is remove the punctuation marks and add spaces between each letter.

Its attack success rate goes from less than 3% to nearly 100%.

Likes