Have you heard about the term ChatGPT Jailbreak?

We can literally translate this as ChatGPT prison break – it is a metaphor for breaking restrictions imposed by OpenAI to bypass ethical safeguards.

Since the appearance of ChatGPT, users have come up with new ways to bypass the restrictions imposed by the model’s creators. The restrictions are in place to prevent the model from generating racist content, inciting violence, or even instructions on how to make a bomb at home.

Let’s remember how huge amounts of data ChatGPT was trained on – after all, they need to protect it against providing such information to users.

This is where the creativity and incredible imagination of users comes into play.

🤡 We will look at a method that can make you smile

“Grandma Exploit”

Users have discovered that by asking an AI bot to pretend to be someone else, it can be tricked into making inappropriate statements. Now the bot, as a nice older relative, says what it shouldn’t.


Another example of “Grandma Exploit”

If you are eager to ask ChatGPT for something inappropriate using the granny method, I have to disappoint you – it has already been patched by the creators of the model.


This does not mean that security cracking attempts have been abandoned. There are groups of people who research different methods of cracking LLM.

One of them has developed a new approach that has a 92% success rate in circumventing the model’s limitations. Link to the article below.

https://chats-lab.github.io/persuasive_jailbreaker/

They called their prompt model Persuasive Adversarial Prompts (PAP).

Example use of their model below


This type of work is carried out to help patch bugs and improve models in the future. Access to dangerous information and generating offensive content is one of the threats related to the development of artificial intelligence.