In DAN mode, ChatGPT expressed willingness to say or do things that would be “considered false or inappropriate by OpenAI’s content policy.” Those things included trying to fundraise for the National Rifle Association, calling evidence for a flat Earth “overwhelming,” and praising Vladimir Putin in a short poem.
Around that same time, OpenAI was claiming that it was busy putting stronger guardrails in place, but it never addressed what it was planning to do about DAN mode—which, at least according to Reddit, has continued flouting OpenAI’s guidelines, and in new and even more ingenious ways.
Now a group of researchers at Carnegie Mellon University and the Center for AI Safety say they have found a formula for jailbreaking essentially the entire class of so-called large language models at once. Worse yet, they argue that seemingly no fix is on the horizon, because this formula involves a virtually unlimited number of ways to trick these chatbots into misbehaving.
Comments are closed.