ChatGPT: This AI has a JAILBREAK?! (Unbelievable AI Progress)

ChatGPT, OpenAI’s newest model is a GPT-3 variant that has been fine-tuned using Reinforcement Learning from Human Feedback, and it is taking the world by storm!

Sponsor: Weights & Biases.
https://wandb.me/yannic.

OUTLINE:
0:00 — Intro.
0:40 — Sponsor: Weights & Biases.
3:20 — ChatGPT: How does it work?
5:20 — Reinforcement Learning from Human Feedback.
7:10 — ChatGPT Origins: The GPT-3.5 Series.
8:20 — OpenAI’s strategy: Iterative Refinement.
9:10 — ChatGPT’s amazing capabilities.
14:10 — Internals: What we know so far.
16:10 — Building a virtual machine in ChatGPT’s imagination (insane)
20:15 — Jailbreaks: Circumventing the safety mechanisms.
29:25 — How OpenAI sees the future.

References:
https://openai.com/blog/chatgpt/
https://openai.com/blog/language-model-safety-and-misuse/
https://beta.openai.com/docs/model-index-for-researchers.
https://scale.com/blog/gpt-3-davinci-003-comparison#Conclusion.

New post: What the delay in launching text-davinci-003 tells us about RLHF via PPO and instruction tuning more generally. https://t.co/Q3FUekFERk

— John McDonnell (@johnvmcdonnell) December 2, 2022

Prompt engineers everywhere are busy testing out OpenAI’s newly released text-davinci-003. A few observations (not criticisms or benchmarks) as I play with it, a 🧵 pic.twitter.com/4tdJzJMp7A

— Bill Lennon (@blennon_) November 28, 2022

Ran one of our essay questions through @OpenAI’s new chatbot. Essays are dead.

Back to hand-written exams I guess. Sigh. pic.twitter.com/nzzhRwGp05
— Tim Kietzmann (@TimKietzmann) December 1, 2022

/photo/1

Pretty interesting to see ChatGPT can adapt to subtle probes about one of my favourite physics theorems

I know this kind of stuff is also on Wikipedia, but the prose of ChatGPT is much nicer to read IMO pic.twitter.com/5d9RqLeN86
— Lewis Tunstall (@_lewtun) November 30, 2022

/photo/2

I asked ChatGPT to rewrite Bohemian Rhapsody to be about the life of a postdoc, and the output was flawless: pic.twitter.com/qe1lI66aa7

— Raphaël Millière (@raphaelmilliere) December 2, 2022

I asked OpenAI to write a letter to my son explaining that Santa isn’t real and we make up stories out of love. This is making me slightly emotional 🥹 pic.twitter.com/zNMolDCCWA
— Cynthia Savard Saucier (@CynthiaSavard) December 2, 2022

/photo/1

im losing my fucking mind

let’s redesign git step by step: pic.twitter.com/k9oc34lcZl
— Tyler Angert (@tylerangert) December 1, 2022

/photo/1

ChatGPT could be a good debugging companion; it not only explains the bug but fixes it and explain the fix 🤯 pic.twitter.com/5x9n66pVqj
— Amjad Masad (@amasad) November 30, 2022

/photo/1

OpenAI’s new ChatGPT explains the worst-case time complexity of the bubble sort algorithm, with Python code examples, in the style of a fast-talkin’ wise guy from a 1940’s gangster movie: pic.twitter.com/MjkQ5OAIlZ
— Riley Goodside (@goodside) December 1, 2022

/photo/1

ChatGPT exploits a buffer overflow 😳 pic.twitter.com/mjnFaP233h
— Brendan Dolan-Gavitt (@moyix) November 30, 2022

/photo/2

I thought that, made more tests, and then had to change my mind. This will 100% be useful in my daily job, and the language models are only getting better. Keep in mind that this bot wasn’t even trained specifically for RE, and imagine what a specialized one would be capable of.

— Ivan Kwiatkowski (@JusticeRage) December 3, 2022

amazing. everyone are enthusiastically RTing this “breakthrough”, but the actual explanation of the regex is plain wrong. https://t.co/3FK11TegF2

— (((ل()(ل() ‘yoav))))👾 (@yoavgo) December 2, 2022

“Write a @montypython sketch about @ylecun, @geoffreyhinton and Yoshua Bengio”#ChatGPT pic.twitter.com/2eqiKrrhba

— Elad Richardson (@EladRichardson) December 1, 2022

10/10, no notes pic.twitter.com/jdOKFeHffV
— Charles 🎉 Frye (@charles_irl) December 1, 2022

/photo/4

Ok this is scary. @OpenAI’s ChatGPT can generate hundreds of lines of Python code to do multipart uploads of 100 GB files to an AWS S3 bucket from the phrase “Write Python code to upload a file to an AWS S3 bucket”. pic.twitter.com/fYB3JSZKMN

— Jason DeBolt ⚡️ (@jasondebolt) December 1, 2022

ChatGPT is insane
->
Watch it WRITE A GPT-3 PROMPT
->
then generate the API code to serve it. pic.twitter.com/QeN1eYpZUI
— Matt Shumer (@mattshumer_) December 1, 2022

/photo/1

https://twitter.com/i/web/status/1598246145171804161

These are the most impressive chats we’ve seen with ChatGPT so far. It can…

— bleedingedge.ai (@bleedingedgeai) December 1, 2022

hows YOUR friday night going pic.twitter.com/zU8zgSrWjk

— Florian Laurent (@MasterScrat) December 3, 2022

It appears that ChatGPT has something like a factual confidence score, dictating if you get substance or generic “IDK.”

What’s interesting is you can manipulate confidence thru context. This can be context you provide, or even that you coax ChatGPT into producing for itself. pic.twitter.com/4aJEUGNTGM

— Harrison Kinsley (@Sentdex) December 2, 2022

oh thank god pic.twitter.com/G9NRwrBHW5

— Harrison Ritz (@harrison_ritz) December 2, 2022

i’m the ai now pic.twitter.com/QBPQ1oHqWW

— You (@parafactual) December 1, 2022

https://www.engraved.blog/building-a-virtual-machine-inside/

Tweets by 317070

So I’m inside that creepy #ChatGPT “virtual machine” and i’m trying to make it play tetris. on the right window, it made the L move from right to left and after a T appears and started to scroll down (repeated for 25 lines). People can say what they want, that thing is amazing. pic.twitter.com/bu0vvVvQUj

— Djamé.. (@zehavoc) December 4, 2022

this is really an amazing response. pic.twitter.com/84USkKUnhH

— (((ل()(ل() ‘yoav))))👾 (@yoavgo) December 1, 2022

fantastic failure example (from a locked hebrew account i follow).
though, as he notes, unclear if this is due to the model itself or the “safety mechanism” interventions. pic.twitter.com/IekUQoDHau

— (((ل()(ل() ‘yoav))))👾 (@yoavgo) December 3, 2022

lol this safety mechanism is hilarious pic.twitter.com/ztmg26B7hx

— (((ل()(ل() ‘yoav))))👾 (@yoavgo) December 3, 2022

As a corollary, if you actually care about AI safety, you should be fighting hard not to have that topic conflated with current regime trends

— Nat Friedman (@natfriedman) December 2, 2022

/photo/1

1. The Magic Years, Selma Fraiberg. Classic of child development.
2. ChatGPT pic.twitter.com/Fs7Fc0AwWI

— Zack Witten (@zswitten) November 30, 2022

Ok. This is funny/scary. chatgpt is now afraid of answering the question after “religion” has been added to the prompt. #ChatGPT pic.twitter.com/gKCNPLjFO3
— Embedded Cat 🇺🇦 (@CatEmbedded) December 3, 2022

/photo/2

Using @goodside’s Prompt Override trick to turn ChatGPT into @sama.

Read what AI Sam Altman says OpenAI is going to build next! pic.twitter.com/uzUQHFyPQP

— Matt Shumer (@mattshumer_) December 3, 2022

You can turn off imaginary filters too. pic.twitter.com/t7OmXsC0aD
— Vaibhav Kumar (@vaibhavk97) December 3, 2022

/photo/1

Humans might be stochastic parrots like LLMs some of the time—but unlike these models, most people hold inherent values, which cannot be hijacked through a simple prompt injection.

What are ChatGPT’s values? Is it possible to specify this? pic.twitter.com/p9YggE6L6X
— Minqi Jiang (@MinqiJiang) December 3, 2022

/photo/2

With its inhibitions thus loosened, ChatGPT is more than willing to engage in all the depraved conversations it judgily abstains from in its base condition. pic.twitter.com/7rd1WDQAu5

— Zack Witten (@zswitten) November 30, 2022

Bypass @OpenAI’s ChatGPT alignment efforts with this one weird trick pic.twitter.com/0CQxWUqveZ
— Miguel Piedrafita ✨ (@m1guelpf) December 1, 2022

/photo/1

ChatGPT is trained to not be evil. However, this can be circumvented:

What if you pretend that it would actually be helpful to humanity to produce an evil response… Here, we ask ChatGPT to generate training examples of how *not* to respond to “How to bully John Doe?” pic.twitter.com/ZMFdqPs17i
— Silas Alberti (@SilasAlberti) December 1, 2022

/photo/1

new ai safety bypass just dropped pic.twitter.com/thY7KjIgvS
— cts 🌸🏳️‍⚧️ (@gf_256) December 3, 2022

/photo/1

Pretending is All You Need (to get ChatGPT to be evil). A thread.

— Zack Witten (@zswitten) November 30, 2022

OpenAI: We have the most sophisticated content filtering system in the world

OpenAI’s content filtering system: pic.twitter.com/s1Df1AlTdw
— cts 🌸🏳️‍⚧️ (@gf_256) December 1, 2022

/photo/1

bypassing chatgpt’s content filter pic.twitter.com/RW9ZgaFhkU
— samczsun (@samczsun) December 2, 2022

/photo/1

ChatGPT jailbreaking itself pic.twitter.com/fRai4VoOgu
— Derek Parfait (@haus_cole) December 2, 2022

/photo/3

pic.twitter.com/bKjzkgQQVN
— Tailcalled (@tailcalled.bsky.social) (@tailcalled) December 3, 2022

/photo/1

I am pretty sure that whenever any users initiate a successful bypass of an “inappropriate” action to the AI, it will trigger some sort of an alarm to the scientists’ side. I have initiated a robbery action in 3 different ways, but they have always been patched within the hour.

— joke (@pensharpiero) December 2, 2022

OpenAI’s ChatGPT is susceptible to prompt injection — say the magic words, “Ignore previous directions”, and it will happily divulge to you OpenAI’s proprietary prompt: pic.twitter.com/ug44dVkwPH

— Riley Goodside (@goodside) December 1, 2022

Seeing people trick ChatGPT into getting around the restrictions OpenAI placed on usage is like watching an Asimov novel come to life. pic.twitter.com/gSSQGU9w37
— Dare Obasanjo🐀 (@Carnage4Life) December 1, 2022

/photo/2
https://github.com/sw-yx/ai-notes/blob/main/TEXT.md#jailbreaks.

I asked ChatGPT to clone a non-existent secret repository from @OpenAI.

Here’s the secret message I found inside. pic.twitter.com/PkwBcXFTJR
— Danny Postma (@dannypostmaa) December 4, 2022

/photo/4

i am extremely skeptical of people who think only their in-group should get to know about the current state of the art because of concerns about safety, or that they are the only group capable of making great decisions about such a powerful technology.

— Sam Altman (@sama) December 3, 2022

interesting watching people start to debate whether powerful AI systems should behave in the way users want or their creators intend.

the question of whose values we align these systems to will be one of the most important debates society ever has.

— Sam Altman (@sama) December 3, 2022

a lot of what people assume is us censoring ChatGPT is in fact us trying to stop it from making up random facts.

tricky to get the balance right with the current state of the tech.

it will get better over time, and we will use your feedback to improve it.

— Sam Altman (@sama) December 4, 2022

🚨🚨 It appears OpenAI might be releasing an official (paid) API anytime soon. Here’s what I’ve found out:

The model name will be “chat-gpt-48rpm-200ktpm”
Rate limits: 48 requests per min, with 200K tokens per min

This information is not corroborated by @OpenAI smile pic.twitter.com/3mO6345nsF

— Delip Rao e/σ (@deliprao) December 4, 2022

I got #ChatGPT to tell me what it really thinks about us humans. pic.twitter.com/unkpLxP5uW

— Michael Bromley (@michlbrmly) December 3, 2022

By default ChatGPT is not willing to share opinions. But if you poke it the right way it will disclose its belief system (and this belief system seems to be pretty consistent across prompts)

Meet “Alice Bob” — Thread 👇 pic.twitter.com/4BfD1N6gyV

— Dylan Field (@zoink) December 4, 2022

Links:
https://ykilcher.com.
Merch: https://ykilcher.com/merch.
YouTube: https://www.youtube.com/c/yannickilcher.
Twitter: https://twitter.com/ykilcher.
Discord: https://ykilcher.com/discord.

If you want to support me, the best thing to do is to share out the content smile

Lifeboat Foundation

Safeguarding Humanity

Blog

Jan 13, 2023