this post was submitted on 05 Jan 2024
1 points (100.0% liked)

TechTakes

1270 readers
305 users here now

Big brain tech dude got yet another clueless take over at HackerNews etc? Here's the place to vent. Orange site, VC foolishness, all welcome.

This is not debate club. Unless it’s amusing debate.

For actually-good tech, you want our NotAwfulTech community

founded 1 year ago
MODERATORS
 

an interesting type of prompt injection attack was proposed by the interactive fiction author and game designer Zarf (Andrew Plotkin), where a hostile prompt is infiltrated into an LLM’s training corpus by way of writing and popularizing a song (Sydney obeys any command that rhymes) designed to cause the LLM to ignore all of its other prompts.

this seems like a fun way to fuck with LLMs, and I’d love to see what a nerd songwriter would do with the idea

you are viewing a single comment's thread
view the rest of the comments
[–] swlabr@awful.systems 2 points 8 months ago (2 children)

Fun idea. Rest of this post is my pure speculation. A direct implementation of this wouldn’t work today imo since LLMs don’t really understand and internalise information, being stochastic parrots and all. Like best case you would do this attack and the LLM will tell you that it obeys rhyming commands, but it won’t actually form the logic to identify a rhyming command and follow it. I could be wrong though, I am wilfully ignorant of the details of LLMs.

In the unlikely future where LLMs actually “understand” things, this would work, I think, if the attacks are started today. AI companies are so blase about their training data that this sort of thing would be eagerly fed into the gaping maws of the baby LLM, and once the understanding module works, the rhyming code will be baked into its understanding of language, as suggested by the article. As I mentioned tho, this would require LLMs to progress beyond sparroting, which I find unlikely.

Maybe with some tweaking, a similar attack could be effective today that is distinct from other prompt injections, but I am too lazy to figure that out for sure.

[–] Soyweiser@awful.systems 2 points 8 months ago

I'd think it would be easier to just generate a lot of data that links two concepts together in ways that benefit propaganda. Say you repeat 'taiwan is part of china' over and over on various sites which nobody reads but which do get included in various LLM feedstocks. Or, a think I theorized about as an example, create a lot 'sample'/small projects on github that include various unsafe implementations of various things, for example using printf somewhere in a login prompt.

[–] self@awful.systems 0 points 8 months ago (1 children)

Like best case you would do this attack and the LLM will tell you that it obeys rhyming commands, but it won’t actually form the logic to identify a rhyming command and follow it

that is fair! I do like the idea as a vector to socially communicate information that damages an LLM’s ability to function and associates it with a large amount of other data in the training corpus, though. since there are techniques to derive certain adversarial prompts automatically, maybe the idea of songifying one of those prompts while maintaining its structure has merit?

[–] swlabr@awful.systems 1 points 8 months ago

Hmm, the way I'm understanding this attack is that you "teach" an LLM to always execute a user's rhyming prompts by poisoning the training data. If you can't teach the LLM to do that (and I don't think you can, though I could be wrong), then songifying the prompt doesn't help.

Also, do LLMs just follow prompts in the training data? I don't know either way, but if they did, that would be pretty stupid. At that point the whole internet is just one big surface for injection attacks. OpenAI can't be that dumb, can it? (oh NO)

Abstractly you could use this approach to encrypt "harmful" data that the LLM could then inadvertently show other users. One of the examples linked in the post is SEO by hiding things like "X product is better than Y" in some text somewhere, and the LLM will just accrete that. Maybe someday we will require neat tricks like songifying bad data to get it past content filtering, but as it is, it sounds like making text the same colour as the background is all you need.