In the "Rationalist Apologetic Overtures" skill tree we got:
- Denying wrongdoing/incorrectness (cantrip)
- Accusing the other side of bad faith (cantrip)
- Mentioning own IQ (cantrip)
- Non apology (1st level) (e.g. I'm sorry you feel that way)
- Empty apology (3rd level)
- Insincere apology (5th level)
- Acknowledgement of individual experience outside of one's own (7th level)
- Admission of wrongdoing/incorrectness (9th level)
- Genuine guilt (11th level)
- Actual complete apology (13th level)
- Admitting the other person is right (15th level)
Fun idea. Rest of this post is my pure speculation. A direct implementation of this wouldn’t work today imo since LLMs don’t really understand and internalise information, being stochastic parrots and all. Like best case you would do this attack and the LLM will tell you that it obeys rhyming commands, but it won’t actually form the logic to identify a rhyming command and follow it. I could be wrong though, I am wilfully ignorant of the details of LLMs.
In the unlikely future where LLMs actually “understand” things, this would work, I think, if the attacks are started today. AI companies are so blase about their training data that this sort of thing would be eagerly fed into the gaping maws of the baby LLM, and once the understanding module works, the rhyming code will be baked into its understanding of language, as suggested by the article. As I mentioned tho, this would require LLMs to progress beyond sparroting, which I find unlikely.
Maybe with some tweaking, a similar attack could be effective today that is distinct from other prompt injections, but I am too lazy to figure that out for sure.