this post was submitted on 27 Mar 2025
69 points (92.6% liked)

Broligarchy Watch

280 readers
55 users here now

(neologism, politics) A small group of ultrawealthy men who exert inordinate control or influence within a political structure, particularly while espousing views regarded as anti-democratic, technofascist, and masculinist.

Wiktionary

The shit is hitting the fan at such a high rate that it can be difficult to keep up. So this is a place to share such news.

Elsewhere in the Fediverse:

founded 1 month ago
MODERATORS
you are viewing a single comment's thread
view the rest of the comments
[–] PhilipTheBucket@ponder.cat 9 points 3 weeks ago (1 children)

If it is done by poising it’s training data, that would be obvious to a system that has unfettered access to the internet

You are vastly overestimating the sophistication and reasoning level of modern LLMs.

If they tweaked the hidden prompting, then maybe it could have figured it out and reported it to people. That would honestly be kind of funny. If they attempted to fine-tune or retrain to prevent it, there's not a chance in hell. Actually, there's a pretty good chance I think that they did the former, in which case maybe the LLM is able to see and report to users, but that's a little unusual (I haven't really heard of them exposing their secret prompting in conversation like that, although being tricked into regurgitating it completely is obviously possible.)

[–] vapeloki@lemmy.world 2 points 3 weeks ago* (last edited 3 weeks ago)

We know a view things about xAI and their models. First of all, they use reinforcement training. While they could finetune Grok to speak more favorable about Musk, it is highly unlikely they succeed. Grok is most likely trained on a sheer amount of tweets. As Musk is a prominent person on X, i think the only way to remove any potential bias against Musk is to re-train with a fresh set and without Musk. But then they lose all the finetuning done.

Now it gets very theoretical:

Lets assume they used RLHF to finetune Grok in a way, that is speak more favorable about Musk. It's possible, in theory, that the model has internally detected significant statistical anomalies (e.g., very strong negative signals intentionally added in reinforcement training to "protect" Musk from negative commentary) and spontaneously surfaced these findings in its natural pattern generation. After all, it is designed to interact with users, and to us online resources to deliver answers.

Combine this with the training data (X), and the most likely biased RLHF to make Grok sound like the "normal" X user (jump to conclusions fast, be edgy, ...), we could see such a prompt.

There are even papers about this:

Of course, this is not self-awareness or stuff like this. But it is an interesting theory.

I apologize for the confusing, shortened answer, i wrote from my phone ;)

EDIT: Interessting fact: There is an effect called "Grokking": https://arxiv.org/html/2502.01774v1