this post was submitted on 17 Jun 2025
115 points (100.0% liked)

TechTakes

1973 readers
173 users here now

Big brain tech dude got yet another clueless take over at HackerNews etc? Here's the place to vent. Orange site, VC foolishness, all welcome.

This is not debate club. Unless it’s amusing debate.

For actually-good tech, you want our NotAwfulTech community

founded 2 years ago
MODERATORS
 

I love to show that kind of shit to AI boosters. (In case you're wondering, the numbers were chosen randomly and the answer is incorrect).

They go waaa waaa its not a calculator, and then I can point out that it got the leading 6 digits and the last digit correct, which is a lot better than it did on the "softer" parts of the test.

you are viewing a single comment's thread
view the rest of the comments
[–] scruiser@awful.systems 5 points 7 hours ago (2 children)

Have they fixed it as in genuinely uses python completely reliably or "fixed" it, like they tweaked the prompt and now it use python 95% of the time instead of 50/50? I'm betting on the later.

[–] diz@awful.systems 2 points 3 hours ago

Yeah, I'd also bet on the latter. They also added a fold-out button that shows you the code it wrote (folded by default), but you got to unfold it or notice that it is absent.

[–] aramova@infosec.pub 4 points 7 hours ago (1 children)

Non-deterministic LLMs will always have randomness in their output. Best they can hope for is layers of sanity checke slowing things down and costing more.

[–] scruiser@awful.systems 5 points 6 hours ago

If you wire the LLM directly into a proof-checker (like with AlphaGeometry) or evaluation function (like with AlphaEvolve) and the raw LLM outputs aren't allowed to do anything on their own, you can get reliability. So you can hope for better, it just requires a narrow domain and a much more thorough approach than slapping some extra firm instructions in an unholy blend of markup languages in the prompt.

In this case, solving math problems is actually something Google search could previously do (before dumping AI into it) and Wolfram Alpha can do, so it really seems like Google should be able to offer a product that does math problems right. Of course, this solution would probably involve bypassing the LLM altogether through preprocessing and post processing.

Also, btw, LLM can be (technically speaking) deterministic if the heat is set all the way down, its just that this doesn't actually improve their performance at math or anything else. And it would still be "random" in the sense that minor variations in the prompt or previous context can induce seemingly arbitrary changes in output.