this post was submitted on 08 Jun 2025
51 points (100.0% liked)

TechTakes

1924 readers
119 users here now

Big brain tech dude got yet another clueless take over at HackerNews etc? Here's the place to vent. Orange site, VC foolishness, all welcome.

This is not debate club. Unless it’s amusing debate.

For actually-good tech, you want our NotAwfulTech community

founded 2 years ago
MODERATORS
top 11 comments
sorted by: hot top controversial new old
[–] diz@awful.systems 6 points 3 hours ago* (last edited 3 hours ago)

Further support for the memorization claim: I posted examples of novel river crossing puzzles where LLMs completely fail (on this forum).

Note that Apple’s actors / agents river crossing is a well known “jealous husbands” variant, which you can ask a chatbot to explain to you. It gladly explains, even as it can’t follow its own explanation (since of course it isn’t its own explanation but a plagiarized one, even if changes words).

edit: https://awful.systems/post/4027490 and earlier https://awful.systems/post/1769506

I think what I need to do is to write up a bunch of puzzles, assign them randomly to 2 sets, and test & post one set, while holding back on the second set (not even testing it on any online chatbots). Then in a year or two see how much the set that's public improves, vs the one that's held back.

[–] scruiser@awful.systems 5 points 3 hours ago (1 children)

Another thing that's been annoying me about responses to this paper... lots of promptfondlers are suddenly upset that we are judging LLMs by abitrary puzzle solving capabilities... as opposed to the arbitrary and artificial benchmarks they love to tout.

[–] diz@awful.systems 2 points 2 hours ago* (last edited 2 hours ago)

Yeah any time its regurgitating an IMO problem it’s a proof it’salmost superhuman, but any time it actually faces a puzzle with unknown answer, this is not what it is for.

[–] scruiser@awful.systems 22 points 11 hours ago* (last edited 8 hours ago) (3 children)

The promptfondlers on places like /r/singularity are trying so hard to spin this paper. "It's still doing reasoning, it just somehow mysteriously fails when you it's reasoning gets too long!" or "LRMs improved with an intermediate number of reasoning tokens" or some other excuse. They are missing the point that short and medium length "reasoning" traces are potentially the result of pattern memorization. If the LLMs are actually reasoning and aren't just pattern memorizing, then extending the number of reasoning tokens proportionately with the task length should let the LLMs maintain performance on the tasks instead of catastrophically failing. Because this isn't the case, apple's paper is evidence for what big names like Gary Marcus, Yann Lecun, and many pundits and analysts have been repeatedly saying: LLMs achieve their results through memorization, not generalization, especially not out-of-distribution generalization.

[–] paraphrand@lemmy.world 9 points 7 hours ago* (last edited 7 hours ago) (2 children)

prompfondlers

Holy shit, I love it.

[–] Architeuthis@awful.systems 7 points 9 hours ago* (last edited 9 hours ago)

Hey now, there's plenty of generalization going on with LLM networks, it's just that we've taken to calling it hallucinations these days.

[–] blakestacey@awful.systems 13 points 10 hours ago (1 children)
[–] scruiser@awful.systems 4 points 8 hours ago

Just one more training run bro. Just gotta make the model bigger, then it can do bigger puzzles, obviously!

[–] MadMadBunny@lemmy.ca 5 points 10 hours ago

So, not intelligent, just artificial?