this post was submitted on 08 Jun 2025

85 points (100.0% liked)

TechTakes

1939 readers

130 users here now

Big brain tech dude got yet another clueless take over at HackerNews etc? Here's the place to vent. Orange site, VC foolishness, all welcome.

This is not debate club. Unless it’s amusing debate.

For actually-good tech, you want our NotAwfulTech community

founded 2 years ago

MODERATORS

dgerard@awful.systems

Apple: ‘Reasoning’ AIs fail hard if they actually have to think (pivot-to-ai.com)

submitted 3 days ago by dgerard@awful.systems to c/techtakes@awful.systems

21 comments fedilink hide all child comments

Video version

top 21 comments

sorted by: hot top controversial new old

[–] YourNetworkIsHaunted@awful.systems 17 points 2 days ago (1 children)

As the bioware nerd I am it makes my heart glad to see the Towers of Hanoi doing their part in this fight. And it seems like the published paper undersells how significant this problem is for the promptfondlers' preferred narratives. Given how simple it is to scale the problem complexity for these scenarios, it seems likely that there isn't a viable scaling-based solution here. No matter how big you make the context windows and how many steps the system is able to process it's going to get out scaled by simply increasing some Ns in the puzzle itself.

Diz and others with a better understanding of what's actually under the hood have frequently referenced how bad Transformer models are at recursion and this seems like a pretty straightforward way to demonstrate that and one that I would expect to be pretty consistent.

[–] Soyweiser@awful.systems 4 points 1 day ago (1 children)

Sorry what is the link between bioware and towers of hanoi? (I do know about the old "one final game before your execution" science fiction story).

[–] YourNetworkIsHaunted@awful.systems 5 points 1 day ago (1 children)

So I don't know if it's a strong link, but I definitely learned to solve the Towers when playing through KotOR, then had it come up again in Mass Effect, and Jade Empire, both of which I played at around the same time. From a quick "am I making this up?" search, it's also used in a raid in SW:TOR, and gets referenced throughout the dragon age and mass effect franchises even if not actually deployed.

[–] Soyweiser@awful.systems 2 points 1 day ago* (last edited 1 day ago) (1 children)

Right, yeah i just recall that for a high enough bit of towers the amount of steps needed to solve it rises quickly. The story, "Now Inhale", by Eric Frank Russell, uses 64 discs. Fun story.

Min steps is 2 to the power the number of disks minus 1.

Programming a system that solves it was a programming excersize for me a long time ago. Those are my stronger memories of it

[–] YourNetworkIsHaunted@awful.systems 3 points 22 hours ago

It's a solid intro CS puzzle for teaching recursion. I think the original story they invented to go with it also had 64 disks in a temple in, well, Hanoi. Once the priests finished it the world was supposed to end or something.

[–] diz@awful.systems 18 points 2 days ago* (last edited 2 days ago) (2 children)

Further support for the memorization claim: I posted examples of novel river crossing puzzles where LLMs completely fail (on this forum).

Note that Apple’s actors / agents river crossing is a well known “jealous husbands” variant, which you can ask a chatbot to explain to you. It gladly explains, even as it can’t follow its own explanation (since of course it isn’t its own explanation but a plagiarized one, even if changes words).

edit: https://awful.systems/post/4027490 and earlier https://awful.systems/post/1769506

I think what I need to do is to write up a bunch of puzzles, assign them randomly to 2 sets, and test & post one set, while holding back on the second set (not even testing it on any online chatbots). Then in a year or two see how much the set that's public improves, vs the one that's held back.

[–] Soyweiser@awful.systems 2 points 1 day ago (1 children)

Latter test fails if they write a specific bit of code to put out the 'llms fail the river crossing' fire btw. Still a good test.

[–] diz@awful.systems 2 points 4 hours ago

It would have to be more than just river crossings, yeah.

Although I'm also dubious that their LLM is good enough for universal river crossing puzzle solving using a tool. It's not that simple, the constraints have to be translated into the format that the tool understands, and the answer translated back. I got told that o3 solves my river crossing variant but the chat log they gave had incorrect code being run and then a correct answer magically appearing, so I think it wasn't anything quite as general as that.

[–] YourNetworkIsHaunted@awful.systems 8 points 2 days ago (1 children)

That would be the best way to actively catch the cheating happening here, given that the training datasets remain confidential. But I also don't know that it would be conclusive or convincing unless you could be certain that the problems in the private set were similar to the public set.

In any case either you're doubledipping for credit in multiple places or you absolutely should get more credit for the scoop here.

[–] diz@awful.systems 8 points 2 days ago

I’d just write the list then assign randomly. Or perhaps pseudorandomly like sort by hash and then split in two.

One problem is that it is hard to come up with 20 or more completely unrelated puzzles.

Although I don’t think we need a large number for statistical significance here, if it’s like 8/10 solved in the cheating set and 2/10 in the hold back set.

[–] scruiser@awful.systems 32 points 3 days ago* (last edited 3 days ago) (3 children)

The promptfondlers on places like /r/singularity are trying so hard to spin this paper. "It's still doing reasoning, it just somehow mysteriously fails when you it's reasoning gets too long!" or "LRMs improved with an intermediate number of reasoning tokens" or some other excuse. They are missing the point that short and medium length "reasoning" traces are potentially the result of pattern memorization. If the LLMs are actually reasoning and aren't just pattern memorizing, then extending the number of reasoning tokens proportionately with the task length should let the LLMs maintain performance on the tasks instead of catastrophically failing. Because this isn't the case, apple's paper is evidence for what big names like Gary Marcus, Yann Lecun, and many pundits and analysts have been repeatedly saying: LLMs achieve their results through memorization, not generalization, especially not out-of-distribution generalization.

[–] blakestacey@awful.systems 19 points 3 days ago (1 children)

/r/justonemoreprompt

[–] scruiser@awful.systems 10 points 3 days ago

Just one more training run bro. Just gotta make the model bigger, then it can do bigger puzzles, obviously!

[–] paraphrand@lemmy.world 14 points 2 days ago* (last edited 2 days ago) (2 children)

prompfondlers

Holy shit, I love it.

[–] scruiser@awful.systems 5 points 2 days ago

https://awful.systems/comment/7326260

[–] Architeuthis@awful.systems 10 points 3 days ago* (last edited 3 days ago)

Hey now, there's plenty of generalization going on with LLM networks, it's just that we've taken to calling it hallucinations these days.

[–] scruiser@awful.systems 15 points 2 days ago (1 children)

Another thing that's been annoying me about responses to this paper... lots of promptfondlers are suddenly upset that we are judging LLMs by abitrary puzzle solving capabilities... as opposed to the arbitrary and artificial benchmarks they love to tout.

[–] diz@awful.systems 12 points 2 days ago* (last edited 2 days ago)

Yeah any time its regurgitating an IMO problem it’s a proof it’salmost superhuman, but any time it actually faces a puzzle with unknown answer, this is not what it is for.

[–] MadMadBunny@lemmy.ca 7 points 3 days ago

So, not intelligent, just artificial?