this post was submitted on 20 Jul 2024

389 points (94.7% liked)

Technology

59288 readers

4784 users here now

This is a most excellent place for technology news and articles.

Our Rules

Follow the lemmy.world rules.
Only tech related content.
Be excellent to each another!
Mod approved content bots can post up to 10 articles per day.
Threads asking for personal tech support may be deleted.
Politics threads may be removed.
No memes allowed as posts, OK to post as comments.
Only approved bots from the list below, to ask if your bot can be added please contact us.
Check for duplicates before posting, duplicates may be removed

Approved Bots

founded 1 year ago

MODERATORS

389

Some bad code just broke a billion Windows machines (www.youtube.com)

submitted 3 months ago by TimeSquirrel@kbin.melroy.org to c/technology@lemmy.world

120 comments fedilink hide all child comments

Cybersecurity firm Crowdstrike pushed an update that caused millions of Windows computers to enter recovery mode, triggering the blue screen of death. Learn ...

top 50 comments

sorted by: hot top controversial new old

[–] dan@upvote.au 66 points 3 months ago* (last edited 3 months ago) (6 children)

Are there really a billion systems in the world that run Crowdstrike? That seems implausible. Is it just hyperbole?

[–] MeekerThanBeaker@lemmy.world 44 points 3 months ago (1 children)

Probably includes a bunch of virtual machines.

[–] Joelk111@lemmy.world 21 points 3 months ago (1 children)

Yeah, our VMs completely died at work. Has to set up temporary stuff on hardware we had laying around today. Was kinda fun, but stressful haha.

[–] dan@upvote.au 9 points 3 months ago (3 children)

Could you just revert VMs to a snapshot before the update? Or do you not take periodic snapshots? You could probably also mount the VM's drive on the host and delete the relevant file that way.

[–] EncryptKeeper@lemmy.world 10 points 3 months ago (1 children)

Yes you can just go into safe mode on an affected machine and delete the offending file. The problem is it took a couple hours before that resolution was found, and it has to be done by hand on every VM. I can’t just run an Ansible playbook against hundreds of non-booted VMs. Then you have to consider in the case of servers, there might be a specific start up order, certain things might have to be started before other things and further fixing might be required given that every VM hard crashed. At the minimum it took many companies 6-12 hours to get back up and running and on many more it could take days.

[–] dan@upvote.au 4 points 3 months ago

Makes sense - thanks for the details.

load more comments (2 replies)

[–] biggerbogboy@sh.itjust.works 27 points 3 months ago (1 children)

I doubt it's too much of a stretch, since even here in australia, we've had multiple airlines, news stations, banks, supermarkets and many others, including the aluminium extrusion business my father works at, all go down, scale this do hundreds of countries with populations tenfold of ours, it puts it into perspective that there may even be more than a billion machines affected

load more comments (1 replies)

[–] Imgonnatrythis@sh.itjust.works 16 points 3 months ago (1 children)

Despite how it may seem on Lemmy, most people have not yet actually switched to Linux. This stat is legit.

[–] dan@upvote.au 9 points 3 months ago (1 children)

I know that Windows is everywhere, I just don't know the percentage of Windows computers that run Crowdstrike.

[–] TheDarksteel94@sopuli.xyz 10 points 3 months ago (3 children)

Keep in mind, it's not just clients, but servers too. A friend of mine works for a decently sized company that has about 1600 (virtual) servers internationally. And yes, all of them were affected.

load more comments (3 replies)

[–] Coasting0942@reddthat.com 12 points 3 months ago

Yes

[–] TexMexBazooka@lemm.ee 5 points 3 months ago

Sounds pretty plausible to me. An organization doesn’t have to be very big to get into the hundreds or thousands of devices on a network when you account for servers and VM.

A company with 40 employees all accessing and RDS server using a company laptop is looking at 85+ devices already

load more comments (1 replies)

[–] JeeBaiChow@lemmy.world 63 points 3 months ago* (last edited 3 months ago) (3 children)

Whoda thunk automatic updates to critical infrastructure was a good idea? Just hope healthcare life support was not affected.

[–] Toribor@corndog.social 68 points 3 months ago (6 children)

Many compliance frameworks require security utilities to receive automatic updates. It's pretty essential for effective endpoint protection considering how fast new threats spread.

The problem is not the automated update, it's why it wasn't caught in testing and how the update managed to break the entire OS.

[–] jbloggs777@discuss.tchncs.de 7 points 3 months ago* (last edited 3 months ago) (1 children)

It is pretty easy to imagine separate streams of updates that affect each other negatively.

CrowdStrike does its own 0-day updates, Microsoft does its own 0-day updates. There is probably limited if any testing at that critical intersection.

If Microsoft 100% controlled the release stream, otoh, there'd be a much better chance to have caught it. The responsibility would probably lie with MS in such a case.

(edit: not saying that this is what happened, hence the conditionals)

[–] Toribor@corndog.social 13 points 3 months ago (1 children)

I don't think that is what happened here in this situation though, I think the issue was caused exclusively by a Crowdstrike update but I haven't read anything official that really breaks this down.

[–] barsquid@lemmy.world 15 points 3 months ago (1 children)

Some comments yesterday were claiming the offending file was several kb of just 0s. All signs are pointing to a massive fuckup from an individual company.

[–] Wiz@midwest.social 4 points 3 months ago

Which makes me wonder, did the company even test it at all on their own machines first?

load more comments (5 replies)

[–] LodeMike 18 points 3 months ago (1 children)

Hospital stuff was affected. Most engineers are smart enough to not connect critical equipment to the Internet, though.

[–] arunwadhwa@lemmy.world 20 points 3 months ago (4 children)

I’m not in the US, but my other medical peers who are mentioned that EPIC (the software most hospitals use to manage patient records) was not affected, but Dragon (the software by Nuance that we doctors use for dictation so we don’t have to type notes) was down. Someone I know complained that they had to “type notes like a medieval peasant.” But I’m glad that the critical infrastructure was up and running. At my former hospital, we used to always maintain physical records simultaneously for all our current inpatients that only the medical team responsible for those specific patients had access to just to be on the safe side.

[–] JeeBaiChow@lemmy.world 5 points 3 months ago (6 children)

That's actually a very smart idea, keeping physical records of every inpatient. Wonder why the ai companies don't do transcription of medical notes, instead of trying to add ai features to my washer/ dryer combo. Just seems like a very practical use of the tech

load more comments (6 replies)

[–] deranger@sh.itjust.works 4 points 3 months ago

I’m an Epic analyst - while Epic was fine, many of our third party integrations shit the bed. Cardiology (where I work) was mostly unaffected aside from Omnicell being down, but the laboratory was massively fucked due to all the integrations they have. Multiple teams were quite busy, I just got to talk to them about it eventually.

[–] RunningInRVA@lemmy.world 4 points 3 months ago (4 children)

This is pretty much correct. I work in an Epic shop and we had about 150 servers to remediate and some number of workstations (I’m not sure how many). While Epic make not have been impacted, it is a highly integrated system and when things are failing around it then it can have an impact on care delivery. For example if a provider places a stat lab order in Epic, that lab order gets transmitted to an integration middleware which then routes it to the lab system. If the integration middleware or the lab system are down, then the provider has no idea the stat order went into a black hole.

load more comments (4 replies)

load more comments (1 replies)

[–] ansiz@lemmy.world 22 points 3 months ago (3 children)

There is no learning, companies just move to different antivirus. The new hotness, the cycle repeats over and over until the new antivirus does this same shit. Look at McAfee in 2010, in fact the CEO of Crowdstrike was the CTO of McAfee then. That easily took down millions of windows XP machines.

load more comments (3 replies)

[–] SpikesOtherDog@ani.social 18 points 3 months ago

[–] snownyte@kbin.run 14 points 3 months ago

Combing over it's Wikipedia article, this company already had a series of other issues.

Sucks to anyone who ever relied on them. Oh look at that, they've been acquiring other security startups and companies. Perhaps that should also be looked into as well?

[–] corsicanguppy@lemmy.ca 10 points 3 months ago* (last edited 3 months ago) (2 children)

There is learning here.

As companies, we put faith in an external entity with goals not identical to our own: a lot of faith, and a lot of control.

That company had the power to destroy our businesses, cripple travel and medicine and our courts, and delay daily work that could include some timely and critical tasks.

This is not crowdstrike's fault; for the bad code yes, but for the indirect effects of that no. We knew - please tell me we had the brains god gave a gnat and we knew - that putting so much control in the hands of outsiders not concerned or aware of our detailed needs and priorities, was a negligent and foolish thing to do.

The lesson is to do our jobs: we need to ensure we have the ability to make the decisions to which were entrusted, and the power that authority gives us that our decisions when accepted are not threatened by a negligent mistake so boneheaded it's all but the whim of a simpleton. We cannot choose to manage our part of our organization effectively, no matter how (un)important that organization or part is, and then share control with a force that we've seen can run roughshod over it.

It's exactly like the leopards eating our face, except people didn't see they were leopards. No one blames the leopards, as they're just conforming to their nature, eventually.

And no one should blame this company for a small mistake, just because we let the jaws get so close to our faces that we became complacent.

[–] BeardedGingerWonder@feddit.uk 13 points 3 months ago (2 children)

Have you never worked in corporate IT or something? Of course we should blame Crowdstrike, that way we don't get a sev 1 on our scorecard.

[–] stephen01king@lemmy.zip 6 points 3 months ago (1 children)

It's funny that corporate IT will be one of the groups getting the blame in this case, despite it being in most cases not their decision that a company lacks a separate test and production environment. The executives that decided that usually gets off scot free.

load more comments (1 replies)

[–] Yaztromo@lemmy.world 7 points 3 months ago

That company had the power to destroy our businesses, cripple travel and medicine and our courts, and delay daily work that could include some timely and critical tasks.

Unless you have the ability and capacity to develop your own ISA/CPU architecture, firmware, OS, and every tool you use from the ground up, you will always be, at some point, “relying on others stuff” which can break on you at a moments notice.

That could be Intel, or Microsoft, or OpenSSH, or CrowdStrike^0. Very, very, very few organizations can exist in the modern computing world without relying on others code/hardware (with the main two that could that come to mind outside smaller embedded systems being IBM and Apple).

I do wish that consumers had held Microsoft more to account over the last few decades to properly use the Intel Protection Rings (if the CrowdStrike driver were able to run in Ring 1, then it’s possible the OS could have isolated it and prevented a BSOD, but instead it runs in Ring 0 with the kernel and has access to damage anything and everything) — but that horse appears to be long out of the gate (enough so that X86S proposes only having Ring 0 and Ring 3 for future processors).

But back to my basic thesis: saying “it’s your fault for relying on other peoples code” is unhelpful and overly reductive, as in the modern day it’s virtually impossible to do so. Even fully auditing your stacks is prohibitive. There is a good argument to be made about not living in a compute monoculture^1; and lots of good arguments against ever using Windows^2 (especially in the cloud) — but those aren’t the arguments you’re making. Saying “this is your fault for relying on other peoples stuff” is unhelpful — and I somehow doubt you designed your own ISA, CPU architecture, firmware, OS, network stack, and application code to post your comment.

——- ^0 — Indeed, all four of these organizations/projects have let us down like this; Intel with Spectre/Meltdown, Microsoft with the 28 day 32-bit Windows reboot bug, and OpenSSH just announced regreSSHion.
^1 — My organization was hit by the Falcon Sensor outage — our app tier layers running on Linux and developer machines running on macOS were unaffected, but our DBMS is still a legacy MS SQL box, so the outage hammered our stack pretty badly. We’ve fortunately been well funded to remove our dependency on MS SQL (and Windows in general), but that’s a multi-year effort that won’t pay off for some time yet.
^2 — my Windows hate is well documented elsewhere.

load more comments