this post was submitted on 21 Aug 2024
548 points (98.6% liked)
Technology
60347 readers
4897 users here now
This is a most excellent place for technology news and articles.
Our Rules
- Follow the lemmy.world rules.
- Only tech related content.
- Be excellent to each another!
- Mod approved content bots can post up to 10 articles per day.
- Threads asking for personal tech support may be deleted.
- Politics threads may be removed.
- No memes allowed as posts, OK to post as comments.
- Only approved bots from the list below, to ask if your bot can be added please contact us.
- Check for duplicates before posting, duplicates may be removed
Approved Bots
founded 2 years ago
MODERATORS
you are viewing a single comment's thread
view the rest of the comments
view the rest of the comments
It's not really criticism, it's competitors claiming they will never fuck up.
Like, if you found mouse in your hamburger at McDonald's, that's a massive fuckup. If Burger King then started saying "you'll never find anything gross in Burger King food!" that would be both crass opportunism and patently false.
It's reasonable to criticize CrowdStrike. They fucked up huge. The incident was a fuckup, and creating an environment where one incident could cause total widespread failure was a systemic fuckup. And it's not even their first fuckup, just the most impactful and public.
But also Microsoft fucked up. And the clients, those who put all of their trust into Microsoft and CrowdStrike without regard to testing, backups, or redundancy, they fucked up, too. Delta shut down, cancelling 4,600 flights. American Airlines cancelled 43 flights, 10 of which would have been cancelled even without the outage.
Like, imagine if some diners at McDonald's connected their mouths to a chute that delivers pre-chewed food sight-unseen into their gullets, and then got mad when they fell ill from eating a mouse. Don't do that, not at any restaurant.
All that said, if you fuck up, you don't get to complain about your competitors being crass opportunists.
Even if that's the case, how is it Crowdstrike's place to call these other companies out for claiming something similar will never happen to them? Thus far, it had only ever happened to CS.
No, we had Sentinelone take down our company a few months ago. Granted, not a global outage, but it's something similar. I'm sure that if you went back in news archives, you'd find articles about major Sentinelone outages. I think Crowdstrike is just the biggest one in recent history. It's certainly not unprecedented.
It feels like a pattern though. I’ve not seen too much from them but they seem to be saying factually correct stuff. But neither worded correctly nor at the right time.
I agree completely, which is why I added that last sentence in an edit. This is a bad look for CrowdStrike, even if I agree with the sentiment.
Everybody fucks up now and then. That's my point. It's why you shouldn't trust one company to automatically push security updates to critical production servers without either a testing environment or disaster recovery procedures in place.
I doubt you'll find any software company, or any company in any industry, that has not fucked up something really important. That's the nature of commerce. It's why many security protocols exist in the first place. If everyone could be trusted to do their jobs right 100% of the time, you would only need to worry about malicious attacks which make up only a small fraction of security incidents.
The difference here is that CrowdStrike sold a bunch of clients on the idea that they could be trusted to push security updates to production servers without trsting environments. I doubt they told Delta that they didn't need DRP or any redundancy, but either way, the failure was amplified by a collective technical debt that corporations have been building into their budget sheets to pad their stock prices.
By all means, switch from CrowdStrike to a competitor. Or sue them for the loss of value resulting in their fuckup. Sort that out in the contracts and courts, because that's not my area. But we should all recognize that the lesson learned is not to switch to another threat prevention software company that won't fuck up. Such a company does not exist.
If you stub your toe, you don't start walking on your hands. You move the damn coffee table out of the pathway and watch where you're walking. The lesson is to invest in your infrastructure, build in redundancy, and protect your critical systems from shit like this.
Resiliency and security have a lot of layers. The crowd strike bungle was very bad but more than anything it shined a bright spot light on the fact that certain organizations IT orgs are just a house of cards waiting to get blown away.
I'm looking at Delta in particular. Airlines are a critical transportation service and to have issues with one software vendor bring your entire company screeching to a halt is nothing short of embarrassing.
If I were on the board, my first question would be, "where's our DRP and why was this situation not accounted for?"
House of cards is exactly right. At every IT job I've worked, the bosses want to check the DRP box as long as it costs as close to zero dollars as possible, and a day or two of 1-2 people writing it up. I do my best to cover my own ass, and regularly do actual restores, limit potential blast radii, and so on. But at a high level, bosses don't give AF about defense, they are always on offense (i.e. make more money faster).
This is the first time I've heard someone call it a house of cards and I think that fits it perfectly!
Number fifteen...
That’s the first thing I heard in my head lmao
In what way did Microsoft fuck up? They don't control Crowdstrike updates. Short of the OS files being immutable it seems unlikely they can stop things like this.
Microsoft gave CrowdStrike unfettered access to push an update that can BSOD every Windows machine without a bypass or failsafe in place. That turned out to be a bad idea.
CrowdStrike pushed an errant update. Microsoft allowed a single errant update to cause an unrecoverable boot loop. CrowdStrike is the market leader in their sector and brings in hundreds of millions of dollars every year, but Microsoft is older than the internet and creates hundreds of billions of dollars. CrowdStrike was the primary cause, but Microsoft enabled the meltdown.
Microsoft did not "give Crowdstrike access to push updates". The IT departments of the companies did.
The security features that Crowdstrike has forces them to run in kernel-space, which means that they will have code running that can crash the OS. They crashed Debian in an almost identical way (forced boot loop) about a month before they did the same to Windows.
Yes, there are ways that Microsoft could rewrite the Windows kernel architecture to make it resistant to this type of failure. But I don't think there are very many other commercial OS's that could stop this from happening.
You're absolutely right, here is an in-depth explanation from Dave Plummer, the guy who wrote the task manager: https://youtu.be/ZHrayP-Y71Q
They have to give that access by EU ruling:
Well there's a provocative anecdote if I've ever seen one. Well done.
Not in all cases [podcast warning], sometimes it's just them pointing out they're doing silly things like how they test every update and don't let it out the door with <98% positive returns or having actual deployment rings instead of of yeeting an update to millions systems in less than an hour.
Clownstrike deserves every bit of shit they're getting, and it amazes me that people are buying the bullshit they're selling. They had no real testing or quality control in place, because if that update had touched test windows boxes it would have tipped them over and they'd have actually known about it ahead of time. Fucking up is fine, we all do it. But when your core practices are that slap dash, bitching about criticism just brings more attention to how badly your processes are designed.
How did Microsoft fuck up? Giving a security vender kernel access? Like they're obligated to from previous lawsuits?
Customers can't test clownstrike updates ahead of time or in a nonprod environment, because clownstrike knows best lol.
Redundancy is not relevant here because what company is going to use different IDR products for primary and secondary tech stacks?
Backups are also not relevant (mostly) because it's quicker to remediate the problem than restore from backup (unless you had super regular DR snaps and enough resolution to roll back from before the problem.
IMO, clownstrike is the issue, and customers have only the slightest blame for using clownstrike and for not spending extra money on a second IDR on redundant stacks.