I'm just so exhausted these days. We have formal SLA's, but its not like they're ever followed. After all, Customer X needs to be notified within 5 minutes of any anomalous events in their cluster, and Customer Y is our biggest customer, so we give them the white glove treatment.
Yadda yadda, bla bla. So on and so forth, almost every customer has some exception/difference in SLAs.
I was hired on to be an SRE, but I'm just a professional dashboard starer at this point. The amount of times I've been alerted in the middle of the night because CPU was running high for 5 minutes is too damn high. Just so I can apologize to Mr. Customer that they maybe had a teensy slowdown during that time.
If I try to get us back to fundamentals and suggest we should only alert on impact, not short lived anomalies, there is some surface level agreement, but everyone seems to think "well we might miss something, so we need to keep it".
It's like we're trying to prevent outages by monitoring for potential issues rather than actually making our system more robust and automate-able.
How do I convince these people that this isn't sustainable? That trying to "catch" incidents before they happen is a fools errand. It's like that chart about the "war on drugs" where it shows exponential expense growth as you try to prevent ALL drug usage (which is impossible). Yet this tech company seems to think we should be trying to prevent all outages with excessive monitoring.
And that doesn't even get into the bonkers agreements we make with customers to agree to do a deep dive research on why 2 different environments have a response time that differs by 1ms.
Or the agreements that force us to complete customer provided training - while not assessing how much training we already committed to. It's entirely normal to do 3-4x HIPAA / PCI / Compliance trainings when everyone else in the org only has to do one set of those.
I'm at a point where I'm considering moving on. This job just isn't sustainable and there's no interest in the org to make it sustainable.
But perhaps one of y'all managed to fix something similar in their org with a few key conversations and some effort? What other things could I try as a sort of final "Hail Mary" before looking to greener pastures?
Your on-call experience is not the norm. That alone should cause you to seek another position. Experienced SREs are always in high demand. Find a place that isn’t abusing your off-hours.
In general, if my on-call engine are paged outside of business hours, I do not expect them to come in on time the next day and we’re having a postmortem ASAP. If we can’t fix the page, we’re evaluating the page’s necessity. It’s either something we can fix, something we can’t fix and don’t care about because we can’t fix it so we’re going to kill the alarm that causes the page, or ephemeral enough that we don’t think it’s worth the time chasing down. My team’s off-hours are not to be abused by stakeholders not giving us the resources we need to resolve issues and I will back that hard. In your case, you need more money and your company needs to either devote the serious resources in R&D to fix this shit, pass on the support cost to the customer at such a high level it’s actually painful for them making them get off the fucking pot, or both. For example, if a contract will affect my team’s off-hours and they’re making a bullshit alarm, they will will pay us a huge amount of money for that support. Usually the contract gets signed because stakeholders are dumb and then the first fucking time that fee hits that stupid alarm gets redlined out because financial stakeholders are smarter.
It's not even a steaming pile of crap or anything. Since it's basically a managed distributed database solution there's limits to what we can do and maintain strong consistency. Things generally take a long time and are very sequentially dependent. So we have automation of course! Buuuut there's very little comfort or trust in what is now very well exercised automation - which is the number 1 barrier in removing many sources of toil. Too many human "check this thing visually before proceeding" steps blocking an otherwise well automated process.
We are so damn close, but some key stakeholders keep wanting just one more thing in our platform support (We need ARM support, We need customer managed pki support, etc.) and we just don't get the latitude we need to actually make things reliable. It's like we're Cloud Platform / DevOps / QA / and SRE rolled into one and they can't seem to make up their damn mind on which rubric they decide to grade us on.
Hell they keep asking us to cut back our testing environment costs but demand new platform features tested at scale. We could solve it with a set of automated and standardized QA environments, but it's almost impossible to get that type of work prioritized.
My direct manager is actually pretty great, but found herself completely powerless after a recent reorg that changed the director she reports to. So all the organizational progress we made was completely reset and we're back to square one of having to explain what we want - except now we're having "kubernetes!" shouted at us while we try to chart a path.
I'm already brushing up my resume, but I must say, the new Gen-AI dominated hiring landscape is weird and bad. Until then, I just have to do the best I can with this business politics hell.