2 minutes
Alerts
Debugging poorly performing SQL queries is not most folks' idea of a good time, but I usually volunteer for that kind of stuff. It may be because I grew up when computers really needed to be coaxed into performing.
One instance of my debugging was directed at a data pipeline pattern that involved some queries that scanned for the maximum date of a large table. It was clearly doing full table scans every time and needed to be fixed. The coworker of mine from the DevOps team came up with what I’d consider the absolute worst solution to this problem:
“How about I write a cron job that looks for long running queries once very hour and sends you an email alert?”
Let’s pick apart exactly why this is so awful:
- It doesn’t solve the problem.
- This alert could be triggered up to an hour after a poorly performing query began. It’s unlikely that I’ll be alerted in time to do something.
- There’s no prescribed solution for what to do when I’m alerted.
- It adds a cron job that someone now has to manage.
In the end I just changed the filter criteria and the problem went away. But I did put into place a few guidelines for managing these types of problems:
The problem is understood and can be solved –> Just go solve it. Put in a regression test afterwards.
The condition is something worth knowing but not fixing –> You want observability. Create a dashboard of some sort that gives you all the information you need to know the health of the system.
The problem occurs sporadically or under unknown conditions and needs to be fixed –> Send an alert to someone who is investigating. Use this only when you intend to fix the thing causing these errors.
Alerts are a bit like unit tests. If something occurs it needs to be addressed. There’s few things worse than a system where only some alerts matter. If you’re sending alerts they must be actionable and there needs to be a requirement that they are acted on. Otherwise you’re just doing observability by alerting.
355 Words
2024-03-12 16:30