Three Causes of Data Headaches

My first couple of experiences in analytics revolved around commercially available software. The consistency between installations was nice, but there would inevitably be some odd choices in how the data was organized. I would always opine that “Ugh, if we worked with the product team that wrote this we could get these problems fixed and we wouldn’t have to do all these silly workarounds.”

Then I started working in Silicon Valley for product companies. I was thrilled to be put in the big leagues, and figured we’d have a much easier time with the data modeling since we could work directly with the engineering teams.

The naivete of youth, eh?

It turns out the engineering teams have their own priorities. And when push comes to shove, getting features out the door that keep the company in business will always get priority over quality-of-life improvements for internal users. And rightly so. Executive teams are of sound enough mind to know that when the data team can make something work, even with duct tape, it’s probably a better use of company time than having the product engineering team refactor it.

Data teams typically have three things that cause them headaches:

Upstream Data Issues. There’s no end to the strangeness that can come from upstream data. There can be duplicate rows, primary key conflicts, unexpected data type changes, deleted columns. The best way to handle these issues is to do data quality tests on everything coming in. Prevention may be out of your hands, but alerting can go a long ways towards mitigating the damage.
Inconstent Usage and Rules in Data Transformation Systems. This is where we have to look in the mirror. Data pipelines have a well deserved reputation for being big balls of mud. Most of the time this is caused by a lack of internal standards for how systems will be used. A developer will write a query, see that it works, then move on. The next request is slightly different, so the developer adds another pipeline. Pretty soon you have hundreds of pipelines that all work differently and everyone is afraid to change them. Well thought-out systems put in place rules about how pieces of the system interact so that we can safely make assumptions about which parts of the system are affected. Having no rules means you can never be sure what will happen.
Requirements from Downstream Users. Tooling for business users of data is still in an immature state. When SQL became a standard there was an expectation that business users would embrace it for their needs. It’s been over forty years and it’s time to give up on that idea. Business users' lack of autonomy in this state is a major cause of headaches for data teams. Their needs should be analyzed just like product end users and data needs to take a lead in prioritizing their usage. The industry still revolves around “Have a question, make the data team create a dashboard.”

Since we don’t have control over upstream data, and we can’t tell business users what to do, we have to drive as much benefit as possible from improving the rules in our transformation systems. No one else is going to do it for us.