Famous Last Words: “That Should Never Happen”

That should never happen, right?When it comes to managing data, there is often a difference between what should happen and what can happen. That space between should and can is often challenging, forcing data professionals to balance risk with business value. A couple of common examples:

“There should not be any sales transactions without a valid customer, but because the OLTP system doesn’t use foreign keys, it could theoretically happen.”

“Location IDs should be less than 20 characters, but those aren’t curated so they could exceed 20 characters.”

“This list of product IDs from System A should match those in System B, but because they are entered manually it is possible to have a few typos.”

“This data warehouse should only contain data loaded as part of our verified ETL process, but since our entire data team has read-write permission on the data warehouse database, it’s possible that manual data imports or transformations can be done.”

“This field in the source system should always contain a date, but the data type is set to ASCII text so it might be possible to find other data in there.”

“The front-end application should validate user input, but since it’s a vendor application we don’t know for certain that it does.”

Managing data and the processes that move data from one system to another requires careful attention to the data safeguards and the things that can happen during input, storage, and movement. As a consultant, I spend a lot of time in design sessions with clients, discussing where data comes from, how (if at all) it is curated in those source systems, and what protections should be built into the process to ensure data integrity. In that role, I’ve had this conversation, almost verbatim, on dozens of occasions:

Me: “Is it possible that <data entity> might not actually be <some data attribute>?”

Client: “No, that should never happen.”

Me: “I understand that it shouldn’t. But could it?”

Client (after a long silence): “Well…. maybe.”

Building robust systems requires planning not just for what should happen, but for what could happen. Source systems may not include referential integrity to avoid situations that are impossible in business but technically possible inside the data store. Fields that appear to store one type of data might be structured as a more generic type, such as text. Data that should be in curated lists can sometimes contain unvalidated user input. None of these things should happen, but they do. When designing a data model or ETL process, be sure that you’re asking questions about what protections in place to make sure that the things that shouldn’t happen, don’t.

Designing in Absolutes

Designing in Absolutes

Designing in AbsolutesThere are absolutes that are true of data modeling and architecture, but these are fewer in number than most people think. There is a liberal use of the words “always” and “never” handed out as technical advice, and while it is usually well-meaning, can lead to a design myopia that limits one’s ability to adapt to atypical application needs.

“You should always have a restorable backup for your production databases.” It would be hard to find anyone to argue a counterpoint to that statement. Similarly, a declaration that all source code should be stored in some form of source control is a generally accepted truism for any data project (or any other initiative based on code, for that matter). Most such absolutes are broad and generalized, and are applicable regardless of architecture, operating system, deployment platform (cloud or on-prem), or geographic location.

It’s much more rare to find absolutes that apply to specific design principles. However, that doesn’t keep some folks from incorrectly asserting absolutes. As I wrote in a post last year entitled Technical Dogma, we are creatures of habit and tend to favor tools or solutions we already know. This tendency coupled with repetition leads to a sort of muscle memory in which we become loyal – sometimes to a fault – to the methods we prefer to build things.

Designing in Absolutes

When we assume that a particular way of doing things is the only way to do it, we make assertions such as the following:

  • Every dimensional design should be built as a star schema. There are no valid reasons to build a snowflake schema.
  • You should never use the T-SQL MERGE statement to load data.
  • Anything with more than a terabyte of data belongs on premises, not in the cloud.
  • I’ll never use ETL again. Big data tools can do everything ETL can, and more.
  • Database triggers should never be used.

These aren’t anecdotal examples. I’ve heard every one of these recently. To be fair, those who declare such preferences to be truisms rarely do so with nefarious intent, but such statements can have negative consequences. Building a solution with the assumption that a particular design pattern must always be used is risky, as it can lead to an inflexible solution that does not account for nuances of the particular application.

When I write about best practices, I am very cautious about speaking in absolutes. Even in my ETL Best Practices series, which represents my experience at having built hundreds of ETL processes over the past decade, I generously use the terms “usually”, “typically”, and “with few exceptions”. I do so not out of a fear to commit, but to be as accurate as possible. As with any other collection of best practices, there will be exceptions and edge cases which may seem to violate one of the principles of a typical design, but are entirely appropriate for some less-common design patterns. Providing the business with the data it needs, not the adherence to a particular set of design patterns, is the ultimate measure of success for any data project.

There are some absolute always-or-never cases in solution design. However, these are few in number and typically vague. Try to focus less on what should always (or never) be true, and more on the needs and nuances of the project at hand.

~~

This post was originally published in my Data Geek Newsletter.