Webinar: Getting Started with Change Tracking in SQL Server

Change TrackingStart your summer off right by brushing up on a highly effective change detection technique! We will be hosting a webinar, Getting Started with Change Tracking in SQL Server, on Friday, June 8th at 11:00am CDT.

In this webinar, I’ll walk you through the essentials of change tracking in SQL Server: what it is, why it’s important, and how it fits into your data movement strategy. I’ll walk through demos to give you realistic examples of how to use change tracking.

Registration is free and is open now. I hope to see you there!

Data Lineage Tracking in ETL Processes

Data Lineage Tracking in ETL ProcessesAnyone who has worked in data integration for a significant length of time has almost certainly been confronted with the following question:

“Where did this data come from?”

Confronting this question brings to the front burner the need for establishing data lineage tracking in ETL processes.

Data Lineage Tracking in ETL Processes

Extract, transform, and load (ETL) operations frequently involve the movement and consolidation of data from multiple sources through numerous transformation steps before being routed to its final resting place. As the ETL touchpoints increase in number and complexity, so does the difficulty in tracking the data back to its origins. The concept of data lineage in ETL is intended to make this process easier. Data lineage tracking involves building ETL elements in such a way that each row of data in the destination tables is unambiguously traceable back to it source, including any transformation process through which it passed during its travels.

Building ETL processes that include data lineage tracking takes extra work and careful planning. Sadly, the majority of ETL processes I’ve found in the wild do not have provisions for capturing data lineage. Because data lineage tracking does not add to the core functions of extraction, transformation, and loading, this design is often skipped during architecture or build. Much like the related ETL best practices of logging and auditing, data lineage is a typically unseen yet still valuable component of a well-designed ETL architecture.

Why Use Data Lineage Tracking?

There are several benefits that are realized with proper ETL data lineage tracking:

  • Trustworthiness. When the origin of each row of data and the path it took to arrive is systematically tracked, users and administrators of the data will have more reason to trust the data.
  • Easier troubleshooting. When the ETL data path is self-describing, it makes testing and troubleshooting far easier.
  • Expose leaky processes. When each row of data is trackable from source to destination, it helps to reveal any holes in the process where data might be lost.
  • Visibility. I can’t count the number of times I have been asked by a client to help them find and document business rules in their ETL processes. A side benefit of data lineage tracking is that the places where the business rules hide become more evident.

Not all load mechanisms require data lineage tracking. Very simple processes, those that load volatile tables, and some single-use, “throwaway” code does not have the same need for tracking data lineage. However, these are the exception rather than the rule.

Conclusion

Data lineage tracking in ETL processes is a best practice for most loads. Although it takes time to properly design and implement this pattern, the value gained is almost always worth the effort.

Keeping Data Quality Simple

Keeping data quality simpleA universal truth in business is that bad data costs money, and can even be dangerous. Data quality processes are often an afterthought rather than a central component of the architecture, due in part to the fear over the complexity of checking and cleansing data quality. In many cases, that fear is warranted; the longer data quality is delayed (in time as well as in data lineage), the more time and money it costs to make it right.

Some data quality needs require a formal project and specialized tools to address. Other issues, can be mitigated or avoided entirely with small changes to the data model or load processes. We like to look for these “low-hanging fruit” opportunities when designing a data or ETL infrastructure, because some of these simple modifications can save dozens of hours (or more) and significantly improve data quality very quickly.

Keeping Data Quality Simple

Among the passive changes that can help prevent or fix data quality issues:

Use of proper data types. Many of the data quality issues we fix are related to numerical data stored as text, and then some non-numerical data ends up inadvertently loaded. Even harder to detect is when numerical data of an incorrect precision is stored. The same goes for date values, date + time values, geolocation data, among others. Storing data in the type that most closely represents its real use will avoid a lot of downstream problems that are often hard to diagnose and expensive to fix.

Non-nullable fields. Every major RDBMS platform supports non-nullable fields, which require a value in said field before an insert or update operation will complete. If a particular field must have a value before that record can be considered valid, marking the column as non-nullable can avoid data consistency issues where that value is missing.

Foreign key validation. The use of foreign keys for data validation is a best practice in most any relational database architecture, and that is doubly true when improving data quality is a main objective. Using foreign keys to limit values to only those entries explicitly allowed in the table referenced by the foreign key prevents stray values from sneaking into a constrained field.

Check constraints. Preventing the insertion of data with values outside a define range can be accomplished through check constraints found in every major database platform. These, like foreign keys, limit the values that can be entered for a column but the check constraint does not use a separate lookup table. Also, you have flexibility to set a range of allowable entries rather than a discrete list of values. An example of this would be using a check constraint to enforce that all dates of birth are on or after a certain date.

ETL cleanup. Most ETL tools have built-in functionality allowing for lightweight data cleansing. Assuming the data in question is being processed through a structured ETL tool, adding in logic to correct minor data quality issues is relatively easy to do with low risk. Emphasize a light touch here – you aren’t going to want to do address standardization or name deduplication without some formal codified process.

No Substitute for Formal Data Quality Processes

Even when taking these precautions to prevent or correct some issues, you’ll still run into cases where a more rigid and comprehensive data quality initiative will be needed. The above suggestions will not eliminate the need for proper data quality tooling, but can help reduce the pain from or delay the need for in-depth data quality remediation.

Conclusion

Data quality requires a multifaceted strategy. Taking care of some of the simple problems with easy-to-use tools already at your fingertips can have a significant and immediate impact on the quality of your data.

Introducing the Pinch Hit Service

I am happy to announce the launch of a new service designed to help with very short term consultation needs. Although most consulting engagements are weeks or months in duration, we’ve discovered that some client needs are simple and do not require a traditional consulting approach. In response to this need, Tyleris has created the Pinch Hit service as a simple, no-commitment, 2-hour remote consultation.

The Pinch Hit was created to assist clients who are handling their own data warehousing, ETL, and reporting infrastructure. They may be looking for a second set of eyes to look at a problem, assistance with troubleshooting a specific problem, or a focused training session. Much like the use of a pinch hitter in baseball, Tyleris brings a specialized skillset to help deal with a clutch situation.

Not every business or technical need is suitable for this service, but in cases where the problem domain is narrow, the Pinch Hit can deliver outstanding value in a short time. If you find yourself in need of a Pinch Hit engagement from Tyleris, just let us know how we can help.

Request a Pinch Hit