Data Curator

The Value of the Data Curator

Generally speaking, the state of data curation is a mess. If you want to make a real and positive change with your employer or client, volunteer to be the curator over a critical area of data.

Data Curation, In a Nutshell

Although much has been written about proper care and feeding of critical data, very few organizations allocate resources specifically for data curation. The process of cleansing, pruning, and standardizing master data is often an afterthought, if it is included in the project planning at all. Without some measure of data curation, the cornerstones of data-driven organizations begin to deteriorate. Although the absence of a master data curation strategy is rarely fatal, bad data undoubtedly costs time and money, and leads to distrust of the data.

On the other hand, properly caring for critical data improves both quality of and trust in said data. When commonly-used reference data is consistent, atomic, and predictable, data consumers can spend more time focusing on their core functions rather than trying to reconcile questionable or inconsistent information.

Those of us working with data can and should be data curators. No, we probably won’t have that title on our business cards, but the fact is that each of us has domain-specific knowledge of at least one area that we could use to improve the quality and completeness of the data in that domain. For example, when I worked in healthcare, I took on the task of helping to normalize the charge master (the list of items and procedures which would be billed to patients). Even though I had never worked as a caregiver or hospital billing professional, I had learned enough in my 4 years of managing healthcare data to contribute to the data curation of this critical list. From a technical perspective, the work was small, but the positive impact to the organization was significant.

Data curation doesn’t have to be a destructive process. By definition, the word curation implies a slow and methodical process. In fact, much of what you’ll do as a data curator is rooted in process and education rather than technology. A thorough data curation process will often reveal some technical or workflow changes that need to be made, which can then be evaluated on the business value of the change versus the cost of implementing the change. Even if no major technical changes are made, the deficiencies and risks will have been identified, which in itself is valuable.

Be prepared for the fact that being a data curator is usually a thankless job. Sure, there will be short-term acknowledgements when positive changes are made (“Hey, those duplicates have been resolved – thanks!!”), but these disappear over time. Well-curated data is a bit like a kitchen faucet: it just works properly every time. The bottom line is to build a data curation process for business value, not professional acknowledgement.

We Are All Data Curators

Even though the official title of data curator is very rare, each of us working in this field can contribute to the data curation process. Though it’s often a thankless task, its value to data quality and business processes in general is significant.

Data Bias

Data Bias

Each of us as data professionals will bring some bias into our work. Even the most objective among us has a tendency to lean toward the answers we assume to be true, whether we are troubleshooting a technical problem or creating a new user-facing data solution.

Is data bias bad? The answer is more complicated than a simple yes or no.

Solution bias

Take one example of data bias in which one is trying to find a solution to a problem. This might be to troubleshoot a data anomaly, or on a larger scale, to choose a software or service to address a business pain point. If they have solved a similar type of problem in the past, they’ll likely assume that the solution is something they’ve used before. Using anecdotal data from their own experience, they’ll start with the likeliest cause that they know about. If that doesn’t work, they’ll move onto other possible root causes they are aware of; if those are exhausted, they’ll then move into research mode for possible causes they haven’t yet experienced.

In most cases, experienced professionals will benefit from this data bias. For example, if a DBA learns of a full SQL Server transaction log, one of the first things they’ll check is whether the log has been backed up recently. This type of solution bias is often a time-saver.

However, the most obvious solution isn’t always the correct one, so we have to guard against getting cut by Occam’s Razor. Going too far down the assumption path can lead to wasted time or other bad outcomes.

Interpretation bias

Interpretation bias is when individuals can look at the same data and draw very different conclusions. While everyone has interpretation bias – it’s impossible to thumb through Facebook or Twitter to see examples of this – the issue is particularly problematic when it occurs with data professionals. Because we are very often the gate keepers of the data, it is essential to recognize any biases we have to prevent them from seeping into the queries, reports, and systems we create to deliver information to data consumers.

In most cases, this interpretation bias is accidental and not malicious. I can remember a few projects in which I began my data analysis with the assumption that some fact was true, and would work backwards from there to find the data to support this conclusion. If you start with a conclusion in mind, you can always find a creative way to interrogate the data to support that conclusion.

Avoiding data bias

It’s almost impossible to rid oneself of data bias entirely, but there are some ways to mitigate it:

  • Recognize and acknowledge your potential data biases.
  • Start with a question, not an answer.
  • Use your experience to guide you, but don’t be blinded by it.
  • Ask for a second opinion from others who may not share your biases.

Data bias can taint your perspective, but it doesn’t have to take away from your effectiveness as a data professional. Recognizing and guarding against it will ensure that bias doesn’t leak into your work product.


Upcoming Webinars for July-Aug 2020

We’ve got a couWebinarsple of exciting Wednesday webinars coming up in the next few weeks. Join me as we walk through the basics of the SSIS Catalog, and compare SQL Server Integration Services to Azure Data Factory.

Introduction to the SSIS Catalog

The development of ETL processes requires careful attention things such as event logging, code versioning, and external configurations. For organizations using SQL Server Integration Services, the SSIS catalog is the ideal way to store, execute, and log SSIS packages. Using the SSIS catalog to manage your ETL infrastructure eliminates a lot of the manual work required for package development and administration.

In this webinar, we’ll cover:
• What is the SSIS catalog?
• Creating the catalog in an existing SQL Server instance
• Deploying packages to the catalog
• Executing packages from the SSIS catalog
• Using the built-in SSIS catalog reports

This webinar is ideally suited for data professionals who are familiar with the basics of SSIS but are new to (or not yet using) the SSIS catalog.

You can register here for the Introduction to the SSIS Catalog webinar.

Head to Head: SSIS Versus Azure Data Factory

Anyone working in the SQL Server ecosystem has likely heard of both SQL Server Integration Services (SSIS) and Azure Data Factory (ADF). Both of these are used to move and transform data, and are both very capable and mature tools. But how do the two compare to each other? How does one determine which is better for a given task?

In this session, we will do a head-to-head comparison of SSIS and ADF. We’ll compare the features of each of these tools, calling out the functionality common to both as well as the behaviors exclusive to one or the other. We will discuss the benefits and the challenges of working with SSIS and ADF, exploring topics including: performance, ease of administration, monitoring and troubleshooting, and cost. Finally, we will wrap up with demos showing examples of using SSIS and ADF for similar tasks to see how they stack up.

You can register for the Head to Head: SSIS Versus Azure Data Factory webinar here.

We need this built NOW!

One of the biggest worldwide stories of the last few weeks has been the spread of the deadly coronavirus, which has infected tens of thousands and killed hundreds. In the wake of the panic around this illness, governments around the world have gone to drastic measures to restrict its spread. In the Chinese city of Wuhan, which is the epicenter of this outbreak, officials have quickly scrambled to quarantine and treat those exposed to the coronavirus. The most significant accomplishment of this effort is that they were able to build an entire hospital in the span of just ten days.

Think about that for a second: a brand new, multi-story, 1,000-bed hospital was designed and built in just a week and a half. It took longer than that for the body shop to repair my wife’s car after a minor fender bender. That the officials in Wuhan were able to construct such a facility in so little time, without much time to plan ahead, speaks to both the creativity and resourcefulness of the architects, engineers, and laborers involved in this project.

Ordinarily, building such a facility would require years: clearing the land, laying the foundation, building out the structure, installing utilities, and finishing out the interior each require many months of planning and labor, and this work largely happens consecutively, not concurrently.

So this begs the question: If we can build such a facility in days rather than years, why don’t we always do it that way?

The answer, of course, is that a hospital designed to be built in 10 days is constructed with speed as the only consideration. Treating as many patients as possible, as quickly as possible, is the only goal. As a result, other attributes – quality, durability, maintainability, and comfort – are all ignored to satisfy the only metric that really matters in such a project: time. The interior photos show a facility that looks more like the inside of a U-Haul truck than a hospital. Outside, the exposed ductwork and skeleton-like walls reveal a structure that is unlikely to withstand the rigors of use.

As a data guy, I see this same pattern when building data architectures. Everyone involved in a data project wants to have a perfectly working data pipeline, with validated metrics and user-tested access tools, delivered at or under budget, and ready for use tomorrow. The challenge comes in when deadlines (whether legitimate or invented on the fly) become the only priority, and architects and developers are asked to sacrifice all else to meet a target date. Sure, you can add a lot of hands to the project, like they did by engaging 7,000 people to build the Wuhan hospital. Throwing more people at the problem might get you a solution more quickly, but the same shortcuts to sacrifice quality, durability, and maintainability will need to be made.

When setting schedules with my clients, I sometimes have to work through this same thought exercise. Yes, we could technically build a data warehouse in a week, but it’s going to be lacking in what one would normally expect of such a structure: many important features would be left out, it’ll likely be difficult to maintain, and there would be no room for customization of any type. And, like the temporary Wuhan hospital, it would likely be gone or abandoned in 18 months.

Building something with speed as the only metric is occasionally necessary, but only under the most extreme of circumstances. Creating a data architecture that delivers accuracy, performance, functionality, and durability requires time – time to design, time to develop, time to test, and time to make revisions. Don’t sacrifice quality for the sake of speed.

Building Processes That Fail

“I build processes that never fail.”

failAs I was chatting with a peer who was pitching me on the robustness of the systems they developed, I was struck by the boldness of those words I had just heard. As we chatted about data in general and data pipelines in particular, this person claimed that they prided themselves on building processes that simply did not fail, for any reason. “Tell me more…“, said the curious technologist in me, as I wondered whether there was some elusive design magic I had been missing out on all these years.

As the conversation continued, I quickly surmised that this bold prediction was a recipe for disaster: one part wishful thinking, one part foolish overconfidence, with a side of short-sightedness. I’ve been a data professional for 17-someodd years now, and every data process I have ever seen has one thing in common: they have all failed at some point. Just like every application, every batch file, every operating system that has ever been written.

Any time I build a new data architecture, or modify an existing one, one of my principal goals is to create as robust an architecture as possible: minimize downtime, prevent errors, avoid logical flaws in the processing of data. But my experience has taught me that one should never expect that any such process will never fail. There are simply too many things that can go wrong, many of which are out of the control of the person or team building the process: internet connections go down, data types change unexpectedly, service account passwords expire, software updates break previously-working functionality. It’s going to happen at some point.

Failing gracefully

Rather than predicting a failure-proof outcome, architects and developers can build a far more resilient system by first asking, “What are the possible ways in which this could fail?” and then building contingencies to minimize the impact of a failure. With data architectures, this means anticipating delays or failure in the underlying hardware and software, coding for changes to the data structures, and identifying potential points of user error. Some such failures can be corrected as part of the data process; in other cases, there should be a soft landing to limit the damage.

Data processes, and applications in general, should be built to fail. More specifically, they should be built to be as resilient as possible, but with enough smarts to address the inevitable failure or anomaly.

[This post first appeared in the Data Geek Newsletter.]

Welcome, Joshua Ferguson!

Here we grow! Thanks to the numerous clients we have partnered with in the past year, Tyleris Data Solutions is expanding to add another skilled data architect to our team.

Joshua FergusonWe are proud to welcome Joshua Ferguson as the newest member of the Tyleris team. Joshua is a highly skilled technologist and a pragmatic problem solver, with a keen ability to bridge the gap between business needs and technical specifications. He studied informatics as an undergraduate, and later earned his Master’s Degree in Computer Science from Arizona State University.

Joshua has worked in various industries throughout his technical career, most recently having worked as a business intelligence architect at a healthcare company. He currently resides in Japan where his wife teaches English to second-language learners.

Joshua has already gotten plugged in to some exciting work with Tyleris clients, and you will likely see more from him both in our professional engagements as well as through our blog and on social media. We are delighted to have him on board!

Join us in Consultant Corner at SQL Saturday Dallas

Do you have questions about business intelligence, analytics, Power BI, or data architecture? If so, we would love to chat with you at the SQL Saturday Dallas event this spring.

On June 1st of this year, we will be hosting a Consultant Corner at SQL Saturday Dallas. Consultant Corner is a casual space where you can have one-on-one conversations with data experts. If you have specific “how do I …?” questions, or if you are just looking for general advice about the business intelligence and analytics landscape, we would love to chat.

We are co-hosting this event with our friends over at 28twelve Consulting. Like us, they are focused on building outstanding solutions in the Microsoft stack, and are great at helping to navigate folks through the multifaceted world of business intelligence.

Registration for SQL Saturday Dallas is free, with an optional on-site lunch for $12. We will be set up in the Consultant Corner in the vendor area all day. We look forward to seeing you there!

What To Look For When Hiring A Data Professional

What To Look For When Hiring A Data Professional

check_smFinding just the right data professional to hire is one of the most challenging tasks an organization can undertake. While hiring a team member for any role requires a great deal of work and care, the role of the data professional is particularly challenging to fill. From day 1, the data professional will have access to and responsibility over the company’s most valuable asset. These roles usually require a mix of hard skills and soft skills, and often require engagement with people at every level, from peers to executive leadership.

Finding the right person

Here at Tyleris Data Solutions, we are getting ready to grow our team this year. In preparation, I have been thinking a lot about the attributes that we should look for in our new team member. While there will always be a longer and more specific list of needs for each role, these are the attributes I have identified that I look for in every data professional.

Integrity. This one is first on the list for a reason, and is the one attribute where compromise is not acceptable. Data professionals have vast access to an organization’s data, and if that information were to be lost or stolen, it could literally end the company. The thing about integrity is that it is almost impossible to fully assess in an interview. Learning about a person’s level of integrity takes time and effort, which is why hiring a data professional should be a slow process.

Intellectual curiosity. Among all of the technical professionals I’ve worked with, I’ve learned that those with a strong intellectual curiosity tend to be more effective. Team members with this attribute often go out of their way to learn about other areas of the business or technical architecture that aren’t necessarily required for the job, leading to a better big-picture view of how the organization uses data.

A positive and empathetic attitude. Increasingly, data professionals have highly visible roles, requiring them to engage with peers, superiors, customers, and clients. Their attitude is the backdrop for each one of those interactions, so it is essential that the data professional come to the table in the right frame of mind. Having an empathy for one’s constituents will improve the quality of the job one performs.

Technical aptitude. The data field is rapidly evolving, and requires of data professionals the willingness and ability to quickly learn new things. Hiring staff members with technical aptitude will help to build a team that is adaptable and can assimilate into new technologies quickly.

Initiative. There are folks who wait to be told exactly what to do, and others who go figure out what needs to be done and then do it. Not every team member has to have this go-getter attitude, but each team needs at least a few people with this characteristic.

Experience. I put this one at the bottom of the list for a reason. It’s not that experience isn’t important – it is! – but of all the items on this list, experience is the one thing that the organization can give to the team member after they are hired. A person with minimal experience but who possesses all of the other attributes on this list is going to be a very compelling candidate.

Hiring is hard. Hiring technical professionals is especially challenging, and is critical to get right. While technical skills are important, finding the person with integrity, attitude, and aptitude will help to build a solid team.

This post was originally published in my Data Geek Newsletter.

Our Relationship with Facebook

At Tyleris Data Solutions, we are data people, and by extension, our first and primary role is that of data stewards. With each and every one of our relationships, our overriding concern beyond all other tasks is the security and privacy of data. In the partnerships that we build with other companies, we look for a similar level of care and concern around protecting the data.

Since our inception, we have used Facebook, both as a social media platform as well as an advertising outlet. During the past year, we have become aware of a number of serious security and privacy issues around Facebook’s protection of and use of data. We strive to only do business with organizations whose partnerships reflect well on us, and vice versa. Based on what the data breaches and the privacy decisions within Facebook, we feel that we can no longer engage with them in any capacity.

Starting today, we will no longer be updating or monitoring our Facebook page, nor will we be responding to any messages sent through Facebook Messenger on that page. In addition, we will discontinue indefinitely all advertising on Facebook.

We are still available for our clients and followers on our website, our newsletter, or by telephone at 214/509-6570. We are also on Twitter, and have recently established a presence on MeWe, a promising new social media platform that is very focused on data privacy.

As always, thanks for your business and for your attention. Feel free to contact us with any questions.

Webinar: Getting Started with Change Tracking in SQL Server

Change TrackingStart your summer off right by brushing up on a highly effective change detection technique! We will be hosting a webinar, Getting Started with Change Tracking in SQL Server, on Friday, June 8th at 11:00am CDT.

In this webinar, I’ll walk you through the essentials of change tracking in SQL Server: what it is, why it’s important, and how it fits into your data movement strategy. I’ll walk through demos to give you realistic examples of how to use change tracking.

Registration is free and is open now. I hope to see you there!