Data Bias

Data Bias

Each of us as data professionals will bring some bias into our work. Even the most objective among us has a tendency to lean toward the answers we assume to be true, whether we are troubleshooting a technical problem or creating a new user-facing data solution.

Is data bias bad? The answer is more complicated than a simple yes or no.

Solution bias

Take one example of data bias in which one is trying to find a solution to a problem. This might be to troubleshoot a data anomaly, or on a larger scale, to choose a software or service to address a business pain point. If they have solved a similar type of problem in the past, they’ll likely assume that the solution is something they’ve used before. Using anecdotal data from their own experience, they’ll start with the likeliest cause that they know about. If that doesn’t work, they’ll move onto other possible root causes they are aware of; if those are exhausted, they’ll then move into research mode for possible causes they haven’t yet experienced.

In most cases, experienced professionals will benefit from this data bias. For example, if a DBA learns of a full SQL Server transaction log, one of the first things they’ll check is whether the log has been backed up recently. This type of solution bias is often a time-saver.

However, the most obvious solution isn’t always the correct one, so we have to guard against getting cut by Occam’s Razor. Going too far down the assumption path can lead to wasted time or other bad outcomes.

Interpretation bias

Interpretation bias is when individuals can look at the same data and draw very different conclusions. While everyone has interpretation bias – it’s impossible to thumb through Facebook or Twitter to see examples of this – the issue is particularly problematic when it occurs with data professionals. Because we are very often the gate keepers of the data, it is essential to recognize any biases we have to prevent them from seeping into the queries, reports, and systems we create to deliver information to data consumers.

In most cases, this interpretation bias is accidental and not malicious. I can remember a few projects in which I began my data analysis with the assumption that some fact was true, and would work backwards from there to find the data to support this conclusion. If you start with a conclusion in mind, you can always find a creative way to interrogate the data to support that conclusion.

Avoiding data bias

It’s almost impossible to rid oneself of data bias entirely, but there are some ways to mitigate it:

  • Recognize and acknowledge your potential data biases.
  • Start with a question, not an answer.
  • Use your experience to guide you, but don’t be blinded by it.
  • Ask for a second opinion from others who may not share your biases.

Data bias can taint your perspective, but it doesn’t have to take away from your effectiveness as a data professional. Recognizing and guarding against it will ensure that bias doesn’t leak into your work product.

Webinars

Upcoming Webinars for July-Aug 2020

We’ve got a couWebinarsple of exciting Wednesday webinars coming up in the next few weeks. Join me as we walk through the basics of the SSIS Catalog, and compare SQL Server Integration Services to Azure Data Factory.

Introduction to the SSIS Catalog

The development of ETL processes requires careful attention things such as event logging, code versioning, and external configurations. For organizations using SQL Server Integration Services, the SSIS catalog is the ideal way to store, execute, and log SSIS packages. Using the SSIS catalog to manage your ETL infrastructure eliminates a lot of the manual work required for package development and administration.

In this webinar, we’ll cover:
• What is the SSIS catalog?
• Creating the catalog in an existing SQL Server instance
• Deploying packages to the catalog
• Executing packages from the SSIS catalog
• Using the built-in SSIS catalog reports

This webinar is ideally suited for data professionals who are familiar with the basics of SSIS but are new to (or not yet using) the SSIS catalog.

You can register here for the Introduction to the SSIS Catalog webinar.

Head to Head: SSIS Versus Azure Data Factory

Anyone working in the SQL Server ecosystem has likely heard of both SQL Server Integration Services (SSIS) and Azure Data Factory (ADF). Both of these are used to move and transform data, and are both very capable and mature tools. But how do the two compare to each other? How does one determine which is better for a given task?

In this session, we will do a head-to-head comparison of SSIS and ADF. We’ll compare the features of each of these tools, calling out the functionality common to both as well as the behaviors exclusive to one or the other. We will discuss the benefits and the challenges of working with SSIS and ADF, exploring topics including: performance, ease of administration, monitoring and troubleshooting, and cost. Finally, we will wrap up with demos showing examples of using SSIS and ADF for similar tasks to see how they stack up.

You can register for the Head to Head: SSIS Versus Azure Data Factory webinar here.

We need this built NOW!

One of the biggest worldwide stories of the last few weeks has been the spread of the deadly coronavirus, which has infected tens of thousands and killed hundreds. In the wake of the panic around this illness, governments around the world have gone to drastic measures to restrict its spread. In the Chinese city of Wuhan, which is the epicenter of this outbreak, officials have quickly scrambled to quarantine and treat those exposed to the coronavirus. The most significant accomplishment of this effort is that they were able to build an entire hospital in the span of just ten days.

Think about that for a second: a brand new, multi-story, 1,000-bed hospital was designed and built in just a week and a half. It took longer than that for the body shop to repair my wife’s car after a minor fender bender. That the officials in Wuhan were able to construct such a facility in so little time, without much time to plan ahead, speaks to both the creativity and resourcefulness of the architects, engineers, and laborers involved in this project.

Ordinarily, building such a facility would require years: clearing the land, laying the foundation, building out the structure, installing utilities, and finishing out the interior each require many months of planning and labor, and this work largely happens consecutively, not concurrently.

So this begs the question: If we can build such a facility in days rather than years, why don’t we always do it that way?

The answer, of course, is that a hospital designed to be built in 10 days is constructed with speed as the only consideration. Treating as many patients as possible, as quickly as possible, is the only goal. As a result, other attributes – quality, durability, maintainability, and comfort – are all ignored to satisfy the only metric that really matters in such a project: time. The interior photos show a facility that looks more like the inside of a U-Haul truck than a hospital. Outside, the exposed ductwork and skeleton-like walls reveal a structure that is unlikely to withstand the rigors of use.

As a data guy, I see this same pattern when building data architectures. Everyone involved in a data project wants to have a perfectly working data pipeline, with validated metrics and user-tested access tools, delivered at or under budget, and ready for use tomorrow. The challenge comes in when deadlines (whether legitimate or invented on the fly) become the only priority, and architects and developers are asked to sacrifice all else to meet a target date. Sure, you can add a lot of hands to the project, like they did by engaging 7,000 people to build the Wuhan hospital. Throwing more people at the problem might get you a solution more quickly, but the same shortcuts to sacrifice quality, durability, and maintainability will need to be made.

When setting schedules with my clients, I sometimes have to work through this same thought exercise. Yes, we could technically build a data warehouse in a week, but it’s going to be lacking in what one would normally expect of such a structure: many important features would be left out, it’ll likely be difficult to maintain, and there would be no room for customization of any type. And, like the temporary Wuhan hospital, it would likely be gone or abandoned in 18 months.

Building something with speed as the only metric is occasionally necessary, but only under the most extreme of circumstances. Creating a data architecture that delivers accuracy, performance, functionality, and durability requires time – time to design, time to develop, time to test, and time to make revisions. Don’t sacrifice quality for the sake of speed.

Building Processes That Fail

“I build processes that never fail.”

failAs I was chatting with a peer who was pitching me on the robustness of the systems they developed, I was struck by the boldness of those words I had just heard. As we chatted about data in general and data pipelines in particular, this person claimed that they prided themselves on building processes that simply did not fail, for any reason. “Tell me more…“, said the curious technologist in me, as I wondered whether there was some elusive design magic I had been missing out on all these years.

As the conversation continued, I quickly surmised that this bold prediction was a recipe for disaster: one part wishful thinking, one part foolish overconfidence, with a side of short-sightedness. I’ve been a data professional for 17-someodd years now, and every data process I have ever seen has one thing in common: they have all failed at some point. Just like every application, every batch file, every operating system that has ever been written.

Any time I build a new data architecture, or modify an existing one, one of my principal goals is to create as robust an architecture as possible: minimize downtime, prevent errors, avoid logical flaws in the processing of data. But my experience has taught me that one should never expect that any such process will never fail. There are simply too many things that can go wrong, many of which are out of the control of the person or team building the process: internet connections go down, data types change unexpectedly, service account passwords expire, software updates break previously-working functionality. It’s going to happen at some point.

Failing gracefully

Rather than predicting a failure-proof outcome, architects and developers can build a far more resilient system by first asking, “What are the possible ways in which this could fail?” and then building contingencies to minimize the impact of a failure. With data architectures, this means anticipating delays or failure in the underlying hardware and software, coding for changes to the data structures, and identifying potential points of user error. Some such failures can be corrected as part of the data process; in other cases, there should be a soft landing to limit the damage.

Data processes, and applications in general, should be built to fail. More specifically, they should be built to be as resilient as possible, but with enough smarts to address the inevitable failure or anomaly.

[This post first appeared in the Data Geek Newsletter.]

Welcome, Joshua Ferguson!

Here we grow! Thanks to the numerous clients we have partnered with in the past year, Tyleris Data Solutions is expanding to add another skilled data architect to our team.

Joshua FergusonWe are proud to welcome Joshua Ferguson as the newest member of the Tyleris team. Joshua is a highly skilled technologist and a pragmatic problem solver, with a keen ability to bridge the gap between business needs and technical specifications. He studied informatics as an undergraduate, and later earned his Master’s Degree in Computer Science from Arizona State University.

Joshua has worked in various industries throughout his technical career, most recently having worked as a business intelligence architect at a healthcare company. He currently resides in Japan where his wife teaches English to second-language learners.

Joshua has already gotten plugged in to some exciting work with Tyleris clients, and you will likely see more from him both in our professional engagements as well as through our blog and on social media. We are delighted to have him on board!

Join us in Consultant Corner at SQL Saturday Dallas

Do you have questions about business intelligence, analytics, Power BI, or data architecture? If so, we would love to chat with you at the SQL Saturday Dallas event this spring.

On June 1st of this year, we will be hosting a Consultant Corner at SQL Saturday Dallas. Consultant Corner is a casual space where you can have one-on-one conversations with data experts. If you have specific “how do I …?” questions, or if you are just looking for general advice about the business intelligence and analytics landscape, we would love to chat.

We are co-hosting this event with our friends over at 28twelve Consulting. Like us, they are focused on building outstanding solutions in the Microsoft stack, and are great at helping to navigate folks through the multifaceted world of business intelligence.

Registration for SQL Saturday Dallas is free, with an optional on-site lunch for $12. We will be set up in the Consultant Corner in the vendor area all day. We look forward to seeing you there!

What To Look For When Hiring A Data Professional

What To Look For When Hiring A Data Professional

check_smFinding just the right data professional to hire is one of the most challenging tasks an organization can undertake. While hiring a team member for any role requires a great deal of work and care, the role of the data professional is particularly challenging to fill. From day 1, the data professional will have access to and responsibility over the company’s most valuable asset. These roles usually require a mix of hard skills and soft skills, and often require engagement with people at every level, from peers to executive leadership.

Finding the right person

Here at Tyleris Data Solutions, we are getting ready to grow our team this year. In preparation, I have been thinking a lot about the attributes that we should look for in our new team member. While there will always be a longer and more specific list of needs for each role, these are the attributes I have identified that I look for in every data professional.

Integrity. This one is first on the list for a reason, and is the one attribute where compromise is not acceptable. Data professionals have vast access to an organization’s data, and if that information were to be lost or stolen, it could literally end the company. The thing about integrity is that it is almost impossible to fully assess in an interview. Learning about a person’s level of integrity takes time and effort, which is why hiring a data professional should be a slow process.

Intellectual curiosity. Among all of the technical professionals I’ve worked with, I’ve learned that those with a strong intellectual curiosity tend to be more effective. Team members with this attribute often go out of their way to learn about other areas of the business or technical architecture that aren’t necessarily required for the job, leading to a better big-picture view of how the organization uses data.

A positive and empathetic attitude. Increasingly, data professionals have highly visible roles, requiring them to engage with peers, superiors, customers, and clients. Their attitude is the backdrop for each one of those interactions, so it is essential that the data professional come to the table in the right frame of mind. Having an empathy for one’s constituents will improve the quality of the job one performs.

Technical aptitude. The data field is rapidly evolving, and requires of data professionals the willingness and ability to quickly learn new things. Hiring staff members with technical aptitude will help to build a team that is adaptable and can assimilate into new technologies quickly.

Initiative. There are folks who wait to be told exactly what to do, and others who go figure out what needs to be done and then do it. Not every team member has to have this go-getter attitude, but each team needs at least a few people with this characteristic.

Experience. I put this one at the bottom of the list for a reason. It’s not that experience isn’t important – it is! – but of all the items on this list, experience is the one thing that the organization can give to the team member after they are hired. A person with minimal experience but who possesses all of the other attributes on this list is going to be a very compelling candidate.

Hiring is hard. Hiring technical professionals is especially challenging, and is critical to get right. While technical skills are important, finding the person with integrity, attitude, and aptitude will help to build a solid team.

This post was originally published in my Data Geek Newsletter.

Our Relationship with Facebook

At Tyleris Data Solutions, we are data people, and by extension, our first and primary role is that of data stewards. With each and every one of our relationships, our overriding concern beyond all other tasks is the security and privacy of data. In the partnerships that we build with other companies, we look for a similar level of care and concern around protecting the data.

Since our inception, we have used Facebook, both as a social media platform as well as an advertising outlet. During the past year, we have become aware of a number of serious security and privacy issues around Facebook’s protection of and use of data. We strive to only do business with organizations whose partnerships reflect well on us, and vice versa. Based on what the data breaches and the privacy decisions within Facebook, we feel that we can no longer engage with them in any capacity.

Starting today, we will no longer be updating or monitoring our Facebook page, nor will we be responding to any messages sent through Facebook Messenger on that page. In addition, we will discontinue indefinitely all advertising on Facebook.

We are still available for our clients and followers on our website, our newsletter, or by telephone at 214/509-6570. We are also on Twitter, and have recently established a presence on MeWe, a promising new social media platform that is very focused on data privacy.

As always, thanks for your business and for your attention. Feel free to contact us with any questions.

Webinar: Getting Started with Change Tracking in SQL Server

Change TrackingStart your summer off right by brushing up on a highly effective change detection technique! We will be hosting a webinar, Getting Started with Change Tracking in SQL Server, on Friday, June 8th at 11:00am CDT.

In this webinar, I’ll walk you through the essentials of change tracking in SQL Server: what it is, why it’s important, and how it fits into your data movement strategy. I’ll walk through demos to give you realistic examples of how to use change tracking.

Registration is free and is open now. I hope to see you there!

Famous Last Words: “That Should Never Happen”

That should never happen, right?When it comes to managing data, there is often a difference between what should happen and what can happen. That space between should and can is often challenging, forcing data professionals to balance risk with business value. A couple of common examples:

“There should not be any sales transactions without a valid customer, but because the OLTP system doesn’t use foreign keys, it could theoretically happen.”

“Location IDs should be less than 20 characters, but those aren’t curated so they could exceed 20 characters.”

“This list of product IDs from System A should match those in System B, but because they are entered manually it is possible to have a few typos.”

“This data warehouse should only contain data loaded as part of our verified ETL process, but since our entire data team has read-write permission on the data warehouse database, it’s possible that manual data imports or transformations can be done.”

“This field in the source system should always contain a date, but the data type is set to ASCII text so it might be possible to find other data in there.”

“The front-end application should validate user input, but since it’s a vendor application we don’t know for certain that it does.”

Managing data and the processes that move data from one system to another requires careful attention to the data safeguards and the things that can happen during input, storage, and movement. As a consultant, I spend a lot of time in design sessions with clients, discussing where data comes from, how (if at all) it is curated in those source systems, and what protections should be built into the process to ensure data integrity. In that role, I’ve had this conversation, almost verbatim, on dozens of occasions:

Me: “Is it possible that <data entity> might not actually be <some data attribute>?”

Client: “No, that should never happen.”

Me: “I understand that it shouldn’t. But could it?”

Client (after a long silence): “Well…. maybe.”

Building robust systems requires planning not just for what should happen, but for what could happen. Source systems may not include referential integrity to avoid situations that are impossible in business but technically possible inside the data store. Fields that appear to store one type of data might be structured as a more generic type, such as text. Data that should be in curated lists can sometimes contain unvalidated user input. None of these things should happen, but they do. When designing a data model or ETL process, be sure that you’re asking questions about what protections in place to make sure that the things that shouldn’t happen, don’t.