Delivering Trustworthy Data and Analytics

Delivering Trustworthy Reporting and AnalyticsOrganizations of every size and in every vertical have a vested interest in unlocking the power of data. Among other benefits, reliable and timely data delivery will reveal what business processes are (or aren’t) working well, help understand customers’ motivations and behavior, and make internal workflows faster and more efficient.

Harnessing and arranging the data is only half the battle, however. Once it’s in your hands, you also need to transform it into a digestible report with actionable insights that audiences of every technical background can parse. Just as importantly, this system of reporting and analytics must be trusted to be reliable and accurate.

Delivering Trustworthy Data and Analytics

Unfortunately, problems with data reporting plague even the most well-prepared organization. In a recent survey, 73 percent of respondents indicated that they had to wait days or weeks for the insights they required, while almost 70 percent of them frequently encountered questions that their current analytics setup had no response for. Even worse, another survey revealed that only one third of executives surveyed trust the data and analytics generated from their business operations.

Building trust in the data is critical for the success of any information delivery project. If the data is inaccurate, difficult to understand, or improperly shaped for the task at hand, the consumers of the data will lose trust in the data. When that trust is lost, users will find other ways to get the data they need (often by building their own offline data store from other sources).

Here’s a look at five key actions that you need to take in order to start creating reports that you (and they) can believe in.

Establish the Source of Truth

A symptom of an immature information delivery strategy is that there is no single and clear source of truth. This fact cannot be overstated: Unless there is a single trusted source of truth for every key business metric, your information delivery architecture cannot be trusted to deliver accurate, validated data. Defining the gold standard for each measure, KPI, and reference list is a prerequisite for any data delivery initiative. Without this, you’re just spinning your wheels rather than making real progress.

Once the source and/or formula for each key piece of information has been defined, communicating that to the wider audience of data consumers is essential. Just as it is important to define each term and definition, you must include in that messaging which sources are to be used to validate any derivative reports, dashboards, or manual queries. Lacking that, your reports will tell conflicting stories.

Get User Input

You can’t deliver high-quality reports or dashboards without understanding what it is that your users need from the data. Getting user input is critical to the process, but all too often this necessary part of the project is shortened (or skipped entirely) to save time. It’s not enough to try to infer requirements from existing reports or dashboards; spend time with those who will be consuming the data to fully understand their needs.

When gathering user input for data delivery requirements, you must understand the following:

  • How do you use this data? Gathering input for information delivery requirements is a business exercise, not a technical one. When you begin gathering input from data consumers, seek first to understand how data consumption helps them do their jobs. With a clearer understanding of their business processes, you’ll be better equipped to deliver a solution that solves their business problem.
  • In the current system (reports, dashboards, etc.) for data delivery, what is working well? What is lacking? It’s important to learn from successes and failures of the past. Invest the time needed to understand what is currently being used to satisfy data delivery needs, and how well those components are working.
  • What are the business rules? Business rules describe relationships, data quality processes, exception and anomaly handling, and other structures and triggers that help turn raw data into actionable information. It is important to understand what business rules exist (or should exist) on the data, and at what point in the data lineage they should reside. Be aware, though, that non-technical users may be turned off by the term “business rules”. Instead, ask about data exceptions, manual clean-up tasks, and other open-ended topics that can reveal what business rules should be included in report delivery.

Make Things Clear

Context is key when it comes to data analytics. You wouldn’t give your users the number “8” without somehow contextualizing that figure–does it refer to the number of hours, the percent of hourly utilization, or one of a million other possibilities? Is that number good or bad? From where does that number originate?

Any data delivery solution must be built with clarity and transparency. For each data source, make clear its intended meaning, its origin and lineage, as well as any processing or transformation logic that has taken place. By providing this contextual information, you can help users understand the reliability of any given datum and find the source of errors or outliers.

Define Everything

Just as you provide precise and accurate data to your users, the language that you use to describe that information needs to be equally clear. Before you begin analytics and reporting, agree upon the terms that you use to measure outcomes, including metrics and key performance indicators (KPIs).

Even the most fundamental metrics should be defined. Something as basic as “a day” might be assumed to be a 24-hour period, but this definition can vary. For any essential metric, KPI, or descriptive term, be clear about what each represents.

Document Changes

Over time, data delivery solutions will evolve. The business needs will change, processes will get more mature, and occasionally the shape or grain of the underlying data will be modified. To maintain trust in your reporting or dashboard solution, be clear and proactive about any changes.

One of the worst outcomes for such a solution is for the data consumers to abandon it because they do not get consistent answers. Let there be no surprises for your reporting audience when critical changes need to be made. Communicating these changes and their impact requires extra time and effort, but that investment will help protect the trust in the data.

Cloud Adoption Concerns (And How To Overcome Them)

Cloud Adoption ConcernsCloud technology has become very pervasive, with 89 percent of organizations globally identifying it as part of their IT strategies while another 75 percent say they have already implemented at least one cloud application. Some of the key benefits enjoyed by organizations that adopt cloud technology include lower costs, faster time-to-market, improved reliability, and better business intelligence insights, among others.

In spite of the rise in usage, there is still a significant amount of concern and push-back for any potential new cloud-based solution. These include security concerns, integration challenges, industry standards compliance issues, information governance, and difficulty in measuring the real economic value of cloud solutions. In many cases, however, organizations often think these challenges are much bigger than they really are.

Cloud Adoption Concerns (And How To Overcome Them)

Here are five of the most common cloud adoption challenges and how to overcome them:

1. Security Concerns

Problem: Although security is becoming less of a concern for cloud adoption, there are still those who believe that data stored in the cloud is inherently less secure than on-prem data storage.

Solution: Your approach to security for your cloud workloads should not be that different from how you approach security for your on-premises workloads. Develop overarching security policies that are based on your particular needs within your organization, then use these as yardsticks to evaluate different cloud solution providers. Major cloud providers such as better than what a small or mid-sized organization can develop internally.

2. Integration with Existing Systems

Problem: It’s rare to migrate all of an organization’s infrastructure to the cloud at once, so there is naturally some concern over how migrated cloud applications will integrate with existing on-prem systems.

Solution: While cloud vendors were initially known for pushing an “all-cloud” approach, the industry has shifted to now support and adopt a hybrid on-prem/cloud model as the default architecture. As part of that shift, the major cloud vendors have made the process of integrating from on-prem to cloud (and cloud to cloud) much easier, through APIs, specialized integration frameworks, and hybrid authentication architectures.

3. Cost Concerns

Problem: Making a long-term commitment for services (rather than one-time hardware purchases) means that we’re locked in to a monthly subscription fee. It’s harder to predict what we’ll spend.

Solution: Organizations need to understand that transition and implementation costs are not unique to the cloud. You need to treat your cloud implementation just like you would a normal IT project regardless of whether it’s SaaS, IaaS or PaaS.

Remember that running your own data center isn’t a one-time capital investment. Servers and supporting equipment have ongoing costs as well, including electricity, cooling, physical security, and labor costs to install, maintain, and upgrade the hardware. Cloud models typically involve smaller one-time capital expenditures than on-premises models, but this model also allows you to scale up your architecture needs (and costs) as your business to grows. The cloud pricing model means you’re paying what you need today, without overbuying to anticipate what you might need tomorrow.

4. Concerns About Loss of Control

Problem: Organizations with on-prem systems often feel they are giving up too much control to cloud providers, surrendering the ability to define granular settings for system resources, networking, and security.

Solution: The cloud isn’t just one thing. One of the best analogies of the different cloud offerings is the pizza-as-a-service model. You can make your own pizza at home, buy a take-and-bake pizza, have a pie delivered, or go out to eat at a pizza place. These different services, much like cloud architectures, allow you to pick the level of service that meets your needs. For workloads that require the organization to have more granular control over how the application runs, the IaaS model makes it possible to migrate to a virtual machine in the cloud without giving up the ability to configure the machine environment, storage, etc. For other workloads requiring less hands-on configuration, using a managed PaaS model makes more sense.

5. Legal and Compliance Concerns

Problem: A common challenge to cloud adoption is that it may not meet compliance or regulatory demands.

Solution: The major cloud vendors are certified for use for most any regulated domain of data. Further, most vendors allow you to specify the geographic region(s) in which your application and data will be stored, thus eliminating the concern of violating geographic boundaries of compliance or regulation.

Conclusion

With cloud architecture becoming more and more common, most of the classic objections to migrating to the cloud no longer hold water. Security, integration, and compliance used to be blockers to cloud adoption, but the platforms have matured to the point that these are no longer barriers.

[POSTPONED] Event: Is Your Data-Driven Organization Ready for Azure?

Ready for Azure?

UPDATE 8/29/2017:

Due to the ongoing weather situation in Houston and its likely long-term impact, we have decided to postpone this event. Our thoughts go out to our friends in Houston and the surrounding areas who have been impacted by Hurricane Harvey.

 

Is your organization exploring a move to the cloud? If so, you’ll be interested in this half-day event we are cohosting in Houston in September.

Along with our friends at Straight Path Solutions and Opifex Solutions, we’ll walk you through the benefits of using Azure for data storage and computing. In this seminar, we’ll help you to answer the following questions:

  • Are we ready to move to the cloud?
  • Which of our workloads is suitable for the cloud?
  • Why choose Azure over the other cloud providers?
  • How can I integrate on-premises data with cloud data and applications?

This seminar is ideally suited for data architects, managers, and CxOs who are cloud-curious and are interested in learning how Azure can help save time and money.

Admittance to this event is free, but does require advance registration. To reserve your seat, visit https://azurehouston2017.eventbrite.com. We look forward to seeing you there!

Which Is More Secure: On-Prem or Cloud Data?

Which Is More Secure: On-Prem or Cloud Data?The short answer is this: Data, regardless of storage location or medium, is as secure as you make it.

The use of cloud-based data processing and storage has grown significantly in the past decade, a trend that is expected to continue. A recent IDG Enterprise survey reported that 70% of organizations surveyed had moved at least part of their data or infrastructure to the cloud, with another 16% planning to do so within the next year. Cloud data storage is no longer something that “other companies” do. The value in both cost savings and time-to-market means that most organizations can realize a benefit from moving at least some data and applications to the cloud.

When comparing cloud to on-prem, each has a number of benefits and shortcomings. On-prem data storage is inherently more flexible, because you can configure your environment exactly how you want it. Cloud data storage reduces up-front costs by eliminating some hardware purchases and data center build-out costs, and can often be configured in hours as opposed to weeks or months for on-premises storage. There is no one-size-fits-all solution; some apps will be a better fit for the cloud, while others work better with on-prem storage and processing.

Which Is More Secure: On-Prem or Cloud Data?

Of the reasons given to resist cloud data storage, by far the one I’ve heard the most is a concern over security. The traditional approach for applications and data storage has been to keep those things within the walls of the organization, especially when the nature of the data is especially sensitive (such as HIPAA or other PII data). I can’t count the number of times I’ve heard someone say that they’d never migrate a particular workload to the cloud because it isn’t secure enough.

I really do sympathize with the emotion of such a statement. Moving sensitive data from a location over which you have physical control (a data center) to one that you don’t (cloud storage) feels like you’re surrendering control of that data. For those unaccustomed to trusting an outside vendor to host their data, this can be an uncomfortable feeling.

However, to look at this in context, think about all of the ways your data is currently being used even when stored on-prem. The odds are high that one or more of these statements is true:

  • Some internal “super users” have carte blanche access to data
  • Some systems allow for generic password-based logins (such as using SQL logins rather than Windows authentication), permitting largely unaudited data access
  • Access to the database itself is controlled and audited, but the backups do not have the same level of security
  • Those responsible for monitoring your firewalls and other barriers also wear a lot of other hats, and/or do not provide 24x7x365 coverage for security monitoring
  • Backups of data are physically transported offsite
  • Service accounts that run core database services (such as SQL Server Agent) have broad permissions for data access for convenience
  • Some data extracts shared with outside vendors, who may also be permitted to share it with their vendors as needed
  • Some vendors even have direct access to your systems
  • Data from third parties is imported into your systems on a scheduled basis using automated tools
  • Nontechnical vendor staff (such as janitorial or building maintenance personnel) have physical access to locations where data is stored
  • Password length and/or complexity is not systemically enforced

This is a very small subset of an innumerable list of activities that require a good deal of trust in people, both internal and external. Data loss and data breaches are only as secure as the weakest link in the virtual chain, and it is far more likely to have a failure of either human error or system design than to have such an event occur simply because you chose a cloud platform versus storing data in-house. A bad guy will have a much easier time crafting a social engineering breach or dictionary attack than hacking the platform itself, whether on-prem or cloud.

Security Depends on You

This doesn’t mean you shouldn’t ask tough questions regarding security when selecting a cloud vendor. When evaluating those services, ask about topics like colocation, multitenancy, penetration testing and monitoring, and physical security. Ask how and when their vendors and partners would have access to your data. Apply at least as much scrutiny in this decision as you apply to other vendors who will have access to your data. But don’t assume that the data is not secure just because you don’t have physical control to the data center or media.

When choosing the cloud for application or data storage, you’re simply trusting a different team to handle what your in-house staff would manage for on-prem storage. In many cases, the cloud vendor will have deeper expertise, better proactive monitoring, and round-the-clock coverage that an in-house team may not be able to provide.

As a business owner, I trust my application and data storage needs to cloud solutions. Tyleris Data Solutions does not own any physical servers, instead relying on trustworthy cloud vendors to store our mission-critical data. There will be some workloads that I would want to handle in-house in a local colocation, but with respect to security, my vendors are far better equipped to manage round-the-clock security than my team is.

The fact that you can walk into your data center and physically touch the server hosting your on-prem data does not make that data more secure. Proximity does not imply security. Data and applications are only as secure as you make them, whether hosted in-house or in the cloud.

Designing in Absolutes

Designing in Absolutes

Designing in AbsolutesThere are absolutes that are true of data modeling and architecture, but these are fewer in number than most people think. There is a liberal use of the words “always” and “never” handed out as technical advice, and while it is usually well-meaning, can lead to a design myopia that limits one’s ability to adapt to atypical application needs.

“You should always have a restorable backup for your production databases.” It would be hard to find anyone to argue a counterpoint to that statement. Similarly, a declaration that all source code should be stored in some form of source control is a generally accepted truism for any data project (or any other initiative based on code, for that matter). Most such absolutes are broad and generalized, and are applicable regardless of architecture, operating system, deployment platform (cloud or on-prem), or geographic location.

It’s much more rare to find absolutes that apply to specific design principles. However, that doesn’t keep some folks from incorrectly asserting absolutes. As I wrote in a post last year entitled Technical Dogma, we are creatures of habit and tend to favor tools or solutions we already know. This tendency coupled with repetition leads to a sort of muscle memory in which we become loyal – sometimes to a fault – to the methods we prefer to build things.

Designing in Absolutes

When we assume that a particular way of doing things is the only way to do it, we make assertions such as the following:

  • Every dimensional design should be built as a star schema. There are no valid reasons to build a snowflake schema.
  • You should never use the T-SQL MERGE statement to load data.
  • Anything with more than a terabyte of data belongs on premises, not in the cloud.
  • I’ll never use ETL again. Big data tools can do everything ETL can, and more.
  • Database triggers should never be used.

These aren’t anecdotal examples. I’ve heard every one of these recently. To be fair, those who declare such preferences to be truisms rarely do so with nefarious intent, but such statements can have negative consequences. Building a solution with the assumption that a particular design pattern must always be used is risky, as it can lead to an inflexible solution that does not account for nuances of the particular application.

When I write about best practices, I am very cautious about speaking in absolutes. Even in my ETL Best Practices series, which represents my experience at having built hundreds of ETL processes over the past decade, I generously use the terms “usually”, “typically”, and “with few exceptions”. I do so not out of a fear to commit, but to be as accurate as possible. As with any other collection of best practices, there will be exceptions and edge cases which may seem to violate one of the principles of a typical design, but are entirely appropriate for some less-common design patterns. Providing the business with the data it needs, not the adherence to a particular set of design patterns, is the ultimate measure of success for any data project.

There are some absolute always-or-never cases in solution design. However, these are few in number and typically vague. Try to focus less on what should always (or never) be true, and more on the needs and nuances of the project at hand.

~~

This post was originally published in my Data Geek Newsletter.

Data Lineage Tracking in ETL Processes

Data Lineage Tracking in ETL ProcessesAnyone who has worked in data integration for a significant length of time has almost certainly been confronted with the following question:

“Where did this data come from?”

Confronting this question brings to the front burner the need for establishing data lineage tracking in ETL processes.

Data Lineage Tracking in ETL Processes

Extract, transform, and load (ETL) operations frequently involve the movement and consolidation of data from multiple sources through numerous transformation steps before being routed to its final resting place. As the ETL touchpoints increase in number and complexity, so does the difficulty in tracking the data back to its origins. The concept of data lineage in ETL is intended to make this process easier. Data lineage tracking involves building ETL elements in such a way that each row of data in the destination tables is unambiguously traceable back to it source, including any transformation process through which it passed during its travels.

Building ETL processes that include data lineage tracking takes extra work and careful planning. Sadly, the majority of ETL processes I’ve found in the wild do not have provisions for capturing data lineage. Because data lineage tracking does not add to the core functions of extraction, transformation, and loading, this design is often skipped during architecture or build. Much like the related ETL best practices of logging and auditing, data lineage is a typically unseen yet still valuable component of a well-designed ETL architecture.

Why Use Data Lineage Tracking?

There are several benefits that are realized with proper ETL data lineage tracking:

  • Trustworthiness. When the origin of each row of data and the path it took to arrive is systematically tracked, users and administrators of the data will have more reason to trust the data.
  • Easier troubleshooting. When the ETL data path is self-describing, it makes testing and troubleshooting far easier.
  • Expose leaky processes. When each row of data is trackable from source to destination, it helps to reveal any holes in the process where data might be lost.
  • Visibility. I can’t count the number of times I have been asked by a client to help them find and document business rules in their ETL processes. A side benefit of data lineage tracking is that the places where the business rules hide become more evident.

Not all load mechanisms require data lineage tracking. Very simple processes, those that load volatile tables, and some single-use, “throwaway” code does not have the same need for tracking data lineage. However, these are the exception rather than the rule.

Conclusion

Data lineage tracking in ETL processes is a best practice for most loads. Although it takes time to properly design and implement this pattern, the value gained is almost always worth the effort.

Benefits of Cloud-Based Database Applications

Depositphotos_26627079_x_thumbI still remember the week I realized that the cloud would most likely live up to its hype. It was a couple of years ago at the PASS Summit, and Microsoft had shifted their marketing message from being all-in on the cloud to a more pragmatic hybrid approach. In many of my offline interactions that week, I heard story after story of successes in moving some workloads to the cloud, resulting in significant cost savings and greater flexibility. That was the week that turned this skeptic into a believer that the cloud is the real deal, and it’s here to stay.

Although not every application is a perfect fit for cloud deployment, the fact is that cloud-based services are rapidly becoming the standard for new development. Platform as a Service (PaaS) vendors are also making it very easy to move existing on-premises applications to the cloud through myriad tools for data and metadata migration.

Benefits of Cloud-Based Database Applications

As the PaaS offerings from vendors such as Amazon and Microsoft have matured over the last few years, the benefits of moving workloads to the cloud have become clearer.

Decreased cost. Let’s say you decide to run your own standalone Microsoft SQL Server machine. Assuming you’re buying modest hardware, you’re going to be out $5,000 for the box, not to mention the periphery (network equipment, UPS units, etc.) required to set up a server and keep it connected and running. Then there’s SQL Server licensing, which for Standard Edition only was around $4,000 per core as of this writing. Assuming your modest server is a quad-core machine, you’re at $16,000 just for SQL Server licensing. If you plan to have redundant hardware to account for high availability or disaster recovery, expect to double (or more) these fixed costs. Also, keep in mind that servers don’t live forever; plan to repeat this purchase every 3-5 years. In contrast, the recurring cost model of PaaS applications reduces the steep capital expenditure into a more finely tuned pay-as-you-go economy, often resulting in lower total cost.

Time to market. Think about the things that will slow down your development or deployment timeline. It takes time – sometimes weeks or months – just to get approval for a capital expenditure for the hardware and software. After that, the new machine has to be built by the manufacturer, delivered, and set up in your data center. Install the operating system, configure it on your network, and then you can finally install SQL Server. In the absolute best case, this is a weeks-long process; in reality, going from zero to on-premises SQL Server will take a month, perhaps much longer. Building a PaaS database can be completed in minutes rather than weeks or months; in fact, earlier today I built a SQL Database in Azure in about the same length of time it took me to type out this paragraph. If time is a factor (and really, when is it not?), the speed at which a new cloud database can be created is a very compelling argument for using a PaaS architecture.

Managed services. If you’re hosting your own database server for a 24×7 application, your organization is on the hook for handling things like operating system hiccups, hardware failures, network snafus, and power outages. Keeping resources and staff to handle these issues takes time, training, and most importantly, money. Add to all of that the ongoing environmental costs of running a data center: cooling, power, backup power apparatus, and physical security, among others. By using a cloud platform to develop and run your database applications, you’ll shift the responsibility of data center management over to your cloud vendor, saving a great deal of effort and money. You’ll also rest a bit easier, since software updates and service SLAs will also be handled by the PaaS host.

Easy scalability. With your on-premises database server, what happens in six months when you discover that the hardware can no longer keep up with demand? You could upgrade the server hardware (assuming it’s got headroom for that) or just buy a new larger server. Similarly, I’ve seen a few cases of buyer’s remorse where the needs were overstated and the server hardware was much more robust (and therefore, more expensive) than it needed to be. Overbuying or underbuying hardware is always costly. Conversely, when building a database application in the cloud, it is relatively easy to set – and later change, if needed – the resources allocated to it. As the processing and storage needs change, the resources can be scaled up or down in seconds with just a few mouse clicks.

Conclusion

Platform as a Service offerings have evolved and improved significantly in the past couple of years. These cloud services make a strong argument for cost savings, time, and flexibility for both new development and migrated applications. As the volume of customers migrating to the cloud continues to increase, prices will continue to fall, furthering the cost benefit advantage of cloud-based applications over their on-premises counterparts.

When Data Warehouse Projects Fail

When Data Warehouse Projects Fail

Getting a data warehouse over the finish line is hard. Data warehouses are complex organisms, requiring intense collaboration and technical expertise across every level of the organization. When it all comes together, it’s a beautiful thing. However, many data warehouse initiatives never make it to user acceptance.

On my technical blog, I have cataloged some of the reasons I’ve found that data warehouses fail. Avoiding these pitfalls can reduce the possibility of a data warehouse project going off the tracks. But things can still go wrong even with the best planning and adherence to best practices.

When Data Warehouse Projects Fail

The odds are good that something will go wrong during every data warehouse implementation: the due date for a deliverable gets pushed out, a dimension table has to be refactored, the granularity of a fact table changes. If things slide to the point that the project is no longer moving forward, it is critical to respond properly to get the project moving positively again if possible.

Triage

First and foremost, focus on determining status and next steps. Is the project truly ended, or has it just stalled? That distinction will drive most of the rest of the decisions made. If there is a design or development impasse, the road forward will look very different if the project has been shelved due to budget cutbacks or other factors. Assess if there is working room to salvage the project. If yes, use that time to isolate and minimize the speed bumps that slowed down the project the first time.

When triaging, don’t be afraid to issue an “all-stop” directive while you reassess next steps. However, don’t let the project founder in that state for long. Figure out what went wrong, fix it, and move forward.

Take inventory

Regardless of whether or not the project is salvageable, take stock of the individual deliverables and the respective status of each. If the project has simply stalled but the plug has not been pulled, having a clearly identified status of the technical and nontechnical assets will make the process of restarting the project far easier. If the project is not salvageable, there is almost certainly some business value in the work that has already been completed. Properly classifying and archiving those assets can provide a jump start for related initiatives in the future.

Communicate, communicate, communicate

Whether it’s a stalled project or one that has been stopped entirely, timely communication is essential to managing expectations. Clearly communicate the status of the project, what to expect next, and any timelines. Make sure that everyone involved – business analysts, executives, technical staff, and other stakeholders – is clear on the status and timeline. Don’t cast blame here; keep the updates fact-based and simple.

Renew the focus as a business project

For stalled data warehouse projects, it is important to refresh the focus of the project. Data warehouses should always be driven by business teams, not technical teams. Because of the technical complexities required of data warehouse projects, it is common to lose focus and steer data warehouse projects as technical initiatives. Although technical architecture is critical, the business stakeholders should be the ones driving the design and deliverables.

Scale back on deliverables

Of the reasons I’ve found that data warehouse projects fail, trying to do too much in one iteration is a common factor. Big-bang data warehouse projects don’t leave much flexibility for design changes or refactoring after user acceptance. If a stalled or failed data warehouse has many concurrent development initiatives, consider cutting back on the initial set of deliverables and deploying in phases. This can add overall time to the schedule, but you get a better product.

Bring in a hired gun

Insourcing your data warehouse project is often the right solution: you aren’t spending money on external help, you don’t lose that project-specific knowledge when the project is done, and your team gains the experience of building out the technical and nontechnical assets of the project. However, if a data warehouse project has stalled, bringing in the right partner to get back on track can help to save the project, and save time and money in the long run.

Conclusion

Like any technical project, data warehouse initiatives can stall or even fail. If this happens, it is important to properly set the project back on track, or wind it down as gracefully as possible if the project has been abandoned.

Upcoming Webinars on SQL Server Reporting Services 2016

With last year’s release of SQL Server, those of us who trade in the reporting and analytics space got a big gift: A brand new version of SQL Server Reporting Services. Unlike prior versions, SSRS 2016 brought many new much-needed improvements that immediately improved the usability and relevance of Reporting Services as a reporting and analytics tool. For the first time in many years, SQL Server Reporting Services is exciting again!

For the next three weeks, we will be hosting weekly webinars to demonstrate the new features of Reporting Services 2016. Please join us for the following webinars:

Introduction to the Report Portal in SSRS 2016 (Tuesday, January 24th)

Introduction to KPIs in SSRS 2016 (Tuesday, January 31st)

Introduction to Mobile Reports in SSRS 2016 (Tuesday, February 7th)

We look forward to seeing you there!

Keeping Data Quality Simple

Keeping data quality simpleA universal truth in business is that bad data costs money, and can even be dangerous. Data quality processes are often an afterthought rather than a central component of the architecture, due in part to the fear over the complexity of checking and cleansing data quality. In many cases, that fear is warranted; the longer data quality is delayed (in time as well as in data lineage), the more time and money it costs to make it right.

Some data quality needs require a formal project and specialized tools to address. Other issues, can be mitigated or avoided entirely with small changes to the data model or load processes. We like to look for these “low-hanging fruit” opportunities when designing a data or ETL infrastructure, because some of these simple modifications can save dozens of hours (or more) and significantly improve data quality very quickly.

Keeping Data Quality Simple

Among the passive changes that can help prevent or fix data quality issues:

Use of proper data types. Many of the data quality issues we fix are related to numerical data stored as text, and then some non-numerical data ends up inadvertently loaded. Even harder to detect is when numerical data of an incorrect precision is stored. The same goes for date values, date + time values, geolocation data, among others. Storing data in the type that most closely represents its real use will avoid a lot of downstream problems that are often hard to diagnose and expensive to fix.

Non-nullable fields. Every major RDBMS platform supports non-nullable fields, which require a value in said field before an insert or update operation will complete. If a particular field must have a value before that record can be considered valid, marking the column as non-nullable can avoid data consistency issues where that value is missing.

Foreign key validation. The use of foreign keys for data validation is a best practice in most any relational database architecture, and that is doubly true when improving data quality is a main objective. Using foreign keys to limit values to only those entries explicitly allowed in the table referenced by the foreign key prevents stray values from sneaking into a constrained field.

Check constraints. Preventing the insertion of data with values outside a define range can be accomplished through check constraints found in every major database platform. These, like foreign keys, limit the values that can be entered for a column but the check constraint does not use a separate lookup table. Also, you have flexibility to set a range of allowable entries rather than a discrete list of values. An example of this would be using a check constraint to enforce that all dates of birth are on or after a certain date.

ETL cleanup. Most ETL tools have built-in functionality allowing for lightweight data cleansing. Assuming the data in question is being processed through a structured ETL tool, adding in logic to correct minor data quality issues is relatively easy to do with low risk. Emphasize a light touch here – you aren’t going to want to do address standardization or name deduplication without some formal codified process.

No Substitute for Formal Data Quality Processes

Even when taking these precautions to prevent or correct some issues, you’ll still run into cases where a more rigid and comprehensive data quality initiative will be needed. The above suggestions will not eliminate the need for proper data quality tooling, but can help reduce the pain from or delay the need for in-depth data quality remediation.

Conclusion

Data quality requires a multifaceted strategy. Taking care of some of the simple problems with easy-to-use tools already at your fingertips can have a significant and immediate impact on the quality of your data.