Data Quality Is Still An Issue For Big Data

Data Quality Is Still An Issue For Big Data
Facebook may have decided that you shouldn’t see the news, but we think you deserve to be in the know with Lifehacker Australia’s content. To sign up for our daily newsletter covering the latest news, hacks and reviews, head HERE. For a running feed of all our stories, follow us on Twitter HERE. Or you can bookmark the Lifehacker Australia homepage to visit whenever you need a fix.

You often hear the argument that data quality isn’t a major concern in big data projects, since the volume of information being analysed can smooth over any problems. However, Forrester researcher Michele Goetz points out this doesn’t mean IT shouldn’t work hard to ensure the data is as accurate as possible.

Quality control picture from Shutterstock

In a blog post, Goetz notes that “analysis can fill gaps by emphasising pattern recognition over master data”. That’s especially common in marketing, where data sources are often sketchy.

Despite that, it would be a mistake to use this as an excuse to minimise data quality efforts, she suggests:

IT still needs to support and certify data quality in the access and integration of data. It isn’t a question of good enough data, it is about data quality efforts that matter to outcomes.

Good point. Hit the full post to read more, and check out our top 10 rules for working with big data for more insights.

Data quality and data science are not polar opposites [Forrester Blogs]


  • IT’s role in data quality is purely that of a gatekeeper. IT should have processes in place to identify poor data, prevent bad data from being entered, and facilitate making changes to bad data. But IT should never change this data themselves. It should always go back to the business owner to make the decisions on how things should be fixed, it’s their data, they’re closest to it, and they need to make the decision on what is important, what things should be and how things should be handled.

  • I disagree. Data quality has matured to the point where there are business processes that can be fully automated based on agreed upon standards. Traditional DQ, name and address cleansing, can be fully automated based on agreed upon postal standards. Product information can be standarized based on agreed upon product codes. Where it gets trickier is when business rules are applied to certain conditions that can only be determined by a business analysts. For instance, “when should a product code identifier be retired based on the sunsetting of a product, and therfore no longer accepted in a cash to order system?” In this case, analysts may see a retired product code still being utilized by a functional area of the business (sales, marketing, shipping) and thus inaccurate revenue may be recorded. IT’s role needs to be defined in collaboration with the business, just as the businesses role does. This type of collaboration usually occurs in the function of Data Governance.

Comments are closed.

Log in to comment on this story!