Data Quality and Machine Learning: What’s the Connection?

Even as the size of big data is growing to the scale of zettabytes, the issue of poor data quality is hindering organizations from performing to their full potential. Gartner’s Data Quality Market Survey estimates that the financial impact caused by data quality issues alone resulted in organizations losing approximately $15 million in 2017. Obviously, this is a huge concern to be addressed.

Traditionally, organizations have used a combination of manual and automated methods to tackle this issue. Most of these solutions, however, are siloed approaches rather than a comprehensive data governance strategy  spanning the entire organization. Today, in the big data world, the complexity of the problem has become bigger, necessitating out-of-the-box solutions. This is where machine learning (ML) assumes its crucial role.

A look into the current data quality scenario

Historically, the European Union's General Data Protection Regulation (GDPR), the California Consumer Privacy Act (CCPA), and other data protection legislation have underscored the need to approach data management and governance with rigor. However, organizations today are becoming more cognizant of the need to prioritize data quality

Most enterprises have taken to addressing data quality issues by defining tight rules in their databases, developing in-house data cleansing applications, and leveraging manual operations. However, there are several limitations to this approach.

  • The 3Vs of big data—variety, velocity, and volume—have made data quality a tough problem to crack. Multiple sources and types of data require customized approaches. For example, companies have access to data from technologies like IoT sensors, which present the challenge of unforeseen volumes and non-standardized data formats across devices.
  • Semi-structured and unstructured data types introduce additional complexity in processes such as data validation. For instance, how do you differentiate between incorrect and outlier data? And then, the sheer volume of data implies that manual cleansing is not a sustainable solution.

Although automated solutions help overcome challenges such as high volumes to an extent, plain automation is not without its limitations and biases. Manual user biases can be built into an automated algorithm, which can be risky. The resulting problems may not even surface until it’s too late and can manifest as a negative impact to the customer experience and even the bottom line.

For example, organizations who use machine learning for recruiting purposes may end up employing an algorithm that weeds out candidates from a certain zip code or college based on historical data. Although this filtering system may seem insignificant, the manual user bias may lead to passing up a candidate who is a perfect fit, or on the other hand, hiring an individual who is not a good fit — ultimately leading to a wasted investment of time and resources in onboarding. 

The landscape of business and the corresponding rules regarding data are changing so frequently that you need systems that are intelligent and agile enough to adapt to these changes at a quicker pace. This is why machine learning systems, which have the capability to self-teach, can prove to be an ideal solution for dealing with data quality issues. As rules and standards change, machines have the ability to evaluate data, assess the quality, predict missing inputs, and provide recommendations.

Without leveraging machine learning or any system to ensure good data quality, organizations are apt to lose out significantly. Poor data quality that involves numerous errors such as duplicate data entries, incomplete entries, and broken formats hinders an organization’s ability to gain accurate and timely insights that could drive business decisions and reveal operational inefficiencies.

Machine learning explained

Machine learning is the process of using statistical techniques on data and training computers how to think. Unlike typical software that is explicitly programmed to behave in a certain way, machine learning software learns through data.

Machine learning software also has the ability to improvise its logical thinking as it gains experience by being exposed to more data and scenarios. This can be compared to how humans think and learn. We get better at playing a certain game or speaking a new language as we practice.

Although they have  been around for more than a few decades, machine learning programs have become popular more recently due to their use in mainstream commercial applications such as Facebook’s facial recognition program, Netflix’s user recommendations, and Google’s speech recognition and predictive search, among others.

Improving data quality using machine learning

Several firms today have started implementing machine learning solutions as part of their data strategy. In a recent survey, 61% of respondents acknowledged AI and ML as their top data initiatives for the year. Given the number of unknowns that data management systems have to deal with, and the challenges introduced by big data, this is not a surprise.

The biggest strength of machine learning is that it considerably expedites data cleaning activities and what typically takes weeks or months  can now be finished in hours or days. Also, volume, which was a disadvantage with manual data operations, is actually an advantage in machine learning programs as they improve when trained with more data.

Let’s take a look at a few specific areas under the umbrella of data quality that machine learning helps to address:

Fill data gaps

While many automation systems can cleanse data based on explicit programming rules, it’s almost impossible for them to fill in missing data gaps without manual intervention or plugging in additional data source feeds. However, machine learning can make calculated assessments on missing data based on its reading of the situation.

For example, Calor Gas—a top supplier of liquid petroleum in the UK—faced fierce competition in acquiring and retaining customers. The oil supplier sought to create a more personalized and relevant experience to engage current and potential customers. However, with more than 100,000 domestic partners, this proved to be a significant challenge for Calor Gas.

Using machine learning, Calor Gas created a complex algorithm based on three key metrics: churn risk, customer value, and segment. This algorithm helped fill in the gaps to provide a 360 view of the organization’s customer base, which detailed the customer lifetime value and likelihood of churn.

Assess relevance

At the opposite end of the spectrum to missing data, organizations often accumulate a large amount of redundant data over the years that does not have any use in a business context.

For example, in the financial industry, machine learning is being used as a quicker way to perform the usually lengthy mortgage application process. 

 The process typically involves signing many documents and paperwork. While automatic data capture avoids some of these routine manual processes, the algorithm takes it a step further by determining the relevance of the data gathered.

Using machine learning, the system is be able to self-teach on the data points required and the ones that can be eliminated. Analysis of this kind can help revamp the process and, eventually,  make it simpler.

Detect anomalies

Machine learning programs are decidedly effective at spotting patterns, associations, and rare occurrences in a pool of data. This can be useful in several real-life situations.

In healthcare, for instance, such an algorithm was able to identify breast cancer in 52% of women almost a year earlier than usual. By feeding data samples of how early mammography scans looked for those women who developed cancer later, it was possible to train the program to detect the issue in advance.

Identifying fraudulent transactions among a sea of financial data and detecting malware are other applications of such programs.

Here is an example of how to use AWS infrastructure (a combination of Amazon SageMaker, AWS Glue, and AWS Lambda), to build, train, and deploy ML programs and find out outliers in your data with better accuracy than ever before.

Identify and remove duplicates

Duplicate data has always been a menace for data stewards, eating away into their productivity. For marketing teams, identifying when several records point to the same customer is crucial in creating targeted marketing strategies. However, 81% of marketers acknowledged in a survey that developing a single customer view is a huge challenge.

The inconsistency of recording data in different systems, typos, and stale data (e.g. changed addresses), make this challenge even more difficult to tackle manually. This is where machine learning programs can be trained to perform fuzzy matching—a process where programs look at several additional attributes—and make a statistical calculation on whether these records are the same or not.

Match and validate data

Coming up with rules to match data collected from various sources can be a time-consuming process. As the number of sources increase, this becomes increasingly more challenging. ML models can be trained to learn the rules and predict matches for new data. There is no restriction to the volume of data, and as a matter of fact, more data works favorably in fine tuning the model.

Machine learning can also be employed effectively to clean up data errors. For example, some physicists trained a machine learning model on the errors that can potentially occur in a quantum computing protocol. The model was then able to generate error chains that can be used to recover the correct quantum states.  In essence, all activities pertaining to data management can take advantage of machine learning models.

The cost of bad data

Bad data can prove to be quite expensive for companies. Attempts to quantify the financial impact have resulted in some shocking numbers. As per a research study published in MIT Sloan Management Review, companies are said to be losing around 15% to 25% of their revenues due to poor data quality. IBM estimated that the annual impact to US economy alone is a staggering $3.1 trillion.

From a productivity perspective, too, the situation appears bleak. Data scientists spend 80% of their time finding, cleansing, and organizing data, leaving only 20% of their time to perform analysis. That’s a lot of clock-hours wasted by highly paid professionals on peripheral work.

It’s also important to remember that decisions based off flawed data can lead to severe consequences in some cases. For example, governments may implement policy decisions based on incorrect data, which can impact generations to come. For commercial enterprises, bad decisions can mean damaging customer relationships or even losing customers.

Machine learning algorithms can flag some of these situations before they get too far. Financial companies use them to identify forged transactions. In fact, it’s estimated that ML models can result in a $12 billion savings for card issuers and banks.

How machine learning helped Travis Perkins see a 60% increase in web traffic

Let us look at an example to understand the impact that integrating machine learning with your big data strategy can make. Travis Perkins,  the U.K.’s largest building materials supplier, struggled with incomplete and inconsistent product information, which then required manual validations and corrections. The manual updates did not always happen consistently and were prone to errors. As the company moved to online sales, these data quality issues required to be fixed at a faster pace than before.

The company chose Talend Data Services and integrated it with their product information management tool. First, Talend’s data quality firewall ensured that it performed validations, cleansing, and deduplication, before letting the product details in. Moreover, in cases where data was missing, Talend was able to report gaps and backfill entries.

Within the first month of its implementation of the Talend data platform, 10,000 updates were made to product entries. As the product descriptions are now accurate and consistently available for all materials, customers can find the products and pages online easier than ever. The company has since enjoyed a 60% increase in web traffic and a 30% increase in sales.

The future of machine learning

Machine learning has already been mainstreamed with many businesses employing it as part of their data management strategy. The good news is that every company does not have to write their own machine learning models. It’s possible to achieve your objectives by using off-the-shelf products.

Talend Data Fabric ensures data quality by employing built-in machine learning in its end-to-end data management platform. With a focus on digital transformation and data integration, Talend Data Fabric uses machine learning standardization methods to prevent duplication and perform data validation on data from external sources. If you’re ready to experience the power of machine learning at its finest, try Talend Data Fabric today.

Ready to get started with Talend?