Profiling, now Available for Big Data

In version 5.2, Talend introduces a key new feature for big data: profiling. Big data quality has been part of the big data story since day one, with a number of key data quality components being ported to Hadoop in the first version of Talend Big Data earlier this year – and we are continuing this effort to streamline big data management.

Profiling is the initial phase of any data integration project. Or at least it should be. Data profiling helps to discover and understand the data that is available, and takes the guesswork out of finding and identifying problem data.  Data duplications, incompleteness and inconsistencies undermine efficiency and usefulness of this data. This is true in “traditional” data sources, and even more true with big data.

The big data profiling features in Talend v5.2 allow users to analyze their data in their Hive database on Hadoop. Profiling is performed “in place”, which means that data does not need to be extracted from Hadoop before being profiled. Instead, big data profiling leverages the power of the Hadoop cluster, allowing users to scale out with limit. Actually, tests have proven that profiling time does not increase significantly when data volumes increase.

And of course, big data profiling provides custom graphical reports with standard tests that apply to all types of data like empty/missing values, number of duplicates, length of data, shapes of data, e-mail validation, phone number validation, etc.

Finally, big data profiling works the same way and is fully integrated with traditional data profiling. So you don’t have to do things differently for big data.  And you don’t need to make big data become another data silo.