Talend Open Profiler

Data profiling is the process of examining the data available in existing data sources (e.g. databases, applications, files, etc.) and collecting statistics and information about this data. Data profiling enables the assessment of the quality level of the data contained in the information system, according to a defined set of metrics and goals.

Talend Open Profiler is a sophisticated, yet simple-to-use open source data profiling tool that defines the content, structure, and quality of highly complex data structures. The open source data profiler allows business users and data management staff to perform a large variety of analyses using a set of indicators, patterns and rules for each data element being analyzed or monitored. It analyzes data on an ongoing basis, and analyzes changes to source data over time to help improve data quality.

These data quality indicators can range from simple or advanced statistics to text string analysis, including summary data and statistical distributions of records. The patterns are preset or customized expressions that define the expected form of data analyzed and the data quality rules help define custom business thresholds and value ranges.

Talend Open Profiler produces sophisticated reports and graphs that let users gauge at a glance the data quality, and the status of the predefined indicators. In addition an embedded data explorer allows users to directly drill down into the tables of the analyzed databases.

Download Talend Open Profiler now!

Want to learn more about open source data quality tool Talend Open Profiler? Then watch  an online demo or check out our users' testimonials.

Not sure if you need Talend Open Profiler or Talend Data Quality? Check out the features comparison matrix.

Metadata discovery

Talend Open Profiler: Metadata discovery

Talend Open Profiler connects to databases and files to introspect their structures and stores the description of their metadata in its Metadata Repository. The metadata is then used by data analysts to set up data quality metrics and indicators.

 


Custom business rules

Talend Open Profiler: Custom business rules

A dedicated wizard makes it easy to set up data quality custom business rules. The data quality rules are used to define expected thresholds on the data quality indicator's value. The range defined is used for measuring the data quality in the selected table in the data profiling tool.

 


Patterns

Talend Open Profiler: Patterns

Patterns are master data, which analyzed data are checked against during the data profiling. A library of predefined patterns is available for most frequent data quality issues.

In addition, fully customized patterns can be built based on regular expressions or SQL statements for better and more detailed inspection of data.

Profiling users can also share their home-grown patterns as well as leverage patterns developed by other users of the open source Talend Community through the Talend Exchange platform directly accessible in the Talend Open Profiler studio.

 


Indicators

Talend Open Profiler: Indicators

The indicators that can be set with open source Talend Open Profiler include:

  • Simple statistics: provides data profiling statistics on the number of records falling in certain categories, including the number of rows, the number of null values, the number distinct and unique values, the number of duplicates, or the number of blank fields.
  • Text statistics: analyzes the characteristics of text fields, including minimum, maximum and average length.
  • Summary statistics: performs statistical analysis on numeric data, including the computation of the mean, the average, the inner quartile range, and the definition of ranges.
  • Advanced data quality statistics: determines the mode and builds frequency tables.
  • Pattern frequency: computes the number of most and less frequent records for each distinct pattern.
  • Soundex frequency: indexes records based on phonetics and sounds.

 


Rendering

Talend Open Profiler: Rendering

Talend Open Profiler is an open source tool that presents a series of tables and graphs that display the results of the data profiling for each data element and each indicator selected.