Talend Open Profiler

Data profiling is the process of examining the data available in existing data sources (e.g. databases, applications, files, etc.) and collecting statistics and information about this data. Data profiling enables the assessment of the quality level of the data contained in the information system, according to a defined set of metrics and goals.

Talend Open Profiler is a sophisticated, yet simple-to-use open source data profiling tool that defines the content, structure, and quality of highly complex data structures. The open source data profiler allows business users and data management staff to perform a large variety of analyses using a set of indicators, patterns and rules for each data element being analyzed or monitored. It analyzes data on an ongoing basis, and analyzes changes to source data over time to help improve data quality.

Download Talend Open Profiler now!

Want to learn more about open source data quality tool Talend Open Profiler? Then watch an online demo or check out our users' testimonials.

Not sure if you need Talend Open Profiler or Talend Data Quality? Check out the features comparison matrix.

Metadata discovery

Talend Open Profiler connects to databases to introspect their structures and stores the description of their metadata in its Metadata Manager.

A filtering system helps users to only select partial tables or columns for the analysis, optimizing the connection performance in case of a large number of tables and helping data analysts to focus their analysis on the most relevant data.

Talend Open Profiler: Metadata discovery

The metadata is then used by data analysts to perform database comparisons and analyses and set up data quality metrics and indicators that help users to assess the quality of the analyzed data and make decision about possible data cleansing, data integration or data stewardship measures to take.

In addition, an embedded data explorer allows users to directly drill down into the tables of the analyzed databases and browse the data using industry-standard SQL queries.

Custom business rules

Business rules are specific criteria, thresholds or range of values that are used to identify matching records, illogical records (e.g.: age entered < 0 or is decimal) or records that do not match the expected values.

Talend Open Profiler: Custom business rules

A dedicated wizard makes it easy to set up data quality custom business rules using Industry-standard SQL language to define these rules, and allowing advanced use of join conditions for more complex needs. The data quality rules are used to define expected thresholds on the data quality indicator's value. The range or statement defined is used for measuring the data quality in the selected table in the data profiling tool.

Patterns

Patterns are master data, which analyzed data are checked against during the data profiling. A library of predefined patterns is available for most frequent data quality issues.

A number of preset patterns are available natively to help define most commonly expected forms of data analyzed.

Talend Open Profiler: Patterns

In addition, fully customized patterns can be built based on regular expressions or SQL statements for optimized and more detailed inspection of data.

Profiling users can also share their home-grown patterns as well as leverage patterns developed by other users of the open source Talend Community through the Talend Exchange platform directly accessible in the Talend Open Profiler studio. Regular expressions or SQL patterns can also be imported from a CSV file when the number of patterns is to handle is very large.

Indicators

Talend Open Profiler: Indicators

Indicators are the results of the implementation of different patterns. They define the content, structure and quality of the analyzed data and can result from simple to highly complex operations based on data-matching and other data-related operations.

A number of system indicators are available natively in Talend Open Profiler to help users get started with data profiling, including:

  • Simple statistics: provide data profiling statistics on the number of records falling in certain categories, including the number of rows, the number of null values, the number distinct and unique values, the number of duplicates, or the number of blank fields.
  • Text statistics: analyze the characteristics of text fields, including minimum, maximum and average length.
  • Summary statistics: perform statistical analysis on numeric data, including the computation of the mean, the average, the inner quartile range, and the definition of ranges.
  • Advanced statistics: determine the most probable and the most frequent values and builds frequency tables based on these values.
  • Pattern frequency statistics: compute the number of most and less frequent records for each distinct pattern.
  • Soundex frequency statistics: index records based on phonetics and sounds.

Dedicated wizards help users to define their own customized indicators based on industry-standard SQL or Java statements to track new data quality metrics or specific data characteristics.

Rendering

Talend Open Profiler: Rendering

For each table, column, data element and indicator selected, Talend Open Profiler produces sophisticated reports and graphs that let users gauge at a glance the results of the data profiling, directly in the analysis editor.