Performing a Column Analysis

Task outline

Column analysis is a type of custom analysis for which you can manually set up several indicators. Gathered information such as the length of values, patterns that values match, the uniqueness of values, and other statistics can help you identify problem areas and ultimately determine how to resolve data quality issues.

After starting Talend Studio, you will establish a connection to the local database server. Then you will create a basic column analysis and use built-in patterns based on regular expressions to analyze individual column data statistics. Finally, you will interpret the results.

Your training environment uses a single virtual machine (VM) that contains all the software you need to complete this tutorial, including Talend Studio and MySQL.

Starting Talend Studio

  1. To start the virtual machine, open a new tab with the same page and click the following link : START VM!

    The VM is launched in your web browser. Wait for Windows to start.

    A script is automatically launched. You can close it by clicking the X button.

  2. Select Yes on the Networks panel on the right side of your screen.

  3. To start Studio, on your VM desktop, double-click the Talend Studio shortcut.

  4. The Talend Data Fabric window opens.

    A project named DQ Essentials is available.

    Click Finish.

  5. Verify that the splash screen appears.

  6. Click the Start now! button (you may need to scroll down).

    The window opens.

Depending on your preferences and available tools, your screen may vary.

You can see the commonly used areas:

  • The DQ Repository, where data analyses and metadata are stored, is in the upper left corner.
  • The workspace, where you create and modify Jobs, run analyses, and examine results, is in the center.

Creating database connection metadata

    1. Make sure you are in the Profiling perspective. In the upper part of the window, click the Profiling icon:

    2. Create metadata. In DQ Repository, expand Metadata.

      Right-click DB connections and click Create Connection.

      You are creating a connection to a collection of databases already configured in your training environment.

      1. In the Name text box, enter StagingDB.

      2. Click Next.

      3. On the DB Type drop-down list, click MySQL. Do not modify the default value in the Db Version box.

      4. Enter these settings:

        • In the Login and Password text boxes, enter root
        • In the Server text box, enter localhost
        • Leave the DataBase text box empty
      5. To verify your connection information, click the Check button.

      6. In the Check Connection window, click OK or make any necessary corrections and again click Check.

      7. Click Finish.

        Your new database connection, StagingDB, appears in the repository.

        The workspace displays the configuration information for the connection.

        If needed, make changes to the connection settings.

Setting up a basic column analysis

  1. Create a basic column analysis.
    1. Right-click the CRM_Analyses folder.

    2. Click New Analysis, expand Column Analysis, and click Basic Column Analysis.

    3. Read the context-sensitive help for Basic Column Analysis. Click Next.

    4. In the Name text box, enter Email_Column_Analysis.

    5. Click Finish.

    The new analysis appears in the appropriate folder in the Data Profiling repository, and it is automatically open in the Profiling perspective.

  2. Select data.

    When creating a new analysis, the Data Preview section is open by default. On the Connection drop-down menu, StagingDB is selected.

    To specify which columns to analyze, click the Select Columns button.

    1. Expand the crm catalog and Tables subdirectory, and click the customer table.

    2. Select the Email column.

    3. Click OK.

      The Email column is shown in the Analyzed Columns section, and an overview of the data is displayed in the Data Preview section.

  3. Select indicators.

    1. Click the Select Indicators button.

    2. Expand Simple Statistics. To select all types of simple statistics, click the first column to its right. The table expands to show the details included in the results of the analysis.

    3. Click some indicators to display their purpose and description in the lower section.

    4. Click OK.

      The Analyzed Columns list shows the statistics selected for the column.

    5. To run the analysis, click the Run icon.

Analyzing the results

The Analysis Results tab is automatically displayed. This view shows the results as both tables and graphs.

Look closely at the indicators you selected:

  • Row Count is the number of records in the column
  • Null Count is the number of rows without any data
  • Distinct Count is the number of distinct values with one or several occurrences in data
  • Unique Count is the number of distinct values with exactly one occurrence. If Unique Count equals Row Count, the column is a good candidate for primary key
  • Duplicate Count is the number of distinct values with several occurrences. Simply put, Duplicate Count and Unique Count equal Distinct Count
  • Blank Count is the number of non-null records that contain only spaces (or nothing)

Using a built-in regular expression

Beyond basic and text statistics, a data analysis allows you to study the data structure by collecting information about the patterns of values. You can then compare the outcome to specified regular expressions to compute matching statistics.

  1. Examine the built-in pattern.

    1. In the Repository, expand Libraries, Patterns, Regex, and internet.

    2. Double-click Email Address. The pattern settings for email addresses open.

      In the Pattern Definition section, notice that the pattern includes regexes for databases, Java, and the default entry. All of these display the standard format of an email address.

    3. Close the Email Address pattern tab.
  2. Use the built-in pattern.

    To replace the Email Pattern control you previously set up with the built-in Email Address regex from the repository, open the Analysis Settings tab for the Email_Column_Analysis.

    1. In the Pattern column of the Analyzed Columns list, click the Add pattern icon (the only one in that column). The Pattern Selection window opens.

    2. Expand Patterns, Regex, and internet, then select Email_Address.

    3. Click OK.

      The pattern appears at the bottom of the indicator list.

      This process has the same effect as dragging the pattern from the repository.

      Now the analysis compares the values in this column to the built-in pattern for email addresses.

  3. Examine the results.

    Run the analysis.

    1. Click the Analysis Results tab and locate Pattern Matching.

      The pattern matching results indicate that even though the Email column contains a wide variety of text patterns, most of them match the basic structure of an email address.

    2. Right-click the pattern row and click View invalid rows.

      SQL Editor opens, displaying the results of a SQL query.

    3. Close SQL Editor.

    4. You can also use SQL Editor for other indicators.

      In the Pattern Frequency statistics, right-click a pattern, then click View rows. Email addresses that follow this pattern are displayed in SQL Editor.

In this tutorial, you started Talend Studio and learned about perspectives. You created and stored database connection information as metadata in the DQ repository. You created and ran a basic column analysis to examine customer data, and you finished by comparing email formats with a predefined regular expression.

You have finished the tutorial.