Using Talend to Gather Data About Data
What's more exciting than data? Data about data!
Recently I had to assess the impact of data model changes within a transactional system feeding our data warehouse. I directed our SQL scripts and ETL processes to the pre-production environment, and ran the ETL jobs and did some basic regression analysis. For every table and column, were there any significant changes which warranted further investigation:
- Number of rows
- Number of nulls per column
- min and max length of each column
- distinct values per columns
The idea was, that any large shifts in data volume, or null values could indicate something worth investigating.
I could use Oracle dictionary tables like "ALL_TAB_COL_STATISTICS" to get some of this information, but wanted more control over the final output. The end result needed to be a data set suitable for analysis in Tibco Spotfire.
I built a simple table within the schema to gather some information for each table/column, and then created a simple Talend job. The job gathers stats using dynamic queries, retrieving table and column names from the Oracle system table, "ALL_TAB_COLUMNS". The stats table created to store the final output is below.
I ran the job before and after the ETL change, and then used Spotfire to look for outliers in the metrics. Below, for example, there is a big difference in the “number of nulls” metric for the "STATE_PROVINCE" column when comparing "before" vs "after".
This is a manufactured example, using Oracle Express HR database. The Talend Job was straightforward, the most complicated part is the creation of the dynamic SQL statements.
Step 1: Create the Oracle connection
Step 2: Iterate over the Oracle tables
Use a "tOracleTableList" component to iterate through the tables. I am excluding the STATS table itself (DB_STATS) using a WHERE clause in the component.
This component iterates over the table names. During each iteration I can grab the table name from the globalMap. The table name is used to build the dynamic SQL statements:
Step 3: Get the columns names for the current table iteration
Now we have the table name, we can create a query against the "ALL_TAB_COLUMNS" system table in Oracle. The query simply returns the column names for the table. We pull out the table name from the globalMap, courtesy of the “tOracleTableList”.
It is not obvious from the screenshot, but there are single quotes and double quotes next to each other. This is a theme for most of the dynamic SQL statements in this article.
Step 4: Iterate over the column names
Later we will need both the table name AND a column name to create SQL statements, so I added a "tFlowToIterate" component. This will let me iterate over the column names. The column name for each iteration will get added to the globalMap as a variable.
At any point I can now pull out both the table name and column name from the globalMap. Note I also have access to other fields I am not using in this example, such as "DATA_TYPE" and "DATA_LENGTH" which are available in the “ALL_TAB_COLUMNS” table.
Step 5: Execute the SQL needed to gather the metrics
This may need some explaining. If, for a particular iteration, the table name is "DEPARTMENTS" and the column name is "DEPARTMENT_NAME", to gather some basic stats and produce one row of output I need to create a query like this:
We are selecting from the special Oracle table "DUAL" which is a dummy table with one row and one column, which is great for when you need to do this kind of query to produce a single row of output. You can see above, we have generated some basic stats for the table and column.
Note: There is some redundancy here, as for each column I am calculating the total number of rows in the table, and really should do this once per table.
To create this SQL, we have to generate a lot of the values dynamically, by extracting the table and column name from the globalMap variables created earlier in the job.
The next component to add is a "tOracleInput" component. Here we construct and execute the dynamic SQL above, for each table and column. Again, we do get into a little bit of "single quote/double quote" hell, but it isn't too bad.
I am also adding a "Before" or "After" value from the Talend context to indicate if this is a view of the stats before or after we implemented the data model changes.
It isn't as bad as it looks, we are basically trying to recreate the SQL above by pulling table and column names out of the globalMap:
The schema is straightforward:
Step 6: Congratulate Ourselves ;-)
You are now querying a database (Oracle system catalog table), to get data about the database (table and column names), so you can create dynamic SQL on the fly - to get more data about the database (metrics)...
Step 7: Insert the record into the stats table
This is a simple insert of the stats record we just created through dynamic SQL, using a "tOracleOutput" component to write the results into the DB_STATS table:
Step 8: Commit (or rollback)
If the subjob completes without error - we perform an Oracle commit. This is necessary as I didn't choose auto-commit in my tOracleConnection in step 1:
If you have hundreds of tables, a trillion rows, and lots of columns, then think carefully about pulling the trigger. It did take me a few hours to execute the job against approximately 150 tables (200 million rows across all), but when I put the data into Spotfire it gave me a view of my data which was incredibly useful, and it did pick up some changes I would have missed otherwise.
The changes in the transactional system had in fact included logic changes to several key PL/SQL procedures used during our ETL extractions.
Disclaimer: All opinions expressed in this article are my own and do not necessarily reflect the position of my employer.