Data Prep 101: Getting Started with Talend Data Preparation
How many hours have you slaved away in MS Excel changing fields like customer name or address because you need to blend the two and your CRM or SFA doesn’t format data fields in the same way? I cannot tell you how many hours have I spent in MS Excel trying to clean up data or shape data to look how I need it to before sending to my ODS or CDW. As a former IT manager, I was always embarrassed to tell my business owners that their simple request to reformat or split fields in our ODS would take 3 weeks to develop, test and deploy to production. It just seemed crazy to me. I once worked with a gentleman, let's say his name is Bob, that got so good with Excel and VBA scripting that his Excel sheets became applications that his business owners loved so much they used his Excel sheets to replace an application that was meant to cleanse, shape, blend and aggregate our target physician data for the Sales team.
It's truly crazy how good people get with Excel macros and VBA scripting or creating full on visualizations of Mario. But, I never wanted to be an Excel wizard, and I am betting you don't either. Last November, I got early access to the pre-beta version of the new Talend Data Preparation tool and I must tell you this is going to save both IT and Business people HOURS of time.
What Can I Do With Data Prep?
With Talend Data preparation I can do and see so many things with my data in a matter of minutes that otherwise would take me forever and about 3 other additional tools to get done.
In the Data Prep tool, I can quickly do some impressive data discovery on the datasets I’m trying to work with. It uses these crazy cool semantic libraries to tell me if a field is of a certain type, and I don't mean just text or integers. I am talking about identifying if the data is a country field, US State, or date field (doesn't matter if it is different formats either!). So, when cleansing data, right off the bat I see if my data conforms exactly how it should as well as how much data within my datasets doesn’t match the expected data types using the cool data quality bars.
To top off the sweet data discovery and profiling, I can also fix most of my data issues in the Talend Data Preparation tool by using a few of the kick ass functions it provides. For example, if one of my fields, like first name, has leading or trailing spaces I can clean those spaces out with one click.
Think about doing this same job Excel; I would have to write a function to trim left and right, about 5-6 clicks and some typing. I can do a lot of other great functions like split a single field into multiple fields based on some delimiter like an underscore or a dash. How many systems put those concatenated fields into one field and it's up to you to split them out? The answer is too many to count. Well, with this simple function it’s no longer an issue. There are hundreds of functions that you can choose from that help you fix your data, and they give great suggestions of possible functions based on the type of data you are looking at in the field.
Did I mention the visualization in the tool? It is packed with helpful charts and graphs of the field's data to help you zero in on data that needs to be resolved.
Go Hands on With Data Prep
In our first Data Prep intro video we show you how one field, which is a marketing lead score, can be cleaned in an instant. The field we’re looking at should be an absolute value between 0 and 100, but the graph in the tool is extremely skewed too the far right indicating an issue with the number. Once we click on the graph to filter my top scores and we can see that some of my data came with 999 as a score.
This is clearly a default in a system, which always wreaks havoc on other systems. With a quick double click and typing in 0 then checking a box to apply to all data with the same value I fix the high end issue. But, when the graph refreshes we see a new problem. We now have negative marketing scores (this data is killing me)!! But no fear, a simple (suggested) function changes those negatives to zero for me! Now, think about how you can apply similar logic to things like inventory levels, system ratings gone wrong, you name it, you can clean it up here.
The Recipe for Clean Data
Here’s the cool part, throughout the process I am taking action on the data, the tool is keeping track of my cleansing in a simple list we call “the recipe”. Yeah, the recipe to fix my data is just listed for my convenience on the left side of the tool and if I want to remove a step because I don't need it any more I can just click on a little trash can and there it goes out of the recipe!
I can also save the recipe so that I can share it with friends and colleagues or reuse it on a new dataset. I can export my clean and shaped data out to a new dataset or into other systems like Tableau to see the results of my work. The great thing about those recipes is that my original data is untouched. The preparation recipe is just showing me what the data would look like if I took those actions. To realize the end result of the recipe's preparation I just need to export the results. (Just in case you can't change the original dataset, just a little compliance issue many people deal with!)
There are endless possibilities in what you can do with the Talend Data Preparation tool and your data. If you spend hours fixing and cleaning or reshaping your data files in spreadsheets or other tools on your desktop then you NEED to check out Talend Data preparation out today! I started out talking about data from a CRM or SFA being mismatch and a mess, well just think about the Data Lakes and Reservoirs sitting out in the Big Data Hadoop systems! Typically, this is raw untransformed data which is going to cause an even bigger headache to cleanse and utilize. The Talend Data Preparation tool is able to connect to all your data repositories and help you speed up that cycle of discovering, cleansing, resolving all your troublesome data.
I invite you to watch this introduction video, and even better try out the free Desktop version for yourself, import your worst data and let us know how you fixed your data with Talend! I dare you.