We live in a world surrounded by data. From our daily grocery shopping, to our mobile phone usage, fitness regime tracker, bank accounts, social media etc., practically everything we do is either driven by or a contributor to data volumes. In this blog I would like to reiterate the importance of data and data preparation in the rapidly growing and demanding data warehousing world. This is only a single use case in a wide variety of applications for data preparation tools in today’s business environments. In later blogs, I’ll also cover some best practices and various potential use cases of Talend Data Preparation tool.
Journey of Data
Let’s start with the journey of data. Data has evolved significantly in last decade. It has grown in its size, content, value and state. Today data comes in a variety of shapes and sizes and volumes. It may range from a small sample set to a million, billion or even trillions of pieces of data consisting of capricious states like text, voice, video, tapes, etc.
For many years, the data warehouse was believed to be static because data didn’t change that often. However, in today’s world data warehouses are either real-time or near real-time, dealing with rapidly changing data. Today businesses are becoming data driven and they are investing heavily in data preparation either with self-driven tools or with their data warehouse.
Importance of Data
Data, at its core, is basically the raw details of transactions/events/statistics/recordings collected for a reason primarily for business improvement, competition or product feedback. This raw data is not necessarily transparent, but it is very important as it provides the foundation for reporting the information businesses metrics and trends to make crucial decision or run operations. Having the right data is important for an organization to have insight on the criteria needed to ensure optimal business performance, uncover areas of improvement and drive other key aspects of the business. For example – for an organization like Talend, it is important to measure the number of active clients, expiring licenses, revenue and upsell/downsell from each client, etc. Having accurate data about the health of your business is important in order to make informed decisions and ultimately keep ahead in today’s data-driven competitive landscape.
From Data To Insight
Now that we know the importance of data, let’s look at how to convert the raw data into meaningful insights. Data that is in a very raw form is not going to be actionable for a business. Usually raw data is not in a readable format, has missing values and might have errors or invalid information, etc. Hence it becomes extremely important to put raw data into a consumable format.
Preparing Data for Insights
Data preparation is a process where the raw data undergoes multiple phases. It needs to be assembled/integrated (if coming from multiple sources), cleansed, formatted, organized, complete and checked for accuracy and consistency so it can be analyzed using business intelligence or business analytics programs and be a valid input to the decision support system. The data preparation process also focuses on business user requirement, improving data quality, completeness and transforming data into a format that meets their needs.
Let’s look at an example use case to get a better understanding. In the diagram given below, the raw data has details pertaining to two movie theaters. It has details like which movie was playing and how many customers bought ticket.
At the first glance, this data set doesn’t give us any meaningful information. It is not consistent, has typo errors and is not complete. However, once it is prepared it gives us a clean data which we can use to determine business performance.
Business Intelligence (BI) professionals or business users could take this clean data and derive meaningful information from it such as which theater has most number of customers or which movie had the most tickets sales. Now when such analytical data is given to business, it helps them decide whether or not they might want to stop playing Movie 1 or play Movie 2 for another week or open another screen theater 2, etc.
Data Preparation for Business Intelligence (BI)
Now that we understand that data preparation is very important for organizations making decisions utilizing data, let’s have a look at the various techniques available for data preparations in the BI world.
- Manual Data Preparation: Performing manual data preparation using excel or similar tool would be too time consuming, error prone and mostly would not work for repeated tasks. This might work perfectly for small data sets, however it wouldn’t be appropriate for dealing with large and complex data like video for instance. Typically, in such scenarios data preparation and analysis would be done by same person or team thereby spending more time on preparation and less time on the actual analysis. Eventually manual data preparation turns out to be high on cost with no reusability features and ultimately ends up creating silos.
- Build a large data warehousing BI team: This team would build a time consuming, sometimes expensive data warehouse. Typically, the team would follow the systems development life cycle (SDLC). End users have to be very thoughtful while giving requirements to the BI team, as any changes in the requirement might affect the outcome. Typically, this approach is inflated and is iterative in nature because of support, maintenance and sometimes changing requirement. However, this method would ensure that the data analytical team gets the right input for them to act.
- Use a self-sufficient, governed self-service data preparation tools that anyone can use like Talend Data Preparation: Using such tools would be fast, avoid manual errors and would give one roof for analytical team to prepare data in a collaborative and controlled way.
All the three methods listed above are widely used however the choice of data preparation method solely depends on individual/organizational need and data availability.
Talend – Data preparation
Talend Data Preparation allows business users and IT to perform many things with data in less time. Just a few of the features of the Talend Data Preparation tool would be to perform the following activities:
- Data Discovery
- Data Cleansing
- Data Visualization