It’s no longer possible to think about enterprise data as discrete chunks of information. Data is everywhere now. More than that, it is all interconnected. At any organization, the same body of data passes through many hands, and informs decisions across multiple departments. Corporate data has become a cohesive living thing that flows throughout information systems.
To extend our living data analogy, it’s up to us all — data managers, data workers, and decision makers alike — to manage the health of our data across the data life cycle.
What is the life cycle of data?
What is the data life cycle?
The data life cycle, also called the information life cycle, refers to the entire period of time that data exists in your system. This life cycle encompasses all the stages that your data goes through, from first capture onward.
In life science, every living thing undergoes a series of phases: infancy, a period of growth and development, productive adulthood, and old age. These phases vary across the tree of life. Salmon die right after they spawn, while whales live to be grandmothers. Even if they live in the same field, a mouse, fox, and butterfly will all have very different life cycles.
In the same way, different data objects will go through different stages of life at their own cadences. That said, here’s one example of a data lifecycle framework:
- Data creation, ingestion, or capture
- Whether you generate data from data entry, acquire existing data from other sources, or receive signals from devices, you get information somehow. This stage describes when data values enter the firewalls of your system.
- Data processing
- There are many processes involved in cleaning and preparing raw data for later analysis. While the order of operations may vary, data preparation typically includes integrating data from multiple sources, validating data, and applying the transformation. Data is often reformatted, summarized, subset, standardized, and enriched as part of the data processing workflow.
- Data analysis
- However you analyze and interpret your data, this is where the magic happens. Exploring and interpreting your data may require a variety of analyses. This could mean statistical analysis and visualization. It can also mean using traditional data modeling or applying artificial intelligence (AI).
- Data sharing or publication
- This stage is where forecasts and insights turn into decisions and direction. When you disseminate the information gained from data analysis, your data delivers its full business value.
- Once data has been collected, processed, analyzed, and shared, it is typically stored for future reference. For archives to have any future value, it’s important to keep metadata about each item in your records, particularly about data provenance.
The data life cycle proceeds from the last step back to the first in a never-ending circle. Of course, in the twenty-first century, one factor has seriously complicated the way we work with data.
How do we scale up for the life cycle of big data?
Big data life cycle
It’s not news to anyone that the volumes of data have grown enormously in recent years, and are only continuing to grow. Enterprises are working with more and more SaaS and web applications, and capturing more data from them. At the same time, more of the global population is joining the internet, clicking links and snapping images, and filling in web forms. Above all else, smart devices and the Internet of Things (IoT) are continually finding new ways to measure everything in the known universe.
You don’t necessarily need (or want) to collect all the data in the universe. While it might sound nice to have every scrap of information at your disposal, data management challenges scale with data volume. More data means higher data storage costs. The more data you have, the more resources you’ll need for data preparation and analysis. Companies that simply collect more and more data without the right digital transformation strategy quickly find themselves in possession of a digital landfill. Sure, they have lots of data. But no one can find what they need, what they can find doesn’t make sense, and they can’t trust it to make business decisions.
In order to scale up for big data without going overboard, build some precautions into your big data life cycle. Three activities typically lead to problems: over-collecting data, managing it poorly, and hoarding deprecated data. Here’s what to do instead:
- Refine your data collection process
- To avoid data overcollection, don’t collect all generated data. Instead, create a plan to define and capture only the data that’s relevant to your project.
- Implement effective data management
- Catalog your data so it’s easy to find and use, and create an infrastructure that combines manual and automated monitoring and maintenance to maintain the health of that data. (Learn the elements of data health.)
- Dispose of unnecessary data
- Once they have outlived their usefulness, consider deleting data or purging old records. You’ll want to keep in mind any legal obligations to either maintain or delete old records and establish a clear schedule for data deletion.
Note, advice to purge old data can be controversial. There are those who maintain a “never delete anything” philosophy. They believe that it’s worthwhile in the long run to save all data for as long as possible. That said, storing data that no longer serves a purpose doesn’t just cost more; it can pose liabilities, leaving you open to risk. This is particularly true in the case of sensitive personal data.
At Talend we believe that the value of data depends on its usefulness to the business. That’s why it’s important to manage your data lifecycle the right way.
Data lifecycle management
The data life cycle is no good to anyone as an abstract concept. Its purpose is to help organizations deliver the data health that end-users need to fuel decisions. To that end, data lifecycle management needs to be transparent and iterative.
To make its life cycle tangible, document the flow of data through your organization with a map of data lineage. That means visually mapping the origin of your data along with each stop it makes and an explanation of why it may not have moved at that point. Life cycle documentation helps simplify tracking for everyday data operations. It also makes it easy to study and resolve bottlenecks and failure points.
Any processes that limit the usefulness of the data are counterproductive and should be caught and adjusted in future cycles. Reuse lessons learned throughout the process to inform the next cycle and maximize data health.
The data sharing stage is often a challenge for organizations. A top-down approach with tightly controlled access to data doesn’t scale up well. Data infrastructure that relies on gatekeepers creates situations where IT becomes overwhelmed with requests, and end-users have trouble getting the data they need in a timely fashion. On the other hand, a bottom-up approach with open access to all data makes it hard to maintain the security and privacy of sensitive data. Data governance ensures that end users get the data they need when they need it, and nothing more.
Data analytics life cycle
If we think of our data as a living thing with its own life cycle, we want to give it a healthy life. That means that it isn’t enough just to think about the technological systems the data flows through. To get business value from your data, you have to build data teams and set up data infrastructure with a top goal of making that data usable. Data-centric processes help people and technologies work together toward that goal.
Research finds that it’s valuable for those who work with data to be involved across the data life cycle. In our 2021 data health survey, 78% of executives reported challenges using company data to make decisions. The research revealed something interesting, though. Executives who primarily either deliver or consume data report low rates of confidence in their data and don’t feel very strongly that they make data-driven decisions. On the other hand, executives who work on both sides of data report understanding the data better and making more data-driven decisions.
Because we don’t forget the human side of the data management equation, Talend helps organizations facilitate data-driven decisions that everyone can trust. As a single platform for data integration, integrity, and governance, Talend Data Fabric makes it easier for decision-makers to work on both sides of their data. When organizations create data infrastructure that supports human expertise across the data life cycle, they truly give their data life.