Data Gravity: What it Means for Your Data
Data is only as valuable as the information it is used to create. The need for valuable, business-specific, data-driven information has created a level of demand that can only be met by maintaining vast amounts of data. As enterprises move forward, data will only continue to grow. This continual expansion has given rise to the phenomenon known as data gravity.
What is data gravity?
Data gravity is the observed characteristic of large datasets that describes their tendency to attract smaller datasets, as well as relevant services and applications. It also speaks to the difficulty of moving a large, “heavy” dataset.
Think of a large body of data, such as a data lake, as a planet, and services and applications being moons. The larger the data becomes, the greater its gravity. The greater the gravity, the more satellites (services, applications, and data) the data will pull into its orbit.
Large datasets are attractive because of the diversity of data available. They are also attractive (i.e. have gravity) because the technologies used to store such large datasets — such as cloud services — are available with various configurations that allow for more choices on how data is processed and used.
The concept of data gravity is also used to indicate the size of a dataset and discuss its relative permanence. Large datasets are “heavy,” and difficult to move. This has implications for how the data can be used and what kind of resources would be required to merge or migrate it.
As business data continues to become an ever increasing commodity, it is essential that data gravity be taken into consideration when designing solutions that will use that data. One must consider not only current data gravity, but its potential growth. Data gravity will only increase over time, and in turn will attract more applications and services.
How data gravity affects the enterprise
Data must be managed effectively to ensure that the information it is providing is accurate, up-to-date, and useful. Data gravity comes into play with any body of data, and as a part of data management and governance, the enterprise must take the data’s influence into account.
Without proper policies, procedures, and rules of engagement, the sheer amount of data in a warehouse, lake, or other dataset can become overwhelming. Worse yet, it can become underutilized. Application owners may revert to using only the data they own to make decisions, leading to incongruous decisions made about a single, multi-owned application.
Data integration is greatly affected by the idea of data gravity — especially the drive to unify systems and decrease the resources wasted by errors or the need to rework solutions. Placing data in one central arena means that data gravity will not collect slowly over time, but rather increase significantly in a short time.
Understanding how the new data gravity will affect the enterprise will ensure that contingencies are in place to handle the data’s rapidly increasing influence on the system. For example, consider how data gravity affects data analysis. Moving massive datasets into analytic clusters is an ineffective — not to mention expensive — process. The enterprise will need to develop better storage optimization that allows for greater data maneuverability.
The problem with data gravity
Data gravity presents data managers with two issues: latency and data nonportability.
- Latency By its very nature, a large dataset requires the applications that use it to be close, in its orbit, or suffer latency. This is because the closer the applications are to the data, the better the workload performance. Speed is critical to successful business operations, and increasing latency as the data’s gravity increases is simply not an option. The enterprise will need to ensure that throughput and workload balance grows with the data’s gravity. This means moving applications to the same arena as the data in order to prevent latency and increase throughput. A good example of how to combat the latency issue is Amazon QuickSight; it was developed to rest directly on cloud data to optimize performance.
- Non-portability Data gravity increases with the size of the dataset, and the larger the dataset, the more difficult the dataset is to move. After all, moving a planet would be quite a feat. Moving vast quantities of data is slow and ties up resources in the process. Data gravity has to be taken into account any time the data needs to be migrated. Due to the dataset’s continual growth, the enterprise would need to develop their migration plans based on requirements that account for the size of the dataset as it will be, rather than its actual, current size. Data gravity is the likelihood of how many services, applications, and/or additional data will be attracted to the dataset, and should be considered when determining future size. Migration will require a specialized, often creative, plan in order to be successful.
Dealing with data gravity
Data gravity is a reality of the technological times that must be handled with as much finesse as possible to keep things moving smoothly and efficiently. The biggest weapons in the data manager’s arsenal will be data management and governance, as well as masterful data integration.
Data management is a must, regardless of whether the data is stored in the cloud or on-premises. Data management allows for leverage of data gravity — how the data is going to be used, by whom, and for what purpose are all factors that will help define what applications and services need to run in the cloud with the data.
With data gravity bringing in more applications and services over time, it is essential that data integrity be maintained to provide accurate and complete data.
Data governance is a core piece of data management. Data governance is best explained as a role system that defines accountability and responsibility in regards to the data.
This is paramount to defying data gravity issues, because it creates better quality data and allows for data mapping. Good data governance will provide its own benefit, as well as help provide better data management overall.
Data integration is how organizations increase the efficiency of systems and applications while also increasing the ability to leverage data.
While it might seem counterintuitive to use data integration as a means of dealing with data gravity, it boils down to having one data source over many. One central source would be voluminous to be sure, but it would also mean that the data manager is only contending with one data gravity source instead of several.
The future of the cloud and data gravity
The largest drawback to data gravity is a need for proximity between the data and the applications that need that data.
For example, more and more enterprises are seeking to share their data in an effort to produce more valuable, robust dataset that would be mutually beneficial. In order to do this effectively, both of the enterprises involved would need close proximity to the data.
Enter the cloud. Enterprises across the country, or even across the globe, can achieve this proximity by leveraging cloud technology.
Cloud technology can be viewed as both a solution and a problem, however. Cloud technology has allowed for massive expansion of data bodies, which has served to increase data gravity rather than diffuse it.
On the opposite side of the coin, cloud technology serves as a means of defying data gravity by allowing enterprises scalable processing power and close proximity to the needed data. This pushes the cloud to the fore, and discourages on-premises data storage.
How to start managing data gravity
Data gravity does not have to be an insurmountable problem. Data gravity is an environmental factor that affects the world of data, but knowing about these effects allows the data manager to take control and deal with the potential fallout. Although it has few exact answers, enterprises can take steps to mitigate the negative impact of data gravity through proper data management and data governance.
Data management and governance must evolve as technology and processes become more advanced. Dealing with increased complexity can seem daunting, but having the right tools goes a long way in easing that strain. Talend Data Fabric is a suite of applications that can help tackle defying data gravity by providing tools proven in the realms of data management, governance, and data integration.
Don’t be left adrift in your data’s orbit. Take the first step in controlling data gravity by seeing how Talend Data Fabric can assist you on your way to speedy, accurate, and superior data management.
Ready to get started with Talend?
More related articles
- What is MySQL? Everything You Need to Know
- What is Middleware? Technology’s Go-to Middleman
- What is Shadow IT? Definition, Risks, and Examples
- What is Serverless Architecture?
- What is SAP?
- What is ERP and Why Do You Need It?
- What is “The Data Vault” and why do we need it?
- What is a Data Lab?
- Understanding Cloud Storage
- What is a Legacy System?
- What is Data as a Service?
- What is a Data Mart?
- What is Data Processing?
- What is Data Mining?
- What is Apache Hive?
- Data Munging: A Process Overview in Python
- What is a Data Source?
- Data Transformation Defined
- SQL vs NoSQL: Differences, Databases, and Decisions
- Data Modeling: Ensuring Data You Can Trust
- How Modern Data Architecture Drives Real Business Results
- CRM Database: What it is and How to Make the Most of Yours
- Data Conversion 101: Improving Database Accuracy