Build for tomorrow: How to future-proof your data infrastructure
By Talend Team
As you establish your reputation as your company’s data leader, you’ll find yourself fielding a lot of different questions from around the organization. Some of the trickiest questions are going to be about the technical infrastructure needed to support your data initiatives — your data tech stack.
The specifics of your data stack are going to depend on how your company is organized, how you engage with employees and customers, what problems you’re trying to solve, and what security and regulatory compliance directives govern your data use. But there are a few universal pieces of advice to keep in mind.
First, even though you’re under intense pressure to get today’s problems solved yesterday, you’ll want to make strategic choices that will position you for success once your organization grows. That doesn’t mean you have to jump on to the latest and greatest the second it’s available. Try to avoid situations where you’re locked into a vendor that won’t be able to grow with you or trapped by a solution that you’ll have to rebuild from the ground up if your needs expand by 25%, 50%, or more.
Second, don’t get in your own way. Provide clear documentation on your company’s data policies so that it’s easy for different departments to experiment with the solutions that make sense for them. After all, the surest way to guarantee that your company is riddled with shadow IT is to make it too hard for them to work with the tools they have.
And, finally, be honest with yourself about where you are and what you need right now. If you’re still growing, the best product on the market might not be the best solution for you — yet. By pulling the trigger on a heavy investment too soon, you could undermine confidence in the entire tech stack. Solving for the problem you have today will help you build runway to solve bigger problems in the future.
With those principles in mind, you are prepared to start building the perfect tech stack for your data infrastructure that includes technology for ingestion, quality, and storage.
Ingestion: Data integration
Data is streaming into your company from a million different places — forms and phones, point-of-sale tools, product interactions, customer feedback, maybe even connected IoT items like factory sensors. You want all that data, and you want to use all that data. But how do you bring it all together?
As you build your data tech stack, one of your first considerations should be a data integration solution. This will take the data from all those various sources and transform it to make it more consistent and usable.
A simple — but common — example is an address field. Let’s say you have a web form that forces a user to select their location from a dropdown list of US states using standard postal abbreviations. But when the user logs in through the mobile app, location is an open field that lets them type whatever they want. You could end up with contacts from “UT,” “Utah,” and the occasional mistyped “Utha.” How could you possibly use that field to segment your database of thousands of users?
Quality: Data integrity and governance
Having data you can use is great. But having data you can trust? That is simply invaluable. As you build your data tech stack, plan to include measures for both data integrity and data governance.
Data integrity describes products that ensure your data is accurate and free from errors. Ideally, they should run in conjunction with data integration so that bad data never makes it into your system in the first place. If you do use a separate solution for data integrity, though, be sure it’s set to run frequently so that you’re at less risk of using potentially flawed data in your applications, analytics, and forecasting.
The second component of data quality is data governance — the technology and policies that you use to maintain the health of your company’s data throughout its lifecycle. You will want to establish a set of rules that define who can take what action, upon what data, in what situations, using what methods. Then you need to make sure that you equip your team with the products they need to enforce those standards.
For example, if you are in a highly regulated industry such as healthcare, you will want to put extra protections in place to make sure that your customers’ private, personal data is secure. You’ll also want to document clear rules about who can access that data and under what circumstances. The products you use to manage data should allow you to set up permissions and access protocols to help enforce those documented rules.
Storage: Data lakes and data warehouses
If you are very early in your process of developing your data infrastructure, it may be sufficient to pull the data you need directly from the various platforms and applications where it originates. But as your data needs expand, it will make sense to set up a more permanent, central repository for your data.
Your two primary options here are a data warehouse or a data lake. These are similar technologies, both widely used for storing data, but they are not interchangeable. Traditionally, data warehouses are more focused on structured data, built for a specific team or a specific purpose such as reporting. Of course, it is rare for a company to have nothing but structured data — research suggests that, while only 8% of companies have exclusively structured data, nearly two-thirds have a blend of structured and unstructured data. Data lakes are a broader solution that can accommodate both structured and unstructured data for a variety of uses. You may even want to use both, building a company-wide data lake that feeds into a data warehouse for a specific team.
Cloud — and multi-cloud
A decade or two ago, we were still operating in a mostly on-premises world. It just made sense to keep your servers where you could see them — safe, secure, and perfectly tailored to your use case and applications. It still makes a certain amount of sense for massive enterprise companies who can afford to build and maintain an expensive physical infrastructure for certain legacy applications. But the rest of us need something more flexible, scalable, and affordable. That’s why so much of the data industry has moved to the cloud.
Now that we’re living in a world transformed by COVID, the advantages of a cloud data infrastructure are more obvious than ever. How could you support a distributed workforce when the tools they need are located in an office hundreds of miles away? How could you ensure data security when you have remote workers logging into a system that was custom-built for local access only? How could you react to changing times and respond to a turbulent market when you’re chained to your investment in a hardware infrastructure that was built to solve a specific problem? The answer is clear: you couldn’t. That’s why the cloud today is more important than ever.
And for growing businesses, there’s a solution that makes even more sense than the cloud: multi-cloud. Instead of locking yourself into a single cloud vendor, you can deploy your data solution across as many platforms as you like, taking advantage of the best aspects of each. This is particularly important for growth-stage companies, since it provides the flexibility to grow organically instead of making a heavy, up-front investment in a single solution that might not be a perfect fit in the long run. Other advantages of the multi-cloud approach include:
- Expedience. When you don’t have to worry about making the right choice forever, it’s much easier to pick a solution that’s right for right now. You can keep building, innovating, and moving forward without stress.
- Cost savings. This applies to not only actual expenditures, but opportunity cost as well. What happens if you make an acquisition? Leveraging a multi-cloud approach going forward can spare precious engineering time otherwise spent re-platforming.
- Freedom. It’s wise to avoid getting locked in with one vendor. If your priorities — or their services — change, you can easily move to a new vendor or shift the balance of assets within your existing infrastructure.
- Security. When you aren’t completely reliant on a single vendor, the risk of compromise gets distributed across multiple platforms. Plus, you can take advantage of the best security options available by picking and choosing offerings from different vendors.
- Innovation. Cloud computing is a competitive space — and that’s a great thing for customers. With a multi-cloud strategy in place, you can take advantage of cutting-edge services, no matter which public cloud provides them.
As the saying goes, the only constant is change. As you select the vendors, solutions, and processes that make up your data infrastructure, make sure that you are giving your company a solid foundation from which to grow. Establish common-sense rules and best practices, and, whenever possible, select vendors with a flexible solution that can adapt to your changing needs over time.
Take a deeper dive into data warehouses and lakes with “The Definitive Guide to Cloud Data Warehouses and Cloud Data Lakes”. This helpful white paper explains what you need to look for when starting to create your cloud data warehouse or data lake, a 3-step plan to make sure your data warehouse investment succeeds, and real-world case studies of the tech stacks companies use to achieve their business goals with cloud data warehouse solutions.