3 Common Pitfalls in Building Your Data Lake and How to Overcome Them
Recently I had the chance to talk to an SVP of IT at one of the largest banks in North America about their digital transformation strategy. As we spoke, their approach to big data and digital transformation struck me as they described it as ever evolving. New technologies would come to market which required new pivots and approaches to leverage these capabilities for the business. It is more important than ever to have an agile architecture that can sustain and scale with your data and analytics growth. Here are three common pitfalls we often see when building a data lake and our thoughts on how to overcome them:
“All I need is an ingestion tool”
Ah yes, the development of a data lake is often seen as the holy grail of everything. Afterall, now you have a place to dump all of your data. The first issue most people run into is data ingestion. How could they collect and ingest the sheer variety and volume of data that was coming into a data lake. Any success of data collection is a quick win for them. So they bought a solution for data ingestion, and all the data can now be captured and collected like never before. Well problem solved, right? Temporarily, maybe, but the real battle has just begun.
Soon enough you will realize that simply getting your data into the lake is just the start. Most data lake projects failed because it turns into a big data swamp with no structure, no quality, a lack of talent and no trace of where the data actually came from. Raw data is rarely useful as a standalone since the data still needs to be processed, cleansed, and transformed in order to provide quality analytics. This often lead to the second pitfall.
Hand coding for data lake
We have had many blogs in the past on this, but you can’t emphasize this topic enough. It’s strikingly true that hand coding may look promising from the initial deployment costs, but the maintenance costs can increase by upwards of 200%. The lack of big data skills, on both the engineering and analytics sides, as well as the movement of cloud adds even more complexity to hand coding. Run the checklist here to help you determine when and where to have custom coding for your data lake project.
With the rising demands of faster analytics, companies today are looking for more self-service capabities when it comes to integration. But it can easily cause peril without proper governance and metadata management in place. As many basic integration tasks may go to citizen integrators, it’s more important to ask is there governance in place to track that? Is access of your data given to the right people at the right time? Is your data lake enabled with proper metadata management so your self-service data catalog is meaningful?
Don’t look for an avocado slicer.
As the data lake market matures, everyone is looking for more and yet struggling with each phase as they go through the filling, processing and managing of data lake projects. To put this in perspective, here is a snapshot of the big data landscape from VC firm FirstMarkfrom 2012:
And this is how it looks in 2017:
The big data market landscape is growing like never before as companies are now more clear on what they need. From these three pitfalls, the biggest piece of advice I can offer is to avoid what I like to call “an avocado slicer”. Yes it might be interesting, fancy, and works perfectly for what you are looking for, but you will soon realize it’s a purpose-built point solution that might only work for ingestion, only compatible with one processing framework, or only works for one department’s particular needs. Instead, have a holistic approach when it comes to your data lake strategy, what you really need is a well-rounded culinary knife! Otherwise, you may end up with an unnecessary amount of technologies and vendors to manage in your technology stack.
In my next post, I’ll be sharing some best questions to ask for a successful data management strategy.