In my last blog post, I shared some thoughts on the common pitfalls when building a data lake. As the movement to the cloud gets more and more common, I’d like to further discuss some of the best practices when building a cloud data lake strategy. When going beyond the scope of integration tools or platforms for your cloud data lake, here are 5 questions to ask, that can be used as a checklist:
1. Does your Cloud Data Lake strategy include a Cloud Data Warehouse?
As many differences as there are between the two, people often times compare the two types of technology approaches. Data warehouses being the centralization of structured data, and Data Lakes often times being the holy grail of all types of data. (You can read more about the two approaches here.)
Not to confuse the two, as these technology approaches should actually be brought together. You will need a data lake to accommodate all types of data that your business deal with today, make it structured, semi-structured or unstructured, on-premise or in the cloud, or those newer types of data such as IoT data. The data lake often time has a landing zone and staging zone for raw data – data at this stage are not yet consumable, but you may want to keep them for future discovery or data science projects. On the other hand, a cloud data warehouse will be in the picture after data is cleansed, mapped and transformed, so that it is more consumable for business analysts to access and make the use of data for reporting or other analytical use. Data at this stage is often time highly processed to adjust to the data warehouse.
If your approach currently only works with a cloud data warehouse, then often time you are losing raw and some formats of data already, it is not so helpful for any prescriptive or advanced analytics projects, or machine learning and AI initiatives as some meanings within the data is already lost. Vice versa, if you don’t have a data warehouse alongside with your data lake strategy, you will end up with a data swamp where all data is kept with no structure, and not consumable by analysts.
From the integration perspective, make sure your integration tool work with both data lake and data warehouse technologies, which will lead us to the next question.
Data Lakes: Purposes, Practices, Patterns, and Platforms now.
2. Does your integration tool have ETL & ELT?
As much as you may know about ETL in your current on-premises data warehouse, moving it to the cloud is a different story, not to mention in a cloud data lake context. Where and how data is processed really depends on what you need for your business.
Similar to what we described in the first question, sometimes you need to keep more of the raw nature of the data, and other times you need more processing. This would require your integration tool to cope with both ETL and ELT capabilities, where the data transformation can be handled either before the data is loaded to your final target, e.g. a cloud data warehouse, or after data is landed there. ELT is more often leveraged when the speed of data ingestion is key to your project, or when you want to keep more intel about your data. Typically, cloud data lakes have a raw data store, then a refined (or transformed) data store. Data scientists, for example, prefer to access the raw data, whereas business users would like the normalized data for business intelligence.
Another use of ELT refers to the massive parallel processing capabilities coming with big data technologies such as Spark and Flink. If your use case requires such a strong processing power, then ELT is a better choice where the processing has more scalability.
3. Can your cloud data lake handle both simple ETL tasks and complex big data ones?
This may look like an obvious question but when you ask about this question, put yourself in the users’ shoes and really think through if your choice of tool can meet both requirements.
Not all of your data lake usage will be complex ones that require advanced processing and transformation, many of them can be simple activities such as ingesting new data into the data lake. Often times, the tasks go beyond the data engineering or IT team as well. So ideally the tool of your choice should be able to handle simple tasks fast and easy, but also can scale to the complexity to meet the requirements of advanced use cases. Building a data lake strategy that can cope with both can help you make your data lake more consumable and practical for various types of users for different purposes.
4. How about batch and streaming needs?
You may think your current architecture and technology stack is good enough, and your business is not really in the Netflix business where streaming is a necessity. Get it? Well think again.
Streaming data has become a part of our everyday lives whether you realize it or not. The “Me” culture has put everything at the moment of now. If your business is on social media, you are in streaming. If IoT and sensor is the next growth market for your business, you are in streaming. If you have a website for customer interaction, you are in streaming. In IDC’s 2018 Data Integration and Integrity End User Survey, 93% of the respondents indicate the plan to use streaming technology by 2020. Real-time and streaming analytics have become a must for modern businesses today to create that competitive edge. So, this naturally raises the questions: can your data lake handle both your batch and streaming needs? Do you have the technology and people to work with streaming, which is fundamentally different from typical batch needs?
Streaming data is particularly challenging to handle because it is continuously generated by an array of sources and devices as well as being delivered in a wide variety of formats.
One prime example of just how complicated streaming data can be comes from the Internet of Things (IoT). With IoT devices, the data is always on; there is no start and no stop, it just keeps flowing. A typical batch processing approach doesn’t work with IoT data because of the continuous stream and the variety of data types it encompasses.
So make sure your data lake strategy and data integration layer can be agile enough to work with both use cases.
You can find more tips on streaming data, here.
5. Can your data lake strategy help cultivate a collaborative culture?
Last but not least, collaboration.
It may take one person to implement the technology, but it will take a whole village to implement it successfully. The only way to make sure your data lake is a success is to have people use it, improving the workflow one way or another.
In a smaller scope, the workflow in your data lake should be able to be reused and leveraged among data engineers. Less recreation will be needed, and operationalization can be much faster. In a bigger scope, the data lake approach can help improve the collaboration between IT and business teams. For example, your business teams are the experts of their data and they know the meaning and the context of data better than anyone else. Data quality can be much improved if the business team can work on the data for business rule transformations, while IT still governs that activity. Defining such a line with governance in place is a delicate work and no easy task. But you may think through your data lake approach, whether it’s governed but open at the same time to encourage not only final consume /usage of the data, but the improvement of data quality in the process, and be recycled to be available to a broader organization.
To summarize, there we go the 5 questions I would recommend asking when thinking about building a cloud data lake strategy. By no means are these the only questions you should think, but hopefully it initiates some thinking outside of your typical technical checklist.