Introduction to the Agile Data Lake
Let’s be honest, the ‘Data Lake’ is one of the latest buzz-words everyone is talking about. Like many buzzwords, few really know how to explain what it is, what it is supposed to do, and/or how to design and build one. As pervasive as they appear to be, you may be surprised to learn that Gartner predicts that only 15% of Data Lake projects make it into production. Forrester predicts that 33% of Enterprises will take their attempted Data Lake projects off life-support. That’s scary! Data Lakes are about getting value from enterprise data, and given these statistics, its nirvana appears to be quite elusive. I’d like to change that and share my thoughts and hopefully providing some guidance for your consideration on how to design, build, and use a successful Data Lake; An Agile Data Lake. Why agile? Because to be successful, it needs to be.
Ok, to start, let’s look at the Wikipedia definition for what a Data Lake is:
“A data lake is a storage repository that holds a vast amount of raw data in its native format, incorporated as structured, semi-structured, and unstructured data.”
Not bad. Yet considering we need to get value from a Data Lake this Wikipedia definition is just not quite sufficient. Why? The reason is simple; you can put any data in the lake, but you need to get data out and that means some structure must exist. The real idea of a data lake is to have a single place to store of all enterprise data, ranging from raw data (which implies an exact copy of source system data) through transformed data, which is then used for various business needs including reporting, visualization, analytics, machine learning, data science, and much more.
I like a ‘revised’ definition from Tamara Dull, Principal Evangelist, Amazon Web Services, who says:
“A data lake is a storage repository that holds a vast amount of raw data in its native format, including structured, semi-structured, and unstructured data, where the data structure and requirements are not defined until the data is needed.”
Much better! Even Agile-like. The reason why this is a better definition is that it incorporates both the prerequisite for data structures and that the stored data would then be used in some fashion, at some point in the future. From that we can safely expect value and that exploiting an Agile approach is absolutely required. The data lake therefore includes structured data from relational databases (basic rows and columns), semi-structured data (like CSV, logs, XML, JSON), unstructured data (emails, documents, PDFs) and even binary data (typically images, pictures, audio, & video) thus creating a centralized data store accommodating all forms of data. The data lake then provides an information platform upon which to serve many business use cases when needed. It is not enough that data goes into the lake, data must come out too.
And, we want to avoid the ‘Data Swamp’ which is essentially a deteriorated and/or unmanaged data lake that is inaccessible to and/or unusable by its intended users, providing little to no business value to the enterprise. Are we on the same page so far? Good.
Data Lakes – In the Beginning
Before we dive deeper, I’d like to share how we got here. Data Lakes represent an evolution resulting from an explosion of data (volume-variety-velocity), the growth of legacy business applications plus numerous new data sources (IoT, WSL, RSS, Social Media, etc.), and the movement from on-premise to cloud (and hybrid).
Additionally, business processes have become more complex, new technologies have recently been introduced enhancing business insights and data mining, plus exploring data in new ways like machine learning and data science. Over the last 30 years we have seen the pioneering of a Data Warehouse (from the likes of Bill Inmon and Ralph Kimball) for business reporting all the way through now to the Agile Data Lake (adapted by Dan Linstedt, yours truly, and a few other brave souls) supporting a wide variety of business use cases, as we’ll see.
To me, Data Lakes represent the result of this dramatic data evolution and should ultimately provide a common foundational, information warehouse architecture that can be deployed on-premise, in the cloud, or a hybrid ecosystem.
Successful Data Lakes are pattern based, metadata driven (for automation) business data repositories, accounting for data governance and data security (ala GDPR & PII) requirements. Data in the lake should present coalesced data and aggregations of the “record of truth” ensuring information accuracy (which is quite hard to accomplish unless you know how), and timeliness. Following an Agile/Scrum methodology, using metadata management, applying data profiling, master data management, and such, I think a Data Lake must represent a ‘Total Quality Management” information system. Still with me? Great!
What is a Data Lake for?
Essentially a data lake is used for any data-centric, business use case, downstream of System (Enterprise) Applications, that help drive corporate insights and operational efficiency. Here are some common examples:
- Business Information, Systems Integration, & Real Time data processing
- Reports, Dashboards, & Analytics
- Business Insights, Data Mining, Machine Learning, & Data Science
- Customer, Vendor, Product, & Service 360
How do you build an Agile Data Lake? As you can see there are many ways to benefit from a successful Data Lake. My question to you is, are you considering any of these? My bet is that you are. My next questions are; Do you know how to get there? Are you able to build a Data Lake the RIGHT way and avoid the swamp? I’ll presume you are reading this to learn more. Let’s continue…
There are three key principles I believe you must first understand and must accept:
- ⇒ A PROPERLY implemented Ecosystem, Data Models, Architecture, & Methodologies
- ⇒ The incorporation of EXCEPTIONAL Data Processing, Governance, & Security
- ⇒ The deliberate use of Job Design PATTERNS and BEST PRACTICES
A successful Data Lake must also be agile which then becomes a data processing and information delivery mechanism designed to augment business decisions and enhance domain knowledge. A Data Lake, therefore, must have a managed lifecycle. This life cycle incorporates 3 key phases:
- Extracting raw source data, accumulating (typically written to flat files) in a landing zone or staging area for downstream processing & archival purposes
- Loading & Transformation of this data into usable formats for further processing and/or use by business users
- Data Aggregations (KPI’s, Data-points, or Metrics)
- Analytics (actuals, predictive, & trends)
- Machine Learning, Data Mining, & Data Science
- Operational System Feedback & Outbound Data Feeds
- Visualizations, & Reporting
The challenge is how to avoid the swamp. I believe you must use the right architecture, data models, and methodology. You really must shift away your ‘Legacy’ thinking; adapt and adopt a ‘Modern’ approach. This is essential. Don’t fall into the trap of thinking you know what a data lake is and how it works until you consider these critical points.
Ok then, let’s examine then these three phases a bit more. Data Ingestion is about capturing data, managing it, and getting it ready for subsequent processing. I think of this like a box crate of data, dumped onto the sandy beach of the lake; a landing zone called a ’Persistent Staging Area’. Persistent because once it arrives, it stays there; for all practical purposes, once processed downstream, becomes an effective archive (and you don’t have to copy it somewhere else). This PSA will contain data, text, voice, video, or whatever it is, which accumulates.
You may notice that I am not talking about technology yet. I will but, let me at least point out that depending upon the technology used for the PSA, you might need to offload this data at some point. My thinking is that an efficient file storage solution is best suited for this 1st phase.
Data Adaptation is a comprehensive, intelligent coalescence of the data which must adapt organically to survive and provide value. These adaptations take several forms (we’ll cover them below) yet essentially reside 1st in a raw, lowest level of granulation, data model which then can be further processed, or as I call it, business purposed, for a variety of domain use cases. The data processing requirements here can be quite involved so I like to automate as much of this as possible. Automation requires metadata. Metadata management presumes governance. And don’t forget security. We’ll talk about these more shortly.
Data Consumption is not just about business users, it is about business information, the knowledge it supports, and hopefully, the wisdom derived from it. You may be familiar with the DIKW Pyramid; Data > Information > Knowledge > Wisdom. I like to insert ‘Understanding’ after ‘Knowledge’ as it leads wisdom.
Data should be treated as a corporate asset and invested as such. Data then becomes a commodity and allows us to focus on the information, knowledge, understanding, and wisdom derived from it. Therefore, it is about the data and getting value from it.
Data Storage Systems: Data Stores
Ok, as we continue to formulate the basis for building a Data Lake, let’s look at how we store data. There are many ways we do this. Here’s a review:
- DATABASE ENGINES:
- ROW: traditional Relational Database System (RDBMS) (ie: Oracle, MS SQL Server, MySQL, etc)
- COLUMNAR: relatively unknown; feels like a RDBMS but optimized for Columns (ie: Snowflake, Presto, Redshift, Infobright, & others)
- NoSQL - “Not Only SQL”:
- Non-Relational, eventual consistency storage & retrieval systems (ie: Cassandra, MongoDB, & more)
- Distributed data processing framework supporting high data Volume, Velocity, & Variety (ie: Cloudera, Hortonworks, MapR, EMR, & HD Insights)
- GRAPH - “Triple-Store”:
- Subject-Predicate-Object, index-free ‘triples’; based upon Graph theory (ie: AlegroGraph, & Neo4J)
- FILE SYSTEMS:
- Everything else under the sun (ie: ASCII/EBCDIC, CSV, XML, JSON, HTML, AVRO, Parquet)
There are many ways to store our data, and many considerations to make, so let’s simplify our life a bit and call them all ‘Data Stores’, regardless of them being Source, Intermediate, Archive, or Target data storage. Simply pick the technology for each type of data store as needed.
What is Data Governance? Clearly another industry enigma. Again, Wikipedia to the rescue:
“Data Governance is a defined process that an organization follows to ensure that high quality data exists throughout the complete lifecycle.”
Does that help? Not really? I didn’t think so. The real idea of data governance is to affirm data as a corporate asset, invest & manage it formally throughout the enterprise, so it can be trusted for accountable & reliable decision making. To achieve these lofty goals, it is essential to appreciate Source through Target lineage. Management of this lineage is a key part of Data Governance and should be well defined and deliberately managed. Separated into 3 areas, lineage is defined as:
- ⇒ Schematic Lineage maintains the metadata about the data structures
- ⇒ Semantic Lineage maintains the metadata about the meaning of data
- ⇒ Data Lineage maintains the metadata of where data originates & its auditability as it changes allowing ‘current’ & ‘back-in-time’ queries
It is fair to say that a proper, in-depth discussion on data governance, metadata management, data preparation, data stewardship, and data glossaries are essential, but if I did that here we’d never get to the good stuff. Perhaps another blog? Ok, but later….
Data Lakes must also ensure that personal data (GDPR & PII) is secure and can be removed (disabled) or updated upon request. Securing data requires access policies, policy enforcement, encryption, and record maintenance techniques. In fact, all corporate data assets need these features which should be a cornerstone of any Data Lake implementation. There are three states of data to consider here:
- ⇒ DATA AT REST in some data store, ready for use throughout the data lake life cycle
- ⇒ DATA IN FLIGHT as it moves through the data lake life cycle itself
- ⇒ DATA IN USE perhaps the most critical, at the user-facing elements of the data lake life cycle
Talend works with several technologies offering data security features. In particular, ‘Protegrity Cloud Security’ provides these capabilities using Talend specific components and integrated features well suited for building an Agile Data Lake. Please feel free to read “BUILDING A SECURE CLOUD DATA LAKE WITH AWS, PROTEGRITY AND TALEND” for more details. We are working together with some of our largest customers using this valuable solution.
Agile Data Lake Technology Options
Processing data into and out of a data lake requires technology, (hardware/software) to implement. Grappling with the many, many options can be daunting. It is so easy to take these for granted, picking anything that sounds good. It’s only after or until better understanding the data involved, systems chosen, and development efforts does one find that the wrong choice has been made. Isn’t this the definition of a data swamp? How do we avoid this?
A successful Data Lake must incorporate a pliable architecture, data model, and methodology. We’ve been talking about that already. But picking the right ‘technology’ is more about the business data requirements and expected use cases. I have some good news here. You can de-couple the data lake designs from the technology stack. To illustrate this, here is a ‘Marketecture’ diagram of depicting the many different technology options crossing through the agile data lake architecture.
As shown above, there are many popular technologies available, and you can choose different capabilities to suit each phase in the data lake life cycle. For those who follow my blogs you already know I do have a soft spot for Data Vault. Since I’ve detailed this approach before, let me simply point you to some interesting links:
- My Blogs on Data Vault which have been very popular
- Kent Graziano, Chief Technical Evangelist @Snowflake & I wrote a joint blog
- My Talend DV Tutorial continues to evolve; watch for updates
- Currently covering the Relational Model with and without using PIT tables
- I have completed work on a Snowflake deployment
- I have completed work on a Big Data (Cloudera/Hive) version as well
You should know that Dan Linstedt created this approach and has developed considerable content you may find interesting. I recommend these:
- Brief History of the Data Vault
- A short intro to Data Vault 2.0
- Defining a Data Lake
- Data Lake Part 2: Reference Data Architectures
- Defining a Data Lake Part 3: Landing Zones
- Defining a Data Lake: Part 4 data warehouse vs data lake
- Defining a Data Lake: Part 5 - do we need a Data Lake?
- Defining a Data Lake: Part 6 - CDC & Integration
I hope you find all this content helpful. Yes, it is a lot to ingest, digest, and understand (Hey, that sounds like a data lake), but take the time. If you are serious about building and using a successful data lake you need this information.
The Agile Data Lake Life Cycle
Ok, whew – a lot of information already and we are not quite done. I have mentioned that a data lake has a life cycle. A successful Agile Data Lake Life Cycle incorporates the 3 phases I’ve described above, data stores, data governance, data security, metadata management (lineage), and of course: ‘Business Rules’. Notice that what we want to do is de-couple ‘Hard’ business rules (that transform physical data in some way) from ‘Soft’ business rules (that adjust result sets based upon adapted queries). This separation contributes to the life cycle being agile.
Think about it, if you push physical data transformations upstream then when the inevitable changes occur, the impact is less to everything downstream. On the flip side, when the dynamics of business impose new criteria, changing a SQL ‘where’ clauses downstream will have less impact on data models it pulls from. The Business Vault provides this insulation from the Raw Data Vault as it can be reconstituted when radical changes occur.
Additionally, a Data Lake is not a Data Warehouse but in fact, encapsulates one as a use case. This is a critical takeaway from this blog. Taking this further, we are not creating ‘Data Marts’ anymore, we want ‘Information Marts’. Did you review the DIKW Pyramid link I mentioned above? Data should, of course, be considered and treated as a business asset. Yet simultaneously, data is now a commodity leading us to information, knowledge, and hopefully: wisdom.
This diagram walks through the Agile Data Lake Life Cycle from Source to Target data stores. Study this. Understand this. You may be glad you did. Ok, let me finish to say that to be agile a data lake must:
- BE ADAPTABLE
- Data Models should be additive without impact to existing model when new sources appear
- BE INSERT ONLY
- Especially for Big Data technologies where Updates & Deletes are expensive
- PROVIDE SCALABLE OPTIONS
- Hybrid infrastructures can offer extensive capabilities
- ALLOW FOR AUTOMATION
- Metadata, in many aspects, can drive the automation of data movement
- PROVIDE AUDITABLE, HISTORICAL DATA
- A key aspect of Data Lineage
And finally, consider that STAR Schemas are, and always were, designed to be ‘Information Delivery Mechanisms’, a misunderstanding some in the industry has fostered for many years. For many years we have all built Data Warehouses using STAR schemas to deliver reporting and business insights. These efforts all too often resulted in raw data storage of the data warehouse in rigid data structures, requiring heavy data cleansing, and frankly high impact when upstream systems are changed or added.
The cost in resources and budget has been a cornerstone to many delays, failed projects, and inaccurate results. This is a legacy mentality and I believe it is time to shift our thinking to a more modern approach. The Agile Data Lake is that new way of thinking. STAR schemas do not go away, but their role has shifted downstream, where they belong and always intended for.
This is just the beginning, yet I hope this blog post gets you thinking about all the possibilities now.
As a versatile technology and coupled with a sound architecture, pliable data models, strong methodologies, thoughtful job design patterns, and best practices, Talend can deliver cost-effective, process efficient and highly productive data management solutions. Incorporate all of this as I’ve shown above and not only will you create an Agile Data Lake, but you will avoid the SWAMP!
Till next time…