TALEND CONNECT 2018 : Get inspired by the movers and shakers in the big data world in NYC
Successful Methodologies with Talend
Successful Methodologies with Talend
For those who are familiar with my previous Talend blogs perhaps you’ve noticed that I like to talk about building better solutions through design patterns and best practices. My blogs also tend to be a bit long. Yet you read them; many thanks for that! This blog is going to focus on methodologies as they apply to Talend solutions. My hope is that at the end of it we all agree that any successful Talend project should use proper methodologies, or face the proverbial project crash and burn.
A stool needs 3 legs upon which it can stand, right? I believe that any software development project is like that stool. It too needs 3 legs, which are defined as:
- USE CASE - Well-defined business data/workflow requirements of a solution
- TECHNOLOGY - The tools with which we craft, deploy, and run the solution
- METHODOLOGY - A given way to do things that everyone agrees with
Software projects tend to focus only on the use case and technology stack, often ignoring methodologies. This, in my opinion, is a mistake and leads to painful results, or lack thereof. When planning for successful Talend projects, incorporating best practices for Job Design Patterns (JDP&BP) and Data Model Designs (DMD&BP) is a fundamental objective. These disciplines fall into the methodology leg of our proverbial stool. The foundational precepts discussed in my JDP&BP blogs suggest that consistency is perhaps the most important precept to apply. The reason is simple: even if you are doing something the wrong way, or let’s be nicer, not doing it the best way, then if it was being done the same way it is much easier to correct. I like that. Successful methodologies should, therefore, encapsulate consistency.
3 Pillars of Success
Let’s take a closer look. Each leg of the stool is a pillar for success. Each pillar holds up its share of the load having a purpose to serve in the bigger picture. Software is not magic and understanding the evolution and life cycle from inception to deprecation is fundamental to any successful implementation. Hardly any serious developer ignores this, yet all too often they succumb to the business pressures and in most cases it is methodology that suffers the most. As methodologies are ignored, hidden long term costs skyrocket. The worst case scenario is a complete project failure; the best case is lower quality, loss of completive edge, and/or extended delivery dates.
So let’s get serious about this. Here is a starter outline to consider:
Modify these as needed; we all have different yet often similar requirements. Let’s continue.
What is a Successful Methodology?
Any successful methodology is one that is adopted and put into practice; nothing more, nothing less. When a software project team applies a methodology it then knows how it plans to achieve the desired results. Without a properly defined and adopted methodology, what we get is analogous to the ‘wild-wild west’, or ‘shoot-then-aim’ scenarios. Neither fosters success in any software project, IMHO! We must have that 3rd leg of the stool.
When we talk about methodologies we can easily cover a lot of different topics. Essentially however this discussion should focus initially on planning for the design, development, release, deployment, and maintenance of a software project. Commonly known as the ‘Software Development Life Cycle’ or SDLC, this should be the first step of many in setting up, managing, coordinating, and measuring the required resources, significant milestones, and actual deliverables. The two most common SDLC methodologies, many of you readers already know:
- WATERFALL - Strict, prescribed steps (phases) that scope an entire software result
- AGILE - Flexible, iterative approach (sprints) that optimize early results
- JEP+ - A hybrid, ‘best of’ approach encapsulating all and only what matters
Without an SDLC methodology in place I submit that any perceived savings for not doing this work (and it is work), is all too often dwarfed by the cost of misguided, misconfigured, mistaken results that miss the mark. How much do software blunders cost? I wonder… trust me, it’s a lot! So, choose an SDLC; choose one that is suitable. Implement it. This is a no-brainer! Agile appears to have become the de-facto standard across the software industry, so start there. While the essential tasks within both methods are similar, it’s their approach that is different. This blog is not going to dive in deep on these, but let’s cover two indispensable areas.
Originating from the manufacturing and construction industries, the Waterfall SDLC method prescribes a sequential, non-iterative process which flows downstream like a waterfall. Originating with an idea or concept, all requirements are gathered, analyzed, and clearly specified, a design is conceived, diagramed, and extensively vetted, code development ensues (sometimes for what seems like eternity), followed by testing/debugging, production rollout, and follow-on maintenance of the code base.
The real benefit of this method everyone knowing exactly what they will get at the end of the project (supposedly). The downside is that it can take a long time before any results can be seen or used. This can become a costly problem if any of the upstream steps failed to capture the true intent of the initial concept. The further up-stream the misstep occurs the costlier it can be correcting. Well, at least we can often get beautiful documentation!
Principles in Agile SDLC focus on short development tasks and prompt delivery of their functionality to users. Promoting a highly-iterative process, this adaptive method reacts to and is flexible to an evolutionary event sequence: A continuous improvement and addition of features, delivering value incrementally over time.
The clear advantage of Agile over Waterfall is that it supports cross-functional teams to be involved in the process end-to-end, fostering better communication and the ability to make changes along the way. The letdown can be that while best practices should be in place to avoid such, feature creep can become a real irritation.
Oh, and don’t count on pretty documentation either.
Don’t get me wrong, Agile is definitely, in most cases the right way to go. In fact, don’t stop there. Agile does not envelop all aspects of broader SDLC best practices. Take for instance: data or better yet, big data. For our purposes, we are building software that pushes data around in one form or another. If you’ve followed my previous blogs, then you know I am talking about: DDLC, or the ‘Database Development Life Cycle’. Read more in my two part series (link above) on Data Model Design & Best Practices.
Introducing JEP+: - Just Enough Process
My years in the software industry have resulted in many successfully projects and ,yes, painful failures. What I learned was that the SDLC/DDLC process is critical and necessary as it allows you to measure quality thus supporting better business decisions for project/product release. I also learned that too much process gets in the way and can bog down the creative deliverables reducing their value and increasing their cost.
My brother Drew Anderson and I, for the past 20 years or so have fashioned a hybrid SDLC process we call JEP+ or Just-Enough-Process-Plus; which means having the right amount of process in place ‘plus’ a little bit more to ensure that there is enough. Essentially an Agile/Scrum based method, we incorporated the Capability Maturity Model, Waterfall, and ISO 9000 methodologies where beneficial. JEP+ is where DDLC has its roots.
Often, when we presented our methods to customers, they almost always asked: how can we buy this? In the end, it was our competitive advantage which led to us getting hired for many consulting projects. Drew still does! Many have recommended we write a book. Scott Ambler – Mr. Agile, suggested the same. We may do it one day.
Job Design Patterns in Talend
There are many elements, or processes in any successful methodology. For Talend developers, a key element we should discuss is Job Design Patterns. This allows me to expand upon some hints I in my JDP&BP series. Presuming we have a clearly defined use case, and we have nailed down the technology reference architecture, before we really dive in to writing code we should have a solid design pattern in mind. Design Patterns provide Talend developers a prescribed approach to building a successful job. Let’s look at some useful design patterns:
Extraction and direct loading of data, 1:1 mapping from source data to target storage; Rarely include transformations unless datatype conversions are required; Common in moving data from one system to another quickly; Not generally a long term solution but it gets things done
Extraction of any specified source data, writing to an intermediate storage (usually a flat file: CSV/XML/JSON) and then loaded into a target data store; Transformations can occur in either step; These are usually 2 separate, decoupled jobs; Common in populating a Data Warehouse or Data Lake
Seamless and/or continuous extraction of new and changed data from either source or target posting the data (either inserting, updating, or deleting as needed) back to the target or source data store; Transformations usually do not occur in this exchange; Common in CDC processes and/or MDM systems where all data stores require all current information all the time; Also used as a one-way population of information marts for analytics
Processing of large data sets in a prescribed ‘chunk’ usually based upon parameterized ‘start/stop’ values where the job can then be executed multiple times concurrently (in parallel) without impact to the downstream process; Often works well with the Dump/Load pattern for extractions that require significant transformations that slow the overall execution process; Common in migrations from disparate systems
Decoupling of data processing into smaller, independent services that can operate on data sets without impact from either calling or called jobs; Micro-Services can do just about anything needed however they should be concise and all dependencies to outside influences except for parameterized pass-in and/or pass-through arguments; Common in data systems integration and e-commerce applications; Can also be designed as a Micro-Batch process to support parallelization of the ‘Chunking’ pattern
Datasets that require many additional elements added to its flow using lookups from multiple sources require special consideration; In-Memory and Row-by-Row Lookup models must be carefully selected and balanced; Often used in coordination with Micro-Services; Common in e-commerce shopping carts and machine learning applications
STREAM 4 PERFORMANCE
Big Data platforms provide for high volume, variety, and velocity supporting fast batch data ingestion process however when the V-V-V is massive data streaming in-memory can dramatically improve performance; Common in parsing large data files in Hadoop using Spark
Internet-Of-Things, sensor data, with a requirement to virtually access its benefit instantly, or in ‘near-real-time’ can utilize data streaming to bypass intermediate processing; Common in collecting continuous data feeds like video systems, alarms, or other sensor data where monitoring its resulting transformation is essential; Usually implemented with Big Data systems, Kafka topics and Spark Streaming
This is not an exhaustive list, yet I hope it gets you thinking about the possibilities.
Talend is a versatile technology and coupled with sound methodologies can deliver cost effective, process efficient and highly productive business data solutions. SDLC processes and Job Design Patterns present important segments of successful methodologies. In my next blog I plan to augment these with additional segments you may find helpful.
Till next time…
Most Downloaded Resources
Browse our most popular resources - You can never just have one.