How to Develop a Data Processing Job Using Apache Beam


This blog post is part 1 of a series of blog posts on Apache Beam.

Are you familiar with Apache Beam? If not, don’t be ashamed, as one of the latest projects developed by the Apache Software Foundation and first released in June 2016, Apache Beam is still relatively new in the data processing world. As a matter of fact, it wasn’t until recently when I started to work closely with Apache Beam, that I loved to learn and learned to love everything about it.

Apache Beam is a unified programming model that provides an easy way to implement batch and streaming data processing jobs and run them on any execution engine using a set of different IOs. Sounds promising but still confusing? This is why I decided to launch a series of blog posts on Apache Beam. In this post, and in the following ones, I’ll show concrete examples and highlight several use cases of data processing jobs using Apache Beam.

Our topic for today is batch processing. Let’s take the following example: You work for a car dealership and want to analyze car sales over a given period of time (e.g. how many cars of each brand were sold?). This means that our data set is bounded (finite amount of data) and it won’t be updated (the sales happened in the past). In this case, we can rely on a batch process to analyze our data.

As an input data, we have text logs of sold cars in the following format:


For example:

Before starting implementation of our first Beam application, we need to get aware of some core ideas that will be used later all the time. There are three main conceptions in Beam: Pipeline, PCollection, and PTransform.

  • Pipeline encapsulates the workflow of your entire data processing tasks from start to finish. 
  • PCollection is a distributed dataset abstraction that Beam uses to transfer data between PTransforms
  • PTransform is a process that operates with input data (input PCollection) and produces output data (output PCollection). Usually, the first and the last PTransformsrepresent a way to input/output data which can be bounded (batch processing) or unbounded (streaming processing).

To simplify things, we can consider Pipeline as DAG (directed acyclic graph) which represents your whole workflow, PTransforms as nodes (that transform the data) and PCollections as edges of this graph. More information can be found in the Beam Programming Guide.

Now, let’s get back to our example and try to implement the first pipeline which will process provided data set.

Creating a pipeline

First, just create a new pipeline:

Pipeline pipeline = Pipeline.create();

Then, let’s create a new PTransform using the pipeline.apply() method which will read data from text file and create a new PCollection of strings. To do this, we use one of the already implemented IOs in Beam – TextIO. TextIO allows to read from and write into text file(s) line by line. It has many other features, like working with different file systems, supporting file patterns, streaming of files. For more information, see the Apache Beam documentation.


The output of this PTransform is a new instance of PCollection<String> where every entry of the collection is a text line of input file.

Since we want to have the total number of sales per brand as a result, we must group them accordingly. Therefore, the next step will be to parse every line and create a key/value pair where key is a brand name and value is a number of sales. It’s worth to mention that the output PCollection from a previous PTransform will be the input PCollection for this one.

On this step, we use Beam internal PTransform, that is called MapElements to create a new pair of key/values for every input entry using the provided implementation of SimpleFunction interface.

We then group the number of sales by brand using another Beam’s transform – GroupByKey. As an output result we have a PCollection of key/values where key is brand name and value is iterable collection of sales for that brand.

.apply(GroupByKey.<String, Integer>create())

Now we are ready to sum up all numbers of car sales per brand using our own implementation of ParDo transform:

To finalize the pipeline, we apply another IO transform to take the PCollection of strings and write them in a text file:


The last thing, we need to do, is to run our created pipeline:;

Looks quite easy, doesn’t it? This is the power of Apache Beam which allows to create complicated data processing pipelines with a minimum amount of code.

For those of you familiar with Hadoop, you may have noticed that this pipeline resembles something:

  • It reads and parses text data line by line creating new key/value pairs (Map)
  • Then groups these key/values by key (GroupBy)
  • Finally, it iterates over all values of one key applying some user function (Reduce)

Yes, that’s true – this simple pipeline can be performed with a classic MapReduce job! But just compare how simpler and clearer it looks in Beam (despite being in Java!) and if we decide to extend our pipelines by adding another transform then it won’t become much more complicated.

Building and running a pipeline

As I mentioned before, a Beam pipeline can be run on different runners (processing engines):

  • Direct Runner
  • Apache Apex
  • Apache Flink
  • Apache Gearpump
  • Apache Spark
  • Google Cloud Dataflow

To do this, we just need to add a correspondent dependency to our maven or gradle project configuration. The good thing is that we don’t have to adjust or rewrite pipeline code to run it on each runner. Even better, we don’t have to recompile our jars if all required runners dependency were included before – we just need to choose which runner to use and that’s it!

Direct Runner is a local runner which is usually used to test your pipeline. When using Java, you must specify your dependency on the Direct Runner in your pom.xml.


After, you have to compile your project:
# mvn clean package

And run your pipeline on direct runner:
# mvn exec:java -Dexec.mainClass=org.apache.beam.tutorial.analytic.SalesPerCarsBrand -Pdirect-runner -Dexec.args=”–runner=DirectRunner”

For example, if our input file contains the following data:
# cat /tmp/beam/cars_sales_log

Then the final result will be like this:
# cat /tmp/beam/cars_sales_report
Toyota: 7
Nissan: 2
Ford: 8

The list of all supported runners and the instructions, how to use them, can be found on this page.

Finally, all code of this example is published on this GitHub repository:

In the next part of this blog post series, I will talk about streaming data processing in Beam. I’ll take another example of data analytics task with an unbounded data source and we will see what Beam provides us in this case.

Successful Methodologies with Talend – Part 2

Let’s continue our conversation about Job Design Patterns with Talend.  In my last blog, Successful Methodologies with Talend, I discussed the importance of a well-balanced approach:

  • Having a clearly defined ‘Use Case’,
  • Incorporating the proper ‘Technology’
  • Defining the way everyone works together with a ‘Methodology’ of choice (Waterfall, Agile, JEP+, or maybe some hybrid of these).

Your Talend projects should be no exception.  Also, when you follow and adopt Best Practices (many are discussed in my other previous blogs), you dramatically increase the opportunity for successful Talend Job Designs.  This leads to successful Talend Projects; and joyful cheers!

With all these fundamentals in place, it seems like a good time for me to elaborate on Job Design Patterns themselves.  Brief descriptions of several common patterns are listed in Part 1 of this series.  Now, let’s take a deeper dive.  First, however, to augment our discussion on the ‘Just-Enough-Process’ methodology, I want to reinforce the importance of team communication.  An Agile team (called a Scrum) is comprised of several players:

  • A Scrum-Master
  • Stakeholders
  • Business Analysts
  • Developers
  • Testers

A successful ‘Sprint’ with defined milestones that follow the agile process communicates, at the Scrum level, with well-defined tasks using tracking tools like Jira from Atlassian.  I’ll assume you know some basics about the Agile Methodology; in case you want to learn more, here are some good links:

Agile Modeling

The Agile Unified Process

Disciplined Agile Delivery


JEP+ Communication Channels

To understand how effective communication can propel the software development life cycle, JEP+ defines an integrated approach. Let’s walk through the ideas behind the diagram on the left. The Software Quality Assurance (SQA) team provides the agnostic hub of any project sprint conversation.  This is effective because the SQA main purpose is to test and measure results.  As the communication hub, the SQA team can effectively, and objectively, become the epicenter of all Scrum communications.  This has worked very effectively for me on small and large projects.

As shown, all key milestones and deliverables are managed effectively.  I like this stratagem, not because I defined it, but because it makes sense.  Adopt this style of communication across your Scrum team, using tools of your choice, and it will likely increase your team’s knowledge and understanding across any software development project, Talend or otherwise.  Let me know if you want to learn more about JEP+; maybe another blog?

Talend Job Design Patterns

Ok, so let’s get into Talend Job Design Patterns.  My previous blog suggested that of the many elements in a successful approach or methodology, for Talend developers, one key element is Job Design Patterns.  What do I mean by that?  Is it a template-based approach to creating jobs?  Well, yes, sort of!  Is it a skeleton, or jumpstart job?  Yeah, that too!  Yet, for me, it is more about the business use case that defines the significant criteria

Ask yourself what is the job’s purpose?  What does it need to do?  From there you can devise a proper approach (or pattern) for the job’s design.  Since there are many common uses cases, several patterns have emerged for me where the job depends greatly upon what result I seek.  Unlike some other ETL tools available, Talend integrates both the process and data flow into a single job.  This allows us to take advantage of building reusable code resulting in sophisticated and pliable jobs.  Creating reusable code is therefore about the orchestration of intelligent code modules.

It is entirely possible ,of course, that job design patterns will vary greatly from use case to other use cases.  This reality should force us to think carefully about job designs and how we should build them.  It should also promote consideration into what can be built as common code modules, reusable across different design patterns.  These can get a bit involved so let’s examine them individually.  We’ll start with some modest ones:

LIFT-N-SHIFT: Job Design Pattern #1

This job design pattern is perhaps the easiest to understand.  Extract the data from one system and then directly place a copy of the data into another.  Few (if any) transformations are involved. It’s simply a 1:1 mapping of source to target data stores.  Examples of possible transformations may include a data type change, or column length variation, or perhaps adding or removing an operative column or two.  Still the idea of a ‘Lift-n-Shift’ is to copy data from one storage system to another, quickly.  Best practices assert that using tPreJob and tPostJob components, appropriate use of tWarn and tDie components, and a common exception handler ‘Joblet’ are highly desirable.

Here is what a simple ‘Lift-n-Shift’ job design pattern may look like:


Let’s go through some significant elements of this job design pattern:

  • The layout follows a Top-to-Bottom / Left-to-Right flow: Best Practice #1
  • Use of a common ‘Joblet ’ for exception handling: Best Practice #3
  • Entry/Exit Points are well defined: Best Practice #4
    • tPreJob and tPostJob components are in place to initialize and wrap up job design
    • tDie and tWarn components are used effectively (not excessively)
    • Also, notice the green highlighted components; these entry points must be understood
  • Exception handling is incorporated: Best Practice #5
  • The process completes all selected data and captures rejected rows: Best Practice #6
  • Finally, the Main-Loop is identified as the main sub-job: Best Practice #7

If you haven’t read my blog series on Job Design Patterns & Best Practices, you should.  These call outs will make more sense!

It is also fair to say that this job design pattern may become more complex depending up the business use case and the technology stack in place.  Consider using a Parent/Child orchestration job design (Best Practice #2).  In most cases, your job design patterns will likely keep this orchestration to a minimum, instead of using the technique I describe for using the tSetDynamicSchema component (Best Practice #29).  This approach, with the source information schema details, may even address some of the limited transformations (ie: data type & size) required.  Keep it simple; make it smart!

MIGRATION: Job Design Pattern #2

A ‘Migration’ job design pattern is a lot like the ‘Lift-n-Shift’ pattern with one important difference: Transformations!  Essentially, when you are copying and/or moving data from source systems to target data stores significant changes to that data may be required thus becoming a migration process.  The source and target systems may be completely different (ie: migration of an Oracle database to MS SQL Server).  Therefore the job design pattern expands from the ‘Lift-n-Shift’ and must accommodate the some or many critical steps involved in converting the data.  This may include splitting or combining tables, accounting for differences between the systems on data types, and/or features (new or obsolete).  Plus you may need to apply specific business rules to the process and data flow in order to achieve the full migration effect.

Here is what a simple ‘Migration’ job design pattern may look like:

Take notice of the same elements and best practices from the ‘Lift-n-Shift’ job design plus some important additions:

  • I’ve created an additional, separate DB connection for the lookup!
    • This is NOT an option; you can’t share the SELECT or INSERT connections simultaneously
    • You may, optionally, define the connection directly in the lookup component instead
    • When multiple lookups are involved I prefer the direct connection strategy
  • The tMap component opts for the correct Lookup Model: Best Practice #25
    • Use the default ‘Load Once’ model for best performance and when there is enough memory on the Job Server to execute the lookup of all rows in the lookup table
    • Use the ‘Reload at Each Row’ model to eliminate any memory limitations, and/or when the lookup table is very large, and/or not all records are expected to be looked-up

It is reasonable to expect that ‘Migration’ job designs may become quite complex.  Again the Parent/Child orchestration approach may prove invaluable.  The tSetDynamicSchema approach may dramatically reduce the overall number of jobs required.  Remember also that parallelism techniques may be needed for a migration job design.  Review Best Practice #16 for more on those options.

COMMAND LINE: Job Design Pattern #3

The ‘Command Line’ job design pattern is very different.  The idea here is that the job works like a command line executable having parameters which control the job behavior.  This can be very helpful in many ways, most of which will be highlighted in subsequent blogs from this series.  Think of the parent job as being a command.  This parent job validates argument values and determines what the next steps are. 

In the job design pattern below we can see that the tPreJob component parses the arguments for required values and exits when they are missing.  That’s all.  The main body of the job then checks for specific argument values and determines the process flow.  Using the ‘RunIF’ trigger we can control the execution of a particular child job.  Clearly, you might need to pass down these arguments into the child jobs where they can incorporate additional validation and execution controls (see Best Practice #2).

Here is what a simple ‘Command Line’ job design pattern may look like:

There are several critical elements of this job design pattern:

  • There are no database connections in this top-level orchestration job
    • Instead, the tPreJob checks to see if any arguments have been passed in
    • You may validate values here, but I think that should be handled in the job body
  • The job body checks the arguments individually using the ‘RunIF ’ trigger, branching the job flow
  • The ‘RunIF ’ check in the tPreJob triggers a tDie component and exits the job directly
    • Why continue if required argument values are missing?
  • The ‘RunIF’ check on the tJava component triggers a tDie component but does not exit the job
    • This allows the tPostJob component to wrap things up properly
  • The ‘RunIF’ checks on the tRunJob components triggers only if the return code is greater than 5000 (see Best Practice #5: Return Codes) but does not exit the job either

In a real world ‘Command Line’ use case, a considerable intelligence factor can be incorporated into the overall job design, Parent/Child orchestration, and exception handling.  A powerful approach!

DUMP-LOAD: Job Design Pattern #4

The ‘Dump-Load’ job design pattern is a two-step strategy. It’s not too different from a ‘Lift-n-Shift’ and/or ‘Migration’ job design. This approach is focused upon the efficient transfer of data from a source to a target data store.  It works quite well on large data sets, where replacing SELECT/INSERT queries with write/read of flat files using a ‘BULK/INSERT’ process is likely a faster option.


Take notice of several critical elements for the 1st part of this job design pattern:

  • A single database connection is used for reading a CONTROL table
    • This is a very effective design allowing for the execution of the job based upon externally managed ‘metadata’ records
    • A CONTROL record would contain a ‘Source’, ‘RunDate’, ‘Status’, and other useful process/data state values
      • The ‘Status’ values might be: ‘ready to dump’ and ‘dumped
      • It might even include the actual SQL query needed for the extract
      • It may also incorporate a range condition for limiting the number of records
      • This allows external modification of the extraction code without modifying the job directly
    • Key variables are initialized to craft a meaningful, unique ‘dump’ file name
      • I like the format {drv:}/{path}/YYYYMMDD_{tablename}.dmp
    • With this job design pattern, it is possible to control multiple SOURCE data stores in the same execution
      • The main body of the job design will read from the CONTROL table for any source ready to process
      • Then using a tMap, separated data flows can handle different output formats
    • A common ‘Joblet’ updates the CONTROL record values of the current data process
      • The ‘Joblet’ will perform an appropriate UPDATE and manage its own database connection
        • Setting the ‘Run Date’, current ‘Status’, and ‘Dump File Name’
      • I have also used an ‘in process’ status to help with exceptions that may occur
        • If you choose to set the 1st state to ‘in process’ an additional use of the ‘Joblet’ after the SELECT query has processed successfully is required to update the status to ‘dumped’ for that particular CONTROL record
        • In this case, external queries of the CONTROL table will show which SOURCE failed as the status will remain ‘in process’ after the job completes its execution
      • Whatever works best for your use case: it’s a choice of requisite sophistication
      • Note that this job design allows ‘at-will re-execution’
    • The actual ‘READ’ or extract then occurs and the resulting data is written to a flat file
      • The extraction component creates its own DB connection directly
        • You can choose to create the connection 1st and reuse it if preferable
      • This output file is likely a delimited CSV file
        • You have many options
      • Once all the CONTROL records are processed the job completes using the ‘tPostJob’, closing the database connection and logging its successful completion
      • As the ‘Dump’ process is decoupled from the ‘Load’ process it can be run multiple times before loading any dumped files
        • I like this as anytime you can decouple the data process you introduce flexibility

Let’s take notice of several critical elements for the 2nd part of this job design pattern:

  • Two database connections are used
    • One for reading the CONTROL table and one for writing to the TARGET data store
  • The same key variables are initialized to set up the unique ‘dump’ file name to be read
    • This may actually be stored in the CONTROL record, yet you still may need to initialize some things
  • This step of the job design pattern controls multiple TARGET data stores within the same execution
    • The main body of the job design will read the CONTROL table for any dump files ready to process
    • Their status will be ‘dumped’ and other values can be retrieved for processing like the actual data file name
    • Then using a tMap, separated data flows can handle the different output formats
  • The same ‘Joblet’ is reused to update the CONTROL record values of the current data process
    • This time the ‘Joblet’ will again UPDATE the current record in process
      • Setting the ‘Run Date’ and current ‘Status’: ‘loaded
    • Note that this job design also allows ‘at-will re-execution’
  • The actual ‘BULK/INSERT’ then occurs and the data file is written to the TARGET table
    • The insertion component can creates its own DB connection directly
      • I’ve created it in the ‘tPreJob’ flow
      • The trade-off is how you want the ‘Auto Commit’ setting
    • The data file being process may also require further splitting based upon some business rules
      • These rules can be incorporated into the CONTROL record
      • A ‘tMap’ would handle the expression to split the output flows
      • And as you may guess, you might need to incorporate a lookup before writing the final data
      • Beware, these additional features may determine if you can actually use the host db Bulk/Insert
    • Finally, process wither saves processed data or captures rejected rows
    • Again, once all the CONTROL records are processed the job completes using the ‘tPostJob’, closing the database connections and logging its successful completion


This is just the beginning.  There will be more to follow, yet I hope this blog post gets you thinking about all the possibilities now.

Talend is a versatile technology and coupled with sound methodologies, job design patterns, and best practices can deliver cost-effective, process efficient and highly productive data management solutions.  These SDLC practices and Job Design Patterns present important segments for implementation of successful methodologies.  In my next blog, I will augment these methodologies with additional segments you may find helpful, PLUS I will share more Talend Job Design Patterns!

Till next time…

The Six Biggest GDPR Pitfalls Everyone Must Avoid

There’s a little over a month left before May 25th, the date on which businesses that handle personal data of EU citizens will have to comply with the terms of the General Data Protection Regulation (GDPR).

Numerous reports suggest that up to half of businesses are still not ready to comply with GDPR – and are therefore at risk of incurring significant fines, as well as failing to meet customers’ and employees’ new rights to access subject data—and time is running out. .

The GDPR itself is a complex piece of legislation. Thus, if your organization needs to be compliant, there’s no time to waste. So here is a look at some of the common misconceptions that could bring significant financial, customer loyalty and brand equity penalties to your business.

1) Believing GDPR only applies to companies based in the EU.

Any business that supplies products or services in the EU or deals with personal data of EU citizens will be bound by the terms of the EU data protection directive (Yes – including the UK – where Brexit will not come into effect until well after the deadline for GDPR compliance). ExtaCloud CEO Seb Matthews says in this interview that many organizations based outside of the EU are acting like “Ostriches burying their head in the ground and hoping that this regulation doesn’t apply to them.” Not a great business strategy, considering:

  1. It does apply to them, and
  2. They face fines of up to 20 million Euros—or 4% of global turnover (whichever is greater)—if they get it wrong.

“Not only is the GDPR compliance deadline quickly approaching, but—along the way—its very existence has raised the awareness level of all citizens in the digital age when it comes to their rights to data privacy,” said Jean-Michel Franco, Senior Product Marketing Director, Talend. “Additionally, as data privacy scandals continue to make headlines, citizens across EMEA, APAC, and NORAM are putting pressure on governments and organizations of all sizes, across all industries, to improve personal data privacy practices and controls.”

2) Failing to understand what personal data you are storing and processing

The EU’s data protection directive – the basis for the GDPR—defines personal data as “any information relating to an identified or identifiable natural person.” This definition is much broader than what we used to consider in the past as Personally Identifiable Information, or PII. In the big data and machine learning age, personal data includes far more than just e-mails, International Bank Account Numbers, phone numbers or customer IDs. This type of data can also include social media details, clickstreams, geo-localization, biometric data, voice recording, customer service logs, etc.

To get an accurate and comprehensive of personal data across an enterprise and reclaim control over this information, it’s wise to implement a data lake where the entire organization can capture and reconcile personal data across systems, map lineage to original and/or related data sources and targets, apply controls and audit trails on top, and share that data internally with appropriate stakeholders, or externally with data subjects. 

<<On-Demand Webinar: What is a Data Hub? GDPR Best Practices from MapR and Talend>>

3) Not knowing where personal data is stored, and/or how to delete its master record upon request

Customer and employee data are usually widely dispersed across an organization, both in the cloud and on-premises, on local file systems or distributed file systems like Hadoop HDFS, as structured or unstructured data, etc. You might not have a 360° view, or even an up-to-date technical index recording of where every data record is kept. GDPR imposes restrictions on the transfer of personal data outside the European Union. Thus, not only do IT leaders need to know where personal data resides because of location regulations, but GDPR also gives EU residents the right to request that their personal information records be deleted for any number of reasons. In the case of such a request, IT leaders need to make sure they can remotely trigger the deletion of personal data, even if it is stored in file systems, proprietary cloud apps, or archive systems where record-level deletion might not be as straightforward.

“To get an accurate and comprehensive of personal data across an enterprise and reclaim control over this information, it’s wise to implement a data lake where the entire organization can capture and reconcile personal data across systems, map lineage to original and/or related data sources and targets, apply controls and audit trails on top, and share that data internally with appropriate stakeholders, or externally with data subjects.”

4) Companies can use data for reasons other than stated when it was collected

One of the principles that will come into effect with GDPR, called data minimization, is that personal data can only be used for the express purposes for which permission has been given by constituents. This means that if you’ve collected data about your customers who have placed complaints to better understand their grievances, you can’t subsequently use that information to offer them special deals or hit them with targeted advertising.

As a result, businesses must take extra measures to ensure they fully and accurately explain what they intend to do with personal data at the time of collection. How wide the scope for interpretation and ambiguity of this aspect of GDPR will of course not be understood for some time. For example, will simply stating that data is being collected “for customer service purposes” cover everything you might need to convey? Possibly, but if not, your company could end up in court and that is an expensive gamble with not only your job, but company money, brand value and reputation as well. 

GDPR also forces companies to reassess their applications – new or legacy – through the lens of data privacy. A typical example is a data warehouse, or a data lake. When it contains personal data, many companies are considering anonymizing it with data masking unless they fully understand who can access this data and for what purpose?

5) Companies can bypass checking consent agreements for legacy data

The standards that must be met under GDPR to show that any personal data processing you are doing is in line with the consent granted by the data subjects (people) are high. Most pertinently, consent is not considered to have been given unless it was collected under an “opt-in” framework rather than an “opt-out” one. That is, EU residents must have clearly given consent for their data to be stored and processed, rather than simply failed to have withheld consent.

This applies to all personal data being processed – whether or not it was collected before or after GDPR comes into force. This means that consent agreements must be scrupulously checked wherever possible, and if there is any doubt, err on the side of caution and don’t use it. If you can’t prove to yourself – and as well to the regulator and the data subject referenced by the data you process – that you have permission to use something, then you are at risk.

6) It is just about data protection

A very common misconception—perhaps due to the name of the regulation itself—is that GDPR is just about data protection. One fundamental principle of GDPR is that a company should allow a Data Subject – your customer, prospect or employee –  to take control of their data.  For example, your organization must respond without delay, and at the latest within one month, to your customers or employee requests for data accessibility, data portability, data rectification, or the right to be forgotten. Many companies such as Apple or Facebook have released or announced a privacy control center where any data subject can see their personal data, specify their consent preference by opting in or out to personalized services, ask to be forgotten, etc. Is your organization ready to deliver this kind of customer service to secure and reinforce your customer and employee relationship while complying with the regulation?  

In Summary:

While most citizens see GDPR as a significant progress with regards to human rights, many businesses perceive it as a constraint. But what if they consider that data privacy could give them a competitive edge, allowing them to increase the impact of personalization for their bottom line, backed up with a system of trust that not only complies with the regulation but also encourages customers to share their personal data for better service.

Read More: The Countdown to GDPR Compliance Begins, Are You Ready? – MapR


Apache Spark and Talend: Performance and Tuning

I first want to start by thanking all the readers of my previous 2 blogs on the topic of Talend and Apache Spark.

If you are new to this blog series and haven’t read my previous posts, you can start reading them here “Talend & Apache Spark: A Technical Primer” and part two here “Talend vs. Spark Submit Configuration: What’s the Difference?”.

The first two posts in my series about Apache Spark provided an overview of how Talend works with Spark, where the similarities lie between Talend and Spark Submit, and the configuration options available for Spark jobs in Talend. 

In this blog, we are going to take a look at Apache Spark performance and tuning. This a common discussion among almost everyone that uses Apache Spark, even outside of Talend. When developing and running your first Spark jobs there are always the following questions that come to mind.

  • How many executors should I allocate for my Spark job?
  • How much memory does each executor need?
  • How many cores should I be using?
  • Why do some Spark jobs take hours to process 10GB worth of data and how do I fix that problem?

In this blog, I’m going to go through each of these questions and provide some answers and insight. Before we proceed any further with this topic, let’s introduce some key concepts that are going to be used throughout this blog:

Partition:  A partition is a portion of a distributed data set. It is created by the default HDFS block size. Spark utilizes partitions to do parallel processing of data sets.

Tasks: Tasks are the units of work that can be run within an executor.

Core: A core is the processing unit within a CPU that determines the number of parallel tasks in Spark that can be run within an executor.

Executor: A process that is started on worker nodes that runs your job submission in memory or disk.

Application Master: Each YARN application spins up an application master process that has the responsibility to request resources from the resource manager. Once the resources are allocated the process then works with node managers to start the required containers within them.

Spark Tuning

 To begin, let’s start with going over how you can tune your Apache Spark jobs inside Talend. As mentioned previously, in your Talend Spark Job, you’ll find the Spark Configuration tab where you can set tuning properties.  This is always unchecked by default in Talend.

In this section, you are given the option to set the memory and cores that your application master and executors will use and how many executors your job will request. Now the main question  that comes up once you start filling the values in this section is “How do I determine the number of cores or memory my application master or executors need to have good performance?” Let’s tackle that question.

How to Choose the Number of Cores for Your Spark Job

At this point, there are a few factors that we need to consider before proceeding any further. These are:

  1. The size of our datasets
  2. The time frame in which our job needs to be completed
  3. The operations and actions that our job is doing

Now with those factors in mind, we can start configuring our job to maximize performance. Let’s first start with tuning our application master. For the application master, we can leave the default values since it only does orchestration of resources and no processing which means there is no need for high values of memory or cores.

Our next step is to configure the memory and cores for our executors. The main question here is how many executors, memory and cores should be used. To find that answer, let’s imagine we have a Hadoop cluster that has 6 worker nodes and each one of them has 32 cores and 120GB of memory. The first thought that probably comes to mind is that the more concurrent tasks that we can have per executor the better our performance will be. When researching this, we can see in performance tuning guides from Hadoop distros like Cloudera as an example in the following link, it has been shown that more than 5 cores per executor will lead to bad HDFS I/O. As a result, the optimal value of cores for performance is 5.

Next, let’s look at how many executors we want to launch. Based on the number of cores and nodes, we can easily determine this number. As we mentioned 5 cores is the optimal number to use per executor. Now, from the each of the 32 cores we have per node we must remove the ones that we cannot use for our jobs as they are needed by the Operating System and Hadoop daemons running on the node. The Hadoop cluster management tool already does that for us, making it easier to determine how many cores per node we have available to use for our Spark jobs.

After making that calculation, let’s assume we are left with 30 cores per node that can be used. Since we already determined that 5 cores is the optimal number per executor that means that we can run up to 6 executors per node. Easy!

Finally, let’s finish up by calculating the amount of memory that we can use. Based on the hardware specs above, we see that there is 120GB of memory per node, but as I mentioned when discussing cores, we cannot use all that memory for jobs as the Operating System needs to use some. Again, our Hadoop cluster management tool can determine how much of that memory can be used for jobs for us. If the Operating System and Hadoop daemons require 2GB of memory, then that leaves us with 118GB of memory to use for our Spark jobs. Since we have already determined that we can have 6 executors per node the math shows that we can use up to roughly 20GB of memory per executor. This though is not 100 percent true as we also should calculate in it, the memory overhead that each executor will have. In my previous blog, I mentioned that the default for the overhead is 384MB. Now if I were to remove that amount from the 20GB then I could roughly say that 19GB will be the highest I can give to an executor.

Dynamic vs. Fixed Allocation of Cluster Resources

The numbers above will work for either fixed or dynamic allocation of cluster resources in a Spark Job. The difference between the two of them is dynamic allocation. With dynamic allocation, you can specify the initial number of executors used, a minimum of executors that the job can use when there is not that much workload, and a maximum number when more processing power is needed. Although it would be nice to be able to use all the resources of the cluster for our job we need to share that processing power with other jobs that run on the cluster. As a result, based on what we have identified as our requirements when going over the factors that we defined earlier for considering for tuning our Talend Spark job, it will determine what percentage of those max values we can use.

With our job configured we can now proceed with actually running our job! Let’s say that we still notice though that our Spark job takes a lot of time to complete, even when it has been set up to the max settings defined above.  We need to go back to our job and look at a few more settings to ensure they are used for maximum performance.

Spark Performance

First, let’s assume that we are joining two tables in our Spark job. One of the factors we considered before starting to optimize our Spark jobs was the size of our datasets. Now, when we look at the size of the tables and we determine that one of them is 50GB and the other one is 100MB, we need to look and see if we are taking advantage within the Talend components of replicated joins.

Replicated Join

A replicated join (or otherwise called Map-Side join) is widely used when you join a large table with a small table to broadcast the data from the smaller table to all executors. In this case, since the smaller dataset can fit in memory, we can use a replicated join to broadcast it to every executor and optimize the performance of our Spark job.

Since the table data needs to be combined at the executor level with the side data, by broadcasting the smaller data set to all executors we are avoiding the data of the larger table being sent over the network. A lot of the performance issues in Spark occur because of the shuffling of large amounts of data over the network. This is something that we can check easily within the Talend job – by enabling the option for “Use replicated join” in the tMap component as shown below. This will broadcast the data of the lookup table to all the executors.

The next step is to look and see our job has operations within it that are performing expensive re-computations.

Spark Cache

In order to walk through about re-computations, let’s consider a simple example of loading a file with customer purchase data. From this data we want to capture a couple of metrics –

  • the total number of customers
  • the number of products purchased

In this case, if we don’t use a Spark cache, each operation above will load the data. This will affect our performance as it causes an expensive re-computation to happen. Since we know that this dataset will need to be used down the road in our job, it is best to use Spark Cache to cache it in memory for later use so that we don’t have to keep reloading it.

Within our Talend Spark jobs, this is done with the tCacheIn and tCacheOut components which are available in the Apache Spark palette in Talend and allow you to utilize the Spark caching mechanism offering different options that are available.

Also, you can select if you want to cache the data on disk only, and then you are also given the option for the cached data to be serialized for either memory, disk or both. Finally, you can also select for the cached data to be replicated on 2 other nodes. The most used option is memory without serialization as it is faster, but when we know that the cached RDD cannot fit in the memory and we don’t want to spill to disk, then serialization is selected as it reduces the space consumed by the dataset but it comes at an additional cost of overhead that affects performance. As a result, you should evaluate your options and select the one that best fits your needs.

If performance issues still remain after all that, we need to start looking at the Spark History Web Interface to see what is happening. As mentioned in my previous blog, in the Spark History section of the Spark Configuration in Talend we can enable Spark logging. Spark logging helps with troubleshooting issues with Spark jobs by keeping the logs after the job has finished and makes it available it through the Spark History Web Interface. Having Spark event logging enabled with our Spark jobs is a best practice and allows us to more easily troubleshoot performance issues.

With Spark event logging enabled, you can go to the Spark History Web Interface where we see that we have the following tabs when looking at the application number for our job:

In our Spark UI above we want to look at the stages tab, identify the one that is affecting the performance of our job, go to the details of it, and check to see if we are seeing something like the behavior below:

What we see is that only one is processing most of the data and the rest are idle, even after we have allocated 10 executors. Now, why does this happen? To answer that we need to identify the stage of the job where the problem exists. As an example, we notice that this is happening at the part of the Spark job that we are reading the data from a compressed file. Since archive files aren’t partitioned at read by default, an RDD with a single partition is going to be created for each archive file we read which will cause this behavior to be seen. Now if that compressed file is in an archive format that is split-able like BZIP2, and can be partitioned at read, then in the advanced settings of the tFileInputDelimited we can enable the property “Set Minimum Partitions” and then at the minimum set as many partitions as our executors as a starting point.

But in the case of an archive file like GZIP that cannot be repartitioned at read, we can explicitly repartition it using our tPartition component. This component, as shown below, allows us to repartition the file so that we can distribute the load equally among the executors.

Partitioning at read can also be used when reading from a database using our tJDBC components, using the following properties:

The re-partitioning can only be applied in certain situations as we can see above. If we determine that our dataset is skewed on the keys we are using for the join, then different methods need to be used. So how can we identify the data skew? Start by looing at the dataset by partition and see how our data is grouped among our keys that we are using for the join. Here is an example of how a skewed dataset by partition would look:

In this case, if we cannot repartition by a different key, we should look at other methods to improve our Talend Spark job. One technique that is widely used is called “salting”. With salting, you are adding a fake key to your actual key to even out the distribution of data per partition. This is something that can be done through our tmap component in Spark job as an example like this:


As we can see above, we are adding at the tmap level the fake key as a numeric random, and we are joining it along with our actual key with the lookup dataset that we have also already added the fake kye. Since the joining is happening based on our actual key plus the fake key we generated for the distribution, it will help to avoid skewed partitions that can affect our performance when joining datasets in Spark.


There are many different techniques that we can use to improve the performance and tune our Talend Spark Jobs. I hope the few that we went through in this blog were helpful. Now go forth and build more Spark jobs on Talend!




Why Paddy Power Betfair Bet on a Cloud Architecture for Big Data


Sports betting and e-gaming companies operate in a fast-paced, highly competitive and regulated market. They’re open 24/7/365, and are constantly striving to provide customers with the best online sports betting experience or risk seeing those customers go to a competitor.

Formed in 2016, Paddy Power Betfair (PPB) is the world’s largest publicly quoted sports betting and gaming company, bringing “excitement to life” for five million customers worldwide. The merger of Paddy Power and Betfair in 2016 created an additional data challenge for an already highly data-driven organization. The merged company had to bring together 70TB of data, from dozens of sources, into an integrated platform.  

A high volume/low latency environment

To track betting and games results down to the second and deliver the best customer experience, PPB needs to manage massive volumes of data and then access and serve it with very low latency levels. But PPB’s data volumes are not only massive, they vary greatly over time.

In order to deal with the huge spikes in usage on days with popular sporting events, like the United Kingdom’s Grand National horse race, Paddy Power Betfair is leveraging the cloud (a virtual private cloud environment to be more specific) with a data platform on AWS using S3, Redshift and Aurora. Talend fits perfectly with PPB’s cloud strategy.

See the full case study>>

Providing a better customer experience while cutting data delivery times in half

Using Talend Real-Time Big Data and the cloud, PPB gets a 360-degree customer view, drives customer engagement and optimizes the betting experience. The goal is to make sure its gamers keep coming back and that they enjoy a quick, reliable, and glitch-free experience, even during large sporting events.

To better understand their customer’s preferences, PPB went back to data. Big data analysis outlined two customer segmentation types across both brands, with little overlap. Betfair customers tend to be “money-centric,” focusing on the best odds and the site’s reliability. They’re also more interested in football and racing bets. Paddy Power customers are “thrill-seekers” and “social,” seeking the best entertainment and ease-of-use, and with more emotional attachment to the brand. The business also gets timely and consistent measurements of business performance across products, channels and brands.

In the world of online betting, data agility is critical. 

From GDPR to Customer Trust: Is Your Data Ready to Protect Customer Privacy?

Five Pillars that will Help You Chart the Best Course for Success using Talend & MapR

 Is it just me, or does there somehow seem to be an eerie correlation between the quickly approaching, May 25thdeadline for compliance with the General Data Protection Regulation (GDPR) and the increasing numbers of reported privacy violations, leaks, complete system failures that are capturing headlines? Coincidence…or not?

All ‘conspiracy theory’ aside, over the last few weeks, we’ve heard about Chief Security Officers, Chief Information Officers, and even CEOs losing their jobs following a data breach that exposed their customer’s sensitive data to external parties. But the repercussions aren’t solely limited to an individual or department. A breach of this magnitude can cost a company not only up to billions of dollars in fines, but also a loss of public trust, brand deterioration and significant loss of business. For example, take the recent Uber incident wherein the claimed ‘digital native’ taxi-alternative company, failed to alert regulators across the world of a mass data breach that potentially put 57 million customers and drivers personal details into the hands of cyber criminals. In UK, the cost for UBER could also be the renewal of its license in the capital with the Transport of London agency—something that will likely have a significant impact on its revenues…

Each day we’re seeing concrete cases illustrating the rising costs of penalties for capturing data without customer consent, or the fact that a loss of control over personal data could have a billion dollar impact on a company’s market valuation.

The impact of GDPR is huge, not only as a regulation that ‘punishes’ companies that fail to comply with severe penalties, but also because data subjects – i.e. any European-based citizen who is an employee, customer, visitor, or user of your company’s products or services—are now understanding their new rights in the digital age and starting to ask the right questions, take the right steps and establishing blocks against companies to protect themselves.  At the same time, the voice of non-European citizens is getting louder when it comes to similar privacy rights and issues.

In fact, a recent survey by Pega Systems shows that data subjects (i.e. citizens) may be more prepared for GDPR than the companies with which they do business; i.e. 82% of European consumers plan to exercise their new rights to view, limit, or erase the information businesses collect about them. To same extent that they leveraged their new right to be forgotten in Google since the European courts ordered the company to allow it in May 2014, data subjects are feeling empowered by their new rights, and will undoubtedly be more mindful of the personal data they share with any vendor at any time.

So, what does this mean for IT Leaders? We think there are two main things to consider:

  1. GDPR is much more than just another compliance regulation. It’s also a customer engagement issue, a call to action for establishing a system of trust when engaging with consumers—your customers—now that digital transformation has turned them into data experts. You’re no longer dealing with a naïve generation of constituents.
  2. GDPR should be perceived as a data management project. While most companies are still mistakenly asking themselves, “Are we ready for GDPR”, with a focus on the internal processes, policies and organization, what they REALLY should be asking is: “Is our data GDPR-tested, consumer and government approved?” 

Benchmarking surveys (like IAPP/Ernst and Young, or Deloitte) are showing that the toughest challenges are related to the second question. Most GDPR initiatives get stuck in paperwork and fail to enable companies to get hands on with the intimate details of protecting the personal data they possess. As a result, topics like consent management, data subject access rights, data portability, or right to be forgotten are not addressed. I would say this is a ‘band aid’ approach to addressing GDPR—it may be a satisfactory first step to show regulatory authorities that work is underway to sufficiently assess the risks and address any and all legal issues. However, this ‘band aid’ approach will fall far short of winning customer trust, which would result in a far more costly business impact that the fines you’ll likely to incur from government entities.

Organizations should rather get hands on with their data and make sure they address the five pillars to get their data ready for GDPR:

  • Know their personal data, by continuously maintaining map of the personal data that flows across the organization
  • Create data subjects 360° view where they can collect, connect, and protect all the personal information that they intend to maintain
  • Protect their data against leakage, misuse, and ensure data is anonymized when processed out of scope of what legitimate interest or consent allows for
  • Forster accountability by allowing to delegate accountabilities on personal data to the stakeholders that contribute to related data processing activities.
  • Know where the data is and when the data moves across borders, while opening personal data for the right of the data subject. This is crucial to enact the rights to data access, data portability, rectification or the rights to be forgotten.

In part II of this series, we’ll see how MapR and Talend can accelerate your data privacy compliance across those five pillars.

Introducing the Talend Architecture Center – Your One Stop Resource for Best Practices, Architectures and More

Today, Talend is pleased to announce the release of the Talend Architecture Center. This site provides you – our customers and partners – reference architectures and guidance for building scalable and secure solutions using Talend products.

The goal of this Architecture Center is to publish logical and physical architectures in order to help you understand the various components in our products, the relationship and interactions between them, and best practices around their deployment. We also provide guidelines on scalability, availability, security and sizing. This is published on a GitHub repository – anyone is welcome to review the content, try it out in your environment and provide us your feedback.

The content start off with version 6.4 and our goal is to keep the content up to date and generate new content for every subsequent major version of the product.

Where it All Began

In the past, whenever a customer or partner would request our assistance in understanding our products or architecting a solution, one of our Talend team members would provide them the required assistance. The guidance can come from any of our client-facing teams – Customer Success Architecture, Professional Services, Sales Engineering, Enablement etc.

Late last year, we decided to create an internal forum to bring all these great ideas together and create a standard reference architecture that we could share with our customers/partners. As our footprint in the enterprise scaled up quickly, we realized that for us to scale and for our customers to scale, it would be a great idea to capture and publish all architectural artifacts in a public repository that our customers could easily access. A few of us at Talend got together to brainstorm and standardize our content. The Talend Architecture Center then came about as a result of that collaborative effort.

The architectural artifacts that you’ll find are based on years of experience working in Talend deployments of all sizes and forms. We also leveraged existing architectures that we have come across and mined those architectures for best practices and patterns that work and eliminated things that don’t.

What Kinds of Content Can I Find?

Talend has several products that help organizations manage their data – Data Integration, Big Data, Cloud, Real Time, MDM, Data Quality etc. Many of our customers also leverage Talend Data Fabric that includes all the products mentioned above in a single, unified platform. We have organized our content based on our products. If you use one of our platform products (more than one product), we provide architectural guidance on the platform.

For each of the products/platforms, we provide you details on:

  • Logical Reference Architecture – an overall architecture describing the various Talend components grouped into functional blocks based on their role in their architecture and the interoperability between themselves. The various functional blocks include:
    • Clients (for building and monitoring Talend jobs),
    • Servers (for administration, management, and monitoring),
    • Databases (to store metadata and configuration information),
    • Repositories (to host project metadata and binaries), and
    • Execution Servers (to deploy and launch jobs)
  • Physical Reference Architecture – provides a template for physical deployment of the logical architecture. We cover guiding principles based on our best practices, and deployment scenarios and strategies for various environments – Development, QA, Pre-Production and Production. This includes high availability and scalability options for your Production deployment.
  • Recommended Sizing – we provide sizing guidance for various components in the architecture. This is based on recommendations from our Product team as well as our real-world experience deploying this architecture for our customers.

Big Data Architecture Example

How Do I Use It In My Own Talend Deployment?

Our main goal is to trigger your thought process as customers/partners. You can use this content to map the reference architecture to your current and future business requirements and subsequently build an architecture for your enterprise. You can browse through and study the relevant portions of the website based on the products you have been licensed for. We also encourage you to browse through the other products so that you are ready to scale up when the need arises. If you are planning to migrate to a later version of the product, this is a good starting point to understand the architectural changes of the new version and its impact.

If you have scenarios which are not covered here, please feel free to share it us with us. It will help us improve our reference architecture based on feedback from real-world implementations of our platform.

Ready, Set, Build

This blog introduced you to the Talend Architecture Center and the content available there. The content captures several years of knowledge and subject matter expertise within various organizations in Talend. It includes detailed information about Logical and Physical architecture for all Talend products and platforms. In the future, we would like to include other areas of architecture e.g. architectural styles and design patterns (Batch, Real Time, Streaming, Event Driven), Solution Architecture (Data Lakes, Data Warehouse, ODS, Data Migration) and other topics that could serve as a valuable input for our customers building solutions using Talend. We welcome your feedback on this work. You can either leave comments on this blog or at the Talend Community discussion forums.

Everything You Need to Know About IoT – Hardware


Hello! After joining Talend 4 months ago, I’m now getting around to publishing my first blog. Since this is my first entry, I decided to dive into the details on one of my favorite topics, the Internet of Things. This is the first entry in a series I’ll be writing around the Internet of Things where we’ll go over all the parts of an IoT project, starting from the edge device continuing to data ingestion and its integration to business applications. If you scroll around the internet on any tech site, you’ll find a lot of blogs and articles about IoT out there. We’ve even done some pretty heavy writing on Talend’s website so I’ll try not to make this one “Yet another IoT article!”

NOTE: I really recommend that you have a look at a few of the articles: “How to Simplify your IoT Platform with Talend”, “Sensors, Environment and Internet of Things”, “Applying Machine Learning to IoT Sensors”.

The Internet of Things – Hardware Considerations

The purpose of this first article is to introduce the fundamental concepts around the Internet of Things but from a device point of view. More precisely the kind of hardware you will be dealing with rather than simply looking at the data collection or software aspects.

I’m not going to sum up everything that you’d be able to find on your own, but rather I’ll try to explain basic things you need to take into consideration when you are just getting started because it’s the wild wild west out there. There are a lot devices, different types of microcontrollers and programming paradigms to choose from. Before we start really getting into this, we should ask ourselves a couple of questions:

  • Why all the buzz around the Internet of Things?
  • Why now?

Why all the buzz around the Internet of Things?

When you look at predictions around the growth in the IoT market, you see a correlation between the growth of connected devices and the growth of data volumes.

The mass adoption of this technology means it could potentially change the society we are living in. For those who think that it’s “utopic”, I really recommend you to read Jeremy Rifkin’s’ book “The Zero Marginal cost society…” and you’ll see that we are not that far from becoming a “collaborative common society”.  

All types of companies from startups to major enterprises like GE are seizing this new market opportunity and have developed what they call “self-service solutions” and platforms ready for you to tackle IoT head on. However, these tend to fall short because it’s most likely going to be proprietary tech, built for specific needs and requiring intensive development.

Ok alright, but why now?

I believe we are currently in the era where 3 technology forces are converging. 

DATA: It’s the “Big Bang” for data and analytics and we have the technology to ingest and store the massive amount of data using open source technologies such as Hadoop and its ecosystem as well as running batch, interactive or real-time analytics also using open source frameworks such as MapReduce and Apache Spark.

CLOUD: What started as SaaS has fully evolved and matured into full infrastructures and platforms. We are now only a few clicks and minutes away from creating an entire IT infrastructure and business solution in the cloud.

ELECTRONICS/ROBOTICS: We see a lot of innovation in sensors capturing new types of data around our movement and our environment. You can easily start on a robotics project with your kid using a cheap and accessible development board. Moreover, anyone can start prototyping any innovative idea and get it crowd founded.   

Ubiquitous Computing – The Start of IoT

Ok, so let’s start at the beginning now that we understand the market opportunity. The Internet of Things is not new. Originally, Ubiquitous Computing, coined by Mark Weiser in 1991, as quote “The computer for the 21st century” referred to a paradigm shift in which a general-purpose machine (the PC) will be replaced by a large number of specialized computers, which are embedded into everyday objects. A typical application for this is the smart home. So, Ubiquitous Computing as a vision was much more than technology, it dealt with the question of how users would interact with an environment that is physical but enriched with computing at the same time.

Simply put, ubiquitous computing is everywhere. Pervasive computing is somewhere (like your desktop). In practice, I don’t think there is much difference and they are just different terms to describe the same computing models. U

biquitous or pervasive computing is a concept in computer science and engineering where computing is made to appear everywhere and anywhere. In contrast to desktop computing, ubiquitous computing can occur using any device, in any location, and in any format. A user interacts with the computer, which can exist in many different forms, including laptop computers, tablets and terminals in everyday objects such as a fridge or a TV.

The Internet of Things

The term “Internet of Things” was coined by Kevin Ashton in 1999. The Internet of Things is really more of a marketing buzzword. I would consider it to be an application of pervasive or ubiquitous computing. In this paradigm, common everyday devices and objects are assumed to be smart computing devices connected to the Internet and capable of communication with users and other similar devices. The Internet of Things was originally thought as extending the principles of the Internet as a network organization concept to physical things.

Whereas ubiquitous computing was designed to make objects intelligent and create richer interaction, the Internet of Things was much more focused on virtual representations of automatically identifiable objects.

When Kevin Ashton stated, “We’re physical, and so is our environment. Our economy, society and survival aren’t based on ideas or information—they’re based on things. You can’t eat bits, burn them to stay warm or put them in your gas tank.”, I believe he was trying to tell us not to forget that today’s computers—and, therefore, the Internet—are almost wholly dependent on human beings for information.

Almost all of the data available on the Internet was first captured and created by human beings—by typing, pressing a record button, taking a digital picture. The problem is, people have limited time, attention and accuracy—all of which means they are not very good at capturing data about things in the real world and that why we are using electronics to do that.

Now that we better understand the concept – let’s dive into the types of hardware that make the IoT concept actually work in the real world. 

What are the “Things” in IoT?

System On Chip (SOC)

SOC are electronic systems that integrate a Microcontroller Unit (MCU) as one of its components. It typically runs on a single MCU and can be 8, 16, 32 bits. It also exposes General Purpose Inputs/Outputs (GPIO) and is programmable using a toolchain that compiles the code (Firmware) and loads it in the MCU’s permanent memory (SDRAM).

The most popular board for the past 7 years has been the Arduino, Atmel then ARM based development board that comes with an integrated development environment (Arduino IDE), simple programming paradigm and the most common and widely spread form factor.

Another SOC became very popular few years after, the Node-MCU or ESP-01 or ESP8266. It’s a kind of smaller, cheaper and lower consumption Arduino Uno on Steroids. It has more memory, a faster processor, embedded Wi-Fi but less GPIO. It is bit trickier to start programming it but it integrates nicely with the Arduino IDE. You can get started by having a look over there:

We’ve recently seen some market consolidation in this space as Atmel was bought by Microchips and ARM by Softbank.

Industrial Microcontroller (PLC & RTU)

In the industrial world, digital computers have to be rugged and adapted for the control of manufacturing processes, such as assembly lines, or robotic devices or any activity that requires high-reliabilitycontrol and ease of programming and process fault diagnosis. There are two well-known types of Microcontrollers you should know. The Programmable Logic Controller and the Remote Terminal Unit.

A Programmable Logic Controller (PLC) is basically a gigantic microcontroller. It does the same things a microcontroller can do, but with higher speed, performance, and reliability where a microcontroller is really just a tiny low power CPU or computer with some output registers wired to pins.

A Remote Terminal Unit (RTU) is a microprocessor-controlled electronic device that interfaces objects in the physical world to a distributed control system by transmitting telemetry data to the system and/or altering the state of connected objects based on control messages received from the system.

The Functions of RTUs and PLCs pretty much overlap, but in terms of usage, RTUs tend to be a better fit for wide geographic telemetry, while PLCs are best suited for local area control.

Single Board Computer (SBC)

A single-board computer (SBC) is a complete computer built on a single circuit board, with a microprocessor(s), memory, input/output (I/O) and other features required of a functional computer. Single-board computers were made as demonstration or development systems, for educational systems, or for use as embedded computer controllers. Many types of home computers or portable computers integrate all their functions onto a single printed circuit board. Unlike a desktop personal computer, single board computers often do not rely on expansion slots for peripheral functions.

The first true single-board computer was born in 1976 called the “dyna-micro” was based on the Intel C8080A, and also used Intel’s first EPROM, the C1702A. SBCs also figured heavily in the early history of home computers, for example in the Acorn Electron and the BBC Micro. Other typical early single board computers like the KIM-1 were often shipped without an enclosure, which had to be added by the owner.

Nowadays we see a lot of those SBC on the market but in my opinion 2 families are standing out from the crowd, the Raspberry Pi and the Onion Omega.

The Raspberry Pi is an SBC was developed in the United Kingdom by the Raspberry Pi Foundation to promote the teaching of basic computer science in schools and in developing countries. The original model became far more popular than anticipated, selling outside its target market for uses such as robotics, Internet of Things …

Several generations of Raspberry Pi have been released. All models feature a Broadcom system on a chip (SoC) with an integrated ARM compatible central processing unit (CPU) and on-chip graphics processing unit (GPU).

The Omega is a personal single-board computer created by a startup company called Onion, released on Kickstarter. It is advertised as “the world’s smallest Linux Server”. The system combines the tiny form factor and power-efficiency of the Arduino, with the power and flexibilities of the Raspberry Pi, you can find this SBU in a lot of device around, if you own a D-Link router you’ll find out that it is powered by an Onion Omega.


Let’s face it, microcontrollers, computers, smartphones, wearables, and the Internet of Things are already playing a big part in our day to day life whether we interact with them or “they’re acting as an invisible servant”. Is this going to change the world? It kind of already does and it’s not going to stop here.

Now that we know a bit more about the concepts and the History of the IoT from a device perspective, stay tuned for the next episode of the series that will be about the wild world of the Internet of things Networks and messaging protocols.

How to Go Serverless with Talend & AWS Lambda

Recently I found myself in a predicament that many of you can relate to, trying to update an aging application that has become too difficult to manage and too costly to continue operating. As we started to talk about what to do, we concluded it was time to start decomposing that application into smaller more manageable pieces.

We spent the rest of the afternoon white boarding what we were going to do. That whole time I kept imaging a car engine that had been completely stripped and all the parts were lying on the table; somehow, we needed to get it to run that way and do so without costing more than it did when it was put together. After reviewing what we had it became clear, to support the goals we set, v2 was moving to the cloud and for several parts that could be run on demand, V2 was going serverless.

For the rest of this post I’ll walk you through how to set up a Function as a Service running on AWS Lambda and built using Talend. This example will demonstrate how you could take input from a Kinesis stream and send the results of your job to another Kinesis Stream.

You’ll need a couple of things before we start in order to work along with me:

Talend Studio – Download your open source Talend studio here.

Eclipse In the example below I used Eclipse Neon Version 3 – Installed with AWS toolkit. You can download that here: (

Now that you have everything installed, lets create our Talend Job.

Step-by-Step: Creating the Talend Job

Below, I have created a simple job which takes up the value from Context variables during its execution and converts it into UPPERCASE and stores the result into Buffer. Here are the Talend components I used:




Start by creating the context variable:

Next let’s take a look at our tFixedFlowInput settings:

Now that we have our settings in for tFixedFlowInput and our context variables, it’s time to set up our tmap component:

Finally, lets finish up by setting up the tBufferOutput component. This should be set as you click on “Sync Columns”

The final output of our Talend job looks like this:

Save and close your job. Simple right? After creating the job in Talend, you cannow export it as Standalone Job as shown below. Right click your job on the Repository and click on the “Build Job” option.

Next, choose your folder and click on Finish.

Creating the Lambda Function in AWS

Now that we have our Talend job built and component settings in proper order, it’s time to work with AWS to create the Lambda functionality of the job. First, open your Eclipse with the AWS toolkit installed. Then, create your new AWS lambda project. Here’s how:

Click File -> New -> Project -> New Project wizard opens -> select AWS Lambda Java. Create a New AWS Lambda java Project Wizard opens, here change the settings as per the image below and click on Finish.

It might take some time so wait till Eclipse helps you create the Skeleton of your project. Okay, at this point you are almost 50% done. Pat yourself on the back, we are almost there. Now the next window opens your Class file.

Now, if you encounter an error now near @Overwrite annotation, simply right click on your project in your Project Explorer -> Choose Properties – > Java Compiler -> do the compliance settings as shown below and click okay. This should resolve your issue.

Preparing the Lambda Code

Now let’s prepare our Lambda code. Import the previously created Talend job and unzip the Talend job as Standalone job. You will see the folder called “lib” and the folder name as same as your job name.

Go to eclipse, right click on the project name in the project explorer -> Select properties -> go to build path -> choose Add External JARs. Now go to the folder “lib” & select all JAR files then click ok. Follow the same procedure and add the JAR file under the folder “test_job” as shown. Press Apply before you click on OK to finish up.

Now let’s download other External Jars for our Lambda function.  Let’s take a look at the list below :


Gson Jar

Import all the above required jar files as shown above. Now let’s begin writing the lambda function. Let’s look at the code below:


Upload your Code to AWS lambda :

Right click on your code -> select AWS Lambda -> Upload function to AWS Lambda 

Configure AWS Lambda

  1. Choose the AWS console region (Remember Lambda,S3 bucket and Kenisis should stay in same region)
  2. Provide name of the function – >Click ok
  3. Choose IAM role access (Recommended to choose full access for Lambda)
  4. Select your S3 bucket where you want to store your lambda function zip file
  5. Click on finish.

Test run your job :

Right click on your code -> Select AWS Lambda -> Run function on Lambda

Output will be :

Alright, your code got executed in Lambda! As you can see, with a few Talend components and some simple code, we can set up a Talend job to run Serverless on AWS Lambda. Did you find this helpful? Let me know if the comments below or if you have questions.

What’s Outcome-Based Data Management?

Companies don’t gain business value just by gathering lots of data. They don’t even necessarily gain value from analyzing the data. The true keys to success are choosing the right data to focus on, knowing what to do with it, and determining the best ways to apply analytics to solve business problems or address market opportunities.

To achieve all of this and get the most out of their information resources, enterprises need to create a coherent data management strategy that is designed to deliver data in ways that will actually impact business outcomes. Here’s why. 

Finding the Needle in the Haystack

These days, data is coming into organizations from so many sources and channels and in such enormous volumes that there needs to be a strategy in place to make the best use of the information. Without a plan, there’s no efficient way to figure out what kinds of data you’re going to use and how will it benefit your business. A key point of that is understanding what business outcomes the data can potentially deliver.

For example, let’s say a software company is looking to build the best applications it possibly can—a sensible goal for any software company. When that company gathers usage data from current and prospective customers, it will want data that can help them decide which features to build into the next release and which features to retire. Once the product is released, the company will want the data to reveal whether it is building product features correctly or needs to tweak something. Are people using the new features and are they delivering the expected value?

Ideally, by gathering and analyzing its incoming data the software company will receive definitive answers to these questions. For this company, and any type of business, it’s about identifying the opportunity and then making sure to actually capture the right kinds of data.

“Without a data management plan, there’s no efficient way to figure out what kinds of data you’re going to use and how will it benefit your business.”

Armed with this data, this company can now actually know whether they are building good features into their products or whether they need to go back and rebuild. They might also find out that most users don’t even care about particular features—which can also be a useful insight.

A Lot of Data vs. The Right Data

I’d venture to guess that most organizations would find that they are either not collecting the right data, or not collecting it in the right format if they did some self-analysis. Maybe they’re not digging deep enough to get to the insights they need from the data in order to make better decisions.

The problem is not that companies don’t have the ability to collect data. In fact, many are overwhelmed by the amount of data they have. But this data and the analytics applied to it, unfortunately, do not help people make better decisions; it’s not information that people can act on to bring value to the organization or its customers.

The good news is this can all be fixed via a solid, outcome-based data management strategy that takes into account what kind of data and analytics are needed by particular users; and how those users will act differently once they have the data so that they can deliver more value.

Even though this is about data, organizations should not make the mistake of assuming that the IT or analytics teams should take the lead in developing and maintaining the data strategy. They need to involve individual business users, whether they are in marketing, product development, customer service, or some other area of the business.

After all, these are the people who know the data best and will be enabled by it. They’re the ones who are at the point of decision and they need to be able to use the data within their world in order to make those decisions. So, one of the most important things enterprises need to do is identify the decision points within the organization, and enable them to use the data they need to affect the desired change.

A lot of business users might not be analytically minded by nature. For that reason, investing in training so those people can actually use the data is a key piece of the data strategy.

By determining what kinds of data need to go to which users and in what format—and by preparing those users to best leverage the information—organizations can truly gain the business value that their data is meant to deliver.