Open Source: 20 years of Innovation and the Best is Yet to Come


In 1998, Netscape decided to release their source code in an effort to attract new users to their product and new developers who could easily integrate applications with the browser. 

At the same time, there seemed to be a groundswell around a culture of open and collaborative development, with legacy software companies beginning to acknowledge Linux and open source software (OSS) as a legitimate option for enterprise solutions. Spurred by the changing software landscape and release of Netscape’s code, a group of influencers in the internet software community, including Tim O’Reilly (CEO of O’Reilly & Associates) and Linus Torvalds (creator of Linux), gathered to strategize the best way to start evangelizing the benefits of OSS to software and innovative power of being a part of a global community to advance and optimize source code. During this meeting of the minds, Christine Peterson, an American nanotechnologistfuturist, and co-founder of Foresight Institute, coined the term ‘Open Source’ for this special breed of ‘community developed software’, which was quickly adopted and the movement took off from there.

Open Source: From Disbelief to Dominance

Open source was initially met with skepticism by companies who questioned whether the quality of this software could be trusted for enterprise-scale functions. However, as time went on, it became evident that the collaborative nature of open source development led to more flexible and innovative software that was often more secure than many of its proprietary counterparts. Why was this the case, you might ask? When comparing a software package created by a handful of developers to a software package created by thousands of developers, it’s fairly obvious that the latter gets closest to what the larger community of users want because those users can have a hand in making it. Additionally, when it comes to security, bugs in open source software tend to get fixed immediately, in contrast to proprietary software, which can often take a vendor several weeks, if not months, to even identify a security defect, let alone develop a FIX or patch for it.

Twenty years after the birth of open source, we have seen the power of a global developer community coming together to make open source the norm. Linux­—which was once a primary competitor Microsoft— has become so predominant that Microsoft has even started using Linux in their cloud offering Azure. Similarly, open source solutions like Git have become a leading solution for version control software, and big data processing frameworks like Spark and Hadoop have permeated the world’s top enterprises.     

Talend’s Commitment to the Community

As an enterprise company that has an open core history, Talend has been a proponent and active participant in the open source movement since 2006 when it launched its first open source project: Talend Open Studio for Data Integration. At the time, Talend had taken a bet on open source and—according to company co-founder,  Bertrand Diard— was one of the only companies that “aligned themselves] with open source” and leveraged a “distributed rather than centralized architecture.” Over the years, Talend became more involved in the open source community by releasing additional open source projects and dedicating a team of engineers to drive development in the Apache community in 2010.

The benefits of open source are not only limited to technology companies, but also to end-user, mainstream enterprises and non-profit organizations. For example,  the International Consortium of Investigative Journalists (ICIJ), utilized Talend’s open source technologies to move data from more than 3 million documents into the databases and tools that gave journalists the insights needed to publish the Panama Papers. Big companies like Spotify have also been active in the open source community using (and contributing to) projects like Beam to build the data architecture needed for “music recommendation, ads targeting, AB testing, behavioral analysis, and business metrics.”  

What’s Next: Top 4 Open Source Projects to Watch in 2018

Open source projects continue to deliver disruptive technology capabilities across every sector. Apache Beam, whose name is a combination of Batch and strEAM, is a framework that abstracts a developer from the processing frameworks they may want to utilize. This allows a developer to build a single pipeline that can easily switch between different streaming and batch processing engines. As new processing technologies become available, a Beam developer does not need to learn new languages. Instead, they would just need to choose the new technology as the preferred processing engine for their pipeline.

TensorFlow, an open source library for machine intelligence, is also making waves in the data world. Originally developed for deep learning, TensorFlow has grown into a flexible interface that can run training and inference algorithms on processors as small as mobile phones or it can leverage a distributed computing environment with hundreds of machines. Its versatility is also evident from the variety of use cases that TensorFlow has addressed. Companies have used TensorFlow to implement machine learning in verticals ranging from robotics, speech recognition, and computational drug discovery.

Last, Docker and Kubernetes are changing the way applications are built and deployed. Docker allows engineers to isolate different applications into different containers that run independently of each other without having to worry about other applications’ libraries, configurations, and lifecycles. Kubernetes is used to deploy, manage and orchestrate the containers. As a result, DevOps teams do not need to worry about app incompatibilities, and which allows for faster deployment and faster innovation.

In summary, one of the best things about open source is that it democratizes innovation. It allows for companies both big and small to utilize and contribute to the most cutting-edge technologies. Seeing these benefits first-hand, we are especially happy to celebrate the 20th anniversary of open source, and we can’t wait to see what great innovations the open source community (ourselves included) create next.  

Salesforce Acquires Mulesoft – The War for Customer Data Rages On

It’s been fascinating to see the customer data market evolve.  Initially, Google and Facebook were leaders in using customer data to create better products and cloud services.  More recently, Amazon, Microsoft, and Google have been in an all-out war to win the cloud platform market, especially through machine learning.  Most recently, we are hearing from our customers how their data processing platforms have been simply transformational for their business across every industry, even pizza delivery and insurance quotes.  Companies at our customer advisory board talked about delivering hundreds of millions of dollars and even more than a billion in increased revenue and improved operational efficiency.  This battle for data is occurring at every level of the stack, including cloud platforms, cloud applications, search, and social media, and across every industry.

Why do this deal? is buying Mulesoft for an impressive $6.5 billion, roughly double what the company was trading at on February 8th.  That is a massive premium, so why did invest so much?  

It’s because there is an “Innovation Multiplier Effect” happening in IT today that many can see, but few can quantify. is one of the companies that can quantify the impact.  What is this “Innovation Multiplier Effect?” – everyone knows how much innovation is going in cloud applications, cloud computing, big data, machine learning, IOT, microservices and mobile devices, but what is harder to see is the multiplier effect that is occurring because all this innovation is happening at the same time

Few companies in the world can see this impact firsthand like  Even a juggernaut like Salesforce can only manage a tiny portion of their customers’ customer data, so technologies that could potentially help them connect to any cloud and any device to bring more data to their platform increases the value of their platform many times over.  Offerings like IOT Cloud and Einstein become so much more valuable when they can be applied to much larger data volumes.

What does this mean for the integration market?  

Talend works closely with Salesforce, Amazon, Microsoft, Google, and many more cloud players.  Every one of them is investing in database technologies, machine learning, real-time and more to help companies turn customer data into new insight.  Just like these industries, the integration market is being reinvented with new cloud and big data technologies. 

When Talend went public in 2016, we discussed the market dynamics in the integration market.  We pointed out how the integration market broke out into three categories:

  • $1B+ legacy players like Oracle, IBM, Informatica and Ab Initio – struggling to keep up with innovation in the cloud and big data market
  • Many new start-ups with less than $10M in revenue
  • Two growth stage companies, Mulesoft and Talend with more than $100M in revenue focusing on different parts of the market

Talend and Mulesoft are addressing two very different, but equally important parts of the market, data integration and API integration respectively.  An acquisition like this reinforces the strategic importance of data, and it opens new opportunities for Talend in adjacent markets.

At Talend, we believe that customers want a platform-independent data integration solution that allows them to connect to data on any cloud.  We believe that customers will demand the flexibility to take advantage of the latest innovation, regardless of where their data sits.  Therefore, we have invested a great deal in cloud integration with a multi-cloud approach for every industry in the world.

What does this mean for Talend customers?

The overall data and analytics market continues to innovate at a staggering rate, and we will continue to work hard to unlock all that innovation regardless of what cloud it works on. will remain a major Talend partner, and we will continue to work with Salesforce to make it easy to get data in or out of the platform. You’ll also see us expand the API design and testing capabilities that came from our Restlet acquisition with Mulesoft’s API integration capabilities.  This dramatically simplifies the effort required to build and deploy new APIs with an API-first design methodology.

On top of this multi-cloud approach, you’ll see Talend invest in delivering data applications for many more roles, such as data analysts, data scientists, data stewards and operational employees, and in tools for ensuring data governance and quality.  These are fundamental components of our strategy to make all this innovation and data trustworthy and available to ANY employee within the company.  This will further unlock the true potential of new cloud innovations like machine learning and real-time data processing so companies may fully realize the value of their data.

The Countdown to GDPR Compliance Begins – Are You Ready?

This blog was originally published by Ronak Chokshi on the MapR Blog


The General Data Protection Regulation (GDPR) is a new legislation enacted in the European Union (EU) as of April 14, 2016, that will be enforced on May 25, 2018. It impacts any company with 250+ employees that controls or processes EU citizen data (i.e., data that pertains to residents in any of the 28 member states of the EU). GDPR offers EU residents additional control over their personal data with rights to modify, restrict, or withdraw consent to access or utility, and it enables data portability.

Many companies in the United States have been ignoring this legislation, assuming that it doesn’t affect them. A Forbes article summarizes why US-based companies need to get serious about GDPR. Moreover, a Forrester report, published in January of this year, suggests that merely 25 percent of organizations across Europe are thought to be GDPR-compliant by now, while another 22 percent expect to be GDPR-compliant in the next 12 months. But, despite GDPR becoming law in less than four months, Forrester found that 11 percent of organizations are still considering what to do about it, while 8 percent of organizations aren’t familiar with GDPR at all. Forrester’s research also found that it is typically media and retail organizations–companies which handle some of the largest amounts of customers’ personal data–that are currently the least prepared for GDPR, with only 27 percent reported to be fully GDPR-compliant.

With those market numbers out of the way, the next thing would be to qualify whether your organization needs to be ready for GDPR or not. It is fair to say that there are some misunderstandings in this area, and many organizations tend to conclude that GDPR doesn’t really apply to them. To that, I suggest you think about all the data that your company collects from mobile devices, sensors, marketing campaigns, chat-logs, in-store shoppers, home automation solutions, transportation, logistics, and social media channels that can be qualified as personal data that pertains to EU residents. Any of these datasets, and any others that can be used to identify a person in the EU, would come under GDPR scrutiny. Moreover, regardless of whether your company is a data controller or a data processor, based inside or outside of the EU, GDPR applies to you.

So, if you are part of a business, data engineering, IT, or legal team within your organization that collects or processes personal data from any EU citizen, you need to pay close attention to this regulation; you can’t afford to ignore it. Hefty fines will be applied to companies that fail to comply with GDPR, which can range up to 4% of a company’s revenue. Determine your GDPR readiness with a free assessment tool.

So what does GDPR mean for you?

If you are a data protection officer or data governance professional, you need to have a clear plan around data governance, accountability, location, and portability. The following points summarize the key tenets of this new legislation:

  • Easier access to personal data. Citizens in the EU will be given greater visibility into how their data is being processed in a clear and understandable way.
  • A right to data portability. It will be easier for people to transfer personal data between service providers.
  • A right to be forgotten. When an individual no longer wants his/her data to be processed and provided there are no legitimate grounds for retaining it, the data will be deleted.
  • A high standard for consent. It needs to be freely given, specific, informed, unambiguous, provable, and easy to withdraw.
  • A right to know when your data has been hacked. Companies must notify the national supervisory authority of serious data breaches as soon as possible (within 72 hours in many cases), so that users can take appropriate measures.

Making your personal data compliant 5 times faster with a GDPR data lake

At MapR, we believe the key underpinning for being GDPR-ready is starting with a data lake that can address data storage, retention, portability, lineage, and governance, using a single, unified platform, instead of a mix of point solutions.

This is precisely why MapR has chosen to partner with Talend on creating an offering that helps companies accelerate the deployment of a GDPR-ready/compliant data lake. Our joint solution is based on the MapR Converged Data Platform and Talend Data Fabric, which together help customers address the following challenges:

The MapR Converged Data Platform provides several features to comply with GDPR, including MapR Volumes, which logically groups PII data (EU vs. non-EU) and immediately applies policies and permissions to this data; high-performance MapR auditing to log data access; and MapR mirroring and replication to easily control the movement of ‘portable’ data.

The table describes how our joint solution addresses the seven key principles stated in the GDPR guidelines for compliance—specifically, Chapter 2, Article 5.

Joint solution addresses the seven key principles

Talend’s data integration platform, Talend Data Fabric, combines data quality, metadata management, data stewardship, data lineage, data services, and big data integration to collect, standardize, reconcile, certify, protect, and propagate personal data. Their unified suite of components, namely the Talend Big Data Platform, Master Data Management, and Metadata Manager, along with their readiness assessment questionnaire are instrumental in this process.

Watch this online webinar to learn more about MapR and Talend’s GDPR Data Fabric.

To summarize, complying with GDPR requires going beyond establishing a rules-based control mechanism, business intelligence reporting, and basic data management tools. You need to get started with a modern data lake solution that allows robust data governance, data lineage, data anonymization, and more–a solution that allows for the convergence of all data, in one platform, across every on-premises, cloud, multi-cloud, or hybrid cloud environment.

The good news: It’s easy to get started. Check out our joint MapR/Talend GDPR Data Lake solution and our joint solution brief to see how it can help you attain better data security and GDPR compliance.

About the Author – Ronak Chokshi

Ronak is Product Marketing & Solutions Strategy Lead for MapR Technologies. He is a product management leader with 13+ years of experience in cross-functional roles. Ronak’s specializations are in advanced data analytics, machine learning, IoT, sensor & connectivity domains.

It’s Time to End Bad Data

Bad data has never been such a big deal.  Why? Well, according to IDC’s latest report, “Data Age 2025”, the projected size of the global data sphere in 2025 would be the equivalent of watching the entire Netflix catalog 489 million times (or 163 ZB of data). In a nutshell, the global data sphere is expected to be 10 times the 2016 data sphere volume by the year 2025. As the total volume of data continues to increase, we can also infer that the volume of bad data will increase as well unless something is done about it.

No doubt, every data professional will incessantly chase bad data as it’s the bane of every digital transformation. Bad data leads to bad insight and ultimately biased decisions.  That’s why it’s crucial to spot bad data in your organization. But it’s also hard to do it.

How to Spot Bad Data

Bad data can come from every area of your organization under diverse forms from business departments, sales, marketing or engineering. Let’s take a look at a few common categories of bad data:

  • Inaccurate: data that contain a misspelling, wrong numbers, missing information, blank fields
  • Non-compliant: data not meeting regulatory standards
  • Uncontrolled: data left without continuous monitoring becomes polluted over time
  • Unsecured: data left without control and vulnerable to access by hackers 
  • Static: data that is not updated and becomes obsolete and useless
  • Dormant: data that is left inactive and unused in a repository lose its value as it’s neither updated nor shared

If Data Fuels Your Business Strategy, Bad Data Could Kill It

If data is the gasoline that fuels your business strategy, bad data can be compared to a poor-quality oil in a car engine. Frankly, there is no chance you’ll go far and fast if you fill the tank with or poor-quality oil. This same logic applies to your organization. With poor data, results can be disastrous and cost millions.

Let’s take a look at a recent “bad data” example from the news. A group of vacationers in the United States followed their GPS application to go sight-seeing. Because there was some bad data present, they wound up driving directly into a lake rather than the destination they intended. Now let’s visualize a future where your car will be powered by machine learning capabilities. It will be fully autonomous and choose directions and optimize routes on its own. If the car drives you into the lake because of poor geo-positioning data, this will end up costing the carmaker quite a bit in repairs and even more to brand reputation. According to Gartner, poor data quality cost rose by 50% in 2017 reaching 15 million dollars per year for every company.  You can imagine this cost will explode in the upcoming years if nothing is done.

Time for a wakeup call:

Results from the 2017 Third Gartner Chief Data Officer (CDO) survey show that the data quality role is again ranked as the top full-time role staffed in the office of the CDO4. But truth is that little has been done to solve the issue. Data quality has always been perceived by organizations as a difficult play. In the past, the general opinion is that achieving better data quality is “too lengthy” and “complicated.” Fortunately, things have changed. Over the last two years, data quality tooling and procedures have dramatically changed. And it’s time for you to take the data bull by the horns.

[ Don’t know where to start? Find out how easy it can be to master data quality across your data infrastructure with the right self-service tools and automated processes.]

Let’s take a closer look at a few common data quality misconceptions:

“Data Quality is Just for Traditional Data Warehouses.

Today, data is coming from everywhere, and data quality tools are evolving. They are now expanding to cover any type of data whatever their type, their nature, and their source. And it’s not only data warehouses. It can be on-premises data or cloud data, data coming from traditional systems and data coming from IoT systems. Faced with data complexity and growing data volume, modern Data Quality tooling uses machine learning and natural language processing capabilities to ease up your work and separate the wheat from the chaff. My advice is to start early. Solving data quality downstream, at the edge of the information chain is difficult and expensive. It’s 10x cheaper to fix data quality issues at the beginning of the chain that at the end.

 “Once you solve your Data Quality, You’re Done. “

Data management is not a one-time operation. To illustrate, let’s look at the example of social networks. The number of social media posts, video, tweets, and pictures added per day is in excess of several billion entries. This rate only continues to increase at lightning speed. It’s also true for business operations. Data is becoming more and more real time. You then need “in-flight Data Quality”. Data Quality is becoming an always-on operation, a continuous and iterative process where you constantly control, validate and enrich your data, smooth your data flows and get better insights. You also simplify your work if you link all your data operations together on a single managed data platform.  

Let’s take Travis Perkins as an example.  Rather than trying to fix inaccuracies in their product data for multi-channel retailers, they created a data quality firewall into their supplier’s portals. When suppliers introduce their product’s characteristics, they have no chance but to enter data that meets Travis Perkins data quality standards

 “Data Quality Falls Under IT Responsibility” 

Gone is the time when data was simply an IT function. As a matter of fact, data is now a major business priority across all lines of business. A security breach, data loss or data mismanagement may lead your company to bankruptcy. Data is the whole company’s priority as well as a shared responsibility. No central organization, whether it’s IT, compliance or the office of the CDO can magically cleanse and qualify all the data. The top-down approach is showing limits. This all about accountability. Like the cleanliness of public spaces, it all starts with citizenship.

Let’s look at a recent example of the Alteryx Leak. A cloud-based data repository containing data from Alteryx, a California-based data analytics firm, was left publicly exposed, revealing massive amounts of sensitive personal information for 123 million American households. This is what happens when you fail to establish a company-wide data governance approach where data has to run across data quality and security controls and processes before it can be published widely.

Bad data management has immediate negative business consequences. Today, good data management requires company-wide accountability. Otherwise, it leads to penalties, bad reputation and negative brand impact. 

“Data Quality Software is Complicated.”

Talend developed Talend Data Prep Cloud in order to enable anyone in an organization to combat bad data. With an interface that is familiar to users who spend thier time well-known data programs like Excel, a non-technical user can easily manipulate big datasets without any destructive effect on your raw data. Line of business users can enrich and cleanse data without requiring any help from IT. Connected with your apps like Marketo & Salesforce, Talend Data Prep will dramatically improve your daily productivity and improve your data flows.  

Experience the industry-leading cloud integration solution today and sign up for Talend Cloud today. 

“It’s Hard to Control Data Quality”.

Data management isn’t just a matter of control anymore, but a matter of governance.  IT should understand that it’s better to delegate some data quality operations to business because they’re the data owners. Business users then become data stewards. They feel engaged and play an active role in the whole data management process. It’s only by moving from an authoritative mode to a more collaborative role that you will succeed in your modern data strategy.

“But It’s Still Hard to Make all Data Operations Work Together.”

IT and business may have their own separate tools to manage data operations. But having the right tools for relevant roles is not enough. You will still need a control center to manage your data flows. You need a unified data platform where all the data operations are linked and operationalized together. Otherwise, you will risk breaking your data chains and ultimately fail to optimize your data quality.

Building a solid data quality strategy with the right platform is not complicated anymore. However, it still requires all data professionals in your organization to react and establish a clear, transparent and governed data strategy.

Data is your most valuable asset. It’s time look at all data with a “Data Quality Lens” and combat any existing data myopia in your organization.

To go further into Data Quality, I  recommend taking a look at a recent Gartner report that reflects eight changing tends that shape data quality tooling. 

How Big Data is Growing Agriculture

For this episode of Craft Beer & Data, Mark and I hung out with Eric Matelski, tap-room manager at the Epic Brewing Company, and talked about how big data is transforming the agricultural industry. But first, Eric told us about one of his favorite beers (lately), Falling Monk:

“This is a beer we made for Falling Rock Tap House here in Denver that celebrates their 20th anniversary. It’s barrel aged in bourbon barrels with two different types of cherries and fresh almonds.”

For more about the ales and Epic Brewing, watch the video, above, or come visit Eric in the River North District of Denver!

Big Data is Creating Digital Transformation

Everyone understands that big data is a technology that can be used, that there are new frameworks for processing it. But what are the benefits to a particular industry?

So we thought we would take a section of our video series and really focus in on how data transitions, digitizes, and improves certain industries.

Ready For More? Download How Leading Enterprises Achieve Business Transformation with Talend and AWS User Guide now.

Download Now

Does Agriculture Need Big Data?

By 2050, it’s been estimated that the population will grow to over 9 billion people. What that means is we need to find ways to produce and distribute food at double the current rate. This is especially challenging when you consider that 40% of the earth’s surface is already used for agriculture.

However, over a third of all food produced is either lost or wasted through the entire production process. Even if you’re just worried about the wasted money, it’s estimated that the wasted food has about a $940 billion impact on the global economy.

There’s clearly a need for more use of data in agriculture and the food industries.

The Four Vs of Big Data in Agriculture

There are definitely some ways that the agricultural industry is already using data. Much of the farming industry uses it throughout their processes.

Let’s look at the big Vs of big data at work in agriculture through the production of the crops and the data that brings in:

  1. Volume
  2. Variety
  3. Velocity
  4. Veracity

For years, John Deere has had sensors on their farming equipment to capture information and data. They, and other seed companies, even using satellite imagery to help farmers determine where they do and don’t need to spray certain pesticides.

This also speaks to variety and velocity, because the data is coming from everywhere. John Deere is really only one of the major farm and tractor companies out there—there are plenty of companies doing exactly the same thing bringing data that looks different.

Then, there’s a huge discussion in the agricultural sciences about how you should collect and store this data. It is coming in from every angle.

Big Data Applications in Agriculture

Big data and machine learning are huge in predicting things like when you might want to use certain pesticides. The US has regulations around not using certain pesticides 24 hours before a “predicted rain storm” of more than an inch of rain. So how do you get that information out to the farmers who are about to spray their fields?

One Talend customer is trying to help farmers better understand their soil. They have found this really interesting way of providing a mobile lab that these farmers can use, because it’s currently very expensive to get a soil analysis. They’ve also built a hand-held x-ray machine that takes a soil analysis to figure out things like nitrogen and potassium levels, and which types of fertilizers or crops can be used to make adjustments in the soil.

            Learn more about how Talend is helping to bring big data to agribusiness →

Agribusiness Big Data in the Cloud

The cloud is a huge topic in agribusiness right now. The scope of available data is pushing most industries to cloud environments, but there’s an added challenge in the agriculture industry. Check out the full episode, above, for more on this conversation.

Ready For More? Watch Getting Started with Data Integration now.

Watch Now

Watch the whole episode of Craft Beer & Data for more on big data in agriculture, including a story about a coder in Japan who owned a cucumber farm, concerns about how all this data might be used against farmers, and more.

And catch up on all of Season 1 on YouTube.


How to Migrate Your Data From On-premises to the Cloud: Amazon S3

Migrating to the Cloud

2018 is the year of the cloud, and as more and more companies move to cloud technologies it is important to realize how your business can best utilize the cloud. One of the biggest issues enterprises are having today, is moving their data from their on-premise databases to their cloud data storage. This can be a long, and tedious process if you don’t have the correct tools. Luckily, Talend is here to help!

At Talend, I needed to take our on-premise database, MySQL, and migrate it to our cloud storage, Amazon S3. Rather than deal with the complexities of Apache Sqoop, I decided to create a job within Talend that would run whenever we needed to migrate new data to the cloud. Using this method saved me precious time that I can use to analyze my newly migrated data. In this blog, I will be reviewing how I built said job. Without further ado, let’s jump right in!

Creating a Connection

As with any Talend job, the first thing we want to do is create the connections. I have a MySQL database so I am going to use the tMysqlConnection component. I also need to create a connection to my S3 cloud storage using tS3Connection. Because connecting to both MySQL and S3 are the first steps every time this job is run, we also need to add tPrejob in front of both components.

Remember, Talend is a code generation tool, by using tPrejob, I can control what will always compile first, ensuring I always connect to my databases. After I configure both connection components, I can connect tPrejob, tMysqlConnection, and tS3Connection together like the screenshot shown below.


Getting your Tables and Setting the Dynamic Schema

Now that I am connected to both of my storage platforms, I can start my cloud migration process from MySQL to Amazon S3. To start, I need to get a list of all the tables I want to move from the database. Using tMysqlTableList, I can specify which tables I want to list through the “WHERE clause”.  However, In the case, I only want to pull from the customer tables.


Now that I have the list of all the tables I want to transfer, my next step is to get a list of the columns within that table.

Using “tMysql” global variables is a fantastic way to pull values from components. These global variables can pull data from the “tMysql” components for other components to use. In this case “((String)globalMap.get(“tMysqlTableList_1_CURRENT_TABLE”))”, will make the component pull columns from the tables that are being gathered by the tMysqlTableList component. Talend makes it easy to retrieve global variables without having to memorize them. All I have to do is type “tMysql”, press Ctrl + Space and all the “tMysql” global variables will appear in a list where you can choose which one you want.

Next, I need to add a tFixedFlowInput to generate the “tableName” and “columnName” columns. The values will only appear within the tFixedFlowInput component If I configure the schema for these columns first. Once I set the schema I can now set the value for these columns which will be, ((String)globalMap.get(“tMysqlTAbleList_1_CURRENT_TABLE”)) for “tableName” and ((String)globalMap.get(“tMysqlTAbleList_1_COLUMN_NAME”)) for “columnName”.

Adding a tLogRow after the fixed flow will allow me to see the names of the tables and columns that my job is pulling from by displaying the information on the run console. Below is an updated screenshot of my job thus far.

Now it’s time to set the dynamic schema that the data will use when being pulled from my on-premise database. Like the name suggests, a dynamic schema is a schema type that will change depending on the column that is being read at the time, making it essential to the job.

To set a dynamic schema I will be using a fancy component called tSetDynamicSchema. Other than having a great name, tSetDynamicSchema will allow me to dynamically set the schema based on the value “columnName”. Now that the schema is dynamic, I don’t need to move each table individually, I can move multiple, different tables with ease.

Reading the Data and Writing the Tables

With my dynamic schema set, I’m ready to start reading the table data using the dynamic type that was created from the tSetDynamicSchema component. Because I am reading data from my on-premise database, I need to use an input component that will read from a MySQL database, tMysqlInput. First, I need to edit the schema of the tMysqlInput component to use the dynamic DB type. I named the column for this schema “dynamic_row” with type “Dynamic” (of course) and DB Type, “VARCHAR”.

After the schema is set I can move onto configuring the tMysqlInput component, making sure the data is being pulled from the current table being listed by tMysqlTableList.

The data in the tables is now being read from the current table listed, however, the data still needs to be written out to a CSV file. To accomplish this, I am going to be using tFileOutputDelimited. I need to make sure the “File Name” follows the correct file path.

Phew! Don’t worry folks, we’re almost done. This is an updated look at the job that I have created up to this point.

Putting Files on Amazon S3

So far, this job reads all the tables with the name customer and writes them to CSV files in a specified folder. Now that I can pull data from tables located in my on-premise database, I need to finish the job by moving these files to Amazon S3. 

tFileList will allow me to get a list of all the files in a specified folder, or in this case, it will allow me to get a list of all the tables that I have pulled from my on-premise database. All I need to do is specify the directory where the files are located.

Once I get a list of all the files I can start to move them into one of my Amazon S3 buckets. The tS3Put component will allow me to do this. All I need to do is specify the “Bucket”, “Key”, and “File”. The “Key” being the name of the file within S3 and “File” being the name of the file that is being uploaded to S3.

Now that the configuration for the tFileList and tS3Put are completed, all that’s left to do is to put the finishing touches on the cloud migration job. Remember those connections that I opened in the very beginning of the job? With the help of tPostjob, tMysqlClose, and tS3Close, I can close the connections that I opened every single time the job is run. Just like before, I want to be able to control what happens after the main loop is compiled, thus, the reason for the tPostjob component. Easy-Peasy! The finished job should resemble something like this.

Running the Job

If the job is run and everything is in tip top shape, then the run console should coincide with the screenshot below. As you can see, the console shows the table that is being read and written, as well as, the corresponding column name.

Now that this job is complete, I can move any tables I want from my on-premise database to my cloud storage without having to build multiple jobs for each table, or messing with pesky hand coding. It feels good to be cloud ready.  

Watch this Demo LIVE

Want to catch this demo live? Join us on Thursday, March 22nd on Talend’s Facebook page for #TalendDevLive where I’ll be building out this job step-by-step and taking your questions along the way. Dont miss it!

“Moving to the Cloud”: Going Cloud First at University of Pennsylvania

Moving an 18th Century University to the Cloud

UPenn is an Ivy League University. Founded in 1740, UPenn was the first university in the United States with both undergraduate and graduate studies. Today, the university has grown into a vast and complex organization with 25 thousand students and 12 different schools. Similarly, UPenn has recently undertaken a massive program to transform and grow the capabilities its IT infrastructure betting on a cloud-first strategy.

UPenn’s 2020 vision for inclusion

Each of UPenn’s 12 schools operates with a different budget, different teams and have different objectives. Yet, the entire University of Pennsylvania has come together to lay out its 2020 vision that encompasses 3 main priorities: Inclusion, Innovation and Impact.

The focus of “Inclusion” is to make premier-quality higher education —which costs around $60,000 per year at UPenn— available and accessible to all admitted applicants regardless of financial need. UPenn accomplishes this by an “all grant, no-loan” financial aid policy.

Over the past decade, dramatic advances have been made in increasing the diversity and excellence of UPenn’s student body. Presently, 53% of graduate and professional students are female, and 32% of undergraduates and 20% of graduate and professional students are U.S. minorities. This makes UPenn one of the most diverse university among the Top 10 Universities in the US.

Making donations as easy as online purchase

Achieving these numbers requires a disciplined and creative fundraising strategy to finance the “all grant, no-loan” financial aid policy. Private donations are indeed a key component of the budget for most higher education institutions in the US, and UPenn is no exception.

According to the Council for Aid to Education’s annual Voluntary Support of Education survey, in the fiscal year ending in June 2016, U.S. colleges and universities drew $41 billion in giving. At UPenn, there are 300,000 active donors in the database and the university processes 160,000 transactions every year. It is their donations that give students from all economic backgrounds the opportunity to go to such a prestigious university almost for free. This is why UPenn’s goal was to ensure that making a donation would be as easy as buying something on Amazon.

In 2017, UPenn updated its Online Giving application to make it easy to navigate, mobile-friendly and integrated with other university systems. In addition to being modern and simple, the new site had to be reliable and flexible to be able to support usual spike usage that occurs at the close of the U.S. tax year in December.

Using Talend to leverage cloud capabilities and bring together UPenn’s IT ecosystem

UPenn knew the new Online Giving application would be cloud-native to take full advantage of the availability and scalability benefits of the AWS platform.

Moving to the cloud-enabled UPenn to be ready for the seasonal spikes in donations, AWS made it possible to easily ramp or down its infrastructure to meet demand. The scalability of the AWS platform translates into reliability and cost savings for UPenn.

On the other hand, UPenn wanted to make online donations seamless and to personalize the whole process, so they chose Talend Cloud to easily manage the data integrations between the cloud-native applications and the legacy on-premises apps. Being able to collect large volumes of data at a precise arrival time for financial transactions, UPenn was able to develop more efficient fundraising campaigns.

The Onlive Giving app went live during the autumn of 2017. During the last peak in December 2017, UPenn recorded a 7% increase in the number of gifts, and an overall increase in revenue of 18%.

The Cloud of Yesterday, Today, and Tomorrow

Cloud computing in the form we understand today started around 10 years ago, with the launch of Amazon Web Services (AWS). This was the first commercially viable option for businesses to store data in the cloud rather than on-premise and acted as a shared service for anyone connecting to the platform.

Early stage cloud computing was certainly more technical than it is now. However, no more than managing a data center – something that IT departments at the time were well used to. Anyone who was able to successfully start up virtual engines was able to set up a cloud environment.

Cloud started off by offering much more basic services – network, storage and compute. Over the last 10 years, we have seen the volume of data stored in the cloud grow exponentially with businesses frequently dealing with petabytes of data, compared to the gigabytes of yesteryear. With data exploding around us, cloud services have had to learn to be much more efficient.

The Cloud of Today

 As data volumes grew and users grew, cloud providers began to offer more cloud-based value-added services. Analytics and machine learning capabilities that are housed in the cloud are frequently offered alongside other business objective oriented services. As a result, the range of people purchasing cloud services has expanded and is no longer confined to the IT department.

This is because one of the great advantages of cloud and SaaS services is how easy they are to set up and how flexible and scalable they are. Business users without extensive IT knowledge became able to utilize services previously only accessible to the IT department. Of course, with this came a rise in  “shadow IT” and while being able to access cloud-based business tools were very helpful for achieving business objectives and accessing new services that could not run on legacy IT systems, it did open up businesses to greater data privacy risks.

Today, the IT and business functions increasingly collaborate to implement the most up-to-date, data-driven technologies to solve real business needs, making “shadow IT” less of a problem. By working together, IT and business are able to offer smarter responses to business needs. The business function is better able to define the requirements of their cloud solutions, while the IT department has more flexibility to implement and test different technologies to find the best solution.

With GDPR coming in to effect in May 2018, the cloud will need to evolve and adapt

An important upside of this is that it allows the IT department to keep a better overview of where data is stored with third-party services like Salesforce. To ensure this is implemented throughout the business, the IT department needs to take on an educational role within the organization – teaching business users about the implications of new data privacy laws and safe and compliant ways to use data within the business. Considering the rapidly approaching GDPR deadline this is good news for organizations as the regulation will mandate much stricter approaches to data protection.

This strategy is being led from the top down, with new Data Protection Officers and C-suite positions like the CDO (Chief Digital Officer) and CTO (Chief Technology Officer) straddling the IT and business function. This is helping redefine the IT department as a department for technological creativity and innovation, rather than simply focusing on solution implementation.

The Cloud of Tomorrow

With GDPR coming in to effect in May 2018, the cloud will need to evolve and adapt. Increasingly, there will be more security services attached to the cloud, as well as greater oversight around what data is stored and where it is. This will be essential for changes like the right to be forgotten, where a person can ask a business to delete all personal data relating to them.

For businesses, this will require significant re-architecting in the cloud environment to increase the ability to analyze and discover data across the storage landscape. The cloud is ready to implement these changes, but it will require a change of mindset within organizations. Cloud usage cannot just be determined on the basis of features and costs, but also on ensuring that the data is stored in a controlled, managed, compliant environment.

In the next 5-10 years, as the volume of data continues to expand exponentially, there will be an increasing need to align this data in terms of format and quality. Organisations will need to be able to pull together multiple data-streams across multiple cloud environments into combined, high-quality insights. This is where an open-source, vendor-neutral management layer will become crucial to help organizations bridge the gap between their vast data reserves and the insights offered by machine learning and AI technologies. All of this will contribute to a future where businesses can use data stored in the cloud to provide predictive analytics for the business – such as predicting load requirements for peak shopping days, or market fluctuations to prepare investors.

Building the Best Enterprise Data Strategy in 2018: How Our Customers Are Getting There

It’s an exciting time to be working in the Cloud, Big Data, and Machine Learning industry, but it’s even more exciting to hear how Talend customers are building their data strategy to drive business results.  Every year we invite representatives from some of our most strategic customers to join us for two days to share their experiences with Talend’s products and provide input into our roadmap.

This year we had an amazing group representing some of the best-known brands in the world, including leaders in financial services, payments, automobile manufacturing, laptops and servers, a restaurant chain, health insurance, pharmaceutical data and sports betting and gaming.  The group ranged from mid-sized companies to some of the largest in the world, and together they have a combined market value of over $200 billion.  Despite how varied the group is, they share at least one thing in common – data is a game changer for all of them and building a data strategy is imperative.  Most of them are in the data business directly or indirectly, but all of them have a core competency in data. 

4 Common Themes Across our Customer Base:

  • Using the Right Data to Drive Business Results: Every one of our customers is seeing a staggering growth in their data volumes.  The customers earlier in their data strategy journeys are still learning how to scale their data platforms.  The more mature customers tend to have more significant challenges with the variety of data and finding the “right data” to improve business results.
  • Increasing Customer Centricity: Regardless of the size of the business or their data maturity, every customer has a customer-centricity program designed to create a better customer experience.
  • Unlocking the power of Machine Learning: It’s simply amazing to hear how quickly our customers have increased their machine learning maturity.  Last year, most were trying to understand the working relationships between IT and data scientists.  Today, these same customers have a clear model for creating machine learning models and later deploying them at scale within in their enterprise data strategy. 
  • “We’re a data company that just happens to be in the X business”. It’s clear just how powerful data can be when you hear almost the same statement from a health insurance company and a restaurant chain.

Delivering Big-time Results to the Bottom Line

Our Customer Advisory Board is always a great way to hear about how companies are achieving bottom-line results by leaning on their data.

“Data is a game changer…and building a data strategy is imperative.”

One automotive company showcased how it has delivered hundreds of millions to the bottom line by using data and machine learning to support everything from autonomous driving to supply chain optimization programs.  They have even gone so far as to track the screws that go into every car, so that they pinpoint cars that need to be recalled, reducing recalls by as much as 50X.

Another company, a restaurant chain, is tracking almost 16 million active customers, tying together the preferences of their families to truly personalize the customer experience and optimize a multi-billion-dollar supply chain of ingredients.

Wrestling with Machine Learning

Overall the advisory board has become far more mature in their use of machine learning in the last year, with much deeper experience around how and where IT and data scientists should work together. 

They’ve also found machine learning projects are the most data hungry of them all as the models require combing through millions of data points to identify the variables that have the biggest impact on their business.  In the case of our financial services customer, it’s not uncommon for their data scientists to stretch the limits of their on-premise storage and compute capabilities.  But the benefits are worth it.  This customer estimates they’ve delivered over $1B in operational improvements from data projects.

Betting on your Data in Real-time

One of the largest sports betting companies in the world employs over 5,000 people to help run their business in real-time 24x7x365.  Their data team has over 50 developers that have built a data platform on AWS using S3, Redshift and Aurora allowing them to track betting and games results down to the second, creating new betting opportunities for their customers.  Using Talend and the cloud, they have built their data strategy around an advanced analytics platform with the governed delivery of distributed machine learning models.  It is truly one of the largest deployments of real-time data tracking in the world.

Growing Through Acquisitions

Several of the customers at the Customer Advisory Board have grown through acquisitions.  These acquisitions have created opportunities for synergies and cost efficiencies.  A key theme across these companies was the need to link together distributed business units so they could act as a single company with joint customers and improve the overall customer experience.

Validating our Roadmap

Every one of our customers is deeply committed to using Talend to help expand their core competency in data management.  Much of our time was spent talking about opportunities to invest in Talend’s platform so that customers could continue building a unified and modern data management capability that spans all types of workloads, clouds, and on-premise locations.

Customers were especially excited about new self-service data management capabilities that were delivered last year and several new products that we plan to introduce this year.  Customers were excited about new self-service capabilities planned for data analysts and data scientists, the fastest growing segments of their teams. 

The board’s needs for data quality and data governance are also expanding rapidly as their data needs expand.  These areas have been top priorities for Talend over the last few years and will continue to be in the future as reflected in our 2018 roadmap that includes many exciting new capabilities to help customers increase collaboration and the finding and sharing of data sets. However, based on how quickly our customers’ needs are evolving to meet market shifts, our greatest strength will remain our ability to rapidly adapt to put them at the forefront of data innovation.


A Simple Architecture for Building a Big Data Lake on Azure with Talend Cloud

Big data has emerged to be the most important tool businesses use to help shape their future. Major companies like Amazon, Uber and Netflix are using big data to fuel breakneck speed of innovation in everything from customer engagement to new product development to business optimization strategy. And the rise of big data technologies such as Hadoop, Spark, Kubernetes, and Kafka, combined with the promise of cloud, have empowered countless enterprises to execute their big data initiatives effortlessly. By moving towards the cloud, companies are already reaping the benefits such as speed of provisioning, time to market, flexibility and agility, instant scalability, reduced overall IT and business costs, to name a few.

Getting Started with Azure and Talend Cloud

Among the leading cloud platforms, one of the most widely adopted is Microsoft Azure, a secure, flexible, enterprise-grade cloud platform that offers IaaS, PaaS, SaaS, and many other development tools and frameworks that can help create a data lake to deliver enterprise big data analytics.

Meanwhile, Talend Cloud is an open, highly scalable cloud integration (iPaaS) solution that simplifies your data and app integrations. Talend cloud brings:

  • Broad connectivity where you can connect to any on-premises databases, SaaS apps, cloud apps, Azure Blob Storage, Azure Data Lake Store, Azure HDInsight, Azure SQL Data Warehouse, Azure CosmosDB, and more
  • Native Spark and Hadoop support
  • Built-in data quality
  • Self-service capabilities such as data prep, data stewardship, and data governance
  • Enterprise capabilities like SDLC and multi-cloud support

Creating a Big Data Lake on Azure for Accurate and Reliable Data

Talend and Azure have been working together to provide our joint customers hyper-scale cloud data lake solution that can deliver actionable insights. But first, what is a data lake? A data lake is an architecture that allows organizations to store massive amounts of data into a central repository. Typically, this includes data of various types and from multiple sources, readily available to be categorized, processed, analyzed and consumed by diverse groups within the organization. Data lakes help eliminate data silos and capture 360-degree views of the organization, customer, and partner data. Compared to traditional data storage and analytics, data lakes help deliver more agility and flexibility especially when built in a cloud environment. A data lake architecture is not limited by response time when in need of rapid changes such as adopting new IT solutions, connecting to new data types and sources, and performing new types of analytics.

The following diagram shows how a typical customer implements a data lake solution using Azure and Talend Cloud:

In this simplified use case, you ingest your structured or unstructured data from the web, social, machine sensors, devices or on-premises applications into Azure Data Store (ADL Store), a hyper-scale Hadoop file system for big data analytics workloads. It’s compatible with Hadoop Distributed File System (HDFS) and works with the Hadoop ecosystem.

Talend Cloud then helps profile your data stored on ADL Store, adding requirements for data governance, business rules, and regulations and compliance. Then you use Talend’s built-in Data Quality natively in Azure HDInsight to prepare data for analysis. Finally, you move the transformed and cleansed data to Azure SQL Data Warehouse, from there, business analysist can directly access those data for BI reports.

Using  Talend, many companies accelerate their ingestion time by 50% into their Microsoft Azure Data Lake. Watch this video below to learn about how Talend Cloud is helping customers move to the cloud or start experiencing Talend Cloud first hand by signing up for a 30-day free trial today.