Full Resource Library

What is Apache Spark?

Apache Spark is one of the most powerful tools available for high speed big data operations and management. Spark’s in-memory processing power and Talend’s single-source, GUI management tools are bringing unparalleled data agility to business intelligence.

View Now

5 Ways to Optimize Your Big Data

Big data is only getting bigger, which means now is the time to optimize. Optimizing big data means (1) removing latency in processing, (2) exploiting data in real time, (3) analyzing data prior to acting, and more. Learn how to get started today.

View Now

What is a Data Lake?

A data lake is a central storage repository that holds big data from many sources in a raw format. The benefits of the data lake format are enticing many organizations to ditch their data warehouses. Discover what sets data lakes apart, why they are becoming more popular, and how to start building one.

View Now

What is MapReduce?

MapReduce is a programming model or pattern within the Hadoop framework that is used to access big data stored in the Hadoop File System (HDFS). The map function takes input, pairs, processes, and produces another set of intermediate pairs as output.

View Now

Data Lake vs Data Warehouse

Data lakes and data warehouses are both widely used for storing big data, but they are not interchangeable terms. A data lake is a vast pool of raw data, the purpose for which is not yet defined. A data warehouse is a repository for structured, filtered data that has already been processed for a specific purpose.

View Now

ETL Testing: An Overview

ETL testing refers to tests applied throughout the ETL process to validate, verify, and ensure the accuracy of data while preventing duplicate records and data loss. Learn the 8 stages of ETL testing, 9 types of tests, common challenges, how to find the best tool, and more.

View Now

What is Data Preparation?

Data preparation is the process of cleaning and transforming raw data prior to processing and analysis. It is a time consuming process, but the business intelligence benefits demand it. And today, savvy self-service data preparation tools are making it easier and more efficient than ever.

View Now

What is Hadoop?

Hadoop is an open source, Java based framework used for storing and processing big data. The data is stored on inexpensive commodity servers that run as clusters. Its distributed file system enables concurrent processing and fault tolerance.

View Now

Defining Big Data Analytics for the Cloud

Big data analytics is the process of translating massive amounts of digital information into useful business intelligence. Utilizing this data, companies can provide actionable information that can be used in real-time to improve business operations, optimize applications for the cloud, and more.

View Now

Big Data Quality

With the advent of big data, data quality management is both more important and more challenging than ever. Fortunately the combination of Hadoop open source distributed processing technologies and Talend open source data management solutions bring big data quality operations within the reach of any organization.

View Now

The Future of Big Data

Big data is the catch-all term used to describe gathering, analyzing, and using massive amounts of digital information to improve operations. It is rapidly changing the way we live, shop, and approach daily life. Understand what big data is and how you can put it to work for you.

View Now

ETL vs ELT: Defining the Difference

The difference between ETL and ELT lies in where data is transformed into business intelligence and how much data is retained in working data warehouses. Discover what those differences mean for business intelligence, which approach is best for your organization, and why the cloud is changing everything.

View Now

Building a Governed Data Lake in the Cloud

The main purpose of a Data Lake is to provide full and direct access to raw (unfiltered) organizational data as an alternative to storing varying and sometimes limited datasets in scattered, disparate data silos.

View Now

Running a Job on YARN

In this tutorial, create a Big Data batch Job running on YARN, read data from HDFS, sort them and display them in the Console.

Watch Now

Running a Job on Spark

Learn how to create a Big Data batch Job using the Spark framework, read data from HDFS, sort them and display them in the Console.

Watch Now

Creating Cluster Connection Metadata from Configuration Files

In this tutorial, create Hadoop Cluster metadata by importing the configuration from the Hadoop configuration files.
This tutorial uses Talend Data Fabric Studio version 6 and a Hadoop cluster: Cloudera CDH version 5.4.
1. Create a new Hadoop cluster metadata definition
Ensure that the Integration perspective is selected.
In the Project Repository, expand Metadata, right-click Hadoop Cluster, and click Create Hadoop Cluster to open the wizard.
In the Name field of the Hadoop Cluster Connection wizard, type MyHadoopCluster_files. In the Purpose field, type Cluster connection metadata, in the Description field, type Metadata to connect to a Cloudera CDH 5.4 cluster, and click Next.

Watch Now

YARN in Hadoop

The recent introduction of YARN in Hadoop provides organizations that are managing big data with even greater processing speed and scalability. An acronym for Yet Another Resource Negotiator, YARN in Hadoop solves a bottleneck in the first version of Hadoop MapReduce and reduces the strict dependency of Hadoop environments on MapReduce.

View Now


displaying pages of 5