Full Resource Library

Running a Job on Spark

Learn how to create a Big Data batch Job using the Spark framework, read data from HDFS, sort them and display them in the Console.

Watch Now

Data Quality Everywhere

This on-demand webinar shows how to turn data into trusted assets by standardizing, monitoring, and establishing gatekeepers and rule-based controls.

Watch Now

Best Practices Report: Multiplatform Data Architectures

This Multiplatform Data Architectures report explains in detail what MDAs are and do, with a focus on helping data professionals and their business counterparts worldwide architect, govern, and grow their MDAs for better business outcomes via well-integrated and unified distributed data from many sources.

Download Now

ETL in the Cloud

Since the dawn of big data, the ETL (extract, transform, and load) process has been the heart that pumps information through modern business networks. Today, cloud-based ETL is a critical tool for managing massive data sets, and one that companies will increasingly rely on in the future.

View Now

What is ELT?

ELT is the process by which raw data is extracted, loaded, and transformed into a data lake or warehouse. In contrast to ETL, ELT provides faster loading.

View Now

What is MapReduce?

MapReduce is a programming model or pattern within the Hadoop framework that is used to access big data stored in the Hadoop File System (HDFS). The map function takes input, pairs, processes, and produces another set of intermediate pairs as output.

View Now

ETL Testing: An Overview

ETL testing refers to tests applied throughout the ETL process to validate, verify, and ensure the accuracy of data while preventing duplicate records and data loss. Learn the 8 stages of ETL testing, 9 types of tests, common challenges, how to find the best tool, and more.

View Now

What is Hadoop?

Hadoop is an open source, Java based framework used for storing and processing big data. The data is stored on inexpensive commodity servers that run as clusters. Its distributed file system enables concurrent processing and fault tolerance.

View Now

Big Data Quality

With the advent of big data, data quality management is both more important and more challenging than ever. Fortunately the combination of Hadoop open source distributed processing technologies and Talend open source data management solutions bring big data quality operations within the reach of any organization.

View Now

ETL vs ELT: Defining the Difference

The difference between ETL and ELT lies in where data is transformed into business intelligence and how much data is retained in working data warehouses. Discover what those differences mean for business intelligence, which approach is best for your organization, and why the cloud is changing everything.

View Now

Running a Job on YARN

In this tutorial, create a Big Data batch Job running on YARN, read data from HDFS, sort them and display them in the Console.

Watch Now

Creating Cluster Connection Metadata from Configuration Files

In this tutorial, create Hadoop Cluster metadata by importing the configuration from the Hadoop configuration files.
This tutorial uses Talend Data Fabric Studio version 6 and a Hadoop cluster: Cloudera CDH version 5.4.
1. Create a new Hadoop cluster metadata definition
Ensure that the Integration perspective is selected.
In the Project Repository, expand Metadata, right-click Hadoop Cluster, and click Create Hadoop Cluster to open the wizard.
In the Name field of the Hadoop Cluster Connection wizard, type MyHadoopCluster_files. In the Purpose field, type Cluster connection metadata, in the Description field, type Metadata to connect to a Cloudera CDH 5.4 cluster, and click Next.

Watch Now

displaying pages of 3