How to Query a Redshift Table With Talend Cloud
Talend Cloud enables the use of several prebuilt connectors and components for different services running on cloud platforms like Amazon Web Services (AWS), Microsoft Azure, or Google Cloud Platform.
This article explores a use case of querying an Amazon Redshift table as part of a Talend Job developed using Talend Studio and executed as a task on Talend Cloud. The data used is a sample of orders placed by customers; the requirement is to (1) fetch the order quantity of each customer from a table on Redshift (2) using lookup of state codes that consists of only state ID and state name columns, and identify the state name against each customer order, where state names are available as part of a CSV file. Execute a Talend Cloud task to accomplish this.
Preparation for the use case:
For the implementation of the use case, a demo environment of Amazon Redshift has been prepared. As part of this preparation the steps followed are:
- Creating a Redshift cluster (single node used here)
- Creating a table ‘dbinfo’ with columns for: customer ID, customer first name, customer last name, state ID (as part of customer address), order number, order quantity. Refer to the image below of the ‘Query editor’ for Redshift on AWS console.
The table is created in a public schema. When the above ‘create table’ statement is successful, it appears in the list, refer to the screen capture below. Note that there are state names available as part of the data on Redshift.
3. To populate the table with sample data, the sample CSV available in S3 is used. Run the COPY command/query below screen.
4. Verify the sample data populated.
These preparation steps are part of the demonstration for the article here.
In a real-world scenario, the use case could be a larger extension of this demo that requires you to do further complex analysis/querying on one or multiple tables populated in Redshift.
Talend job for querying Redshift table:
A Talend standard Job has prebuilt components to connect to Amazon Redshift and to fetch data from Redshift. The sample job created to demonstrate the use case here looks like the image below.
Talend Studio, available with Talend Cloud Real-Time Big Data Platform version 7.1.1, is used to develop this sample job.
Steps to create this job include:
- Create a DB connection in the Talend Studio metadata repository.
- Either drag the connection definition from the repository into designer – select tRedshiftConnection component when prompted OR use tRedshiftConnection from Palette – enter the Redshift cluster, database, and table information manually. The element named ‘blog_redshift’ in the image above is the tRedshiftConnection component.
- Create a new subjob starting with the tRedshiftInput component. The element named ‘dbinfo’ is the tRedshiftInput component. The configuration for this component looks like the image below.
Using ‘Guess Query’ populates the ‘Query’ property with the selected statements as displayed in the image. The demo here is using the default query thus populated. This query could be edited as needed. For example, here the query could be edited to fetch only the necessary columns –‘quantity’ and ‘stateid’.
Using tMap component helps with combining the Redshift table and CSV data; and filtering out necessary columns, here in the use case ‘quantity’ from Redshift table and the ‘statename’ from lookup CSV file.
- tLogRow is used to output the two columns – one from Redshift, other from CSV after joining both data input using ‘state ID’ columns in each. Again, in a real-world scenario, this part of Talend job could include various complex logic to work through required analysis depending on the need.
- It is good practice to close any connections created as part of Talend job, tRedshiftClose is used to close the connection created by tRedshiftConnection.
Thus, the job that implements the requirements of the use case is complete, and ready to be deployed in Talend Cloud to be executed as a task.
Create a task in Talend Cloud to run the job with Redshift query:
To deploy the Talend Job to Talend Cloud, right-click the job in the repository – click on the option ‘Publish to Cloud’. Refer to the image below.
Publish to Cloud option requires to select a workspace, where the job will be deployed as an artifact. A version number is associated with the artifact; every publish automatically increments the version.
The Talend Job gets deployed as an artifact, and a task is automatically created. Both the artifact and task are available under the ‘Management’ left-menu of Talend Management Console (TMC). The image below displays the workspace ‘Personal’ tile under ‘Default’ environment that contains links to artifacts list and tasks list.
Tasks and plans list:
Note: A plan is the step-by-step execution of multiple tasks depending on specified conditions. Each step in a plan is associated with one task.
Click on the task to edit, and use pencil icon within the Configuration section, as highlighted with a green box in the image below.
Editing a task includes selecting the artifact for the task which is pre-populated here and specifying go live attributes:
Clicking on the Go Live button, executes the task based on the run-type.
Runtime here allows the use of Cloud Engine or Remote Engine. In this demo, a pre-defined Remote Engine called ‘demo-csm-re’ is used. This Article on Architecture in Talend Community provides details regarding Cloud Engine and Remote Engine.
The remote engines are present under ‘Engines’ left-menu on Talend Management Console of Talend Cloud. Refer to the image below.
Clicking on View Logs:
Talend Cloud makes it easier to integrate data from different kinds of sources like other cloud platforms, SaaS applications, or data in on-premises systems; it empowers users to perform complex transformations and/or analysis on the integrated data. Give Talend Cloud a try today.