How to Optimize Data Pipelines Using Talend and intermix.io on Amazon Redshift
Cloud Data Warehouse Trends
In the past few years, we’ve seen rapid adoption of a new data analytics stack, based on cloud warehouses like Amazon Redshift, Google BigQuery and Snowflake. This stack has three layers:
- Data Integration. Ingests raw data from the source into a staging area in the warehouse.
- Modeling Layer. Run algorithms on the raw data (usually with batch processing) and materialize the outputs into prepared tables and views.
- Application Layer. Data discovery and visualization tools. Other tools that serve data to other systems (ie for machine learning applications).
On AWS, the stack looks like this:
Companies using this new stack deal with three growth challenges:
- Growing Data Volumes. Data volume grows by a factor ~10x over 3-5 years, with more and new types of data sources emerging along the way.
- Model Explosion. Companies are always asking new questions of data, so the amount of models is always increasing.
- Users and Apps. Data is becoming useful for more and more people and departments within the organization.
Trusted and Timely Data Becomes Mission Critical
Companies aim to make every department data driven. Data use cases typically fall into one of three categories:
- Basic reporting. This is traditional Business Intelligence (BI). An analyst writes a query and then inspects the results. With BI you have to know what questions you’re seeking an answer to, before your run the query.
- Advanced analytics. Autonomous or semi-autonomous data exploration using sophisticated techniques. Goes beyond those of traditional business intelligence (BI), to discover deeper insights, make predictions, or generate recommendations.
- Machine Learning. This is when the data modeling function is autonomous and “learns” based on example inputs. It’s based on the idea that systems can learn from data, identify patterns and make decisions with minimal human intervention. Most companies are still in the very early phases of ML development.
For each of the above uses cases, a company will need different applications, tools and people to interface with the data.
It’s critical that the raw data is consistent across this growing number of people and tools. If the data is not consistent, it would be impossible to determine whether an error in the data is caused by a modeling problem, or a raw data problem.
In response, companies invest in doing modeling in the data warehouse and delivering prepared data to analysts and data science teams. These prepared data sets are built by continuously running algorithms on the data, and ensuring that the outputs are correct. It’s all about getting all your data into a single place and with the ability to run fast queries and store large volumes of data.
Modeling in the data warehouse requires building fast and robust data pipelines.
Building Fast Data Pipelines on Amazon Redshift
Amazon Redshift is a great place to do build data pipelines because of fast performance and low cost of storage.
Performance management is critical, because slow queries impact data pipelines in the following ways:
- When data pipeline jobs take longer, overall data recency “freshness” goes down.
- Query authors spend more time waiting for queries to finish, leads to slower velocity for data science projects.
- Expensive, wasteful queries require more CPU and Storage, which leads to higher costs.
Building data pipelines on Amazon Redshift is a matter of doing some thoughtful up-front planning and ongoing monitoring as your data volume, users and cluster grow.
Here are some tips we’ve learned from our experience helping customers with Amazon Redshift. For a more comprehensive list, please check out our “Top 14” blog post at this link.
Amazon Redshift Tip 1: Use the Workload Manager (WLM)
The Amazon Redshift Workload Manager (WLM) is critical to managing query performance. Amazon Redshift run queries in a queueing model. The default WLM configuration has a single queue with five slots. 99% of the time, this default configuration will not work for you and you will need to tweak it.
Configuring the WLM for your workloads is critical to:
- Prioritizing key workloads - making sure the important queries are fast
- Scaling workloads - ensuring that you scale performance as you add more users (concurrency scaling)
For more information on setting up Amazon Redshift WLM, see our recent blog post on the Automatic WLM feature.
Amazon Redshift Tip 2: Use Amazon Spectrum for infrequently used data
Amazon Redshift prices are based on the size of your cluster, i.e. it couples compute and storage. You’ll have to keep adding nodes for storage, even though you may not need the additional computing power of the additional vCPUs.
With Redshift Spectrum, you can leave data as-is in your S3 data lake, and query it via Amazon Redshift. This approach makes sense when you have data that doesn’t require frequent access. Leave your “hot” data in Amazon Redshift, and your “cold” data in S3.
With Redshift Spectrum, you get the best of both worlds: a) keep all your historical data, along with the performance of Amazon Redshift and b) benefit from the cost savings of using S3. See a more detailed write-up on Amazon Spectrum at this link.
Amazon Redshift Tip 3: Security Best Practices
Security and access controls should be first-class considerations for any data project. A common pattern is to separate the ‘prepared’ data sets from the ‘raw’ data sets This can be done easily by creating one schema for each.
Here are some rules of thumb to implement separation of concerns for the various actors in the data warehouse:
- Analysts and Data Scientists have read-only access to ‘prepared’ data schemas.
- Modeling tools have read-only access to ‘raw’ data sets, and write access to the prepared data sets.
- ETL tools have ‘write-only’ access to the ‘raw’ data schemas.
Setting up your access controls in this way will prevent problems down the road.
Talend and intermix.io: Better Together
intermix.io helps data teams spot problems in data pipelines before they turn into fire drills. It’s a SaaS product that provides a single view of every tool, user, and query that touches the data in your Amazon Redshift cloud warehouse.
Intermix.io works by ingesting metadata & other logs from Amazon Redshift. Intermix.io can see all the SQL queries executed on the data warehouse, as well as metadata about storage and tables.
The intermix.io Talend integration works by embedding a comment block in the Talend SQL query before the SQL query is submitted to Amazon Redshift. The comment block contains metadata about the Talend job like job name, user name, and other attributes. intermix.io then introspects the SQL query to parse the comment block, and stores the query attributes.
The result is the ability to group query latency metrics by Talend job name and user - providing an easy way to spot issues in individual Talend jobs.
Troubleshoot Talend Pipelines
With intermix.io Query Insights, find the individual queries that cause your Talend data pipelines to suffer. See latency and query throughput for all Talend jobs, and drill-down into the individual jobs and SQL queries.
Once you see your problematic queries, easily drill into them to get details recommendations on how to optimize the queries to run faster.
Query Optimization Recommendations
Once you find a query that is taking a long time, drill into the query to get actionable advice on how to re-write it to make it run faster.
Measure & Improve SLAs
The intermix.io dashboard makes it easy to see the P95 SLA you are providing to users, and quickly drill into the users who are running expensive queries.
Improving SLA metrics requires a two-pronged approach:
- Optimize the Amazon Redshift cluster resources
- Find and optimize the slowest & most resource-intensive queries
By following the advice in this article, you will be able to respond to and fix problems fast. The result will be fewer support requests, and peace of mind that when you return to your desk on Monday morning, you won’t have to deal with any surprises and complaints.
We’ve assembled a number of useful resources for Amazon Redshift practitioner:
- Join hundreds of Amazon Redshift practitioners on our free Slack channel. Sign up here.
- Download technical content for Amazon Redshift on our Blog.
About the author
Paul Lappas is the Co-Founder & CEO at intermix.io where he helps customers optimize data pipeline & analytical queries on Amazon Redshift.