A Serverless Architecture for Big Data

This post is co-authored by Jorge Villamariona and Anselmo Barrero at Qubole.

A popular term emerging from the software industry over the last few years is serverless computing, more commonly referred to as just “serverless”. So what does it mean? In its simplest form, a serverless architecture is a computing model where a service provider dynamically manages the allocation of computing resources based on a Service Level Agreement (SLA), provisioning and running resources only for the time needed and without requiring end-user involvement. 

With a serverless architecture, the server provider would automatically increase computing capacity when demand for resources is high and would intelligently downscale when demand for resources goes down.  In this architecture, the end users only care about the tasks they want to execute (get a report, execute a query, execute a data pipeline, etc.) without the hassle of procuring, provisioning and managing the underlying infrastructure.

Traditional vs. Serverless Architectures

So, what are some major advantages for going serverless? Cost, scale and, environment options to start. Traditional architectures rely on the infrastructure administrator’s ability to estimate workloads and size hardware and software accordingly.  Moving to the cloud represents an improvement over on-premises architectures because it allows the infrastructure to scale on-demand. 

However, administrators still need to be involved to define the conditions and rules to scale and manage the cloud infrastructure. The next step forward is to leverage a serverless architecture and allow the infrastructure to automatically decide behind the scenes when to provision, scale and decommission resources as workloads change.  Qubole is a great example of a serverless architecture.

The Qubole platform automatically determines the infrastructure needed and scales it intelligently based on the workloads and SLAs.  As a result, Qubole’s serverless architecture saves customers over 50% in annual infrastructure costs compared to traditional and other managed cloud big data architectures.

This intelligent automation allows Qubole to process over an exabyte of data per month for customers deploying AI, machine learning, and analytics without requiring customers to provision and manage any infrastructure

Value of adopting a serverless architecture for Big Data

Big data deals with large volumes of data arriving at high speed which makes it difficult and inefficient to estimate the infrastructure required for processing it ahead of time.   On-premises infrastructures impose limits in processing power, are expensive, and complex to manage and maintain. Deploying Big Data in the cloud on your own or as a managed service from cloud providers (Amazon AWS, Microsoft Azure, Google Cloud, etc) improves the processing limitations and eases capex, but it creates overhead managing and attempting to optimize the infrastructure.  Improper utilization, underutilization, or overutilization on certain time periods can lead to cloud costs that are much higher than on-premises processing. This, combined with scarce skilled resources results in a very low success rate of only 15% for all big data projects according to Gartner 

To successfully leverage a serverless platform for big data you need to look for a solution that addresses the following questions:

  • Will it reduce big data infrastructure costs?
  • Does it provide automation and resources to execute data pipelines and provide analytics at any scale?
  • Will it reduce operational costs?
  • Will it help my data team scale and not be overrun by business demands for data? 

A serverless platform like Qubole is very appealing to teams deploying big data because it addresses the factors that cause big data projects to fail since it reduces infrastructure complexity and costs, as well as reliance on scarce experts.   

Qubole reduces the administration overhead by providing a simple interface to define the run-time characteristics of big data engines. Users only need to specify the minimum and maximum clusters size, whether to leverage spot instances (in the case of AWS) and the cluster composition to meet their price performance objectives. Qubole then takes over and automatically manages the infrastructure based on the business  requirements  and the workloads’ SLA without the need for further manual intervention

Qubole’s serverless architecture auto-scales to avoid latencies when dealing with large bursty incoming loads and it also down-scales to avoid idle wasted resources.  Qubole can scale from 5 nodes up to 200 nodes in less than 5 minutes. For reference, Qubole also manages the largest Spark cluster in the cloud (500+ nodes).

TCO of a Serverless Big Data Architecture 

When it comes to pricing, Qubole’s serverless architecture offers the best performance by adding computing capacity only when needed and orderly downscaling it as soon as resources become idle. 

With Qubole there are no infrastructure administration overheads or cloud resources overspent. Additionally, as we can see in the chart above, data teams leveraging Qubole don’t suffer from delays in provisioning computing resources when workloads suddenly increase.

 The combination of Talend Cloud and Qubole not only lowers infrastructure costs, but also increases the productivity of the data team, since they don’t need to worry about cluster procurement, configuration, and management. Data teams build their data pipelines in Talend Cloud and push their execution to the Qubole serverless platform, all without having to write complex code or managing infrastructure.

This partnership allows these teams to focus on building highly functioning end-to-end data pipelines, allowing data scientists to deploy faster IOT, machine learning and advanced analytics applications that have high impact on the business. With Talend and Qubole data teams build scalable serverless data pipelines, that work at low operating costs while often being engineered and maintained by single developers.  This cost reduction makes the benefits of big data more accessible to a wider audience.

To learn more about Qubole and test-drive the serverless platform visit https://www.qubole.com/lp/testdrive/

About the Authors

Jorge Villamariona works for the Product Marketing team at Qubole.  Over the years Mr. Villamariona has acquired extensive experience in relational databases, business intelligence, big data engines, ETL,  and CRM systems. Mr. Villamariona enjoys complex data challenges and helping customers gain greater insight and value from their existing data.

Anselmo Barrero is a Director of Business Development at Qubole with more than 25 years of experience in IT and three patents granted. Mr. Barrero is passionate about building products and strategic partnerships to address market opportunities. He has created products that yield more than 50% YoY growth and established strategic partnerships in areas such as Data Warehouse that resulted in more than 100% consecutive YoY growth.
In his current role Mr. Barrero is responsible for establishing strategic partnerships in big data and the cloud to allow customers reduce the cost and time of getting value out of their data

Join The Conversation


Leave a Reply