Data Integration with Google BigQuery

Research firm IDC made a remarkable prediction last year. According to its report, the amount of data produced globally will reach 163 zettabytes (1 ZB = 1 trillion GB) by 2025, a ten-fold increase from 2016. While the gargantuan increase in the amount of available data might seem like a great thing for business, many companies lack the tools needed to query and process it efficiently. As a result, troves of data have gone unprocessed and unused.

Enter Google BigQuery.

To address this data dilemma, Google introduced the MapReduce algorithm, which was able to split and batch process massive datasets in the Hadoop ecosystem. Google BigQuery takes this concept even further: BigQuery gives companies the power to process petabytes of data in a matter of minutes or even seconds. In this article, we take a closer look at BigQuery, its capabilities, and offer some insight on how to get started with this powerful data processing tool.

What is BigQuery?

is a cloud-based data warehouse from Google that lets users query and analyze large amounts of read-only data. Using a SQL-like syntax, BigQuery runs queries on billions of rows of data in a matter of seconds.

Google BigQuery is an IaaS (infrastructure as a platform) which offers serverless, scalable infrastructure along with an elastic pay-as-you-go pricing model. It eliminates the effort and expense involved in procuring and managing on-premise hardware. Moreover, it’s an affordable service as users pay only for the hardware used and the queries processed.

BigQuery successfully democratizes big data analysis. Until its launch, only enterprises, with enormous financial and human resources were able to afford the infrastructure needed to produce such massive data analysis. BigQuery changes the equation by essentially renting the infrastructure and compute resources needed to mine vast amounts of data for business intelligence.

The key to BigQuery — Dremel

BigQuery is the public interface of Google’s Dremel query engine. In tandem with Google’s data centers, it’s Dremel that enables BigQuery to run big data jobs quickly and efficiently.

While it had been possible to run interactive queries on traditional database systems for decades, it was a challenge to replicate the process in the big data world. This was due to the presence of huge amounts of unstructured data such as images, videos, log files, and books. All of this data needed to be queried, and Google needed a solution.

At first, MapReduce was designed to tackle this challenge. However, its batch-processing approach made it less than ideal for instant querying. Dremel, on the other hand, enabled Google to perform interactive querying on billions of records in seconds.

Behind the BigQuery curtain

To understand what makes BigQuery so powerful, it’s helpful to dig deeper into Dremel’s features and characteristics. In particular, we want to look at Dremel’s data architecture, columnar databases, and nested data storage components.

Tree architecture

Dremel uses tree architecture, which means that it treats a query as an execution tree. Execution trees break an SQL query into pieces and then reassemble the results for faster performance. Slots (or leaves) read billions of rows of data and perform computations on them while the mixers (or branches) aggregate the results.

Columnar databases

Another reason for Dremel’s incredibly fast performance is its use of a columnar data storage format instead of the traditional row-based storage. Columnar databases allow for better compression due to the homogenous nature of data stored within columns. In this design, only the required columns are pulled out, making it an ideal choice for huge databases with billions of rows.

Data sorting and aggregation operations are also easier with columnar databases when compared to relational databases. This makes columnar databases more suitable for intensive data analysis and the parallel processing approach employed in Dremel’s tree architecture.

Nested data storage

Join-based queries can be time-consuming in normalized databases, and this challenge only gets worse in large databases. Dremel opts for a different approach and permits the storage of nested or repeated data using the data type — RECORD. This feature gives Dremel the capability to maintain relationships between data inside a table. Nested data can be loaded from JSON files or other source formats into tables.

Columnar and nested data storage are ideal for querying semi-structured and unstructured data, which constitute an important part of the big data universe.

Hands on with BigQuery

If Google BigQuery is beginning to sound like the right choice for your company, you should know that there are three ways to access BigQuery — through a web UI, through the REST API, or through a command line tool.

Accessing BigQuery through Web UI

If you’re new to BigQuery, the web UI may be the best starting point. Both the classic and newer versions of the web UI are easy to understand and work with. When you launch the UI console, a window will appear which guides you through the process of starting a new project.

After the creation of a new project, three steps must be taken before you can start using BigQuery to run jobs:

Step 1: Enable BigQuery API for the project.
Step 2: Enter billing details, even it’s a free trial. The free package comes with 10 GB of active storage and 1 TB of processed query data per month.
Step 3: Prepare your dataset.

Preparing data for analysis

Since BigQuery works with Google Cloud Platform, you’ll need to have your data loaded into Google Cloud Storage before you can execute queries. Input data is usually structured in formats such as CSV, JSON, or Avro before importing into Cloud Storage.

You can use the Data Transfer Service to schedule and manage future data imports. In addition, Google allows integration between its other products such as Google Drive, Google BigTable, and Google Sheets. If the data resides in one of these sources, there is no need to import it again.

Once your data is located in the right source, you can import it into BigQuery and begin running queries.

Integrating data with BigQuery

If you’re ready to start using BigQuery to analyze your data, you’ll need to identify each application, platform, or service from which you want to move your data. For many companies, this is the primary challenge to using BigQuery.

For data that’s stored in sources other than Google, a data integration platform can automate and speed up the extraction, cleansing, and migration processes. With over 900 connectors, Talend Cloud Integration Platform combines and unifies multiple types of data from a variety of sources for use with Google Cloud Storage and BigQuery. Start your free trial of Talend Cloud and see how easy it can be to go from zero to BigQuery in no time.

Ready to get started with Talend?