What is Apache Hive?

While you need Hadoop for reliable, scalable, distributed computing, the learning curve for extracting data is just too steep to be time effective and cost efficient. Your answer? Apache Hive.

What is Apache Hive?

Apache Hive is a popular data warehouse software that enables you to easily and quickly write SQL-like queries to efficiently extract data from Apache Hadoop.

Hadoop is an open-source framework for storing and processing massive amounts of data. While Hadoop offers many advantages over traditional relational databases, the task of learning and using Hadoop is daunting since it requires SQL queries to be implemented in the MapReduce Java API.

To resolve this formidable issue, Facebook developed the Apache Hive data warehouse so they could bypass writing Java and simply access data using simple SQL-like queries.

Today, Apache Hive’s SQL-like interface has become the Gold Standard for making ad-hoc queries, summarizing, and analyzing Hadoop data. This solution is particularly cost effective and scalable when assimilated into cloud computing networks, which is why many companies, such as Netflix and Amazon, continue to develop and improve Apache Hive.

The most predominant use cases for Apache Hive are to batch SQL queries of sizable data sets and to batch process large ETL and ELT jobs.

How Does Apache Hive Work?

In short, Apache Hive translates the input program written in the HiveQL (SQL-like) language to one or more Java MapReduce, Tez, or Spark jobs. (All of these execution engines can run in Hadoop YARN.) Apache Hive then organizes the data into tables for the Hadoop Distributed File System HDFS) and runs the jobs on a cluster to produce an answer.

Apache Hive Data

The Apache Hive tables are similar to tables in a relational database, and data units are organized from larger to more granular units. Databases consist of tables that are made up of partitions, which can further be broken down into buckets. The data is accessed through HiveQL (Hive Query Language) and can be overwritten or appended. Within each database, table data is serialized, and each table has a corresponding HDFS directory.

Apache Hive Architecture

Multiple interfaces are available, from a web browser UI, to a CLI, to external clients. The Apache Hive Thrift server enables remote clients to submit commands and requests to Apache Hive using a variety of programming languages. The central repository for Apache Hive is a metastore that contains all information, such as all table definitions.

The engine that makes Apache Hive work is the driver, which consists of a compiler, an optimizer to determine the best execution plan, and an executor. Optionally, Apache Hive can be run with LLAP. Note that for high availability, you can configure a backup of the metadata.

Apache Hive Security

Apache Hive is integrated with Hadoop security, which uses Kerberos for a mutual authentication between client and server. Permissions for newly created files in Apache Hive are dictated by the HDFS, which enables you to authorize by user, group, and others.

Benefits of Apache Hive

Apache Hive is ideal for running end-of-day reports, reviewing daily transactions, making ad-hoc queries, and performing data analysis. Such deep insights made available by Apache Hive render significant competitive advantages and make it easier for you to react to market demands.

Following are a few of the benefits that make such insights readily available:

  • Ease of use — Querying data is easy to learn with its SQL-like language.
  • Accelerated initial insertion of data — Data does not have to be read, parsed, and serialized to a disk in the database’s internal format, since Apache Hive reads the schema without checking the table type or schema definition. Compare this to a traditional database where data must be verified each time it is inserted.
  • Superior scalability, flexibility, and cost efficiency — Apache Hive stores 100s of petabytes of data, since it stores data in the HDFS, making it a much more scalable solution than a traditional database. As a cloud-based Hadoop service, Apache Hive enables users to rapidly spin virtual servers up or down to accommodate fluctuating workloads.
  • Streamlined security — Critical workloads can be replicated for disaster recovery.
  • Low overhead — Insert-only tables have near-zero overhead. Since there is no renaming required, the solution is cloud friendly.
  • Exceptional working capacity — Huge datasets support up to 100,000 queries/hour.

Apache Hive vs. Apache Pig

Apache Hive and Apache Pig are key components of the Hadoop ecosystem, and are sometimes confused because they serve similar purposes.

Both simplify the writing of complex Java MapReduce programs, and both free users from learning MapReduce and HDFS. Both support dynamic join, order, and sort operations using a language that is SQL-like

However, Apache Hive leverages SQL more directly and thus, is easier for database experts to learn. Additionally, while each of these systems supports the creation of UDFs, UDFs are much easier to troubleshoot in Pig.

Pig is mainly used for programming and is used most often by researchers and programmers, while Apache Hive is used more for creating reports and is used most often by data analysts. The following table identifies further differences to help you determine the best solution for you.

HiveQL

HiveQL is the language used by Apache Hive after you have defined the structure. HiveQL statements are very similar to standard SQL ones, although they do not strictly adhere to SQL standards. Even without Java or MapReduce knowledge, if you are familiar with SQL, you can write customized and sophisticated MapReduce analyses.

The following simple example of HiveQL demonstrates just how similar HiveQL queries are to SQL queries.


[style-codebox]SELECT upper(name), unitprice
FROM acme_sales;
SELECT category, count (1)
FROM products
GROUP BY category
[/style-codebox]

Additionally, HiveQL supports extensions that are not in SQL, including create table as select and multi-table inserts.

The Apache Hive compiler translates HiveQL statements into DAGs of MapReduce, Tez, and Spark jobs so they can be submitted to Hadoop for execution. Following is a list of a few of the basic tasks that HiveQL can easily do:

  • Create and manage tables and partitions
  • Support various relational, arithmetic, and logical operators
  • Evaluate functions
  • Download table contents to a local directory
  • Download the result of queries to an HDFS directory

For details, see the HiveQL Language Manual.

Apache Hive Integration: The Key to Big Data Success

Apache Hive integration is imperative for any big-data operation that requires summarization, analysis, and ad-hoc querying of massive datasets distributed across a cluster. It provides an easy-to-learn, highly scalable, and fault-tolerant way to move and convert data between Hadoop and any major file format, database, or package enterprise application.

With big data integrated and easily accessible, your business is primed for tackling new and innovative ways of learning the needs of potential customers. You can also run your internal operations faster with less expense.

Here are few example business use cases for achieving these goals:

  • Clickstream analysis to segment user communities and understand their preferences
  • Data tracking, for example to track ad usage
  • Reporting and analytics for both internal and customer-facing research
  • Internal log analysis for both web, mobile, and cloud applications
  • Data mining to learn exact patterns
  • Parsing and learning from data to make predictions
  • Machine learning to reduce internal operational overhead

To truly gain business value from Apache Hive, it must be integrated into your broader data flows and data management strategy.

The open-source Talend Open Studio for Big Data platform is ideal for seamless integration, delivering more comprehensive connectivity than any other data integration solution. This means you can move and convert data between Hadoop and any major file format, database, or package enterprise application.

With the Talend Open Studio for Big Data platform, you can run on-premises, in the cloud, or both. As the first purely open-source big data management solution, Talend Open Studio for Big Data helps you develop faster, with less ramp-up time.

Using an Eclipse-based IDE, you can design and build big data integration jobs in hours, rather than days or weeks. By dragging graphical components from a palette onto a central workspace, arranging components, and configuring their properties, you can quickly and easily engineer Apache Hive processes.

Getting Started with Apache Hive

Apache Hive is a powerful companion to Hadoop, making your processes easier and more efficient. Seamless integration is the key to making the most of what Apache Hive has to offer.

If you haven’t started with Hive yet, be prepared for a smooth integration. If you have, don’t worry: It’s not too late to get setup for better operations and greater efficiency. Simply go to the Talend Downloads page for a free trial of the Talend Open Studio and Big Data solution.

Ready to get started with Talend?