Hadoop is an open source, Java based framework used for storing and processing big data. The data is stored on inexpensive commodity servers that run as clusters. Its distributed file system enables concurrent processing and fault tolerance. Developed by Doug Cutting and Michael J. Cafarella, Hadoop uses the MapReduce programming model for faster storage and retrieval of data from its nodes. The framework is managed by Apache Software Foundation and is licensed under the Apache License 2.0.
For years, while the processing power of application servers has been increasing manifold, databases have lagged behind due to their limited capacity and speed. However, today, as many applications are generating big data to be processed, Hadoop plays a significant role in providing a much-needed makeover to the database world.
Hadoop and Data Lakes now.
From a business point of view, too, there are direct and indirect benefits. By using open-source technology on inexpensive servers that are mostly in the cloud (and sometimes on-premises), organizations achieve significant cost savings.
Additionally, the ability to collect massive data, and the insights derived from crunching this data, results in better business decisions in the real-world—such as the ability to focus on the right consumer segment, weed out or fix erroneous processes, optimize floor operations, provide relevant search results, perform predictive analytics, and so on.
How Hadoop Improves on Traditional Databases
Hadoop solves two key challenges with traditional databases:
1. Capacity: Hadoop stores large volumes of data.
By using a distributed file system called an HDFS (Hadoop Distributed File System), the data is split into chunks and saved across clusters of commodity servers. As these commodity servers are built with simple hardware configurations, these are economical and easily scalable as the data grows.
2. Speed: Hadoop stores and retrieves data faster.
Hadoop uses the MapReduce functional programming model to perform parallel processing across data sets. So, when a query is sent to the database, instead of handling data sequentially, tasks are split and concurrently run across distributed servers. Finally, the output of all tasks is collated and sent back to the application, drastically improving the processing speed.
5 Benefits of Hadoop for Big Data
For big data and analytics, Hadoop is a life saver. Data gathered about people, processes, objects, tools, etc. is useful only when meaningful patterns emerge that, in-turn, result in better decisions. Hadoop helps overcome the challenge of the vastness of big data:
- Resilience — Data stored in any node is also replicated in other nodes of the cluster. This ensures fault tolerance. If one node goes down, there is always a backup of the data available in the cluster.
- Scalability — Unlike traditional systems that have a limitation on data storage, Hadoop is scalable because it operates in a distributed environment. As the need arises, the setup can be easily expanded to include more servers that can store up to multiple petabytes of data.
- Low cost — As Hadoop is an open-source framework, with no license to be procured, the costs are significantly lower compared to relational database systems. The use of inexpensive commodity hardware also works in its favor to keep the solution economical.
- Speed — Hadoop’s distributed file system, concurrent processing, and the MapReduce model enable running complex queries in a matter of seconds.
- Data diversity — HDFS has the capability to store different data formats such as unstructured (e.g. videos), semi-structured (e.g. XML files), and structured. While storing data, it is not required to validate against a predefined schema. Rather, the data can be dumped in any format. Later, when retrieved, data is parsed and fitted into any schema as needed. This gives the flexibility to derive different insights using the same data.
O’Reilly Report: Moving Hadoop to the Cloud now.
The Hadoop Ecosystem: Core Components
Hadoop is not just one application, rather it is a platform with various integral components that enable distributed data storage and processing. These components together form the Hadoop ecosystem.
Some of these are core components, which form the foundation of the framework, while some are supplementary components that bring add-on functionalities into the Hadoop world.
The core components of Hadoop are:
HDFS: Maintaining the Distributed File System
HDFS is the pillar of Hadoop that maintains the distributed file system. It makes it possible to store and replicate data across multiple servers.
HDFS has a NameNode and DataNode. DataNodes are the commodity servers where the data is actually stored. The NameNode, on the other hand, contains metadata with information on the data stored in the different nodes. The application only interacts with the NameNode, which communicates with data nodes as required.
YARN: Yet Another Resource Negotiator
YARN stands for Yet Another Resource Negotiator. It manages and schedules the resources, and decides what should happen in each data node. The central master node that manages all processing requests is called the Resource Manager. The Resource Manager interacts with Node Managers; every slave datanode has its own Node Manager to execute tasks.
MapReduce is a programming model that was first used by Google for indexing its search operations. It is the logic used to split data into smaller sets. It works on the basis of two functions — Map() and Reduce() — that parse the data in a quick and efficient manner.
First, the Map function groups, filters, and sorts multiple data sets in parallel to produce tuples (key, value pairs). Then, the Reduce function aggregates the data from these tuples to produce the desired output.
The Hadoop Ecosystem: Supplementary Components
The following are a few supplementary components that are extensively used in the Hadoop ecosystem.
Hive: Data Warehousing
Hive is a data warehousing system that helps to query large datasets in the HDFS. Before Hive, developers were faced with the challenge of creating complex MapReduce jobs to query the Hadoop data. Hive uses HQL (Hive Query Language), which resembles the syntax of SQL. Since most developers come from a SQL background, Hive is easier to get on-board.
The advantage of Hive is that a JDBC/ODBC driver acts as an interface between the application and the HDFS. It exposes the Hadoop file system as tables, converts HQL into MapReduce jobs, and vice-versa. So while the developers and database administrators gain the benefit of batch processing large datasets, they can use simple, familiar queries to achieve that. Originally developed by the Facebook team, Hive is now an open source technology.
Pig: Reduce MapReduce Functions
Pig, initially developed by Yahoo!, is similar to Hive in that it eliminates the need to create MapReduce functions to query the HDFS. Similar to HQL, the language used — here, called “Pig Latin” — is closer to SQL. “Pig Latin” is a high-level data flow language layer on top of MapReduce.
Pig also has a runtime environment that interfaces with HDFS. Scripts in languages such as Java or Python can also be embedded inside Pig.
Hive Versus Pig
Although Pig and Hive have similar functions, one can be more effective than the other in different scenarios.
Pig is useful in the data preparation stage, as it can perform complex joins and queries easily. It also works well with different data formats, including semi-structured and unstructured. Pig Latin is closer to SQL but also varies from SQL enough for it to have a learning curve.
Hive, however, works well with structured data and is therefore more effective during data warehousing. It’s used in the server side of the cluster.
Researchers and programmers tend to use Pig on the client side of a cluster, whereas business intelligence users such as data analysts find Hive as the right fit.
Flume: Big Data Ingestion
Flume is a big data ingestion tool that acts as a courier service between multiple data sources and the HDFS. It collects, aggregates, and sends huge amounts of streaming data (e.g. log files, events) generated by applications such as social media sites, IoT apps, and ecommerce portals into the HDFS.
Flume is feature-rich, it:
- Has a distributed architecture.
- Ensures reliable data transfer.
- Is fault-tolerant.
- Has the flexibility to collect data in batches or real-time.
- Can be scaled horizontally to handle more traffic, as needed.
Data sources communicate with Flume agents — every agent has a source, channel, and a sink. The source collects data from the sender, the channel temporarily stores the data, and finally, the sink transfers data to the destination, which is a Hadoop server.
Sqoop: Data Ingestion for Relational Databases
Sqoop (“SQL,” to Hadoop) is another data ingestion tool like Flume. While Flume works on unstructured or semi-structured data, Sqoop is used to export data from and import data into relational databases. As most enterprise data is stored in relational databases, Sqoop is used to import that data into Hadoop for analysts to examine.
Database admins and developers can use a simple command line interface to export and import data. Sqoop converts these commands to MapReduce format and sends them to the HDFS using YARN. Sqoop is also fault-tolerant and performs concurrent operations like Flume.
Zookeeper: Coordination of Distributed Applications
Zookeeper is a service that coordinates distributed applications. In the Hadoop framework, it acts as an admin tool with a centralized registry that has information about the cluster of distributed servers it manages. Some of its key functions are:
- Maintaining configuration information (shared state of configuration data)
- Naming service (assignment of name to each server)
- Synchronization service (handles deadlocks, race condition, and data inconsistency)
- Leader election (elects a leader among the servers through consensus)
The cluster of servers that the Zookeeper service runs on is called an “ensemble.” The ensemble elects a leader among the group, with the rest behaving as followers. All write-operations from clients need to be routed through the leader, whereas read operations can go directly to any server.
Zookeeper provides high reliability and resilience through fail-safe synchronization, atomicity, and serialization of messages.
Kafka: Faster Data Transfers
Kafka is a distributed publish-subscribe messaging system that is often used with Hadoop for faster data transfers. A Kafka cluster consists of a group of servers that act as an intermediary between producers and consumers.
In the context of big data, an example of a producer could be a sensor gathering temperature data to relay back to the server. Consumers are the Hadoop servers. The producers publish message on a topic and the consumers pull messages by listening to the topic.
A single topic can be split further into partitions. All messages with the same key arrive to a specific partition. A consumer can listen to one or more partitions.
By grouping messages under one key and getting a consumer to cater to specific partitions, many consumers can listen on the same topic at the same time. Thus, a topic is parallelized, increasing the throughput of the system. Kafka is widely adopted for its speed, scalability, and robust replication.
HBase: Non-Relational Database
HBase is a column-oriented, non-relational database that sits on top of HDFS. One of the challenges with HDFS is that it can only do batch processing. So for simple interactive queries, data still has to be processed in batches, leading to high latency.
HBase solves this challenge by allowing queries for single rows across huge tables with low latency. It achieves this by internally using hash tables. It is modelled along the lines of Google BigTable that helps access the Google File System (GFS).
HBase is scalable, has failure support when a node goes down, and is good with unstructured as well as semi-structured data. Hence, it is ideal for querying big data stores for analytical purposes.
O’Reilly Report: Moving Hadoop to the Cloud now.
Challenges of Hadoop
Though Hadoop has widely been seen as a key enabler of big data, there are still some challenges to consider. These challenges stem from the nature of its complex ecosystem and the need for advanced technical knowledge to perform Hadoop functions. However, with the right integration platform and tools, the complexity is reduced significantly and hence, makes working with it easier as well.
1. Steep Learning Curve
To query the Hadoop file system, programmers have to write MapReduce functions in Java. This is not straightforward, and involves a steep learning curve. Also, there are too many components that make up the ecosystem, and it takes time to get familiar with them.
2. Different Datasets Require Different Approaches
There is no ‘one size fits all’ solution in Hadoop. Most of the supplementary components discussed above have been built in response to a gap that needed to be addressed.
For example, Hive and Pig provide a simpler way to query the data sets. Additionally, data ingestion tools such as Flume and Sqoop help gather data from multiple sources. There are numerous other components as well and it takes experience to make the right choice.
3. Limitations of MapReduce
MapReduce is an excellent programming model to batch process big data sets. However, it has its limitations.
Its file-intensive approach, with multiple reads and writes, isn’t well-suited for real-time, interactive data analytics or iterative tasks. For such operations, MapReduce isn’t efficient enough, and leads to high latencies. (There are workarounds to this problem. Apache is an alternative that is filling the gap of MapReduce.)
4. Data Security
As big data gets moved to the cloud, sensitive data is dumped into Hadoop servers, creating the need to ensure data security. The vast ecosystem has so many tools that it’s important to ensure that each tool has the correct access rights to the data. There needs to be appropriate authentication, provisioning, data encryption, and frequent auditing. Hadoop has the capability to address this challenge, but it’s a matter of having the expertise and being meticulous in execution.
Although many tech giants have been using the components of Hadoop discussed here, it is still relatively new in the industry. Most challenges stem from this nascence, but a robust big data integration platform can solve or ease all of them.
Hadoop vs Apache Spark
The MapReduce model, despite its many advantages, is not efficient for interactive queries and real-time data processing, as it relies on disk writes between each stage of processing.
Spark is a data processing engine that solves this challenge by using in-memory data storage. Although it started as a sub-project of Hadoop, it has its own cluster technology.
Often, Spark is used on top of HDFS to leverage just the storage aspect of Hadoop. For the processing algorithm, it uses its own libraries that support SQL queries, streaming, machine learning, and graphs.
Data scientists use Spark extensively for its lightning speed and elegant, feature-rich APIs that make working with large data sets easy.
While Spark may seem to have an edge over Hadoop, both can work in tandem. Depending on the requirement and the type of data sets, Hadoop and Spark complement each other. Spark does not have a file system of its own, so it has to depend on HDFS, or other such solutions, for its storage.
The real comparison is actually between the processing logic of Spark and the MapReduce model. When RAM is a constraint, and for overnight jobs, MapReduce is a good fit. However, to stream data, access machine learning libraries, and for quick real-time operations, Spark is the ideal choice.
A Future with Many Possibilities
In just a decade, Hadoop has made its presence felt in a big way in the computing industry. This is because it has finally made the possibility of data analytics real. From analyzing site visits to fraud detection to banking applications, its applications are diverse.
With Talend Open Studio for Big Data it’s easy to integrate your Hadoop setup into any data architecture. Talend provides more built-in data connectors than any other data management solution, enabling you to build seamless data flows between Hadoop and any major file format (CSV, XML, Excel, etc.), database system (Oracle, SQL Server, MySQL, etc.), packaged enterprise application (SAP, SugarCRM, etc.), and even cloud data services like Salesforce and Force.com.