Hadoop is not Just a Data Dump

July 16, 2013 --

When they hear Hadoop, many people think HDFS – Hadoop Distributed File System – and how it enables efficient and safe storage of all that big data that would be too expensive to store in a data warehouse – and since these tapes libraries with robotic loaders are long gone… had Hadoop become the perfect data dump?

Of course, Hadoop is a great place to dump store data. But not because it’s cheaper than a data warehouse, or more modern than a tape library. It’s a great place to store data, because that’s where the data is being processed.

The other part of Hadoop (I know, there are more than 2 parts, I am over simplifying here) is the processing framework. Today, in Hadoop 1.0, it’s called MapReduce. Hadoop 2.0 brings (or rather, will bring) YARN to the table, a much more powerful, versatile and real-time framework. On top of these frameworks, there are scripting and processing utilities such as Hive or Pig, or vendor-built extensions that enable – for example, SQL on Hadoop (Impala, Stinger, Drill, HAWK to name only a few).

If you are only using Hadoop as a data dump, you are missing half the value. No – you are missing 90% of the value.




- by Brandwein Matt on July 19, 2013
Great post, Yves. Hadoop is definitely moving beyond batch to deliver multiple workloads and applications atop a single strategic data platform. This has been Cloudera's vision for some time now (see http://bit.ly/12PB94u and http://bit.ly/1dKkFvY). Interactive SQL is one example. Impala is generally available today (Apache-licensed open source and free) so anyone can try it out and compare to the alternatives. Seeing is believing. www.cloudera.com/impala