When they hear Hadoop, many people think HDFS – Hadoop Distributed File System – and how it enables efficient and safe storage of all that big data that would be too expensive to store in a data warehouse – and since these tapes libraries with robotic loaders are long gone… had Hadoop become the perfect data dump?
Of course, Hadoop is a great place to
dump store data. But not because it’s cheaper than a data warehouse, or more modern than a tape library. It’s a great place to store data, because that’s where the data is being processed.
The other part of Hadoop (I know, there are more than 2 parts, I am over simplifying here) is the processing framework. Today, in Hadoop 1.0, it’s called MapReduce. Hadoop 2.0 brings (or rather, will bring) YARN to the table, a much more powerful, versatile and real-time framework. On top of these frameworks, there are scripting and processing utilities such as Hive or Pig, or vendor-built extensions that enable – for example, SQL on Hadoop (Impala, Stinger, Drill, HAWK to name only a few).
If you are only using Hadoop as a data dump, you are missing half the value. No – you are missing 90% of the value.