Talend, a global open source software leader, today identified five key areas companies should monitor to avoid pitfalls that can derail big data projects and ensure those projects generate value as they move past the early pilot stages. As a sponsor of Hadoop Summit 2013, Talend will be available at the event (Booth #46) to speak about these and other tips for successful enterprise big data deployments.
Most organizations are still in the early stages of big data adoption, and few have thought beyond the technology angle of how big data will profoundly impact their processes and their information architecture. Whether big data projects are past the pilot stage and being deployed in production, or still on the horizon, they require strategic thinking and adequate planning to avoid some now-typical pitfalls that tend to get in the way of success for big data projects.
Here are five key points from the experts at Talend to help guide your strategy:
- Forget volume (or rather, don’t focus on it). Big data is large – and small. It’s extremely diverse in origin, in style, in consistency and in quality. Some organizations in certain industries are dealing with massive data volumes, while others have much smaller data sets to exploit, but might have a broader variety of sources and formats. Make sure you go after the “right” data: identify all the sources that are relevant, and don’t be embarrassed if you don’t need to scale your data computing cluster to hundreds of nodes right away!
- Don’t leave data behind – be comprehensive. Some of the data you need for your big data projects is clearly identified, such as transactional data used or generated by business applications. However, more of this data is hidden in log files, manufacturing systems, desktops or various servers; this is what we call “Dark Data". Some of it is even going to waste in the exhaust fumes of IT. This “Exhaust Data” from sensors and logs is purged after a certain amount of time, or never stored in the first place. All of it is potentially relevant. Don’t restrain your project to the first category: Inventory Dark Data, and deploy collection mechanisms for Exhaust Data, so that they become value contributors as well.
- Don’t move everything – distribute data “logically." Too many organizations looking for ways to break down data silos bring all the data together in one central place, and Hadoop is an excellent storage resource for large amounts of data (and it is in itself distributed across clusters). However, you need to think “distribution” beyond Hadoop. It’s not always necessary to duplicate and replicate everything. Some data is already readily available in the enterprise data warehouse, with fast, random access. Some of it might be better off residing where it was produced. The “Logical Data Warehouse” concept applies well in the “non big data” world. Leverage it for big data.
- It’s not only about storage – think processing platform. Hadoop is not only a receptacle for big data with its distributed file system, but it is also an engine that brings incredible potential to process data and extract meaningful information. A broad ecosystem of tools and programming paradigms exist that cover all use cases of data manipulation. From MapReduce to YARN, from Pig to HiveQL complemented by Impala, Stinger or Drill, or through the merging of Hadoop and SQL engines like HAWK, there are processing resources available that make it unnecessary to get data out of the platform. All the resources are here, at your fingertips.
- Lastly, don’t treat big data as an isolated island. Sandboxes are fine for proof of concepts, but when big data projects go live, they need to be an integral part of the overall IT infrastructure and information architecture. You need to connect big data applications to other systems, upstream and downstream. Big data must also become part of your IT and information governance policy.
“Interest in and the roll-out of big data strategies has increased significantly, but many organizations are still stuck in the starting blocks,” said Bertrand Diard, co-founder and chief strategy officer, Talend. “Because of the novelty of the platforms and their applications, big data projects typically get under the spotlight and expectations are extremely high. Leading-edge technologies, such as the ones Talend provides, make it really easy to get started by lowering the adoption barrier. However, moving pilot projects into mainstream enterprise IT requires more than technology. By sharing these common pitfalls, we hope to help organizations learn from the experience of others and steer clear of obstacles on their big data journey.”