Big data represents a significant paradigm shift in enterprise technology and stands to transform much of what the modern enterprise is today. Firms capture trillions of bytes of information about their customers, suppliers, and company operations, and millions of networked sensors are being embedded in devices such as mobile phones, energy meters and automobiles, sensing, creating, and communicating data. There is an increasing desire to collect call detail records, web logs, data from sensor networks, financial transactions, social media and Internet text, and then analyze with existing data sources. By collecting and analyzing all this information companies gain insight into new business opportunities and threats.
Obstacles: Big Data Integration Challenges
There are several challenges when undertaking a big data initiative and they include: technology, people, and quality processes.
- Technology. A successful big data initiative requires acquiring, integrating and managing several big data technologies such as Hadoop, MapReduce, NoSQL databases, Pig, Scoop, Hive, Oozie and others. It can become an expensive custom-coding nightmare to integrate large sets of diverse structured and unstructured datasets that also becomes difficult to maintain and manage. Conventional data management tools fail when trying to integrate, search and analyze big datasets, which (for now) range from terabytes to multiple petabytes of information.
- People. As with any new technology, staff needs to be trained in big data technologies to learn proper skills and best practices. A recent Talend survey, “How Big is Big Data Adoption”, found that the two biggest big data implementation challenges are finding in-house expertise, and allocation of sufficient budget, time and resources.
- Quality Processes. The survey also found that many big data projects do not have explicit project management structure, data governance, and lack the necessary big data quality procedures when processing unstructured sets of data.
Solution: Talend Big Data
Talend’s open source approach and flexible integration platform for big data enables users to easily connect and analyze data from disparate systems to help drive and improve business performance. Talend’s big data capabilities integrate with today’s big data market leaders such as Cassandra, Cloudera, Hortonworks, Google, Greenplum, Mapr, MongoDB, Teradata and Vertica, positioning Talend as a leader in the management of big data.
Big Data Integration
Landing big data (large volumes of log files, data from operational systems, social media, sensors, or other sources) into a big data platform such as Apache Hadoop, Google Cloud Platform, Netezza, Teradata or Vertica is a cinch with the breadth of big data components provided by Talend. A full set of Talend data integration components (application, database, service and even a master data hub) is available, so that data movement can be orchestrated from any source or into almost any target.
NoSQL connectivity to MongoDB and Cassandra is simplified through pre-built graphical connector components.
Big Data Quality
Talend provides data quality functions that take advantage of the massively parallel environment of Hadoop, allowing you to understand the completeness, accuracy and integrity of data as well as to remove duplicates. Hadoop data profiling allows you do collect information and statistics about big data to assess data quality, repurposing and metadata. Additional functions include standardization, parsing, enrichment, matching, survivorship and monitoring of ongoing data quality.
Big Data Manipulation
Talend supports Apache Pig and HBase so you can perform basic transformations and analysis on massive amounts of data in little time. With these scripting languages you can compare, filter, evaluate and group data within an HDFS cluster. Google Big Query supports allows you to interactively analyze very large datasets. Talend speeds development and collaboration by providing a set of components that allow these scripts to be defined in a graphical environment and as part of a data flow.
Big Data Project Governance and Administration
Governance of a big data project is very similar to any integration project; however, big data projects sometimes lack necessary management functions. Talend presents a simple, intuitive environment to implement and deploy a big data program with the ability to schedule, monitor and deploy any big data job. Also included is a common repository so developers can collaborate and share project metadata and artifacts.
Talend provides the following products to streamline big data tasks.
Talend Platform for Big Data
Talend Platform for Big Data is a powerful and versatile big data integration and data quality solution that simplifies the loading, extraction and processing of large and diverse data sets so you can make more informed and timely decisions.
Unlike other solutions where you need to integrate products to make a solution, Talend’s products improve your productivity through a unified platform - a common code repository and tooling for scheduling, metadata management, data processing and service enablement.