Getting Started with Big Data

Getting Started with Big Data


Big data is here to stay

After social media, the Internet of Things is the next big driving force behind the increase in data worldwide, which is doubling in size every two years.  [1] At the same time, data processing speeds and capabilities are becoming increasingly important because—much like food—data loses relevance after a certain date.  Additionally,  the increasing variety of structured, unstructured and semi-structured data (such as pictures, text, videos and sound) is now becoming easier to capture and analyze. The three main factors defining big data are: volume, velocity and variety. [2] Companies which are in command of these three classes of data can derive great value from them and will be more successful in comparison to their less digitally adept counterparts. [3]

Digital leadership today is best demonstrated by companies such as Amazon, Netflix, and Uber, that have successfully implemented a comprehensive data acquisition strategy that helps differentiate them from competitors. Having the right big data technology platform also needs to be part of that strategy. According to O’Reilly’s report on “The Big Data Market 2016” [4] ,“larger enterprises (those with more than 5,000 employees) are adopting big data technologies [such as Hadoop and Spark] much faster than smaller companies.”

However, there are lots of opportunities for smaller organizations to become ‘digital leaders’ using today’s modern data platforms. According to Cloudera, maturing Hadoop ecosystems can not only help achieve cost savings, but can also open up new business opportunities by making it possible to use data more strategically. The main applications of big data are better understanding customers, improving products and services, achieving more effective processes, and reducing risks with improved quality assurance and better problem detection. [5]

How to get off to a good start

At the outset, it can seem difficult to get started with big data projects. In our day to day work we see many medium-sized companies (in the German-speaking market) thinking about big data technologies in principle, but not managing to get things off the ground with specific projects.  So what’s really the best way to get started?

In our experience, usually the most successful approach is to start small way with a clearly defined project plan which is relevant to your business. Many of our customers are currently facing the challenge of having to connect with novel data sources and store ever larger quantities of data. This is often machine data, and occasionally social media data. In principle, it would be possible to accommodate this data in a relational database, or possibly in an existing data warehouse. But this is usually expensive in the context of substantial projects, so it is well worth considering alternatives. To put it simply, it just doesn’t feel quite right.

Typically, smaller projects are an excellent way to gain experience with big data technologies. Basically, a relatively small, manageable and isolated project can often provide a low-risk way to get started with a new technology. It doesn’t matter if all the three Vs (volume, velocity, variety) are completely fulfilled or not. But it is important to find a controllable, appropriate and relevant big data use case with measurable factors of success in order to ensure it can be transferred quickly from a pilot to a production environment. [6]

Even if you could address a big data project with the tried and tested technologies you currently own, taking a chance to get started with big data technologies should not be missed. Otherwise, you might find yourself unable to deal with the complexity of a larger project, if you haven’t already experimented with new technologies on a more controllable scale.

The architecture of a big data project is usually quite manageable as the technologies are already mature and much more accessible. A Hadoop distribution is used for data storage. First data has to be collected at the source, potentially transformed (although for big data it is advisable to store raw data without transforming it) and then loaded into Hadoop. The Talend Big Data platform provides everything you need to implement such a link based on a model, generating high performance native code that helps your team get up and running with Apache Hadoop, Apache Spark, Spark Streaming and NoSQL technologies quickly.

In the end the data is usually evaluated, either directly at a raw data level or via a detour to a data mart with preprocessed data. The data mart can in turn be filled with Talend. Evaluations can then be carried out with suitable tools already in use, although this is also a good opportunity to introduce new tools, typically from the fields of data visualization,  discovery and advance analytics.


Start, grow and create opportunities

Big data and traditional data warehouses are growing closer together. Theoretically, a complete data warehouse can be modernized with the help of big data technologies, such as Hadoop, something that frequently leads to significant cost savings while simultaneously opening up new opportunities. But it is also possible for the world of big data to merge with traditional data warehouses at a more leisurely pace. Once the big data infrastructure is there, it is simple to link Hadoop with the data warehouse – potentially in both directions as well. The data warehouse can serve as a source of data which is stored in Hadoop.  By the same token, data from Hadoop can be read, transformed and finally stored in the data warehouse. The two worlds don’t have to be isolated from one another, instead they merge together and, in the end, you have a data warehouse based on big data technologies.

Great journeys always begin with a first small step. Big data technologies are more mature and accessible now, but, as so often in life, you can only progress after you get started on something specific. That is why we recommend you actively look for such projects. Once you have set out on your journey and have investigated the technologies and set up the infrastructure, you will quickly benefit from the new opportunities it presents. All this means you can continue to stay competitive in this age of data-based companies.









About the author Dr. Gero Presser

Dr. Gero Presser is a co-founder and managing partner of Quinscape GmbH in Dortmund. Quinscape has positioned itself on the German market as a leading system integrator for the Talend, Jaspersoft/Spotfire, Kony and Intrexx platforms and, with their 100 members of staff, they take care of renowned customers including SMEs, largecorporations and the public sector. 

Gero Presser did his doctorate in decision-making theory in the field of artificial intelligence and at Quinscape he is responsible for setting up the business field of Business Intelligence with a focus on analytics and integration.


Join The Conversation


Leave a Reply

Your email address will not be published. Required fields are marked *