Big Data: Why You Must Consider Open Source
Guest Blog by Bernard Marr, Founder and CEO of The Advanced Performance Institute
A quiet revolution has been taking place in the technology world in recent years. The popularity of open source software has soared as more and more businesses have realized the value of moving away from walled-in, proprietary technologies of old.
And it’s no coincidence that this transformation has taken place in parallel with the explosion of interest in big data and analytics. The modular, fluid and constantly-evolving nature of open source is in synch with the needs of cutting edge analytics projects for faster, more flexible and, vitally, more secure systems and platforms with which to implement them.
Open Source and Big Data
So what exactly is open source, and what is it that makes it such a good fit for big data projects? Well, like big data, open source is really nothing new – it’s a concept which has existed since the early days of computing. However, it’s only more recently, with the huge growth in the number of people, and amount of data online, that its full potential is starting to be explored.
The lazy description of open source is often that it is “free” software. Certainly that’s how you will hear the more popular open source consumer and business products (such as the Microsoft Office alternative LibreOffice, or the web browser Firefox) described. But there’s much more to it than that. Generally, truly open source products are distributed under one of many different open source licenses, such as the GNU Public License or the Apache License. As well as granting the user the right to freely download and use the project, it can also be modified and redistributed. Software developers can even strip out useful parts from one open source project to use in their own products – which could either be open source themselves, or proprietary. In general, the only stipulation is that they must acknowledge where open source material has been used in their own products, and include the relevant licensing documentation in their distribution.
Advantages of Open Source
Open source development has many advantages over its alternative – proprietary development. Because anyone can contribute to the projects, the most popular have huge teams of enthusiastic volunteers constantly working to refine and improve the end product.
In fact, Justin Kestelyn, senior director of technical evangelism and developer relations at leading open source vendor Cloudera, tells me that proprietary solutions are no longer the default choice for data management platforms.
He says “Emerging data management platforms are just never proprietary any more. Most customers would simply see them as too risky for new applications.
“There are multiple – and at this point in history, thoroughly validated – business benefits to using open source software.”
Among those reasons, he says, are the lack of fees allowing customers to evaluate and test products and technologies at no expense, the enthusiasm of the global development community, the appeal of working in an open source environment to developers, and the freedom from “lock in”.
This last one has one caveat, though, Kestelyn explains – “Be careful, though, of open source software that leaves you on an architectural island, with commercial support only available from a single vendor. This can make the principle moot.”
The literal meaning of open source is that the raw source code behind the project is available for anyone to inspect, scrutinize and improve. This brings big security benefits – flaws which could lead to the loss of valuable or personal data are more likely to be spotted when hundreds or thousands of people are examining the code in its raw form. In contrast, in the world of proprietary development, only the handful of people whose job it is to write and then test the code will ever see the exact nature of the nuts and bolts holding it all together.
It also makes it far more difficult for software developers to hide or obfuscate exactly what it is that their programs are doing, while they are running on a user’s computer. Consumers are growing ever more aware of the importance of knowing what their computers are doing with their personal data “behind the scenes”. This was proven by the recent outcry over what many saw as excessive snooping built into the latest upgrade to Microsoft’s Windows. Increasingly, customers are aware that running open source gives them the confidence of knowing that their software has been heavily scrutinized by a large community of non-affiliated developers. Anything that the software is attempting to do with data which could be seen as unethical or deceptive will be spotted and will not be tolerated by the open source community. Even if the open source software you are using is not one of the many larger packages, in theory you can still examine the source code yourself to find out exactly what it does (or pay an independent expert to audit it for you.)
Who Uses Open Source?
Don’t be mistaken by thinking that because it is free, open source software is amateur software. As well as the armies of volunteers which work on the projects in their spare time, large numbers of employed professionals are getting paid to do so, too. Tech giants such as IBM, Microsoft and Google are now some of the keenest contributors, in terms of man hours, to the biggest open source projects such as Apache Hadoop and Spark.
Of the involvement of these “internet scale” businesses in open source, Ciaran Dynes, vice president of products at vendor Talend, says “What’s interesting is that their business models are not dependent on ‘owning’ the software. The open sourcing of the software is a by-product of their need to innovate to address a market gap they’ve identified – for example Google Search.
“Open sourcing is a part of their branding and being recognized as a good company to join. This is quite different from vendors, such as Talend or Redhat, where the use of open source has been to seed the market with our technology to upset the status quo of proprietary vendors.”
Many popular big data related open source projects actually started out as in house initiatives at tech companies – for example, the Presto query engine which was developed at Facebook before being released into the wild and adopted by, among others, Netflix and AirBnB to handle back end analytics tasks.
Open source can often be more flexible than proprietary software, too. Because the code, poured over and optimized by thousands of contributors, is often highly efficient, it is often less demanding on computing resources and power than proprietary software which does the same job . This means there is less of a need to constantly be updating hardware and operating systems in order to make sure you can run your software.
The Internet is built on open source – and at the same, it enabled open source to begin to reach its potential by bringing together programmers from around the world and enabling them to collaborate with each other. An entire industry has sprung up around some of the most popular open source products – in the case of big data, that would include Hadoop and Spark – aimed at helping businesses get the most from them. These businesses typically produce enterprise distributions of open source products which, for a fee, come adapted for specific markets, or with packaged consulting services to help their customers get the most from them.
All this means is that it is easier than ever to get involved with open source, and in many cases it is becoming the mainstream rather than alternative choice. Last year a survey of 13,000 professionals across all industries found that 78% relied on open source technology to run their companies. This represents a 100% increase since 2010. As Cloudera’s Justin Kestelyn puts it, “We are quite literally living in a Golden Age, right now.” I couldn’t agree more.