7 Emerging Open Source Big Data Projects that will Revolutionize Your Business
Open source software (OSS) just celebrated its 20th anniversary and not only does the community have a lot of milestones to celebrate, but also a lot to which they can look forward! OSS continues to disrupt the status quo in groundbreaking ways, but it’s also becoming increasingly mainstream. Thus, if you’re an IT leader of any-sized organization, you should be thinking about and planning for how to incorporate OSS into your infrastructure. Among the hundreds of popular open source projects currently underway, here are seven we recommend you watch!
Twenty years ago, the Open Source framework was published, delivering what would be the most significant trend in software development since that time. The Open Source Initiative, a non-profit organization that advocates for open source development and non-proprietary software, pegs the date of inception at February 3, 1998. Whether you want to call it "free software" or "open source", ultimately, it’s all about making application and system source codes widely available and putting the software under a license that favors user autonomy.
Master your data project, no matter how big. Download Talend Open Studio for Big Data.
According to Ovum, open source is already the default option across several big data categories ranging from storage, analytics and applications to machine learning. In the latest Black Duck Software and North Bridge's survey, 90% of respondents reported they rely on open source “for improved efficiency, innovation and interoperability,” most commonly because of “freedom from vendor lock-in; competitive features and technical capabilities; ability to customize; and overall quality.” There are now thousands of successful open source projects that companies must strategically choose from to stay competitive.
“When the first survey launched ten years ago, hardly anyone would have predicted that open source use would be ubiquitous worldwide just a decade later, but for many good reasons, that’s what has happened. Its value in reducing development costs, in freeing internal developers to work on higher-order tasks, and in accelerating time to market is undeniable. Simply put, open source is the way applications are developed today,” said Black Duck CEO Lou Shipley in a statement to the press. “The future of open source is full of possibilities.”
It doesn’t matter what your project is, or which technologies you’re working with, open source is your ticket to success. But that doesn’t mean that all open source projects are created equal, or that just any open source project will propel your company to the head of the pack.
While every company must develop its strategy, and choose the open source projects it feels will fuel its desired business outcomes, there are some that are some project that we feel are worth strong consideration.
How open source can be your path to business agility
Following are a few of the big data open source projects that have the largest potential for enabling companies to have extreme agility and lightning fast responses to customers, business needs and market challenges. If you’re an IT leader, I’d recommend you check out these projects, keep a pulse on them and start considering the potential impact they may have on your IT infrastructure and overall business:
- Apache Beam is a project model that got its name from combining the terms for big data processes batch and streaming because it’s a single model for both cases. Beam = Batch + strEAM. Under the Beam model, you only need to design a data pipeline once, and choose from multiple processing frameworks later. Your data pipeline is portable, and flexible so that you can choose to make it batch or stream. You don’t need to redesign every time you want to choose a different processing engine or when you need to process batch or streaming data. So, your team can benefit from much greater agility and flexibility to reuse data pipelines, and choose the right processing engine for multiple use cases.
- Apache Airflow is ideal for automated, smart scheduling of Beam pipelines to optimize processes and organize projects. Among other beneficial capabilities and features, pipelines are configured via code rendering them dynamic, and metrics have visualized graphics for DAG and Task instances. If and when there is a failure, Airflow has the ability to rerun a DAG instance.
- Apache Cassandra is a scalable and nimble multi-master database that enables failed node replacements without having to shut anything down, and automatic data replication across multiple nodes. It’s a NoSQL database with high availability and scalability. It differs from the traditional RDBMS, and some other NoSQL databases, in that it is designed with no master-slave structure, all nodes are peers and fault tolerant. This makes it extremely easy to scale out for more computing power without any application downtime. For example, your transactional applications can be in production at extreme scales, such as volumes and speeds typical of Black Friday sales events, without the worry that it can go offline at any moment because any single node is down.
- Apache Carbon Data is an indexed columnar data format for incredibly fast analytics on big data platforms such as Hadoop and Spark. This new kind of file format solves the problem of querying analysis for different use cases. There are many types of querying needs from OLAP vs detailed query, big scan, and small scan etc. With Apache Carbon, the data format is unified so you can access through a single copy of data and use only the computing power needed, thus making your queries run much faster.
- Apache Spark is one of the most widely utilized Apache projects and a popular choice for incredibly fast big data processing (cluster computing) with built-in capabilities for real-time data streaming, SQL, machine learning, and graph processing. Spark is optimized to run in-memory and enables interactive streaming analytics where, unlike batch processing, you can analyze vast amounts of historical data with live data to make real-time decisions, such as fraud detection, predictive analytics, sentiment analysis and next-best offer.
- TensorFlow is an extremely popular open source library for machine intelligence which enables far more advanced analytics at scale. TensorFlow is designed for large-scale distributed training and inference, but it is also flexible enough to support experimentation with new machine learning models and system-level optimizations. People love TensorFlow for a reason! Before TensorFlow there was no single library that deftly catches the breadth and depth of machine learning and possesses such huge potentials. It is very readable, well documented and expected to continue to grow into a more vibrant community.
- Docker and Kubernetes are container and automated container management technologies that speed deployments of applications. Using technologies like containers makes your architecture extremely flexible and more portable. Your DevOps process will benefit from increased efficiencies in continuous deployment.
As impressive as each of these open projects are individually, it is the collective advances that best illustrate the huge impact the open source community has had on the enterprise and the monumental shift from legacy and proprietary software to open source-based systems — enabling companies of all sizes, across all industries to increase speed, agility, and data-driven insights at all levels or their organizations.
How can companies prepare for the OSS changes ahead
While the changes that have already occurred are quite breath-taking, this is not the end of the story for these and other market-shaping forces. There are several ways to help companies leverage the sea change that has already occurred and to adapt to the innovations yet to come from the mashup of open source, cloud and big data.
Become an open source champion in your business. Join the open source communities relative to your projects and interests. Educate yourself, your team and management on its benefits. Determine what you can leverage instead of “reinventing the wheel”.
Contribute to open source projects. “There are a lot of companies that use open source today, but unfortunately many of them do not contribute,” says Jean-Baptiste Onofré, a Technical Fellow and software architect on the Apache Team at Talend. Onofré was also a mentor on the Apache Beam incubation modeling who contributed most of the connectors and is now a Project Management Committee (PMC) member for Beam.
“It’s a win-win. You contribute upstream to the project so that others benefit from your work, but your company also benefits from their work. It means more feedback, more new features, more potentially fixed issues.”
Become an influencer in open source projects key to your company. By contributing to the OS community, companies develop influence in the open source community on projects important to your company’s progress. That influence helps you direct changes to the project that will be of particular benefit to your company’s projects.
Change the business culture to open source. The open source culture is open-minded, innovative and collaborative. “It means everything you do is transparent and you have to accept the different feedbacks with grace,” says Onofré. “You have to improve your code again and again, not because you are less or more skilled than another guy it's just because by itself open source forces you to be very open-minded and accepting of change.”
Change has always been the only constant in human existence and business. But change is happening faster now than at any other time in history. By staying open-minded, attuned to open source, and aware of the many ways to use data and analytics, you’ll be well prepared for whatever pops up next on the horizon.