Stop Chasing Perfection in Analytics. Here’s Why
I wrote a blog around another favorite topic of mine, DevOps, a while back and in it I discussed the notion of perfection being the enemy of ‘good enough’. After some conversations these last few weeks, I have reaffirmed my stance and broadened it to include everything, especially analytics.
The things I hear time and again from people just starting out on their analytics journey as well as those a ways down the road with it is this - “We are looking for the perfect use case...”, “We want to get the perfect tool...” or “We want the perfect platform...”. The simple fact is none of these things exist. Rather than seeking perfect tools, we should be thinking about rolling up our sleeves and doing some work.
Data science, analytics, the absolutely OVERUSED phrase “big data” all start with someone having an idea and trying it out, but more than 90% of these attempts end up failing. The key to analytics success is the model of “fail fast, fail cheap”. This model is built around the idea that not only will you fail, but you SHOULD fail. The key is failing as quickly as possible, and therein lies the rub. Analytics projects are an investment, and it IS important to have tools that will help you overcome the three big pitfalls:
- Managing the environment (tools for cloud, hybrid, and on-prem are numerous and rapidly maturing).
- Having access to data and breaking down the silos within the organization - this is largely a process problem my friends and, as I have said thousands of times, “you can’t fix a process problem with technology”.
- Good data integration and management tools - this is an area with only a few players, but that will mature really quickly when more organizations realize what a huge issue it is.
In the bigger picture, we have largely handled the first issue. AWS/Azure/Google make this supremely simple in the cloud and solutions like Blue Data make it simple in a hybrid or on-prem implementation. They make it simpler for IT to handle cluster deployment, rapid prototyping, and the overall management of the analytics engines like Hadoop and Spark.
Issue number two is still a work in progress and I am intentionally going to skim over it, because it is an incredibly dense set of tasks and requires a lot of buy-in and cross-functional work. This has not been the strong suit of any of the groups that need to be involved, and can also lead to a lot of security and data governance type issues that fall outside the ability for a technology to address. Stay tuned for some future blogs that will hash out this issue in detail.
Issue number three is the next big thing in analytics software. There are already some players in the market that offer the ability to provide data to the people who need it most, while at the same time greatly simplifying the processes of management, discovery, cataloging, and the overall “plumbing” of data pipelines to enable rapid configuration and deployment of complex analytics jobs. This is going to be HUGELY important in the small to medium enterprise (SMB/SME) space which is by far the largest market that remains for analytics. The big guys (think Fortune 500) may not have great solutions for this issue based on what I have heard from them, but they do have the advantage of the two most powerful resources at the disposal of any technology team: money and warm bodies. Throw enough of both at a problem and it usually goes away.
This is, of course, the root of the problem for the SMB/SME space where “unlimited” resources are not available. If you are in this situation you may be asking, “How do I get the best data, to the best spot, for the most value and the most work, without an army and the budget of most solid western countries?” That is where data management and data integration software is going to make the biggest mark. By releasing an organization from the end to hire data engineers and custodians who have the sole job of just wrangling data they will enable those smaller organizations to take advantage of the analytics trend. When this happens, those Fortune 500 companies better watch out, because a whole new wave of disruption will emerge and those SMB/SME may not remain “small” for long!
If you are going to look at data integration tools here is my quick checklist for making sure they are worth the time:
- Open source version (read FREE) as well as a paid version for when you mature into needing enterprise-grade features and support.
- Integration of data, analytics, and most importantly Data Governance/MDM functionality for ALL data types (not just data in HDFS).
- Support for not only SQL/NoSQL stacks OR Hadoop OR Spark, but an ability to support them all - this will matter more and more as your strategy matures.
- Templates to get you started with analytics jobs that include the ability to customize extensively.
- Support for multiple users simultaneously - for when you have more than one person wanting to run a report!
- A roadmap and vision that includes investment and growth in the big 3:
- Platform support (minimum SQL and Hadoop, should support Spark)
- Tools support (does it include or integrate to visualization tools like Tableau or Qlik)
- Data Management and Governance (meta-data, lineage and audit, security, and the ability to supply data pipelines on-demand to end users without data engineering skills)
So get out there, download some tools, and start trying them out. There are always new things to discover and you should start failing now so you can succeed the one time in the future that will really matter!