Recently, I was fortunate enough to find myself in Munich, Germany during a trip to visit with family and discovered that just north of town is the city of Ingolstadt, which is home to the Audi factory. Being somewhat of a gear-head and very much an Audi fan, I decided to take the factory tour and check out the museum (I essentially got a private tour as I took the English version and it was only my wife and I on it - HIGHLY recommend it!).
In the beginning of ETL….
When I started my IT career over 15 years ago I was nothing more than a “Fresh-out” with a college degree and an interest in computers and programming. At that time, I knew the theories behind the Software Development Life Cycle (SDLC) and had put it to some practice in a classroom setting but, I was still left questioning how it relates to the big, bad corporate world. And by the way, what the heck is ETL?
With the release of Apache Spark version 2.0 out in preview, there has been a lot of buzz recently about the implications of this advanced technology. Nowhere was that more apparent than in San Francisco this week where Spark Summit West drew a sold-out crowd of 2,500 software developers and data scientists, according to host and Spark cloud service provider Databricks.
Self-service data preparation, which we define as empowering business workers and analysts to prepare data for themselves prior to analysis, is often cited as the next big thing. In fact, Gartner predicted last year that “by 2018 most business users and analysts in organisations will have access to self-service tools to prepare data for analysis“.
In our last installment, we looked at how to easily configure your data warehouse or datamart to spin up and spin down automatically so that you don’t need to waste valuable compute resources (or money!) running databases when they’re not in use. Now let’s take a look at how you can automate your AWS Redshift cluster environments.
As part of a POC of Talend v6.1 Big Data capabilities, I was asked by one of our long-time customers, a major e-commerce company, to present a solution for aggregating huge files of clickstream data on Hadoop.
Have you ever stood up a datamart that was needed to build a handful of analytical reports, but then that repository sits idle until the next time those reports need refreshing (which may be a week, a month or several months…)? At many points in my career I have built data warehouses and datamarts for that exact scenario and have been frustrated by the length of time that database sits idle…it seemed like such a waste or energy and technical resources.
With the world becoming more connected and data savvy with every passing year, there’s a rising need for businesses to efficiently manage the trillions of bytes of data that they capture, and gain insights from them. Talend helps businesses do exactly this while boosting developer productivity and reducing time-to-value for ETL data warehouse projects. Talend for Big Data is seeing rapid growth, emerging as a must-have tool for quickly and effectively cleansing and analyzing Big Data.
In my previous blog “Beyond ‘The Data Vault’” I examined various data storage options and a practical architecture/design for an Enterprise Data Vault Warehouse. As you may have realized by now I am quite smitten with this innovative data modeling methodology and recommend to anyone who is developing a ‘Data Lake’ or Data Warehouse on Big Data platforms consider this as a critical design paradigm.
I wrote a blog around another favorite topic of mine, DevOps, a while back and in it I discussed the notion of perfection being the enemy of ‘good enough’. After some conversations these last few weeks, I have reaffirmed my stance and broadened it to include everything, especially analytics.
Our Puzzled Customers
In this era of Big Data, many of the IT people I talk with have a number of questions about the technology and trends associated with this new paradigm.
For example, many of them are feeling somewhat overwhelmed with the amount of data they now have to deal with – data that seems to be growing exponentially.
Many of the comments I hear go something like this: “I never thought we would be swamped with so much information. Are there Big Data solutions available now that can help me deal with this deluge of data?”