A leading academic medical research center loads complex medical data

Talend helps the Research Data Management Group map data visually and execute high-performance loading.

There were actually three strategic considerations that drove us to Talend. Besides being an open source solution, vendor-independence was very important, since we'd had problems with support in the past and didn't want to get tied in. The third driver is that Talend is proficient at talking to Sybase.

IT Leader, Research Data
Management Group

A leader in health sciences

With a workforce of 18,600 people, this leading medical sciences institution is dedicated to defining health worldwide through advanced biomedical research, graduate-level education in the life sciences and health professions, and excellence in patient care.

The newly-formed Information Technology Services (ITS) department provides a significant additional focus on academic education, research, and administrative systems. The organization is key in leveraging investments in information technology architecture and infrastructure. A sub-unit serves the needs of the research community by providing an integrated repository of clinical and life sciences data and by providing a centralized, secure, professionally managed infrastructure for the storage and management of research data.

Field mapping for a database conversion project

The research center needed to process and load large volumes of medical data to a data warehouse, used for advanced statistical analysis via the i2b2 scientific analysis platform. The problem seemed simple enough-basically how to map the fields from the first database to the second. Initially, the developers had begun by exporting the field and table names to Excel, manually reviewing the sample data, and then copying and pasting the field and table names that appeared to match up, along with a column for notes of conversions that would be necessary. Clearly this time-consuming process could be handled more efficiently with a specialized tool to facilitate the process.

Working with a column-oriented database

The center had committed to using the Sybase IQ database for hosting the data warehouse. “Sybase IQ is a database engine specifically designed for retrieval of data as opposed to transactions,” explained the IT Leader of the Research Data Management Group. “The challenge for us was to be able to load it efficiently. We also wanted a product that was well-supported and that was not subject to the whims of any particular vendor.”

Choosing an open source solution.

The research center was interested in open source solutions because its work is also open source (among other, the i2b2 platform is an open source project, result from a grant from the National Institute of Health). In the course of their search for the right tool, the center came across Nautilus Consulting, who recommended that they look at Talend before making a decision. “Talend is a really slick open source product that is available in an enhanced, subscription-supported version as well,” said Bill Grant, Founder and Principal of Nautilus Consulting. “It has a great GUI development environment and generates fully portable Java programs rather than requiring a run-time or engine to make it work. It is very mature and flexible and, best of all, can be had at the right price-you can download the fully functional open source version for free.”

“There were actually three strategic considerations that drove us to Talend. Besides being an open source solution, vendor-independence was very important, since we'd had problems with support in the past and didn't want to get tied in. The third driver is that Talend has native connectors, optimized to leverage the power of Sybase IQ.”

“The fundamental drivers were more strategic than financial, but I have to admit that the fact that I could download Talend Open Studio for free, without any commitment, did have some impact on our willingness to try it out. However, we do realize the importance of vendor support, and we have subscribed to Talend's Gold Technical Support offering.”

Today, the Research Data Management Group is successfully using Talend to load data into Sybase IQ. “The team learned the product through a Quick Start expert consulting engagement, which was really helpful in getting up to speed. We were in production within two months.”

A second open source initiative

As the application server layer, the center is using a popular open source scientific programming environment called i2b2.org. “When the people that wrote i2b2 first got the grant to create that open source platform from the NIH, it didn't include ETL,” added the IT Leader. “That was a major gap in their platform. Talend filled this gap and allowed us to import the data into i2b2. We are considering posting the Talend scripts to the Talend Exchange, so that people doing the same kind of ETL job wouldn't have to reinvent the wheel.”

“What we're working on is a fairly comprehensive data warehouse of medical data, with a very different approach in that we're storing the actual data from the medical systems and relying on the combination of Sybase IQ's database engine and the i2b2 interface to make that data accessible in a way that nobody has seen before.”

Going forward

The research center plans to go deeper with Talend. “It is clearly a great environment. We are already working on creating complete stand-alone data integration processes-developing major ETL jobs on a continual basis. From my perspective, it's a completely unique product. Quite simply, before Talend there was really no other solution that I succeeded in making work in our complex environment. It's just a perfect match for our needs.”