UMIT Fights Cancer with Data Integration, Mining, and Analysis

Talend Open Studio for Data Integration helps UMIT process data and perform statistical analysis.
UMIT/biomed relies entirely on Talend's solutions for all data integration needs. We have high hopes that the IMGuS project will contribute to the reduction of prostate cancer mortality rates, and data integration is a critical part of this project. Talend is helping us save lives!
Dr. Bernhard Pfeifer, Associate Professor

A University specializing in cancer treatment

The University for Health Sciences, Medical Informatics and Technology (UMIT), based in Hall, Austria, is a key participant in the IMGuS project. A life science data warehouse system supporting systems biology in prostate cancer is managed by UMIT/biomed. In coordination with five other research groups located in Germany and Austria, UMIT manages the technical infrastructure and the data warehouse part of the project.

The IMGuS project

Prostate cancer is the most frequent tumor type in men and the second most frequent cause of male death. The IMGuS project applies high throughput data processing to identify molecular signatures, identifying patients who are good candidates for prostate cancer treatment. Patient samples come from a bank at the University of Innsbruck's Clinic of Urology. The established technology platforms of the different partners is used to generate complementary genomic, proteomic, and metabolomic data using samples from healthy controls, low-risk, and high-risk prostate cancerpatients. The results for both groups are analyzed using statistical and data mining methods to determine molecular signatures for new therapy and prediction approaches. The generated data are integrated and stored in a clinical data warehouse, which is managed by the Institute of Biomedical Engineering at UMIT.

Data processing is key to cancer research

"€œA large part of cancer research today consists of data processing and statistical analysis,"€ explains Dr. Bernhard Tilg, Professor and Board Member at UMIT Institute of Biomedical Engineering. "€œThe goal of these projects is to identify molecular signatures associated with certain types of tumors, so that efficient and non-intrusive diagnostic mechanisms can be designed. Some cancer treatments have high success rates when the disease is diagnosed in time, but the key problem remains the diagnostic."

"€œWe use data integration to combine several different data sources to perform advanced analysis and statistics on the whole set,"€ clarifies Dr. Bernhard Pfeifer, Associate Professor at UMIT Institute of Biomedical Engineering. "€œAnd, because of the amount of data the high throughput sources create, an automated approach is mandatory. We looked at a number of data integration solutions, both proprietary and open source, and settled on Talend's solutions because of their flexibility, openness, and high performance."

It was critical that the chosen data integration solution not only work with all data sources, but also be able to integrate specific data processing approaches. For example, since various medical devices deliver data in different formats, pre-processing this data is necessary. Talend's open architecture allowed UMIT to develop specific components to access and process this data.

The PostgreSQL-based LINDA data warehouse - the basis for the statistical analysis of the IMGuS project data - is loaded in two stages. The first stage, dubbed Electronic Data Capture or EDC, centralizes data from all the different sources - patient samples, reference medical data, genome cartography, etc. "€œThe Electronic Data Capture stage is very complex,"€ explains Bernard Pfeifer. "Not only are the data providers very diverse (five different universities and research centers) but the formats vary widely - very large CSV files, high-resolution images, RDBMS, XML data, etc."€

Administrative data is also loaded at this stage, including patient demographics, information about the biological source a certain sample comes from (tissue, serum, etc.), or information on the data source where the information is stored.

The second loading stage reconciles, transforms, cleanses, and enriches the data contained in the EDC and loads the LINDA data warehouse. "€œAt this stage, we need to bring in reference data from external providers - medical publications, legacy systems, reference medical databases, and the like,"€ explains Bernhard Pfeifer. "€œTalend's native support of Web Services and XML brings tremendous value to the project. It allows us to parse and cross-reference external data sources easily, greatly reducing the time it would otherwise take to enrich the data warehouse."€

The frequent refresh of the data warehouse (performed nightly) ensures that researchers can use ad-hoc query and data mining tools, and apply advanced statistical models, to extract data relevant to their research.

"€œUMIT/biomed relies entirely on Talend's solutions for all data integration needs,"€ concludes Bernhard Pfeifer. "€œWe have high hopes that the IMGuS project will contribute to the reduction of prostate cancer mortality rates, and data integration is a critical part of this project. Talend is helping us save lives!"