Talend Data Integration at LIFE Biosystems

Cancer Research Leveraging Data Centers.
Currently, the drugs used to fight cancer work directly in approximately 25% of all cases. To increase this percentage to 40%, we have to be able to integrate and analyze more data sources. Talend is here as an important part of our infrastructure.
Dr. Stephan Brock, CEO

These days, modern research no longer takes place just in laboratories. The percentage of research work done "in silico,"€ with the help of information technology, is constantly increasing. Computer-based processes make it possible to simulate biochemical processes, and the calculated results are primarily used to confirm traditional lab experiments that are done "€œin vivo,"€ on living organisms, or "€œin vitro,"€ in a test tube. Computers are specifically used for complex projects in genetic research, or for consolidating and analyzing numerous large data sources. LIFE Biosystems is actively involved in "€œin silico"€ cancer research. This privately held company in Basel, Switzerland, researches in Heidelberg and Houston and combines molecular and clinical data to facilitate cancer treatments that are customized for a particular individual. Its main tool is the TheranoSys Discovery Platform (TDP), an innovative infrastructure that can analyze bodies of existing knowledge and data. With the help of TDP, it is possible to detect molecular mechanisms that are responsible for a resistance against certain cancer treatments.

The main strength of this approach is the synergies between the proprietary identification technology, the multi-disciplinary team of experts, and, last but not least, the access to huge amounts of clinical and molecular patient data from leading cancer centers. Since 2009, LIFE Biosystems has been using the Talend Data Integration, from the leading open source provider for data management solutions, Talend, to regularly align all this data.

The Challenge

For an effective cancer treatment, it is important to find out the exact type of disease the patient has and which drugs might bring the best results. Individual active ingredients affect special proteins, and combinations of active ingredients can trigger complex mechanisms. There are tremendous quantities of lab results and studies on this issue, both in biotechnology studies and traditional medical and pharmaceutical research studies. One of the main tasks in bioinformatics is to merge the different sources and to find meaningful links. "€œWe are basically the link between the different research institutes,"€ explained Dr. Stephan Brock, CEO of LIFE Biosystems AG. "€œThe multitude of sources and the volume of the available, constantly updated data require a high degree of computing power and intelligence for the development of the algorithms. Data management is therefore extremely important to us."

The business model of LIFE Biosystems is based on the effective processing of information, which is the information technology it uses is of particular importance. Open source technologies are very popular in bioinformatics; Linux can almost be considered the standard. In the Heidelberg computer center, the approximately 200 Intel-based services run on this free operating system. Five additional servers of the same type stand in another computer center in Houston, Texas. Also, the software developer'€™s workstations run on Linux.

The merging, orchestration, and analysis of data are some of the most important tasks at LIFE Biosystems. Data from various public and private sources are integrated into a currently approximately 1.6 terabyte big data warehouse, which is used in various projects to generate, analyze, and confirm therapeutic system models. With the data from the data warehouse, approximately 80% of all queries can be answered directly, with predefined algorithms. Sometimes, this involves simple text mining, while other times, it involves complex queries. Some internal applications also access the database in order to generate reports or carry out analyses.

In total, LIFE Biosystems is currently using more than 40 different data sources, such as the European Molecular Biology Laboratory (EMBL), the National Institute of Health, which is affiliated with the US Department of Health, the human protein database "€œUNIPROT"€œ, the database for human biology "€œReactome"€œ or Thomson-Reuters, where detailed information on drugs and active ingredients can be obtained. Its scope ranges from versatile spreadsheets to large, complex, and structured XML files. In order to handle this large number of resources and to remain current in spite of the very different publication methods, it was decided to consolidate and standardize all data integration processes.

The Talend Solution

"€œIn the past, most ETL jobs were solved on a script basis. Every individual process was manually programmed," Guillaume Taglang, the Lead System Architect at LIFE Biosystems, recalls. "But as we reached a certain number of data sources, this manual programming became very complicated and often created problems due to the lack of interface uniformity."€ By the beginning of 2011, the number of data sources is expected to double from today'€™s figures. In order to keep up with this development, the company started looking for a standardized solution approach.

"€œOne special requirement was that we needed to be able to track all changes based on the job documentation," explained Guillaume Taglang. "€œWhen it comes to experiments, you have to be able to find out at any time which software version or which algorithm was used. This is the only way to properly reproduce results. This integration with a comprehensive version control is part of the functionality of the Talend Data Integration we use." 

Within the context of a thorough market evaluation, the company took a close look both at the traditional data integration platforms from Informatica and various open source data integration solutions, such as specialized tools such as Talend, CloverETL, and Jitterbug, as well as Pentaho, a comprehensive BI Suite with an integrated ETL component. As part of a comprehensive "€œproof-of-concept,"€ the potential solutions had to perform some typical jobs to see what they could do. Talend passed all tests very successfully and ultimately emerged as the winner in the selection process. During the evaluation phase, developers implemented a few smaller jobs with the software'€™s community version Talend Open Studio for Data Integration, so that it could integrate the first results from smaller projects directly into the professional version. For company-wide use, LIFE Biosystems chose the enterprise solution, the Talend Data Integration, which offers additional functions and professional support.

The Talend software is currently used throughout the entire data processing area and helps analyze new data sources and make decisions on whether and how these can be integrated into the overall data model. With Talend, LIFE Biosystems can develop and execute the tasks required for the extraction and loading of data, which is needed to populate the data warehouse and aggregate data as quickly as possible. A typical request could, for example, come from a researcher at a hospital, who sends patient-specific data in an Excel file. For such a scenario, ETL jobs were developed together with Talend that enable us to read the data and automatically align it with the database.

Advantages

Within a short period of time, the Talend Data Integration enabled us to transfer the scripts that had been used up to that point to ETL processes. Therefore, today'€™s know-how on integration processes is no longer just in the heads of individual programmers, but can be found in a central repository in which all developers have access. Additional data sources into the data model of LIFE Biosystems can be integrated faster and with a lot less effort. Talend is also a code generator: ETL processes can be modeled effortlessly through a graphic user interface, where in the end a code is generated that can be carried out within Talend or on the server. The code is either Java or Perl, and can also be directly accessed and manipulated by experienced programmers.

"Typical ETL jobs are very easy to develop with Talend. The deeper you get into the subject, the more complicated it gets, but that is just the nature of the matter. Based on my experience, the tool is much easier to use and understand than Informatica,"€ stated Guillaume Taglang. "€œWe were especially impressed by the support of Talend and the community. With regard to simple questions, there is almost always someone online who can help. With regard to specific questions about our infrastructure, the support team has always provided us with fast and usable answers."

Another advantage that was found was the job scheduling functionality Talend offers. To keep data pools as current as possible, external and internal databases have to be aligned in regular intervals. The used databases, however, sometimes have very different database updating cycles: some are updated on a weekly basis, others monthly or even just once a year. With Talend, these processes can be scheduled and then run automatically on the correct date. Even calculation-intensive processes can be run in a cluster.

Since LIFE Biosystems has earned itself the reputation of an absolute IT expert in its industry, some customers also asked whether they could be provided with ETL jobs for data transfer purposes. This is possible due to the open source approach, since, in contrast to traditional solutions, no software license is required to be able to use an ETL process. The free Talend Open Studio for Data Integration version is absolutely sufficient for these purposes.

"Currently, the drugs used to fight cancer work directly in approximately 25% of all cases. To increase this percentage to 40%, we have to be able to integrate and analyze more data sources. Talend is here as an important part of our infrastructure,"€ summarizes Dr. Stephan Brock. "€Our dream would be to be able to just push a button one day to determine the right cancer treatment for a specific patient and to thus be able to increase the prognosis significantly."