Since we don't have the ability to impose a standard format for the provided source data, Talend Open Studio for Data Integration has allowed us to save a significant amount of time compared to our former system with improved reliability.François-Xavier Thoorens, Research Officer
A good all-around research Institution
The Joint Research Center is one of the Directorates-General of the European Commission. It consists of seven research institutes based in five Member States of the European Union (Belgium, Germany, Italy, the Netherlands, and Spain). With a workforce of about 2,700 persons, the JRC plays an active role in the creation of a safer, cleaner, healthier, and more competitive Europe by supplying scientific and technical support for the conception, development, implementation, and enforcement of community regulations.
Scientific support for enforcing community regulations
There is no formal community regulation in maritime matters and community research work falls within the scope of diverse regulations dedicated to other topics - energy and transport, fishing, natural resources, environment, climate, etc. The JRC coordinates research in various domains - data harmonization and accessibility, collecting and analyzing statistics on fishing activities in Europe (e.g., control and evaluation of fish populations), maritime surveillance, control of the maritime ecosystem, pollution watch, etc. One of the challenges that the JRC has to address is this disparity between its data suppliers, which makes it essential to consolidate and reconcile data before it can be used.
Information sharing between Member States
The JRC manipulates a large amount of data originating from diverse entities in each member country, and cross references it with geospatial data from satellites, radars, etc. For this reason, the IBC Institute's department of Maritime Affairs-the JRC Directorate in Italy charged with the protection and safety of its citizens - recently introduced two large - scale data integration projects.
"The first project, called Data Collection, was initiated by European Directives to organize information sharing among member countries in the field of fisheries," reports François-Xavier Thoorens, Research Officer at the JRC in Ispra, Italy. "Its objective is two-fold to collect several types of data (scientific, social, etc.), and then to consolidate them within a unified database for analysis, using Google Earth for visualization. The biggest problem with this project is the heterogeneity of the sources; both modeland data differ from one country to another."
Initially, the JRC used Talend Open Studio for Data Integration to convert the data of every member country to a common format which could then be loaded into a unified database. However, this was only a temporary solution since European Directives later established a standard XML schema for the source data making upstream conversions useless. "We did not use Talend Open Studio for Data Integration for very long on this project," explains François-Xavier Thoorens. "But this period allowed us to get trained and to discover all the advantages of the solution. The open source model is particularly well suited to this type of temporary use, because there is no cost for the license and the tool's learning curve is very short."
Real-time ocean watching
The second project organizes campaigns to watch the oceans and seas bordering the European Community with satellite and radars pictures. "Once again, there is no predefined data model and we receive a mix of Excel and CSV files, and even screenshots or faxes," continues François-Xavier Thoorens. "To perform real-time tracking, we have to integrate all this data - for example, GPS locations (provided in a VMS format for fishing boats or AIS format for merchant ships). This information lets us identify anomalies (for example ships in distress, voluntary pollution, etc.). The processed data volumes are huge."
During the first stage of the project, the JRC developed a custom system based on Linux, Java, and manual scripting (AWK). "But the consolidation turned out to be too complex, and very hard to maintain," says François-Xavier Thoorens. "For example, the VMS format contains the code name of the ship, its position, its length, etc. But each data point can be expressed in various formats - US or European style for the date formats; lengths stated in feet or meters; numbers with decimal commas or points, etc. Besides that, each file contains columns in a different order and we even receive PDF files containing scanned faxes or screenshots."
Industrializing integration with Talend Open Studio for Data Integration
The JRC started to search for a simpler solution, which would be easier to implement and would reduce the number of manual interventions. "We looked at various alternatives on the market," explains François-Xavier Thoorens. "Talend Open Studio for Data Integration offered several advantages. First of all, the free license was attractive - budgetary savings, and no license management. Moreover, we were already familiar with the Eclipse framework that Talend is based on and we could save some time with regards to the ramp up. We made some tests on limited data sets and on specific jobs and we were very satisfied with the results. The scripts turned out to be very effective with well-structured and consistent data. Since we don't have the ability to impose a standard format for the provided source data, Talend Open Studio for Data Integration has allowed us to save a significant amount of time compared to our former system with improved reliability."
Today, thanks to Talend Open Studio for Data Integration, the JRC no longer needs to manage a large number of scripts and complex programs, but only a single, well-documented and much more reliable process. "Thanks to the numerous connectors, we can chain the conversions without going through intermediate transformations. This lets us save a lot of development time and industrialize our integration process," rejoices François-Xavier Thoorens.
In light of the success of this project, the JRC is evaluating Talend Open Studio for Data Integration for handling scientific data dealing with DNA tracing. The data conversion needs would be similar, but the data volumes would be much larger. Tests are currently underway to validate this solution. Talend Open Studio for Data Integration has become the JRC's data conversion engine of choice for any ad-hoc conversion need.