With the increased availability of data through sensors, inter-connected mobile devices, social media and private or public spatial data sets, the demand for a seamless integration of spatial information into data-driven decision-making processes has reached a new high. We consider spatial data as any kind of data supplemented with additional information about the location and shape of objects on earth. One simple example would be the general information of companies, buildings, persons and/or vehicles such as name, type, and color, with supplementary X, Y- coordinate values defining their current or permanent position on the earth respectively. But real world objects often have much more complex forms. For example, using line-geometries with several nodes, we can describe, for instance, a street, a river or even movements of a car or a person. With polygon-geometries we can additionally describe settlements, shops, forest areas etc. Spatial data with their diversity of formats and spatial reference systems and their specific geometrical operations have been historically managed separately while being only occasionally if at all, considered in traditional data integration processes.
In this article, we would like to show a concrete example of how to easily enrich your ETL-processes with spatial data. If you wish to try it out on your own, you can use a current version of Talend Open Studio and additionally download and install the Plugin “GeoSpatial Integration for Talend” for free. The plugin extends Talend’s data sources by providing access to spatial databases and formats, such as Oracle Locator/Spatial, PostGIS, SpatiaLite, Shape Files and Well-known text (WKT). In addition, the plugin extends the palette of Talend’s components with a full set of geometrical operators such as area or length calculators, buffers, centroid, envelopes, intersections and much more to automate spatial tasks and workflows. The following image depicts the overall spatial ETL-process with the newly added components.
To demonstrate the functionality, let’s have a look at the following small scenario with wind farms: The individual turbines of the wind farms are available as single location objects (points) and we would like to calculate the full area of the wind farm (or to speak in geometrical terms: to calculate the convex hull of all points belonging to one wind farm location and hence grouped by an attribute).
This exercise includes the following steps:
- Define the metadata for our new data source (shape file)
- Define the Talend job to calculate the wind farm area by applying a geometrical operation
- Execute the job and write the resulting areas in a file or database.
This simple process allows users to define and visualize the wind farms’ expansion area, where the group of wind turbines is located, and thus create the basis for further planning steps in the fields of road construction or nature conservation.
Step 1: Define the Metadata to load the Geometries from a file or database.
First, we have to register the new data source we want to read. In our case, it is the shape file with the wind turbine location information. If you have spatial data in another format or spatial database like Oracle Spatial or PostgreSQL with PostGIS, the process is very similar.
This is how the sample file with the wind turbines looks like if visualized as individual point objects. For the purpose of demonstration, our GIS & Reporting platform Cadenza Desktop is used. The background map is based on Open Street Map.
Next, we have to define the metadata for our new data source in Talend as follows. This way, we can use this data source for the example-job and any other future jobs.
- Specify the name and purpose.
- Select a path to the source file with the wind turbines. If the shape file has a project-file (.prj) the coordinate reference system is automatically identified and assigned. In other cases, you can enter it manually.
- Edit the schema.
Schematic diagram data such as column names and data types are imported automatically. Adjustments can be made if required (e.g. setting the primary key, if this has not been specified in the data records).
The shapefile has now been included in the Talend metadata and can be used for the next step or for any other job you want to start with this data. After this step metadata-tree should look like this in Talend Open Studio:
Step (2): Create a Talend Job including the spatial operators.
In this step, our aim is to generate a geometry that covers all wind turbines and to create a convex hull around all wind turbines using the geometry functions as illustrated below:
To do so, you have to create a new Job in Talend and enter a name, purpose, and description for it. Then drag the new data source, “windfarm” (created in the previous step), and drop it to the job’s workspace. A dialog window will open automatically. You can now choose a component you wish to use. Since we are working with a shape file, select “dShapeFileInput”.
The shape file with our wind turbines is now ready to be further processed. Since we want to calculate the convex hull (wind park area) of our points, select the new dAggregateGeometryRow component, move it to the job’s workspace as well and choose the “main-row” connection between the dShapeInput and the dAggregateGeometryRow components.
Finally, we have to define the output. In our case, the output is a simple shape-file containing just the name of the wind park and the new geometries created. Hence, select the dAggregateGeometryRow component and navigate to the “Component” tab. Click the “EditSchema” button in order to define the component’s output schema. By clicking the “Add” button (“+” symbol) we create two new columns „Name“ ( Type: STRING, Length: 80) and “ConvexHull“ (Type: DGeometry ).
As our source file contains points from different wind parks and we want to generate a convex hull from wind turbines pertaining to each wind park, the wind turbine points need to be grouped by the name of each Wind park. This can be accomplished by setting the “Group by” property of the component “dAggregateGeometryRow_1” to “NAME” column (refer to the illustration below).
Add a tMap component to the workspace and create a main-row link between the “dAggregateGeometryRow“ and the “tMap” component as follows:
Open the tMap component by double-clicking it, create a new output data stream and call it “Out1“, for instance.
Now we can map the input data stream via drag & drop to the output data stream as shown below. It is recommended that you create a unique identifier for all output data sets. In our example, we hence create a new “ID” column for the unique identifier as an “Integer” data type by using the Numeric.sequence(“s1”,1,1) function:
The newly created spatial and attribute data shall be saved as a shapefile. For this, we can use the new dShapefileOutput component as follows.
To customize the Output-Stream, navigate to Component properties and select the storage location of the shapefile. Also select the same coordinate reference system as defined in the data source. In our example it is EPSG:3857 (a.k.a WGS84 or Web Mercator).
Step (3): Execute the job and write the resulting areas in a file or database.
That’s it! Now you can run the job and the output files will be saved to the file you specified for further use. After the execution of your job, your ETL process should now look like this:
Congratulations! You just processed your first Spatial ETL job without leaving Talend Open Studio.
For the purpose of demonstration of the result, the screenshot below shows the newly created areas added in our software Cadenza:
As you can see, the new areas (the wind parks) are created including all single wind turbines of an area.
More Steps with Spatial Data
If you start thinking about it, you are sure to find many other examples where spatial information and operations will be valuable for your business decision. Maybe you would like
… to calculate the best location of you next shop based on demographic data and driving distances to your current shops
…or you want to analyze location of mobile website visitors to trigger special location-based offers
…or to identify new areas where offline marketing would have best impact on your target audience
… Or you need to extend your regular reports by additional spatial information, provide deeper spatial analytics and location intelligence to you processes etc.
After downloading the Plugin for GeoSpatial Integration, you will find many additional samples and we will send you further click-through tutorials to experiment with Spatial-ETL. Do try it out! You can use the Plugin for free with Talend Open Studio. For professional support in a production-ready environment, team work and in conjunction with Talend Data Integration or Talend Data Management Platform, we also offer a yearly subscription.
Let us know what you think in the comments below and become spatial data driven.
About the Author –
Dr. Wassilios Kazakos
Head of Marketing and Business Development at Disy.
Wassili studied computer science and received a PH.D. at the University of Karlsruhe (KIT), in Germany. During his research carrier, he held several positions, from researcher and department manager at the database division of the FZI Research Center for Information Technology. With his technical skills, love for spatial data and market understanding, he helped Disy Informationssysteme GmbH becoming Germany’s leading provider for spatial and environmental IT-solutions. His deep understanding of customer’s needs in combination with his computer science background make him a highly-esteemed expert for spatial data driven applications. He has published several books and has written numerous articles in magazines and scientific journals.