Complex Generation and Distribution of Documents with Talend
In this post, I would like to cover the possibilities we have to build complex document generating systems. Actually formally a domain of expensive software like Adobe Publisher. With Talend Open Studio and JasperReports you are able to create such a system.
Introduction / Conceptual Formulation
One of Germany’s biggest online vehicle dealers gets access to detailed offer and demand data which reflect the current offers and pricing in the market. The division Automotive Intelligence provides the car dealers with these data. Thus, data-driven car dealers can estimate at which conditions they should accept a trade-in, how their prices are in relation to the competitors, etc. The indexes and a concrete recommendation for each vehicle help the traders to sell cars even faster and more successfully.
Picture 1: Overview of the processing and the data flow
The whole process of calculating the index figures, assembling the necessary resources and establishing the documents is carried out by Talend jobs.
In the Talend Administration Center (TAC), the staff is always in control of the data processing process.
The challenges/demands of this project:
1. Short processing time and linear scaling of the capacity
2. Control of the whole process in the Talend Administration Centre
3. Establishment of Excel and pdf files based on templates with interactive
4. Tolerance towards resources which at times might not be not fully available (automatic repetition)
Parallelization: The Key to High Performance
Parallelization reduces the amount of time needed per customer. The single process steps can be run in parallel. The control job checks whether there are further work units (Customer analyses reach the corresponding start status).
Within one processing step an optional number of customer analyses can be processed.
Picture 2: Parallel Processing (main course)
In order to distribute the workload a virtual job server will be established in the TAC, which will automatically distribute the workload to the single job servers.
The parallelization within the task (start of job instances) is carried out with the parallel iteration:
Picture 3: Parallel starts of job instances; the number of job instances is kept steady if possible
Control of the Whole Process in the TAC
The implementation of the process control makes use of the possibility of the TAC to start tasks via RESTful web services (aka MetaServlet). This function was encapsulated in the Custom-Component tRunTask.
Picture 4: Cooperation of Control Task and Task for Step-by-Step Processing.
The separate task is an advantage, which improves the processing of all tasks. Within this architecture a single step can be updated in the TAC without having to update the whole system. This improves stability and transparency as the single steps are carried out in separate processes and the log outputs refer to only one step.
Template based Report Compilation / Creation
In order to create the pdf files, we also use the design tool JasperStudio. The underlying JasperLibrary offers an excellent facility to integrate the report compilation into the customer’s application.
In this case we enlarged the range of functions using Custom-Components by Talend with the integration of the JasperLibrary.
Picture 5: Creating a pdf Report based on JasperReport Design and tJasperReportExec
We used Custom-Component tJasperReportExec in order to create the pdf because thus it is possible to compile the JasperReport design templates automatically and the parameters needed for creating the report can be extracted from the ETL process.
This component also provides other output formats but in this project the PDF format was the requirement.
Picture 6: Creation of Excel Report
The Excel templates contain formatting information:
- Conditional formatting
- Alternating colours (odd/even row)
- Row height
- Data validation
Processing writes the data and applies formatting on the filled cells.
Additionally, formulas are filled in.
For this task the following Custom Components are used:
These excel related components are based on the great work of the Apache POI project which aims to provide a Java language interface to the Microsoft Office format and the do a great job.
Tolerance towards external resources which at times might not be fully available
In the setting of this project interfaces were used which are subject to substantial variations of load.
Thus requests were temporarily refused/rejected. Usually this problem solved itself after a short period of waiting.
The repetition is started selectively with a corresponding status provision per job instance. Thus the manual maintenance work is reduced to a minimum. Normal breakdowns, which the system tackles itself, will not be signalled to the administrators. Only if a repetition fails the system generates an alert.
These short-term breakdowns can be retraced in the log of the TAC.
Key figures and resources of this project
The components used were partly extended/upgrade for this project and are available for free in Talend Exchange.
For all used components exists a PDF documentation linked in the detail page of a component in the Exchange web site.
A community edition of JasperStudio is available for free download.