Stripping Websites and Translating Text using Talend and Google Translate API
Recently I developed a proof of concept ETL job to strip product information from a website, translate it, and create a data set comprising the original text and the translated text.
For my example, I chose to strip drug product names out of a French health authority website, and translate them to English. This is not a detailed instructional article, just a high-level overview of the process of solving this problem in Talend using third party java libraries and calling on external translation services.
A sample URL to perform a search of this website is below, the zero value is an offset. The search returns 20 products at a time in a fairly simple HTML table. For this example we will do a keyword search of "Glucose", but we could choose the marketing authorisation holder or a number of other attributes.
This returns the first 20 products (offset of 0):
This returns the next 20 products (offset now 20):
This is a good fit for Talend, because you have the ability to drop in jar files to add additional functionality to the workflow. For HTML parsing I am using the excellent jsoup Java library. I added the jsoup jar file as a code routine into the Talend Routines library and used this in a tJavaFlex component to parse the products in the HTML table and create a flow for each product in the list.
For language translation, I am making a REST API call to Google Translate - using the Talend tRest component. I needed to set up a Google Cloud account, with billing enabled to use the Google Translate API, but for this volume of translation, the costs will be in the pennies range.
The rest of the workflow is processed using out of the box Talend components.
- tSetProxy component (x2) to set the corporate proxies for the job (one for HTTP and one for https required to make the Google API calls).
- tForEach component to fabricate a set of URLs using offset values of 0,20,40,60,80 etc. (see above URL). This offset value is used to construct the URL for a tHttpRequest component.
- tHttpRequest component to make the calls to the health authority website, using URLs fabricated with the different offset values sent from the tForEach component. Each HTTPd request writes the HTML response to a numbered file for later HTML parsing by jsoup.
- tJavaFlex component which is where the jsoup HTML parser is used to split the product list in the HTML into separate rows for passing down the flow. Jsoup has a powerful jquery type selector syntax to make it easy to target the HTML nodes you need to process.
- tRest component to make a call to the Google Translate API using my application key. We pass, for translation, the product name which jsoup parsed out of the HTML.
- tExtractJSONFields component to extract the translated product name from the Google Translate API JSON response from the tRest call.
- Few other components such as tLogRow and tFileOutputExcel to capture the output which is effectively just 2 columns: Product Name (in French) and the translated product name courtesy of Google Translate.
The end result is scraping all these French product names from the website:
...and we end up with the French product name and an English translated product name in a nice spreadsheet format which we can work with.
And the job isn't that complicated to look at, in fact, there is some redundancy in here (a tJavaRow I probably don't actually need).
The first step is to harvest the HTML out into a set of files. We need to fabricate URLs with different offsets to get the first 20 products, then the next 20 products etc. An offset of 180 is enough to retrieve ~200 product names as a test, so we need to fabricate these URL calls:
I used a tForEach component to loop generating the offsets 0,20,40 etc.
Then the tHttpRequest uses this variable from tForEach to generate a URL with the offset, and writes the response to a file we can process later.
Note the tHttpRequest component uses context variables where the base URL and file locations are maintained:
You can see the result is a set of text files which contains the HTML responses for parsing. The numeric prefix matches the tForEach variable, and _0.txt represents the URL response with a zero offset, _20.txt the URL response with a 20 offset etc.
The bulk of the clever stuff is handled in tJavaFlex which uses the jsoup HTML parsing library to target the table cells which contain the product names:
An HTML text file created earlier in the workflow, as a result of tHttpRequest, is opened and then a jquery type selector is used to home in on the table cell required.
Elements products = doc.select("td.first > a > strong");
This selector basically states "go and return - in a collection - all the HTML nodes, which are <strong> tags, which are descendants of an <a> tag which are descendants of a table cell <td> tag with the CSS class called 'first'...".
We can see from the HTML of the website, how the product names are structured, and why this selector works. Each page presents 20 results, so the tJavaFlex generates 20 rows of data for the flow, one for each product.
We then make a call to the Google Translate API using tRest, note this is not my real API key in the screenshot...we specify a source language (French) and a target language (English), and we pass in the product name.
Note that I am reading this product name from the globalMap, as I had to turn the flow into an iterate - so I could feed the product name into the tRest component. I am also URL encoding the product name, as this caused me problems calling the Google Translate API.
It is then a case of reading the response from Google Translate and parsing out the JSON field from the response. For this, we use a tExtractJSONFields component and use an XPath query. A link to the Google Translate API documentation is provided here, followed by the configuration of the JSON Talend component.
You can see how this works by looking at the JSON response sample from the Google Translate documentation:
And that is, at a very high level, how Talend enables you to parse HTML, translate it from one language to another - and produce a nicely formatted data set...despite not supporting either natively.
Disclaimer: All opinions expressed in this article are my own and do not necessarily reflect the position of my employer.