Manipulating HTML using Talend Data Integration

By Richard Hall

Recently, I have been working on migrating the Talend Community from our previous platform to Salesforce. This has been an interesting migration, since it hasn’t simply been a “lift and shift” between environments: as with many engagements I’ve taken part in, there have been a number of transformations that needed to take place to ensure the data fit the new system.

Transformations

One of the transformations required here has been very different from any problem I have dealt with before: how to change internal links between hundreds of thousands of HTML documents? With many organizations holding semi-structured data in content management systems, blogs, forums, etc. it occurred to me that others will have experienced this problem, and many will experience it in the future.

This might sound like a trivial question, but when you really analyze it there are quite a few blockers to solving this problem.


Identifying Links

First, you have the issue of identifying links within your HTML. You can use regex expressions, but when you want to change them you will undoubtedly end up in the minefield of String manipulation within a document that must maintain its structure for it to be interpreted correctly. I tried this. I failed. There is a better solution, which I will demonstrate later in this blog.
 

Identifying URL structure and loading order

Once you have solved the problem above, there is also the issue of identifying what your future URLs will be. With some systems, you can predict your URL structure before loading data. Unfortunately, with Salesforce you cannot. So, in order to solve the problem of not knowing what the future URLs would be, I needed to identify a loading order.

  1. The first posts to be loaded were posts that had no internal links.
  2. The next posts loaded were posts that were directly and only linked to the posts I had previously loaded.
  3. This process went on until all posts were loaded, apart from those that had self-links. This was a bit of a blocker for a while until I realised that I could use a variation on the process I had been using. I ended up loading the self-linked posts, obtaining the new URL, then editing those posts to update them to accommodate the self-links.

This is a massively simplified explanation of the process, but it gives a reasonable abstraction of the considerations required.


Before I could consider attempting the process of identifying the order of posts to load, I needed to obtain important pieces of data that only existed within the semi-structured HTML data. I had no idea if links would exist in a particular document and if so, how many there would be. All I had was the HTML. As a human reading the documents one at a time, this is relatively easy. But how was I going to process this data thoroughly and accurately for hundreds of thousands of posts? By using Talend and a third-party HTML parser called jsoup.
 

Parsing HTML using Talend and third-party Java libraries

One of the massive benefits of Talend (and I know I labour this point in many of my blogs) is that you can make use of third-party Java libraries. This has opened so many doors to my usage of Talend.

In this case, I needed to be able to parse HTML, find any and all links, then potentially change many of them, while migrating hundreds of thousands of HTML documents. I could have used String parsing and manipulation with regular expressions, but it would have been particularly complicated to ensure that I was not going to break the structure of the HTML. While not impossible, it would have been a lot of work. Being able to make use of existing HTML parsing functionality and build it into my data integration job is a far more favourable approach from my point of view. I’m lazy and don’t want to have to reinvent the wheel.

I have explained the problem, so all that is left for me to do (and the reason I imagine most of you are reading this) is to demonstrate what I did, with an example Talend Job.
 

A Talend Job to modify HTML links

The purpose of this Job is to demonstrate the process of finding and changing links within an HTML document. It is essentially what I have done with the migration code I described above, but without the complexity of dealing with the ordering and manipulation of multiple HTML documents.

In this example, we will use a single HTML document and a basic CSV file listing the links to be changed. The output of the Job will be the altered HTML. I will demonstrate each step of the process of building the Job and using the jsoup HTML parser, and I will give you the simple code I’ve built using the parser.
 

The HTML File

The HTML file that will be processed by this Job could literally contain anything. But as an example, I am using a file that contains the following HTML:

<<<<Image 1>

You can copy and paste this into an HTML file or use your own. Keep in mind that the links found in the lookup file relate to this HTML. I chose to use links to UK TV Stations and links to Talend websites. The lookup file has links to US TV Stations. The UK TV Stations will be changed to US TV Stations while leaving the Talend links untouched.
 

The Lookup File

The Lookup File is simply a CSV file of UK TV Stations, and the US TV Stations to convert the link to. The contents are shown below:

<<<<Image 2>

The jsoup JAR

Go to the jsoup download page and download the latest version of the jsoup JAR.

Note: For this blog I am using version 1.13.1, but it should work with newer versions. If for any reason it does not work, go back to v1.13.1 and the code will work.
 

The HTMLManipulationTools Routine

Before we can build the Job, we will need to put together a routine to make use of the jsoup HTML parser. Below you can see the routine that I have put together:

<<<Image 3>>

1. To use the code above, create a routine with the same name, then copy and paste the code into the routine. Once you have done this, you will need to close the routine, then right-click it and select Edit Routine Libraries as shown below:

<<<Image 4>>

2. Once this has been selected, the following window will appear. Click New to add your jsoup JAR:

<<<Image 5>>

3. Click on the Artifact repository radio button, the Install a new module radio button, then click on the ellipsis button to select your JAR (surrounded by a green box):

<<<Image 6>>

4. This will bring up an explorer window. Find your JAR file and click Open. You will see a screen like the one below:

<<<Image 7>>

5. You can see the path to the JAR is surrounded with a green box in the screenshot above. You should see similar. Click the OK button.

6. You will be taken back to the Import External Library window. Click Finish and you are done. Your routine is ready to use:

<<<Image 8>>

The ReplaceHTMLLinks Job

Below is a screenshot of the ReplaceHTMLLinks Job. I will describe how to put this together below.

<<<Image 9>>

1. HTML Input: This component is a tFileInputRaw component. I am using this to consume an HTML file in one go as a single “row”. This is configured as shown below:

<<<Image 10>>

Make sure the file is read as a String by selecting the Read the file as a string radio button.


2. Find links: This component is a tJavaFlex component. This allows us to use Java code, and this is where we will be extracting each of the links from within the HTML document. I will paste the code from each of the sections of this component below, then explain what is happening.  

Start Code

<<<Image 11>>

  

In the code above, we are creating an ArrayList of HashMaps that will contain details on all the links that are found (title and URL). We are using the findLinks method that has been built in the HTMLManipulationTools routine. The HTML from the input file is analysed by the findLinks method. It is supplied via the tMap component (Find Link Map) using the globalMap with a key of body.

You will see that the beginning of a while loop has been coded. This forces the Main Code to be fired for every iteration of the loop, or for every link that is returned.


 Main Code

<<<Image 12>>

In the Main Code, we are extracting the HashMap for each element of the ArrayList, and from that we are extracting the href and title values (or the URL and title) for each link. These are being sent to the next component (Find Link Map) via columns called link and link_title. These need to be created for this component.

End Code

 <<<Image 13>>

The code above closes the while loop opened in the Start Code.

3. Find Link Map: This component is a tMap component. Its configuration can be seen below:

 <<<Image 14>>

The key things to pay attention to here are in the coloured boxes. In the red box you can see that the Lookup Model is set to Reload at each row. What this does is load all of the records from the lookup for every row passed from the main row. In this case there will be just one row. When that row is supplied by the main input, its content is set to be the value of the body key in the globalMap. You can see this being set in the blue box. This value is then used by the lookup input (the Find Links component) and the links are identified within the body of the HTML supplied by the globalMap. Notice also that the Join Model has been set to Left Outer Join just in case there are no links. All links found are then sent with the HTML (content) to the output, to be handled by the next component.

4. TV Station Changes: This component is a tFileInputDelimited component. It is used to read a CSV file that holds information about links to change within the HTML document being processed. The configuration of this component can be seen below:

 <<<Image 15>>

This component has a schema of three columns (OldNew, and URL) that hold the Old TitleNew Title, and New URL. These columns are all String columns and will need to be created for this component using the Edit schema functionality (see the button in the green box). The location of the file being used can be seen in the red box.

5. Lookup Changes: This component is a tMap component. It is used to link the rows representing the links found in the HTML document with the rows from the TV Station Changes component. The configuration of this component can be seen below:

 <<<Image 16>>

The main input comes from the previous tMap (Find Link Map) and the lookup is from the tFileInputDelimited (TV Station Changes). The tables are joined using an Inner Join which means that only rows where the join is fulfilled will be returned. The tables are joined using the link titles found in the HTML and the old link titles in the CSV file. Where a join is found, the link needs to be replaced. The records that are found are sent to the next component.

6. Change Links: This component is a tJavaFlex component. This is where the link replacement takes place. Only links that need to be altered will be passed to this component. Each link change will be passed along with the HTML body. Since we know that the same HTML body will be passed to this component for every row, we have a bit of logic that ensures that the first HTML body record received is altered for every link that needs changing. This is done so that the changes are carried out on the same HTML incrementally across all of the rows. If this were to be carried out across multiple different HTML documents, we would need to make sure that the HTML body that is modified incrementally is updated using some sort of key when all of the links for a particular HTML document have been completed.

Below I am showing the code that has been used and explain what is happening:

Start Code

<<<Image 17>>

 In the code above, we are simply setting the content variable to an empty String. This will hold the HTML code.

Main Code

<<<Image 18>>

The code above checks to see if the content variable is an empty String. If it is, it sets the value to be the value of the incoming content column (this holds the HTML). The next command calls the replaceLinks method and supplies the HTML, the old link, the new link, and the label. It returns the HTML to the content variable (therefore updating it) with the link passed to it updated.

There is no End Code section populated for this component.

7. Return Last Record: This component is a tAggregateRow component. The purpose of this component is to group all records returned by the previous component into one record, and to use the very last row’s HTML body as the HTML body to be returned as the fully altered document. The configuration of this component can be seen below:

<<<Image 19>>

Notice that the Function selected for the content column is last. This ensures that only the last row’s HTML body is returned.

8. Display Results: This component is a tLogRow component. It is used to print the altered HTML to the output window. There is no configuration required for this component.

Running the Job

When this Job is run, it will read the HTML in the source file, find all of the links in the file, compare all of the links that are found with the links to be transformed (found in the lookup file), then it will alter those links and output the converted HTML.

The source HTML for the example I provided can be seen above, the output window of Talend Studio will contain the following altered HTML if everything has gone to plan.

<<<Image 20>>