Using Talend and MapR to Create a Real-time Recommendation Model
Talend was recently recognized as a certified partner on the MapR Converged Data Platform. This is exciting news not only for Talend and MapR, but also for current and future customers who are looking at Talend and MapR as the solution to their big data challenges. Today we are going to look at how you can implement a real-time recommendation model using Talend’s native integration capabilities with the MapR Converged Data Platform – specifically MapR-DB and MapR Streams.
MapR-DB and MapR Streams + Talend: Getting Started
First let’s look at MapR-DB. As an enterprise-grade, high performance NoSQL database management system, MapR-DB is MapR’s native solution to other NoSQL database systems like Cassandra and MongoDB. But it's so much more. Being native to the MapR Converged Data Platform, MapR-DB can run in the same cluster as Apache Hadoop and Apache Spark. This allows you immediate access to analyze live, interactive data; eliminating data silos and speeding up the data-to-action cycle. Overall, this creates a much more efficient data architecture.
MapR Streams, on the other hand, is a global publish-subscribe event streaming system for big data. In fact, it is the first big data-scale streaming system built into a Converged Data Platform. By providing a Kafka API for real-time producers and consumers as well as out-of-the-box integration with popular stream processing frameworks (such as Spark Streaming, Storm, Flink and Apex), MapR Streams makes data available instantly to stream processing and other applications.
By being a certified partner on the MapR Converged Data Platform, Talend has simplified the interaction with MapR-DB and MapR Streams by providing out-of-the-box components specific to the MapR Converged Data Platform. This allows for native connectivity, precise configuration, and more efficient processing. Let’s see how this looks in a real-life scenario.
Building Smart Recommendation Pipelines with MapR and Talend
As consumers, all of us are very familiar with online shopping and the ability of Internet retailers to make purchase suggestions as we navigate through their digital product catalog. This is a critical real-world application for online retailers and I even wrote about it at a higher level in a prior blog series. Now let’s take a deeper dive into this recommendation pipeline and see how Talend along with MapR-DB and MapR Streams can make it possible for you. The process I will be walking through is available in the new Talend Big Data Sandbox featuring MapR.
In the demo environment, there are a couple of setup jobs that need to be executed to initialize the datasets. In the interest of time, I am not going to go into detail about them now. The first job I do want to review, however, is one that is very simple yet very critical in this process. It creates the MapR Stream that will be used throughout the rest of the demo.
The tMapRStreamsCreateStream component requires a tMapRStreamsConnection component configured to the MapR Cluster. The tMapRStreamsCreateStream component will reference the connection information as the job is executed. As you can see in the configuration of the tMapRStreamsCreateStream, the first parameter to address is the Action parameter.
Here you have the options to create/alter a stream or create/alter a topic. In the logical order of progression, I need to create a stream before I can create a topic.
So, I have set it to “Create stream if not exists”. Once I select the action, I can identify the Stream path and set the Stream permissions.
As you can see, the permissions allow you to tightly manage the use of the stream for added security. Finally, you will notice under the Stream Settings, I have checked the box next to “Enable automatic topic creation”. This is a nice feature to the MapR Streams functionality as it will later allow me to create a topic on-the-fly with the first dataset I write to the stream.
Building Your Recommendation Model
Before I can go any further with my streaming recommendation pipeline, I need to take a quick step back and create the actual recommendation model.
The recommendation model I am using in this demo is the Alternating Least Squares Model. Talend has a specific component for this recommendation model called the tALSModel component. It does all the hard calculations to create my recommendations but requires 3 key pieces of information – a User Id, a Product Id and a Rating. To compile the dataset necessary for the ALS Model, I have a sampling of Clickstream data that I am going to narrow down to just the User Id and Product Id from each clickstream record.
Using just those two pieces of information, I will aggregate based on Product Id to get a count of how many times a specific user clicked on the same product page (Fig. 1).
I will use that count as the "Rating". However, since the tALSModel component only accepts integer inputs and my User Id and Product Id from the Clickstream data are Alpha-numeric, I need to reference MapR-DB lookup tables containing user community data and product catalog data to get the numeric equivalent of each User Id and Product Id respectively (Fig. 2). In this demo environment, these lookup tables were populated by the dataset initialization jobs. In a real-world scenario, they could be the actual tables that store your user community accounts and demographics and your product catalog or PIM respectively.
Having a solid Recommendation Model is the most critical step in the entire process, simply because this one component is responsible for generating the recommendations of additional products for which the user may be interested. And really is the sole reason for the pipeline to begin with. This brings to point a few things:
- First, the importance of quality and reliability in your source data (i.e. Clickstream data, User Demographic data and Product catalogue data) used to generate your Recommendation Model
- Second, the frequency with which the model is updated. As data is ever-changing and evolving, finding an effective schedule to update the Recommendation Model is just as important as having the ability to generate recommendations in the first place.
Now that the Recommendation Model is created and our MapR Stream path is configured we can start accepting streaming data into a topic. In a live environment, you would be able to stream live web clicks from your Clickstream data into a MapR Streams topic and then feed it directly into your recommendation pipeline. However, in this demo environment, we are going to simulate the live clicks using the same subset of Clickstream data used to generate the Recommendation Model.
The key thing to point out in this job is simply the tMapRStreamsOutput component which allows us to not only write to the MapR Stream, but also create the Topic on-the-fly. Notice that in the Topic Name parameter, I have specified the path to the MapR Stream (which was created in an earlier job) and contextualized the Topic name. Because each MapR Stream can host multiple Topics, being able to create a new Topic without having to run a host of other jobs to create a new stream and/or topic allows the flexibility to contextualize a Topic name and just update the context value; thereby repointing the clickstream data to a new Topic within the same stream.
The new Topic could be fed to a separately configured Recommendation Model during promotional periods for example. When the promotional period ends, switch the Topic back and restart the original pipeline.
Finally, with the recommendation model created and Clickstream data flowing into our MapR Streams Topic, we can start the recommendation pipeline. This job will consume the (simulated) Clickstream data flowing into our configured MapR Streams Topic and first parse the streaming payload to generate a schema-based record for further processing. Then, by using a MapR-DB lookup, we can enrich our dataset with user information.
Configuring the MapR-DB lookup in a Spark Streaming job can be a bit tricky. Since we are looking for a specific record based on the user id that is coming from the Clickstream data through our MapR Stream dataset, you will need to configure the tMapRDBLookupInput component to look up against the “user” table containing user demographic information, making sure the columns are mapped to the correct column family. Then, to ensure the specific user is returned in the lookup, Check the “Is By Filter” Checkbox and configure the Filter expression. Choose a Filter Type of “Row Filter” and identify the Filter Column, Filter Family and Filter Operation.
In this case it is the double equal sign (==). Finally, in the Filter Value field, list the incoming value to filter on. Finally, choose “Substring comparator” as the Filter Comparator Type. Once the tMapRDBLookupInput component is configured, open up the tMap used to join the stream dataset to the MapR-DB lookup table and ensure the user id from the input stream is joined to the user id on the lookup table and that the Lookup Model specifies “Reload at each row".
The tWindow component allows the flexibility to break up the streaming data into micro batches by defining a Window Duration and a Slide Duration. These values are based on the Batch Size that is defined in the Spark Configuration of the job and must be multiples of the Batch Size. If you want to learn more about Window Operations in Spark Streaming, click here.
The tRecommend component simply allows us to reference an already generated Recommendation Model and specify the number of recommendations to generate for each identifiable user. What is returned is the list of Product ID’s and Recommendation Scores of each recommended product. The Score indicates a sort of confidence level of the recommended product. This value could be used later in the flow to qualify the recommendation before it would be shown to a customer.
If the score didn’t reach a defined threshold, the recommendation could be discarded (or even written to a separate file/database for further analysis to better tweak the recommendation model). In this flow, however, we are simply removing any null recommendations using a tFilter component.Finally, the job utilizes a lookup to a MapR-DB table to enrich the recommended Product Id’s with their respective detailed product information. As outlined above when the user lookup was configured, a tMapRDBLookupInput component must be configured for the product data to be retrieved.
The configuration is essentially the same with a few noted exceptions. Because the lookup is based on an integer comparison rather than a string comparison, the Filter Type should be “Single Column Value Filter” and the Filter Comparator Type will be “Binary Comparator” to indicate the values being compared are integers. These detailed recommendations can then be written to a separate MapR-DB table for fast-data retrieval and displayed to a user as they navigate through the online retail site (handled by a separate job flow).
What I have shown you through this demo will uncover a literal goldmine (in the data sense of the word). Utilizing MapR and Talend in this scenario, you now have the ability to understand the needs and wants of your customers as they navigate through your online space, and even before they even enter your digital doors. This process used to be the secret to the success of the top online retailers in the world. But with Talend Data Fabric and the MapR Converged Data Platform, you are just clicks away from building an online shopping experience that rivals the best in the business. To see this and other examples first-hand, download the new Talend Big Data Sandbox featuring the MapR Converged Data Platform.