Talend & Couchbase: Jumping into the NoSQL Database World
While the whole world is shifting towards big data, NoSQL has become a crucial technology in the data management industry. The need for moving and transforming data between traditional and modern systems has likewise become mission critical for data-driven businesses. This data movement could either be to a new data warehouse project or migrating the existing data from traditional RDBMS to the new NoSQL platform or adding new transformations to the existing jobs.
Talend offers a diverse range of components for utilizing big data to suit each data integration purpose. It also provides NoSQL connectivity to leading NoSQL databases like Couchbase, Cassandra, MongoDB, HBase, Neo4J, Apache CouchDB and Riak. Using Talend to manage unstructured data in a NoSQL scenario doesn't require any specialized knowledge of NoSQL databases. In short, Talend is a big umbrella providing many connectors for all kinds of data movement/transformations.
Before we jump into how to use Talend to read and write into Couchbase server, let’s look into few basic concepts about NoSQL databases.
What is NoSQL?
NoSQL stands for Not Only SQL. It is a movement towards document stores that do not make use of the relational model. The fundamental shift is in the way NoSQL stores data. For example, when you would need to store data about customer details, in RDBMS you would need to extract this information into tables and then use a server side or report side language to transfer this data back to its original state. On the other hand, in NoSQL, you just store the customer details. NoSQL is schema free, which means you don’t need to design your tables and structure up front – you can simply start storing values. All the values are stored in Documents and all the query joins are done using MapReduce. MapReduce is used to create a ‘view’ (like a resultset) this view consists of a subset of the overall data.
What is Couchbase?
Couchbase Server is a NoSQL database. It is designed with a distributed architecture for performance, scalability, and availability. It enables developers to build applications easier and faster by leveraging the power of SQL with the flexibility of JSON.
Talend & Couchbase Server
Talend enables you to manage and transform data between Couchbase Server, a NoSQL document database, and any other relational or big data system. This integration also allows you to efficiently build richer reports and analytics on the data stored in Couchbase, utilizing the power of Couchbase’s pre-computed indexes and aggregates.
Talend’s connectors utilize a drag and drop interface which makes it very easy to work with Couchbase. It allows the data to be transformed into a schema less JSON document format, which means that you could now move all your data from an RDBMS to JSON format seamlessly. That’s good news for all your data migration projects!!!
What Components Can You Use?
Talend offers the following components to work with Couchbase Server.
- tCouchbaseConnection : This component allows you to create a connection to a Couchbase bucket and reuse that connection for other components. This opens a connection to a Couchbase bucket in order that a transaction may be made.
- tCouchbaseInput : This component allows you to query the documents from the Couchbase database. This allows you to fetch your documents from the Couchbase database either by the unique key or through Views.
- tCouchbaseOutput : This component allows you to perform actions on the JSON or binary documents stored in the Couchbase database based on the incoming flat data from a file, a database table etc. This inserts, updates, upserts or deletes the documents in the Couchbase database which are stored in the form of Key/Value pairs, where the Value can be JSON or binary data.
- tCouchbaseClose : This component closes a connection to the Couchbase bucket when all transactions are done, in order to guarantee the integrity of transactions. This closes a Couchbase bucket connection.
How Does it Work?
Talend in/out Couchbase connectors allows you to manage and transform your data. To bring data from other data sources into Couchbase, the tCouchbaseInput connector takes incoming data streams and transforms it into JSON documents before they are stored in Couchbase. To import data into Couchbase, you can define which data fields need to be transformed into JSON attributes. Similarly, to export data from Couchbase to other data sources, the tCouchbaseOutput connector uses the schema mapping specified by the user to read JSON documents and transform them into target data formats. You have the flexibility to define which attributes in your JSON document need to be exported and transformed. For this blog, I have created two simple jobs, however, more complex scenarios can be tackled with Talend as well.
Couchbase and Talend Jobs: Create a Document
The first job reads a .txt file which has unstructured data. The job creates a document with the data read. The input file consists of feedback from the customer and the customer_id is not of a single data type. It consists of characters, numbers and special characters as shown below. In the traditional approach, we would have started with creating a surrogate key (which will be a primary key) for the customer_id. However, with Couchbase, we could store this as-is.
The overall job would look like the image given below. tCouchbase_Connection opens a connection to the Couchbase server. Once the connection is established, the input file is read and few transformations are done in tMap component post which the data is written to a document in the Couchbase Server.
For the example job given the default bucket is used and tCpouchbaseOutput settings look like this:
Note that the JSON configuration is very important as this would define the way your document would be stored. In the example, the JSON configuration is similar to what is shown below.
Once the job runs successfully, you could login to Couchbase and check that the document is created.
Couchbase and Talend Jobs: Read a Document
This job would read the document created by our previous job. There are two ways of reading the documents.
- Using the key: IDs of the documents stored in the Couchbase database document. In our example, it could be either 123,6534672 or john.
- Using the views: Use this check box to view the document information as per the Map/Reduce functions and other settings. The schema here has three pre-defined fields, Id, Key, and Value. Where, Id holds the document ID, Key holds the information specified by the key of the Map function and Value holds the information specified by the value of the Map function.
The job given below reads the document using the key. The key must be specified in the settings.
tCouchBaseInput settings in the job are given below.
Once the job runs successfully, the output according to the filter given in the settings would be displayed in the console.
The next job reads the document using the views.
Change the settings in the tCouchbaseoutput as shown below. Here I am creating a view ‘customer_view’ with customer_id,customer_name and feedback columns.
Save the and run the job. Go to Couchbase console and check that the view and the result set is created. This view could further be used for jobs or for ad-hoc queries.
I hope this blog would be useful while working with Talend and Couchbase. As always, feedback, questions, and comments are welcomed below! Happy connecting.