SaaS Data Migration & Data Integration
With the continued growth in Cloud computing, more and more organizations are moving their data to Software as a Service (SaaS) providers such as Salesforce.
If you're about to embark on a Data Migration or Data Integration project, and you're used to working with a traditional relational database that sits in your own data center, you may be in for a few surprises.
I have worked with Salesforce and a number of other SaaS providers for a number of years, working on both Data Migration and Data Integration projects. I thought I'd share some of the lessons that I have learned.
When you're working with SaaS, you'll usually be talking through an API that is provided by the vendor. This API is typically using REST or SOAP.
Talend provides excellent support for Salesforce and many other SaaS vendors with out-of-the-box components. If Talend does not directly support your vendor, then you can do it yourself using components such as tRest and tSOAP. If you're feeling really adventurous, you could write your own custom components and then make them available on the Talend Exchange for the benefit of others.
The Fragility of the Internet
Whilst the technology underlying the Internet may be robust, the reality is that your connection may sometimes be less reliable than you would hope for and bandwidth is always under pressure. You may feel less confident pushing 100M rows to your SaaS provider, than you would to an Oracle database that is hosted in your organization’s own data center. With this in mind, you will want to give careful consideration to the techniques that you use and how you would recover should you encounter a mid-point failure.
If you’re using Salesforce, for example, they provide both Standard and Bulk APIs, so try to make sure that you use the correct API, depending on your data volume and your requirement. Get a good understanding of the APIs that your own vendor provides before investing in the wrong approach.
Try to ensure that unnecessary obstacles do not get in your way. If your organization uses Proxy Servers to access the Internet, try to avoid these as they can often interfere with your activity. There should, hopefully, not be a strong case for your interfaces to be forced to use these servers as they’re really intended for managing employee interactive Internet access rather than the mission-critical interface between your CRM and Finance systems.
You will, of course, always want to make sure that you are sending your data across the Internet in a secure manner. If FTP plays a part in your interface, make sure that you are using Secure FTP and not sending your highly valued data in plain text for all to see.
A Step Back in Time
Sometimes, working with SaaS may seem like a step back in time.
As well as yourself, your SaaS provider will have other organizations to whom they will need to provide a good service. They will want to make sure that they can always provide the service that they have promised. This may mean that they place limitations on your own activity.
Salesforce, for example, provides limitations on their Bulk API.
If you're bulk-loading data in to Salesforce, you are typically limited to 5,000 batches in a rolling 24-hour period. Each batch is limited to a maximum of 10,000 rows and 10,000,000 characters. This may seem a lot; but if you’re working on a large Data Migration, you will soon use this up. Also remember that you may not be the only person within your organization that is using the Bulk API.
It is unlikely that you are going to be able to stage and manipulate your data with your SaaS provider. You will usually need to pre-prepare your data and get it right before sending it up for insert, update or delete.
You may not have complete freedom of how you index your data.
You may not have low-level transaction control. You may not be able to wrap your own control data within the same transaction as your user data. This will affect how you recover from failure and (manually) rollback if needed.
- Create your own Developer account at https://developer.salesforce.com/ so that you can try things out. You'll have access to all of the Salesforce features, a luxury that you will be unlikely to have within the environments of your own organization.
- Make use of Salesforce External Ids. These provide source to target traceability and will help you manage your Inserts, Updates and Deletes more effectively.
- Use the Bulk API, when reading or writing high volume data and ensure that you optimize your batch sizes. Make use of the parallel load option where appropriate.
- Consider how you will recover from failure.
- All Salesforce objects have the Created By Id as well as the Create, Modification and System timestamps that you would usually expect to see. Use them to manage your loading, extraction and recovery.
- As with any Talend parameters, externalize your Salesforce connection parameters. Remember that you need to use a Security Token as well as your Salesforce password. Use tSalesforceConnection to establish your connections and remember that you need separate connections for both Standard and Bulk API requests.
- Make sure that you understand the character encoding of both Salesforce and the system that you are interfacing with; make the correct conversions as needed.
About Alan Black
Alan has been working with data integration tools for over 15 years, usually as part of a data warehousing project, data migration, or integrating with cloud services. Alan was first introduced to Talend when he was working for a client who had purchased the Enterprise version of Talend and was looking for a consultant with domain knowledge and substantial data integration experience.
In his recent day work, Alan mostly uses Talend Enterprise Data Integration. He also uses Talend Open Studio for Data Integration, for other projects that he is involved with. Alan also has experience in writing custom components.
Find out more and contact Alan here: www.yellowpelican.co.uk/