An Informatica PowerCenter Developers’ Guide to Talend – III

 

Thanks to all dedicated and avid readers who have been following my journey on transitioning over to Talend from Informatica PowerCenter. If you are new here and haven’t read any of my previous posts, you can start reading them here – “Part – 1” and “Part – 2”  . The first two parts of this series provided an overview of the architectural differences between PowerCenter and Talend and what goes on under the hood when it comes to processing the data. It also provided a primer on Talend Studio – our unified development environment – and a mapping between some of the most popular PowerCenter transformations and their equivalent Talend components.

Based on the feedback I have received so far, the third part of this series will focus on SDLC; scalability & high availability; and parallelization. Each of these are all big topics in their own right and my goal is not to drill down into the details. But, it’s my sincere hope that after reading this blog, you will be able to appreciate how Talend’s architecture allows you to automate, deploy, scale and parallelize your data integration platform.

Software Development Life Cycle a.k.a. SDLC

SDLC is a software engineering process that divides the software development work into distinct phases to improve design, product management and project management. There are various SDLC methodologies that you will be familiar with as a software developer – waterfall, iterative, prototyping, and most recently agile.  Once you pick your SDLC process as an organization, the next step is to automate as much of it as possible. The goal is to get the software as soon as possible into the hands of QA testers or for end users in production environment after the developer has completed his or her work. Some related concepts like build automation, continuous integration (ability for multiple developers to integrate their code with others more frequently) and version control systems help with the management of this automation. The aim is to design a regular and continuous build and deployment followed by automated end-to-end testing to verify the current code base.

PowerCenter provides its proprietary version control system but it lacks in the areas of automated testing and continuous integration and deployment. Talend provides native integration with industry standard version control system like Git and SVN. If your organization is already invested in GIT/SVN, you can use that single repository to store code for your data integration jobs as well as other artifacts. Also, your developers don’t need to learn another version control system just for the data integration jobs they build.

Talend also provides complete support for Continuous Integration and Deployment which makes the entire SDLC process much more efficient. You can read all about it here – Talend Software Development Life Cycle Best Practices Guide.

Scalability

One of the key considerations of selecting a data integration platform is its ability to scale. As the number of sources or targets increase or the volume of data being processed grows exponentially, the architecture of the platform should enable such growths in volume without much overhead in the administration of these systems.

If you look at the simplified versions of PowerCenter and Talend architecture (Figure 1), you will notice the similarities. PowerCenter’s Integration Service performs the heavy-lifting integration operations that moves data from sources to targets. The Job Server in the Talend architecture performs this same function. It is this component that needs to scale with increasing number of sources, targets and data volumes.

Scalability can be achieved in multiple ways (Figure 2):

  • Vertical Scaling – replacing the current server on which the Job Server runs with a bigger and more robust server with more processors and more RAM, or
  • Horizontal Scaling – adding more similar servers that each run one or more Job Servers, or
  • Some combination of the two above

Pros and cons of the above-mentioned scaling approaches are discussed extensively on the Internet and you should spend time researching and understanding it. You need to pick the right option based on your requirements and budget.

Talend also provides the option of grouping a set of physical servers into a “virtual server” (Figure 2). A virtual server is a group of physical servers from which the best-rated server will automatically get preferred at Job execution time. Once you set the execution task onto a virtual server, Talend determines the best physical server to execute the task and sends the request there. This decision on which server to pick is based on a rating system that leverages information on CPU/RAM/disk usage of each of these physical servers.

If you have additional grid computing requirements, you should consider using our Big Data Platform that leverages Hadoop for grid, clustering and high availability.

Parallelization

Parallelization can be implemented to (i) meet high throughput requirements, and/or (ii) troubleshoot processing bottlenecks. There are several mechanisms to enable code parallelization in Talend and they fall into one of two categories:

1. Process Flow Parallelization – the ability to run multiple jobs/subjobs in parallel

2. Data Flow Parallelization – the ability to break down a set of rows within a subjob into smaller sets – each of which can be processed by a separate thread or process.

Let’s look at the options to parallelize process flows –

1. Execution Planmultiple jobs/tasks can be configured to run in parallel from the TAC. Each task in the plan can run on its own job server. 

2. Multiple Job Flows – once you enable “Multi-Thread Execution” in your job, all subjobs run in parallel.

3. tParallelize component – this is very similar to the previous option but the difference this time is that the parallel execution is orchestrated by a component, rather than a job-wide setting. The key advantage of this approach is that it gives you control over which parts of your job execute in parallel

4. Parent/Child Jobs – you can use the tRunJob component to call a child job. This child job can run on its own JVM if the “Use an independent process to run subjob” option is selected.

You can achieve parallelization of data flows a couple of different ways –

1. Auto Parallel – you can right click on a component and “Set Parallelization”. This prompts you to select the number of threads and (optionally) select a key hash for partitioning of data

2. Manual Parallel – you can use Talend components – tPartitioner, tCollector, tDepartitioner and tRecollector – to achieve the same thing as auto parallel above.

If you want to read up the details of all these options, you can find it in Talend Help or request your CSM to provide a link to an expert session recording of a deep dive into parallelization.

For PowerCenter developers moving to Talend, here’s a guide that maps PowerCenter parallelization options to Talend.

Conclusion

That concludes this blog series on my journey from PowerCenter to Talend. It’s been a very positive experience and I have been able to appreciate the benefits of embracing a modern data integration paradigm. Hope you had as much fun reading it as I had putting it together. I would love to hear from you – so feel free to leave your thoughts and comments below.


Successful Methodologies with Talend

 

For those who are familiar with my previous Talend blogs perhaps you’ve noticed that I like to talk about building better solutions through design patterns and best practices.  My blogs also tend to be a bit long.  Yet you read them; many thanks for that!  This blog is going to focus on methodologies as they apply to Talend solutions.  My hope is that at the end of it we all agree that any successful Talend project should use proper methodologies, or face the proverbial project crash and burn.

A stool needs 3 legs upon which it can stand, right?  I believe that any software development project is like that stool.  It too needs 3 legs, which are defined as:

  • USE CASE – Well-defined business data/workflow requirements of a solution
  • TECHNOLOGY – The tools with which we craft, deploy, and run the solution
  • METHODOLOGYA given way to do things that everyone agrees with

Software projects tend to focus only on the use case and technology stack, often ignoring methodologies.  This, in my opinion, is a mistake and leads to painful results, or lack thereof.  When planning for successful Talend projects, incorporating best practices for Job Design Patterns (JDP&BP) and Data Model Designs (DMD&BP) is a fundamental objective.  These disciplines fall into the methodology leg of our proverbial stool.  The foundational precepts discussed in my JDP&BP blogs suggest that consistency is perhaps the most important precept to apply.  The reason is simple:  even if you are doing something the wrong way, or let’s be nicer, not doing it the best way, then if it was being done the same way it is much easier to correct.  I like that.  Successful methodologies should, therefore, encapsulate consistency.

3 Pillars of Success

Let’s take a closer look.  Each leg of the stool is a pillar for success.  Each pillar holds up its share of the load having a purpose to serve in the bigger picture.  Software is not magic and understanding the evolution and life cycle from inception to deprecation is fundamental to any successful implementation.  Hardly any serious developer ignores this, yet all too often they succumb to the business pressures and in most cases it is methodology that suffers the most.  As methodologies are ignored, hidden long term costs skyrocket.  The worst case scenario is a complete project failure; the best case is lower quality, loss of completive edge, and/or extended delivery dates.

So let’s get serious about this.  Here is a starter outline to consider:

Modify these as needed; we all have different yet often similar requirements.  Let’s continue.

What is a Successful Methodology?

Any successful methodology is one that is adopted and put into practice; nothing more, nothing less.  When a software project team applies a methodology it then knows how it plans to achieve the desired results.  Without a properly defined and adopted methodology, what we get is analogous to the ‘wild-wild west’, or ‘shoot-then-aim’ scenarios.  Neither fosters success in any software project, IMHO!  We must have that 3rd leg of the stool. 

When we talk about methodologies we can easily cover a lot of different topics.  Essentially, however, this discussion should focus initially on planning for the design, development, release, deployment, and maintenance of a software project.  Commonly known as the ‘Software Development Life Cycle’ or SDLC, this should be the first step of many in setting up, managing, coordinating, and measuring the required resources, significant milestones, and actual deliverables.  The two most common SDLC methodologies, many of you readers already know:

  • WATERFALL – Strict, prescribed steps (phases) that scope an entire software result
  • AGILE                   – Flexible, iterative approach (sprints) that optimize early results
  • JEP+                   – A hybrid, ‘best of’ approach encapsulating all and only what matters

Without an SDLC methodology in place I submit that any perceived savings for not doing this work (and it is work), is all too often dwarfed by the cost of misguided, misconfigured, mistaken results that miss the mark.  How much do software blunders cost? I wonder…  trust me, it’s a lot!  So, choose an SDLC; choose one that is suitable.  Implement it.  This is a no-brainer!  Agile appears to have become the de-facto standard across the software industry, so start there.  While the essential tasks within both methods are similar, it’s their approach that is different.  This blog is not going to dive in deep on these, but let’s cover two indispensable areas.

Waterfall SDLC:

Originating from the manufacturing and construction industries, the Waterfall SDLC method prescribes a sequential, non-iterative process which flows downstream like a waterfall.  Originating with an idea or concept, all requirements are gathered, analyzed, and clearly specified, a design is conceived, diagramed, and extensively vetted, code development ensues (sometimes for what seems like an eternity), followed by testing/debugging, production rollout, and follow-on maintenance of the code base.

The real benefit of this method everyone knowing exactly what they will get at the end of the project (supposedly).  The downside is that it can take a long time before any results can be seen or used.  This can become a costly problem if any of the upstream steps failed to capture the true intent of the initial concept.  The further up-stream the misstep occurs the costlier it can be correcting.  Well, at least we can often get beautiful documentation!

 

Agile SDLC:

Principles in Agile SDLC focus on short development tasks and prompt delivery of their functionality to users.  Promoting a highly-iterative process, this adaptive method reacts to and is flexible to an evolutionary event sequence: A continuous improvement and addition of features, delivering value incrementally over time.

The clear advantage of Agile over Waterfall is that it supports cross-functional teams to be involved in the process end-to-end, fostering better communication and the ability to make changes along the way.  The letdown can be that while best practices should be in place to avoid such, feature creep can become a real irritation. 

Oh, and don’t count on pretty documentation either.

Don’t get me wrong, Agile is definitely, in most cases the right way to go.  In fact, don’t stop there.  Agile does not envelop all aspects of broader SDLC best practices.  Take for instance: data or better yet, big data.  For our purposes, we are building software that pushes data around in one form or another.  If you’ve followed my previous blogs, then you know I am talking about: DDLC, or the ‘Database Development Life Cycle’.  Read more in my two-part series (link above) on Data Model Design & Best Practices.

Introducing JEP+: – Just Enough Process

My years in the software industry have resulted in many successfully projects and ,yes, painful failures.  What I learned was that the SDLC/DDLC process is critical and necessary as it allows you to measure quality thus supporting better business decisions for project/product release.  I also learned that too much process gets in the way and can bog down the creative deliverables reducing their value and increasing their cost.

My brother Drew Anderson and I, for the past 20 years or so have fashioned a hybrid SDLC process we call JEP+ or Just-Enough-Process-Plus; which means having the right amount of process in place ‘plus’ a little bit more to ensure that there is enough.  Essentially an Agile/Scrum based method, we incorporated the Capability Maturity Model, Waterfall, and ISO 9000 methodologies where beneficial.  JEP+ is where DDLC has its roots.

Often, when we presented our methods to customers, they almost always asked: how can we buy this?  In the end, it was our competitive advantage which led to us getting hired for many consulting projects.  Drew still does!  Many have recommended we write a book.  Scott Ambler – Mr. Agile, suggested the same.  We may do it one day.

Job Design Patterns in Talend

There are many elements, or processes in any successful methodology.  For Talend developers, a key element we should discuss is Job Design Patterns.  This allows me to expand upon some hints I in my JDP&BP series.  Presuming we have a clearly defined use case, and we have nailed down the technology reference architecture, before we really dive in to writing code we should have a solid design pattern in mind.  Design Patterns provide Talend developers a prescribed approach to building a successful job.  Let’s look at some useful design patterns:

 

This is not an exhaustive list, yet I hope it gets you thinking about the possibilities.

Conclusion

Talend is a versatile technology and coupled with sound methodologies can deliver cost-effective, process efficient and highly productive business data solutions.  SDLC processes and Job Design Patterns present important segments of successful methodologies.  In my next blog I plan to augment these with additional segments you may find helpful.

 

Till next time…


Six Top Technology Trends to Watch in 2018

 

Usually, predictions for a new year are brimming with optimism for a future yet unfurled. But 2018 is a roaring exception, given it’s the grand finale of decisions made back in 2016. Yes, perhaps we should have seen much of this coming, but few did. Hold onto your hats, because 2018 is almost upon us and it’s going to be a rough ride. The good news is that we’ll likely all be better off on the other side for having gone through these ‘data growing pains.’

So, what lies ahead that snuck up on us from behind?

At least one global company will be fined millions due to GDPR non-compliance.

In today’s global, digital economy, companies are collecting more data than ever on their customers. That data is becoming more diverse and complex, stemming from different sources and in different formats. The creation and exchange of data has also increased significantly as Bring-your-own-data (the new BYOD) and enterprise collaboration software have grown to become a mainstay in the modern workplace. 

But while we were all busy collecting that data and figuring out the best way to use it to our company’s advantage, the countdown to EU’s General Data Protection Regulation (GDPR) effective date quickly clicks by. It’s a heavyweight data privacy law that takes effect on May 25, 2018 and the fines for non-compliance are hefty. Yes, 2018 is the year we all realize something huge happened in 2016 – April 14, 2016, to be exact. That was the day the GDPR was finally approved by the EU Parliament.

Compliance is as much a data management issue as it is a regulation and security issue. Companies are well-advised to complete deployments on all three fronts asap. Even so, one or more giant companies will stumble big this year and get hit with mega-million-dollar fines.

In 2018, ethical lines will be drawn detailing data morality (a.k.a. data virtue).

Data privacy grows up next year and becomes a much bigger and more pressing issue around the world. Consumers are giving companies more information than they’re even aware of with every purchase and search. In 2018, data and the ‘morality/virtue’ of using that data will come to a crossroads. Organizations collect mass amounts of information on their customers, and while the EU is aggressively moving forward with privacy regulations like GDPR, there is still ‘a lot of grey area’ when it comes to the ethical implications of third parties gaining this amount of information on its clients.

GDPR has already demonstrated that the government can regulate companies’ possession of data, but can they also put strongholds on how they use that data? Regulators and companies alike will be grappling with this question. Legally defining a line between helpful and hurtful use of consumer data will become the quest of the year.

Most companies will seek to address these issues proactively, rather than be subjected to perhaps excessive and even contradicting regulations among the various governments around the world. As a result, expect data morality to become a hot topic.

Machine Learning (ML) explodes and then backlashes.

2017 saw AI hype explode, with companies like Amazon, Apple, Google, Microsoft and more pledging embedded AI in everything. However, intelligence is only as good as its data. While 2018 will usher in organizations refining their AI, machine learning and deep learning algorithms to leverage company and third-party data for the improvement of the broader customer experience, only three percent will be working with reasonably accurate data. Unless companies get a handle on their data to ensure 100 percent accuracy, ML and AI algorithms could be learning from flawed data, resulting in inaccurate analytics and erroneous prediction, leading to poor business decisions. That in turn will fuel a consumer backlash against companies and their machines, as anger over incorrect outcomes and inappropriate results, in addition to lost jobs and wages resulting from general automation.

Companies will be busted for promotional fake news.

By now we’ve all heard of Fake News as it is applied to news outlets, but this next wave is a bit different. This is fake information distributed to promote goods and services beyond what they actually deliver, in an effort to increase sales. It’s marketing gone bad. This is not a new or emerging trend as it has been around for a while, but 2018 will be the year companies finally get punished for the practice. Expect brand loyalties to plummet and consumer suspicions to escalate. Tech companies will undergo serious financial penalties for not removing fake news or banned content. It will be imperative for companies to implement a data strategy for maintaining credibility among the public and the companies leveraging their sites.

Social media data companies will be increasingly regulated.

Social platforms are now viewed as the “new media” and therefore have a social responsibility to manage their public output. First regulators will try to combat nation-state election manipulation and other fraudulent and fake news by imposing the same regulation on social media as is currently applied to news media. But that will be replaced with laws that require responsibility and accountability for clean and accurate data from all social platforms. The more data these platforms have, the harder it is for them to discern what’s real and what’s fake. In 2018, it will be imperative for social media companies to implement a data strategy to maintain credibility among the public and the companies leveraging their sites.

Companies will need an audio/visual data strategy to survive in 2018 and beyond.

Gartner predicted that “by 2021, early adopter brands that redesign their websites to support visual and voice search will increase digital commerce revenue by 30 percent.” As such, organizations will need to develop strategies to collect, cleanse and analyze audio/visual data. There will no longer be a distinction between structured and unstructured data. Companies will have to be able to ingest all types of data, no matter file format, clean it, qualify it and use it responsibly.

If you detected a theme of data responsibility and accountability in these 2018 predictions, you’re right! Along with this push in all things data related, there will also be an upsurge in new data veracity products and services that incorporate entire communities and/or data supply chains. However, as 2018 unfolds, the odds are good that everyone will be better off – organizations and consumers alike – for finally having established data privacy and morality standards that will keep us all honest and accountable.


Disaster Recovery 101: 3 Strategies to Consider

 

When a disaster strikes and takes down the IT systems that are essential to operations, the IT team is often called on to enable a quick recovery. A disaster recovery plan (DRP) can help get systems back online quickly and efficiently. This plan documents the procedures required recover critical data and IT infrastructure after an outage.

But, is a “disaster” plan different than a normal recovery plan? Surprisingly many IT teams rely on normal backup plan and consider it as “disaster recovery plan” which is the wrong approach. In this blog, I want to explain the differences and types of disaster recovery approaches your team can take to prepare for everything from power failures to broader scale issues. 

Disaster Recovery Planning vs. Backup Planning

A “disaster” is categorized when the facility where your infrastructure is hosted is no longer operational. The reasons for failure could be small local issues like fires or power and utility failures. However, disasters can also include even broader scale problems like floods, tornadoes, storms, hurricanes, and civil disturbances which can have an impact on a regional level.  A disaster recovery plan (DRP) includes procedures required to recover data, system functionality and IT infrastructure after an outage with minimum resources.

A disaster completely shuts down operations in an area rendering any backup plan associated with that area non-functional. The limitations occurred by the events mentioned above could result in buildings, equipment and IT systems being unusable. Without public utilities like water, electricity, heating and cooling life halts, and no system can work on its full capacity. Communication channels, particularly in IT Sector, are the backbone to carry out work. Disasters often cause widespread outages in communications, either because of direct damage to infrastructure or sudden spikes in usage related to the disaster.

In some cases, it’s actually mandatory to have disaster recovery plans due to compliance regulations, although it is good to have disaster recovery plan in place for every system where possible. When starting to create your plan, begin with building a “Disaster Recovery Project Team”. This team should consist of experts from both the business and technical teams. They should decide which business processes have a critical impact on the organization and what losses may happen if they go down.  In addition to resource and critical failure planning, an additional plan should be put in place defining how teams will communicate in a disaster with the absence of infrastructure.  

Now that you’ve got your team and resources planned, let’s consider some recovery strategies.

3 Recovery Strategies to Consider

There are different strategies that can be adopted in any disaster recovery plan. In this blog, I won’t go into full detail on all of them. However, I want to touch on a few strategies for IT infrastructure recovery in disaster situations.

Cold Backup

“Cold Backup” is an IT infrastructure recovery strategy where all necessary data is kept safe at different locations according to the disaster recovery plan. The whole system is not in a state where it can be simply started with the flip of a switch, but needs to be recovered piece-by-piece. Everything from installation to data recovery will need to be done to bring services in an operating state. Normally, this doesn’t require any license as there are no working pieces that are in operation. Cold backup is the least expensive recovery strategy but requires the most time to get systems up, running and serving.

Warm Backup

“Warm Backup” represents a setup where a reasonable hardware infrastructure and software installation are already available. The environment will simply need data from the latest backup to start serving. This setup does require a license (mostly non-production license) as the system is ready to serve but it’s not actively participating. It’s a more expensive setup than a cold backup setup, but requires far less time to get up and start serving.

Hot Backup

This is the most expensive disaster recovery setup. It consists of matching the same hardware and software modules as your original system. It also remains as up-to-date as an original setup. It also may have access to the same data which is replicated to disaster recovery sites or may receive it on regular basis. At some organizations, this setup is also used as a geographical load balancer. It does require a production license if it is serving but that varies from vendor to vendor.

Conclusion

Anyone and everyone can reasonably say that “there is no chance of an earthquake in my area”. I hope and wish that this never happens, but IT teams need to be ready. Major disaster events may not happen at all, but you may encounter small level events happening quite frequently like fires, power outages and bad weather. If you do not have the right plan in place, believe me, that will cost a fortune in terms of business losses due to downtime. At Talend, we recommend taking disaster recovery seriously. Talend products support different disaster recovery strategies like the ones mentioned earlier (hot, warm and cold).  My recommendation is to act before it’s too late. The right plan will improve business process, minimize disruption and will bring an edge over competitors.

Resources:  IT Disaster Recovery Planning For Dummies® by Peter Gregory; Philip Jan Rothstein


NetSuite and Talend: Integrating with Cloud ERP Systems

 

NetSuite, is a provider of cloud-based Financials/ERP and Omnichannel commerce solutions. Integration platform providers like Talend help with data integration, data migration and automation for NetSuite. The web services from NetSuite enables integration using Java, .NET or any other development language that supports SOAP-based web services. The framework provides comprehensive error handling and security functions with support for authentication, authorization, access control, session management and encryption.

Talend has built special connectors for NetSuite which automate a lot of backend work needed to build and work with NetSuite. Often people with language skills in Java and .Net can start building applications to work with NetSuite with relative ease. However, teams without these skills need not worry if they have an integration platform like Talend in their toolset. Talend provides 3 basic methods for working with NetSuite:

  • Using NetSuite components
  • Using NetSuite OpenAir SOAP API calls & the tXMLMap component
  • Using JDBC connection adapters

In this blog, I’d like to go in-depth on NetSuite components functionality and how to automate operations like establishing connections, reading data, insertions, modification, and deletion. Talend Studio helps perform these functions using 3 components namely tNetsuiteConnection, tNetsuiteInput and tNetsuiteOutput

Talend & NetSuite Connectors

Let’s look at how the configurations of these components and a few sample data integration job designs for inserts and updates below.

Netsuite Components

tNetsuiteConnection: This creates a connection to the NetSuite SOAP server so that other NetSuite components in the job can reuse the connection. You have the choice to select your preferred version of API along with other necessary information for authentication.

tNetsuiteInput: Invokes the NetSuite SOAP service and retrieves data according to the conditions you specify. The nice thing about the Input component is its search functionality which enables you to filter records from NetSuite instead of listing everything based on a record type.  The record type browse functionality give a listing of all entities like Customer, Purchase Order including customizations in real-time.

The retrieved data from tNetsuiteInput is a pipe separated list of columns. It’s important to note that the sublist or picklist items from NetSuite are retained as JSON. Thus, you’ll need to use tExtractJSONFields component to parse the content and process it accordingly in Talend.

tNetsuiteOutput: Invokes the NetSuite SOAP service and inserts, updates, upserts, or removes data on the NetSuite SOAP server.

Before creating a new entity in NetSuite, you’ll need to understand the concept the ExternalId and InternalId.

ExternalId is used to store a unique key from source systems or self-generated keys which are later required for updating the records to NetSuite. 

The InternalId is generated by NetSuite hence when creating a new entry into the NetSuite portal using tNetsuitOutput component, it returns the newly created Internal Id & ExternalId if passed earlier.

The output from tNetsuitOutput can be stored in a repository of files for processing and can be useful for performing updates based on InternalId or ExternalId. One important thing to note here is that the ExternalId is only available for parent entities like “Customer” and not in the sublist addressbookList so if you want to update the sublist at a later point in time then InternalId needs to be used. So, a best practice is to store InternalId of each new sublist item after creation so it can be passed later into tNetsuitOutput during update or upsert operations.   

For creating the JSON sublist or picklist payload string fields, you can use the Talend component tWriteJSONField. The tWriteJSONField component provides many options to map the flat file structure or similar data rows from databases to the JSON formats. When configuring a JSON tree, the default element is a string. If an element is not a string, you need to add an attribute for the element to set its type. Here the type can be set to integer, double, array or object types. Please refer to the help documents for all the options available.

Examples for Inserting and Update operations using the NetSuite components are shown below.

Insert Operations:

In the Job design below, we can see how to construct the fields needed for a submission to NetSuite. The Subsidiary and AddressbookList fields are string data types containing JSON payloads. The ExternalId for a customer record is sent during the web service call and the output returned contains the newly created InternalId. This InternalId or ExternalId can be used for subsequent updates if needed.

Update & Upserts:

For updating or upsertion, the payload sent to the web service needs to have ExternalId or InternalId populated for parent entities like Customers. For sublist items like AddressBookList, the InternalId field needs to be populated with respective InternalId for each address item newly created during the Add Operation. If no InternalId is mentioned during upserts as shown below, then it will be treated as a new record and the same gets created in NetSuite. You can also see below that the picklist can also be modified with a simple change of value, for example, from “_india” to “_germany”.

Before Updating:

After Updating:

Conclusion:

With just three NetSuite components and few JSON parsing components in Talend, you can easily start working with a NetSuite WSDL based web service. Using Talend, you can be up and running quickly for new implementations and during migrations with Netsuite.


What is the Future for SQL Developers in a Machine Learning World?

 

When I graduated from college in the late 1990s, it was just in time to enjoy the Y2K crisis. If you remember those fun times, then you are old enough to enjoy this blog.  I graduated with a Management Information Systems (MIS) degree, which is a cross between Computer Science (CS) and Business Management, and although I was stronger in CS than Business Management, I survived.  There was a class spanning both disciplines that I partially excelled in called Database Theory, which taught the basics of Relational Database Management Systems (RDBMS).  We learned everything from proper table structures, Primary Keys and Foreign Keys, to basic modeling techniques. It is also where we first heard of the term SQL “sequel” (or Squirrel as some people think it is pronounced). 

SQL stands for Structured Query Language and is supported by a set of standards, although they seem to be implemented slightly differently by every database vendor.  Even though SQL is always a little different depending on if you are using MySQL, Oracle, DB2 or whatever vendor tool you have, if you are good at writing SQL and know the database model, you can adapt quickly to get whatever data you need.  

In my career, I have spent about 14 years in various integration roles, almost always using some type of RDBMS system as my source and targets. I excelled at building different data models to support reporting, data marts and operational data stores (ODSs). All these data models were supporting operations, financial consolidation, and other diverse business needs. I became VERY, VERY good at writing complex and efficient SQL throughout my career in IT.  

Today, I still enjoy trying out different systems and databases that all claim SQL support in some form or another. For example, I recently gained my Data Vault 2.0 Certification, in which I built a Data Vault for our corporate needs using Snowflake that is a fully supported ANSI SQL Cloud Data Warehouse system. Happily, I have not lost my skills.  

But the question that all this is leading up to is: Can someone like myself still find a place in this world of new platforms and processing?   

To SQL or Not to SQL

The database paradigm has changed. There is now NoSQL, Document Databases, Columnar Databases, Graph Databases, Hadoop, Spark, and many other Massively Parallel Processing (MPP) platforms popping up daily. They all provide great benefits for many different use cases that just don’t work well with traditional RDBMSs.  

Big data platforms provide a way to process more diverse data faster than we could have thought of in 1999, when most IT professionals had to know SQL to meet business needs. Today, you need to know many more platforms and environments to take advantage of all the capabilities and benefits that Big Data vendors are promising.  Can those of us who have depended on SQL compliant systems survive, or do we need to learn Scala, Python, R, Java, or whatever the next cool language and platform needs?  

There are the saving graces of tools like Hive and Impala that allow you to use your SQL skills to find and access data on Hadoop platforms and Data Lakes, but tools like Hive all come with their restrictions.  You can only do so many functions on the data – the defined functions that SQL has always supported. Of course, you can use User Defined Functions, but then you get into programming quickly. 

Is Python in Your Future?

Where SQL-supported systems fall short is when you start applying the latest machine learning methods on your data or when you want to take advantage of huge volumes of streaming data and query data in motion. Yes, for those of us who love RDBMSs, data is not always at rest.

Times have changed and so must our skills.  I personally have started to learn Python as it is an easier language to use than Java, and many machine learning methods are supported in Python. In 5 to 10 years, every information worker or IT support person will have to know how to use machine learning, or at least how to support it.  You will have to support MPP systems like Hadoop and Spark in some form for data processing. Machine learning will be key to support data-driven decision making and to get the competitive insights required to win your market.

SQL is Dead, Long Live SQL

There is still a very strong need for SQL as I see methodologies such as data vaults evolve and become widely popular in the NoSQL and HDFS/storage spaces.  There will always be structured systems, e.g. ERP and CRM systems that will need structured data warehouses. You can count on that not going away.  But, when your CxO comes to you to predict the future and its business implications, or to understand highly automated optimization solutions that get smarter over time, you may want to stop looking in the same old usual places. You may have to start looking at all the data available in its most natural forms (unstructured or semi-structured) and find ways to become more predictive, prescriptive and even cognitive.  So, while SQL and RDBMSs, like the mainframe, will exist for many, many years, the tide is shifting towards tools for real-time analytics as SQL, currently, falls short!


8 Key Takeaways from the MDM & Data Governance Summit

 

A few weeks ago, I had the great opportunity to attend the MDM & Data Governance Summit held in NYC. The summit was packed with information, trends, best practices and research from the MDM Gartner Institute. Much of the information is something you don’t find in webinars or white papers on the internet. Speakers were from many different industries who brought different perspectives to the Data Management concepts.

I also got to meet IT business executives who are all along different parts of their MDM and Data Governance journey.  As I reflect back on the conference, I wanted to share some of my key highlights and takeaways from the event as we all start to prepare for our IT and data strategies in 2018:

MDM & Data Governance Summit: 8 Key Takeaways:

MDM Main Drivers

The top 5 main drivers of MDM are: Achieving synergies for cross-sell, Compliance, Customer satisfaction, System integration and economies of scale for M&A. A new driver also has been considered recently and that is: Digital Transformation and MDM is at the core of Digital Transformation.

IT & Business Partnerships are More Important Than Ever

If there was one thing that everyone at the summit agreed upon, it was that the partnership of Business and IT directly impacts the success of the MDM and Data Governance programs more than any other factor. For one banking company, this happened naturally as the business also understood the power of data and the issues related to it. But largely, the experience of most points to this partnership being an uphill battle to get the buy-in from the business. Tying the project to a specific business objective is critical in these scenarios. Bottom line is that a solid partnership between business and IT will provide the right foundation for a MDM program.

Managing Change is Critical

It’s widely accepted that any MDM journey is long, and it takes energy and perseverance. One company mitigated this process by starting with the domain which they thought would have the most impact.

Data Governance Council

Only about 20% of the audience had some form of data governance council but all the case studies presented had a data governance council in place. The council was made up of both business and IT teams. There is no real pattern from the organizational structure perspective. An insurance company which has a hugely successful MDM implementation has the Enterprise Information Management team part of the Compliance team. Another financial company had the team reporting to the COO. So it depends on how your company is organized and does business.

GDPR

This topic was everywhere. 50% of the audience when polled said they were impacted by this regulation. But looks like lot of companies still lag way behind in terms of preparing their enterprise data for compliance. This is a serious issue as there are less than 150 days to get ready. One of the speakers said that MDM is the heart of GDPR implementation. 

Next Generation MDM

‘Data as a service’ is something every company should aim for in the next 2-3 years. Also bringing in social media and unstructured data will be key to gain actionable insights from MDM initiatives. Large enterprises have moved beyond CDI & PIM to focus on relationships and hierarchies. Cloud MDM will be in demand but there is potential for creating more data silos as integration becomes a challenge.

Big Data Lakes

There are just too many technologies in the Big Data space. Therefore solution architecture becomes key when building a Data Lake. A common implementation was to load the data from legacy systems into Hadoop without any transformation. But without metadata, the lake quickly becomes a swamp. So, to get true value from Big Data Analytics, MDM and Data Governance have to be effective and sustainable. Also from a technology perspective, there needs to be sound integration with big data systems. My company Talend has been at the forefront of Big Data Integration providing a unified platform for MDM, DQ, ESB and Data Integration.

Quotes

Finally, I want to end this blog with some great quotes from the speakers:

“Digital Transformation requires Information Excellence.”

“If you don’t know where you are, a map won’t help.”

“Big Data + Data Governance = Big Opportunity”

“Data is a precious thing and will last longer than the systems themselves.”

“There is no operational excellence without Data excellence.”

“A shared solution is the best solution.”

“People and processes are more critical than technology.”

“Rules before tools.”

“Master data is the heart of applications & architecture.”

 “There is no AI without IA (Information Agenda).”

As you prepare for MDM and Data Governance initiatives in 2018, I hope some of my takeaways will spark new ideas for you on how to have a successful journey to MDM.

 


5 Predictions About the Future of Machine Learning

 

Machine Learning is currently one of the hottest topics in IT. The reason stems from the seemingly unlimited use cases where machine learning can play from fraud detection to self-driving cars, and even identifying your ‘gold card’ customers to price prediction.

But what is the future for this fascinating field? Where is it going? What will be the next best thing? Where will we be in ten years time? Truth is, whatever the next great leap is will likely be a surprise to us all, but having experience helping customers all over the world in this area, I want to make five predictions about areas and use cases where I believe Machine Learning will play:

  1. Quantum Computing

Machine-learning tasks involve problems such as manipulating and classifying large numbers of vectors in high-dimensional spaces. The classical algorithms we currently use for solving such problems take time. Quantum computers will likely be very good at manipulating high-dimensional vectors in large tensor product spaces. It is likely that both the development of both supervised and unsupervised quantum machine learning algorithms will hugely increase the number of vectors and their dimensions exponentially more quickly than classical algorithms. This will likely result in a massive increase in the speed at which machine learning algorithms will run.

  1. Better Unsupervised Algorithms

Unsupervised learning occurs when no labels are given to the learning algorithm, it is left on its own to find structure in the input data. Unsupervised learning can be a goal in itself, such as discovering hidden patterns in data; or a means towards an end, often called feature learning. It is likely that advances in building smarter, unsupervised learning algorithms will lead to faster and more accurate outcomes.

  1. Collaborative Learning

Collaborative learning is about utilizing different computational entities so that they collaborate in order to produce better learning results than they would have achieved on their own. An example of this would be utilizing the nodes of an IoT sensor network, or what is called edge analytics. With the growth of the IoT, it is likely that large numbers of separate entities will be utilized to learn collaboratively in many ways.

  1. Deeper Personalization

Personalization can be great, but it can also be equally annoying. We have all experienced recommendations that seem to bear no actual relation to anything that we may actually be interested in. In the future, users will likely receive more precise recommendations and adverts will become both more effective and less inaccurate. The user experience will vastly improve for all.

  1. Cognitive Services

This technology includes kit like APIs and services, through which developers can create more discoverable and intelligent applications. Machine Learning API’s will allow developers to introduce intelligent features such as emotion detection or speech, facial and vision recognition as well as language and speech understanding into their applications. The future of this field will be the introduction of deeply personalized computing experiences for all.

These are things I think can and should happen in the machine learnings bright future, but it is equally like that the introduction of some new unknown disruptive technology will result in a future of which will we would never have predicted.

For more information on Machine Learning, including an introduction to the topic and overview of the Talend Machine Learning components, check out my Talend Expert Session recordings here.


Getting Ready For GDPR: 5 Key Takeaways from Data 2020 EMEA

 

Over the year, I’ve had dozens of discussions with customers, partners, and thought leaders on the challenges and opportunities they face in achieving GDPR compliance.

Since September, I’ve seen an undeniable uptake on the number of companies focused on the “how” rather than the “why” or the what of GDPR. Interest has spread across countries, from the UK to Germany, Nordics to Spain, Italy to France and BeNeLux. But the most insightful discussions I’ve heard on GDPR so far took place in Stockholm, for the #data2020 event in September.

The good news is that the interactive panel session dedicated to this topic has been recorded and is now publicly available. I had the privilege to participate in this session moderated by Patrick Eckemo, a Swedish IT strategist, together with Johan Wisenborn, who heads Data Privacy Country Operations at Novartis, and Richard Hogg, Global GDPR evangelist at IBM.

You can access this session below. And here are my five takeaways.

  1. GDPR is very broad, but it is just the beginning of bigger focus on data governance

Johan Wisenborn of Novartis highlighted the fact that the number of headcounts dedicated to data privacy in his legal department grew from 3 to 40 people in only two years. He also noted that, although GDPR clearly sets the highest standard in terms of regulations, he was confronted with a growing number of regulations for data privacy and sovereignty around the world, from Japan to Australia, China to Canada, and from India to South Africa.

The panel also made it clear that the stakes go well beyond regulatory compliance. In this data-driven world, trust has become the new currency. Now that insights and innovations depend on big data, there’s no option but to have total control your data, otherwise, your customer won’t buy in. Only organizations that have nurtured a trusted relationship with their employees and customers will be able to reap the benefits of personal data and drive the latest innovations such as precision medicine.

  1. It all starts with accountability

As the panelists noted in the video, most of the privacy rules that come with GDPR were already expressed in former regulations, but the principle of accountability makes it game-changing. GDPR is much more explicit on the requirements for an organization to define internal responsibilities, implement the measures and platforms for enforcing privacy rules and demonstrate compliance with the GDPR principles. As a result, defining responsibilities should be considered as prerequisites to kick off a GDPR project.

Once a Data Protection Officer has been named, then the organization can assess the risks. A recent Data IQ survey on GDPR shows that more than half of the organizations, including those that rate themselves as being at a very early stage of GDPR compliance, have now nominated their Data Protection Officer. Then the focus can shift from the “why” to the “what” and to the “how”, while accountability becomes widespread across the organization by getting C-level attention and educating the workforce to understand GDPR and their needed engagement for compliance.    

  1. Start with the foundations

The panelists highlighted Article 30 of GDPR as a top priority. This article states “Each controller and, where applicable, the controller’s representative, shall maintain a record of processing activities under its responsibility”. Bringing clarity on how to process the Privacy Impact assessments, together with mapping your personal data across your organization, should also be considered at early steps in your project. It won’t make you fully compliant, but it sets the foundations for your GDPR program and drives it onto the right tracks. You might not need tools at first sight to achieve this, but keep in mind that are you are setting the foundations and that your personal data landscape will constantly evolve over time.

  1. Get the resources, make the case, define your priorities

GDPR is regularly compared to Y2K or Euro, as it puts new requirements on legacy systems. But contrary to those one-shot exercises, data privacy is a journey that won’t be over by May, 25th, 2018. It requires a staged approach that starts with a forensic gap analysis and data assessment, followed up by a management plan and roles settings. Then the real project starts, by establishing and operationalizing the needed controls and stewardship activities, measuring results and tracking for gaps and potential improvements. This is much more the one-time tick box exercise that might fade away as time goes by. Privacy is a big thing in the digital era, so be prepared to see the bar getting higher over time as customer expectations increase and regulations related to data sovereignty burgeon across countries.  

  1. Invest on content and the rights for the data subject

GDPR is not just another regulation. It is about putting individuals in control of their data, and thereby reinforcing customer trust and engagement, as well as growing a brand reputation. Starting on May 25th, your customers, prospects, visitors and stakeholders will be empowered to challenge your privacy practices through the lens of simple actions, such as giving or withdrawing consent, or requesting for their rights: the right to be informed, to restrict processing or to object; the right of access, to data portability, or rectification; or the right to be forgotten or not to be subject to automated profiling and decision-making.

This part of the regulation constitutes the most visible,customer-facing side of GDPR. The interactive sessions during the Data 2020, showing that very few organizations have yet considered who in will be responsible for that topic and how it would be ultimately delivered to their customers.  

So, there you have it. My top takeaways from , what I perceived to be, some of the most insightful conversations on GDPR. What are your thoughts? Leave a comment below or tweet me here. I’d love to hear your take.


How to Create a Smart City with IoT and Big Data

 

These days, it seems like every city is trying to implement “smart city” initiatives. Take Singapore for example, now known for the most extensive effort to collect data on citizens’ daily living habits/routines ever attempted by a municipality. Even Bill Gates has pumped millions of dollars into helping Phoenix in their smart city efforts.

But what do we mean when we talk about a “smart city”? Is it the better use of resources within the city center, or the fact that all resources being wasted would eventually be overcome? Eventually, we can all agree to disagree, as there is a perception most people have in their mind with regards to the functionality and the phenomenon we know as the smart city.

Since development has been rapid, it is difficult to keep an eye on all that is happening. Industries have started adapting the Internet of Things (IoT) and wish to implement it in a broader level to increase efficiency. But regardless of what’s transpired to-date in all industries, one thing that does not escape the eye is M2ocity’s rollout in France.

To talk over this matter, I got in touch with the man behind these technicalities: Xavier Diab, IT director at M2ocity. Readers who are not yet familiar with M2ocity should know that the company is France’s biggest telecom operator for the Internet of Things (IoT). A result of a merger between water supplier Veolia and operator Orange, M2ocity has emerged as a leading protagonist in the market for IoT in France.

Drivers for Merger

Veolia water and Orange gave birth to m2ocity in 2011, bringing both companies’ services together in the form of a smart metering entity. M2ocity has expanded its range from France to other areas of the world. Veolia water, which had a strong presence across France, initiated the union with Orange, fostering hopes that both the companies could improve their quality of customer service by safeguarding resources and optimizing performance at scale

The merger between Veolia water and Orange has come a long way from where it started, as M2ocity now stands at a crucial position in the drive towards a smart city.

One of my first questions to M2ocity’s Xavier Diab was about the drivers behind the merger into M2ocity, France’s biggest telecom operator for IoT and applications. Xavier responded that water was and is a vital resource for inhabitants living in all major cities and metropolitan centers around the world. Since water resources are very important, it is imperative that all stakeholders involved take steps to preserve and manage it in a “smart” way.

Although we know about the importance of water as a commodity, we haven’t taken concrete steps to address its preservation. In France alone, around 343 billion gallons (1,300 billion liters) of all portable water goes to waste due to leaks in hydraulic systems across French cities. The extent of the waste can be equated to 25% of the country’s total water being lost due to minor leaks across the region. These leaks come at a cost to both consumers and operators.

M2ocity, being the biggest player in the French telecom market, hopes to implement a three-point initiative for better waste management. The initiative includes smart city networks, smart objects, and energy efficiency, which end up forming the smart city itself.

Challenges

M2ocity’s aim to create a smart city was met by numerous challenges. Some of the challenges they faced can be listed in the following ways:

  • Water Leakages: As mentioned above, water leakage was a big hindrance in the fulfillment of the smart city project. Water leakages are usually comprised of end point leakages, underground leakages (the most complicated ones), and above ground leakages.
  • Data Collection: Along their way, M2ocity realized that data collection for IoT is not as easy as it sounds. Since there is are standards available in the market, it is difficult to collect data in a seamless way. The availability of different formats means that collectors often face a conundrum of what to collect, how to monitor, and what to use.
  • Too Much Data: While the implications of collecting data also proved to be problematic, the high scale of data present offers another challenge. As I have highlighted in previous articles, AI or IoT can only be successful if the data is cleansed.. The challenges in delivering data with the added responsibility to rationalize and cleanse data makes for an interesting challenge that can only be adressed through the use of seamless methods that make the process flawless. While the size of the data brings challenges to the collection and setting a metric for the compilation of data, there are also numerous implications when it comes to visualizing/displaying the data . The visualization of such high quantities of data requires both performance and architectural capabilities. For M2ocity to display these proficiencies, it was imperative that it got in touch with a data integration firm that had both better standardized devices and formats.

The Solution

When asked about the solution to the problem, Xavier responded:

For M2ocity, that data integration entity was Talend, which offered the flexibility and reliability they were looking for. M2ocity’s data needs were diverse and required the best of Talend to process the information in a way that would be suitable for both M2ocity and its customers. Talend’s format, collection intervals, and scalability fit perfectly with what M2ocity required.

Xavier elaborated on the company’s selection criteria, noting it had numerous vendors from which it had to choose. However, despite other vendor bids, the offer from Talend beat out competing solutions for its ability to keep pace with rapidly evolving business needs, an easy-to-use interface,  and a reasonable pricing model that was in-line with their budget.

Results

It has been over  7 years both firms went into an agreement, and the ride has been nothing short of incredibly successful. Some of the results Xavier pointed to as metrics of the project’s success include:

  • More than 2.3 million smart sensors have been installed across 3,000 cities.
  • The system manages around 20 million messages of data from clients per day.
  • There are around 140 million messages collected and displayed every week.

 “We collect all types of data—water, temperature, electricity, pollution, noise data, etc.—and analyze them to develop innovative public services in smart cities.”

Xavier Diab

The future for the IoT and better data integration is indeed bright, and M2ocity wants to expand its services into lighting, safety, security, and temperature.

If you would like to read more from Ronald van Loon on the possibilities of Big Data and Artificial Intelligence please click “Follow” and connect on LinkedIn and Twitter.