Continuous Integration Best Practices – Part 1
In this blog, I want to highlight some of the best practices that I’ve come across as I've implemented continuous integration with Talend. For those of you who are new to CI/CD please go through the part 1 and part 2 of my previous blogs on ‘Continuous Integration and workflows with Talend and Jenkins’. This blog would also introduce you to some basic guidance on how to implement and maintain a CI/CD system. These recommendations will help in improving the effectiveness of CI/CD.
Without any further delay - let's jump right in!
For every product Talend offers, there is also a recommended architecture. For details on the recommended architecture for our different products please refer to our Git repo: https://talendpnp.github.io/ . This repository has details on every product Talend offers. It’s truly a great resource, however, for this blog I am focusing only on the CI/CD aspect with Data Integration platform.
In the architecture for Talend Data Integration, it’s recommended to have a separate Software Development Life Cycle (SDLC) environment. This SDLC environment should typically consist of an artifact repository like Nexus, a version control, server tool like Git, Talend Commandline, Maven, and the Talend CI Builder Plugin. The SDLC server would typically look as follows (the optional components are marked in yellow):
As the picture shows, the recommendation suggests having separate servers for Nexus, Version control system and CI server. One Nexus is shared across all environments. All environments access the binaries stored in Nexus for job execution. The version control system is needed only in the development environment. This server is also not accessed from other environments.
Continuous integration requires working with a version control system and hence it is important to have a healthy, clean system for the CI work fast. Talend recommends a few best practices while working with Git. Please go through the links below on best practices while working with Git.
- Using Git with Talend: https://help.talend.com/reader/NZd~_ZsO9MIgBDlYKkw5AQ/PaEUi6e6uvIJtQQd0ApuTg
- Branching Merging and Tagging in Talend: https://help.talend.com/reader/Zpzt_P_L6RrGuFyEnxwwqw/N5i7DnkE3P2l2QY9pym_xg
- Managing SVN/git branches and tags: https://help.talend.com/reader/WEEgRoxIy_iUMxRK3bopPQ/vlsuqVZXW0kMHMiwyGbvTA
Best Practice 3 – Continuous Integration Environment
Talend recommends 4 environments with a continuous integration set up (see below). The best practice is illustrated with Git, Jenkins and Nexus as an example. The GIT in the SDLC server is only accessible from the Development Environment. Other environments cannot access the GIT.
All the coding activities take place in the development environment and are pushed to the version control system Git. The CI/CD process takes the code from Git, converts it into binaries and publishes the code to Nexus Snapshot repository. All non-prod environments have access to Nexus Release and Snapshot repository, however, the production environment has access only to the Release repository.
It is important to note that One nexus is shared among all environments.
Best Practice 4 – Maintaining Nexus
The artifact repository Nexus plays a very vital role in continuous integration. Nexus is used by Talend not only for providing software updates/patches but is also used as an artifact repository to hold the job binaries.
These binaries are then called via the Talend Administrator Center or Talend Cloud to execute the jobs. If your Talend Studio is connected with the Talend Administration Centre, all the Nexus artifact repository settings are automatically retrieved from the Talend Administration Center. You can choose to use the retrieved settings to publish your Jobs or configure your own artifact repositories.
If your Studio is working on a local connection, all the fields are pre-filled with the locally-stored default settings.
Now, with respect to CI/CD, as a best practice, it is recommended to upload the CI builder plugin and all the external jar files used by the project to the third-party repository in Nexus. The third-party folder will look as given below once Talend ci builder is uploaded.
To implement a CI/CD pipeline it is important to understand the difference between release and snapshot Artifacts/Repositories. Release artifacts are stable and everlasting in order to guarantee that builds which depend on them are repeatable over time. By default, the Release repositories do not allow redeployment. Snapshots capture a work in progress and are used during development. By default, snapshot repositories allow redeployment.
As a best practice, it is recommended that development teams learn the difference between the repositories and to implement the pipeline in such a way that the development artifacts refer to the Snapshot repository and rest of the environments refer only to the Release Repository.
The Talend Recommended Architecture talks about “isolating” the CI/CD environment. The CI/CD system typically has some of the most critical information, complete access to your source/target, has all credentials and hence it is critically important to secure and safeguard the CI/CD system.
As a best practice, the CI/CD system should be deployed to internal and protected networks. Setting up VPNs or other network access control technology is recommended to ensure that only authenticated operators are able to access the system. Depending on the complexity of your network topology, your CI/CD system may need to access several different networks to deploy code to different environments. The important point to keep in mind is that your CI/CD systems are highly valuable targets, and in many cases, they have a broad degree of access to your other vital systems. Protecting all external access to the servers and tightly controlling the types of internal access allowed will help reduce the risk of your CI/CD system being compromised.
The image given below shows Talend’s recommended architecture where the CI server, SDLC and the Version control system (git) are isolated and secured.
It is a best practice to have one environment (QA environment) as close to that of the production environment. This includes infrastructure, operating system, databases, patches, network topology, firewalls and configuration.
Having an environment close to the production environment and validating code changes in this environment helps in ensuring that the integration accurately reflects how the change would behave in production. It would identify mismatches and last-minute surprises which can be effectively eliminated, and code can be released safely to production at any time. Note that the more differences between your production environment and the QA environment, the less chances are that your tests will measure how the code will perform when released. Some differences between QA and production are expected but keeping them manageable and making sure they are well-understood is essential.
CI/CD pipelines help in driving the changes through automated testing cycles to different environments like test, stage and finally to production. Making the CI/CD pipeline fast is very important for the team to be productive.
If a team has to wait long for the build to finish, then it defeats the whole purpose. Since all the changes must follow this CI/CD process, keeping the pipeline fast and dependable is very important or else it would diminish the purpose. There are multiple aspects to keep the CD/CD process fast. To being with the CI/CD infrastructure must be good enough not only to suit the current need but also to scale out if needed. Also, it is important to visit the test cases in a timely manner to ensure that no test cases are adding any overhead to the system
Promoting code through the CI/CD pipelines validates each change so that it fits the organization's standards and doesn’t introduce any bugs to the existing code. Any failures in a CI/CD pipeline are immediately visible and it should stop the further code integration/deployment. This is a gatekeeping mechanism that safeguards the important environments from untrusted code.
To utilize these advantages, it is important to ensure that every change in the production environment goes through only the CI/CD pipeline. The CI/CD pipeline should be the only mechanism by which code enters the production environment. This CI/CD pipeline could be automated or via a manual trigger.
Generally, a team follows all this until a production fix or a show stopper error occurs. When the error is critical, there is a pressure to resolve them quickly. It is recommended that even in such scenarios the fix should be introduced to other environments via the CI/CD pipeline. Putting your fix through the pipeline (or just using the CI/CD system to rollback) will also prevent the next deployment from erasing an ad hoc hotfix that was applied directly to production. The pipeline protects the validity of your deployments regardless of whether this was a regular, planned release, or a fast fix to resolve an ongoing issue. This use of the CI/CD system is yet another reason to work to keep your pipeline fast.
As an example, the image below shows the option to publish the job via studio. It is not recommended to use this approach. The recommended approach is to use the pipeline. The example here shows Jenkins.
Best Practice 10 – Build binaries only once
A primary goal of a CI/CD pipeline is to build confidence in your changes and minimize the chance of unexpected impact. If your pipeline requires building, packaging, or bundling step, that step should be executed only once, and the resulting output should be reused throughout the entire pipeline. This practice helps prevent problems that arise while the code is being compiled or packaged. Also, if you have test cases written and you are building the same code for the different environments this ensures you are replicating the testing effort and time in each environment.
To avoid this problem, CI systems should include a build process as the first step in the pipeline that creates and packages the software in a clean environment/temporary workspace. The resulting artifact should be versioned and uploaded to an artifact storage system to be pulled down by subsequent stages of the pipeline, ensuring that the build does not change as it progresses through the system.
In this blog, we’ve started with CI/CD best practices. Hopefully, it has been useful. My next blog in this series will focus on some more best practice in CI/CD world so keep watching and happy reading. Until next time!