Mature organisations can afford a lot of sophistication in their tooling. For new hires it’s often just “the CI”, but that name spans a number of different responsibilities and software.
The most most basic need is often to do stuff whenever someone pushes a new version of the code (named a commit in most version control systems).
Depending on the project, programming language and so on, we may want to compile our code, test it, and, if it works, deploy it for the world to see. Those steps form the actual “Continuous Integration” system.
Different products have different approaches: some focus on defining straightforward pipeline steps, other let you define sophisticated orchestration with scripting, run nightly or manual tests, or even let you build a whole DAG of tasks to run. The flow is usually specified in a domain specific language: groovy for Jenkins (the incumbent/aging CI solution) and specifications described in yaml for all other solutions (GitlabCI, CircleCI, DroneCI, TravisCI, Azure Pipelines, etc).
In the end, they all have a pretty UI, let you see logs, your commit’s status, and sadly you can be sure complex projects will need to find workarounds around their API to follow your ideal workflow.
Keep things as simple as possible, under source control, and everything will be easier.
When your CI says
make, in what working directory does it actually happens? What version of
make is used ? What compiler, or more generally what
CXX/… environment variables? Even more generally, what is the OS, and how do you know if it’s even Linux or Windows?
The answer to those questions is called the environment. To specify how your code is executed you have a spectrum of options, for instance :
- Make it implicit: specify it via the CI that your jobs request specific “agents”, “tags”… And register matching CI job runners with the correct environment. One issue is that the configuration is not saved as code, and you could run into issues trying to reproduce results…
- Have generic test runners, and ask your CI to use a specific docker container as part of the task’s requirements. In many cases, you will still have to worry about what volumes should be made mounted where, what the user should be, how you update and cache said container… But it’s easily solved and you get a lot of flexibility.
- When you have a big cluster of machines, you can also just run a setup script at the beginning of each job. To be honest it’s often the simplest option if you have a solid HPC infrastructure with task queues and shared storage. My preference goes to
source-ing .envrc files, in
bashformat, full of
export VAR=X. How everything was compiled is described in a separate dockerfile. On the terminal use
direnvand everything will work like magic.
Use solutions where you define the infrastructure you’re using as code. It gives you certainty and upgrade paths.
Tests are your safety net: they let you contribute to the code without fear of breaking things, help you maintain, refactor and even serve as API documentation.
You CI tool will trigger each commit those tests, to make sure everything works as intented.
What matters is that you tests both small components (classes, functions…) for correctness - unit tests - and how things fit on a larger scale - integration tests.
For us working on hardware design, we need consistent stable results on hardware IP, so we also define bit-accuracy test. Those don’t check correctness, only that your code behaves like the one you asked the hardware design team to etch into a chip.
The actual test framework depends a lot on the language. For instance, in C++,
gtest is popular. Its macros let you define tests that run you call and check you obtain expected results. Other frameworks in other languages go a long way to try to make tests human readable, so that anyone can write tests.
While tests are required, invest in training and be kind, to make sure engineers view them as helpful tools not dead weight imposed upon them.
When working on algorithms, there is no such thing as “correctness”. Your tests are usually basic and only checks that your processing/training is setup correctly. What you want to monitor is quality.
What you want is a process that can compute both:
- Objective figures of merit, aka metrics, aka Key Performance Indicators (KPIs).
- Qualitative reports, plots, or interactive demos that let you subjectively investigate how good your results, and where you fail.
You want a tool that lets you visualize results, compare different versions of your code, track results over time, and share results easily.
For machine learning there are some great options to track runs during development: tensorboard, mlflow, nni, polyaxon, cometML…
At Samsung none of those fit our team’s use cases. We developed our own and hopefully will open-source it! (Receive the announcement)
Many products have a separate QA team that does its own testing. Having all build artifacts, (automated) tuning tools and algorithm results integrated help everyone collaborate and be on the same page.
At some point we leave the realm of CI and venture into operations. When exactly? Today the frontier is less clear, since managed cloud offering try to replace operations, and developpers are expected to own part of operations via “infrastructure as code”…
Still, in many organisations the “CI” will also encompass functions such as:
- Managing build artifacts and published versions
- Deployment: orchestrating infrastructure, services, provisioning, upgrades, rollbacks…
The benefits of unbundling the CI
There is a lot of value in building integrated systems. However, to keep them easy to work on, keep the different functions above separated.
Is it obvious? Maybe, but in the wild you hear about interesting setups .