Automated software testing

At Next DLP quality is the responsibility of everyone, from the graduate to the CEO. Quality matters because we are a security company, and our focus is to protect you and your organization from potential threats. We want to ensure that our software always performs exactly as designed by us. A change is not good enough if it does not provide sufficient test coverage and benchmarking, no matter who made it.

Our testing process is easily reproducible by any Next DLP developer so fixes can be implemented quickly. We exceed the Joel Test as a measure of our process. We are pragmatic programmers. We use, test, and contribute to thoroughly tried and tested open source projects where we can, but understand that our customers expect us to be experts and fully responsible for them. Where no third-party tools exist we build them in-house using industry best practices.

Continuous integration

In seeking quality we have invested heavily in continuous integration (CI) at Next DLP. It is well known that finding and fixing issues later in the development process has a greater cost, but there are no magic bullets. Doing unit testing alone can lead to issues in integration and usability of the system, testing at higher layers has higher cost of compute time and pipeline latency. At Next DLP we have struck a balance between the two, and are continuously refining the process as we develop new internal tools and external features to keep our quality high, whilst minimizing the latency of a change so that fixes can be merged quickly when needed.

We have run over a million jobs through our pipeline - every commit being tested and approved before merging. This is important to ensure our master branch is always ready to have release branches made from it. We have broken our process down into a few different steps.

Code review

Every merge request at Next DLP is approved by another developer. There is no way for any developer to merge a change without going through the approval process no matter their seniority. It is important to us that any developer can block a change request by asking whether it has enough testing in place, or if something is unclear or unnecessary. Code review feeds into spreading knowledge in the team, as people can observe and contribute to parts of the codebase they are unfamiliar with before jumping in. All comments must be resolved before any code can merge.

Static analysis

Static analysis is a technique which finds common mistakes made by developers using source code as an input. Developers can run most of these tools live as part of their own process (some as they are typing!), in addition to CI running them. We use industry standard tools from Synopsys (Coverity), and Google (go fmt, go vet, go lint), as well as a range of open source tools for every language we use at Next DLP (go-metalinter, eslint, elm-analyse, shellcheck and many others).

There has been much written on the subject of software quality metrics, such as cyclomatic complexity. Whilst we do not believe that having good software quality metrics is equivalent to good software, we do believe that having bad metrics is indicative of a problem. As a result, we measure and limit them for every commit into our repository.

Unit testing

Unit testing is vital for finding simple issues early on. A change is not considered finished until it has good test coverage. We have found like many others that test driven development can find errors earlier, and lead to better design choices. We run all unit tests in our CI environment for regression and add to them as issues are found in our code. Our CI environment also serves as a place for running microbenchmarks for checking the performance of small components of the system.

Kernel testing framework

The Reveal Agent is split into two components, the driver or module (“bottom half”) and the application layer (“top half”).

The agent’s kernel modules contain a platform specific layer which we test in isolation from the main agent software. This architecture helps us find issues within the bottom half of the agent easily, and allows us to reuse testing logic between Linux, Windows and Mac agents. We run testing on all our supported platforms in CI, as well as providing automation for developers to run tests locally in virtual machines. These test the event pipeline from the operating system through our kernel module and into the agent, as well as checking for issues between our kernel module and operating system APIs. The tests are necessarily run natively in virtual machines as they require access to operating system resources that cannot be shared. As a result these tests are some of our more expensive in terms of test resource. We also monitor performance here, to check for any impact we have on the system performance as a whole.

Remote agent testing

The Reveal Agent is split into two components, the driver or module (“bottom half”) and the application layer (“top half”).

Complementary to the kernel module testing, we have created an agent where the module layer is replaced with a mock or test driver. With this we can test the top half of the agent cost efficiently by instantiating many instances of these tests on a single CI node. At Next DLP we use containers for the vast majority of our testing in order to reproduce environments, and the remote agent testing is no exception. Given the platform independent nature of the top half of the agent we can build coverage between platforms. Additionally, we can use this framework for benchmarking Reveal itself without having to instantiate thousands of real, virtualized agents and it allows us to inject faults or erroneous data into the platform to measure any impact.

End-to-end testing

We deploy all of our agent platforms performing a variety of tests against our test deployments internally, as the final backstop before release. This checks agent enrollment, and performs smoke tests for various agent features. The end-to-end testing leans heavily on the agent crash reporting system and performance statistics. We gather statistics in the same way as we do for customer installs which help us validate that we can diagnose issues in the field.

Long-term testing

We run several agents as testing endpoints long term to check for any performance degradation and cumulative errors, as well as testing the upgrading process. This is where we monitor CPU usage and memory utilization, with alerts on when these exceed certain thresholds to demand developer attention.

Dogfooding and usability

At Next DLP we actively run the agent on every machine we can, including our CI machines, laptops, desktops and servers. Dogfooding helps us track performance and usability of the agent when running on real workloads. Running the Reveal Agent allows us to gather real data which serves as an input to our machine learning. It also keeps us honest, as we would not ship software to our customers we would not be comfortable relying on ourselves.

Our testing process is not done yet - nor will it ever be. As issues arise we have a policy of adding tests for regression testing, and will keep refining our process as we develop the product further. More layers will be introduced as we have access to more resources (in fact this blog post is out of date at the time of press, and doesn’t cover penetration testing, fuzzing, or how we test infrastructure), but I can comfortably say that the Reveal Agent has the most comprehensive testing system I have ever worked with.

Sources

Atlassian, “Dogfooding and Frequent Internal Releases”, July 9, 2009
IBM, “Monitoring cyclomatic complexity”, March 28, 2006
J. McCabe, “A Complexity Measure. IEEE Transactions on Software Engineering”, Volume SE 2 No. 4, 1976.
Synopsys, “How much do bugs cost to fix during each phase of the SDLC?”, January 12, 2017

This post was originally published in October 2018 and has been updated for comprehensiveness.