Most approaches to evaluating service certification are centered around improving monitoring, log management and incident response times. In our opinion, this is a fundamentally flawed way to approach both service compatibility and overall service certification. How services are going to work together should be tested before the production environment so that potential problems with interconnected services can be addressed before they cause downtime or production bugs.
As an industry, we have a tendency to focus a lot on code. Most systems are created to be very code-centric, focused on writing code, testing code, collaborating on code. But this code-first approach has limitations. Applications are becoming increasingly polyglot, making direct code collaboration more challenging. At the same time, services are becoming more granular as well as more tightly interrelated.
Code is still important, but we also need to start treating services as first-class citizens, too. Not all problems will be addressed by shipping perfect code. Testing and verifying code will always be important, but ultimately we need to know not whether or not the code is good but whether or not the service works. Treating service certification as equally as important as running code tests and doing code reviews is the only way that organizations will end up with applications that work as expected in production — and that, after all, is the end goal.
Developers are the ones responsible for creating services, and they need to have access to information about how well their service works — not just whether or not the code is clean. This information needs to be available as automatically as possible, without requiring manual feedback from colleagues working on other services. We expect the feedback loop to be short on code quality tests — the feedback loop for service certification should be short as well.
Relying on a canary deployment process to test service certification is an inherently backwards approach — but it’s the only way many organizations have to see whether or not the service works. However, a canary deployment is not the right place to check for service fit. First of all, canary deployments are released to a subset of users — you are intentionally accepting that you could break things for those users. Second, it lengthens the feedback loop for developers. Instead of immediately (and privately) seeing that a service isn’t working correctly with upstream or downstream dependencies, they don’t find out that there’s a problem until the very end of the code deployment process — and it’s a public failure instead of a private debugging experience.
One of the biggest challenges when managing service certification is the ever-changing environments in modern software development. The production environment is constantly changing and there are dozens of teams simultaneously working on updating and improving existing services as well as creating new ones. This makes it incredibly challenging for a local machine to exactly match the production environment while also making frictionless collaboration between developers and teams of developers essential.
There has to be a way for developers to certify their services against an environment that also has everyone else’s most up-to-date services. This is the only way to be confident that not only does the service work in the specific production environment but also that there hasn’t been a change in an upstream dependency that isn’t compatible.
The reality for most organizations is that service certification isn’t done in any kind of systematic way pre-production, so things like service compatibility and interactions between dependencies aren’t fully evaluated until it is in the production environment. This approach means that any problems with service compatibility are discovered when they cause downtime or user experience problems, and that teams are trying to address them by improving their monitoring and debugging tools.
Service certification should be handled at the same place in the development workflow as code testing: As close to the developer as possible. Not only does this help shorten the feedback loop and improve developer productivity, it also reduces the risk of downtime or bugs appearing in production because of service-level problems.
Roost lets organizations evaluate service fit and certify services before updates are put into the CI/CD pipeline, keeping the process as close to developers as possible. See how it works here or request a demo.
Updated 5 months ago