Safer software releases with Feature Switches

Abstract

Show how feature switches and canary releases can reduce the risks associated with deployment and time spent on QA, and thus increase the overall productivity of a development team.

Introduction

A failed release to prod is something that every programmer has faced at least once in his or her career. After the failure has been discovered, developers need to take a crucial decision: whether the whole release should be reverted or if there is enough time to make an ad hoc hotfix. This is, of course, a stressful situation, especially if several features are bundled into the same release and/or the hotfix becomes more complicated than originally estimated. Furthermore, if failed releases occur repeatedly the business might lose confidence in its developers. Of course nobody wants that, however the typical remedy - a more detailed and rigorous testing phase - has significant downsides. It slows down the development process by making it more rigid and bureaucratic. Therefore we've introduced a different approach in order to tackle this problem - feature switches.

Feature switch

So what is a feature switch? According to Wikipedia, it's a way to enable and disable features on run-time. Feature switches can be implemented in many different ways. From database entries through property files up to sophisticated web-based servers, they all behave in a similar way. They introduce a branch in code that allows a feature to be toggled on and off. Originally introduced to reduce the number of branches in CVS, they also allow new features to be released for a few designated users or even unfinished features for test purposes.

How feature switches helped us?

Deployment vs. Release

Our first gain from adoption of feature switches was a clear distinction between release and deployment. Usually these two words have the same meaning and are used interchangeably, but actually, they represent closely related but different concepts. A deployment happens when a new version of the software is installed on production servers. A release is the moment when new features become available for end-users. Additionally, these two processes have different owners. Developers are responsible for deployment whereas release is a solely business decision. Thanks to feature switches we began to clearly see this difference and use it to our benefit. Right now we can deploy new software versions as soon as features are developed and tested. Later on, our PO makes features available for the end users.

Canary releases and production tests

Another thing that sped up our development process is production tests. Thanks to feature switches and Unleash (our feature switch system) we can release the new feature to a limited, carefully chosen set of users. This allows us to perform tests on production without affecting actual users. Additionally, we can gradually release new features to keep track of whether our solution scales up.

Recovering from failed releases

Correctly implemented feature switches virtually eliminate the situation described in the introduction. If a bug is spotted after a release the feature switch can be used to fall back to the previous, working version of the feature.

Implementation

Background

Before we begun utilizing feature switches we already had several good practices in place. First of all, we started applying TDD to our software development. The code is always reviewed and if a feature is a crucial one we do pair programming. We have continuous integration that runs all the tests after each push. But what made the adoption of feature switches easier and more reliable was the flexibility which dependency injection gave us.

A good feature switch

Before jumping into the implementation details let's take a closer look at the characteristics of a good feature switch. After several months of experience we can say that a feature switch should:

  • be toggled in one place - Checking if a feature is enabled in several places make the system obscure, especially in the case of micro-services,
  • have a short lifespan - a feature switch should be removed as soon as the feature is fully released and should never be considered as a manner in which your system is configured. Be strict about this, otherwise you will end up with a messy, highly coupled and unpredictable system,
  • be designed in a way that allows an easy cleanup - removing a feature switch should be a purely mechanical task. Ideally the removing commit should only consist of code removals.

Design

Let's consider a class:

class Foo:
    def do_something(self):
        # I'm doing something

If we'd like to change the behavior of the method do_something using a feature switch we'd probably write something like:

class Foo:
    def do_something(self):
        if self.feature_is_toggled():
            self.do_something_in_new_way()
        else:
            self.do_something_in_old_way()

From the behavioral point of view, there is nothing wrong with this approach. However, cleaning up this switch fails to comply with the third characteristic mentioned above. Removing it would require removing the if-else-statement and inline the do_something_in_new_way method. This is a significant change that introduces risk and thus should be tested. For us this was unacceptable. We wanted to keep the clean-up stage of feature switches an effortless task that should not even be taken into consideration when planning our sprints. Therefore we switched to another way of implementing feature switches, one that relies on dependency injection. In this approach the first thing we do is to create a copy of Foo - OldFoo. This class will be used every time the feature is switched off. Then we alter the implementation of the original class Foo. Afterward, we create a switch class, which toggles between the two implementations:

The original class with default behavior:

class OldFoo:
    def do_something(self, *args):
        # I'm doing something in the old way

The class with the new behavior:

class Foo:
    def do_something(self, *args):
        # I'm doing something in a new way

class SwitchFoo:
    def __init__(self, old_foo, new_foo, switch_mechanism):
        ....

    def do_something(self, *args):
        if switch_mechanism.feature_enabled():
           return self._new_foo.do_something(*args)
        return self._old_foo.do_something(*args)

The SwitchFoo class can be injected into all the client classes:

class Bar:
      def __init__(self, foo):
          self._foo = foo

      def do_something(self, *args):
          self._foo.do_something(*args)

Implementing feature switches in this way makes the clean-up straightforward. The only thing that needs to be done is to remove the OldFoo and SwitchFoo classes and thus the third characteristic - the easy clean up- is fulfilled.

More complicated cases

Canary Release

A canary release is a release in which a feature is toggled on only for designated users. This requires ids of some sort to be passed to the feature switch mechanism in order to check if the feature is toggled. In this case, if the user id was a part of the interface of the do_something method, then the SwitchFoo class could look like:

class SwitchFoo:
    def __init__(self, old_foo, new_foo, switch_mechanism):
        ....

    def do_something(self, user_id, *args):
        if switch_mechanism.feature_enabled(user_id):
           return self._new_foo.do_something(user_id, *args)
        return self._old_foo.do_something(user_id, *args)

But what to do if the id is not an argument of the do_something method? In this case, we would need to implement the switch at a higher level, where the id is available. If it was the Bar class level then we'd need to instantiate two versions of the Bar class: one with NewFoo and one with OldFoo.

class SwitchBar:
    def __init__(
        self,
        bar_using_old_foo,
        bar_using_new_foo,
        switch_mechanism,
    ):
        ....

    def do_something(self, user_id, *args):
        if switch_mechanism.feature_enabled(user_id):
           return self._bar_using_old_foo.do_something(user_id, *args)
        return self._bar_using_new_foo.do_something(user_id, *args)

DB Migration

Things always get more complicated whenever a db-migration comes into play and feature switches are no exception from this rule. So when you'd like to introduce a switch for a feature requiring a database migration you need to support two versions of your database for as long as the feature switch is in use. This, of course, can be a tremendous task, however, if the feature is very risky, might be worth considering.

Conclusions

A release is the most uncertain and risky part of the whole process of software development. In my team introduction of feature-switches has diminished the risk significantly. Since the change, we have been deploying more often and with greater confidence. As an additional benefit we have a clearer distinction between business and engineering parts of our projects. However, there are a few things that should be kept in mind when you write your first feature switch:

  • at least in the beginning avoid features that require destructive database migrations,
  • use feature switches only for releasing,
  • clean up switches as soon as a full release has been confirmed by business,
  • never use feature switches as a way of configuring your system.