📈 🔢 📯 How to roll out dangerous refactoring to prod with a million users? 👸🏽 🐒 🔛

The film "Airplane", 1980.

That's how I felt when I poured another refactoring on the prod. Even if you cover the entire code with metrics and logs, test the functionality on all environments - this will not save 100% from the fakaps after the deployment.

First Fakap

Somehow we refactored our processing of integration with Google Sheets. For users, this is a very valuable feature, because they use many tools at the same time that need to be linked together - send contacts to a table, upload answers to questions, export users, etc.

The integration code did not refactor from the first version and it became increasingly difficult to maintain. This started affecting our users - old bugs were revealed that we were afraid to edit due to the complexity of the code. It's time to do something about it. No logical changes were supposed - just write tests, move classes and comb names. Of course, we tested the functionality on the dev environment and went to deploy.

After 20 minutes, users wrote that the integration did not work. The functionality of sending data to Google Sheet fell off - it turned out that for debug we send data in different formats for sales and local environments. When refactoring, we hit the format for selling.

We fixed the integration, but nevertheless, the sediment from the merry Friday evening (and you thought!) Remained. In retrospect (meeting the team to complete the sprint), we began to think about how to prevent such situations in the future - we need to improve the practice of manual testing, auto-testing, working with metrics and alarms, and besides this, we got the idea to use feature flags to test refactoring on prode, in fact, this will be discussed.

Implementation

The scheme is simple: if the user has the flag enabled, go to the code with the new version, if not, to the code with the old version:

if ($user->hasFeature(UserFeatures::FEATURE_1)) {
  // new version
} else {
  // old version
}

With this approach, we have the opportunity to test refactoring on prod first on ourselves and then pour it on users.

Almost from the very start of the project, we had a primitive implementation of the flags feature. In the database for two basic entities, user and account, features fields were added, which were a bit mask . In the code, we registered new constants for features, which we then added to the mask if a specific feature becomes available to the user.

public const ALLOW_FEATURE_1 = 0b0000001;
public const ALLOW_FEATURE_2 = 0b0000010;
public const ALLOW_FEATURE_3 = 0b0000100;

The use in the code looked like this:

If ($user->hasFeature(UserFeatures::ALLOW_FEATURE_1)) {
  // feature 1 logic
}

When refactoring, we usually first open the flag to the team for testing, then to several users who actively use the feature, and finally open to all, but sometimes more complex schemes appear, more about them below.

Overloaded place refactoring

One of our systems accepts Facebook webhooks and processes them through the queue. Queue processing ceased to cope and users began to receive certain messages with a delay, which could critically affect the experience of bot subscribers. We began to refactor this place by transferring processing to a more complex queue scheme. The place is critical - it is dangerous to pour new logic on all servers, so we closed the new logic under the flag and were able to test it on the prod. But what happens when we open this flag at all? How will our infrastructure behave? This time we deployed the opening of the flag on the servers and followed the metrics.

All critical data processing we have divided into clusters. Each cluster has an id. We decided to simplify testing of such complex refactoring by opening the flag feature only on certain servers, the check in the code looks like this:

If ($user->hasFeature(UserFeatures::CGT_REFACTORING) ||
    \in_array($cluster, Configurator::get('cgt_refactoring_cluster_ids'))) {
  // new version
} else {
  // old version
}

First, we poured refactoring and opened the flags to the team. Then we found several users who actively used the cgt feature, opened flags to them and looked to see if everything worked for them. And finally, they began to open flags on the servers and follow the metrics.

The cgt_refactoring_cluster_ids flag can be changed through the admin panel. Initially, we assign the value cgt_refactoring_cluster_ids to an empty array, then add one cluster at a time - [1], look at the metrics for a while and add another cluster - [1, 2] until we test the entire system.

Configurator implementation

I’ll talk a bit about what Configurator is and how it is implemented. It was written to be able to change the logic without deployment, for example, as in the case above, when we need to sharply roll back the logic. We also use it for dynamic configs, for example, when you need to test different caching times, you can take it out for quick testing. For the developer, this looks like a list of fields with admin values that can be changed. We store all this in a DB, we cache in Redis and in a statics for our workers.

Refactoring outdated locations

In the next quarter, we refactored the registration logic, preparing it for the transition to the possibility of registration through several services. In our conditions, it is impossible to cluster the registration logic so that a certain user is tied to a certain logic, and we did not come up with anything better than testing the logic, rolling out a percentage of all registration requests. This is easy to do in a similar way with flags:

If (Configurator::get('auth_refactoring_percentage') > \random_int(0, 99)) {
  // new version
} else {
  // old version
}

Accordingly, we set the value of auth_refactoring_percentage in the admin panel from 0 to 100. Of course, we “smeared” the entire authorization logic with metrics to understand that we did not reduce the conversion in the end.

Metrics

To tell how we follow metrics in the process of opening flags, we will consider another case in more detail. ManyChat accepts Facebook hooks from Facebook when a subscriber sends a message to Facebook Messenger. We must process each message in accordance with business logic. For the cgt feature, we need to determine whether the subscriber started the conversation through a comment on Facebook in order to send him a relevant message in response. In the code, it looks like determining the context of the current subscriber, if we can determine the widgetId, then we determine the response message from it.

More about feature

Facebook api. — . Widget, :

—> —> —> Facebook:

:
—> —>

“ , !” , . , “ !” id , — , id.

Previously, we defined the context in 3 ways, it looked something like this:

function getWidgetIdContext(User $user, WatchService $watcher): int?
{
  //      
  if (null !== $user->gt_widget_id_context) {
    $watcher->logTick('cgt_match_processor_matched_via_context');

    return $user->gt_widget_id_context;
  }

  //      
  if (null !== $user->name) {
    $widgetId = $this->cgtMatchByThread($user);
    if (null !== $widgetId) {
      $watcher->logTick('cgt_match_processor_matched_via_thread');

      return $widgetId;
    }

    $widgetId = $this->cgtMatchByConversation($user);
    if (null !== $widgetId) {
      $watcher->logTick('cgt_match_processor_matched_via_conversation');

      return $widgetId;
    }
  }

  return null;
}

The watcher service sends analytics at the time of the match, respectively, we had metrics for all three cases: The

number of times the context was found by different methods of linking in time.

Next, we found another match method that should replace all the old options. To test this, we got another metric:

function getWidgetIdContext(User $user, WatchService $watcher): int?
{
  //    
  $widgetId = $this->cgtMatchByEcho($user);
  
  if (null !== $widgetId) {
      $watcher->logTick('cgt_match_processor_matched_via_echo_message');
  }

  //    
  // ...
}

At this stage, we want to make sure that the number of new hits is equal to the sum of the old hits, so just write the metric without returning $ widgetId: The

number of contexts found by the new method completely covers the sum of bindings by the old methods

But this does not guarantee us the correct match logic on all cases. The next step is gradual testing through the opening of flags:

function getWidgetIdContext(User $user, WatchService $watcher): int?
{
  //    
  $widgetId = $this->cgtMatchByEcho($user);
  
  if (null !== $widgetId) {
    $watcher->logTick('cgt_match_processor_matched_by_echo_message');
  
    //    ,   
    If ($this->allowMatchingByEcho($user)) {
      return $widgetId;
    }
  }

  // ...
}

function allowMatchingByEcho(User $user): bool
{
  //    
  If ($user->hasFeature(UserFeatures::ALLOW_CGT_MATCHING_BY_ECHO)) {
    return true;
  }
  //     
  If (\in_array($this->clusterId, Configurator::get('cgt_matching_by_echo_cluster_ids'))) {
    return true;
  }

  return false;
}

Then the testing process began: at first we tested the new functionality on our own on all environments and on random users who often use matching by opening the flag UserFeatures :: ALLOW_CGT_MATCHING_BY_ECHO. At this stage, we caught a few cases when the match worked incorrectly and repaired them. Then they started rolling out to servers: on average, we rolled out one server in 1 day during the week. Before testing, we warn support that they look closely at the tickets related to the functionality and write to us about any oddities. Thanks to support and users, several corner cases were fixed. And finally, the last step is the discovery on all unconditionally:

function getWidgetIdContext(User $user, WatchService $watcher): int?
{
  $widgetId = $this->cgtMatchByEcho($user);
  
  if (null !== $widgetId) {
    $watcher->logTick('cgt_match_processor_matched_by_echo_message');
  
    return $widgetId;
  }

  return null;
}

New flag feature implementation

The implementation of the flag feature described at the beginning of the article served us for about 3 years, but with the growth of teams it became uncomfortable - we had to deploy when creating each flag and do not forget to clear the value of the flags (we reused constant values for different features). Recently, the component has been rewritten and now we can flexibly manage flags through the admin panel. Flags were untied from the bitmask and stored in a separate table - this makes it easy to create new flags. Each entry also has a description and owner, flag management has become more transparent.

Cons of such approaches

This approach has a big minus - there are two versions of the code and they need to be supported at the same time. When testing, you have to consider that there are two branches of logic, and you need to check them all, and this is very painful. During the development, there were situations when we introduced a fix into one logic, but forgot about another, and at some point it shot. Therefore, we apply this approach only in critical places and try to get rid of the old version of the code as quickly as possible. We try to do the rest of the refactoring in small iterations.

Total

The current process looks like this - first we close the logic under the condition of the flags, then deploy and begin to gradually open the flags. When expanding flags, we closely monitor errors and metrics, as soon as something goes wrong - immediately roll back the flag and deal with the problem. The plus is that opening / closing the flag is very fast - it is just a change in the value in the admin panel. After some time, we cut out the old version of the code, this should be the minimum time to prevent changes in both versions of the code. It is important to warn colleagues about such refactoring. We do a review through github and use code owners during such refactoring so that the changes do not get into the code without the knowledge of the author of the refactoring.

Most recently, I rolled out a new version of the Facebook Graph API. In a second, we make more than 3000 requests to the API and any error is expensive for us. Therefore, I rolled out the change under the flag with minimal impact - it turned out to catch one unpleasant bug, test the new version and eventually switch to it completely without worries.

How to roll out dangerous refactoring to prod with a million users?