Testing In Production Best Practices

During a conference last December, it was validating to learn about other developers presenting on testing enterprise applications in, well…production. We’ve all done it, and let’s be honest, sometimes it’s the best environment to see how our code performs in the wild. Whether it’s leveraging the benefits of nearly zero downtime or seeing how your code stands up to realistic resource loads, it can’t be denied, testing in production has its advantages.

We’ll explore some scenarios where this unorthodox testing helps us accomplish our goals. But first, let’s understand two popular methods of production testing.

Feature Flagging

Feature flagging (or sometimes referred to as feature toggling) is a software development practice where the execution path is modified based on an externally-controlled setting or flag – the primary advantage being no code changes are required, so a deployment is not needed to change its behavior. Let’s look at an example…

An online storefront implemented a feature that enables users to register with their Google account. While efforts were taken to ensure backend (i.e. database, APIs, etc) and consuming channels’ efforts (i.e. mobile, Web) were done currently AND released simultaneously, guess what? It turns out the consuming channels’ work was completed and released before our backend counterparts. To avoid showing the new Google registration option to users, they decided to add conditional logic to show the old registration form based on feature flag setting. Once the backend work is completed and released, it would only require changing the feature flag setting to show guests the new registration form.

<span>if (featureFlagManager.displayGoogleRegistration()) {</span>
<span> view.displayGoogleRegistration();</span>
<span>}</span>
<span>else {</span>
<span> view.displayRegistration();</span>
<span>}</span>

The line featureFlagManager.displayGoogleRegistration() dictates which registration is displayed. There’s no need to review the underlying logic behind this method. Only know the actual setting comes from an external source, which is typically managed by designated individuals. This external source can be an on-prem, cloud-hosted configuration file or even a third-party tool specializing in feature flag management like LaunchDarkly or Adobe Target.

Feature flagging has some drawbacks that must be considered before employing them in your code.

Maintenance Impact

Introducing feature flags has a potentially high maintenance impact. Using the earlier example, what happens after all the backend and consuming channels’ code is live? We could leave our conditional logic as is, but it’s no longer needed. Ideally, this code should be removed. Over time, introducing long-lived, feature flags can be burdensome to manage while introducing unnecessary complexity. To mitigate this, consider time-boxing feature flags by planning future tasks/stories to remove them from the code.

Requires Mature Architecture

Leveraging feature flags is more effective in mature architectures. The application code should be robust and employ good fault tolerance practices. Some questions to consider include:

Does the application intelligently handle database or response payload changes in a mature design pattern like Repo/Unit of Work? This is important if adding feature flags that reference response or database data.
Are asynchronous API calls chained appropriately, and only when applicable? Think of a scenario where the UI thread’s execution path includes a feature flag to display some element. If the content inside that element comes from an API response hasn't been retrieved yet, we could render content that isn’t available. In short, ensure that race conditions are avoided between different application threads – including feature flag API calls.

Accidental Exposure

Feature flagging can be a light-weight, quick-to-implement solution for A/B testing and, if done with care, testing in production. Be mindful, however, that introducing ‘dormant’ features, protected only by a single boolean inside a configuration file, must be carefully managed. The cost of accidentally (or maliciously) setting a flag incorrectly could translate to sensitive content being exposed.

This extra work involved to minimize the aforementioned concerns fosters clean code, but also strengthens the argument to find alternatives to feature flagging for production testing. Also, there are better, less risky approaches to production testing with larger, more complex applications and services.

Traffic Routing

Another, and more robust, technique to production testing involves traffic routing using Azure Traffic Manager (ATM). In a nutshell, ATM is a DNS-based, load balancer that provides routing methods to determine how to route network traffic to different service endpoints. The most relevant method for production testing is the weighted approach. With the weighted traffic-routing method, a weight is assigned to each service endpoint. The weight is a number between 1 and 1000 – the higher the number, the higher the priority. The following diagram illustrates the weighted method in action.

Figure 1: Weighted traffic-routing method
Source: Microsoft

In this example, there are three weighted endpoints. An endpoint will be randomly selected if it’s available and based on its assigned weight. Since Region A is not available, it will not be selected. Since Region B is weighted 50 while Test A is weighted 5, we know that Region B will be selected 10 times more often than Test A.

Microsoft reminds us, though, that since DNS responses are cached by clients, caching can have an impact on how traffic gets routed. If the number of clients is large, traffic routing works pretty well. Conversely, if the number of clients is small, caching can disrupt expected traffic routing. So, if there’s a large, enterprise, production environment, client response caching likely won’t impact production testing. Let’s examine a scenario that would benefit from ATM’s traffic routing method.

An online retailer is rolling out a major campaign to eliminate excess inventory. The campaign includes substantial updates to a microservice that is responsible for setting product prices based on remaining supply. The pricing microservice has passed functional and regression testing but has not been load tested.

Time is of the essence and they don’t have time to set up formal load testing before the planned release, so they have chosen to test the latest microservice updates in production with ATM.

Similar to blue-green deployments, the team is leveraging a new endpoint for the microservice updates. If all goes well, this test will become the release. The existing endpoint is weighted at 1000, and initially, the new endpoint (included in the microservice update) is weighted at 0. As these weights are slowly shifted, eventually, the old endpoint will be 0 and the new one will be 1000. Once the campaign is over, these weights can be reverted to their original values.

Conclusion

Despite obvious motivations for not testing your code in production, there are tools available to help do it effectively. Feature flagging is great for simple, temporary feature toggling requiring minimal effort and doesn’t require code changes. If you're testing basic UI components, for example, adding a temporary feature flag is probably fine.

Azure’s traffic management tool is a more robust option for testing backend logic that doesn’t require us to write conditional application code since it’s handled inside our dev ops pipeline. But it does require an established deployment pipeline.

2025 Tech Trends: AI-Propelled Innovation

Agentic AI 101: A Practical Path to Autonomy

Enabling Decision Intelligence with the ADEPT Accelerator

Healthcare Trends 2025: Patient-Centered Experience

Making Accessibility Business as Usual: Maintaining and Optimizing Accessibility Over Time

CapTech Wins Forbes America’s Best Management Consulting Firms for Eight Consecutive Years