Product Experimentation — The Why And The How

What is Product Experimentation?
So, how will you decide whether to call it the ‘Register’ or ‘Sign up’ button? Similarly, your product might have so many questions that your mind is divided. The answer is Experimentation (or A/B Testing). Experimentation (or A/B Testing) is a scientific method to prove the causality of a change on the observed impact. The name “A/B Testing” corresponds to two groups (A and B) that are compared. It is widely used in product management to safeguard our key business metrics and avoid regressions when a feature is about to ship.
Product Experimentation Goals
Experimentation is an essential part of being an agile team. You need to make bets, build software, track results and real-world feedback, and then implement them into your next sprint. Experimentation can serve several goals in your product including Avoiding bias, Proving value and Finding the Fun.
Avoiding Bias
Rely on data instead of opinion. There are times when you get confused with so many opinions from so many stakeholders. You have ambiguity in your mind but you can’t challenge that opinion from someone senior in the leadership let’s say. Experimentation comes as a handy tool in such times.
Proving value
Promote a culture that celebrates moving the needle, not just shipping. You can easily experiment with several ideas and decide which one has the most impact on your goal. You will be in the position to roll out the high-impact feature to your customers confidently.
Finding the fun
What sounds good in meetings doesn’t always work, and users often surprise us. Make small initial investments in several ideas, see which resonate with users, and then invest big on those. You will be surprised to see the results. You will see so many good ideas that you were rooting for don’t work for the end customers exactly. You can learn from such experiments and implement this learning in your decisions even without doing experiments as well.
Experimentation OR Staged Rollout?
Just to call it out clearly, there is a difference between experimentation and staged rollout. A staged rollout is gradually exposing a feature to increased traffic while experimentation is all about testing a hypothesis and taking a decision to launch a feature or not based on the statistically significant level.
Staged Rollout
- A controlled way to propagate a feature across rings
- Increase traffic allocation gradually across rings.
- If nothing breaks, the feature will be shipped to 100%
- Feature Request > Build a Feature > Ring Phase Rollout > Ship Decision
Experimentation
- Make a data-based decision before shipping a feature.
- Iterate, identify regressions, and determine the value of the new feature
- Review scorecard and make ship decision
- Feature Request > Establish a Hypothesis > Build a Feature > Run an Experiment > Analyze Results > Make a Ship Decision
What Makes A Good Product Experiment?
The good (and bad) news is that you can experiment with pretty much anything. All you need is a hunch and a way to test it. However, if you want your tests to lead to actionable insights, they need to include a few specific components:
- Problem: A good experiment solves a real user problem. These problems are discovered through data analysis, customer insights, market research, and experience. Even better is if these problems connect to a central product theme to allow for compounding returns.
- Users: Good experiments have a clear ‘audience size’–the fraction of people who will see it. You need to know who will be impacted by this change and their typical behaviours so you can look out for other knock-on changes.
- Benefit: Good experiments know the benefit they’re giving to users and consider trade-offs. If you increase pricing, will you decrease user retention?
- Feature: Good experiments provide one or more solutions that you believe will create value for users. These are ‘‘informed bets’–you know there’s a chance they’ll flop. But you’ll still learn even if they do.
- Data: Good experiments are built around a ‘North Star metric’–some ultimate goal you want to change. Most experiments will collect hundreds of data points and can change user behaviours in all sorts of ways. You need to go into it knowing the metric you’re trying to change and by how much.
Above all else, good experimentation connects to your mission, values, and goals. As Fareed Mosavat, former Director of Product at Slack, says:
“Good experiments advance product strategy. Bad experiments only advance metrics.”
Product Experimentation Process
Experiment Design
An experiment has different components that need to be taken care of while designing an experiment. First, it is critical to start with a hypothesis and then define the success metric, guardrail metric, target audience, and target platform.
- Hypothesis: Hypothesis are assumptions you made about shipping this feature and must correlate with the impacts of this feature. The metrics you’re tracking for this feature need to be able to prove your hypothesis true/false. You need to formulate a clear hypothesis for the experimentation. If “this change is made”, then “the metric will increase/decrease”, because “the user experience will have changed in this way”. For example, If “I move the position of my prime CTA button from left to right side of the page” then “conversion from this page will increase by x%” because “the users will find it more convenient to click on the CTA button”
- Success Metrics: What metric defines success for your feature? You need to be very careful while deciding on the success metric. You can not look at this metric in silos. It might also happen that this metric is going up but another co-related metric is going down. You need to be mindful of other guard-rail metrics as well when defining your success metric.
- Guardrail Metrics: Guardrail metrics (also called Counter Metrics) are business metrics designed to indirectly measure business value and provide alerts about any potentially misleading or erroneous results and analysis. Those are the metrics that you will not harm these metrics while trying to improve the metrics in the success criteria. For example, Page load times, app crashes, and unsubscribe rates are an example of counter metrics.
- Target Audience: Who will be affected by your experiment? Choose between Consumer, Enterprise, or other customer segments you might have your product targeted to.
- Target Platform: Select your target surface across Web/Android/iOS/Windows/Mac or any other platform depending on where your targeted customer base is present.
Experiment Creation
Product managers need to have strong decisions and agreements before creating an experiment. Discuss the intention and the goal of the experiment with your team prior to obtaining results. What is the goal? Is it to increase conversion, engagement, retention, acquisition, or something else? Once it is clear, you can create a variant version that will help drive this goal. Keep the existing feature as a controlled version. You need not worry about experiment parameters like Sample size, Experiment length, Traffic allocation, etc. There are some tools like Optimizely, VWO, Apptimize, etc that can easily help you to set these parameters.
Experiment Analysis
Once your experiment is reached the statistically significant level, the tool will send you a notification. It shows the experiment results, winning version, and confidence level. Based on the results, you can decide to either roll out to 100% traffic or roll back the experiment and iterate. You will make some unbiased learnings from each experiment irrespective you prove or disprove your hypothesis. For example, sometimes your favourite variant might not give the winning results but it will still guide you that you had a wrong bias, to begin with. Experimentation is really helpful in exposing such biases.
Iterate/Ship
If you are satisfied with the result of your variant version, you can decide to roll it out to a bigger audience. If not, you need to iterate the variant version, learn from the results, and develop new variations or hypotheses. It is critical to iterate and fine-tune your experiment based on the results achieved. Sometimes you might fail in the first or even in the second iteration of the experiment but if you keep on fine-tuning the experiment and keep on making changes in the user experience as per the previous version of the experiment, you will eventually land on a successful result.
Concluding the Experiment
Once your Analysis is done, it's time to decide if you want to ship the feature or continue Iterating.
Check if the scorecard is inconclusive due to low metric Power
- If results are inconclusive during the 1st week or 2nd week, it could be either because your feature coverage is low, or the feature did not yield the results you were expecting.
- If your feature is low coverage, then remember to create a triggered scorecard and you may choose to keep the experiment running longer to improve your metrics power.
- If your feature is high coverage and results show inconclusive in the 14-day scorecard, then stop your experiment and decide to Iterate/Abandon.
Ship if the scorecard is flat or neutral and no impact is expected
- If results are neutral and are expected given the nature of the feature.
- If results are not neutral, where the negative movements are offset by the positive movements.
- Then take the ship call.
Ship if positive movements
- If results are positive or results are net positive; the negative movements are offset by important positive movements.
- If multiple variants are positive, pick the variant with the clearest net positive movement of statistically significant metrics.
- If no clear difference, pick based on best fit to product strategy or other than the impact on metric considerations (e.g. cost, compliance, etc.)
Iterate/Abandon if Undesirable/Unexpected Results
- If highly statistically significant regressions are unexpected, stop the experiment and further decide to Iterate/Abandon.
- If the scorecard is flat and you expected positive movements and there is no benefit in shipping as is.