In this article, we will take a look at experience goals, the part goals play in evaluating the performance of your experiences, and how to use power to gauge success.
Understanding the value driven by an experience is an important validation for your personalization efforts but that value must also be considered alongside the performance of an experience against its goals.
Being able to analyze the performance of an experience against its goals, understand what importance to attach to goal metrics, and see what is likely to change over time are critical to making timely decisions as you look to maximize the value from your personalization efforts.
Your Experience list page is your first port of call and a great place to get a top-level view of how each of your experiences is performing against its goals. Let's take a look at an example:
By hovering on the uplift, we can see how the experience is performing against each of the defined goals:
Each Qubit experience will have a primary goal and secondary goals. You can add a maximum of five goals for each experience.
Performance against goals is also presented in your Stats card, shown when you open an experience. Let's look at an example:
INFO: When referring specifically to the metric reported for an experience, RPV and RPC refer to revenue from the moment a visitor enters the experience until the moment they leave or the experience ends.
In this next example, we see that user has added a custom goal
INFO: Custom goals are a great option where you are looking to evaluate the success of an experience in triggering a QP event or getting the user to interact with a button or a similar UI element.
Performance can also be framed in terms of success and failure. Of course, what success means depends on what you are trying to achieve and goals help us by providing the criteria we will use to evaluate whether an experience has been successful or not.
Using goals and the principles of A/B testing, we can answer focused questions such as:
We provide clear visual cues to help you evaluate your experience against each goal:
For each goal, we report the results of experience variations compared to the experience control. Variations are always compared to the experience control, so, for example, A v B, A v C, A v D, etc.
WARNING: We do not perform direct comparisons between variations, B/C, C/D, etc.
By comparing the variation to the control, you always have a solid basis to determine which variation, if any, is having the biggest impact on each of your goals, whether that be conversions, Revenue Per Visitor, or the firing of specific QP events.
Stated simply, the variation that is most successful at achieving the primary goal is the winner.
A goal is considered complete when it has reached statistical significance and achieved an acceptable statistical power.
INFO: Remember that an experience is considered complete only when the primary goal has reached statistical significance.
INFO: Qubit declares a goal to be a winner, when the probability of uplift is greater than 95% and the power is greater than 70%.
When your primary goal has reached an acceptable power, we will present the outcome of the experience. There are a number of possible outcomes, each shown as a New finding:
If the probability of uplift is between 80% and 95%¹ for the primary goal, we will report that a variation is performing better than the control.
This means we are getting more confident about the change in uplift being attributable to the experience. However, because the result is not yet statistically significant (>95%), we can't yet declare it a winner:
We have a winner! - In short, your experiment has been a success. We are more than 95%¹ sure that the observed change in uplift for the primary goal is a result of the experience and not some random factor.
In the following example we have observed a 2.95% uplift in conversions for those visitors that saw the experience variation:
¹ 95% is the default winning threshold for all Qubit experiences but you have the option of changing this. See Setting custom statistical thresholds.
Power is the probability of a significant result given random sampling fluctuations around your actual data. In other words, it shows you the importance you can attach to a result. When power is low, there's still a lot that could change in the experiment and it's therefore too early to attach any significance to the result.
If we look at the following example, we would conclude that it's far too early to attach any significance to either uplift or revenue-both could change significantly as more data is gathered:
When you hover over the power meter, we'll provide a breakdown by goal:
INFO: You can alter the power by editing an experience's winning threshold. See Setting custom statistical thresholds.
This section provides a couple of case studies illustrating how you can use power forecasts to gauge present and future success of an experiment. This allows for more colorful and more justified decisions than simply terminate or keep waiting.
The experiment has collected many visitors and converters and a statistically significant probability of an uplift (~97%). The power forecast is increasing as we forecast into the next few weeks.
It's looking good, it's probably going to be a winner! Since the power forecast is increasing, we can decide whether to funnel more traffic into the likely winner by switching to a supervised 95/5 allocation.
The experiment has a decent amount of visitors and converters, a not-yet significant probability of a downturn (~14%). Today it has a low power but it is increasing quickly as we forecast into the future.
It's probably a loser. It might be worth ending early, but if you're willing to wait a little bit to be sure, you can use the forecast to weigh the opportunity cost of this information alongside the raw cost of the likely-losing metric.
The experiment has a very broad audience, is relatively new, has few visitors and converters, a ~50% probability of an uplift, and a low power (both today and far into the future).
The Bayesian prior is the main contribution to the results, and the forecasts mainly reflect this prior, rather than the data. Since the prior is conservative, the power forecast is also conservative. The experiment is still new, so we should let it run some more and collect more data.
The experiment is very targeted, so it has few visitors and converters despite being very old, and a low power (both today and far into the future). The uplift is reported as (0.5±1)%, with a statistically insignificant ~70% probability of an uplift.
The experience is unlikely to drive a large uplift, and even if it drives a small uplift we won't have enough traffic to prove it. We should give up on rigorously testing this experience, and push the variation of the A/B test to 100% of the small audience's traffic, secure in the knowledge that at least the added personalization is not worse than the control.
The experiment has many visitors and converters, a ~50% probability of an uplift, and a low power (both today and far into the future).
If the power is low today and in the future, then there is no reason to think the experiment is "moving the needle" or will do so in the future. The large amount of gathered data means, to your users, this is effectively indistinguishable from the control. It may be worth declaring the experiment futile and moving on to something else.
We have replaced sample size with something called statistical power. It shows the likelihood that a significant result was achieved and accounts for both the volume of data collected and the observed results.
Power is an improvement on sample size. Whereas sample size uses a pre-determined effect size to determine if we have reached 'sample', based on the observed number of visitors, power takes into account the volume of data collected AND the observed results to give a more accurate representation of the current reliability of the test results.
We were initially looking at a method of making it easier to "call" an experiment early, including faster stats, less precision, etc. Statistical power allows us to look at the volume of data collected and the currently observed results and make a prediction about whether it "is worth waiting for more data".
No, sample size is still a very effective way of determining the results of an experiment but is less effective at helping us to make decisions earlier.
If the power meter isn't increasing over time, you will not learn any more if the current results and volume of data stay the same. In this scenario, you might consider "calling" the experiment early or changing the traffic allocation to 100%.
See Using power forecasts to gauge success for more information about the decisions you can take in our illustrative case studies.
Since the release of Statistical Power, we no longer use the default or custom effect size to determine if we have reached 'sample based on the observed number of visitors. Power takes into account the volume of data collected AND the observed results to give a more accurate representation of the current reliability of the test results.
These metrics will not display if you have selected the All traffic allocation mode. See Traffic allocation for more information.
Experiences in pilot mode are usually "risky" experiences. They are typically run in pilot mode to validate that they aren't driving massive downturns, rather than to detect small uplifts.
Power is the probability of detecting an uplift of the measured size, assuming that the measured uplift is real (not just a statistical fluctuation). The higher the power, the more certain you can be that you haven't "missed out" on a significant effect in your data.
If a risky experience running in pilot mode is driving a massive negative effect, the power will become large quickly, and you should end the test once it reaches whichever power you're comfortable with.
If this same experience is not driving a massive negative effect, and instead has a negligible or positive effect, the power will grow slowly, much slower than with a 50/50 split. Once you've de-risked the variation you should republish the experiment to 50/50 in order to benefit from this optimal traffic allocation.
For both of these reasons, you'll find yourself typically ending your "pilot" tests a little earlier than your 50/50 tests.
Most tests require a large number of visitors in each variation to reach ~70% power. If a customer has only a 5% chance of being put in the control, or a 20% chance of being put in the variation, it will take a much longer time to complete that test. 50/50 is the fastest possible A/B test.
Although this can differ between clients and depends on the configuration of your property, typically the QProtocol events used to report conversions and revenue are identified in the following table:
Goals are attributed for two weeks after an experience is paused to cater for an experience influencing a visitor who purchases slightly later. We believe this is the most accurate way to handle the changes induced by iterations.
A visitor's conversions and other goals are counted towards experience results only if achieved during the same iteration in which the visitor entered the experience.
Indeed the statistics must not carry conversions/goals across iterations because both the experience itself and the visitor's allocation to control/variation may well have changed.
Multiple iterations can therefore delay the achievement of statistical significance.