A crucial assumption behind an A/B test is that the only difference between test and control has to be the feature we are testing. This implies that test and control user distribution are comparable. If this is true, we can then exactly estimate the impact of the feature change on whichever metric we are testing.

Comparable test and control user distribution means that, for each relevant segment, the relative proportion of users in test and control is similar. That is, if US users are 10% of users in the test group, we expect to also have ~10% of US users in control. If we have 50% of repeat users in test, we should have a similar percentage in control, and so on.

From a purely statistical standpoint, the above should be true over a large enough number of users. And in A/B testing, we are looking for very small gains, so sample size is large, and, therefore, test and control distributions should be the same.

In practice, it is pretty frequent that test and control distributions are different, invalidating the test results. The number one reason for that is bugs or bias in the randomization algorithm that assigns users to test and control, leading to over/under representation of certain segments. That is, we might have more US users in control, but those users have higher conversion rate, so the difference we see in the metric is not only affected by the feature change that we are testing.

It is therefore extremely important to check that test and control distributions are similar before doing the statistical test. Let’s see how.

## Data

Below we have a standard A/B test dataset. You can also download it from here

### R

#read from google drive
head(data)
  user_id source device browser_language     browser sex age   country test conversion
1       1    SEO    Web               EN      Chrome   M  38     Chile    0          0
2       2    SEO Mobile               ES Android_App   M  27  Colombia    0          0
3       3    SEO Mobile               ES  Iphone_App   M  18 Guatemala    1          0
4       5    Ads    Web               ES      Chrome   M  22 Argentina    1          0
5       8    Ads Mobile               ES Android_App   M  19 Venezuela    1          0
6      11    Ads    Web               ES      Chrome   F  28  Colombia    1          0

Column description:

• user_id: the id of the user. Unique by user

• source: marketing channel: Ads, SEO, Direct. Direct means everything except for ads and SEO. Such as directly typing site URL on the browser, downloading the app w/o coming from SEO or Ads, referral friend, etc.

• device: device used by the user. It can be mobile or web

• browser_language: in browser or app settings, the language chosen by the user. It can be EN, ES, Other (Other means any language except for English and Spanish)

• browser: user browser. It can be: IE, Chrome, Android_App, FireFox, Iphone_App, Safari, Opera

• sex: user sex: Male or Female

• age: user age (self-reported)

• country: user country based on ip address

• conversion: whether the user converted (1) or not (0). This is our label. A test is considered successful if it increases the proportion of users who convert

• test: users are randomly split into test (1) and control (0). Test users see the new translation and control the old one

### Python

import pandas
pandas.set_option('display.max_columns', 20)
pandas.set_option('display.width', 350)
data.head()
  user_id source device browser_language     browser sex age   country test conversion
1       1    SEO    Web               EN      Chrome   M  38     Chile    0          0
2       2    SEO Mobile               ES Android_App   M  27  Colombia    0          0
3       3    SEO Mobile               ES  Iphone_App   M  18 Guatemala    1          0
4       5    Ads    Web               ES      Chrome   M  22 Argentina    1          0
5       8    Ads Mobile               ES Android_App   M  19 Venezuela    1          0

Column description:

• user_id: the id of the user. Unique by user

• source: marketing channel: Ads, SEO, Direct. Direct means everything except for ads and SEO. Such as directly typing site URL on the browser, downloading the app w/o coming from SEO or Ads, referral friend, etc.

• device: device used by the user. It can be mobile or web

• browser_language: in browser or app settings, the language chosen by the user. It can be EN, ES, Other (Other means any language except for English and Spanish)

• browser: user browser. It can be: IE, Chrome, Android_App, FireFox, Iphone_App, Safari, Opera

• sex: user sex: Male or Female

• age: user age (self-reported)

• country: user country based on ip address

• conversion: whether the user converted (1) or not (0). This is our label. A test is considered successful if it increases the proportion of users who convert

• test: users are randomly split into test (1) and control (0). Test users see the new translation and control the old one

## Check A/B Test Randomization

### R

Checking that randomization worked well simply means making sure that all variables have the same distribution in test and control. So, taking for instance the first variable, source, it would mean checking that proportion of users coming from ads, SEO, and direct is the same.

This can easily be done the following way:

require(dplyr)

data%>%
group_by(source) %>% #variable we care about
summarize(
test_relative_frequency = length(source[test==1])/nrow(data[data$test==1,]), control_relative_frequency = length(source[test==0])/nrow(data[data$test==0,])
)
# A tibble: 3 x 3
source test_relative_frequency control_relative_frequency
<fctr>                   <dbl>                      <dbl>
2 Direct               0.1995004                  0.2009487
3    SEO               0.3998582                  0.3978231

As we can see, relative frequency of source for different segments is the same. That is, we have basically the same proportion of users coming from Ads, Direct, and SEO in both test and control.

We could potentially keep checking all the variables like this. But it would be extremely time consuming (and boring), especially when you start considering numerical variables and categorical variables with many levels.

So we turn this into a machine learning problem and let an algorithm do the boring work for us. The approach is:

1. Get rid of the conversion variable for now. We don’t care about it here. We are jut checking if the two user distributions are the same. This is before we check conversion rate for the groups

2. Use the variable test as our label. Try to build a model that manages to separate the users whose test value is 0 vs those whose test value is 1. If randomization worked well, this will be impossible because the two groups are exactly the same. If all variable relative frequencies were the same as for source, no model would be able to separate test == 1 vs test == 0. If randomization did not work well, the model will manage to use a given variable to separate the two groups.

3. As a model, pick a decision tree. This will allow you to clearly see which variable (if any) is used for the split. That’s where randomization failed.

library(rpart)
library(rpart.plot)

#prepare the data set to check if randomization worked well
#we don't need this variable
data$user_id = NULL #mke it into a classification problem data$test = as.factor(data$test) #Build the decision tree tree = rpart (test ~ ., #get rid of the 9th column, conversion, not needed here data[,-9], #make classes balanced, Easier to understand the output. parms = list(prior = c(0.5, 0.5)) ) tree n= 401085 node), split, n, loss, yval, (yprob) * denotes terminal node 1) root 401085 200542.50 1 (0.5000000 0.5000000) 2) country=Bolivia,Chile,Colombia,Costa Rica,Ecuador,El Salvador,Guatemala,Honduras,Mexico,Nicaragua,Panama,Paraguay,Peru,Venezuela 350218 162347.50 0 (0.5391991 0.4608009) * 3) country=Argentina,Uruguay 50867 10574.12 1 (0.2168199 0.7831801) * So we can see that test and control are not the same! Users from Argentina and Uruguay are way more likely to be in test than control (80% vs 20%, respectively). All other countries are Ok, with approximately 50/50 split. Let’s double check this. Let’s check proportion of Argentinian and Uruguayan users in control vs test. #dummy variable just for Argentinia and Uruguay users data$is_argentina = ifelse(data$country == "Argentina",1, 0) data$is_uruguay =   ifelse(data$country == "Uruguay",1, 0) data%>% group_by(test) %>% #let's check for both test and control summarize( Argentina_relative_frequency = mean(is_argentina), Uruguay_relative_frequency = mean(is_uruguay) ) # A tibble: 2 x 3 test Argentina_relative_frequency Uruguay_relative_frequency <fctr> <dbl> <dbl> 1 0 0.0504881 0.002239478 2 1 0.1732229 0.017235626 Our tree was right! In test, 17% of users are from Argentina, but in control, only 5% of users are from Argentina. Uruguay is even more extreme: test has 1.7% of users from Uruguay and control has basically no Uruguayan users (0.2%). And this is a big problem because that means we are not comparing anymore apples to apples in our A/B test. The difference we might see in conversion rate might very well depend on the fact that users between the two groups are different. Let’s check it in practice. #this is the test results using the orginal dataset original_data = t.test(data$conversion[data$test==1], data$conversion[data$test==0]) #this is after removing Argentina and Uruguay data_no_AR_UR = t.test(data$conversion[data$test==1 & !data$country%in%c("Argentina", "Uruguay")],
data$conversion[data$test==0 & !data$country%in%c("Argentina", "Uruguay")] ) data.frame(data_type = c("Full", "Removed_Argentina_Uruguay"), p_value = c(original_data$p.value, data_no_AR_UR$p.value), t_statistic = c(original_data$statistic, data_no_AR_UR\$statistic)
)
                  data_type      p_value t_statistic
1                      Full 1.928918e-13  -7.3538952
2 Removed_Argentina_Uruguay 7.200849e-01   0.3583456

Huge difference! The biased test where some countries are over/under represented is statistically significant with negative t statistics. So test is worse than control! After removing those two countries, we get non-significant results.

At this point, you have two options:

1. Acknowledge that there was a bug, go talk to the software engineer in charge of randomization, figure out what went wrong, fix it and re-run the test. Note that when you found a bug, it might be a sign that more things are messed up, not just the one you found. So when you find a bug, always try to get to the bottom of it

2. If you do find out that everything was fine, but for some reason there was only a problem with those two countries, you can potentially adjust the weights for those two segments so that relative frequencies become the same and then re-run the test

Full course in Product Data Science

### Python

Checking that randomization worked well simply means making sure that all variables have the same distribution in test and control. So, taking for instance the first variable, source, it would mean checking that proportion of users coming from ads, SEO, and direct is the same.

This can easily be done the following way:

#let's group by source and estimate relative frequencies
data_grouped_source = data.groupby("source")["test"].agg({
"frequency_test_0": lambda x: len(x[x==0]),
"frequency_test_1": lambda x: len(x[x==1])})

#get relative frequencies
print(data_grouped_source/data_grouped_source.sum())
        frequency_test_0  frequency_test_1
source
Direct          0.200949          0.199500
SEO             0.397823          0.399858

As we can see, relative frequency of source for different segments is the same. That is, we have basically the same proportion of users coming from Ads, Direct, and SEO in both test and control.

We could potentially keep checking all the variables like this. But it would be extremely time consuming (and boring), especially when you start considering numerical variables and categorical variables with many levels.

So we turn this into a machine learning problem and let an algorithm do the boring work for us. The approach is:

1. Get rid of the conversion variable for now. We don’t care about it here. We are jut checking if the two user distributions are the same. This is before we check conversion rate for the groups

2. Use the variable test as our label. Try to build a model that manages to separate the users whose test value is 0 vs those whose test value is 1. If randomization worked well, this will be impossible because the two groups are exactly the same. If all variable relative frequencies were the same as for source, no model would be able to separate test == 1 vs test == 0. If randomization did not work well, the model will manage to use a given variable to separate the two groups.

3. As a model, pick a decision tree. This will allow you to clearly see which variable (if any) is used for the split. That’s where randomization failed.

import graphviz
from sklearn.tree import DecisionTreeClassifier
from sklearn.tree import export_graphviz
from graphviz import Source

#drop user_id, not needed
data = data.drop(['user_id'], axis=1)
#make dummy vars. Don't drop one level here, keep them all. You don't want to risk dropping the one level that actually creates problems with the randomization
data_dummy = pandas.get_dummies(data)
#model features, test is the label and conversion is not needed here
train_cols = data_dummy.drop(['test', 'conversion'], axis=1)

tree=DecisionTreeClassifier(
#change weights. Our data set is now perfectly balanced. It makes easier to look at tree output
class_weight="balanced",
#only split if it's worthwhile. The default value of 0 means always split no matter what if you can increase overall performance, which creates tons of noisy and irrelevant splits
min_impurity_decrease = 0.001
)
tree.fit(train_cols,data_dummy['test'])

export_graphviz(tree, out_file="tree_test.dot", feature_names=train_cols.columns, proportion=True, rotate=True)
with open("tree_test.dot") as f:

s = Source.from_file("tree_test.dot")
s.view()

So we can see that test and control are not the same! Users from Argentina and Uruguay are way more likely to be in test than control. When country_Argentina is 1, the tree shows that users in control are ~23% and in test 73% instead of 50/50. For Uruguay, the proportions are even more extreme: 11% in control and 89% in test! Not good!

Let’s double check this manually in our dataset.

print(data_dummy.groupby("test")[["country_Argentina", "country_Uruguay"]].mean())
      country_Argentina  country_Uruguay
test
0              0.050488         0.002239
1              0.173223         0.017236

Our tree was right! In test, 17% of users are from Argentina, but in control only 5% of users are from Argentina. Uruguay is even more extreme: test has 1.7% of users from Uruguay and control has just 0.2% of Uruguayan users.

And this is a big problem because that means we are not comparing anymore apples to apples in our A/B test. The difference we might see in conversion rate might very well depend on the fact that users between the two groups are different.

Let’s check it in practice:

from scipy import stats

#this is the test results using the orginal dataset
original_data = stats.ttest_ind(data_dummy.loc[data['test'] == 1]['conversion'],
data_dummy.loc[data['test'] == 0]['conversion'],
equal_var=False)

#this is after removing Argentina and Uruguay
data_no_AR_UR = stats.ttest_ind(data_dummy.loc[(data['test'] == 1) & (data_dummy['country_Argentina'] ==  0) & (data_dummy['country_Uruguay'] ==  0)]['conversion'],
data_dummy.loc[(data['test'] == 0) & (data_dummy['country_Argentina'] ==  0) & (data_dummy['country_Uruguay'] ==  0)]['conversion'],
equal_var=False)

print(pandas.DataFrame( {"data_type" : ["Full", "Removed_Argentina_Uruguay"],
"p_value" : [original_data.pvalue, data_no_AR_UR.pvalue],
"t_statistic" : [original_data.statistic, data_no_AR_UR.statistic]
}))
                   data_type       p_value  t_statistic
0                       Full  1.928918e-13    -7.353895
1  Removed_Argentina_Uruguay  7.200849e-01     0.358346

Huge difference! The biased test where some countries are over/under represented is statistically significant with negative t statistics. So test is worse than control! After removing those two countries, we get non-significant results.

At this point, you have two options:

1. Acknowledge that there was a bug, go talk to the software engineer in charge of randomization, figure out what went wrong, fix it and re-run the test. Note that when you found a bug, it might be a sign that more things are messed up, not just the one you found. So when you find a bug, always try to get to the bottom of it

2. If you do find out that everything was fine, but for some reason there was only a problem with those two countries, you can potentially adjust the weights for those two segments so that relative frequencies become the same and then re-run the test

Full course in Product Data Science