We will focus here on logistic regression given that the label we are trying to predict (“clicked”) is binary. However, the overall approach if you were dealing with a linear regression would be similar. After all, a logistic regression can be seen as a linear method with a particular link function (logit) to constrain the output between 0 and 1, so that it can be used for binary classification problems.



#Read from google drive. This is the same dataset described in the previous section

#Before building the regression, we need to know which ones are the reference levels for the categorical variables
#only keep categorical variables
dt = data[,sapply(data, is.factor)] 
#find first level. These are the reference levels
sapply(sapply(dt, levels), "[[", 1)
   email_text email_version       weekday  user_country 
 "long_email"     "generic"      "Friday"          "ES" 
#build logisitc regression
log.reg = glm(clicked ~ ., data = data, family = binomial) 
#print coefficients and their pvalues
                               Estimate   Std. Error     z value      Pr(>|z|)
(Intercept)               -6.880922e+00 1.560551e-01 -44.0928996  0.000000e+00
email_id                  -3.848609e-08 7.780270e-08  -0.4946626  6.208383e-01
email_textshort_email      2.793085e-01 4.530413e-02   6.1651878  7.039953e-10
email_versionpersonalized  6.387251e-01 4.691389e-02  13.6148404  3.268591e-42
hour                       1.670684e-02 5.005810e-03   3.3374906  8.453859e-04
weekdayMonday              5.410326e-01 9.340848e-02   5.7921141  6.950589e-09
weekdaySaturday            2.828638e-01 9.777452e-02   2.8930214  3.815553e-03
weekdaySunday              1.836278e-01 1.001176e-01   1.8341213  6.663599e-02
weekdayThursday            6.254040e-01 9.233836e-02   6.7729595  1.261743e-11
weekdayTuesday             6.162222e-01 9.237057e-02   6.6711960  2.537271e-11
weekdayWednesday           7.554637e-01 9.084352e-02   8.3160993  9.090564e-17
user_countryFR            -7.864558e-02 1.625708e-01  -0.4837621  6.285547e-01
user_countryUK             1.155254e+00 1.220474e-01   9.4656215  2.918203e-21
user_countryUS             1.141360e+00 1.159490e-01   9.8436412  7.301922e-23
user_past_purchases        1.878107e-01 5.725710e-03  32.8012980 5.642459e-236
#clean the output a bit
output = data.frame(summary(log.reg)$coefficients)
#fix column names
colnames(output) = c("Coefficient_Value", "SE", "z_value", "p_value") 
#only keep significant variables
output = subset (output, p_value < 0.05)  
#get final results ordered by coefficient value
output[order(output$Coefficient_Value, decreasing=T),] 
                          Coefficient_Value         SE    z_value       p_value
user_countryUK                   1.15525449 0.12204740   9.465622  2.918203e-21
user_countryUS                   1.14136025 0.11594899   9.843641  7.301922e-23
weekdayWednesday                 0.75546370 0.09084352   8.316099  9.090564e-17
email_versionpersonalized        0.63872512 0.04691389  13.614840  3.268591e-42
weekdayThursday                  0.62540395 0.09233836   6.772960  1.261743e-11
weekdayTuesday                   0.61622219 0.09237057   6.671196  2.537271e-11
weekdayMonday                    0.54103257 0.09340848   5.792114  6.950589e-09
weekdaySaturday                  0.28286377 0.09777452   2.893021  3.815553e-03
email_textshort_email            0.27930848 0.04530413   6.165188  7.039953e-10
user_past_purchases              0.18781071 0.00572571  32.801298 5.642459e-236
hour                             0.01670684 0.00500581   3.337491  8.453859e-04
(Intercept)                     -6.88092187 0.15605510 -44.092900  0.000000e+00


import pandas
import statsmodels.api as sm
pandas.set_option('display.max_columns', 10)
pandas.set_option('display.width', 350)
#Read from google drive. This is the same dataset described in the previous section
data = pandas.read_csv('XXXXXXXXX')
#Before building the regression, we need to know which ones are the reference levels for the categorical variables
#only keep categorical variables
data_categorical = data.select_dtypes(['object']).astype("category") 
#find reference level, i.e. the first level
print(data_categorical.apply(lambda x: x.cat.categories[0]))
email_text       long_email
email_version       generic
weekday              Friday
user_country             ES
dtype: object
#make dummy variables from categorical ones. Using one-hot encoding and drop_first=True. The latest stable version of sm (0.14) requires float conversion
data = pandas.get_dummies(data, drop_first=True).astype('float')
#add intercept
data['intercept'] = 1
#drop the label
train_cols = data.drop('clicked', axis=1)
#Build Logistic Regression
logit = sm.Logit(data['clicked'], train_cols)
output = logit.fit()
Optimization terminated successfully.
         Current function value: 0.092770
         Iterations 9
output_table = pandas.DataFrame(dict(coefficients = output.params, SE = output.bse, z = output.tvalues, p_values = output.pvalues))
#get coefficients and pvalues
                            coefficients            SE          z       p_values
email_id                   -3.848609e-08  7.780379e-08  -0.494656   6.208432e-01
hour                        1.670684e-02  5.005879e-03   3.337445   8.455247e-04
user_past_purchases         1.878107e-01  5.725787e-03  32.800855  5.725039e-236
email_text_short_email      2.793085e-01  4.530477e-02   6.165101   7.043829e-10
email_version_personalized  6.387251e-01  4.691461e-02  13.614631   3.277989e-42
weekday_Monday              5.410326e-01  9.341014e-02   5.792011   6.954864e-09
weekday_Saturday            2.828638e-01  9.777629e-02   2.892969   3.816190e-03
weekday_Sunday              1.836278e-01  1.001194e-01   1.834088   6.664099e-02
weekday_Thursday            6.254040e-01  9.233999e-02   6.772839   1.262790e-11
weekday_Tuesday             6.162222e-01  9.237223e-02   6.671077   2.539336e-11
weekday_Wednesday           7.554637e-01  9.084515e-02   8.315950   9.102053e-17
user_country_FR            -7.864563e-02  1.625969e-01  -0.483685   6.286097e-01
user_country_UK             1.155255e+00  1.220603e-01   9.464618   2.946372e-21
user_country_US             1.141360e+00  1.159626e-01   9.842487   7.386228e-23
intercept                  -6.880922e+00  1.560666e-01 -44.089646   0.000000e+00
#only keep significant variables and order results by coefficient value
print(output_table.loc[output_table['p_values'] < 0.05].sort_values("coefficients", ascending=False))
                            coefficients        SE          z       p_values
user_country_UK                 1.155255  0.122060   9.464618   2.946372e-21
user_country_US                 1.141360  0.115963   9.842487   7.386228e-23
weekday_Wednesday               0.755464  0.090845   8.315950   9.102053e-17
email_version_personalized      0.638725  0.046915  13.614631   3.277989e-42
weekday_Thursday                0.625404  0.092340   6.772839   1.262790e-11
weekday_Tuesday                 0.616222  0.092372   6.671077   2.539336e-11
weekday_Monday                  0.541033  0.093410   5.792011   6.954864e-09
weekday_Saturday                0.282864  0.097776   2.892969   3.816190e-03
email_text_short_email          0.279308  0.045305   6.165101   7.043829e-10
user_past_purchases             0.187811  0.005726  32.800855  5.725039e-236
hour                            0.016707  0.005006   3.337445   8.455247e-04
intercept                      -6.880922  0.156067 -44.089646   0.000000e+00


Understanding the output

Pros and Cons

Pros of using logistic regression coefficients to extract insights from data

✓ Pretty much anyone in a technical or product management role in a tech company is familiar with logistic regressions (if this is not true at your company, you are probably working in the wrong place). It is so much easier to present data science work if the audience is already familiar with the techniques used

✓ Logistic regressions are by far the most used model in production. Despite all the blog posts, conference talks, etc. about deep learning, it is almost guaranteed that a consumer tech company most important model in production will be a logistic regression. Therefore, it will be easy to collaborate with engineers (i.e. leveraging prior work done by them, helping them improve their model, etc.)

✓ It is simple, fast, and generally reliable. Indeed, building the model is straightforward. The model works well in the majority of cases and all you have to do is look at the coefficient values and their p-values


Cons of using logistic regression coefficients to extract insights from data

✓ Coefficients give an idea of the impact of each variable on the output. But it is actually pretty hard to exactly visualize what that means. I.e., a change in a given variable by one unit changes the log odds ratio by \({\beta}\) units, where \({\beta}\) is the coefficient. Mmh…

✓ Coefficients do not allow to segment a variable. For instance, a positive coefficient in front of variable age means that as age increases, the output keeps increasing as well. Always. This is unlikely to be true for most numerical variables. You often need to create segments before building the regression (btw RuleFit solves exactly this problem)

✓ Coefficient meaning in front of a categorical variable with several levels can be confusing. You change a given variable reference level and all other level coefficients change

✓ The absolute value of a coefficient is often used to quickly estimate variable importance. However, that depends on the variable scale more than anything else. You could normalize variables, so they are all on the same scale. But that’s rarely a good idea if your goal is presenting to product people. It is hard to get a product manager excited by saying: “If we increase variable X by one standard deviation, we could achieve this and that”


Full course in Product Data Science