Challenge and data description can be found here.


Data


Let’s read the dataset and see the variables we have

R


require(dplyr) 
require(ggplot2)

#read the file
data = read.csv("https://drive.google.com/uc?XXXXXXXXXXXXXXX")
dim(data)
[1] 2115    7
head(data)
        date shown clicked converted avg_cost_per_click total_revenue         ad
1 2015-10-01 65877    2339        43               0.90        641.62 ad_group_1
2 2015-10-02 65100    2498        38               0.94        756.37 ad_group_1
3 2015-10-03 70658    2313        49               0.86        970.90 ad_group_1
4 2015-10-04 69809    2833        51               1.01        907.39 ad_group_1
5 2015-10-05 68186    2696        41               1.00        879.45 ad_group_1
6 2015-10-06 66864    2617        46               0.98        746.48 ad_group_1
#make it a date
data$date=as.Date(data$date)

summary(data)
      date                shown           clicked        converted      avg_cost_per_click total_revenue               ad      
 Min.   :2015-10-01   Min.   :     0   Min.   :    0   Min.   :   0.0   Min.   :0.000      Min.   : -200.2   ad_group_1 :  53  
 1st Qu.:2015-10-14   1st Qu.: 28030   1st Qu.:  744   1st Qu.:  18.0   1st Qu.:0.760      1st Qu.:  235.5   ad_group_11:  53  
 Median :2015-10-27   Median : 54029   Median : 1392   Median :  41.0   Median :1.400      Median :  553.3   ad_group_12:  53  
 Mean   :2015-10-27   Mean   : 68300   Mean   : 3056   Mean   : 126.5   Mean   :1.374      Mean   : 1966.5   ad_group_13:  53  
 3rd Qu.:2015-11-09   3rd Qu.: 97314   3rd Qu.: 3366   3rd Qu.: 103.0   3rd Qu.:1.920      3rd Qu.: 1611.5   ad_group_15:  53  
 Max.   :2015-11-22   Max.   :192507   Max.   :20848   Max.   :1578.0   Max.   :4.190      Max.   :39623.7   ad_group_16:  53  
                                                                                                             (Other)    :1797  


Data looks weird. For instance, there is negative revenue that doesn’t make much sense. Let’s clean the data a bit. Here we will remove impossible data. In a real world situation, we would try to get to the bottom of this to figure out where the bad data are coming from.

#Revenue cannot be negative
paste("There are", nrow(subset(data, total_revenue<0)), "events with negative revenue")
[1] "There are 4 events with negative revenue"
#Remove those
data = subset(data, !total_revenue<0)

#Also, clicked should be >= shown and converted should be >= clicked. Let's see:
paste("There are", nrow(subset(data, shown<clicked | clicked<converted)), "events where the funnel doesn't make any sense")
[1] "There are 0 events where the funnel doesn't make any sense"
#Finally, there are a few zeros that seem weird, considering that avg values are very high. Let's plot and see:
ggplot(data, aes(y=shown, x=date, colour=ad, group=ad)) + 
  geom_line(show.legend = FALSE) + 
  ggtitle("Ad impressions")