The world is currently craving explainable artificial intelligence. As companies and data scientists mature, they start to realise that in many use cases, predictive performance is irrelevant if the algorithm is not trusted. We’ve seen startups and junior data scientists continue to fall in the trap of building overly complex models in a business context. As a consequence, such models are not trusted, not used, and remain without value. And that’s exactly what inspired us to share our approach. While launching Cobra last month – our prediction library written in Python – we’ve claimed that the approach offers explainability by design. But what exactly does that mean? And how do we make it real?
Explainability by design
While it is indeed possible to interpret the working of any algorithm to some degree, not many algorithms are explainable in a way that one could write out the whole algorithm in a one-pager. By using Cobra, you could – we’ll explain how and why.
A crucial ingredient of Cobra: PIGs
Hello? What is a PIG? The acronym stands for Predictor Insights Graph – a graph that illustrates the relationship between a single predictor and the target variable. Suppose we try to predict the probability of burnout in a company. Then we could divide the employees for example into age bins – from youngest to oldest employees. For each bin, we can calculate the probability of burnout. As such, this simple graph represents the probability of burnout for different age groups.
The PIG is a basic yet powerful concept that allows fruitful discussion with the client on the interpretation of the most important predictors. However, it serves for much more than a nice visual slide. (In the graph above, the blue line represents the probability of burnout in the age group, the grey bars represents the relative size of the age group – note that the grey bars are merely present to visualise whether the probability was calculated on a bin that is sufficiently large.)
PIGs are crucial components of Cobra
While the PIG certainly has its merit in the interpretation of an algorithm, it can also serve to solve all common data preprocessing issues in building algorithms. Typical preprocessing steps include treatment of missing values, treatment of outliers, treatment of nominal values and experimentation with variable transformations (normalizing, log-transformations, etc.).
Binning and incidence replacement
There are two basic steps that resolve all preprocessing issues:
- convert every predictor into a discrete (and limited) number of bins
- replace the content of the predictor by the incidence of the target variable, computed on the training set
The good news? The concept of the PIGs solves all issues in an elegant way:
- missing valuescan be treated as a separate bin, whereby the missing value is replaced by the incidence of all missing cases. Indeed, the fact that the information is missing is potentially very relevant information (consider e.g. the case where a client does not provide an email address).
- outliers are always grouped into outside bins, effectively reducing their impact on the model.
- for nominal variables, each class can be represented by a separate bin, where the value is replaced by the incidence of the class. When the class is small, it can be combined in an ‘other’ category. Note that ‘large’ and ‘small’ are relative concepts, yet they can be covered by evaluating the statistical significance of a single bin: a significant deviation from the average incidence implies by definition that the deviation is large enough, and that there is sufficient evidence.
- experiments with (non-linear) variable transformations are no longer needed, as the non-linear relationship is in fact represented by the incidences of the different bins – as represented in the PIGs graph above.
No technique is without risks. There are two major and concrete risks in applying this technique:
- constructing many small bins – when the bin is represented by a low number of observations, the estimation of incidence is inaccurate. For example, if our database contains only one French customer, we cannot represent all French customers by this simple example. Failing to create sufficiently large bins obviously results in overfitting – meaning that the model will work for seen data (‘training’) but will not generalise to other data (‘test’ & ‘validation’). By default, creating 10 bins per predictor is often sufficient to cover the non-linearity and prevent overfitting. In small datasets (e.g. under 5K observations), reducing the number of bins typically reduces overfitting.
- using data of the validation or test set to compute bin incidences – obviously, the idea of validation and test data is to verify the generalisation of the algorithm on unseen data. In this context, it is quite obvious that this data cannot be used to compute bin incidences without compromising objective validation. This is to be avoided in all cases.
Such a strange approach?
When junior data scientists are confronted with the approach, they are often sceptical. At first, this seems quite different and perhaps more basic than currently popular approaches like Random Forests, Gradient Boosting or Deep Learning. Using information from the target variable in the model seems like a dangerous approach, and combining preprocessed discretized variables in a regression-based approach does not seem modern enough. Indeed, our approach shows similarities to best practices in credit scoring, a domain with a huge legacy in building highly performant and interpretable models. However, binning variables and representing the variable by the group incidence is also a very common concept in the most common and basic machine learning algorithm – a decision tree. As such, we benefit from the same logic as do decision trees: they reduce the need to treat missings, outliers, nominal variables and transformations per se, and enable us to automate data preprocessing effectively and efficiently. In fact, we experimented often by using univariate decision trees to discretize variables efficiently – only to learn by experience that creating 10 bins results in equivalent performance.
So what about interactions?
True – in its basic form, the approach does not handle interactions in the same way tree-based approaches do. However, in our many experiments, we discovered that consistently adding interactions only rarely increases performance significantly. And as a basic rule of thumb, whenever tree-based approaches outperform Cobra, the gap can often be closed immediately by adding the most significant interactions as new predictors to the model.
Coming back to explainability by design
Using Cobra implies fitting a regression based-model on top of discretized predictors. As such, since all variables are binned, we can even rewrite the resulting model as a set of IF… THEN… statements, ready to be implemented in really any environment. Fortunately, you often don’t have to, since you can run through all steps needed in a very user-friendly way using our library. Using our methodology, we can also simply decompose any individual score to understand exactly why the case was scored high or low. In short, we have gathered evidence that the approach offers a great solution for making an algorithm more transparent and understandable, while maintaining great predictive power.
Want more examples?
For additional examples of PIGs in action – have a look at our slides of predicting employee burnout (note that the same concept is used there with a continuous target variable, being the number of days absent).
Want to learn more?
Then join the Data Science Leuven Meetup of February 9, when Jan Beníšek will present Cobra to the world.