All articles

An Alternative View on Machine Learning Interpretability

The goal of every data scientist and machine learning engineer is to know why (or why not) your machine learning model works properly. Specifically, you want interpretability in your model – to understand, recognize weaknesses, improve, determine if there are biases, and conform to regulations.

Typically, words which are contributed most are marked, for example sentiment analysis on a high-level would focus on words such as “good”, “happy”, “sad”, or “bad”. At Applica we came up with a different take on this process, which is more of a complementary idea, not an alternative to existing methods such as LIME. We decided to look for words that make the results (as measured by some evaluation score) of a machine learning system worse (e.g.: words which make it harder for a given system to classify a text or extract information from it).

This idea is designed for the machine learning practitioner (to debug the system, find problems in the “intelligent” parts, or just find trivial mistakes in pre- or post-processing or in training data), but this might also be of interest to other users of a machine learning system. For instance, to understand what the common problem is for 2% of cases when the system makes a mistake (e.g.: classifies a negative text as a positive one).

Example: Twitter sentiment classification task

  • This task is meant to classify tweets as positive or negative
  • What we are going to do is to list the most troublesome words and visualize them
  • Here is the list:
Feature# of occurrencesAverage scoreHow likely by chance
in<1>:but16850.657623100.00000000000000000000
in<1>:I62520.702047380.00000000000000000000
in<1>:to73490.709044680.00000000000000000000
exp:0130250.728461930.00000000000000000000
in<1>:the65650.709667750.00000000000000000000
in<1>:in31010.696748230.00000000000000000000
in<1>:it26080.694084730.00000000000000000000
in<1>:a53440.710929590.00000000000000000000
in<1>:was14310.682403160.00000000000000000000
in<1>:that18580.689409020.00000000000000000000
  • The top word in this example is “but”, which is not completely surprising (but perhaps not so easy to come up with off the top of your head)
  • Why does the word “I” make it harder to identify the sentiment? Maybe the tweets containing “I” are more subjective, whimsical or ironic?
  • Words such as “the”, “to”, and “was” are in the Top 10 probably because they are used to build proper sentences (rather than very short tweets), so the sentiment might be expressed in a more subtle way there
  • Apart from words as part of the input tweet (marked with “in<1>:”), you’ll also find “exp:0”, which means that tweets expressing negative sentiments are much harder to identify
  • With words that link two clauses, the system must learn to look at the second clause (and mostly discard the first clause)
  • The result of this specific classifier is not terrible for the sentences with “but”, though it is significantly worse than tweets without those exact words

Technically, our idea is to calculate the probability that the difference in the evaluation score between (1) texts containing a given word and (2) texts without the word could be due to chance; the words for which the probability is very low are “suspicious”.

For instance, consider a simple example, with ten sentences and some score (it does not matter which specific score):

A specific feature is marked with a blue square. Does this feature make the evaluation score much lower? It might seem so. But you can actually calculate the probability of a such distribution of the given feature if it was purely by chance. In our simple example the probability is around 0.06. Is it high or low? Actually, it does not matter much as you don’t need any threshold, and you can just sort the words by this probability.

This idea could also be generalized to other “features”, rather than just words, and you might ask whether longer texts are more prone to mistakes? To be more precise – longer in characters or longer in words? Or words containing a proper name?

With this method, you can also find the “easiest” words, e.g.: the words that make the sentiment classification task really easy:

Feature# of occurrencesAverage scoreHow likely by chance
in<1>:welcome590.895425710.99999999999997910000
in<1>:great3300.801774460.99999999999999610000
in<1>:Thank1500.839248900.99999999999999690000
in<1>:vip230.995242040.99999999999999940000
exp:1129790.738054941.00000000000000000000
in<1>:Happy1100.886739131.00000000000000000000
in<1>:Thanks2140.845969211.00000000000000000000
in<1>:sad2950.844230101.00000000000000000000
in<1>:thank1500.870381431.00000000000000000000
in<1>:thanks2880.829488181.00000000000000000000

Some words are not that surprising (e.g.: “happy” or “sad”), but some are quite interesting (e.g.: “vip”)

To summarize, this method is (contrary to LIME) both fast and general – it can be used for any type of evaluation metric. And as this is an “exploratory” idea, it may not give you explicit answers, but it might provide some clues (regardless if you’re a machine learning engineer, domain expert, or simply an end user).

For more detailed information on this topic, please refer to our full paper GEval: A Tool for Debugging NLP Datasets and Models.

Should you have additional questions or comments, I look forward to your feedback.