ML Evaluation Metrics

Introduction to Evaluation Metrics

There are many metrics to evaulate a Machine Learning (ML) model. This post is focused on the the most common binary classification metrics. Which metrics are better and when should you use them? The answer to these questions is usually with the trite and slightly annoying answer: "it depends." We need to understand the trade-offs between different metrics before we can make a decision.

We will start by defining each classification metrics and when we should use it.

Accuracy

  • This is a measure of how many observations (positive and negative) were classified correctly. So the equation is the number of correctly classified observations divided by all observations. Accuracy calculations are based on the predicted classes.
  • The Accuracy score is dependent on your threshold choice. A threshold of 0.5 may be obvious, but it could be suboptimal. Plot a accuracy score vs threshold to see an optimal threshold.
  • Note: "Tuning" a threshold for logistic regression is different from tuning hyperparameters such as learning rate. Part of choosing a threshold is assessing how much you'll suffer for making a mistake. For example, mistakenly labelling a non-spam message as spam is bad because user could be missing a very important message. On the other hand, mistakenly labelling a spam message as non-spam is unpleasant, but inconsequential (user has to mark as spam and delete).
  • When should you use it? The problem/dataset should be balanced. Put it another way, use this metric when every class is equally important. This is a good metric to use as it can easily be explained and understood by most stakeholders.
  • When should you NOT use it? If the problem/dataset is imbalanced, it is easy to get high accuracy score by classifying all observations as the majority class. For example, if 90% of a dataset is positive, the model could simply label all predicted observations as positive and it would achieve 90% accuracy. Use Cohen Kappa or Matthew Correlation Coefficient (MCC) for imbalanced problems.

Cohen Kappa Metric

  • Cohen Kappa tells you how much better your model is over the random classifier that predicts based on class frequencies. Need to calculate the "observed agreement" which is how the classifier predictions agree with ground truth (Accuracy) and the expected agreement which is how the predictions of the **random classifier that samples according to class frequencies" agrees with the ground truth (Accuracy of the random classifier). Essentially this is introducing a baseline classifier.
  • Like Accuracy, plot the metric against thresholding to tune (Cohen Kappa on y-axis and Threshold on x-axis).
  • When should you use it? For imbalanced problems, this is a great alternative metric to using Accuracy.

Matthew Correlation Coefficient (MCC)

  • MCC is the correlation between the predicted classes and ground truth. The equations is based on the confusion matrix or you can calculated it from the correlation between True positive and prediction. Some consider this metric the most important metric in classification problems.
  • Like Accuracy, plot the metric against thresholding to tune (MCC on y-axis and Threshold on x-axis).
  • When should you use it? For imbalanced problems, this is a great alternative metric to using Accuracy.

F1 Score

  • F1 score conveys the balance between the precision and the recall. Like Accuracy, the closer F1 score is to 1 then the better the score.
  • Precision is the true positives (TP) divided by TP and false positives (FP) aka "Positive Predictive Value" attempts to answer: What proportion of positive identifications was actually correct? For example, if Precision of 0.5, it means that when the model predicts a positive, it is correct 50% of the time. A model that produces no false positives (Type-I error) has a precision of 1.0 - this can be observed in the equation.
  • Recall is TP divided by TP and false negatives (FN) aka "True Positive Rate/Sensitivity" attempts to answer: What proportion of actual positives was identified correctly? For example, if Recall is 0.25, it correctly identifies 25% of all actual positives. A model that produces no false negatives (Type-II error) has a recall of 1.0 - this can be observed in the equation.
  • Unfortunately, precision and recall are often in tension: improving precision typically reduces recall and vice versa.
  • The context of the problem can make false negative and false positive confusing. One way to remember is this: If a positive is *bad, a false **negative is worse. If a positive is good, a false positive is worse!* PBFN, PGFP
  • The F1 score is a special case of the more general F beta function. When choosing beta, if you care more about precision, then choose a beta between 0 and 1. The more you care about recall over precision then the higher beta you choose. This F beta function will also affect your optimal thresholding. This should be plotted to get a sense of how F beta is affecting plotting. You then should plot F1 score vs thresholding to optimize the F1 score.
  • When should you use it? When you care more about the positive case.
  • When should you NOT use it? Note that it can show overoptimistic results, especially on imbalanced datasets.

ROC AUC

  • ROC stands for "Receiver Operating Characteristic" and is displayed as a probability curve. The curve visualizes the tradeoff between TP and FP rates (TP on y-axis and FP on x-asis). For every threshold, we calculate these rates and display it as a curve. AUC is the "Area Under the Curve" and represents the measure of separability. This tells how well the model is able to distinguish between the classes. The higher the AUC, the better it is at predicting positives as positives and negatives as negatives. The ROC AUC is equivalent to calculating the rank correlation between predictions and targets. This metric shows how well the model is at ranking predictions. It tells us what the probability that a randomly chosen positive observation is ranked higher than a randomly chosen negative observation.
  • Note that ROC AUC is calculated based on the predicted scores, not on classes like Accuracy.
  • When should you use it? When you care about ranking the predictions and outputting well-calibrated probability is not the priority. When you care equaly about the positive and negatives classes.
  • When should you NOT use it? Like other metrics, do not use it when data is heavily imbalanced. For a dataset with high true negatives, the the false positive rate is pulled down link. Alternative is to use to use the Precision/Recall (PR) Curve.

PR AUC

  • Precision/Recall (PR) as the name suggests combines the precision and recall in a probability curve (precision on y-axis and recall on x-axis). Similar to ROC, for every threshold the precision and recall are calculated. Knowing at which recall your precision starts to fall fast may be a good threshold to choose. The AUC for PR can be thought of as the average of precision scores calculated for reach recall threshold.
  • When should you use it? When your data is heavily imbalanced. For highly negative datasets, the PR AUC mainly focuses on the positive classes and less on the negative class. Hence, this metric is useful for scenarios where you care more about the positive class than the negative class. This is also useful if you want to communicate precision and recall decisions.
  • When should you NOT use it? This metric may be harder to interpret than other metrics.

Log Loss

  • Log loss is usually used as the objective function that is optimized. It is the average of the error (difference between ground truth and predicted score) for every observation. The more certain the model is that an observation is positive when it is actually positive the lower the error. However, the model error get heavily punished when the model is more certain that an observation is true when it is not. This is not a linear relationship and it is a good idea to clip predictions to decrease the log loss.
  • When should you use it? This "metric" should be used with another metric to evaluate performance. It is risky to only use log loss as it does not always determine the best model.

You'll only receive email when they publish something new.

More from Confusing Matrix
All posts