Machine Learning Metrics Cheatsheet

A comprehensive reference for evaluation metrics across classification, regression, clustering, ranking, and recommendation systems.

Confusion Matrix

A confusion matrix summarizes the prediction results of a classification model.

                  Predicted
                  Positive    Negative
              +------------+------------+
   Actual     |            |            |
   Positive   |     TP     |     FN     |
              |            |            |
              +------------+------------+
   Actual     |            |            |
   Negative   |     FP     |     TN     |
              |            |            |
              +------------+------------+
Term Meaning
TP (True Positive) Correctly predicted as positive
TN (True Negative) Correctly predicted as negative
FP (False Positive) Incorrectly predicted as positive (Type I error)
FN (False Negative) Incorrectly predicted as negative (Type II error)

Classification Metrics

Accuracy

  • Formula: \(\frac{TP + TN}{TP + TN + FP + FN}\)
  • Description: Proportion of correctly classified samples out of all samples.
  • Usage: Overall model performance for balanced datasets. Misleading when classes are imbalanced.

Precision

  • Formula: \(\frac{TP}{TP + FP}\)
  • Description: Of all samples predicted positive, how many are actually positive.
  • Usage: Emphasizes minimizing false positives. Use when the cost of a false positive is high (e.g., spam filtering).

Recall (Sensitivity / True Positive Rate)

  • Formula: \(\frac{TP}{TP + FN}\)
  • Description: Of all actual positives, how many were correctly predicted.
  • Usage: Emphasizes minimizing false negatives. Use when the cost of missing a positive is high (e.g., disease detection).

Specificity (True Negative Rate)

  • Formula: \(\frac{TN}{TN + FP}\)
  • Description: Of all actual negatives, how many were correctly predicted.
  • Usage: Important when false positives are costly. Complementary to recall.

F1 Score

  • Formula: \(2 \times \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}}\)
  • Description: Harmonic mean of precision and recall.
  • Usage: Balanced metric when you need a single number that accounts for both false positives and false negatives.

F-beta Score

  • Formula: \((1 + \beta^2) \times \frac{\text{Precision} \times \text{Recall}}{(\beta^2 \times \text{Precision}) + \text{Recall}}\)
  • Description: Weighted harmonic mean. Beta controls the trade-off: beta < 1 favors precision, beta > 1 favors recall.
  • Usage: F0.5 when false positives are more costly; F2 when false negatives are more costly.

ROC-AUC (Receiver Operating Characteristic - Area Under Curve)

  • Description: Plots True Positive Rate (Recall) vs. False Positive Rate (1 - Specificity) at various classification thresholds. AUC measures the entire area under this curve.
  • Range: 0.0 to 1.0. A value of 0.5 indicates random guessing; 1.0 indicates perfect classification.
  • Usage: Evaluates model’s ability to distinguish between classes across all thresholds. Works well for balanced datasets but can be overly optimistic on imbalanced data.

PR-AUC (Precision-Recall Area Under Curve)

  • Description: Plots Precision vs. Recall at various thresholds. AUC measures the area under this curve.
  • Usage: Preferred over ROC-AUC for imbalanced datasets because it focuses on the performance of the positive (minority) class.

Log Loss (Binary Cross-Entropy)

  • Formula: \(-\frac{1}{n} \sum_{i=1}^{n} \left[ y_i \log(\hat{p}_i) + (1 - y_i) \log(1 - \hat{p}_i) \right]\)
  • Description: Measures the performance of a model whose output is a probability between 0 and 1. Penalizes confident wrong predictions heavily.
  • Usage: Evaluates probabilistic classifiers. Lower is better; 0 is perfect.

Matthews Correlation Coefficient (MCC)

  • Formula: \(\frac{TP \times TN - FP \times FN}{\sqrt{(TP+FP)(TP+FN)(TN+FP)(TN+FN)}}\)
  • Range: -1 to +1. +1 is perfect, 0 is random, -1 is total disagreement.
  • Description: A balanced measure that accounts for all four quadrants of the confusion matrix.
  • Usage: Considered one of the most reliable single-number metrics, especially for imbalanced datasets.

Cohen’s Kappa

  • Formula: \(\kappa = \frac{p_o - p_e}{1 - p_e}\) where \(p_o\) is observed agreement (accuracy) and \(p_e\) is expected agreement by chance.
  • Range: -1 to +1. Values above 0.8 generally indicate strong agreement.
  • Description: Measures agreement between predicted and actual labels, adjusted for chance.
  • Usage: Useful when classes are imbalanced, as it accounts for the possibility of correct predictions by random chance.

Regression Metrics

Mean Absolute Error (MAE)

  • Formula: \(\frac{1}{n} \sum_{i=1}^{n} \lvert y_i - \hat{y}_i \rvert\)
  • Description: Average of absolute differences between predicted and actual values.
  • Usage: Measures average prediction error in original units. Robust to outliers compared to MSE.

Mean Squared Error (MSE)

  • Formula: \(\frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2\)
  • Description: Average of squared differences between predicted and actual values.
  • Usage: Penalizes larger errors more heavily than MAE. Sensitive to outliers.

Root Mean Squared Error (RMSE)

  • Formula: \(\sqrt{\frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2}\)
  • Description: Square root of MSE, providing error in original units.
  • Usage: More interpretable than MSE while still penalizing large errors.

Mean Absolute Percentage Error (MAPE)

  • Formula: \(\frac{100\%}{n} \sum_{i=1}^{n} \left\lvert \frac{y_i - \hat{y}_i}{y_i} \right\rvert\)
  • Description: Average absolute percentage difference between predicted and actual values.
  • Usage: Scale-independent metric, easy to interpret as a percentage. Undefined when actual values are zero.

R-squared (Coefficient of Determination)

  • Formula: \(R^2 = 1 - \frac{\sum_{i=1}^{n} (y_i - \hat{y}_i)^2}{\sum_{i=1}^{n} (y_i - \bar{y})^2}\)
  • Range: Can be negative (worse than mean), 0 (same as predicting the mean), or up to 1 (perfect fit).
  • Description: Proportion of variance in the dependent variable explained by the model.
  • Usage: Standard goodness-of-fit measure. Does not penalize for model complexity.

Adjusted R-squared

  • Formula: \(R^2_{adj} = 1 - \frac{(1 - R^2)(n - 1)}{n - p - 1}\) where \(n\) is the number of samples and \(p\) is the number of predictors.
  • Description: Modified R-squared that adjusts for the number of predictors in the model.
  • Usage: Preferred over R-squared when comparing models with different numbers of features. Penalizes adding features that do not improve the model.

Clustering Metrics

Silhouette Score

  • Range: -1 to +1. Higher is better.
  • Description: Measures how similar an object is to its own cluster compared to other clusters.
  • Usage: Determines the optimal number of clusters. Values near 0 indicate overlapping clusters.

Davies-Bouldin Index

  • Range: 0 to infinity. Lower is better.
  • Description: Evaluates cluster separation and compactness by computing the average similarity between each cluster and its most similar cluster.
  • Usage: Lower values indicate better clustering with more distinct, compact clusters.

Adjusted Rand Index (ARI)

  • Range: -1 to +1. 1 is perfect agreement, 0 is random.
  • Description: Measures similarity between two clusterings, adjusted for chance.
  • Usage: Comparing clustering results against ground truth labels.

Ranking & Recommendation Metrics

Precision at K

  • Description: Fraction of recommended items in top-K that are relevant.
  • Usage: Measures relevancy of recommendations at a specific cutoff.

Recall at K

  • Description: Fraction of all relevant items that appear in the top-K recommendations.
  • Usage: Measures coverage of relevant items at a specific cutoff.

NDCG (Normalized Discounted Cumulative Gain)

  • Formula: \(\text{NDCG@K} = \frac{\text{DCG@K}}{\text{IDCG@K}} \quad \text{where} \quad \text{DCG@K} = \sum_{i=1}^{K} \frac{2^{rel_i} - 1}{\log_2(i + 1)}\)
  • Range: 0 to 1. Higher is better.
  • Description: Measures ranking quality by considering both the relevance of items and their position in the ranked list. Items ranked higher contribute more to the score.
  • Usage: Standard metric for search engines and recommendation systems where ranking order matters.

Mean Reciprocal Rank (MRR)

  • Formula: \(\text{MRR} = \frac{1}{n} \sum_{i=1}^{n} \frac{1}{\text{rank}_i}\)
  • Description: Average of the reciprocal of the rank of the first relevant item across queries.
  • Usage: Evaluates how early the first relevant result appears.

MAP (Mean Average Precision)

  • Description: Mean of the average precision scores across all queries. Average precision is the area under the precision-recall curve for a single query.
  • Usage: Standard metric for information retrieval and object detection.

Which Metric to Use

A decision guide for choosing the right metric based on your problem.

Classification

Scenario Recommended Metric(s) Reason
Balanced classes Accuracy, F1 Score Both classes equally important
Imbalanced classes F1 Score, PR-AUC, MCC Accuracy is misleading; these focus on minority class
Cost of false positives is high Precision, F0.5 Minimizes false alarms (e.g., spam filter)
Cost of false negatives is high Recall, F2 Minimizes missed positives (e.g., cancer detection)
Need probabilistic output evaluation Log Loss, Brier Score Evaluates calibration of predicted probabilities
Comparing models across thresholds ROC-AUC (balanced), PR-AUC (imbalanced) Threshold-independent evaluation
Multi-class problems Macro/Weighted F1, Cohen’s Kappa Accounts for performance across all classes

Regression

Scenario Recommended Metric(s) Reason
General purpose RMSE Penalizes large errors, in original units
Robust to outliers MAE Equal weight to all errors
Comparing across scales MAPE, R-squared Scale-independent interpretation
Feature selection / model comparison Adjusted R-squared Penalizes unnecessary complexity
MAE vs RMSE trade-off Use both MAE for median-like error, RMSE for mean-like error. If RMSE » MAE, large errors are present

Ranking & Retrieval

Scenario Recommended Metric(s) Reason
Order of results matters NDCG Accounts for position and relevance
Finding first relevant result MRR Rewards early relevant results
Overall retrieval quality MAP Comprehensive precision-recall summary
Fixed recommendation list size Precision@K, Recall@K Evaluates a specific cutoff