Metrics for Evaluating AI/ML Algorithms


Metric for Evaluation (Performance Measures) for AI/ML algorithm

A metric is any number that provides measured information. Performance of learning models is evaluated with various types of metrics. Evaluation of machine learning models can be considered similar to hypothesis testing in statistics. In statistics value of the population parameter has to be statistically inferred based on the sample statistics. Similarly an AI/ML model is evaluated using sampled finite data set. The available data set is split into train and test sets. Trained models are never evaluated on train data but on test set. Evaluations can be done by holding out the test set, cross validation or boot strapping.  

Classification Accuracy
Accuracy is the simplest metric to measure the performance of a trained ML. It is the number of correct predictions made divided by the total number of predictions made for a given set of observed data.


Figure-1:

All predictions on target. 

Classification Rate/Accuracy:

Classification Rate or Accuracy is given by the relation:

Accuracy = (TP + TN) / (TP + TN + FP + FN)


Accuracy is a simplistic measure that is misleading on many real-world problems. Consider a two-class domain classifier. The classifier outputs are one of two possible judgments: Positive or Negative. Given a test set and a specific classifier there are 4 possible classifications as follows: 

A positive example classified as positive. This is a true positive.
A positive example misclassified as negative. This is a false negative.
A negative example classified as negative. This is a true negative.
A negative example misclassified as positive. This is a false positive.


The problem with class imbalance

Accuracy measure assumes that dataset is balanced or approximately balanced with 50:50 positive and negative classes. In real world imbalanced data set is the rule and not the exception. If the dataset with a 99:1 split of negatives to positives the classifier accuracy measure can lead to wrong evaluation.
Examples of unbalanced data sets are 1% fraud finance transactions and 99% genuine, 95% healthy and 5% diseased, 10% customer churn and 90% continuing to stay, 99.5% of factory production of defect free items and 0.5% defective, 99.999% of human population are not terrorists and so on.

Confusion Matrix: 
A confusion matrix is a summary of prediction results on a classification problem. The number of correct and incorrect predictions are summarized with count values and broken down by each class. The confusion matrix shows the ways in which your classification model is confused when it makes predictions. It is also called classification matrix.  We can create a 2×2 matrix with the columns as true classes and the rows as the hypothesized classes. It looks like this:



Figure-2:

Consider class 1 positive correspondingly class 0 is  negative. The row column arrangement may be interchanged.

Definition of the Terms:

• Positive (P) : True observation is positive (for example: is an apple).
• Negative (N) : True observation is not positive (for example: is not an apple).
• True Positive (TP) : Observation is positive, and is predicted to be positive.
• False Negative (FN) : Observation is positive, but is predicted negative.
• True Negative (TN) : Observation is negative, and is predicted to be negative.
• False Positive (FP) : Observation is negative, but is predicted positive.


The following table  summarizes the above definitions.

Table-1

Expected cost

The confusion matrix contains frequencies of the four different outcomes. The most precise way to solve the class imbalance problem is to use these four numbers to calculate expected cost (or equivalently, expected benefit). The standard form of an expected value of a probabilistic random variable is to take the probability of each outcome and multiply it by its corresponding values. 

We use the same method to compute the expected cost of confusion matrix outcomes. Then we have to evaluate the cost and benefits  of the four outcomes. The final calculation is the sum:


In the above equation p(P) and p(N) are the prior probabilities of positive and negative classes respectively, also called the class priors. cost(P) is the cost of dealing with a positive class. cost(N) is the cost of dealing with a negative class. Cost of positive class is calculated as

                                                                 
Cost of negative class is calculated similarly.

  


The priors probabilities p(P) and p(N) can be estimated directly from the data. The four probabilities p(TP), p(FN), p(TN), p(FP) is computed from classifier confusion matrix. The cost() and benefit() values are extrinsic values that can’t be derived from data which are estimated based on expert knowledge of the domain.


Precision: 
To get the value of precision we divide the total number of correctly classified positive examples by the total number of predicted positive examples. High Precision indicates that large percentage of samples predicted as positive are indeed positive. This means that the portion of false positives (FP) is small in number. Precision tells us about when it predicts yes, how often is it correct. 



Consider precision of label “1” and denote it by P1
P1 = TP1 / (TP1 + FP1)

Consider precision of label “0” and denote it by P0

P0 = TP0 / (TP0 + FP0)


Precision vs Accuracy

Difference between Precision and Accuracy metrics illustrated in figures



Figure-3:

Inaccurate but predictions repeat within a region



Figure-4:

Accuracy vs Precision

Recall: 
Recall (Sensitivity) can be defined as the ratio of the total number of correctly classified positive examples divide to the total number of positive examples. High Recall indicates that large percentage of true  class is correctly recognized. This means that smaller number of FN).

Recall gives us an idea about when it’s actually yes, how often does it predict yes.
Recall is given by the relation:
Recall=TP / (TP + FN)


Figure to illustrate how Precision, Recall can be used for object detection in images.



Figure-5:

Precision is the fraction of how many among predicted as true are actually true. Recall is the fraction of actually true with respect to total true. (Also refer to Dice Similarity Index)


Precision vs. Recall for Imbalanced Classification
You may decide to use precision or recall for a data with imbalanced classes problem. Maximizing precision will minimize the number false positives, whereas maximizing the recall will minimize the number of false negatives. To summarize

  • Precision: Appropriate is when minimizing false positives is the focus.
  • Recall: Appropriate is when minimizing false negatives is the focus.


Macro, Micro and Average methods

Macro-average Precision
Calculate metrics for each label, and find their unweighted mean. This does not take class label imbalance into account.
The method is straight forward. Just take the average of the precision of the model for each class label. For example, the macro-average precision for class labels 1 and 0 is computed as
Macro-average precision=(P1+P0)/2

Micro- average precision

Calculate metrics globally by counting the total true positives, false negatives and false positives. 

Micro- average precision  = (TP1 +TP0)/ (TP1 + TP0+ FP1 + FP0)

Weighted- average precision

Calculate metrics for each label, and find their average weighted by support (the number of true instances for each label). This alters ‘macro’ to account for label imbalance; 

Weighted - average precision= [W× P1+ W×P0] / 2

(The above methods for Macro, Micro and Weighted average methods also applies for Recall)

F1  - measure: 
Since we have two measures (Precision and Recall) it helps to have a measurement that represents both of them. Consider a binary classifier where we need the best precision and recall at the same time. Computing the arithmetic mean of precision and recall to get the best of both is not a good solution. 

The F1 -measure uses Harmonic Mean in place of Arithmetic Mean. This measure ignores  higher values more than the lesser values. F1 score is calculated as 




Assume Recall = 0.95 and Precision = 0.91

F1 – measure = (2*0.95*0.91)/(0.91+0.95)=0.92

The F1-measure will always be nearer to the smaller value of Precision or Recall. 

Advantage of F1 measure (a simple use case)
Assume that an e-Commerce company implements a simple Recommender System. A customer has sent a query about a product. The recommender should display a list of possible recommendations. The dataset contains 1000 products with 100 relevant (positive) products and 900 non-relevant (negative) products. It is required to classify them as relevant (1) or non-relevant (0).
The company decides to minimize false negatives to 0 and considers all 1000 as relevant, non-relevant products are assumed relevant. There are 900 false positives and 100 true positives. Since all truly positives are predicted true there are no false negatives. Since all products are predicted true there are no true predicted negatives. Then
Precision = 100/ (100 + 900) = 0.1
Recall = 100 / (100 + 0) = 1
The arithmetic mean of Precision and Recall in order to represent the quality of our classifier as a single number, we get (1 + 0.1) / 2 = 0.55. Even though the classifier has made more wrong predictions for 90% items the arithmetic average performance measure is 0.55.
However the F1 Score would be F1 = 2 * ((0.01 * 1) / (0.01 + 1)) = 0.0909

Now consider we have infinite data elements of negative class negative and a single element of class positive and the dump model predicts only positive against all the instances in the  data.


This means
Precision : 0.0
Recall : 1.0

Now:

Arithmetic mean: 0.5
Harmonic mean: 0.0

When we take the arithmetic mean, it would have 50% correct. Despite being the worst possible outcome! While taking the harmonic mean, the F-measure is 0.

Here, if we take the arithmetic mean, we get 0.5. It is clear that the above result comes from a dumb classifier which just ignores the input and just predicts one positive data correctly. Now, if we were to take HM, we will get 0 which is accurate as this model is useless for all purposes.



Fbeta measure:
There are situations however for which a data scientist would like to give more importance/weight to either precision or recall. Altering the above expression a bit such that we can include an adjustable parameter beta for this purpose, we get:


Fbeta measures the effectiveness of a model with respect to a user who attaches β times as much importance to recall as precision. The best value of beta is 1 and the worst is 0. If Î˛ > 1, then the measure favors recall, if Î˛ < 1, then the measure favors precision.

ROC and AUC: 
Let’s first try to understand what is ROC (Receiver operating characteristic) curve. Refer to Table-1 we observe that for a probabilistic model, we get different values for each pair wise combinations of metrics. For each value of sensitivity, we get a different specificity. The two vary as follows:


Figure-6:


The ROC curve is the plot between sensitivity and (1- specificity). (1- specificity) is also known as false positive rate and sensitivity is also known as True Positive rate. Following is the ROC curve for a certain model.



Figure-7:


Consider a threshold for which we get the following confusion matrix.

                   Pre

Act

1

0

Total

Ratio

1

3834

16

3850

TPR = 99.6%

0

634

951

1590

FPR = 40.19%


Table-2

As you can see, the sensitivity at this threshold is 99.6% and the (1-specificity) is ~40.2%. This coordinate becomes on point in our ROC curve.

AUC:
Area under this curve (AUC) of ROC is a measure to quantify the ROC with a single numeric measure.
Note that the area of entire unit square equals 1. Following are a few thumb rules to evaluate the ROC:

  • 0.9 - 1 = excellent (A)
  • 0.8 - 0.9 = good (B)
  • 0.7 - 0.8 = fair (C)
  • 0.6 - 0.7 = poor (D)
  • 0.5 - 0.6 = fail (F)

A large value of AUC above 0.9 is excellent, but this might simply be over-fitting. In such cases it becomes very important to do in-time and out-of-time validations.


Precision – Recall curve:

A precision-recall curve is a plot of the precision (y-axis) and the recall (x-axis) for different thresholds, much like the ROC curve. A naĂŻve way to calculate a precision-recall curve is by connecting precision-recall points. A precision-recall point is a point with a pair of x and y values in the precision-recall space where x is recall and y is precision. 

The ROC curve shows how the recall vs precision relationship changes as we vary the threshold for identifying a positive in our model. The threshold represents the value above which a data point is considered in the positive class.

With imbalanced and skewed data set Precision-Recall (PR) curves give a more informative picture of an algorithm's performance. Receiver Operator Characteristic (ROC) curves are generally used to present results for binary decision problems in machine learning. An algorithm that optimizes the area under the ROC curve is not guaranteed to optimize the area under the PR curve.

Figure-8:



Figure-9:


The precision-recall curve shows the tradeoff between precision and recall for different thresholds. A high area under the curve represents both high recall and high precision, where high precision relates to a low false positive rate, and high recall relates to a low false negative rate.

Hamming loss: 

Hamming Loss calculates loss generated in the bit string of class labels during prediction. It does that by exclusive-or (XOR) between the actual and predicted labels and then average across the data set. The following figures illustrates the method of Hamming loss calculation.

Case 1: Actual Same as Predicted
Actual = [[0 1]         Predicted= [[0 1]
          [1 1]]                    [1 1]]
 
     
XOR Output = [[0 0
               0 0]]
 
HL  = 0.0

Case 2: Actual same as inverted Predictions

Actual = [[0 1]         Predicted= [[1 0]
          [1 1]]                    [0 0]]
 
     
XOR Output = [[1 1
               1 1]]

HL  = 1.0

Case 3: Actual partially same as Predicted
Actual = [[0 1]         Predicted= [[0 0]
          [1 1]]                    [0 1]]
 
     
XOR Output = [[0 1
               1 0]]

HL  = 0.5



Hamming Loss is computed as 

where N is the number of data samples and L is the number of labels.

Hamming loss really equals to (1 - accuracy) for binary classesThe use of HL does not make much sense in the binary case, since it is directly related to accuracy. Nevertheless accuracy is ambiguous in the multiple-label case. 
For multi-label case HL computes the Hamming loss between actual labels and the predicted labels. HL is the fraction of labels that are incorrectly predicted. HL thus presents one clear single-performance-value for multiple-label case in contrast to the Precision/Recall/F1 that can be evaluated only for independent binary classifiers for each label. 


Cohen’s Kappa Coefficient and Jaccard Score 
Cohen's kappa coefficient:
Cohen’s Kappa is a measure of agreement that takes into account how much agreement could be expected by chance. It does this by taking into account the class imbalance and the classifier’s tendency to vote Yes or No. This is a statistic which measures inter-rater agreement for qualitative (categorical) items. 
Cohen's Kappa is generally thought to be a more robust measure than simple percentage agreement calculation, since it takes into account the agreement occurring by chance. Cohen's kappa measures the agreement between two raters who each classify N items into C mutually exclusive categories. 
The Cohen Kappa coefficient


where − 
  • po  = relative observed agreement among raters. 
  • pe = the hypothetical probability of chance agreement.

po and pe are computed using the observed data to calculate the probabilities of each observer randomly classifying each category.

If the raters are in complete agreement then k = 1.

If there is no agreement among the raters other than what would be expected by chance (as given by pe), then k ≤ 0.

Problem
There are 50 applications for grants. Each grant proposal was read by two readers and each reader either said "Yes" or "No" to the proposal. The readers are A and B. The agree-disagreement count between A and B is shown in Table-3. Agreement count is shown on the diagonal slanting left.  Similarly the data on the diagonal slanting right are disagreements. Calculate Cohen's kappa coefficient.


B
Yes
No
A
Yes
20
5
No
10
15
Table-3

Solution:
Note that there were 20 proposals that were granted by both reader A and reader B and 15 proposals that were rejected by both readers.
To calculate p0 , the observed proportionate agreement is
p0 = (20+15)/50 = 0.7
To calculate pe (the probability of random agreement) we note that:
·        Reader A said "Yes" to 25 applicants and "No" to 25 applicants. Thus reader A said "Yes" 50% of the time.
·        Reader B said "Yes" to 30 applicants and "No" to 20 applicants. Thus reader B said "Yes" 60% of the time.
Using joint probability formula P(A and B) = P(A) x P(B) where P(A) and P(B)  are independent probabilities of A and B rating Yes.

The probability that both of them would say "Yes" randomly is 0.50 x 0.60 = 0.30. Similarly the probability that both of them would say "No" is 0.50 x 0.40 = 0.20. Thus the overall probability of chance (random) agreement is the sum (0.30+0.20) i.e., pe = 0.5.
Therefore Cohen's kappa coefficient  (k) = 0.4

Jaccard Score:

The Jaccard index, also known as Intersection over Union and the Jaccard similarity coefficient (originally given the French name coefficient de communautĂ© by Paul Jaccard), is a statistic used for gauging the similarity and diversity of sample sets. The Jaccard coefficient measures similarity between finite sample sets, and is defined as the size of the intersection divided by the size of the union of the sample sets




Figure-10:


Figure-11:


This measure is similar to the Dice coefficient such that 




Jaccard Score for Multiclass


A sample confusion matrix for multiclass problem is shown below:

Predict
True

A

B

C
A
AA
AB
AC
B
BA
BB
BC
C
CA
CB
CC

Table-4
Confusion matrix  for sample classification

Comparing the computation of accuracy with Jaccard score
The accuracy is:  



The average Jaccard score a.k.a. average Jaccard coefficient is:





Hinge loss: 

In machine learning, the hinge loss is a loss function used for training classifiers. The hinge loss is used for "maximum-margin" classification, most notably for support vector machines (SVMs). SVM has a notion of a margin. 
The table below illustrates hinge loss for a hypothetical SVM. Consider a binary classification problem. Items can be class -1 or +1 (for example, male / female)
Suppose the margin is 0.2 and a set of actual and computed values is as shown in the table.
margin = 0.2

     actual  computed  hinge loss
==================================
[0]   +1        0.55        0
[1]   +1        0.25        0     
[2]   +1        0.15       0.05      
[3]   +1       -0.25       0.45

[4]   -1       -0.35        0      
[5]   -1       -0.98        0
[6]   -1       -0.05       0.15
[7]   -1       +0.25       0.45

If the computed output value is any positive value, the prediction is class +1 and vice versa.
For item [0], the actual is +1 and the computed is +0.55 so this is a correct prediction and because the computed value is greater than the margin of 0.2 there is no hinge loss error.
For item [1], the actual is +1 and the computed is +0.25 so the same situation occurs.
For item [2], the actual is +1 and the computed is +0.15 so the classification is correct, but the computed is too close (less than the margin of 0.2) to zero so there’s a small hinge loss even though the classification is correct.
For item [3], that actual is +1 and the computed is -0.25 so the classification is wrong and there’s a large hinge loss.
For item [4], the actual is -1 and the computed is -0.35 so the classification is correct, and there is no hinge loss because the computed is far away enough (0.2) from the boundary of 0.
For item [5], the actual is -1 and the computed is -0.98 so this is the same situation as item [4] and so no hinge loss.
For item [6], the actual is -1 and the computed is -0.05 the classification is correct but there is a moderate hinge loss because the computed is too close to zero.
For item [7], the actual is -1 and the computed is +0.25 so the classification is wrong and there’s a large hinge loss. Notice the symmetry with item [3].
To summarize, when working with an SVM, if a computed value gives a correct classification and is larger than the margin, there is no hinge loss. If a computed value gives a correct classification but is too close to zero (where too close is defined by a margin) there is a small hinge loss. If a computed value gives an incorrect classification there will always be a hinge loss.
This is the conceptual idea of hinge loss for a margin-based classifier.
For an intended output t = ±1 and a classifier score y, the hinge loss of the prediction y is defined as


Note that {\displaystyle y} should be the output y of the classifier's decision function is the score value and not the predicted class label.
When t and y have the same sign (meaning y predicts the right class) and   |y| margin{\displaystyle |y|\geq 1}, the hinge loss is 0{\displaystyle \ell (y)=0}. When they have opposite signs, the loss {\displaystyle \ell (y)} increases linearly with y.

Figure-12:
Hinge loss for when the true class is +1



Figure-13:
Hinge loss for when the true class is -1

Log Loss (Binary Cross Entropy): 
AUC of ROC considers the predicted probabilities for determining our model’s performance. However, there is an issue with AUC.
ROC does not take into account the model’s capability to predict higher probability for samples more likely to be positive. 
In that case, we could us the log loss which is nothing but negative average of the log of corrected predicted probabilities for each instance. Cross entropy loss function measures the performance of a classification model the output of which is a probability value that varies from 0 to 1 predicted by the model. When cross entropy is large it means that deviation of output from the target is large and vice-versa. Hence this error measure is appropriate only for classification models. Hence cross-entropy cost function depends on the relative errors and not on the sum squares of absolute errors; thus it gives the same weightage to small and large error values



  • y1 is predicted probability of positive class
  • 1- y1 is predicted probability of negative class
  • t1  = 1 for positive class and 0 for negative class (actual values)



Let us calculate log loss for a few random values to get the gist of the above mathematical function:

  • Logloss(1, 0.1) = 2.303. 
  • Logloss(1, 0.5) = 0.693. 
  • Logloss(1, 0.9) = 0.105


If we plot this relationship, we will get a curve as follows:



Figure-14:

It’s apparent from the gentle downward slope towards the right that the Log Loss gradually declines as the predicted probability improves. Moving in the opposite direction though, the Log Loss ramps up very rapidly as the predicted probability approaches 0.

So, lower the log loss, better the model. However, there is no absolute measure on a good log loss and it is use-case or application dependent.

Whereas the AUC is computed with regard to binary classification with a varying decision threshold, log loss actually takes “certainty” of classification into account.

The above log loss function is also called Binary Cross Entropy Error or  Negative Log Likelihood. For multiclass problems with K classifications the binary cross entropy equation can be extended to include all K classes instead of 2 classes. The multiclass cross entropy function is also called categorical cross entropy.


Mathews Correlation Coefficient: 

The Matthews correlation coefficient (MCC) is used in machine learning as a measure of the quality of binary (two-class) classifications, introduced by biochemist Brian W. Matthews in 1975. Although the MCC is equivalent to Karl Pearson's phi coefficient, which was developed decades earlier, the term MCC is widely used in the field of bioinformatics.
While there is no perfect way of describing the confusion matrix of true and false positives and negatives by a single number, the Matthews correlation coefficient is generally regarded as being one of the best such measures. Other measures, such as the proportion of correct predictions (also termed accuracy), are not useful when the two classes are of very different sizes. For example, assigning every object to the larger set achieves a high proportion of correct predictions, but is not generally a useful classification.
The MCC can be calculated directly from the confusion matrix using the formula:


The coefficient takes into account true and false positives and negatives and is generally regarded as a balanced measure which can be used even if the classes are of very different sizes.

The MCC is in essence a correlation coefficient between the observed and predicted binary classifications; it returns a value between −1 and +1. A coefficient of +1 represents a perfect prediction, 0 no better than random prediction and −1 indicates total disagreement between prediction and observation.

The statistic is also known as the phi coefficient. MCC is related to the chi-square statistic for a 2×2 contingency table. According to some ML practitioners and scientists Matthews correlation coefficient is the most informative single score to establish the quality of a binary classifier prediction in a confusion matrix context.

MCC for Multiclass
The Matthews correlation coefficient has been generalized to the multiclass case (for K different classes). This generalization was called the {\displaystyle R_{K}} Rk statistic by the author, and is also defined in terms of a {\displaystyle K\times K} confusion matrix.{\displaystyle C}


Advantages of MCC over Accuracy and F1 Scores

The Matthews correlation coefficient is more informative than F1 score and Accuracy measure in evaluating binary classification problems, because it takes into account the balance ratios of the four confusion matrix categories (true positives, true negatives, false positives, false negatives).

Though Accuracy and F1 score are widely employed in statistics, both can be misleading, since they do not fully consider the size of the four classes of the confusion matrix in their final score computation.
Suppose, for example, you have a very imbalanced validation set made of 100 elements, 95 of which are positive elements, and only 5 are negative elements. And suppose also you made some mistakes in designing and training your machine learning classifier, and now you have an algorithm which always predicts positive. Imagine that you are not aware of this issue.
By applying your only-positive predictor to your imbalanced validation set, therefore, you obtain values for the confusion matrix categories:

TP = 95, FP = 5; TN = 0, FN = 0.

These values lead to the following performance scores: Accuracy = 95%, and F1 score = 97.44%. By reading these over-optimistic scores, then you will be very happy and will think that your machine learning algorithm is doing an excellent job. Obviously, you would be on the wrong track.
On the contrary, to avoid these dangerous misleading illusions, we can exploit: the Matthews correlation coefficient (MCC). {\displaystyle {\text{MCC}}={\frac {TP\times TN-FP\times FN}{\sqrt {(TP+FP)(TP+FN)(TN+FP)(TN+FN)}}}}The worst value = −1; best value = +1). By considering the proportion of each class of the confusion matrix in its formula, its score is high only if your classifier is doing well on both the negative and the positive elements.

Zero_One_loss and Brier Score: 

Zero_one_loss is a common loss function used with classification learning. It assigns 0 to loss for a correct classification and 1 for an incorrect classification.

In multilabel classification, the zero_one_loss function corresponds to the subset zero_one_loss: for each sample, the entire set of labels must be correctly predicted, otherwise the loss for that sample is equal to one.


where M is the total number of class labels.

Figure-15:

Note:
Hamming loss is equivalent to the subset zero-one function, in multiclass classification problems. In such cases Hamming loss is equivalent to Hamming distance between y_true and y_pred.


Brier Score
The Brier score is a proper score function that measures the accuracy of probabilistic predictions. It is applicable to tasks in which predictions must assign probabilities to a set of mutually exclusive discrete outcomes. The set of possible outcomes can be either binary or categorical in nature, and the probabilities assigned to this set of outcomes must sum to one (where each individual probability is in the range of 0 to 1). It was proposed by Glenn W. Brier in 1950.
The Brier score can be thought of as a cost function, precisely across all items i = 1 ,...., N in a set N predictions, the Brier score measures the mean squared difference between:

  • The predicted probability pi assigned to the possible outcomes for item i. 
  • The actual outcome oi 

Therefore, the lower the Brier score is for a set of predictions, the better the predictions are calibrated.
Note that the Brier score, in its most common formulation, takes on a value between zero and one, since this is the largest possible difference between a predicted probability (which must be between zero and one) and the actual outcome (which can take on values of only 0 or 1). In the original (1950) formulation of the Brier score, the range is double, from zero to two.
The Brier score is appropriate for binary and categorical outcomes that can be structured as true or false, but is inappropriate for ordinal variables which can take on three or more values.
Suppose that one is forecasting the probability {\displaystyle P} p that it will rain on a given day. Then the Brier score is calculated as follows:
  • If the forecast is 100% ({\displaystyle P}p = 1) and it rains, then the Brier Score is 0, the best score achievable. 
  • If the forecast is 100% and it does not rain, then the Brier Score is 1, the worst score achievable. I
  • f the forecast is 70% (p = 0.70) and it rains, then the Brier Score is (0.70−1)2 = 0.09. 
  • If the forecast is 30% (p = 0.30) and it rains, then the Brier Score is (0.30−1)2 = 0.49. 
  • If the forecast is 50% (p = 0.50), then the Brier score is (0.50−1)2 = (0.50−0)2 = 0.25, regardless of whether it rains.
The Brier score can be decomposed into 3 additive components: Uncertainty, Reliability, and Resolution. (Murphy 1973)


Figure-16:


Log loss score vs Brier score
The log loss score that heavily penalizes predicted probabilities far away from their expected value. The Brier score that is gentler than log loss but still penalizes proportional to the distance from the expected value.

Either of these measures may be appropriate, depending on what you want to concentrate on.
The Brier score is basically the sum of squared errors of the classwise probability estimates. It will inform you as to both how accurate the model is and how "confidently" accurate the model is.
You would not want to use the Brier score for scoring an ordinal classification problem. If missing a class 1 by predicting class 2 is better than predicting class 3, for example. The Brier score weights all misses equally.
Cross entropy (log loss) will, basically, measure the relative uncertainty between classes your model produces relative to the true classes. Over the past decade or so, it's become one of the very standard model scoring statistics for multiclass (and binary) classification problems.

Figure-17:



Mean absolute error: 

The mean absolute error (MAE) is the simplest regression error metric to understand. We’ll calculate the residual for every data point, taking only the absolute value of each so that negative and positive residuals do not cancel out. We then take the average of all these residuals. Effectively, MAE describes the typical magnitude of the residual.


Mean absolute percentage error

The mean absolute percentage error (MPE) is the percentage equivalent of MAE. The equation is similar to that of MAE, with adjustments to convert everything into percentages. 


Mean percentage error

The mean percentage error (MPE) equation is exactly like that of MAPE. The only difference is that it lacks the absolute value operation.


MPE is useful to us because it allows us to see if our model systematically underestimates (more negative error) or overestimates (positive error).


Mean square error: 

The mean square error (MSE) is just like the MAE, but squares the difference before summing them all instead of using the absolute value.


The table below will give a quick summary of the acronyms and their basic characteristics.

Acroynm
Full Name
Residual Operation?
Robust To Outliers?
MAE
Mean Absolute Error
Absolute Value
Yes
MSE
Mean Squared Error
Square
No
RMSE
Root Mean Squared Error
Square
No
MAPE
Mean Absolute Percentage Error
Absolute Value
Yes
MPE
Mean Percentage Error
N/A
Yes
Table-5

Root Mean Squared Error (RMSE)
RMSE is the most popular evaluation metric used in regression problems. It follows an assumption that errors are unbiased and follow a normal distribution. Here are the key points to consider on RMSE:
  1. The power of ‘square root’  empowers this metric to show large number deviations.
  2. The ‘squared’ nature of this metric helps to deliver more robust results which prevents cancelling the positive and negative error values. In other words, this metric aptly displays the plausible magnitude of error term.
  3. It avoids the use of absolute error values which is highly undesirable in mathematical calculations.
  4. When we have more samples, reconstructing the error distribution using RMSE is considered to be more reliable.
  5. RMSE is highly affected by outlier values. Hence, make sure you’ve removed outliers from your data set prior to using this metric.
  6. As compared to mean absolute error, RMSE gives higher weightage and punishes large errors.
RMSE metric is given by:


where, N is total Number of Observations.

Max-Error: 

While RMSE is the most common metric, it can be hard to interpret. One alternative is to look at quantiles of the distribution of the absolute percentage errors. The Max-Error metric is the worst case error between the predicted value and the true value.

Root Mean Squared Logarithmic Error
MSLE measures the ratio between actual and predicted.



This can be written as


The RMSLE is used when we don’t want to penalize huge differences in the predicted and the actual values when both predicted and true values are huge numbers.



  1. If both predicted and actual values are small: RMSE and RMSLE are same.
  2. If either predicted or the actual value is big: RMSE > RMSLE
  3. If both predicted and actual values are big: RMSE > RMSLE (RMSLE becomes almost negligible)

R-Squared: 

We learned that when the RMSE decreases, the model’s performance will improve. But these values alone are not intuitive.
In the case of a classification problem, if the model has an accuracy of 0.8, we could gauge how good our model is against a random model, which has an accuracy of  0.5. So the random model can be treated as a benchmark. But when we talk about the RMSE metrics, we do not have a benchmark to compare.
This is used where we can implement R-Squared metric. The formula for R-Squared is as follows:


where


  • MSE (model): Mean Squared Error of the predictions against the actual values
  • MSE (baseline): Mean Squared Error of  mean prediction against the predicted values

In other words how good our regression model as compared to a very simple model that just predicts the mean value of target from the train set as predictions.

Adjusted R-Squared
A model performing equal to baseline would give R-Squared as 0. Better the model, higher the value. The best model with all correct predictions would give R-Squared as 1. However, on adding new features to the model, the R-Squared value either increases or remains the same. R-Squared does not penalize for adding features that add no value to the model. So an improved version over the R-Squared is the adjusted R-Squared. The formula for adjusted R-Squared is given by:

where

k: number of features
n: number of samples

As you can see, this metric takes the number of features into account. When we add more features, the term in the denominator n-(k +1) decreases, so the whole expression increases.
If R-Squared does not increase, that means the feature added isn’t valuable for our model. So overall we subtract a greater value from 1 and adjusted R2, in turn, would decrease.


References:


  1. Alice Zheng, Evaluating Machine Learning Models – A Beginner’s Guide to key concepts and pitfalls, O’Reilly, 2015
  2. Nathalie Japkowicz & Mohak Shah, Evaluating Learning Algorithms: A Classification Perspective, Cambridge University Press, 2011.
  3. Sebastian Raschka, University of Wisconsin–Madison Department of Statistics, Model Evaluation, Model Selection, Algorithm Selection in Machine Learning, arXiv 1811.128081v1 [cs.LG]  Nov 2018
  4. Václav Hlavác, Czech Technical University in Prague, Czech Institute of Informatics, Robotics and Cybernetics, Classifier performance evaluation.
  5. The Basics of Classifier Evaluation Part-1, www.svds.com
  6. Machine Learning Crash Course, developers.google.com



Figure Credits:


Figure-1: blog.minitab.com
Figure-3: wiki.awf.forst.uni-goettingen.de
Figure-4: sciencewitheberhart.weebly.com
Figure-5: datascience.stackexchange.com
Figure-6: www.medcalc.org
Figure-7: www.datasciencecentral.com
Figure-8: machinelearning-blog.com
Figure-9: stackoverflow.com
Figure-10: www.differencebetween.net
Figure-11: www.pinclipart.com
Figure-12: math.stackexchange.com
Figure-13: stats.stackexchange.com
Figure-14: ml-cheatsheet.readthedocs.io
Figure-15: fa.bianp.net
Figure-16: www.scisports.com




Comments

  1. The information provided was really good. Thanks for sharing these wonderful ideas. If you want to more about Artificial Intelligence


    ReplyDelete
  2. Wonderful illustrated information. I thank you about that. No doubt it will be very useful for my future projects. Would like to see some other posts on the same subject!artificial intelligence course in delhi

    ReplyDelete
  3. I feel very grateful that I read this. It is very helpful and very informative and I really learned a lot from it.artificial intelligence course in noida

    ReplyDelete
  4. This article has all the information I need to test my model accuracies! Learnt a lot!

    ReplyDelete
  5. Artificial intelligence (AI) is an area of computer science that emphasizes the creation of intelligent machines that work and react like humans.
    OTH Gold
    NFFI

    ReplyDelete
  6. I read that Post and got it fine and informative. Please share more like that...
    data science course malaysia

    ReplyDelete
  7. AI & ML in Dubai
    https://www.nsreem.com/ourservices/ai-ml/
    Artificial intelligence is very widespread today. In at least certainly considered one among its various forms has had an impact on all major industries in the world today, NSREEM is #1 AI & ML Service Provider in Dubai
    1632813851114-9

    ReplyDelete
  8. AI & ML in Dubai
    https://www.nsreem.com/ourservices/ai-ml/
    Artificial intelligence is very widespread today. In at least certainly considered one among its various forms has had an impact on all major industries in the world today, NSREEM is #1 AI & ML Service Provider in Dubai
    1632816190481-8

    ReplyDelete
  9. AI & ML in Dubai
    https://www.nsreem.com/ourservices/ai-ml/
    Artificial intelligence is very widespread today. In at least certainly considered one among its various forms has had an impact on all major industries in the world today, NSREEM is #1 AI & ML Service Provider in Dubai
    1632843251411-15

    ReplyDelete
  10. This comment has been removed by the author.

    ReplyDelete

Post a Comment

Popular posts from this blog

Modeling Threshold Logic Neural Networks: McCulloch-Pitts model and Rosenblatt’s Perceptrons

Regularization and Generalization in Deep Learning

Gradient Descent rule and Widrow Hoff rule for Deep Learning