Metrics for Evaluating AI/ML Algorithms
Metric for Evaluation (Performance Measures) for AI/ML
algorithm
A metric is any
number that provides measured information. Performance
of learning models is evaluated with various types of metrics. Evaluation
of machine learning models can be considered similar to hypothesis testing in
statistics. In statistics value of the population parameter has
to be statistically inferred based on the sample statistics. Similarly an AI/ML
model is evaluated using sampled finite data set. The available data set is
split into train and test sets. Trained models are never evaluated on train
data but on test set. Evaluations can be done by holding out the test set,
cross validation or boot strapping.
Classification
Accuracy
Accuracy
is the simplest metric to measure the performance of a trained ML. It is the
number of correct predictions made divided by the total number of predictions
made for a given set of observed data.
Classification
Rate/Accuracy:
Classification Rate or Accuracy is given by the relation:
Accuracy = (TP + TN) / (TP + TN + FP + FN)
Accuracy
is a simplistic measure that is misleading on many real-world problems. Consider a two-class domain classifier.
The classifier outputs are one of two possible judgments: Positive or Negative.
Given a test set and a specific classifier there are 4 possible classifications as follows:
A positive example
classified as positive. This is a true
positive.
A positive example
misclassified as negative. This is a false negative.
A negative example
classified as negative. This is a true
negative.
A negative example
misclassified as positive. This is a false positive.
The problem with class
imbalance
Accuracy measure assumes that dataset
is balanced or approximately balanced with 50:50 positive and negative classes.
In real world imbalanced data set is the rule and not the exception. If the dataset
with a 99:1 split of negatives to positives the classifier accuracy measure can
lead to wrong evaluation.
Examples of unbalanced data sets are 1%
fraud finance transactions and 99% genuine, 95% healthy and 5% diseased, 10%
customer churn and 90% continuing to stay, 99.5% of factory production of
defect free items and 0.5% defective, 99.999% of human population are not
terrorists and so on.
Confusion Matrix:
A confusion matrix is
a summary of prediction results on a classification problem. The number of correct and incorrect predictions
are summarized with count values and broken down by each class. The confusion
matrix shows the ways in which your classification model is confused when it
makes predictions. It is also called classification matrix. We can
create a 2×2 matrix with the columns as true classes and the rows as the
hypothesized classes. It looks like this:
Consider class 1 positive
correspondingly class 0 is negative. The row column arrangement may be
interchanged.
Definition
of the Terms:
• Positive (P) : True observation is positive (for example: is an apple).
•
Negative (N) : True observation is not positive (for example: is not an apple).
•
True Positive (TP) : Observation is positive, and is predicted to be positive.
•
False Negative (FN) : Observation is positive, but is predicted negative.
•
True Negative (TN) : Observation is negative, and is predicted to be negative.
•
False Positive (FP) : Observation is negative, but is predicted positive.
The following
table summarizes the above definitions.
Expected
cost
The confusion matrix
contains frequencies of the four different outcomes. The most precise way to
solve the class imbalance problem is to use these four numbers to calculate
expected cost (or equivalently, expected benefit). The standard form of an
expected value of a probabilistic random variable is to take the probability of each outcome and multiply it by
its corresponding values.
We use the same method to compute the expected cost of confusion matrix outcomes. Then we have to evaluate the cost and benefits of the four outcomes. The final calculation is the sum:
In the above equation p(P) and p(N) are the prior
probabilities of positive and negative classes respectively, also called the
class priors. cost(P) is the cost of dealing with a positive class. cost(N) is the
cost of dealing with a negative class. Cost of positive class is calculated as
Cost of negative class is calculated similarly.
The priors probabilities p(P) and p(N) can be estimated directly from the data. The four probabilities p(TP), p(FN), p(TN), p(FP) is
computed from classifier confusion matrix. The cost() and benefit() values
are extrinsic values
that can’t be derived from data which are estimated based on expert knowledge
of the domain.
Precision:
To
get the value of precision we divide the total number of correctly classified
positive examples by the total number of predicted positive examples. High Precision indicates that large percentage of samples
predicted as positive are indeed positive. This means that the portion of false
positives (FP) is small in number. Precision tells us about when it predicts yes, how often
is it correct.
Consider precision of label “1” and denote it by P1
P1 = TP1 / (TP1 + FP1)
Consider precision of label “0” and denote it by P0
P0 = TP0 / (TP0 + FP0)
Precision
vs Accuracy
Difference
between Precision and Accuracy metrics illustrated in figures
Recall:
Recall
(Sensitivity) can be defined as the ratio of the total number of correctly
classified positive examples divide to the total number of positive examples.
High Recall indicates that large percentage of true class is correctly recognized. This means that smaller number of FN).
Recall
gives us an idea about when it’s actually yes, how often does it predict yes.
Recall is given by the
relation:
Recall=TP / (TP + FN)
Figure to illustrate how
Precision, Recall can be used for object detection in images.
Precision is the
fraction of how many among predicted as true are actually true. Recall is the
fraction of actually true with respect to total true. (Also refer to Dice
Similarity Index)
Precision vs. Recall
for Imbalanced Classification
You may decide to use precision or recall for a data with imbalanced classes problem. Maximizing precision will minimize the number
false positives, whereas maximizing the recall will minimize the number of
false negatives. To summarize
- Precision: Appropriate is when minimizing false positives is
the focus.
- Recall: Appropriate is when minimizing false negatives is
the focus.
Macro, Micro and Average methods
Macro-average Precision
Calculate metrics for each label, and find their unweighted mean. This does not take class label imbalance into account.
The method is straight forward. Just take the average of the precision of the model for each class label. For example, the macro-average precision for class labels 1 and 0 is computed as
Macro-average precision=(P1+P0)/2
Micro- average precision
Calculate metrics globally by counting the total true positives, false negatives and false positives.
Micro- average precision = (TP1 +TP0)/ (TP1 + TP0+ FP1 + FP0)
Weighted- average precision
Calculate metrics for each label, and find their average weighted by support (the number of true instances for each label). This alters ‘macro’ to account for label imbalance;
Weighted - average precision= [W1 × P1+ W0 ×P0] / 2
(The above methods for Macro, Micro and Weighted average methods also applies for Recall)
F1 - measure:
Since we have two measures
(Precision and Recall) it helps to have a measurement that represents both of
them. Consider a binary classifier where we need the best precision and recall at the
same time. Computing the arithmetic mean of precision and recall to get the best of both is not a good solution.
The F1 -measure uses Harmonic Mean in place
of Arithmetic Mean. This measure ignores higher values more than the lesser values. F1 score is calculated as
Assume
Recall = 0.95 and Precision = 0.91
F1 – measure = (2*0.95*0.91)/(0.91+0.95)=0.92
The F1-measure
will always be nearer to the smaller value of Precision or Recall.
Advantage of F1 measure (a simple use case)
Assume
that an e-Commerce company implements a simple Recommender System. A customer
has sent a query about a product. The recommender should display a list of
possible recommendations. The dataset contains 1000 products with 100 relevant
(positive) products and 900 non-relevant (negative) products. It is required to
classify them as relevant (1) or non-relevant (0).
The
company decides to minimize false negatives to 0 and considers all 1000 as
relevant, non-relevant products are assumed relevant. There are 900 false
positives and 100 true positives. Since all truly positives are predicted true there
are no false negatives. Since all products are predicted true there are no true
predicted negatives. Then
Precision = 100/ (100 + 900) = 0.1
Recall = 100 / (100 + 0) = 1
The
arithmetic mean of Precision and Recall in order to represent the
quality of our classifier as a single number, we get (1 + 0.1) / 2 = 0.55. Even
though the classifier has made more wrong predictions for 90% items the arithmetic
average performance measure is 0.55.
However
the F1 Score would be F1 = 2 * ((0.01 *
1) / (0.01 + 1)) = 0.0909
Now
consider we have infinite data
elements of negative class negative and a single element of class positive and
the dump model predicts only positive against all the instances in the data.
This
means
Precision : 0.0
Recall : 1.0
Recall : 1.0
Now:
Arithmetic mean: 0.5
Harmonic mean: 0.0
When
we take the arithmetic mean, it would have 50% correct. Despite being the worst
possible outcome! While taking the harmonic mean, the F-measure is 0.
Here, if we take the arithmetic mean, we get 0.5. It is
clear that the above result comes from a dumb classifier which just ignores the
input and just predicts one positive data correctly. Now, if we were to take
HM, we will get 0 which is accurate as this model is useless for all purposes.
Fbeta measure:
There are situations however for which a data scientist
would like to give more importance/weight to either precision or recall.
Altering the above expression a bit such that we can include an adjustable
parameter beta for this purpose, we get:
Fbeta measures the effectiveness of a model with respect to
a user who attaches β times as much importance to recall as precision. The best value of beta is 1 and the worst is 0. If β > 1, then the measure favors recall, if β < 1, then the measure favors precision.
ROC and AUC:
Let’s first try to understand what is ROC (Receiver
operating characteristic) curve. Refer to Table-1 we
observe that for a probabilistic model, we get different values for each pair
wise combinations of metrics. For each value of sensitivity, we get a different
specificity. The two vary as follows:
Figure-6:
The ROC curve is the plot between sensitivity and (1- specificity).
(1- specificity) is also known as
false positive rate and sensitivity
is also known as True Positive rate. Following is the ROC curve for a certain
model.
Figure-7:
Consider a threshold for which we get the following confusion
matrix.
Pre Act |
1 |
0 |
Total |
Ratio |
1 |
3834 |
16 |
3850 |
TPR =
99.6% |
0 |
634 |
951 |
1590 |
FPR =
40.19% |
Table-2
As you can see, the sensitivity at this threshold is 99.6%
and the (1-specificity) is ~40.2%. This coordinate becomes on point in our ROC
curve.
AUC:
Area under this curve (AUC) of ROC is a measure to quantify the ROC with a single numeric measure.
Note that the area of entire unit square equals 1. Following are a few thumb rules to evaluate the ROC:
- 0.9 - 1 = excellent (A)
- 0.8 - 0.9 = good (B)
- 0.7 - 0.8 = fair (C)
- 0.6 - 0.7 = poor (D)
- 0.5 - 0.6 = fail (F)
A large value of AUC above 0.9 is excellent, but this might simply be over-fitting. In such cases it becomes very
important to do in-time and out-of-time validations.
Precision – Recall curve:
A precision-recall curve is a plot of
the precision (y-axis) and the recall (x-axis)
for different thresholds, much like the ROC curve. A
naĂŻve way to calculate a precision-recall curve is by connecting precision-recall points.
A precision-recall point is a point with a pair of x and y values in the precision-recall space
where x is recall and y is precision.
The ROC curve shows how the recall vs precision relationship
changes as we vary the threshold for identifying a positive in
our model. The threshold represents the value above which a
data point is considered in the positive class.
With
imbalanced and skewed data set Precision-Recall (PR) curves give
a more informative picture of an algorithm's performance. Receiver
Operator Characteristic (ROC) curves are generally used to present results for
binary decision problems in machine learning. An algorithm that optimizes the
area under the ROC curve is not guaranteed to optimize the area under the PR
curve.
Figure-8:
Figure-9:
The precision-recall curve shows the
tradeoff between precision and recall for
different thresholds. A high area under the curve represents
both high recall and high precision, where
high precision relates to a low false positive rate, and
high recall relates to a low false negative rate.
Hamming
loss:
Hamming Loss calculates loss generated in the bit string of class labels
during prediction. It does that by exclusive-or (XOR) between the actual and predicted labels and then average
across the data set. The following figures illustrates the method of Hamming
loss calculation.
Case 1: Actual Same as Predicted
Actual = [[0 1] Predicted= [[0 1]
[1 1]] [1 1]]
XOR Output = [[0 0
0 0]]
HL = 0.0
Case 2: Actual same as inverted Predictions
Actual = [[0 1] Predicted= [[1 0]
[1 1]] [0 0]]
XOR Output = [[1 1
1 1]]
HL = 1.0
Case 3: Actual partially same
as Predicted
Actual = [[0 1] Predicted= [[0 0]
[1 1]] [0 1]]
XOR Output = [[0 1
1 0]]
HL = 0.5
Hamming Loss is computed as
where N is the
number of data samples and L is the
number of labels.
Hamming loss really equals to (1 - accuracy) for binary
classes. The use of HL does not make much sense in the binary case,
since it is directly related to accuracy. Nevertheless accuracy is ambiguous in
the multiple-label case.
For multi-label case HL computes the Hamming loss between actual labels and the predicted labels. HL is the fraction of labels that are incorrectly predicted. HL thus presents one clear single-performance-value for multiple-label case in contrast to the Precision/Recall/F1 that can be evaluated only for independent binary classifiers for each label.
For multi-label case HL computes the Hamming loss between actual labels and the predicted labels. HL is the fraction of labels that are incorrectly predicted. HL thus presents one clear single-performance-value for multiple-label case in contrast to the Precision/Recall/F1 that can be evaluated only for independent binary classifiers for each label.
Cohen’s
Kappa Coefficient and Jaccard Score
Cohen's
kappa coefficient:
Cohen’s Kappa is a
measure of agreement that takes into account how much agreement could be expected
by chance. It does this by taking into account the class imbalance and the
classifier’s tendency to vote Yes or No. This is a statistic which measures inter-rater agreement for qualitative
(categorical) items.
Cohen's Kappa is generally thought to be a more robust measure than simple percentage agreement calculation, since it takes into account the agreement occurring by chance. Cohen's kappa measures the agreement between two raters who each classify N items into C mutually exclusive categories.
The Cohen Kappa coefficient
Cohen's Kappa is generally thought to be a more robust measure than simple percentage agreement calculation, since it takes into account the agreement occurring by chance. Cohen's kappa measures the agreement between two raters who each classify N items into C mutually exclusive categories.
The Cohen Kappa coefficient
where
−
- po = relative observed agreement among raters.
- pe = the hypothetical probability of chance agreement.
po and pe are computed using the observed
data to calculate the probabilities of each observer randomly classifying each
category.
If the raters are in complete agreement then k =
1.
If there is no agreement among the raters other than what
would be expected by chance (as given by pe), then k ≤ 0.
Problem
There are 50 applications
for grants. Each grant proposal was read by two readers and each reader either
said "Yes" or "No" to the proposal. The readers
are A and B. The agree-disagreement count between A and B is shown in
Table-3. Agreement count is shown on the diagonal slanting left. Similarly the data on the diagonal slanting
right are disagreements. Calculate Cohen's kappa coefficient.
B
|
|||
Yes
|
No
|
||
A
|
Yes
|
20
|
5
|
No
|
10
|
15
|
Table-3
Solution:
Note that there were 20
proposals that were granted by both reader A and reader B and 15 proposals that
were rejected by both readers.
To calculate p0 , the observed
proportionate agreement is
p0 =
(20+15)/50 = 0.7
To calculate pe (the probability
of random agreement) we note that:
·
Reader
A said "Yes" to 25
applicants and "No" to 25 applicants. Thus reader A said "Yes" 50% of the time.
·
Reader
B said "Yes" to 30
applicants and "No" to 20 applicants. Thus reader B said
"Yes" 60% of the time.
Using
joint probability formula P(A
and B) = P(A) x P(B) where P(A) and P(B) are independent probabilities of A and B rating Yes.
The probability that both of them would say
"Yes" randomly is 0.50 x 0.60 = 0.30. Similarly the probability that
both of them would say "No" is 0.50 x 0.40 = 0.20. Thus the overall
probability of chance (random) agreement is the sum (0.30+0.20) i.e., pe = 0.5.
Therefore Cohen's kappa coefficient (k) = 0.4
Jaccard Score:
The Jaccard index, also
known as Intersection over Union and the Jaccard
similarity coefficient (originally given the French name coefficient
de communauté by Paul Jaccard), is a statistic used for
gauging
the similarity and diversity of sample sets. The
Jaccard coefficient measures similarity between finite sample sets, and is
defined as the size of the intersection divided by the size of the union of
the sample sets
Figure-10:
Figure-11:
This measure is similar to the Dice
coefficient such that
Jaccard Score for
Multiclass
A sample confusion matrix for
multiclass problem is shown below:
Predict
True
|
A
|
B
|
C
|
A
|
AA
|
AB
|
AC
|
B
|
BA
|
BB
|
BC
|
C
|
CA
|
CB
|
CC
|
Table-4
Confusion matrix for sample classification
Comparing the computation of accuracy with Jaccard score
The
accuracy is:
The average Jaccard score a.k.a. average Jaccard
coefficient is:
Hinge
loss:
In machine learning, the hinge loss is a loss function used for training classifiers. The hinge loss is used for "maximum-margin" classification, most notably for support vector machines (SVMs). SVM has a notion of a margin.
Suppose the margin is 0.2 and a set of actual and computed values is as
shown in the table.
margin = 0.2
actual computed
hinge loss
==================================
[0] +1
0.55 0
[1] +1
0.25 0
[2] +1
0.15 0.05
[3] +1
-0.25 0.45
[4] -1 -0.35
0
[5] -1
-0.98 0
[6] -1
-0.05 0.15
[7] -1
+0.25 0.45
If the computed output
value is any positive value, the prediction is class +1 and vice versa.
For item [0], the actual is
+1 and the computed is +0.55 so this is a correct prediction and because the
computed value is greater than the margin of 0.2 there is no hinge loss error.
For item [1], the actual is
+1 and the computed is +0.25 so the same situation occurs.
For item [2], the actual is
+1 and the computed is +0.15 so the classification is correct, but the computed
is too close (less than the margin of 0.2) to zero so there’s a small hinge
loss even though the classification is correct.
For item [3], that actual
is +1 and the computed is -0.25 so the classification is wrong and there’s a
large hinge loss.
For item [4], the actual is
-1 and the computed is -0.35 so the classification is correct, and there is no
hinge loss because the computed is far away enough (0.2) from the boundary of
0.
For item [5], the actual is
-1 and the computed is -0.98 so this is the same situation as item [4] and so
no hinge loss.
For item [6], the actual is
-1 and the computed is -0.05 the classification is correct but there is a
moderate hinge loss because the computed is too close to zero.
For item [7], the actual is
-1 and the computed is +0.25 so the classification is wrong and there’s a large
hinge loss. Notice the symmetry with item [3].
To summarize, when working
with an SVM, if a computed value gives a correct classification and is larger
than the margin, there is no hinge loss. If a computed value gives a correct
classification but is too close to zero (where too close is defined by a
margin) there is a small hinge loss. If a computed value gives an incorrect
classification there will always be a hinge loss.
This is the conceptual idea
of hinge loss for a margin-based classifier.
For an intended
output t = ±1 and a classifier score y, the
hinge loss of the prediction y is defined as
Note that should be the
output y of the classifier's decision
function is the score value and not the predicted class label.
When t and y have
the same sign (meaning y predicts the right class) and |y| ≥ margin, the hinge loss is 0 . When they have opposite signs, the loss increases
linearly with y.
Figure-12:
Hinge
loss for when the true class is +1
Figure-13:
Hinge loss for when the true class is -1
Log
Loss (Binary Cross Entropy):
AUC of ROC considers the predicted probabilities for
determining our model’s performance. However, there is an issue with AUC.
ROC does not take into account the model’s capability to predict higher probability for samples more likely to be positive.
ROC does not take into account the model’s capability to predict higher probability for samples more likely to be positive.
In that case, we could us the log loss
which is nothing but negative average of the log of corrected predicted
probabilities for each instance. Cross entropy loss function measures the
performance of a classification model the output of which is a probability
value that varies from 0 to 1 predicted by the model. When cross entropy is
large it means that deviation of output from the target is large and
vice-versa. Hence this error measure is appropriate only for classification
models. Hence cross-entropy cost function depends on the relative errors and
not on the sum squares of absolute errors; thus it gives the same weightage to
small and large error values.
- y1 is predicted probability of positive class
- 1- y1 is predicted probability of negative class
- t1 = 1 for positive
class and 0 for negative class (actual values)
Let us calculate log loss for a few random values to get the
gist of the above mathematical function:
- Logloss(1, 0.1) = 2.303.
- Logloss(1, 0.5) = 0.693.
- Logloss(1, 0.9) = 0.105
If we plot this relationship, we will get a curve as
follows:
Figure-14:
It’s apparent from the gentle downward slope towards the
right that the Log Loss gradually declines as the predicted probability
improves. Moving in the opposite direction though, the Log Loss ramps up very
rapidly as the predicted probability approaches 0.
So, lower the log loss, better the model. However, there is
no absolute measure on a good log loss and it is use-case or application
dependent.
Whereas the AUC is computed with regard to binary
classification with a varying decision threshold, log loss actually takes
“certainty” of classification into account.
The above log loss function is also called Binary Cross Entropy Error
or Negative Log Likelihood. For multiclass problems with K classifications the binary cross entropy equation can be extended
to include all K classes instead of 2
classes. The multiclass cross entropy function is also called categorical
cross entropy.
Mathews
Correlation Coefficient:
The Matthews correlation coefficient (MCC) is used
in machine learning as a measure of the quality of binary
(two-class) classifications, introduced by biochemist Brian W.
Matthews in 1975. Although the MCC is equivalent to Karl Pearson's phi
coefficient, which was developed decades earlier, the term MCC is widely used
in the field of bioinformatics.
While there is no perfect way of describing
the confusion matrix of true and false positives and negatives by a
single number, the Matthews correlation coefficient is generally regarded as
being one of the best such measures. Other measures, such as the proportion of
correct predictions (also termed accuracy), are not useful when the two
classes are of very different sizes. For example, assigning every object to the
larger set achieves a high proportion of correct predictions, but is not
generally a useful classification.
The MCC can be calculated directly from
the confusion matrix using the formula:
The coefficient takes into account true and false positives and
negatives and is generally regarded as a balanced measure which can be used
even if the classes are of very different sizes.
The MCC is in essence a correlation coefficient between the observed and
predicted binary classifications; it returns a value between −1 and +1. A
coefficient of +1 represents a perfect prediction, 0 no better than random
prediction and −1 indicates total disagreement between prediction and
observation.
The statistic is also known as the phi coefficient. MCC is related
to the chi-square statistic for a 2×2 contingency table. According to some ML practitioners and scientists Matthews correlation coefficient is the most informative
single score to establish the quality of a binary classifier prediction in a
confusion matrix context.
MCC for Multiclass
The Matthews correlation
coefficient has been generalized to the multiclass case (for K different classes). This
generalization was called the Rk
statistic by the author, and is also defined in terms of a confusion matrix.
The Matthews correlation coefficient is more informative than F1 score
and Accuracy measure in evaluating binary classification problems, because it
takes into account the balance ratios of the four confusion matrix categories
(true positives, true negatives, false positives, false negatives).
Though Accuracy and F1
score are widely employed in statistics, both can be misleading, since they do
not fully consider the size of the four classes of the confusion matrix in
their final score computation.
Suppose, for example, you
have a very imbalanced validation set made of 100 elements, 95 of which are
positive elements, and only 5 are negative elements. And suppose also you made
some mistakes in designing and training your machine learning classifier, and
now you have an algorithm which always predicts positive. Imagine that you are
not aware of this issue.
By applying your
only-positive predictor to your imbalanced validation set, therefore, you
obtain values for the confusion matrix categories:
TP
= 95, FP
= 5; TN = 0, FN = 0.
These values lead to the
following performance scores: Accuracy = 95%, and F1 score = 97.44%. By reading
these over-optimistic scores, then you will be very happy and will think that
your machine learning algorithm is doing an excellent job. Obviously, you would
be on the wrong track.
On the contrary, to avoid
these dangerous misleading illusions, we can exploit: the Matthews correlation
coefficient (MCC). The worst value = −1; best value = +1).
By considering the proportion of each class of the confusion matrix in its
formula, its score is high only if your classifier is doing well on both the
negative and the positive elements.
Zero_One_loss and Brier
Score:
Zero_one_loss is
a common loss function used with classification learning. It
assigns 0 to loss for a correct classification and 1 for an
incorrect classification.
In multilabel classification,
the zero_one_loss function corresponds to the subset zero_one_loss: for each sample, the entire set of labels must be correctly
predicted, otherwise the loss for that sample is equal
to one.
where M is the total number of class labels.
Note:
Hamming loss is equivalent to the subset zero-one function, in multiclass classification problems. In such cases Hamming loss is equivalent to Hamming distance between y_true and y_pred.
Brier Score
The Brier score is a proper score
function that measures the accuracy of probabilistic predictions. It is
applicable to tasks in which predictions must assign probabilities to a set
of mutually exclusive discrete outcomes. The set of possible outcomes
can be either binary or categorical in nature, and the probabilities assigned
to this set of outcomes must sum to one (where each individual probability is
in the range of 0 to 1). It was proposed by Glenn W. Brier in 1950.
The Brier score can be thought of as a cost function, precisely across
all items i = 1 ,...., N in a set N predictions, the Brier score
measures the mean squared difference between:
- The predicted probability pi assigned to the possible outcomes for item i.
- The actual outcome oi
Therefore, the lower the Brier score is
for a set of predictions, the better the predictions are
calibrated.
Note that the Brier score, in its most common
formulation, takes on a value between zero and one, since this is the largest
possible difference between a predicted probability (which must be between zero
and one) and the actual outcome (which can take on values of only 0 or 1). In
the original (1950) formulation of the Brier score, the range is double, from
zero to two.
The Brier score is appropriate for binary and categorical
outcomes that can be structured as true or false, but is inappropriate for
ordinal variables which can take on three or more values.
Suppose that one is
forecasting the probability p that it will rain on
a given day. Then the Brier score is calculated as follows:
- If the forecast is 100% ( p = 1) and it rains, then the Brier Score is 0, the best score achievable.
- If the forecast is 100% and it does not rain, then the Brier Score is 1, the worst score achievable. I
- f the forecast is 70% (p = 0.70) and it rains, then the Brier Score is (0.70−1)2 = 0.09.
- If the forecast is 30% (p = 0.30) and it rains, then the Brier Score is (0.30−1)2 = 0.49.
- If the forecast is 50% (p = 0.50), then the Brier score is (0.50−1)2 = (0.50−0)2 = 0.25, regardless of whether it rains.
Figure-16:
Log loss score vs Brier score
The log loss score that heavily penalizes predicted probabilities far away from their expected value. The Brier score that is gentler than log loss but still penalizes proportional to the distance from the expected value.
Either of these measures may be appropriate, depending on what you want to concentrate on.
The Brier score is basically the sum of squared errors of the classwise probability estimates. It will inform you as to both how accurate the model is and how "confidently" accurate the model is.
You would not want to use the Brier score for scoring an ordinal classification problem. If missing a class 1 by predicting class 2 is better than predicting class 3, for example. The Brier score weights all misses equally.
Cross entropy (log loss) will, basically, measure the relative uncertainty between classes your model produces relative to the true classes. Over the past decade or so, it's become one of the very standard model scoring statistics for multiclass (and binary) classification problems.
Figure-17:
Mean absolute error:
The mean absolute error (MAE)
is the simplest regression error metric to understand. We’ll calculate the
residual for every data point, taking only the absolute value of each so that
negative and positive residuals do not cancel out. We then take the average of
all these residuals. Effectively, MAE describes the typical magnitude
of the residual.
Mean absolute
percentage error
The mean absolute percentage error (MPE)
is the percentage equivalent of MAE. The equation is similar to that of MAE,
with adjustments to convert everything into percentages.
Mean percentage error
The
mean percentage error (MPE) equation is exactly like that of MAPE. The only
difference is that it lacks the absolute value operation.
MPE is useful to us because it allows us to see if our model
systematically underestimates (more
negative error) or overestimates (positive
error).
Mean square error:
The mean square error (MSE)
is just like the MAE, but squares the
difference before summing them all instead of using the absolute value.
The table below will give a quick summary of the acronyms and their
basic characteristics.
Acroynm
|
Full
Name
|
Residual
Operation?
|
Robust
To Outliers?
|
MAE
|
Mean
Absolute Error
|
Absolute
Value
|
Yes
|
MSE
|
Mean
Squared Error
|
Square
|
No
|
RMSE
|
Root
Mean Squared Error
|
Square
|
No
|
MAPE
|
Mean
Absolute Percentage Error
|
Absolute
Value
|
Yes
|
MPE
|
Mean
Percentage Error
|
N/A
|
Yes
|
Root
Mean Squared Error (RMSE)
RMSE is the most popular evaluation metric used in
regression problems. It follows an assumption that errors are unbiased and
follow a normal distribution. Here are the key points to consider on RMSE:
- The power of ‘square root’ empowers this metric to show large number deviations.
- The ‘squared’ nature of this metric helps to deliver more robust results which prevents cancelling the positive and negative error values. In other words, this metric aptly displays the plausible magnitude of error term.
- It avoids the use of absolute error values which is highly undesirable in mathematical calculations.
- When we have more samples, reconstructing the error distribution using RMSE is considered to be more reliable.
- RMSE is highly affected by outlier values. Hence, make sure you’ve removed outliers from your data set prior to using this metric.
- As compared to mean absolute error, RMSE gives higher weightage and punishes large errors.
RMSE metric is given by:
where, N is total
Number of Observations.
Max-Error:
While RMSE is the most common
metric, it can be hard to interpret. One alternative is to look at quantiles of
the distribution of the absolute percentage errors. The Max-Error metric is
the worst case error
between the predicted value and the true value.
Root
Mean Squared Logarithmic Error
MSLE measures the ratio
between actual and predicted.
The RMSLE is used when we don’t want to penalize huge
differences in the predicted and the actual values when both predicted and true
values are huge numbers.
- If both predicted and actual values are small: RMSE and RMSLE are same.
- If either predicted or the actual value is big: RMSE > RMSLE
- If both predicted and actual values are big: RMSE > RMSLE (RMSLE becomes almost negligible)
R-Squared:
We learned that when the RMSE decreases, the model’s
performance will improve. But these values alone are not intuitive.
In the case of a classification problem, if the model has an
accuracy of 0.8, we could gauge how good our model is against a random model,
which has an accuracy of 0.5. So the random model can be treated as a
benchmark. But when we talk about the RMSE metrics, we do not have a
benchmark to compare.
This is used where we can implement R-Squared metric. The
formula for R-Squared is as follows:
where
- MSE (model): Mean Squared Error of the predictions against the actual values
- MSE (baseline): Mean Squared Error of mean prediction against the predicted values
In other words how good our regression model as compared to
a very simple model that just predicts the mean value of target from the train
set as predictions.
Adjusted
R-Squared
A model performing equal to baseline would give R-Squared as
0. Better the model, higher the value. The best model with all correct
predictions would give R-Squared as 1. However, on adding new features to the
model, the R-Squared value either increases or remains the same. R-Squared does
not penalize for adding features that add no value to the model. So an improved
version over the R-Squared is the adjusted R-Squared. The formula for
adjusted R-Squared is given by:
where
k: number of features
n: number of samples
As you can see, this metric takes the number of features
into account. When we add more features, the term in the denominator n-(k
+1) decreases, so the whole expression increases.
If R-Squared does
not increase, that means the feature added isn’t valuable for our model. So
overall we subtract a greater value from 1 and adjusted R2, in turn, would decrease.
References:
- Alice Zheng, Evaluating Machine Learning Models – A Beginner’s Guide to key concepts and pitfalls, O’Reilly, 2015
- Nathalie Japkowicz & Mohak Shah, Evaluating Learning Algorithms: A Classification Perspective, Cambridge University Press, 2011.
- Sebastian Raschka, University of Wisconsin–Madison Department of Statistics, Model Evaluation, Model Selection, Algorithm Selection in Machine Learning, arXiv 1811.128081v1 [cs.LG] Nov 2018
- Václav Hlavác, Czech Technical University in Prague, Czech Institute of Informatics, Robotics and Cybernetics, Classifier performance evaluation.
- The Basics of Classifier Evaluation Part-1, www.svds.com
- Machine Learning Crash Course, developers.google.com
Figure Credits:
Figure-1: blog.minitab.com
Figure-3: wiki.awf.forst.uni-goettingen.de
Figure-4: sciencewitheberhart.weebly.com
Figure-5: datascience.stackexchange.com
Figure-6: www.medcalc.org
Figure-7: www.datasciencecentral.com
Figure-8: machinelearning-blog.com
Figure-9: stackoverflow.com
Figure-10:
www.differencebetween.net
Figure-11:
www.pinclipart.com
Figure-12:
math.stackexchange.com
Figure-13:
stats.stackexchange.com
Figure-14:
ml-cheatsheet.readthedocs.io
Figure-15:
fa.bianp.net
Figure-16:
www.scisports.com
The information provided was really good. Thanks for sharing these wonderful ideas. If you want to more about Artificial Intelligence
ReplyDeleteThanks for your fedback
DeleteWonderful illustrated information. I thank you about that. No doubt it will be very useful for my future projects. Would like to see some other posts on the same subject!artificial intelligence course in delhi
ReplyDeleteThank you for the feedback.
DeleteI feel very grateful that I read this. It is very helpful and very informative and I really learned a lot from it.artificial intelligence course in noida
ReplyDeleteWish you the best.
ReplyDeleteThis article has all the information I need to test my model accuracies! Learnt a lot!
ReplyDeleteArtificial intelligence (AI) is an area of computer science that emphasizes the creation of intelligent machines that work and react like humans.
ReplyDeleteOTH Gold
NFFI
I read that Post and got it fine and informative. Please share more like that...
ReplyDeletedata science course malaysia
This Blog is very useful and informative.
ReplyDeleteai training aurangabad
AI & ML in Dubai
ReplyDeletehttps://www.nsreem.com/ourservices/ai-ml/
Artificial intelligence is very widespread today. In at least certainly considered one among its various forms has had an impact on all major industries in the world today, NSREEM is #1 AI & ML Service Provider in Dubai
1632813851114-9
AI & ML in Dubai
ReplyDeletehttps://www.nsreem.com/ourservices/ai-ml/
Artificial intelligence is very widespread today. In at least certainly considered one among its various forms has had an impact on all major industries in the world today, NSREEM is #1 AI & ML Service Provider in Dubai
1632816190481-8
AI & ML in Dubai
ReplyDeletehttps://www.nsreem.com/ourservices/ai-ml/
Artificial intelligence is very widespread today. In at least certainly considered one among its various forms has had an impact on all major industries in the world today, NSREEM is #1 AI & ML Service Provider in Dubai
1632843251411-15
best post.
ReplyDeletedata science online free
Best Data Science Online Training
This comment has been removed by the author.
ReplyDeleteAmazing Blog.. Very Informative
ReplyDeleteB.Tech Artificial Intelligence and Machine Learning