Sensitivity, specificity, type I error, type II error, and?
Machine learning and statistics always confuse people not only by the algorithms but also by the evaluation of the results. Confusion matrix, contingency table, to name but a few, are all visualization methods for ease of understanding the meaning of the evaluation parameters. You must hear about these parameters every day, for example, when people talk about the "sensitivity” and the “specificity” of a COVID-19 PCR test.
In this post, I will put together the seemingly similar concepts of evaluation metrics from both statistics and machine learning. These parameters are chosen based on my personal experience of learning.
Hypothesis testing in statistics versus prediction performance in machine learning
Hypothesis testing
In statistics, the 2x2 table is used for summarize the statistical results and the truth. A brief instruction of formulating a hypothesis: Let’s say we are testing two hypothesis regarding which one is closer to the truth. In statistics, if you fail to reject a hypothesis, it does not mean you “accept” that it is true, it simply means you cannot reject this possibility that it is true and there are all other possible hypothesis out there, including the alternative hypothesis. In this case, we are not concluding the question. On the other hand, if you “reject” a hypothesis, that means the other hypothesis you are testing cannot be rejected, therefore a conclusion will be drawn on that premise. Therefore, we want to “reject” a hypothesis more than to “fail to reject” one; hence, we want to “fail to reject” our anticipated outcome by rejecting the unwanted outcome. For example, when testing if drug A and drug B exert different effects on patients, as a researcher, I want to see the difference; hence, the null hypothesis here (H0) could be there is no difference between two drugs, and the alternative hypothesis (H1) could be there is difference. Rejecting H0 renders the conclusion that we cannot reject H1, and therefore there is statistically difference between two drugs effect.
Machine learning
In machine learning, we want to see how well the model prediction aligns with the truth. Hence, a contingency table with “truth” and “prediction” is generated, and all samples can be categorized into four possible outcomes: TP, TN, FP, FN. For the explanation of these outcomes and performance parameter, please see Figure 2.
Now, you may also hear a lot about “precision” and “recall”. What are they? When doing a supervised machine learning project, you need to first label the samples with outcomes, for both true and predicted outcomes, let’s use +(positive) and - (negative) as an example. Figure 3illustrates the definition of precision and recall, and their harmonic mean, F1- score.
Precision = PPV
Recall = sensitivity = TPR
You may now realize that precision and recall depends on the model’s prediction on FP and FN, and hence it is always a “tradeoff”. But what is the “threshold” that determines the best combination of precision and recall? What does the threshold mean?
Again, in a machine learning project, if a neural network is used, then the classification outcome of a binary class is generated by the probability of the model’s confidence on the two classes, for example, predict 0 with 80% confidence and 1 with 20% confidence, and hence the result prediction will be 0. I will not go into details regarding how coding can be done to transfer probability to classes; however, the point here is, the probability of deciding the maximum level of predicting from 0 to 1, that is the threshold, can also be manipulated by coding. Hence, next time, when you read a machine learning article, and it is stated that the decision threshold was found based on performance metrics, you will understand that the threshold is the boundary set by researchers that distinguishes the prediction of 0 and 1.
Other parameters
I have also come across a few other machine learning performance parameters. These parameters will be updated here.
- Matthews correlation coefficient (MCC)
(1) Value is always between -1 and 1; 0 means that the classifier is no better than a random flip of a fair coin.
A. when the classifier is perfect (FP = FN = 0), the value of MCC is 1
B. when the classifier always misclassifies (TP = TN = 0), the value of MCC is -1
(2) It is perfectly symmetric, all classes are included and no class is more important than the other
(3) If you switch the positive and negative, you’ll still get the same value
to be continued…
A screening test with high sensitivity, we will have many FP. This is why if a test with high sensitivity, there is a risk of having general testing since high FP would create unnecessary public fear.