Welcome toVigges Developer Community-Open, Learning,Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
1.7k views
in Technique[技术] by (71.8m points)

python - Total number of TP, TN, FP & FN do not sum up to total number of observed values

I was going through the Classification on imbalanced data by TensorFlow. Here in this tutorial they have used Kaggle's Credit Card Fraud Detection. In this section you could see that the number of training examples are 182276 and number of validation samples are 45569. To evaluate the baseline model they have used Keras's inbuilt metrics - TruePositive, FalsePositive, TrueNegative, FalseNegative.

However if you look at the training logs in train the model section then you can see that the sum of FP+TP+FN+TN is not equal to number of training examples. Nor the sum is equal to number of validation examples for validation data.

Part 1

EPOCH 1

TP = 64
FP = 25
TN = 139431.9780
FN = 188.3956
TP+FP+TN+FN = 139709.3736

The above sum is nowhere close to 182276. Same is true for all the subsequent epochs. Why is this the case?

Part 2

As the number of epoch increases, the total sum decreases further. For example compare the values for epoch 2 and 1. EPOCH 2

TP - 25
FP - 5.67
TN - 93973.1538
FN - 136.2967 
TP+FP+TN+FN = 94135.1205

The total sum is now reduced further by 45574. Same is true for epochs lower down the order.

  1. Shouldn't the total sum be the same?
  2. If not then why does it keep on decreasing?

Part 3

Why are the values for TP, FP, FN, TN in both training and validation floating numbers? As per my understanding these should always be integer. As per the explanation in the Understanding useful metrics the values represent count and should hence be integers.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

This is not an exact answer but I want to share my observations about this. I downloaded the data from Kaggle and I've run same example 3 times. How I used metrics:

metrics = [
    keras.metrics.FalseNegatives(name="fn"),
    keras.metrics.FalsePositives(name="fp"),
    keras.metrics.TrueNegatives(name="tn"),
    keras.metrics.TruePositives(name="tp"),
    keras.metrics.Precision(name="precision"),
    keras.metrics.Recall(name="recall"),
]

I also tried to add tf.keras but both of the resulted same, so I sticked to the one in tutorial.

As floating numbers' sum don't make sense, I decided to try it myself with the same data. For training I have 227846 samples.

Part 1)

Epoch 2/30
fn: 40.0000 - fp: 5700.0000 - tn: 221729.0000 - tp: 377.0000 - precision: 0.0620 - recall: 0.9041

When you sum up 40 + 5700 + 221729 + 377 = 227846

Part 2)

Epoch 25/30
fn: 9.0000 - fp: 3501.0000 - tn: 223928.0000 - tp: 408.0000 - precision: 0.1044 - recall: 0.9784

Sum all, 3501 + 223928 + 408 + 9 = 227846

Same thing applies for validation too. (Part 1 - Part 2)

Part 3) I also expected them to be without fractions, like 9.000 except 9.248.

An original example from Keras' docs. enter image description here

Both of the epochs' sum are equal to each other. As I am not sure, there might be something wrong with data/processing in tutorial. So I decided to download the data from link that is in tutorial:

Epoch 8/10
fn: 22.0000 - fp: 5191.0000 - tn: 222238.0000 - tp: 395.0000 

I applied the same process both of the data. So I tried to see if there is something fishy about data pre-processing in the tutorial. I had a train shape of (227846,30). In tutorial it is (182276, 29). 182276 because of splitting different ratios, but 29 features is the different one.

I conclude there might be something that I can not see about data processing in the tutorial, the more I look it, the more blind I become. I am unable to see if something wrong in the processing exists.

The results are like this, I also searched about metrics, all of them were integers.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to Vigges Developer Community for programmer and developer-Open, Learning and Share
...