## PyCM

PyCM is a multi-class confusion matrix library written in Python that supports both input data vectors and direct matrix, and a proper tool for post-classification model evaluation that supports most classes and overall statistics parameters. PyCM is the swiss-army knife of confusion matrices, targeted mainly at data scientists that need a broad array of metrics for predictive models and accurate evaluation of a large variety of classifiers.

Fig1. ConfusionMatrix Block Diagram

Open Hub | |

PyPI Counter | |

Github Stars |

Branch | master | dev |

CI |

Code Quality |

## Installation

⚠️ PyCM 2.4 is the last version to support **Python 2.7** & **Python 3.4**

⚠️ Plotting capability requires **Matplotlib (>= 3.0.0)** or **Seaborn (>= 0.9.1)**

### Source code

- Download Version 3.2 or Latest Source
- Run
`pip install -r requirements.txt`

or`pip3 install -r requirements.txt`

(Need root access) - Run
`python3 setup.py install`

or`python setup.py install`

(Need root access)

### PyPI

- Check Python Packaging User Guide
- Run
`pip install pycm==3.2`

or`pip3 install pycm==3.2`

(Need root access)

### Conda

- Check Conda Managing Package
`conda install -c sepandhaghighi pycm`

(Need root access)

### Easy install

- Run
`easy_install --upgrade pycm`

(Need root access)

### MATLAB

- Download and install MATLAB (>=8.5, 64/32 bit)
- Download and install Python3.x (>=3.5, 64/32 bit)
- [x] Select
`Add to PATH`

option - [x] Select
`Install pip`

option

- [x] Select
- Run
`pip install pycm`

or`pip3 install pycm`

(Need root access) - Configure Python interpreter

```
>> pyversion PYTHON_EXECUTABLE_FULL_PATH
```

- Visit MATLAB Examples

### Docker

- Run
`docker pull sepandhaghighi/pycm`

(Need root access) - Configuration :
- Ubuntu 16.04
- Python 3.6

## Usage

### From vector

```
>>> from pycm import *
>>> y_actu = [2, 0, 2, 2, 0, 1, 1, 2, 2, 0, 1, 2] # or y_actu = numpy.array([2, 0, 2, 2, 0, 1, 1, 2, 2, 0, 1, 2])
>>> y_pred = [0, 0, 2, 1, 0, 2, 1, 0, 2, 0, 2, 2] # or y_pred = numpy.array([0, 0, 2, 1, 0, 2, 1, 0, 2, 0, 2, 2])
>>> cm = ConfusionMatrix(actual_vector=y_actu, predict_vector=y_pred) # Create CM From Data
>>> cm.classes
[0, 1, 2]
>>> cm.table
{0: {0: 3, 1: 0, 2: 0}, 1: {0: 0, 1: 1, 2: 2}, 2: {0: 2, 1: 1, 2: 3}}
>>> print(cm)
Predict 0 1 2
Actual
0 3 0 0
1 0 1 2
2 2 1 3
Overall Statistics :
95% CI (0.30439,0.86228)
ACC Macro 0.72222
ARI 0.09206
AUNP 0.66667
AUNU 0.69444
Bangdiwala B 0.37255
Bennett S 0.375
CBA 0.47778
CSI 0.17778
Chi-Squared 6.6
Chi-Squared DF 4
Conditional Entropy 0.95915
Cramer V 0.5244
Cross Entropy 1.59352
F1 Macro 0.56515
F1 Micro 0.58333
FNR Macro 0.38889
FNR Micro 0.41667
FPR Macro 0.22222
FPR Micro 0.20833
Gwet AC1 0.38931
Hamming Loss 0.41667
Joint Entropy 2.45915
KL Divergence 0.09352
Kappa 0.35484
Kappa 95% CI (-0.07708,0.78675)
Kappa No Prevalence 0.16667
Kappa Standard Error 0.22036
Kappa Unbiased 0.34426
Krippendorff Alpha 0.37158
Lambda A 0.16667
Lambda B 0.42857
Mutual Information 0.52421
NIR 0.5
Overall ACC 0.58333
Overall CEN 0.46381
Overall J (1.225,0.40833)
Overall MCC 0.36667
Overall MCEN 0.51894
Overall RACC 0.35417
Overall RACCU 0.36458
P-Value 0.38721
PPV Macro 0.56667
PPV Micro 0.58333
Pearson C 0.59568
Phi-Squared 0.55
RCI 0.34947
RR 4.0
Reference Entropy 1.5
Response Entropy 1.48336
SOA1(Landis & Koch) Fair
SOA2(Fleiss) Poor
SOA3(Altman) Fair
SOA4(Cicchetti) Poor
SOA5(Cramer) Relatively Strong
SOA6(Matthews) Weak
Scott PI 0.34426
Standard Error 0.14232
TNR Macro 0.77778
TNR Micro 0.79167
TPR Macro 0.61111
TPR Micro 0.58333
Zero-one Loss 5
Class Statistics :
Classes 0 1 2
ACC(Accuracy) 0.83333 0.75 0.58333
AGF(Adjusted F-score) 0.9136 0.53995 0.5516
AGM(Adjusted geometric mean) 0.83729 0.692 0.60712
AM(Difference between automatic and manual classification) 2 -1 -1
AUC(Area under the ROC curve) 0.88889 0.61111 0.58333
AUCI(AUC value interpretation) Very Good Fair Poor
AUPR(Area under the PR curve) 0.8 0.41667 0.55
BCD(Bray-Curtis dissimilarity) 0.08333 0.04167 0.04167
BM(Informedness or bookmaker informedness) 0.77778 0.22222 0.16667
CEN(Confusion entropy) 0.25 0.49658 0.60442
DOR(Diagnostic odds ratio) None 4.0 2.0
DP(Discriminant power) None 0.33193 0.16597
DPI(Discriminant power interpretation) None Poor Poor
ERR(Error rate) 0.16667 0.25 0.41667
F0.5(F0.5 score) 0.65217 0.45455 0.57692
F1(F1 score - harmonic mean of precision and sensitivity) 0.75 0.4 0.54545
F2(F2 score) 0.88235 0.35714 0.51724
FDR(False discovery rate) 0.4 0.5 0.4
FN(False negative/miss/type 2 error) 0 2 3
FNR(Miss rate or false negative rate) 0.0 0.66667 0.5
FOR(False omission rate) 0.0 0.2 0.42857
FP(False positive/type 1 error/false alarm) 2 1 2
FPR(Fall-out or false positive rate) 0.22222 0.11111 0.33333
G(G-measure geometric mean of precision and sensitivity) 0.7746 0.40825 0.54772
GI(Gini index) 0.77778 0.22222 0.16667
GM(G-mean geometric mean of specificity and sensitivity) 0.88192 0.54433 0.57735
IBA(Index of balanced accuracy) 0.95062 0.13169 0.27778
ICSI(Individual classification success index) 0.6 -0.16667 0.1
IS(Information score) 1.26303 1.0 0.26303
J(Jaccard index) 0.6 0.25 0.375
LS(Lift score) 2.4 2.0 1.2
MCC(Matthews correlation coefficient) 0.68313 0.2582 0.16903
MCCI(Matthews correlation coefficient interpretation) Moderate Negligible Negligible
MCEN(Modified confusion entropy) 0.26439 0.5 0.6875
MK(Markedness) 0.6 0.3 0.17143
N(Condition negative) 9 9 6
NLR(Negative likelihood ratio) 0.0 0.75 0.75
NLRI(Negative likelihood ratio interpretation) Good Negligible Negligible
NPV(Negative predictive value) 1.0 0.8 0.57143
OC(Overlap coefficient) 1.0 0.5 0.6
OOC(Otsuka-Ochiai coefficient) 0.7746 0.40825 0.54772
OP(Optimized precision) 0.70833 0.29545 0.44048
P(Condition positive or support) 3 3 6
PLR(Positive likelihood ratio) 4.5 3.0 1.5
PLRI(Positive likelihood ratio interpretation) Poor Poor Poor
POP(Population) 12 12 12
PPV(Precision or positive predictive value) 0.6 0.5 0.6
PRE(Prevalence) 0.25 0.25 0.5
Q(Yule Q - coefficient of colligation) None 0.6 0.33333
QI(Yule Q interpretation) None Moderate Weak
RACC(Random accuracy) 0.10417 0.04167 0.20833
RACCU(Random accuracy unbiased) 0.11111 0.0434 0.21007
TN(True negative/correct rejection) 7 8 4
TNR(Specificity or true negative rate) 0.77778 0.88889 0.66667
TON(Test outcome negative) 7 10 7
TOP(Test outcome positive) 5 2 5
TP(True positive/hit) 3 1 3
TPR(Sensitivity, recall, hit rate, or true positive rate) 1.0 0.33333 0.5
Y(Youden index) 0.77778 0.22222 0.16667
dInd(Distance index) 0.22222 0.67586 0.60093
sInd(Similarity index) 0.84287 0.52209 0.57508
>>> cm.print_matrix()
Predict 0 1 2
Actual
0 3 0 0
1 0 1 2
2 2 1 3
>>> cm.print_normalized_matrix()
Predict 0 1 2
Actual
0 1.0 0.0 0.0
1 0.0 0.33333 0.66667
2 0.33333 0.16667 0.5
>>> cm.print_matrix(one_vs_all=True,class_name=0) # One-Vs-All, new in version 1.4
Predict 0 ~
Actual
0 3 0
~ 2 7
>>> cm = ConfusionMatrix(y_actu, y_pred, classes=[1,0,2]) # classes, new in version 3.2
>>> cm.print_matrix()
Predict 1 0 2
Actual
1 1 0 2
0 0 3 0
2 1 2 3
>>> cm = ConfusionMatrix(y_actu, y_pred, classes=[1,0,4]) # classes, new in version 3.2
>>> cm.print_matrix()
Predict 1 0 4
Actual
1 1 0 0
0 0 3 0
4 0 0 0
```

### Direct CM

```
>>> from pycm import *
>>> cm2 = ConfusionMatrix(matrix={"Class1": {"Class1": 1, "Class2":2}, "Class2": {"Class1": 0, "Class2": 5}}) # Create CM Directly
>>> cm2
pycm.ConfusionMatrix(classes: ['Class1', 'Class2'])
>>> print(cm2)
Predict Class1 Class2
Actual
Class1 1 2
Class2 0 5
Overall Statistics :
95% CI (0.44994,1.05006)
ACC Macro 0.75
ARI 0.17241
AUNP 0.66667
AUNU 0.66667
Bangdiwala B 0.68421
Bennett S 0.5
CBA 0.52381
CSI 0.52381
Chi-Squared 1.90476
Chi-Squared DF 1
Conditional Entropy 0.34436
Cramer V 0.48795
Cross Entropy 1.2454
F1 Macro 0.66667
F1 Micro 0.75
FNR Macro 0.33333
FNR Micro 0.25
FPR Macro 0.33333
FPR Micro 0.25
Gwet AC1 0.6
Hamming Loss 0.25
Joint Entropy 1.29879
KL Divergence 0.29097
Kappa 0.38462
Kappa 95% CI (-0.354,1.12323)
Kappa No Prevalence 0.5
Kappa Standard Error 0.37684
Kappa Unbiased 0.33333
Krippendorff Alpha 0.375
Lambda A 0.33333
Lambda B 0.0
Mutual Information 0.1992
NIR 0.625
Overall ACC 0.75
Overall CEN 0.44812
Overall J (1.04762,0.52381)
Overall MCC 0.48795
Overall MCEN 0.29904
Overall RACC 0.59375
Overall RACCU 0.625
P-Value 0.36974
PPV Macro 0.85714
PPV Micro 0.75
Pearson C 0.43853
Phi-Squared 0.2381
RCI 0.20871
RR 4.0
Reference Entropy 0.95443
Response Entropy 0.54356
SOA1(Landis & Koch) Fair
SOA2(Fleiss) Poor
SOA3(Altman) Fair
SOA4(Cicchetti) Poor
SOA5(Cramer) Relatively Strong
SOA6(Matthews) Weak
Scott PI 0.33333
Standard Error 0.15309
TNR Macro 0.66667
TNR Micro 0.75
TPR Macro 0.66667
TPR Micro 0.75
Zero-one Loss 2
Class Statistics :
Classes Class1 Class2
ACC(Accuracy) 0.75 0.75
AGF(Adjusted F-score) 0.53979 0.81325
AGM(Adjusted geometric mean) 0.73991 0.5108
AM(Difference between automatic and manual classification) -2 2
AUC(Area under the ROC curve) 0.66667 0.66667
AUCI(AUC value interpretation) Fair Fair
AUPR(Area under the PR curve) 0.66667 0.85714
BCD(Bray-Curtis dissimilarity) 0.125 0.125
BM(Informedness or bookmaker informedness) 0.33333 0.33333
CEN(Confusion entropy) 0.5 0.43083
DOR(Diagnostic odds ratio) None None
DP(Discriminant power) None None
DPI(Discriminant power interpretation) None None
ERR(Error rate) 0.25 0.25
F0.5(F0.5 score) 0.71429 0.75758
F1(F1 score - harmonic mean of precision and sensitivity) 0.5 0.83333
F2(F2 score) 0.38462 0.92593
FDR(False discovery rate) 0.0 0.28571
FN(False negative/miss/type 2 error) 2 0
FNR(Miss rate or false negative rate) 0.66667 0.0
FOR(False omission rate) 0.28571 0.0
FP(False positive/type 1 error/false alarm) 0 2
FPR(Fall-out or false positive rate) 0.0 0.66667
G(G-measure geometric mean of precision and sensitivity) 0.57735 0.84515
GI(Gini index) 0.33333 0.33333
GM(G-mean geometric mean of specificity and sensitivity) 0.57735 0.57735
IBA(Index of balanced accuracy) 0.11111 0.55556
ICSI(Individual classification success index) 0.33333 0.71429
IS(Information score) 1.41504 0.19265
J(Jaccard index) 0.33333 0.71429
LS(Lift score) 2.66667 1.14286
MCC(Matthews correlation coefficient) 0.48795 0.48795
MCCI(Matthews correlation coefficient interpretation) Weak Weak
MCEN(Modified confusion entropy) 0.38998 0.51639
MK(Markedness) 0.71429 0.71429
N(Condition negative) 5 3
NLR(Negative likelihood ratio) 0.66667 0.0
NLRI(Negative likelihood ratio interpretation) Negligible Good
NPV(Negative predictive value) 0.71429 1.0
OC(Overlap coefficient) 1.0 1.0
OOC(Otsuka-Ochiai coefficient) 0.57735 0.84515
OP(Optimized precision) 0.25 0.25
P(Condition positive or support) 3 5
PLR(Positive likelihood ratio) None 1.5
PLRI(Positive likelihood ratio interpretation) None Poor
POP(Population) 8 8
PPV(Precision or positive predictive value) 1.0 0.71429
PRE(Prevalence) 0.375 0.625
Q(Yule Q - coefficient of colligation) None None
QI(Yule Q interpretation) None None
RACC(Random accuracy) 0.04688 0.54688
RACCU(Random accuracy unbiased) 0.0625 0.5625
TN(True negative/correct rejection) 5 1
TNR(Specificity or true negative rate) 1.0 0.33333
TON(Test outcome negative) 7 1
TOP(Test outcome positive) 1 7
TP(True positive/hit) 1 5
TPR(Sensitivity, recall, hit rate, or true positive rate) 0.33333 1.0
Y(Youden index) 0.33333 0.33333
dInd(Distance index) 0.66667 0.66667
sInd(Similarity index) 0.5286 0.5286
>>> cm2.stat(summary=True)
Overall Statistics :
ACC Macro 0.75
F1 Macro 0.66667
FPR Macro 0.33333
Kappa 0.38462
Overall ACC 0.75
PPV Macro 0.85714
SOA1(Landis & Koch) Fair
TPR Macro 0.66667
Zero-one Loss 2
Class Statistics :
Classes Class1 Class2
ACC(Accuracy) 0.75 0.75
AUC(Area under the ROC curve) 0.66667 0.66667
AUCI(AUC value interpretation) Fair Fair
F1(F1 score - harmonic mean of precision and sensitivity) 0.5 0.83333
FN(False negative/miss/type 2 error) 2 0
FP(False positive/type 1 error/false alarm) 0 2
FPR(Fall-out or false positive rate) 0.0 0.66667
N(Condition negative) 5 3
P(Condition positive or support) 3 5
POP(Population) 8 8
PPV(Precision or positive predictive value) 1.0 0.71429
TN(True negative/correct rejection) 5 1
TON(Test outcome negative) 7 1
TOP(Test outcome positive) 1 7
TP(True positive/hit) 1 5
TPR(Sensitivity, recall, hit rate, or true positive rate) 0.33333 1.0
>>> cm3 = ConfusionMatrix(matrix={"Class1": {"Class1": 1, "Class2":0}, "Class2": {"Class1": 2, "Class2": 5}},transpose=True) # Transpose Matrix
>>> cm3.print_matrix()
Predict Class1 Class2
Actual
Class1 1 2
Class2 0 5
```

`matrix()`

and`normalized_matrix()`

renamed to`print_matrix()`

and`print_normalized_matrix()`

in`version 1.5`

### Activation threshold

`threshold`

is added in `version 0.9`

for real value prediction.

For more information visit Example3

### Load from file

`file`

is added in `version 0.9.5`

in order to load saved confusion matrix with `.obj`

format generated by `save_obj`

method.

For more information visit Example4

### Sample weights

`sample_weight`

is added in `version 1.2`

For more information visit Example5

### Transpose

`transpose`

is added in `version 1.2`

in order to transpose input matrix (only in `Direct CM`

mode)

### Relabel

`relabel`

method is added in `version 1.5`

in order to change ConfusionMatrix classnames.

```
>>> cm.relabel(mapping={0:"L1",1:"L2",2:"L3"})
>>> cm
pycm.ConfusionMatrix(classes: ['L1', 'L2', 'L3'])
```

### Position

`position`

method is added in `version 2.8`

in order to find the indexes of observations in `predict_vector`

which made TP, TN, FP, FN.

```
>>> cm.position()
{0: {'FN': [], 'FP': [0, 7], 'TP': [1, 4, 9], 'TN': [2, 3, 5, 6, 8, 10, 11]}, 1: {'FN': [5, 10], 'FP': [3], 'TP': [6], 'TN': [0, 1, 2, 4, 7, 8, 9, 11]}, 2: {'FN': [0, 3, 7], 'FP': [5, 10], 'TP': [2, 8, 11], 'TN': [1, 4, 6, 9]}}
```

### To array

`to_array`

method is added in `version 2.9`

in order to returns the confusion matrix in the form of a NumPy array. This can be helpful to apply different operations over the confusion matrix for different purposes such as aggregation, normalization, and combination.

```
>>> cm.to_array()
array([[3, 0, 0],
[0, 1, 2],
[2, 1, 3]])
>>> cm.to_array(normalized=True)
array([[1. , 0. , 0. ],
[0. , 0.33333, 0.66667],
[0.33333, 0.16667, 0.5 ]])
>>> cm.to_array(normalized=True,one_vs_all=True, class_name="L1")
array([[1. , 0. ],
[0.22222, 0.77778]])
```

### Combine

`combine`

method is added in `version 3.0`

in order to merge two confusion matrices. This option will be useful in mini-batch learning.

```
>>> cm_combined = cm2.combine(cm3)
>>> cm_combined.print_matrix()
Predict Class1 Class2
Actual
Class1 2 4
Class2 0 10
```

### Plot

`plot`

method is added in `version 3.0`

in order to plot a confusion matrix using Matplotlib or Seaborn.

```
>>> cm.plot()
```

```
>>> from matplotlib import pyplot as plt
>>> cm.plot(cmap=plt.cm.Greens,number_label=True,plot_lib="matplotlib")
```

```
>>> cm.plot(cmap=plt.cm.Reds,normalized=True,number_label=True,plot_lib="seaborn")
```

### Online help

`online_help`

function is added in `version 1.1`

in order to open each statistics definition in web browser

```
>>> from pycm import online_help
>>> online_help("J")
>>> online_help("SOA1(Landis & Koch)")
>>> online_help(2)
```

- List of items are available by calling
`online_help()`

(without argument) - If PyCM website is not available, set
`alt_link = True`

(new in`version 2.4`

)

### Parameter recommender

This option has been added in `version 1.9`

to recommend the most related parameters considering the characteristics of the input dataset.

The suggested parameters are selected according to some characteristics of the input such as being balance/imbalance and binary/multi-class.

All suggestions can be categorized into three main groups: imbalanced dataset, binary classification for a balanced dataset, and multi-class classification for a balanced dataset.

The recommendation lists have been gathered according to the respective paper of each parameter and the capabilities which had been claimed by the paper.

```
>>> cm.imbalance
False
>>> cm.binary
False
>>> cm.recommended_list
['MCC', 'TPR Micro', 'ACC', 'PPV Macro', 'BCD', 'Overall MCC', 'Hamming Loss', 'TPR Macro', 'Zero-one Loss', 'ERR', 'PPV Micro', 'Overall ACC']
```

### Compare

In `version 2.0`

, a method for comparing several confusion matrices is introduced. This option is a combination of several overall and class-based benchmarks. Each of the benchmarks evaluates the performance of the classification algorithm from good to poor and give them a numeric score. The score of good and poor performances are 1 and 0, respectively.

After that, two scores are calculated for each confusion matrices, overall and class-based. The overall score is the average of the score of six overall benchmarks which are Landis & Koch, Fleiss, Altman, Cicchetti, Cramer, and Matthews. In the same manner, the class-based score is the average of the score of six class-based benchmarks which are Positive Likelihood Ratio Interpretation, Negative Likelihood Ratio Interpretation, Discriminant Power Interpretation, AUC value Interpretation, Matthews Correlation Coefficient Interpretation and Yule's Q Interpretation. It should be noticed that if one of the benchmarks returns none for one of the classes, that benchmarks will be eliminated in total averaging. If the user sets weights for the classes, the averaging over the value of class-based benchmark scores will transform to a weighted average.

If the user sets the value of `by_class`

boolean input `True`

, the best confusion matrix is the one with the maximum class-based score. Otherwise, if a confusion matrix obtains the maximum of both overall and class-based scores, that will be reported as the best confusion matrix, but in any other case, the compared object doesn’t select the best confusion matrix.

```
>>> cm2 = ConfusionMatrix(matrix={0:{0:2,1:50,2:6},1:{0:5,1:50,2:3},2:{0:1,1:7,2:50}})
>>> cm3 = ConfusionMatrix(matrix={0:{0:50,1:2,2:6},1:{0:50,1:5,2:3},2:{0:1,1:55,2:2}})
>>> cp = Compare({"cm2":cm2,"cm3":cm3})
>>> print(cp)
Best : cm2
Rank Name Class-Score Overall-Score
1 cm2 9.05 2.55
2 cm3 6.05 1.98333
>>> cp.best
pycm.ConfusionMatrix(classes: [0, 1, 2])
>>> cp.sorted
['cm2', 'cm3']
>>> cp.best_name
'cm2'
```

### Acceptable data types

#### ConfusionMatrix

`actual_vector`

: python`list`

or numpy`array`

of any stringable objects`predict_vector`

: python`list`

or numpy`array`

of any stringable objects`matrix`

:`dict`

`digit`

:`int`

`threshold`

:`FunctionType (function or lambda)`

`file`

:`File object`

`sample_weight`

: python`list`

or numpy`array`

of numbers`transpose`

:`bool`

`classes`

: python`list`

- Run
`help(ConfusionMatrix)`

for`ConfusionMatrix`

object details

#### Compare

`cm_dict`

: python`dict`

of`ConfusionMatrix`

object (`str`

:`ConfusionMatrix`

)`by_class`

:`bool`

`weight`

: python`dict`

of class weights (`class_name`

:`float`

)`digit`

:`int`

- Run
`help(Compare)`

for`Compare`

object details

For more information visit here

## Try PyCM in your browser!

PyCM can be used online in interactive Jupyter Notebooks via the Binder service! Try it out now! :

- Check
`Examples`

in`Document`

folder

## Issues & bug reports

Just fill an issue and describe it. We'll check it ASAP!

or send an email to [email protected].

- Please complete the issue template

## Outputs

## Dependencies

master | dev |

## References

1- J. R. Landis and G. G. Koch, "The measurement of observer agreement for categorical data,"biometrics, pp. 159-174, 1977.

2- D. M. Powers, "Evaluation: from precision, recall and F-measure to ROC, informedness, markedness and correlation,"arXiv preprint arXiv:2010.16061, 2020.

3- C. Sammut and G. I. Webb,Encyclopedia of machine learning. Springer Science & Business Media, 2011.

4- J. L. Fleiss, "Measuring nominal scale agreement among many raters,"Psychological bulletin, vol. 76, no. 5, p. 378, 1971.

5- D. G. Altman,Practical statistics for medical research. CRC press, 1990.

6- K. L. Gwet, "Computing inter-rater reliability and its variance in the presence of high agreement,"British Journal of Mathematical and Statistical Psychology, vol. 61, no. 1, pp. 29-48, 2008.

7- W. A. Scott, "Reliability of content analysis: The case of nominal scale coding,"Public opinion quarterly, pp. 321-325, 1955.

8- E. M. Bennett, R. Alpert, and A. Goldstein, "Communications through limited-response questioning,"Public Opinion Quarterly, vol. 18, no. 3, pp. 303-308, 1954.

9- D. V. Cicchetti, "Guidelines, criteria, and rules of thumb for evaluating normed and standardized assessment instruments in psychology,"Psychological assessment, vol. 6, no. 4, p. 284, 1994.

10- R. B. Davies, "Algorithm AS 155: The distribution of a linear combination of χ2 random variables,"Applied Statistics, pp. 323-333, 1980.

11- S. Kullback and R. A. Leibler, "On information and sufficiency,"The annals of mathematical statistics, vol. 22, no. 1, pp. 79-86, 1951.

12- L. A. Goodman and W. H. Kruskal, "Measures of association for cross classifications, IV: Simplification of asymptotic variances,"Journal of the American Statistical Association, vol. 67, no. 338, pp. 415-421, 1972.

13- L. A. Goodman and W. H. Kruskal, "Measures of association for cross classifications III: Approximate sampling theory,"Journal of the American Statistical Association, vol. 58, no. 302, pp. 310-364, 1963.

14- T. Byrt, J. Bishop, and J. B. Carlin, "Bias, prevalence and kappa,"Journal of clinical epidemiology, vol. 46, no. 5, pp. 423-429, 1993.

15- M. Shepperd, D. Bowes, and T. Hall, "Researcher bias: The use of machine learning in software defect prediction,"IEEE Transactions on Software Engineering, vol. 40, no. 6, pp. 603-616, 2014.

16- X. Deng, Q. Liu, Y. Deng, and S. Mahadevan, "An improved method to construct basic probability assignment based on the confusion matrix for classification problem,"Information Sciences, vol. 340, pp. 250-261, 2016.

17- J.-M. Wei, X.-J. Yuan, Q.-H. Hu, and S.-Q. Wang, "A novel measure for evaluating classifiers,"Expert Systems with Applications, vol. 37, no. 5, pp. 3799-3809, 2010.

18- I. Kononenko and I. Bratko, "Information-based evaluation criterion for classifier's performance,"Machine learning, vol. 6, no. 1, pp. 67-80, 1991.

19- R. Delgado and J. D. Núnez-González, "Enhancing confusion entropy as measure for evaluating classifiers," inThe 13th International Conference on Soft Computing Models in Industrial and Environmental Applications, 2018: Springer, pp. 79-89.

20- J. Gorodkin, "Comparing two K-category assignments by a K-category correlation coefficient,"Computational biology and chemistry, vol.28, no. 5-6, pp. 367-374, 2004.

21- C. O. Freitas, J. M. De Carvalho, J. Oliveira, S. B. Aires, and R. Sabourin, "Confusion matrix disagreement for multiple classifiers," inIberoamerican Congress on Pattern Recognition, 2007: Springer, pp. 387-396.

22- P. Branco, L. Torgo, and R. P. Ribeiro, "Relevance-based evaluation metrics for multi-class imbalanced domains," inPacific-Asia Conference on Knowledge Discovery and Data Mining, 2017: Springer, pp. 698-710.

23- D. Ballabio, F. Grisoni, and R. Todeschini, "Multivariate comparison of classification performance measures,"Chemometrics and Intelligent Laboratory Systems, vol. 174, pp. 33-44, 2018.

24- J. Cohen, "A coefficient of agreement for nominal scales,"Educational and psychological measurement, vol. 20, no. 1, pp. 37-46, 1960.

25- S. Siegel, "Nonparametric statistics for the behavioral sciences," 1956.

26- H. Cramér,Mathematical methods of statistics. Princeton university press, 1999.

27- B. W. Matthews, "Comparison of the predicted and observed secondary structure of T4 phage lysozyme,"Biochimica et Biophysica Acta (BBA)-Protein Structure, vol. 405, no. 2, pp. 442-451, 1975.

28- J. A. Swets, "The relative operating characteristic in psychology: a technique for isolating effects of response bias finds wide use in the study of perception and cognition,"Science, vol. 182, no. 4116, pp. 990-1000, 1973.

29- P. Jaccard, "Étude comparative de la distribution florale dans une portion des Alpes et des Jura,"Bull Soc Vaudoise Sci Nat, vol. 37, pp. 547-579, 1901.

30- T. M. Cover and J. A. Thomas,Elements of Information Theory. John Wiley & Sons, 2012.

31- E. S. Keeping,Introduction to statistical inference. Courier Corporation, 1995.

32- V. Sindhwani, P. Bhattacharya, and S. Rakshit, "Information theoretic feature crediting in multiclass support vector machines," inProceedings of the 2001 SIAM International Conference on Data Mining, 2001: SIAM, pp. 1-18.

33- M. Bekkar, H. K. Djemaa, and T. A. Alitouche, "Evaluation measures for models assessment over imbalanced data sets,"J Inf Eng Appl, vol. 3, no. 10, 2013.

34- W. J. Youden, "Index for rating diagnostic tests,"Cancer, vol. 3, no. 1, pp. 32-35, 1950.

35- S. Brin, R. Motwani, J. D. Ullman, and S. Tsur, "Dynamic itemset counting and implication rules for market basket data," inProceedings of the 1997 ACM SIGMOD international conference on Management of data, 1997, pp. 255-264.

36- S. Raschka, "MLxtend: Providing machine learning and data science utilities and extensions to Python's scientific computing stack,"Journal of open source software, vol. 3, no. 24, p. 638, 2018.

37- J. R. Bray and J. T. Curtis, "An ordination of the upland forest communities of southern Wisconsin," Ecological monographs, vol. 27, no. 4, pp. 325-349, 1957.

38- J. L. Fleiss, J. Cohen, and B. S. Everitt, "Large sample standard errors of kappa and weighted kappa,"Psychological bulletin, vol. 72, no. 5, p. 323, 1969.

39- M. Felkin, "Comparing classification results between n-ary and binary problems," inQuality Measures in Data Mining: Springer, 2007, pp. 277-301.

40- R. Ranawana and V. Palade, "Optimized precision-a new measure for classifier performance evaluation," in2006 IEEE International Conference on Evolutionary Computation, 2006: IEEE, pp. 2254-2261.

41- V. García, R. A. Mollineda, and J. S. Sánchez, "Index of balanced accuracy: A performance measure for skewed class distributions," inIberian conference on pattern recognition and image analysis, 2009: Springer, pp. 441-448.

42- P. Branco, L. Torgo, and R. P. Ribeiro, "A survey of predictive modeling on imbalanced domains,"ACM Computing Surveys (CSUR), vol. 49, no. 2, pp. 1-50, 2016.

43- K. Pearson, "Notes on Regression and Inheritance in the Case of Two Parents," inProceedings of the Royal Society of London, p. 240-242, 1895.

44- W. J. Conover,Practical nonparametric statistics. John Wiley & Sons, 1998.

45- G. U. Yule, "On the methods of measuring association between two attributes,"Journal of the Royal Statistical Society, vol. 75, no. 6, pp. 579-652, 1912.

46- R. Batuwita and V. Palade, "A new performance measure for class imbalance learning. application to bioinformatics problems," in2009 International Conference on Machine Learning and Applications, 2009: IEEE, pp. 545-550.

47- D. K. Lee, "Alternatives to P value: confidence interval and effect size,"Korean journal of anesthesiology, vol. 69, no. 6, p. 555, 2016.

48- M. A. Raslich, R. J. Markert, and S. A. Stutes, "Selecting and interpreting diagnostic tests,"Biochemia Medica, vol. 17, no. 2, pp. 151-161, 2007.

49- D. E. Hinkle, W. Wiersma, and S. G. Jurs,Applied statistics for the behavioral sciences. Houghton Mifflin College Division, 2003.

50- A. Maratea, A. Petrosino, and M. Manzo, "Adjusted F-measure and kernel scaling for imbalanced data learning,"Information Sciences, vol. 257, pp. 331-341, 2014.

51- L. Mosley, "A balanced approach to the multi-class imbalance problem," 2013.

52- M. Vijaymeena and K. Kavitha, "A survey on similarity measures in text mining,"Machine Learning and Applications: An International Journal, vol. 3, no. 2, pp. 19-28, 2016.

53- Y. Otsuka, "The faunal character of the Japanese Pleistocene marine Mollusca, as evidence of climate having become colder during the Pleistocene in Japan,"Biogeograph Soc Japan, vol. 6, no. 16, pp. 165-170, 1936.

54- A. Tversky, "Features of similarity,"Psychological review, vol. 84, no. 4, p. 327, 1977.

55- K. Boyd, K. H. Eng, and C. D. Page, "Area under the precision-recall curve: point estimates and confidence intervals," inJoint European conference on machine learning and knowledge discovery in databases, 2013: Springer, pp. 451-466.

56- J. Davis and M. Goadrich, "The relationship between Precision-Recall and ROC curves," inProceedings of the 23rd international conference on Machine learning, 2006, pp. 233-240.

57- M. Kuhn, "Building predictive models in R using the caret package,"J Stat Softw, vol. 28, no. 5, pp. 1-26, 2008.

58- V. Labatut and H. Cherifi, "Accuracy measures for the comparison of classifiers,"arXiv preprint arXiv:1207.3790, 2012.

59- S. Wallis, "Binomial confidence intervals and contingency tests: mathematical fundamentals and the evaluation of alternative methods,"Journal of Quantitative Linguistics, vol. 20, no. 3, pp. 178-208, 2013.

60- D. Altman, D. Machin, T. Bryant, and M. Gardner,Statistics with confidence: confidence intervals and statistical guidelines. John Wiley & Sons, 2013.

61- J. A. Hanley and B. J. McNeil, "The meaning and use of the area under a receiver operating characteristic (ROC) curve,"Radiology, vol. 143, no. 1, pp. 29-36, 1982.

62- E. B. Wilson, "Probable inference, the law of succession, and statistical inference,"Journal of the American Statistical Association, vol. 22, no. 158, pp. 209-212, 1927.

63- A. Agresti and B. A. Coull, "Approximate is better than “exact” for interval estimation of binomial proportions,"The American Statistician, vol. 52, no. 2, pp. 119-126, 1998.

64- C. S. Peirce, "The numerical measure of the success of predictions,"Science, no. 93, pp. 453-454, 1884.

65- E. W. Steyerberg, B. Van Calster, and M. J. Pencina, "Performance measures for prediction models and markers: evaluation of predictions and classifications,"Revista Española de Cardiología (English Edition), vol. 64, no. 9, pp. 788-794, 2011.

66- A. J. Vickers and E. B. Elkin, "Decision curve analysis: a novel method for evaluating prediction models,"Medical Decision Making, vol. 26, no. 6, pp. 565-574, 2006.

67- G. W. Bohrnstedt and D. Knoke,"Statistics for social data analysis," 1982.

68- W. M. Rand, "Objective criteria for the evaluation of clustering methods,"Journal of the American Statistical association, vol. 66, no. 336, pp. 846-850, 1971.

69- J. M. Santos and M. Embrechts, "On the use of the adjusted rand index as a metric for evaluating supervised classification," inInternational conference on artificial neural networks, 2009: Springer, pp. 175-184.

70- J. Cohen, "Weighted kappa: nominal scale agreement provision for scaled disagreement or partial credit,"Psychological bulletin, vol. 70, no. 4, p. 213, 1968.

71- R. Bakeman and J. M. Gottman,Observing interaction: An introduction to sequential analysis. Cambridge university press, 1997.

72- S. Bangdiwala, "A graphical test for observer agreement," in45th International Statistical Institute Meeting, 1985, vol. 1985, p. 307.

73- K. Bangdiwala and H. Bryan, "Using SAS software graphical procedures for the observer agreement chart," inProceedings of the SAS Users Group International Conference, 1987, vol. 12, pp. 1083-1088.

74- A. F. Hayes and K. Krippendorff, "Answering the call for a standard reliability measure for coding data,"Communication methods and measures, vol. 1, no. 1, pp. 77-89, 2007.

75- M. Aickin, "Maximum likelihood estimation of agreement in the constant predictive probability model, and its relation to Cohen's kappa,"Biometrics, pp. 293-302, 1990.

76- N. A. Macmillan and C. D. Creelman,Detectiontheory: A user's guide. Psychology press, 2004.

## Cite

If you use PyCM in your research, we would appreciate citations to the following paper :

Haghighi, S., Jasemi, M., Hessabi, S. and Zolanvari, A. (2018). PyCM: Multiclass confusion matrix library in Python. Journal of Open Source Software, 3(25), p.729.

@article{Haghighi2018, doi = {10.21105/joss.00729}, url = {https://doi.org/10.21105/joss.00729}, year = {2018}, month = {may}, publisher = {The Open Journal}, volume = {3}, number = {25}, pages = {729}, author = {Sepand Haghighi and Masoomeh Jasemi and Shaahin Hessabi and Alireza Zolanvari}, title = {{PyCM}: Multiclass confusion matrix library in Python}, journal = {Journal of Open Source Software} }

Download PyCM.bib