[pymvpa] Permutation test with unbalanced data, confusion matrices of each permutation round?

Tue Jun 15 15:25:36 BST 2021

Dear PyMVPA users,

I have an unbalanced dataset of 11 classes with varying number of events per class (due to data-driven approach). I am running leave-one-subject out cross-validation to classify the events. I was able to implement a permutation test where the event labels are shuffled in the training set and then tested in real data of the leave-one-out subject. What I would like do is to classify all 11 classes in combination (not one versus others or all pairwise classifications) and then assess if the classifier is able to predict each class statistically above chance level. But because the data is unbalanced I cannot assume that the chance level for each class is equal (naive chance level would be 1/11~0.09). With the code below I was able to create a null distribution of the classification accuracies for the whole model, but in this case I cannot assess the statistical significance of each class by comparing the class accuracies with the null distribution of the whole model.

I would like to extract the confusion matrix of the predictions for each permutation round (event labels shuffled in the training set). From these confusion matrices I could create separate null distributions of prediction accuracies for each class and then assess the significance of each class separately. I assumed that this could be fairly simple to do but could not find anything related in the documentation or in this mailing list.

The relevant sections of the code:

# Classifier
clf = MLPClassifier(alpha=1, max_iter=1000) #Neural net classifier
wrapped_clf = SKLLearnerAdapter(clf) # Neural net classifier
stats.chisqprob = lambda chisq, df:stats.chi2.sf(chisq,df) 

# Permutator
permutator = AttributePermutator('targets', count=1000)
distr_est = MCNullDist(permutator, tail='right', enable_ca=['dist_samples'])

# Cross validation
cv = CrossValidation(wrapped_clf, NFoldPartitioner(attr='subject'),
                 errorfx=lambda p,t: np.mean(p==t),
                 postproc=mean_sample(),
                 null_dist=distr_est,
                 enable_ca=['stats'])

bsc_null_results = cv(ds_mni)
perm_accu = cv.null_dist.ca.dist_samples # Null distribution of accuracies for the whole model
accuracy_bs=bsc_null_results.S # Real accuracy for the whole model
confmat_bs = cv.ca.stats.matrix # Confusion matrix for classification using real data in the training set

Is it possible to extract the confusion matrices for each permutation round? If not I would be thankful for any advice on how to assess the significance of class accuracies separately in this case of unbalanced data. 

Sincerely,
Severi Santavirta