[pymvpa] mvpa noob questions...
Michael Hanke
mih at debian.org
Fri Sep 28 10:15:17 UTC 2012
Hi,
On Thu, Sep 27, 2012 at 02:53:11PM -0700, Ray Schumacher wrote:
> We have a study dataset with subject, label (stroke/no stroke), and
> 60 features; I'd like to make an SVM classifier and test its
> significance, most important features, etc. I get results, but also
> a few cryptic (to me) errors and some warnings.
>
> Also, if I try NFoldPartitioner() rather than HalfPartitioner(...) I
> get a traceback about missing chunks, so it seems I need to set them
> explicitly?
Yes. You could do:
ds.sa['chunks'] = np.arange(len(ds))
if you have no specific requirements.
More comments inline below...
> I also can't get searchlight set up correctly.
>
>
> My test code, based on the tutorial:
> from mvpa2.tutorial_suite import *
>
> d = file(r"C:\temp\test3_redo.csv").readlines()
> lol= [x[:-1].split(",") for x in d]
> print lol[0]
> ## the list of subject names
> subjects = [r[0] for r in lol]
> ## the feature data
> dat = [[float(c) for c in row[6:]] for row in lol]
>
> labels = [r[1] for r in lol]
> tmp = [l.replace('Normal', '0') \
> for l in [l.replace('Stroke', '1') for l in labels]]
> ## the truth values
> labels = [int(x) for x in tmp]
>
> ds = Dataset(samples=dat)
> ds.sa['subject'] = subjects
> ds.sa['targets'] = labels
> print ds, '\n'
>
> clf = LinearCSVMC()
> cvte = CrossValidation(clf, HalfPartitioner(count=2,
> selection_strategy='random', attr='subject'),
> errorfx=lambda p, t: np.mean(p == t), enable_ca=['stats'])
> cv_results = cvte(ds)
> print cvte.ca.stats.as_string(description=True)
> print cvte.ca.stats.matrix
>
> aov = OneWayAnova()
> f = aov(ds)
> print 'aov:', f
>
> fsel = SensitivityBasedFeatureSelection(
> OneWayAnova(),
> FixedNElementTailSelector(5, mode='select', tail='upper'))
> fsel.train(ds)
> ds_p = fsel(ds)
> print '\nfixed:', ds_p.shape
>
> results = cvte(ds_p)
> print np.round(cvte.ca.stats.stats['ACC%'], 1)
> print cvte.ca.stats.matrix
> print
>
> fsel = SensitivityBasedFeatureSelection(
> OneWayAnova(),
> FractionTailSelector(0.05, mode='select', tail='upper'))
> fclf = FeatureSelectionClassifier(clf, fsel)
> cvte = CrossValidation(fclf, HalfPartitioner(count=2,
> selection_strategy='random', attr='subject'),
> enable_ca=['stats'])
> results = cvte(ds)
> print 'fractional', np.round(cvte.ca.stats.stats['ACC%'], 1)
>
>
> Errors:
> 'gcc' is not recognized as an internal or external command,
> operable program or batch file.
> C:\Python27\lib\site-packages\scipy\integrate\quadpack.py:288:
You don't have built pymvpa's extensions properly. How did you install
it. Here is also the usual advice: "Don't do this at home!" On windows
it is a lot more convenient to get the NeuroDebian virtual machine and
install PyMVPA inside -- shouldn't take you more than 5 min to set up.
> UserWarning: Extremely bad integrand behavior occurs at some points
> of the
> integration interval.
> warnings.warn(msg)
> C:\Python27\lib\site-packages\mvpa2\misc\errorfx.py:102:
> RuntimeWarning: invalid value encountered in divide
> ([0], np.cumsum(t)/t.sum(dtype=np.float), [1]))
> C:\Python27\lib\site-packages\scipy\stats\stats.py:274:
> RuntimeWarning: invalid value encountered in double_scalars
> return np.mean(x,axis)/factor
> C:\Python27\lib\site-packages\mvpa2\misc\errorfx.py:106:
> RuntimeWarning: invalid value encountered in divide
> ([0], np.cumsum(~t)/(~t).sum(dtype=np.float), [1]))
> C:\Python27\lib\site-packages\mvpa2\clfs\transerror.py:678:
> RuntimeWarning: invalid value encountered in divide
> stats['PPV'] = stats['TP'] / (1.0*stats["P'"])
> C:\Python27\lib\site-packages\mvpa2\clfs\transerror.py:679:
> RuntimeWarning: invalid value encountered in divide
> stats['NPV'] = stats['TN'] / (1.0*stats["N'"])
> C:\Python27\lib\site-packages\mvpa2\clfs\transerror.py:680:
> RuntimeWarning: invalid value encountered in divide
> stats['FDR'] = stats['FP'] / (1.0*stats["P'"])
> C:\Python27\lib\site-packages\mvpa2\measures\anova.py:111:
> RuntimeWarning: invalid value encountered in divide
> msb = ssbn / float(dfbn)
>
Not sure why this happens -- could be specific to you input dataset.
> Output:
> ['S001', 'Stroke', 'Structural', 'DL', 'A3+4', 'L', '33175.5142',
> '14408.18074', '10849.84165', '8059.24706', '8010.452299', '14',
> '45', '40', '55', '50', '56060.79132', '24908.80989', '16687.6343',
> '10154.6501', '7901.745475', '14', '45', '40', '50', '30',
> '64268.60726', '12620.57744', '992.4884881', '825.5158143',
> '751.0024413', '19', '27', '33', '67', '40', '2170.966193',
> '1879.560843', '1741.498856', '1340.718439', '959.5283252', '32',
> '15', '19', '42', '23', '0', '0', '0', '0', '0', '0', '0', '0', '0',
> '0', '13424.62045', '9142.678538', '8140.41212', '6403.125282',
> '5807.041425', '15', '19', '66', '32', '41']
> <Dataset: 62x60 at float64, <sa: subject,targets>>
>
> WARNING: Only 1 sets have estimates assigned from 2 sets. ROC
> estimates might be incorrect.
Here you go, the classifier never predicted target 0 correctly.
> * Please note: warnings are printed only once, but underlying
> problem might occur many times *
> ----------.
> predictions\targets 0.0 1
> `------ ---- ---- P' N' FP FN PPV NPV TPR SPC
> FDR MCC F1 AUC
> 0.0 0 23 23 39 23 24 0 0.38 0 0.39
> 1 -0.61 0 nan
> 1 24 15 39 23 24 23 0.38 0 0.39 0
> 0.62 -0.61 0.39 nan
> Per target: ---- ----
> P 24 38
> N 38 24
> TP 0 15
> TN 15 0
> Summary \ Means: ---- ---- 31 31 23.5 23.5 0.19 0.19 0.2 0.2
> 0.81 -0.61 0.19 nan
> CHI^2 25.68 p=1.1e-05
> ACC 0.24
> ACC% 24.19
> # of sets 2
>
> Statistics computed in 1-vs-rest fashion per each target.
> Abbreviations (for details see http://en.wikipedia.org/wiki/ROC_curve):
> TP : true positive (AKA hit)
> TN : true negative (AKA correct rejection)
> FP : false positive (AKA false alarm, Type I error)
> FN : false negative (AKA miss, Type II error)
> TPR: true positive rate (AKA hit rate, recall, sensitivity)
> TPR = TP / P = TP / (TP + FN)
> FPR: false positive rate (AKA false alarm rate, fall-out)
> FPR = FP / N = FP / (FP + TN)
> ACC: accuracy
> ACC = (TP + TN) / (P + N)
> SPC: specificity
> SPC = TN / (FP + TN) = 1 - FPR
> PPV: positive predictive value (AKA precision)
> PPV = TP / (TP + FP)
> NPV: negative predictive value
> NPV = TN / (TN + FN)
> FDR: false discovery rate
> FDR = FP / (FP + TP)
> MCC: Matthews Correlation Coefficient
> MCC = (TP*TN - FP*FN)/sqrt(P N P' N')
> F1 : F1 score
> F1 = 2TP / (P + P') = 2TP / (2TP + FP + FN)
> AUC: Area under (AUC) curve
> CHI^2: Chi-square of confusion matrix
> LOE(ACC): Linear Order Effect in ACC across sets
> # of sets: number of target/prediction sets which were provided
>
> [[ 0 23]
> [24 15]]
> aov: <Dataset: 1x60 at float64, <fa: fprob>>
>
> fixed: (62, 5)
> WARNING: Obtained degenerate data with zero norm for training of
> <LinearCSVMC>. Scaling of C cannot be done.
There seem to be a problem with the dataset. do you have invariant
features in it?
Michael
--
Michael Hanke
http://mih.voxindeserto.de
More information about the Pkg-ExpPsy-PyMVPA
mailing list