[med-svn] [Git][med-team/q2-sample-classifier][upstream] New upstream version 2021.8.0
Andreas Tille (@tille)
gitlab at salsa.debian.org
Mon Sep 27 20:19:26 BST 2021
Andreas Tille pushed to branch upstream at Debian Med / q2-sample-classifier
Commits:
7c4b78fa by Andreas Tille at 2021-09-27T21:05:46+02:00
New upstream version 2021.8.0
- - - - -
26 changed files:
- + .github/workflows/ci.yml
- − .travis.yml
- LICENSE
- README.md
- − ci/recipe/conda_build_config.yaml
- ci/recipe/meta.yaml
- q2_sample_classifier/__init__.py
- q2_sample_classifier/_format.py
- q2_sample_classifier/_transformer.py
- q2_sample_classifier/_type.py
- q2_sample_classifier/_version.py
- q2_sample_classifier/assets/index.html
- q2_sample_classifier/classify.py
- q2_sample_classifier/plugin_setup.py
- q2_sample_classifier/tests/__init__.py
- + q2_sample_classifier/tests/data/true_targets.tsv
- q2_sample_classifier/tests/test_actions.py
- q2_sample_classifier/tests/test_base_class.py
- q2_sample_classifier/tests/test_classifier.py
- q2_sample_classifier/tests/test_estimators.py
- q2_sample_classifier/tests/test_types_formats_transformers.py
- q2_sample_classifier/tests/test_utilities.py
- q2_sample_classifier/tests/test_visualization.py
- q2_sample_classifier/utilities.py
- q2_sample_classifier/visuals.py
- setup.py
Changes:
=====================================
.github/workflows/ci.yml
=====================================
@@ -0,0 +1,55 @@
+# This file is automatically generated by busywork.qiime2.org and
+# template-repos - any manual edits made to this file will be erased when
+# busywork performs maintenance updates.
+
+name: ci
+
+on:
+ pull_request:
+ push:
+ branches:
+ - master
+
+jobs:
+ lint:
+ runs-on: ubuntu-latest
+ steps:
+ - name: checkout source
+ uses: actions/checkout at v2
+
+ - name: set up python 3.8
+ uses: actions/setup-python at v1
+ with:
+ python-version: 3.8
+
+ - name: install dependencies
+ run: python -m pip install --upgrade pip
+
+ - name: lint
+ run: |
+ pip install -q https://github.com/qiime2/q2lint/archive/master.zip
+ q2lint
+ pip install -q flake8
+ flake8
+
+ build-and-test:
+ needs: lint
+ strategy:
+ matrix:
+ os: [ubuntu-latest, macos-latest]
+ runs-on: ${{ matrix.os }}
+ steps:
+ - name: checkout source
+ uses: actions/checkout at v2
+ with:
+ fetch-depth: 0
+
+ - name: set up git repo for versioneer
+ run: git fetch --depth=1 origin +refs/tags/*:refs/tags/*
+
+ - uses: qiime2/action-library-packaging at alpha1
+ with:
+ package-name: q2-sample-classifier
+ build-target: dev
+ additional-tests: py.test --pyargs q2_sample_classifier
+ library-token: ${{ secrets.LIBRARY_TOKEN }}
=====================================
.travis.yml deleted
=====================================
@@ -1,25 +0,0 @@
-sudo: false
-language: python
-before_install:
- - export MPLBACKEND='Agg'
- - wget -q https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh -O miniconda.sh
- - export MINICONDA_PREFIX="$HOME/miniconda"
- - bash miniconda.sh -b -p $MINICONDA_PREFIX
- - export PATH="$MINICONDA_PREFIX/bin:$PATH"
- - conda config --set always_yes yes
- - conda update -q conda
- - conda info -a
-install:
- - wget -q https://raw.githubusercontent.com/qiime2/environment-files/master/latest/staging/qiime2-latest-py36-linux-conda.yml
- - conda env create -q -n test-env --file qiime2-latest-py36-linux-conda.yml
- - source activate test-env
- - conda install -q pytest-cov
- - pip install flake8 coveralls
- - pip install -q https://github.com/qiime2/q2lint/archive/master.zip
- - make install
-script:
- - make lint
- - make test-cov
-after_success:
- - coveralls
-
=====================================
LICENSE
=====================================
@@ -1,6 +1,6 @@
BSD 3-Clause License
-Copyright (c) 2017-2020, QIIME 2 development team.
+Copyright (c) 2017-2021, QIIME 2 development team.
All rights reserved.
Redistribution and use in source and binary forms, with or without
=====================================
README.md
=====================================
@@ -1,35 +1,5 @@
# q2-sample-classifier
-[![Build Status](https://travis-ci.org/qiime2/q2-sample-classifier.svg?branch=master)](https://travis-ci.org/qiime2/q2-sample-classifier) [![Coverage Status](https://coveralls.io/repos/github/qiime2/q2-sample-classifier/badge.svg?branch=master)](https://coveralls.io/github/qiime2/q2-sample-classifier?branch=master) [![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.1468879.svg)](https://doi.org/10.5281/zenodo.1468879) [![status](http://joss.theoj.org/papers/d828a4ecf73eb6a147f8634e9054eeee/status.svg)](http://joss.theoj.org/papers/d828a4ecf73eb6a147f8634e9054eeee)
+![](https://github.com/qiime2/q2-sample-classifier/workflows/ci/badge.svg)
-
-QIIME 2 plugin for machine learning prediction of sample data.
-
-Microbiome studies often aim to predict outcomes or differentiate samples based on their microbial compositions, tasks that can be efficiently performed by supervised learning methods. The q2-sample-classifier plugin makes these methods more accessible, reproducible, and interpretable to a broad audience of microbiologists, clinicians, and others who wish to utilize supervised learning methods for predicting sample characteristics based on microbiome composition or other "omics" data.
-
-## Installation
-
-Follow the QIIME 2 core distribution installation instructions at https://qiime2.org/ to install q2-sample-classifier as part of the QIIME 2 analysis platform.
-
-To test deployment, run:
-```
-pytest --disable-pytest-warnings --pyargs q2_sample_classifier
-```
-
-## Example usage
-
-This is a QIIME 2 plugin. For details on QIIME 2 and tutorials demonstrating how to use this plugin, see the [QIIME 2 documentation](https://qiime2.org/). Tutorials for this plugin can be found [here](https://docs.qiime2.org/2018.8/tutorials/sample-classifier/).
-
-Not sure which learning model to use? A good starting point is [this flowchart](http://scikit-learn.org/dev/tutorial/machine_learning_map/index.html). Most of the classification and regression models shown in that chart (and a few extras) are implemented in q2-sample-classifier.
-
-## API documentation
-
-API documentation can be found [here](https://docs.qiime2.org/2018.8/plugins/available/sample-classifier/).
-
-## Help
-
-For user support, see the [QIIME 2 Forum](https://forum.qiime2.org). Bug reports and feature requests can also be made [via a new issue](https://github.com/qiime2/q2-sample-classifier/issues/new/choose).
-
-## Contributing
-
-QIIME 2 is an open-source project, and we are very interested in contributions from the community. Please see the [contributing guidelines](https://github.com/qiime2/q2-sample-classifier/blob/master/.github/CONTRIBUTING.md) if you would like to get involved.
+This is a QIIME 2 plugin. For details on QIIME 2, see https://qiime2.org..
\ No newline at end of file
=====================================
ci/recipe/conda_build_config.yaml deleted
=====================================
@@ -1,2 +0,0 @@
-python:
- - 3.6
=====================================
ci/recipe/meta.yaml
=====================================
@@ -1,6 +1,5 @@
{% set data = load_setup_py_data() %}
{% set version = data.get('version') or 'placehold' %}
-{% set release = '.'.join(version.split('.')[:2]) %}
package:
name: q2-sample-classifier
@@ -19,19 +18,27 @@ requirements:
run:
- python {{ python }}
- - pandas >=1
+ - pandas {{ pandas }}
- scipy
+ - numpy
- joblib
- scikit-learn >=0.22.1
- scikit-bio
- seaborn >=0.8
- fastcluster
- - qiime2 {{ release }}.*
- - q2-types {{ release }}.*
- - q2templates {{ release }}.*
- - q2-feature-table {{ release }}.*
+ - qiime2 {{ qiime2_epoch }}.*
+ - q2-types {{ qiime2_epoch }}.*
+ - q2templates {{ qiime2_epoch }}.*
+ - q2-feature-table {{ qiime2_epoch }}.*
test:
+ requires:
+ - qiime2 >={{ qiime2 }}
+ - q2-types >={{ q2_types }}
+ - q2templates >={{ q2templates }}
+ - q2-feature-table >={{ q2_feature_table }}
+ - pytest
+
imports:
- q2_sample_classifier
- qiime2.plugins.sample_classifier
=====================================
q2_sample_classifier/__init__.py
=====================================
@@ -1,5 +1,5 @@
# ----------------------------------------------------------------------------
-# Copyright (c) 2017-2020, QIIME 2 development team.
+# Copyright (c) 2017-2021, QIIME 2 development team.
#
# Distributed under the terms of the Modified BSD License.
#
@@ -10,10 +10,11 @@ from ._format import (
BooleanSeriesFormat, BooleanSeriesDirectoryFormat,
PredictionsFormat, PredictionsDirectoryFormat, ImportanceFormat,
ImportanceDirectoryFormat, SampleEstimatorDirFmt, PickleFormat,
- ProbabilitiesFormat, ProbabilitiesDirectoryFormat)
+ ProbabilitiesFormat, ProbabilitiesDirectoryFormat,
+ TrueTargetsDirectoryFormat)
from ._type import (BooleanSeries, ClassifierPredictions, RegressorPredictions,
Importance, SampleEstimator, Classifier, Regressor,
- Probabilities)
+ Probabilities, TrueTargets)
from ._version import get_versions
@@ -26,4 +27,5 @@ __all__ = ['BooleanSeriesFormat', 'BooleanSeriesDirectoryFormat',
'SampleEstimatorDirFmt', 'PickleFormat', 'BooleanSeries',
'ClassifierPredictions', 'RegressorPredictions', 'Importance',
'Classifier', 'Regressor', 'SampleEstimator', 'Probabilities',
- 'ProbabilitiesFormat', 'ProbabilitiesDirectoryFormat']
+ 'ProbabilitiesFormat', 'ProbabilitiesDirectoryFormat',
+ 'TrueTargets', 'TrueTargetsDirectoryFormat']
=====================================
q2_sample_classifier/_format.py
=====================================
@@ -1,5 +1,5 @@
# ----------------------------------------------------------------------------
-# Copyright (c) 2017-2020, QIIME 2 development team.
+# Copyright (c) 2017-2021, QIIME 2 development team.
#
# Distributed under the terms of the Modified BSD License.
#
@@ -176,3 +176,8 @@ class ProbabilitiesFormat(_MultiColumnNumericFormat):
ProbabilitiesDirectoryFormat = model.SingleFileDirectoryFormat(
'ProbabilitiesDirectoryFormat', 'class_probabilities.tsv',
ProbabilitiesFormat)
+
+
+TrueTargetsDirectoryFormat = model.SingleFileDirectoryFormat(
+ 'TrueTargetsDirectoryFormat', 'true_targets.tsv',
+ PredictionsFormat)
=====================================
q2_sample_classifier/_transformer.py
=====================================
@@ -1,5 +1,5 @@
# ----------------------------------------------------------------------------
-# Copyright (c) 2017-2020, QIIME 2 development team.
+# Copyright (c) 2017-2021, QIIME 2 development team.
#
# Distributed under the terms of the Modified BSD License.
#
=====================================
q2_sample_classifier/_type.py
=====================================
@@ -1,5 +1,5 @@
# ----------------------------------------------------------------------------
-# Copyright (c) 2017-2020, QIIME 2 development team.
+# Copyright (c) 2017-2021, QIIME 2 development team.
#
# Distributed under the terms of the Modified BSD License.
#
@@ -26,3 +26,5 @@ Importance = SemanticType(
'Importance', variant_of=FeatureData.field['type'])
Probabilities = SemanticType(
'Probabilities', variant_of=SampleData.field['type'])
+TrueTargets = SemanticType(
+ 'TrueTargets', variant_of=SampleData.field['type'])
=====================================
q2_sample_classifier/_version.py
=====================================
@@ -23,9 +23,9 @@ def get_keywords():
# setup.py/versioneer.py will grep for the variable names, so they must
# each be defined on a line of their own. _version.py will just call
# get_keywords().
- git_refnames = " (HEAD -> master, tag: 2020.11.1)"
- git_full = "392e5454b7cb6a91d43968d62178fb6a216e193f"
- git_date = "2020-12-05 20:44:51 +0000"
+ git_refnames = " (tag: 2021.8.0)"
+ git_full = "916eb0799fa2c95a04766b44c22335c6a097ce13"
+ git_date = "2021-09-09 18:35:33 +0000"
keywords = {"refnames": git_refnames, "full": git_full, "date": git_date}
return keywords
=====================================
q2_sample_classifier/assets/index.html
=====================================
@@ -6,6 +6,25 @@
{% block content %}
+{% if warning_msg %}
+<div class="panel-group" id="warnings" role="tablist" aria-multiselectable="true">
+ <div class="panel panel-warning">
+ <div class="panel-heading" role="tab" id="warnings-heading">
+ <h4 class="panel-title">
+ <a role="button" data-toggle="collapse" data-parent="#warnings" href="#warnings-list" aria-expanded="true" aria-controls="warnings-list">
+ Warnings (click here to collapse/expand):
+ </a>
+ </h4>
+ </div>
+ <div id="warnings-list" class="panel-collapse collapse in" role="tabpanel" aria-labelledby="warnings-heading">
+ <div class="alert alert-warning col-md-12">
+ <p><strong>{{ warning_msg }}</strong></p>
+ </div>
+ </div>
+ </div>
+</div>
+{% endif %}
+
<div class="row">
{% if predictions %}
<h1>Model Accuracy</h1>
@@ -65,6 +84,9 @@
<br>
<p>Download as PDF</p>
</a>
+ <a href="rfe_scores.tsv">
+ <p>Download as TSV</p>
+ </a>
</div>
{% endif %}
{% if result %}
=====================================
q2_sample_classifier/classify.py
=====================================
@@ -1,5 +1,5 @@
# ----------------------------------------------------------------------------
-# Copyright (c) 2017-2020, QIIME 2 development team.
+# Copyright (c) 2017-2021, QIIME 2 development team.
#
# Distributed under the terms of the Modified BSD License.
#
@@ -163,8 +163,10 @@ def classify_samples(ctx,
confusion = ctx.get_action('sample_classifier', 'confusion_matrix')
heat = ctx.get_action('sample_classifier', 'heatmap')
- X_train, X_test = split(table, metadata, test_size, random_state,
- stratify=True, missing_samples=missing_samples)
+ X_train, X_test, y_train, y_test = split(table, metadata, test_size,
+ random_state,
+ stratify=True,
+ missing_samples=missing_samples)
sample_estimator, importance = fit(
X_train, metadata, step, cv, random_state, n_jobs, n_estimators,
@@ -183,7 +185,7 @@ def classify_samples(ctx,
group_samples=True, missing_samples=missing_samples)
return (sample_estimator, importance, predictions, summary,
- accuracy_results, probabilities, _heatmap)
+ accuracy_results, probabilities, _heatmap, y_train, y_test)
def regress_samples(ctx,
@@ -207,8 +209,10 @@ def regress_samples(ctx,
summarize_estimator = ctx.get_action('sample_classifier', 'summarize')
scatter = ctx.get_action('sample_classifier', 'scatterplot')
- X_train, X_test = split(table, metadata, test_size, random_state,
- stratify, missing_samples=missing_samples)
+ X_train, X_test, y_train, y_test = split(table, metadata, test_size,
+ random_state,
+ stratify,
+ missing_samples=missing_samples)
sample_estimator, importance = fit(
X_train, metadata, step, cv, random_state, n_jobs, n_estimators,
@@ -303,15 +307,12 @@ def split_table(table: biom.Table, metadata: qiime2.MetadataColumn,
test_size: float = defaults['test_size'],
random_state: int = None, stratify: str = True,
missing_samples: str = defaults['missing_samples']
- ) -> (biom.Table, biom.Table):
+ ) -> (biom.Table, biom.Table, pd.Series, pd.Series):
column = metadata.name
X_train, X_test, y_train, y_test = _prepare_training_data(
table, metadata, column, test_size, random_state, load_data=True,
stratify=stratify, missing_samples=missing_samples)
- # TODO: we can consider returning the metadata (y_train, y_test) if a
- # SampleData[Metadata] type comes into existence. For now we will just
- # throw this out.
- return X_train, X_test
+ return X_train, X_test, y_train, y_test
def regress_samples_ncv(
@@ -366,6 +367,7 @@ def confusion_matrix(output_dir: str,
missing_samples: str = defaults['missing_samples'],
vmin: int = 'auto', vmax: int = 'auto',
palette: str = defaults['palette']) -> None:
+
if vmin == 'auto':
vmin = None
if vmax == 'auto':
=====================================
q2_sample_classifier/plugin_setup.py
=====================================
@@ -1,5 +1,5 @@
# ----------------------------------------------------------------------------
-# Copyright (c) 2017-2020, QIIME 2 development team.
+# Copyright (c) 2017-2021, QIIME 2 development team.
#
# Distributed under the terms of the Modified BSD License.
#
@@ -33,11 +33,13 @@ from ._format import (SampleEstimatorDirFmt,
PredictionsFormat,
PredictionsDirectoryFormat,
ProbabilitiesFormat,
- ProbabilitiesDirectoryFormat)
+ ProbabilitiesDirectoryFormat,
+ TrueTargetsDirectoryFormat)
from ._type import (ClassifierPredictions, RegressorPredictions,
SampleEstimator, BooleanSeries, Importance,
- Classifier, Regressor, Probabilities)
+ Classifier, Regressor, Probabilities,
+ TrueTargets)
import q2_sample_classifier
citations = Citations.load('citations.bib', package='q2_sample_classifier')
@@ -102,7 +104,7 @@ parameters = {
'missing_samples': Str % Choices(['error', 'ignore'])},
'splitter': {
'test_size': Float % Range(0.0, 1.0, inclusive_end=False,
- inclusive_start=False)},
+ inclusive_start=True)},
'rfe': {
'step': Float % Range(0.0, 1.0, inclusive_end=False,
inclusive_start=False),
@@ -226,15 +228,21 @@ plugin.pipelines.register_function(
('feature_importance', FeatureData[Importance]),
('predictions', SampleData[ClassifierPredictions])
] + pipeline_outputs + [
- ('probabilities', SampleData[Probabilities]),
- ('heatmap', Visualization)],
+ ('probabilities', SampleData[Probabilities]),
+ ('heatmap', Visualization),
+ ('training_targets', SampleData[TrueTargets]),
+ ('test_targets', SampleData[TrueTargets])],
input_descriptions={'table': input_descriptions['table']},
parameter_descriptions=classifier_pipeline_parameter_descriptions,
output_descriptions={
**pipeline_output_descriptions,
'probabilities': input_descriptions['probabilities'],
'heatmap': 'A heatmap of the top 50 most important features from the '
- 'table.'},
+ 'table.',
+ 'training_targets': 'Series containing true target values of '
+ 'train samples',
+ 'test_targets': 'Series containing true target values '
+ 'of test samples'},
name='Train and test a cross-validated supervised learning classifier.',
description=description.format(
'categorical', 'supervised learning classifier')
@@ -258,7 +266,7 @@ plugin.pipelines.register_function(
'metadata': 'Categorical metadata column to use as prediction target.',
'k': 'Number of nearest neighbors',
'palette': 'The color palette to use for plotting.',
- },
+ },
output_descriptions={
'predictions': 'leave one out predictions for each sample',
'accuracy_results': 'Accuracy results visualization.',
@@ -486,7 +494,9 @@ plugin.methods.register_function(
'metadata': MetadataColumn[Numeric | Categorical],
**parameters['regressor']},
outputs=[('training_table', FeatureTable[T]),
- ('test_table', FeatureTable[T])],
+ ('test_table', FeatureTable[T]),
+ ('training_targets', SampleData[TrueTargets]),
+ ('test_targets', SampleData[TrueTargets])],
input_descriptions={'table': 'Feature table containing all features that '
'should be used for target prediction.'},
parameter_descriptions={
@@ -497,7 +507,11 @@ plugin.methods.register_function(
'metadata': 'Numeric metadata column to use as prediction target.'},
output_descriptions={
'training_table': 'Feature table containing training samples',
- 'test_table': 'Feature table containing test samples'},
+ 'test_table': 'Feature table containing test samples',
+ 'training_targets': 'Series containing true target values of '
+ 'train samples',
+ 'test_targets': 'Series containing true target values of '
+ 'test samples'},
name='Split a feature table into training and testing sets.',
description=(
'Split a feature table into training and testing sets. By default '
@@ -623,7 +637,7 @@ plugin.pipelines.register_function(
# Registrations
plugin.register_semantic_types(
SampleEstimator, BooleanSeries, Importance, ClassifierPredictions,
- RegressorPredictions, Classifier, Regressor, Probabilities)
+ RegressorPredictions, Classifier, Regressor, Probabilities, TrueTargets)
plugin.register_semantic_type_to_format(
SampleEstimator[Classifier],
artifact_format=SampleEstimatorDirFmt)
@@ -645,9 +659,13 @@ plugin.register_semantic_type_to_format(
plugin.register_semantic_type_to_format(
SampleData[Probabilities],
artifact_format=ProbabilitiesDirectoryFormat)
+plugin.register_semantic_type_to_format(
+ SampleData[TrueTargets],
+ artifact_format=TrueTargetsDirectoryFormat)
plugin.register_formats(
SampleEstimatorDirFmt, BooleanSeriesFormat, BooleanSeriesDirectoryFormat,
ImportanceFormat, ImportanceDirectoryFormat, PredictionsFormat,
PredictionsDirectoryFormat, ProbabilitiesFormat,
- ProbabilitiesDirectoryFormat)
+ ProbabilitiesDirectoryFormat,
+ TrueTargetsDirectoryFormat)
importlib.import_module('q2_sample_classifier._transformer')
=====================================
q2_sample_classifier/tests/__init__.py
=====================================
@@ -1,5 +1,5 @@
# ----------------------------------------------------------------------------
-# Copyright (c) 2017-2020, QIIME 2 development team.
+# Copyright (c) 2017-2021, QIIME 2 development team.
#
# Distributed under the terms of the Modified BSD License.
#
=====================================
q2_sample_classifier/tests/data/true_targets.tsv
=====================================
@@ -0,0 +1,9 @@
+sample-id delivery
+10249.C041.08SS Vaginal
+10249.C055.08SS Cesarean
+10249.C027.07SS Vaginal
+10249.C042.07SS Vaginal
+10249.C005.08SS Cesarean
+10249.C056.09SS Cesarean
+10249.C035.07SD Vaginal
+10249.C001.10SS Vaginal
=====================================
q2_sample_classifier/tests/test_actions.py
=====================================
@@ -1,11 +1,15 @@
# ----------------------------------------------------------------------------
-# Copyright (c) 2017-2020, QIIME 2 development team.
+# Copyright (c) 2017-2021, QIIME 2 development team.
#
# Distributed under the terms of the Modified BSD License.
#
# The full license is in the file LICENSE, distributed with this software.
# ----------------------------------------------------------------------------
+
+import os
+
import pandas as pd
+import pandas.testing as pdt
import numpy as np
import biom
@@ -43,11 +47,21 @@ class NowLetsTestTheActions(SampleClassifierTestPluginBase):
md2.index.name = 'SampleID'
self.md2 = qiime2.Metadata(md2)
- # let's make sure the correct transformers are in place! See issue 114
- # if this runs without error, that's good enough for me. We already
- # validate the function above.
+ # let's make sure the function runs w/o errors and that the correct
+ # transformers are in place (see issue 114)
def test_action_split_table(self):
- sample_classifier.actions.split_table(self.tab, self.md, test_size=0.5)
+ res = sample_classifier.actions.split_table(
+ self.tab, self.md, test_size=0.5)
+ y_train = res.training_targets.view(pd.Series)
+ y_test = res.test_targets.view(pd.Series)
+
+ # test whether extracted target is correct
+ self.assertEqual(y_train.name, 'bugs')
+
+ # test if complete target column is covered
+ y_all = y_train.append(y_test).sort_index()
+ y_all.index.name = 'SampleID'
+ pdt.assert_series_equal(y_all, self.md._series)
def test_metatable(self):
exp = biom.Table(
@@ -155,6 +169,15 @@ class TestSummarize(SampleEstimatorTestBase):
def test_summary_with_rfecv(self):
summarize(self.temp_dir.name, self.pipeline)
+ self.assertTrue('rfe_plot.pdf' in os.listdir(self.temp_dir.name))
+ self.assertTrue('rfe_plot.png' in os.listdir(self.temp_dir.name))
+ self.assertTrue('rfe_scores.tsv' in os.listdir(self.temp_dir.name))
+
def test_summary_without_rfecv(self):
+ # nuke the rfe_scores to test the other branch of _summarize_estimator
del self.pipeline.rfe_scores
summarize(self.temp_dir.name, self.pipeline)
+
+ self.assertFalse('rfe_plot.pdf' in os.listdir(self.temp_dir.name))
+ self.assertFalse('rfe_plot.png' in os.listdir(self.temp_dir.name))
+ self.assertFalse('rfe_scores.tsv' in os.listdir(self.temp_dir.name))
=====================================
q2_sample_classifier/tests/test_base_class.py
=====================================
@@ -1,5 +1,5 @@
# ----------------------------------------------------------------------------
-# Copyright (c) 2017-2020, QIIME 2 development team.
+# Copyright (c) 2017-2021, QIIME 2 development team.
#
# Distributed under the terms of the Modified BSD License.
#
=====================================
q2_sample_classifier/tests/test_classifier.py
=====================================
@@ -1,5 +1,5 @@
# ----------------------------------------------------------------------------
-# Copyright (c) 2017-2020, QIIME 2 development team.
+# Copyright (c) 2017-2021, QIIME 2 development team.
#
# Distributed under the terms of the Modified BSD License.
#
@@ -10,7 +10,7 @@ import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import RFECV
-import pandas.util.testing as pdt
+import pandas.testing as pdt
import biom
import qiime2
=====================================
q2_sample_classifier/tests/test_estimators.py
=====================================
@@ -1,5 +1,5 @@
# ----------------------------------------------------------------------------
-# Copyright (c) 2017-2020, QIIME 2 development team.
+# Copyright (c) 2017-2021, QIIME 2 development team.
#
# Distributed under the terms of the Modified BSD License.
#
@@ -7,7 +7,7 @@
# ----------------------------------------------------------------------------
import os
import pandas as pd
-import pandas.util.testing as pdt
+import pandas.testing as pdt
import biom
import shutil
import json
@@ -113,7 +113,7 @@ class EstimatorsTests(SampleClassifierTestPluginBase):
self.mdc_ecam_fp = _load_nmc('ecam_map_maturity.txt', 'month')
self.exp_imp = pd.read_csv(
self.get_data_path('importance.tsv'), sep='\t', header=0,
- index_col=0)
+ index_col=0, names=['feature', 'importance'])
self.exp_pred = pd.read_csv(
self.get_data_path('predictions.tsv'), sep='\t', header=0,
index_col=0, squeeze=True)
@@ -152,7 +152,7 @@ class EstimatorsTests(SampleClassifierTestPluginBase):
[1, 0, 4, 4],
[4, 4, 0, 1],
[4, 4, 1, 0],
- ], ids=sample_ids)
+ ], ids=sample_ids)
dm = qiime2.Artifact.import_data('DistanceMatrix', distance_matrix)
categories = pd.Series(('skinny', 'skinny', 'fat', 'fat'),
@@ -182,7 +182,7 @@ class EstimatorsTests(SampleClassifierTestPluginBase):
[2, 0, 1, 1],
[3, 1, 0, 1],
[3, 1, 1, 0],
- ], ids=sample_ids)
+ ], ids=sample_ids)
dm = qiime2.Artifact.import_data('DistanceMatrix', distance_matrix)
categories = pd.Series(('fat', 'skinny', 'skinny', 'skinny'),
@@ -212,7 +212,7 @@ class EstimatorsTests(SampleClassifierTestPluginBase):
[2, 0, 3, 4],
[1, 3, 0, 3],
[5, 4, 3, 0],
- ], ids=sample_ids)
+ ], ids=sample_ids)
dm = qiime2.Artifact.import_data('DistanceMatrix', distance_matrix)
categories = pd.Series(('fat', 'skinny', 'skinny', 'skinny'),
@@ -251,6 +251,20 @@ class EstimatorsTests(SampleClassifierTestPluginBase):
msg='Accuracy of %s classifier was %f, but expected %f' % (
classifier, accuracy, seeded_results[classifier]))
+ # test if training classifier with pipeline classify_samples raises
+ # warning when test_size = 0.0
+ def test_classify_samples_w_all_train_set(self):
+ with self.assertWarnsRegex(Warning, "not representative of "
+ "your model's performance"):
+ table_fp = self.get_data_path('chardonnay.table.qza')
+ table = qiime2.Artifact.load(table_fp)
+ sample_classifier.actions.classify_samples(
+ table=table, metadata=self.mdc_chard_fp,
+ test_size=0.0, cv=1, n_estimators=10, n_jobs=1,
+ estimator='RandomForestClassifier', random_state=123,
+ parameter_tuning=False, optimize_feature_selection=False,
+ missing_samples='ignore')
+
# test that the plugin methods/visualizers work
def test_regress_samples_ncv(self):
y_pred, importances = regress_samples_ncv(
@@ -289,7 +303,8 @@ class EstimatorsTests(SampleClassifierTestPluginBase):
name='prediction')
exp_importances = pd.DataFrame(
[0.595111111111111, 0.23155555555555551, 0.17333333333333334],
- index=pd.Index(['o3', 'o1', 'o2']), columns=['importance'])
+ index=pd.Index(['o3', 'o1', 'o2'], name='feature'),
+ columns=['importance'])
exp_probabilities = pd.DataFrame(
[[0.5, 0.5], [0., 1.], [0., 1.], [0.5, 0.5], [0.5, 0.5],
[0.5, 0.5], [0.5, 0.5], [0., 1.], [1., 0.], [1., 0.]],
@@ -390,28 +405,31 @@ class EstimatorsTests(SampleClassifierTestPluginBase):
missing_samples='ignore')
def test_split_table_no_rounding_error(self):
- X_train, X_test = split_table(
+ X_train, X_test, y_train, y_test = split_table(
self.table_chard_fp, self.mdc_chard_fp, test_size=0.5,
random_state=123, stratify=True, missing_samples='ignore')
self.assertEqual(len(X_train.ids()) + len(X_test.ids()), 21)
+ self.assertEqual(y_train.shape[0] + y_test.shape[0], 21)
def test_split_table_no_split(self):
- X_train, X_test = split_table(
+ X_train, X_test, y_train, y_test = split_table(
self.table_chard_fp, self.mdc_chard_fp, test_size=0.0,
random_state=123, stratify=True, missing_samples='ignore')
self.assertEqual(len(X_train.ids()), 21)
+ self.assertEqual(y_train.shape[0], 21)
def test_split_table_invalid_test_size(self):
with self.assertRaisesRegex(ValueError, "at least two samples"):
- X_train, X_test = split_table(
+ X_train, X_test, y_train, y_test = split_table(
self.table_chard_fp, self.mdc_chard_fp, test_size=1.0,
random_state=123, stratify=True, missing_samples='ignore')
def test_split_table_percnorm(self):
- X_train, X_test = split_table(
+ X_train, X_test, y_train, y_test = split_table(
self.table_percnorm, self.mdc_percnorm, test_size=0.5,
random_state=123, stratify=True, missing_samples='ignore')
self.assertEqual(len(X_train.ids()) + len(X_test.ids()), 4)
+ self.assertEqual(y_train.shape[0] + y_test.shape[0], 4)
# test experimental functions
def test_detect_outliers(self):
@@ -450,6 +468,7 @@ class EstimatorsTests(SampleClassifierTestPluginBase):
sample_ids = pred.index.intersection(exp.index)
pred = pred.loc[sample_ids]
exp = exp.loc[sample_ids]
+ # verify predictions:
# test that expected number of correct results is achieved (these
# are mostly quite high as we would expect (total n=21))
correct_results = np.sum(pred == exp)
@@ -458,6 +477,17 @@ class EstimatorsTests(SampleClassifierTestPluginBase):
msg='Accuracy of %s classifier was %f, but expected %f' % (
classifier, correct_results,
seeded_predict_results[classifier]))
+ # verify probabilities
+ # test whether all are in correct range (0 to 1)
+ ls_pred_classes = prob.columns.tolist()
+ ls_correct_range = [col for col in ls_pred_classes if
+ prob[col].between(
+ 0, 1, inclusive=True).all()]
+ self.assertEqual(len(ls_correct_range), prob.shape[1],
+ msg='Predicted probabilities of class {}'
+ 'are not in range [0,1]'.format(
+ [col for col in ls_pred_classes
+ if col not in ls_correct_range]))
def test_predict_regressions(self):
for regressor in ['RandomForestRegressor', 'ExtraTreesRegressor',
@@ -526,7 +556,7 @@ seeded_results = {
'ExtraTreesClassifier': 0.454545454545,
'GradientBoostingClassifier': 0.272727272727,
'AdaBoostClassifier': 0.272727272727,
- 'LinearSVC': 0.727272727273,
+ 'LinearSVC': 0.818182,
'SVC': 0.36363636363636365,
'KNeighborsClassifier': 0.363636363636,
'RandomForestRegressor': 23.226508,
=====================================
q2_sample_classifier/tests/test_types_formats_transformers.py
=====================================
@@ -1,5 +1,5 @@
# ----------------------------------------------------------------------------
-# Copyright (c) 2017-2020, QIIME 2 development team.
+# Copyright (c) 2017-2021, QIIME 2 development team.
#
# Distributed under the terms of the Modified BSD License.
#
@@ -7,7 +7,7 @@
# ----------------------------------------------------------------------------
import os
import pandas as pd
-import pandas.util.testing as pdt
+import pandas.testing as pdt
import numpy as np
import shutil
import tempfile
@@ -28,7 +28,8 @@ from q2_sample_classifier import (
RegressorPredictions, ImportanceFormat, ImportanceDirectoryFormat,
Importance, PickleFormat, ProbabilitiesFormat,
ProbabilitiesDirectoryFormat, Probabilities, Classifier, Regressor,
- SampleEstimator, SampleEstimatorDirFmt)
+ SampleEstimator, SampleEstimatorDirFmt,
+ TrueTargetsDirectoryFormat, TrueTargets)
from q2_sample_classifier.visuals import (
_custom_palettes, _plot_heatmap_from_confusion_matrix,)
from q2_sample_classifier._format import JSONFormat
@@ -356,6 +357,21 @@ class TestSemanticTypes(SampleClassifierTestPluginBase):
for palette in _custom_palettes().keys():
_plot_heatmap_from_confusion_matrix(confused, palette)
+ # test TrueTarget
+ def test_TrueTargets_semantic_type_registration(self):
+ self.assertRegisteredSemanticType(TrueTargets)
+
+ # test TrueTargetDirectoryFormats
+ def test_TrueTargets_dir_fmt_validate_positive(self):
+ filepath = self.get_data_path('true_targets.tsv')
+ shutil.copy(filepath, self.temp_dir.name)
+ format = TrueTargetsDirectoryFormat(self.temp_dir.name, mode='r')
+ format.validate()
+
+ def test_TrueTarget_to_TrueTargets_dir_fmt_registration(self):
+ self.assertSemanticTypeRegisteredToFormat(
+ SampleData[TrueTargets], TrueTargetsDirectoryFormat)
+
class TestTypes(SampleClassifierTestPluginBase):
def test_sample_estimator_semantic_type_registration(self):
=====================================
q2_sample_classifier/tests/test_utilities.py
=====================================
@@ -1,5 +1,5 @@
# ----------------------------------------------------------------------------
-# Copyright (c) 2017-2020, QIIME 2 development team.
+# Copyright (c) 2017-2021, QIIME 2 development team.
#
# Distributed under the terms of the Modified BSD License.
#
@@ -12,7 +12,7 @@ from sklearn.svm import LinearSVC
from sklearn.feature_extraction import DictVectorizer
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
-import pandas.util.testing as pdt
+import pandas.testing as pdt
import qiime2
@@ -121,8 +121,9 @@ class UtilitiesTests(SampleClassifierTestPluginBase):
# and this should not occur now, but theoretically should just concat and
# sort but not collapse if all column names are unique
def test_mean_feature_importance_do_not_collapse(self):
- imps = [pd.DataFrame([4, 3, 2, 1], columns=["importance0"]),
- pd.DataFrame([16, 15, 14, 13], columns=["importance1"])]
+ imps = [pd.DataFrame([4.0, 3.0, 2.0, 1.0], columns=["importance0"]),
+ pd.DataFrame([16.0, 15.0, 14.0, 13.0],
+ columns=["importance1"])]
exp = pd.concat(imps, axis=1)
pdt.assert_frame_equal(_mean_feature_importance(imps), exp)
=====================================
q2_sample_classifier/tests/test_visualization.py
=====================================
@@ -1,12 +1,12 @@
# ----------------------------------------------------------------------------
-# Copyright (c) 2017-2020, QIIME 2 development team.
+# Copyright (c) 2017-2021, QIIME 2 development team.
#
# Distributed under the terms of the Modified BSD License.
#
# The full license is in the file LICENSE, distributed with this software.
# ----------------------------------------------------------------------------
import pandas as pd
-import pandas.util.testing as pdt
+import pandas.testing as pdt
from os import mkdir, listdir
from os.path import join
import biom
=====================================
q2_sample_classifier/utilities.py
=====================================
@@ -1,5 +1,5 @@
# ----------------------------------------------------------------------------
-# Copyright (c) 2017-2020, QIIME 2 development team.
+# Copyright (c) 2017-2021, QIIME 2 development team.
#
# Distributed under the terms of the Modified BSD License.
#
@@ -18,7 +18,7 @@ from sklearn.ensemble import (RandomForestRegressor, RandomForestClassifier,
ExtraTreesClassifier, ExtraTreesRegressor,
AdaBoostClassifier, GradientBoostingClassifier,
AdaBoostRegressor, GradientBoostingRegressor)
-from sklearn.svm import LinearSVC, SVR, SVC
+from sklearn.svm import SVR, SVC
from sklearn.linear_model import Ridge, Lasso, ElasticNet
from sklearn.neighbors import KNeighborsClassifier, KNeighborsRegressor
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor
@@ -47,20 +47,6 @@ parameters = {
"min_weight_fraction_leaf": [0.0001, 0.001, 0.01]},
'bootstrap': {"bootstrap": [True, False]},
'criterion': {"criterion": ["gini", "entropy"]},
- 'linear_svm': {"C": [1, 0.5, 0.1, 0.9, 0.8],
- # should probably include penalty in grid search, but:
- # Unsupported set of arguments: The combination of
- # penalty='l1' and loss='hinge' is not supported
- # "penalty": ["l1", "l2"],
- "loss": ["hinge", "squared_hinge"],
- "tol": [0.00001, 0.0001, 0.001]
- # should probably include this in grid search, as
- # dual=False is preferred when samples>features. However:
- # Unsupported set of arguments: The combination of
- # penalty='l2' and loss='hinge' are not supported when
- # dual=False
- # "dual": [True, False]
- },
'svm': {"C": [1, 0.5, 0.1, 0.9, 0.8],
"tol": [0.00001, 0.0001, 0.001, 0.01],
"shrinking": [True, False]},
@@ -171,6 +157,9 @@ def _split_training_data(feature_data, targets, column, test_size=0.2,
except ValueError:
_stratification_error()
else:
+ warning_msg = _warn_zero_test_split()
+ warnings.warn(warning_msg, UserWarning)
+
X_train, X_test, y_train, y_test = (
feature_data, feature_data, targets, targets)
@@ -438,11 +427,14 @@ def _predict_and_plot(output_dir, y_test, y_pred, vmin=None, vmax=None,
else:
predictions = _linear_regress(y_test, y_pred)
predict_plot = _regplot_from_dataframe(y_test, y_pred)
+
if output_dir is not None:
predict_plot.get_figure().savefig(
join(output_dir, 'predictions.png'), bbox_inches='tight')
predict_plot.get_figure().savefig(
join(output_dir, 'predictions.pdf'), bbox_inches='tight')
+
+ plt.close('all')
return predictions, predict_plot
@@ -480,6 +472,14 @@ def _plot_accuracy(output_dir, predictions, truth, probabilities,
or numeric data inside two pd.Series
'''
truth = truth.to_series()
+
+ # check if test_size == 0.0 and all predictions are complete dataset
+ if (missing_samples == 'ignore') & (
+ predictions.shape[0] == truth.shape[0]):
+ warning_msg = _warn_zero_test_split()
+ else:
+ warning_msg = None
+
predictions, truth = _match_series_or_die(
predictions, truth, missing_samples)
@@ -499,7 +499,7 @@ def _plot_accuracy(output_dir, predictions, truth, probabilities,
# output to viz
_visualize(output_dir=output_dir, estimator=None, cm=predictions,
roc=probabilities, optimize_feature_selection=False,
- title=plot_title)
+ title=plot_title, warning_msg=warning_msg)
def sort_importances(importances, ascending=False):
@@ -523,6 +523,11 @@ def _summarize_estimator(output_dir, sample_estimator):
rfep.savefig(join(output_dir, 'rfe_plot.pdf'))
plt.close('all')
optimize_feature_selection = True
+ # generate rfe scores file
+ df = pd.DataFrame(data={'rfe_score': sample_estimator.rfe_scores},
+ index=sample_estimator.rfe_scores.index)
+ df.index.name = 'feature_count'
+ df.to_csv(join(output_dir, 'rfe_scores.tsv'), sep='\t', index=True)
# if the rfe_scores attribute does not exist, do nothing
except AttributeError:
optimize_feature_selection = False
@@ -533,7 +538,8 @@ def _summarize_estimator(output_dir, sample_estimator):
def _visualize(output_dir, estimator, cm, roc,
- optimize_feature_selection=True, title='results'):
+ optimize_feature_selection=True, title='results',
+ warning_msg=None):
pd.set_option('display.max_colwidth', None)
@@ -558,7 +564,8 @@ def _visualize(output_dir, estimator, cm, roc,
'result': result,
'predictions': cm,
'roc': roc,
- 'optimize_feature_selection': optimize_feature_selection})
+ 'optimize_feature_selection': optimize_feature_selection,
+ 'warning_msg': warning_msg})
def _visualize_knn(output_dir, params: pd.Series):
@@ -678,14 +685,11 @@ def predict_probabilities(estimator, test_set, index):
have their class probabilities predicted.
index: array-like of sample names
'''
- # most classifiers have a predict_proba attribute
- try:
- probs = pd.DataFrame(estimator.predict_proba(test_set),
- index=index, columns=estimator.classes_)
- # SVMs use the decision_function attribute
- except AttributeError:
- probs = pd.DataFrame(estimator.decision_function(test_set),
- index=index, columns=estimator.classes_)
+ # all used classifiers have a predict_proba attribute
+ # (approximated for SVCs)
+ probs = pd.DataFrame(estimator.predict_proba(test_set),
+ index=index, columns=estimator.classes_)
+
return probs
@@ -763,11 +767,13 @@ def _select_estimator(estimator, n_jobs, n_estimators, random_state=None):
estimator = GradientBoostingClassifier(
n_estimators=n_estimators, random_state=random_state)
elif estimator == 'LinearSVC':
- param_dist = parameters['linear_svm']
- estimator = LinearSVC(random_state=random_state)
+ param_dist = parameters['svm']
+ estimator = SVC(kernel='linear', random_state=random_state,
+ gamma='scale', probability=True)
elif estimator == 'SVC':
param_dist = parameters['svm']
- estimator = SVC(kernel='rbf', random_state=random_state, gamma='scale')
+ estimator = SVC(kernel='rbf', random_state=random_state,
+ gamma='scale', probability=True)
elif estimator == 'KNeighborsClassifier':
param_dist = parameters['kneighbors']
estimator = KNeighborsClassifier(algorithm='auto')
@@ -845,3 +851,11 @@ def _warn_feature_selection():
'the parameter settings requested. See documentation or try a '
'different estimator model.'))
warnings.warn(warning, UserWarning)
+
+
+def _warn_zero_test_split():
+ return 'Using test_size = 0.0, you are using your complete dataset for ' \
+ 'fitting the estimator. Hence, any returned model evaluations are ' \
+ 'based on that same training dataset and are not representative of ' \
+ 'your model\'s performance on a previously unseen dataset. Please ' \
+ 'consider evaluating this model on a separate dataset.'
=====================================
q2_sample_classifier/visuals.py
=====================================
@@ -1,5 +1,5 @@
# ----------------------------------------------------------------------------
-# Copyright (c) 2017-2020, QIIME 2 development team.
+# Copyright (c) 2017-2021, QIIME 2 development team.
#
# Distributed under the terms of the Modified BSD License.
#
@@ -10,7 +10,7 @@ from sklearn.metrics import (
mean_squared_error, confusion_matrix, accuracy_score, roc_curve, auc)
from sklearn.preprocessing import label_binarize
from itertools import cycle
-from scipy import interp
+from numpy import interp
import pandas as pd
import numpy as np
import seaborn as sns
@@ -72,7 +72,7 @@ def _regplot_from_dataframe(x, y, plot_style="whitegrid", arb=True,
color="grey"):
'''Seaborn regplot with true 1:1 ratio set by arb (bool).'''
sns.set_style(plot_style)
- reg = sns.regplot(x, y, color=color)
+ reg = sns.regplot(x=x, y=y, color=color)
plt.xlabel('True value')
plt.ylabel('Predicted value')
if arb is True:
=====================================
setup.py
=====================================
@@ -1,5 +1,5 @@
# ----------------------------------------------------------------------------
-# Copyright (c) 2017-2020, QIIME 2 development team.
+# Copyright (c) 2017-2021, QIIME 2 development team.
#
# Distributed under the terms of the Modified BSD License.
#
View it on GitLab: https://salsa.debian.org/med-team/q2-sample-classifier/-/commit/7c4b78faf33bc50e958843964685cf5d73d1374d
--
View it on GitLab: https://salsa.debian.org/med-team/q2-sample-classifier/-/commit/7c4b78faf33bc50e958843964685cf5d73d1374d
You're receiving this email because of your account on salsa.debian.org.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://alioth-lists.debian.net/pipermail/debian-med-commit/attachments/20210927/addb5712/attachment-0001.htm>
More information about the debian-med-commit
mailing list