[med-svn] [Git][med-team/q2-sample-classifier][upstream] New upstream version 2021.8.0

Andreas Tille (@tille) gitlab at salsa.debian.org
Mon Sep 27 20:19:26 BST 2021



Andreas Tille pushed to branch upstream at Debian Med / q2-sample-classifier


Commits:
7c4b78fa by Andreas Tille at 2021-09-27T21:05:46+02:00
New upstream version 2021.8.0
- - - - -


26 changed files:

- + .github/workflows/ci.yml
- − .travis.yml
- LICENSE
- README.md
- − ci/recipe/conda_build_config.yaml
- ci/recipe/meta.yaml
- q2_sample_classifier/__init__.py
- q2_sample_classifier/_format.py
- q2_sample_classifier/_transformer.py
- q2_sample_classifier/_type.py
- q2_sample_classifier/_version.py
- q2_sample_classifier/assets/index.html
- q2_sample_classifier/classify.py
- q2_sample_classifier/plugin_setup.py
- q2_sample_classifier/tests/__init__.py
- + q2_sample_classifier/tests/data/true_targets.tsv
- q2_sample_classifier/tests/test_actions.py
- q2_sample_classifier/tests/test_base_class.py
- q2_sample_classifier/tests/test_classifier.py
- q2_sample_classifier/tests/test_estimators.py
- q2_sample_classifier/tests/test_types_formats_transformers.py
- q2_sample_classifier/tests/test_utilities.py
- q2_sample_classifier/tests/test_visualization.py
- q2_sample_classifier/utilities.py
- q2_sample_classifier/visuals.py
- setup.py


Changes:

=====================================
.github/workflows/ci.yml
=====================================
@@ -0,0 +1,55 @@
+# This file is automatically generated by busywork.qiime2.org and
+# template-repos - any manual edits made to this file will be erased when
+# busywork performs maintenance updates.
+
+name: ci
+
+on:
+  pull_request:
+  push:
+    branches:
+      - master
+
+jobs:
+  lint:
+    runs-on: ubuntu-latest
+    steps:
+    - name: checkout source
+      uses: actions/checkout at v2
+
+    - name: set up python 3.8
+      uses: actions/setup-python at v1
+      with:
+        python-version: 3.8
+
+    - name: install dependencies
+      run: python -m pip install --upgrade pip
+
+    - name: lint
+      run: |
+        pip install -q https://github.com/qiime2/q2lint/archive/master.zip
+        q2lint
+        pip install -q flake8
+        flake8
+
+  build-and-test:
+    needs: lint
+    strategy:
+      matrix:
+        os: [ubuntu-latest, macos-latest]
+    runs-on: ${{ matrix.os }}
+    steps:
+    - name: checkout source
+      uses: actions/checkout at v2
+      with:
+        fetch-depth: 0
+
+    - name: set up git repo for versioneer
+      run: git fetch --depth=1 origin +refs/tags/*:refs/tags/*
+
+    - uses: qiime2/action-library-packaging at alpha1
+      with:
+        package-name: q2-sample-classifier
+        build-target: dev
+        additional-tests: py.test --pyargs q2_sample_classifier
+        library-token: ${{ secrets.LIBRARY_TOKEN }}


=====================================
.travis.yml deleted
=====================================
@@ -1,25 +0,0 @@
-sudo: false
-language: python
-before_install:
-  - export MPLBACKEND='Agg'
-  - wget -q https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh -O miniconda.sh
-  - export MINICONDA_PREFIX="$HOME/miniconda"
-  - bash miniconda.sh -b -p $MINICONDA_PREFIX
-  - export PATH="$MINICONDA_PREFIX/bin:$PATH"
-  - conda config --set always_yes yes
-  - conda update -q conda
-  - conda info -a
-install:
-  - wget -q https://raw.githubusercontent.com/qiime2/environment-files/master/latest/staging/qiime2-latest-py36-linux-conda.yml
-  - conda env create -q -n test-env --file qiime2-latest-py36-linux-conda.yml
-  - source activate test-env
-  - conda install -q pytest-cov
-  - pip install flake8 coveralls
-  - pip install -q https://github.com/qiime2/q2lint/archive/master.zip
-  - make install
-script:
-  - make lint
-  - make test-cov
-after_success:
-  - coveralls
-


=====================================
LICENSE
=====================================
@@ -1,6 +1,6 @@
 BSD 3-Clause License
 
-Copyright (c) 2017-2020, QIIME 2 development team.
+Copyright (c) 2017-2021, QIIME 2 development team.
 All rights reserved.
 
 Redistribution and use in source and binary forms, with or without


=====================================
README.md
=====================================
@@ -1,35 +1,5 @@
 # q2-sample-classifier
 
-[![Build Status](https://travis-ci.org/qiime2/q2-sample-classifier.svg?branch=master)](https://travis-ci.org/qiime2/q2-sample-classifier) [![Coverage Status](https://coveralls.io/repos/github/qiime2/q2-sample-classifier/badge.svg?branch=master)](https://coveralls.io/github/qiime2/q2-sample-classifier?branch=master) [![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.1468879.svg)](https://doi.org/10.5281/zenodo.1468879) [![status](http://joss.theoj.org/papers/d828a4ecf73eb6a147f8634e9054eeee/status.svg)](http://joss.theoj.org/papers/d828a4ecf73eb6a147f8634e9054eeee)
+![](https://github.com/qiime2/q2-sample-classifier/workflows/ci/badge.svg)
 
-
-QIIME 2 plugin for machine learning prediction of sample data.
-
-Microbiome studies often aim to predict outcomes or differentiate samples based on their microbial compositions, tasks that can be efficiently performed by supervised learning methods. The q2-sample-classifier plugin makes these methods more accessible, reproducible, and interpretable to a broad audience of microbiologists, clinicians, and others who wish to utilize supervised learning methods for predicting sample characteristics based on microbiome composition or other "omics" data.
-
-## Installation
-
-Follow the QIIME 2 core distribution installation instructions at https://qiime2.org/ to install q2-sample-classifier as part of the QIIME 2 analysis platform.
-
-To test deployment, run:
-```
-pytest --disable-pytest-warnings --pyargs q2_sample_classifier
-```
-
-## Example usage
-
-This is a QIIME 2 plugin. For details on QIIME 2 and tutorials demonstrating how to use this plugin, see the [QIIME 2 documentation](https://qiime2.org/). Tutorials for this plugin can be found [here](https://docs.qiime2.org/2018.8/tutorials/sample-classifier/).
-
-Not sure which learning model to use? A good starting point is [this flowchart](http://scikit-learn.org/dev/tutorial/machine_learning_map/index.html). Most of the classification and regression models shown in that chart (and a few extras) are implemented in q2-sample-classifier.
-
-## API documentation
-
-API documentation can be found [here](https://docs.qiime2.org/2018.8/plugins/available/sample-classifier/).
-
-## Help
-
-For user support, see the [QIIME 2 Forum](https://forum.qiime2.org). Bug reports and feature requests can also be made [via a new issue](https://github.com/qiime2/q2-sample-classifier/issues/new/choose).
-
-## Contributing
-
-QIIME 2 is an open-source project, and we are very interested in contributions from the community. Please see the [contributing guidelines](https://github.com/qiime2/q2-sample-classifier/blob/master/.github/CONTRIBUTING.md) if you would like to get involved.
+This is a QIIME 2 plugin. For details on QIIME 2, see https://qiime2.org..
\ No newline at end of file


=====================================
ci/recipe/conda_build_config.yaml deleted
=====================================
@@ -1,2 +0,0 @@
-python:
-  - 3.6


=====================================
ci/recipe/meta.yaml
=====================================
@@ -1,6 +1,5 @@
 {% set data = load_setup_py_data() %}
 {% set version = data.get('version') or 'placehold' %}
-{% set release = '.'.join(version.split('.')[:2]) %}
 
 package:
   name: q2-sample-classifier
@@ -19,19 +18,27 @@ requirements:
 
   run:
     - python {{ python }}
-    - pandas >=1
+    - pandas {{ pandas }}
     - scipy
+    - numpy
     - joblib
     - scikit-learn >=0.22.1
     - scikit-bio
     - seaborn >=0.8
     - fastcluster
-    - qiime2 {{ release }}.*
-    - q2-types {{ release }}.*
-    - q2templates {{ release }}.*
-    - q2-feature-table {{ release }}.*
+    - qiime2 {{ qiime2_epoch }}.*
+    - q2-types {{ qiime2_epoch }}.*
+    - q2templates {{ qiime2_epoch }}.*
+    - q2-feature-table {{ qiime2_epoch }}.*
 
 test:
+  requires:
+    - qiime2 >={{ qiime2 }}
+    - q2-types >={{ q2_types }}
+    - q2templates >={{ q2templates }}
+    - q2-feature-table >={{ q2_feature_table }}
+    - pytest
+
   imports:
     - q2_sample_classifier
     - qiime2.plugins.sample_classifier


=====================================
q2_sample_classifier/__init__.py
=====================================
@@ -1,5 +1,5 @@
 # ----------------------------------------------------------------------------
-# Copyright (c) 2017-2020, QIIME 2 development team.
+# Copyright (c) 2017-2021, QIIME 2 development team.
 #
 # Distributed under the terms of the Modified BSD License.
 #
@@ -10,10 +10,11 @@ from ._format import (
     BooleanSeriesFormat, BooleanSeriesDirectoryFormat,
     PredictionsFormat, PredictionsDirectoryFormat, ImportanceFormat,
     ImportanceDirectoryFormat, SampleEstimatorDirFmt, PickleFormat,
-    ProbabilitiesFormat, ProbabilitiesDirectoryFormat)
+    ProbabilitiesFormat, ProbabilitiesDirectoryFormat,
+    TrueTargetsDirectoryFormat)
 from ._type import (BooleanSeries, ClassifierPredictions, RegressorPredictions,
                     Importance, SampleEstimator, Classifier, Regressor,
-                    Probabilities)
+                    Probabilities, TrueTargets)
 from ._version import get_versions
 
 
@@ -26,4 +27,5 @@ __all__ = ['BooleanSeriesFormat', 'BooleanSeriesDirectoryFormat',
            'SampleEstimatorDirFmt', 'PickleFormat', 'BooleanSeries',
            'ClassifierPredictions', 'RegressorPredictions', 'Importance',
            'Classifier', 'Regressor', 'SampleEstimator', 'Probabilities',
-           'ProbabilitiesFormat', 'ProbabilitiesDirectoryFormat']
+           'ProbabilitiesFormat', 'ProbabilitiesDirectoryFormat',
+           'TrueTargets', 'TrueTargetsDirectoryFormat']


=====================================
q2_sample_classifier/_format.py
=====================================
@@ -1,5 +1,5 @@
 # ----------------------------------------------------------------------------
-# Copyright (c) 2017-2020, QIIME 2 development team.
+# Copyright (c) 2017-2021, QIIME 2 development team.
 #
 # Distributed under the terms of the Modified BSD License.
 #
@@ -176,3 +176,8 @@ class ProbabilitiesFormat(_MultiColumnNumericFormat):
 ProbabilitiesDirectoryFormat = model.SingleFileDirectoryFormat(
     'ProbabilitiesDirectoryFormat', 'class_probabilities.tsv',
     ProbabilitiesFormat)
+
+
+TrueTargetsDirectoryFormat = model.SingleFileDirectoryFormat(
+    'TrueTargetsDirectoryFormat', 'true_targets.tsv',
+    PredictionsFormat)


=====================================
q2_sample_classifier/_transformer.py
=====================================
@@ -1,5 +1,5 @@
 # ----------------------------------------------------------------------------
-# Copyright (c) 2017-2020, QIIME 2 development team.
+# Copyright (c) 2017-2021, QIIME 2 development team.
 #
 # Distributed under the terms of the Modified BSD License.
 #


=====================================
q2_sample_classifier/_type.py
=====================================
@@ -1,5 +1,5 @@
 # ----------------------------------------------------------------------------
-# Copyright (c) 2017-2020, QIIME 2 development team.
+# Copyright (c) 2017-2021, QIIME 2 development team.
 #
 # Distributed under the terms of the Modified BSD License.
 #
@@ -26,3 +26,5 @@ Importance = SemanticType(
     'Importance', variant_of=FeatureData.field['type'])
 Probabilities = SemanticType(
     'Probabilities', variant_of=SampleData.field['type'])
+TrueTargets = SemanticType(
+    'TrueTargets', variant_of=SampleData.field['type'])


=====================================
q2_sample_classifier/_version.py
=====================================
@@ -23,9 +23,9 @@ def get_keywords():
     # setup.py/versioneer.py will grep for the variable names, so they must
     # each be defined on a line of their own. _version.py will just call
     # get_keywords().
-    git_refnames = " (HEAD -> master, tag: 2020.11.1)"
-    git_full = "392e5454b7cb6a91d43968d62178fb6a216e193f"
-    git_date = "2020-12-05 20:44:51 +0000"
+    git_refnames = " (tag: 2021.8.0)"
+    git_full = "916eb0799fa2c95a04766b44c22335c6a097ce13"
+    git_date = "2021-09-09 18:35:33 +0000"
     keywords = {"refnames": git_refnames, "full": git_full, "date": git_date}
     return keywords
 


=====================================
q2_sample_classifier/assets/index.html
=====================================
@@ -6,6 +6,25 @@
 
 {% block content %}
 
+{% if warning_msg %}
+<div class="panel-group" id="warnings" role="tablist" aria-multiselectable="true">
+  <div class="panel panel-warning">
+    <div class="panel-heading" role="tab" id="warnings-heading">
+      <h4 class="panel-title">
+        <a role="button" data-toggle="collapse" data-parent="#warnings" href="#warnings-list" aria-expanded="true" aria-controls="warnings-list">
+          Warnings (click here to collapse/expand):
+        </a>
+      </h4>
+    </div>
+    <div id="warnings-list" class="panel-collapse collapse in" role="tabpanel" aria-labelledby="warnings-heading">
+      <div class="alert alert-warning col-md-12">
+        <p><strong>{{ warning_msg }}</strong></p>
+      </div>
+    </div>
+  </div>
+</div>
+{% endif %}
+
 <div class="row">
   {% if predictions %}
   <h1>Model Accuracy</h1>
@@ -65,6 +84,9 @@
         <br>
         <p>Download as PDF</p>
       </a>
+      <a href="rfe_scores.tsv">
+        <p>Download as TSV</p>
+      </a>
     </div>
     {% endif %}
     {% if result %}


=====================================
q2_sample_classifier/classify.py
=====================================
@@ -1,5 +1,5 @@
 # ----------------------------------------------------------------------------
-# Copyright (c) 2017-2020, QIIME 2 development team.
+# Copyright (c) 2017-2021, QIIME 2 development team.
 #
 # Distributed under the terms of the Modified BSD License.
 #
@@ -163,8 +163,10 @@ def classify_samples(ctx,
     confusion = ctx.get_action('sample_classifier', 'confusion_matrix')
     heat = ctx.get_action('sample_classifier', 'heatmap')
 
-    X_train, X_test = split(table, metadata, test_size, random_state,
-                            stratify=True, missing_samples=missing_samples)
+    X_train, X_test, y_train, y_test = split(table, metadata, test_size,
+                                             random_state,
+                                             stratify=True,
+                                             missing_samples=missing_samples)
 
     sample_estimator, importance = fit(
         X_train, metadata, step, cv, random_state, n_jobs, n_estimators,
@@ -183,7 +185,7 @@ def classify_samples(ctx,
                        group_samples=True, missing_samples=missing_samples)
 
     return (sample_estimator, importance, predictions, summary,
-            accuracy_results, probabilities, _heatmap)
+            accuracy_results, probabilities, _heatmap, y_train, y_test)
 
 
 def regress_samples(ctx,
@@ -207,8 +209,10 @@ def regress_samples(ctx,
     summarize_estimator = ctx.get_action('sample_classifier', 'summarize')
     scatter = ctx.get_action('sample_classifier', 'scatterplot')
 
-    X_train, X_test = split(table, metadata, test_size, random_state,
-                            stratify, missing_samples=missing_samples)
+    X_train, X_test, y_train, y_test = split(table, metadata, test_size,
+                                             random_state,
+                                             stratify,
+                                             missing_samples=missing_samples)
 
     sample_estimator, importance = fit(
         X_train, metadata, step, cv, random_state, n_jobs, n_estimators,
@@ -303,15 +307,12 @@ def split_table(table: biom.Table, metadata: qiime2.MetadataColumn,
                 test_size: float = defaults['test_size'],
                 random_state: int = None, stratify: str = True,
                 missing_samples: str = defaults['missing_samples']
-                ) -> (biom.Table, biom.Table):
+                ) -> (biom.Table, biom.Table, pd.Series, pd.Series):
     column = metadata.name
     X_train, X_test, y_train, y_test = _prepare_training_data(
         table, metadata, column, test_size, random_state, load_data=True,
         stratify=stratify, missing_samples=missing_samples)
-    # TODO: we can consider returning the metadata (y_train, y_test) if a
-    # SampleData[Metadata] type comes into existence. For now we will just
-    # throw this out.
-    return X_train, X_test
+    return X_train, X_test, y_train, y_test
 
 
 def regress_samples_ncv(
@@ -366,6 +367,7 @@ def confusion_matrix(output_dir: str,
                      missing_samples: str = defaults['missing_samples'],
                      vmin: int = 'auto', vmax: int = 'auto',
                      palette: str = defaults['palette']) -> None:
+
     if vmin == 'auto':
         vmin = None
     if vmax == 'auto':


=====================================
q2_sample_classifier/plugin_setup.py
=====================================
@@ -1,5 +1,5 @@
 # ----------------------------------------------------------------------------
-# Copyright (c) 2017-2020, QIIME 2 development team.
+# Copyright (c) 2017-2021, QIIME 2 development team.
 #
 # Distributed under the terms of the Modified BSD License.
 #
@@ -33,11 +33,13 @@ from ._format import (SampleEstimatorDirFmt,
                       PredictionsFormat,
                       PredictionsDirectoryFormat,
                       ProbabilitiesFormat,
-                      ProbabilitiesDirectoryFormat)
+                      ProbabilitiesDirectoryFormat,
+                      TrueTargetsDirectoryFormat)
 
 from ._type import (ClassifierPredictions, RegressorPredictions,
                     SampleEstimator, BooleanSeries, Importance,
-                    Classifier, Regressor, Probabilities)
+                    Classifier, Regressor, Probabilities,
+                    TrueTargets)
 import q2_sample_classifier
 
 citations = Citations.load('citations.bib', package='q2_sample_classifier')
@@ -102,7 +104,7 @@ parameters = {
         'missing_samples': Str % Choices(['error', 'ignore'])},
     'splitter': {
         'test_size': Float % Range(0.0, 1.0, inclusive_end=False,
-                                   inclusive_start=False)},
+                                   inclusive_start=True)},
     'rfe': {
         'step': Float % Range(0.0, 1.0, inclusive_end=False,
                               inclusive_start=False),
@@ -226,15 +228,21 @@ plugin.pipelines.register_function(
              ('feature_importance', FeatureData[Importance]),
              ('predictions', SampleData[ClassifierPredictions])
              ] + pipeline_outputs + [
-                ('probabilities', SampleData[Probabilities]),
-                ('heatmap', Visualization)],
+        ('probabilities', SampleData[Probabilities]),
+        ('heatmap', Visualization),
+        ('training_targets', SampleData[TrueTargets]),
+        ('test_targets', SampleData[TrueTargets])],
     input_descriptions={'table': input_descriptions['table']},
     parameter_descriptions=classifier_pipeline_parameter_descriptions,
     output_descriptions={
         **pipeline_output_descriptions,
         'probabilities': input_descriptions['probabilities'],
         'heatmap': 'A heatmap of the top 50 most important features from the '
-                   'table.'},
+                   'table.',
+        'training_targets': 'Series containing true target values of '
+        'train samples',
+        'test_targets': 'Series containing true target values '
+        'of test samples'},
     name='Train and test a cross-validated supervised learning classifier.',
     description=description.format(
         'categorical', 'supervised learning classifier')
@@ -258,7 +266,7 @@ plugin.pipelines.register_function(
         'metadata': 'Categorical metadata column to use as prediction target.',
         'k': 'Number of nearest neighbors',
         'palette': 'The color palette to use for plotting.',
-        },
+    },
     output_descriptions={
         'predictions': 'leave one out predictions for each sample',
         'accuracy_results': 'Accuracy results visualization.',
@@ -486,7 +494,9 @@ plugin.methods.register_function(
         'metadata': MetadataColumn[Numeric | Categorical],
         **parameters['regressor']},
     outputs=[('training_table', FeatureTable[T]),
-             ('test_table', FeatureTable[T])],
+             ('test_table', FeatureTable[T]),
+             ('training_targets', SampleData[TrueTargets]),
+             ('test_targets', SampleData[TrueTargets])],
     input_descriptions={'table': 'Feature table containing all features that '
                         'should be used for target prediction.'},
     parameter_descriptions={
@@ -497,7 +507,11 @@ plugin.methods.register_function(
         'metadata': 'Numeric metadata column to use as prediction target.'},
     output_descriptions={
         'training_table': 'Feature table containing training samples',
-        'test_table': 'Feature table containing test samples'},
+        'test_table': 'Feature table containing test samples',
+        'training_targets': 'Series containing true target values of '
+        'train samples',
+        'test_targets': 'Series containing true target values of '
+        'test samples'},
     name='Split a feature table into training and testing sets.',
     description=(
         'Split a feature table into training and testing sets. By default '
@@ -623,7 +637,7 @@ plugin.pipelines.register_function(
 # Registrations
 plugin.register_semantic_types(
     SampleEstimator, BooleanSeries, Importance, ClassifierPredictions,
-    RegressorPredictions, Classifier, Regressor, Probabilities)
+    RegressorPredictions, Classifier, Regressor, Probabilities, TrueTargets)
 plugin.register_semantic_type_to_format(
     SampleEstimator[Classifier],
     artifact_format=SampleEstimatorDirFmt)
@@ -645,9 +659,13 @@ plugin.register_semantic_type_to_format(
 plugin.register_semantic_type_to_format(
     SampleData[Probabilities],
     artifact_format=ProbabilitiesDirectoryFormat)
+plugin.register_semantic_type_to_format(
+    SampleData[TrueTargets],
+    artifact_format=TrueTargetsDirectoryFormat)
 plugin.register_formats(
     SampleEstimatorDirFmt, BooleanSeriesFormat, BooleanSeriesDirectoryFormat,
     ImportanceFormat, ImportanceDirectoryFormat, PredictionsFormat,
     PredictionsDirectoryFormat, ProbabilitiesFormat,
-    ProbabilitiesDirectoryFormat)
+    ProbabilitiesDirectoryFormat,
+    TrueTargetsDirectoryFormat)
 importlib.import_module('q2_sample_classifier._transformer')


=====================================
q2_sample_classifier/tests/__init__.py
=====================================
@@ -1,5 +1,5 @@
 # ----------------------------------------------------------------------------
-# Copyright (c) 2017-2020, QIIME 2 development team.
+# Copyright (c) 2017-2021, QIIME 2 development team.
 #
 # Distributed under the terms of the Modified BSD License.
 #


=====================================
q2_sample_classifier/tests/data/true_targets.tsv
=====================================
@@ -0,0 +1,9 @@
+sample-id	delivery
+10249.C041.08SS	Vaginal
+10249.C055.08SS	Cesarean
+10249.C027.07SS	Vaginal
+10249.C042.07SS	Vaginal
+10249.C005.08SS	Cesarean
+10249.C056.09SS	Cesarean
+10249.C035.07SD	Vaginal
+10249.C001.10SS	Vaginal


=====================================
q2_sample_classifier/tests/test_actions.py
=====================================
@@ -1,11 +1,15 @@
 # ----------------------------------------------------------------------------
-# Copyright (c) 2017-2020, QIIME 2 development team.
+# Copyright (c) 2017-2021, QIIME 2 development team.
 #
 # Distributed under the terms of the Modified BSD License.
 #
 # The full license is in the file LICENSE, distributed with this software.
 # ----------------------------------------------------------------------------
+
+import os
+
 import pandas as pd
+import pandas.testing as pdt
 import numpy as np
 import biom
 
@@ -43,11 +47,21 @@ class NowLetsTestTheActions(SampleClassifierTestPluginBase):
         md2.index.name = 'SampleID'
         self.md2 = qiime2.Metadata(md2)
 
-    # let's make sure the correct transformers are in place! See issue 114
-    # if this runs without error, that's good enough for me. We already
-    # validate the function above.
+    # let's make sure the function runs w/o errors and that the correct
+    # transformers are in place (see issue 114)
     def test_action_split_table(self):
-        sample_classifier.actions.split_table(self.tab, self.md, test_size=0.5)
+        res = sample_classifier.actions.split_table(
+            self.tab, self.md, test_size=0.5)
+        y_train = res.training_targets.view(pd.Series)
+        y_test = res.test_targets.view(pd.Series)
+
+        # test whether extracted target is correct
+        self.assertEqual(y_train.name, 'bugs')
+
+        # test if complete target column is covered
+        y_all = y_train.append(y_test).sort_index()
+        y_all.index.name = 'SampleID'
+        pdt.assert_series_equal(y_all, self.md._series)
 
     def test_metatable(self):
         exp = biom.Table(
@@ -155,6 +169,15 @@ class TestSummarize(SampleEstimatorTestBase):
     def test_summary_with_rfecv(self):
         summarize(self.temp_dir.name, self.pipeline)
 
+        self.assertTrue('rfe_plot.pdf' in os.listdir(self.temp_dir.name))
+        self.assertTrue('rfe_plot.png' in os.listdir(self.temp_dir.name))
+        self.assertTrue('rfe_scores.tsv' in os.listdir(self.temp_dir.name))
+
     def test_summary_without_rfecv(self):
+        # nuke the rfe_scores to test the other branch of _summarize_estimator
         del self.pipeline.rfe_scores
         summarize(self.temp_dir.name, self.pipeline)
+
+        self.assertFalse('rfe_plot.pdf' in os.listdir(self.temp_dir.name))
+        self.assertFalse('rfe_plot.png' in os.listdir(self.temp_dir.name))
+        self.assertFalse('rfe_scores.tsv' in os.listdir(self.temp_dir.name))


=====================================
q2_sample_classifier/tests/test_base_class.py
=====================================
@@ -1,5 +1,5 @@
 # ----------------------------------------------------------------------------
-# Copyright (c) 2017-2020, QIIME 2 development team.
+# Copyright (c) 2017-2021, QIIME 2 development team.
 #
 # Distributed under the terms of the Modified BSD License.
 #


=====================================
q2_sample_classifier/tests/test_classifier.py
=====================================
@@ -1,5 +1,5 @@
 # ----------------------------------------------------------------------------
-# Copyright (c) 2017-2020, QIIME 2 development team.
+# Copyright (c) 2017-2021, QIIME 2 development team.
 #
 # Distributed under the terms of the Modified BSD License.
 #
@@ -10,7 +10,7 @@ import pandas as pd
 import numpy as np
 from sklearn.ensemble import RandomForestClassifier
 from sklearn.feature_selection import RFECV
-import pandas.util.testing as pdt
+import pandas.testing as pdt
 import biom
 
 import qiime2


=====================================
q2_sample_classifier/tests/test_estimators.py
=====================================
@@ -1,5 +1,5 @@
 # ----------------------------------------------------------------------------
-# Copyright (c) 2017-2020, QIIME 2 development team.
+# Copyright (c) 2017-2021, QIIME 2 development team.
 #
 # Distributed under the terms of the Modified BSD License.
 #
@@ -7,7 +7,7 @@
 # ----------------------------------------------------------------------------
 import os
 import pandas as pd
-import pandas.util.testing as pdt
+import pandas.testing as pdt
 import biom
 import shutil
 import json
@@ -113,7 +113,7 @@ class EstimatorsTests(SampleClassifierTestPluginBase):
         self.mdc_ecam_fp = _load_nmc('ecam_map_maturity.txt', 'month')
         self.exp_imp = pd.read_csv(
             self.get_data_path('importance.tsv'), sep='\t', header=0,
-            index_col=0)
+            index_col=0, names=['feature', 'importance'])
         self.exp_pred = pd.read_csv(
             self.get_data_path('predictions.tsv'), sep='\t', header=0,
             index_col=0, squeeze=True)
@@ -152,7 +152,7 @@ class EstimatorsTests(SampleClassifierTestPluginBase):
             [1, 0, 4, 4],
             [4, 4, 0, 1],
             [4, 4, 1, 0],
-            ], ids=sample_ids)
+        ], ids=sample_ids)
 
         dm = qiime2.Artifact.import_data('DistanceMatrix', distance_matrix)
         categories = pd.Series(('skinny', 'skinny', 'fat', 'fat'),
@@ -182,7 +182,7 @@ class EstimatorsTests(SampleClassifierTestPluginBase):
             [2, 0, 1, 1],
             [3, 1, 0, 1],
             [3, 1, 1, 0],
-            ], ids=sample_ids)
+        ], ids=sample_ids)
 
         dm = qiime2.Artifact.import_data('DistanceMatrix', distance_matrix)
         categories = pd.Series(('fat', 'skinny', 'skinny', 'skinny'),
@@ -212,7 +212,7 @@ class EstimatorsTests(SampleClassifierTestPluginBase):
             [2, 0, 3, 4],
             [1, 3, 0, 3],
             [5, 4, 3, 0],
-            ], ids=sample_ids)
+        ], ids=sample_ids)
 
         dm = qiime2.Artifact.import_data('DistanceMatrix', distance_matrix)
         categories = pd.Series(('fat', 'skinny', 'skinny', 'skinny'),
@@ -251,6 +251,20 @@ class EstimatorsTests(SampleClassifierTestPluginBase):
                 msg='Accuracy of %s classifier was %f, but expected %f' % (
                     classifier, accuracy, seeded_results[classifier]))
 
+    # test if training classifier with pipeline classify_samples raises
+    # warning when test_size = 0.0
+    def test_classify_samples_w_all_train_set(self):
+        with self.assertWarnsRegex(Warning, "not representative of "
+                                   "your model's performance"):
+            table_fp = self.get_data_path('chardonnay.table.qza')
+            table = qiime2.Artifact.load(table_fp)
+            sample_classifier.actions.classify_samples(
+                table=table, metadata=self.mdc_chard_fp,
+                test_size=0.0, cv=1, n_estimators=10, n_jobs=1,
+                estimator='RandomForestClassifier', random_state=123,
+                parameter_tuning=False, optimize_feature_selection=False,
+                missing_samples='ignore')
+
     # test that the plugin methods/visualizers work
     def test_regress_samples_ncv(self):
         y_pred, importances = regress_samples_ncv(
@@ -289,7 +303,8 @@ class EstimatorsTests(SampleClassifierTestPluginBase):
             name='prediction')
         exp_importances = pd.DataFrame(
             [0.595111111111111, 0.23155555555555551, 0.17333333333333334],
-            index=pd.Index(['o3', 'o1', 'o2']), columns=['importance'])
+            index=pd.Index(['o3', 'o1', 'o2'], name='feature'),
+            columns=['importance'])
         exp_probabilities = pd.DataFrame(
             [[0.5, 0.5], [0., 1.], [0., 1.], [0.5, 0.5], [0.5, 0.5],
              [0.5, 0.5], [0.5, 0.5], [0., 1.], [1., 0.], [1., 0.]],
@@ -390,28 +405,31 @@ class EstimatorsTests(SampleClassifierTestPluginBase):
             missing_samples='ignore')
 
     def test_split_table_no_rounding_error(self):
-        X_train, X_test = split_table(
+        X_train, X_test, y_train, y_test = split_table(
             self.table_chard_fp, self.mdc_chard_fp, test_size=0.5,
             random_state=123, stratify=True, missing_samples='ignore')
         self.assertEqual(len(X_train.ids()) + len(X_test.ids()), 21)
+        self.assertEqual(y_train.shape[0] + y_test.shape[0], 21)
 
     def test_split_table_no_split(self):
-        X_train, X_test = split_table(
+        X_train, X_test, y_train, y_test = split_table(
             self.table_chard_fp, self.mdc_chard_fp, test_size=0.0,
             random_state=123, stratify=True, missing_samples='ignore')
         self.assertEqual(len(X_train.ids()), 21)
+        self.assertEqual(y_train.shape[0], 21)
 
     def test_split_table_invalid_test_size(self):
         with self.assertRaisesRegex(ValueError, "at least two samples"):
-            X_train, X_test = split_table(
+            X_train, X_test, y_train, y_test = split_table(
                 self.table_chard_fp, self.mdc_chard_fp, test_size=1.0,
                 random_state=123, stratify=True, missing_samples='ignore')
 
     def test_split_table_percnorm(self):
-        X_train, X_test = split_table(
+        X_train, X_test, y_train, y_test = split_table(
             self.table_percnorm, self.mdc_percnorm, test_size=0.5,
             random_state=123, stratify=True, missing_samples='ignore')
         self.assertEqual(len(X_train.ids()) + len(X_test.ids()), 4)
+        self.assertEqual(y_train.shape[0] + y_test.shape[0], 4)
 
     # test experimental functions
     def test_detect_outliers(self):
@@ -450,6 +468,7 @@ class EstimatorsTests(SampleClassifierTestPluginBase):
             sample_ids = pred.index.intersection(exp.index)
             pred = pred.loc[sample_ids]
             exp = exp.loc[sample_ids]
+            # verify predictions:
             # test that expected number of correct results is achieved (these
             # are mostly quite high as we would expect (total n=21))
             correct_results = np.sum(pred == exp)
@@ -458,6 +477,17 @@ class EstimatorsTests(SampleClassifierTestPluginBase):
                 msg='Accuracy of %s classifier was %f, but expected %f' % (
                     classifier, correct_results,
                     seeded_predict_results[classifier]))
+            # verify probabilities
+            # test whether all are in correct range (0 to 1)
+            ls_pred_classes = prob.columns.tolist()
+            ls_correct_range = [col for col in ls_pred_classes if
+                                prob[col].between(
+                                    0, 1, inclusive=True).all()]
+            self.assertEqual(len(ls_correct_range), prob.shape[1],
+                             msg='Predicted probabilities of class {}'
+                             'are not in range [0,1]'.format(
+                [col for col in ls_pred_classes
+                 if col not in ls_correct_range]))
 
     def test_predict_regressions(self):
         for regressor in ['RandomForestRegressor', 'ExtraTreesRegressor',
@@ -526,7 +556,7 @@ seeded_results = {
     'ExtraTreesClassifier': 0.454545454545,
     'GradientBoostingClassifier': 0.272727272727,
     'AdaBoostClassifier': 0.272727272727,
-    'LinearSVC': 0.727272727273,
+    'LinearSVC': 0.818182,
     'SVC': 0.36363636363636365,
     'KNeighborsClassifier': 0.363636363636,
     'RandomForestRegressor': 23.226508,


=====================================
q2_sample_classifier/tests/test_types_formats_transformers.py
=====================================
@@ -1,5 +1,5 @@
 # ----------------------------------------------------------------------------
-# Copyright (c) 2017-2020, QIIME 2 development team.
+# Copyright (c) 2017-2021, QIIME 2 development team.
 #
 # Distributed under the terms of the Modified BSD License.
 #
@@ -7,7 +7,7 @@
 # ----------------------------------------------------------------------------
 import os
 import pandas as pd
-import pandas.util.testing as pdt
+import pandas.testing as pdt
 import numpy as np
 import shutil
 import tempfile
@@ -28,7 +28,8 @@ from q2_sample_classifier import (
     RegressorPredictions, ImportanceFormat, ImportanceDirectoryFormat,
     Importance, PickleFormat, ProbabilitiesFormat,
     ProbabilitiesDirectoryFormat, Probabilities, Classifier, Regressor,
-    SampleEstimator, SampleEstimatorDirFmt)
+    SampleEstimator, SampleEstimatorDirFmt,
+    TrueTargetsDirectoryFormat, TrueTargets)
 from q2_sample_classifier.visuals import (
     _custom_palettes, _plot_heatmap_from_confusion_matrix,)
 from q2_sample_classifier._format import JSONFormat
@@ -356,6 +357,21 @@ class TestSemanticTypes(SampleClassifierTestPluginBase):
         for palette in _custom_palettes().keys():
             _plot_heatmap_from_confusion_matrix(confused, palette)
 
+    # test TrueTarget
+    def test_TrueTargets_semantic_type_registration(self):
+        self.assertRegisteredSemanticType(TrueTargets)
+
+    # test TrueTargetDirectoryFormats
+    def test_TrueTargets_dir_fmt_validate_positive(self):
+        filepath = self.get_data_path('true_targets.tsv')
+        shutil.copy(filepath, self.temp_dir.name)
+        format = TrueTargetsDirectoryFormat(self.temp_dir.name, mode='r')
+        format.validate()
+
+    def test_TrueTarget_to_TrueTargets_dir_fmt_registration(self):
+        self.assertSemanticTypeRegisteredToFormat(
+            SampleData[TrueTargets], TrueTargetsDirectoryFormat)
+
 
 class TestTypes(SampleClassifierTestPluginBase):
     def test_sample_estimator_semantic_type_registration(self):


=====================================
q2_sample_classifier/tests/test_utilities.py
=====================================
@@ -1,5 +1,5 @@
 # ----------------------------------------------------------------------------
-# Copyright (c) 2017-2020, QIIME 2 development team.
+# Copyright (c) 2017-2021, QIIME 2 development team.
 #
 # Distributed under the terms of the Modified BSD License.
 #
@@ -12,7 +12,7 @@ from sklearn.svm import LinearSVC
 from sklearn.feature_extraction import DictVectorizer
 from sklearn.pipeline import Pipeline
 from sklearn.ensemble import RandomForestClassifier
-import pandas.util.testing as pdt
+import pandas.testing as pdt
 
 import qiime2
 
@@ -121,8 +121,9 @@ class UtilitiesTests(SampleClassifierTestPluginBase):
     # and this should not occur now, but theoretically should just concat and
     # sort but not collapse if all column names are unique
     def test_mean_feature_importance_do_not_collapse(self):
-        imps = [pd.DataFrame([4, 3, 2, 1], columns=["importance0"]),
-                pd.DataFrame([16, 15, 14, 13], columns=["importance1"])]
+        imps = [pd.DataFrame([4.0, 3.0, 2.0, 1.0], columns=["importance0"]),
+                pd.DataFrame([16.0, 15.0, 14.0, 13.0],
+                columns=["importance1"])]
         exp = pd.concat(imps, axis=1)
         pdt.assert_frame_equal(_mean_feature_importance(imps), exp)
 


=====================================
q2_sample_classifier/tests/test_visualization.py
=====================================
@@ -1,12 +1,12 @@
 # ----------------------------------------------------------------------------
-# Copyright (c) 2017-2020, QIIME 2 development team.
+# Copyright (c) 2017-2021, QIIME 2 development team.
 #
 # Distributed under the terms of the Modified BSD License.
 #
 # The full license is in the file LICENSE, distributed with this software.
 # ----------------------------------------------------------------------------
 import pandas as pd
-import pandas.util.testing as pdt
+import pandas.testing as pdt
 from os import mkdir, listdir
 from os.path import join
 import biom


=====================================
q2_sample_classifier/utilities.py
=====================================
@@ -1,5 +1,5 @@
 # ----------------------------------------------------------------------------
-# Copyright (c) 2017-2020, QIIME 2 development team.
+# Copyright (c) 2017-2021, QIIME 2 development team.
 #
 # Distributed under the terms of the Modified BSD License.
 #
@@ -18,7 +18,7 @@ from sklearn.ensemble import (RandomForestRegressor, RandomForestClassifier,
                               ExtraTreesClassifier, ExtraTreesRegressor,
                               AdaBoostClassifier, GradientBoostingClassifier,
                               AdaBoostRegressor, GradientBoostingRegressor)
-from sklearn.svm import LinearSVC, SVR, SVC
+from sklearn.svm import SVR, SVC
 from sklearn.linear_model import Ridge, Lasso, ElasticNet
 from sklearn.neighbors import KNeighborsClassifier, KNeighborsRegressor
 from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor
@@ -47,20 +47,6 @@ parameters = {
                  "min_weight_fraction_leaf": [0.0001, 0.001, 0.01]},
     'bootstrap': {"bootstrap": [True, False]},
     'criterion': {"criterion": ["gini", "entropy"]},
-    'linear_svm': {"C": [1, 0.5, 0.1, 0.9, 0.8],
-                   # should probably include penalty in grid search, but:
-                   # Unsupported set of arguments: The combination of
-                   # penalty='l1' and loss='hinge' is not supported
-                   # "penalty": ["l1", "l2"],
-                   "loss": ["hinge", "squared_hinge"],
-                   "tol": [0.00001, 0.0001, 0.001]
-                   # should probably include this in grid search, as
-                   # dual=False is preferred when samples>features. However:
-                   # Unsupported set of arguments: The combination of
-                   # penalty='l2' and loss='hinge' are not supported when
-                   # dual=False
-                   # "dual": [True, False]
-                   },
     'svm': {"C": [1, 0.5, 0.1, 0.9, 0.8],
             "tol": [0.00001, 0.0001, 0.001, 0.01],
             "shrinking": [True, False]},
@@ -171,6 +157,9 @@ def _split_training_data(feature_data, targets, column, test_size=0.2,
         except ValueError:
             _stratification_error()
     else:
+        warning_msg = _warn_zero_test_split()
+        warnings.warn(warning_msg, UserWarning)
+
         X_train, X_test, y_train, y_test = (
             feature_data, feature_data, targets, targets)
 
@@ -438,11 +427,14 @@ def _predict_and_plot(output_dir, y_test, y_pred, vmin=None, vmax=None,
     else:
         predictions = _linear_regress(y_test, y_pred)
         predict_plot = _regplot_from_dataframe(y_test, y_pred)
+
     if output_dir is not None:
         predict_plot.get_figure().savefig(
             join(output_dir, 'predictions.png'), bbox_inches='tight')
         predict_plot.get_figure().savefig(
             join(output_dir, 'predictions.pdf'), bbox_inches='tight')
+
+    plt.close('all')
     return predictions, predict_plot
 
 
@@ -480,6 +472,14 @@ def _plot_accuracy(output_dir, predictions, truth, probabilities,
     or numeric data inside two pd.Series
     '''
     truth = truth.to_series()
+
+    # check if test_size == 0.0 and all predictions are complete dataset
+    if (missing_samples == 'ignore') & (
+            predictions.shape[0] == truth.shape[0]):
+        warning_msg = _warn_zero_test_split()
+    else:
+        warning_msg = None
+
     predictions, truth = _match_series_or_die(
         predictions, truth, missing_samples)
 
@@ -499,7 +499,7 @@ def _plot_accuracy(output_dir, predictions, truth, probabilities,
     # output to viz
     _visualize(output_dir=output_dir, estimator=None, cm=predictions,
                roc=probabilities, optimize_feature_selection=False,
-               title=plot_title)
+               title=plot_title, warning_msg=warning_msg)
 
 
 def sort_importances(importances, ascending=False):
@@ -523,6 +523,11 @@ def _summarize_estimator(output_dir, sample_estimator):
         rfep.savefig(join(output_dir, 'rfe_plot.pdf'))
         plt.close('all')
         optimize_feature_selection = True
+        # generate rfe scores file
+        df = pd.DataFrame(data={'rfe_score': sample_estimator.rfe_scores},
+                          index=sample_estimator.rfe_scores.index)
+        df.index.name = 'feature_count'
+        df.to_csv(join(output_dir, 'rfe_scores.tsv'), sep='\t', index=True)
     # if the rfe_scores attribute does not exist, do nothing
     except AttributeError:
         optimize_feature_selection = False
@@ -533,7 +538,8 @@ def _summarize_estimator(output_dir, sample_estimator):
 
 
 def _visualize(output_dir, estimator, cm, roc,
-               optimize_feature_selection=True, title='results'):
+               optimize_feature_selection=True, title='results',
+               warning_msg=None):
 
     pd.set_option('display.max_colwidth', None)
 
@@ -558,7 +564,8 @@ def _visualize(output_dir, estimator, cm, roc,
         'result': result,
         'predictions': cm,
         'roc': roc,
-        'optimize_feature_selection': optimize_feature_selection})
+        'optimize_feature_selection': optimize_feature_selection,
+        'warning_msg': warning_msg})
 
 
 def _visualize_knn(output_dir, params: pd.Series):
@@ -678,14 +685,11 @@ def predict_probabilities(estimator, test_set, index):
               have their class probabilities predicted.
     index: array-like of sample names
     '''
-    # most classifiers have a predict_proba attribute
-    try:
-        probs = pd.DataFrame(estimator.predict_proba(test_set),
-                             index=index, columns=estimator.classes_)
-    # SVMs use the decision_function attribute
-    except AttributeError:
-        probs = pd.DataFrame(estimator.decision_function(test_set),
-                             index=index, columns=estimator.classes_)
+    # all used classifiers have a predict_proba attribute
+    # (approximated for SVCs)
+    probs = pd.DataFrame(estimator.predict_proba(test_set),
+                         index=index, columns=estimator.classes_)
+
     return probs
 
 
@@ -763,11 +767,13 @@ def _select_estimator(estimator, n_jobs, n_estimators, random_state=None):
         estimator = GradientBoostingClassifier(
             n_estimators=n_estimators, random_state=random_state)
     elif estimator == 'LinearSVC':
-        param_dist = parameters['linear_svm']
-        estimator = LinearSVC(random_state=random_state)
+        param_dist = parameters['svm']
+        estimator = SVC(kernel='linear', random_state=random_state,
+                        gamma='scale', probability=True)
     elif estimator == 'SVC':
         param_dist = parameters['svm']
-        estimator = SVC(kernel='rbf', random_state=random_state, gamma='scale')
+        estimator = SVC(kernel='rbf', random_state=random_state,
+                        gamma='scale', probability=True)
     elif estimator == 'KNeighborsClassifier':
         param_dist = parameters['kneighbors']
         estimator = KNeighborsClassifier(algorithm='auto')
@@ -845,3 +851,11 @@ def _warn_feature_selection():
          'the parameter settings requested. See documentation or try a '
          'different estimator model.'))
     warnings.warn(warning, UserWarning)
+
+
+def _warn_zero_test_split():
+    return 'Using test_size = 0.0, you are using your complete dataset for ' \
+        'fitting the estimator. Hence, any returned model evaluations are ' \
+        'based on that same training dataset and are not representative of ' \
+        'your model\'s performance on a previously unseen dataset. Please ' \
+        'consider evaluating this model on a separate dataset.'


=====================================
q2_sample_classifier/visuals.py
=====================================
@@ -1,5 +1,5 @@
 # ----------------------------------------------------------------------------
-# Copyright (c) 2017-2020, QIIME 2 development team.
+# Copyright (c) 2017-2021, QIIME 2 development team.
 #
 # Distributed under the terms of the Modified BSD License.
 #
@@ -10,7 +10,7 @@ from sklearn.metrics import (
     mean_squared_error, confusion_matrix, accuracy_score, roc_curve, auc)
 from sklearn.preprocessing import label_binarize
 from itertools import cycle
-from scipy import interp
+from numpy import interp
 import pandas as pd
 import numpy as np
 import seaborn as sns
@@ -72,7 +72,7 @@ def _regplot_from_dataframe(x, y, plot_style="whitegrid", arb=True,
                             color="grey"):
     '''Seaborn regplot with true 1:1 ratio set by arb (bool).'''
     sns.set_style(plot_style)
-    reg = sns.regplot(x, y, color=color)
+    reg = sns.regplot(x=x, y=y, color=color)
     plt.xlabel('True value')
     plt.ylabel('Predicted value')
     if arb is True:


=====================================
setup.py
=====================================
@@ -1,5 +1,5 @@
 # ----------------------------------------------------------------------------
-# Copyright (c) 2017-2020, QIIME 2 development team.
+# Copyright (c) 2017-2021, QIIME 2 development team.
 #
 # Distributed under the terms of the Modified BSD License.
 #



View it on GitLab: https://salsa.debian.org/med-team/q2-sample-classifier/-/commit/7c4b78faf33bc50e958843964685cf5d73d1374d

-- 
View it on GitLab: https://salsa.debian.org/med-team/q2-sample-classifier/-/commit/7c4b78faf33bc50e958843964685cf5d73d1374d
You're receiving this email because of your account on salsa.debian.org.


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://alioth-lists.debian.net/pipermail/debian-med-commit/attachments/20210927/addb5712/attachment-0001.htm>


More information about the debian-med-commit mailing list