[med-svn] [Git][med-team/lefse][upstream] New upstream version 1.1.2
Andreas Tille (@tille)
gitlab at salsa.debian.org
Tue Sep 28 07:25:46 BST 2021
Andreas Tille pushed to branch upstream at Debian Med / lefse
Commits:
1b6d11f2 by Andreas Tille at 2021-09-28T06:33:44+02:00
New upstream version 1.1.2
- - - - -
26 changed files:
- + .github/workflows/python_publish.yml
- − .hg_archival.txt
- − .hgtags
- + README.md
- example/run.sh → example/bioconda-lefse_run.sh
- − format_input.py
- − lefse.py
- __init__.py → lefse/__init__.py
- + lefse/__pycache__/__init__.cpython-38.pyc
- + lefse/__pycache__/lefse.cpython-38.pyc
- + lefse/lefse.py
- + lefse/lefse.pyc
- lefse2circlader.py → lefse/lefse2circlader.py
- + lefse/lefse_format_input.py
- + lefse/lefse_plot_cladogram.py
- plot_features.py → lefse/lefse_plot_features.py
- plot_res.py → lefse/lefse_plot_res.py
- + lefse/lefse_run.py
- qiime2lefse.py → lefse/qiime2lefse.py
- lefsebiom/AbundanceTable.py
- lefsebiom/CClade.py
- lefsebiom/ValidateData.py
- − plot_cladogram.py
- requirements.txt
- − run_lefse.py
- + setup.py
Changes:
=====================================
.github/workflows/python_publish.yml
=====================================
@@ -0,0 +1,31 @@
+# This workflows will upload a Python Package using Twine when a release is created
+# For more information see: https://help.github.com/en/actions/language-and-framework-guides/using-python-with-github-actions#publishing-to-package-registries
+
+name: Upload Python Package
+
+on:
+ release:
+ types: [created]
+
+jobs:
+ deploy:
+
+ runs-on: ubuntu-latest
+
+ steps:
+ - uses: actions/checkout at v2
+ - name: Set up Python
+ uses: actions/setup-python at v2
+ with:
+ python-version: '3.7'
+ - name: Install dependencies
+ run: |
+ python -m pip install --upgrade pip
+ pip install setuptools wheel twine
+ - name: Build and publish
+ env:
+ TWINE_USERNAME: ${{ secrets.PYPI_USERNAME }}
+ TWINE_PASSWORD: ${{ secrets.PYPI_PASSWORD }}
+ run: |
+ python setup.py sdist bdist_wheel
+ twine upload dist/*
=====================================
.hg_archival.txt deleted
=====================================
@@ -1,4 +0,0 @@
-repo: e20290ebdfe008633514e90d97edc6091bf0b21d
-node: 3ce86f78d59ecd02eaf168afb6e1bd898bb3facc
-branch: default
-tag: 1.0.8
=====================================
.hgtags deleted
=====================================
@@ -1,3 +0,0 @@
-89e1a36a3c257de1ccab84f2318fe53229c8f7d0 1.0.5
-f9b6a0dcd6453e6da472d7972b5bddbe8c1abab1 1.0.6
-4e95b89d544e42532887f9f3e1a35d9accf5190f 1.0.7
=====================================
README.md
=====================================
@@ -0,0 +1,28 @@
+**LEfSe**
+==============
+
+LEfSe (Linear discriminant analysis Effect Size) determines the features
+(organisms, clades, operational taxonomic units, genes, or functions)
+most likely to explain differences between classes by coupling standard
+tests for statistical significance with additional tests encoding
+biological consistency and effect relevance.
+
+LEfSe is available as a [Galaxy module](http://huttenhower.org/galaxy/),
+a Conda formula, a Docker image, and included in bioBakery (VM and
+cloud). For additional information, please refer to the [LEfSe
+paper](http://www.ncbi.nlm.nih.gov/pubmed/21702898).
+
+## Installation
+
+LEfSe can be installed with Conda or run from a Docker image. Please
+note, if you are using bioBakery (Vagrant VM or cloud) you do not need
+to install LEfSe because the tool and its dependencies are already
+installed.
+
+Install with Conda: `$ conda install -c bioconda lefse`
+
+Install with Docker: `$ docker run -it biobakery/lefse bash`
+
+LEfSe requires R v. 3.6 or higher and the R libraries survival, mvtnorm, modeltools, coin, MASS.
+
+We provide support for LEfSe users. Please join our [bioBakery Support Forum](https://forum.biobakery.org/c/Downstream-analysis-and-statistics/LEfSe) designated specifically for LEfSe users.
=====================================
example/run.sh → example/bioconda-lefse_run.sh
=====================================
@@ -1,12 +1,21 @@
+## script changed from original by for bioconda-lefse by jfg 2017.10.31
+## adapted from the original lefse run.sh available at:
+## https://bitbucket.org/nsegata/lefse/src/54694b4b0d9e335ff1ecafff8db4f1e0cf7004da/example/run.sh?at=default&fileviewer=file-view-default
+##
+##
+
# Download a 3-classes example (with subclasses and subjects) from huttenhower.sph.harvard.edu
# It is a small subset of the HMP 16S dataset for finding biomarkers characterizing
# different level of oxygen availability in different bodysites
-wget http://huttenhower.sph.harvard.edu/webfm_send/129 -O hmp_aerobiosis_small.txt
+wget https://github.com/biobakery/biobakery/raw/master/demos/biobakery_demos/data/lefse/input/hmp_small_aerobiosis.txt -O hmp_aerobiosis_small.txt
+
+# As using LEfSe through bioconda, need to activate the LEfSe installation:
+conda activate lefse
# Running the LEfSe commands with -h gives the list of available options
-# format_input.py convert the input data matrix to the format for LEfSe.
+# lefse-format_input.py convert the input data matrix to the format for LEfSe.
#
# In this example we use the class information in the first line (-c 1)
# the subclass in the second line (-s 2) and the subject in the third (-u 3).
@@ -14,38 +23,42 @@ wget http://huttenhower.sph.harvard.edu/webfm_send/129 -O hmp_aerobiosis_small.t
# the value -1 for them.
# -o 1000000 scales the feature such that the sum (of the same taxonomic leve)
# is 1M: this is done only for obtaining more meaningful values for the LDA score
-../format_input.py hmp_aerobiosis_small.txt hmp_aerobiosis_small.in -c 1 -s 2 -u 3 -o 1000000
+lefse_format_input.py hmp_aerobiosis_small.txt hmp_aerobiosis_small.in -c 1 -s 2 -u 3 -o 1000000
# run_lefse.py performs the actual statistica analysis
#
# Apply LEfSe on the formatted data producing the results (to be further processed
# for visualization with the other modules). The option available
# can be listed using the -h option
-../run_lefse.py hmp_aerobiosis_small.in hmp_aerobiosis_small.res
+run_lefse.py hmp_aerobiosis_small.in hmp_aerobiosis_small.res
-# plot_res.py visualizes the output
+# lefse_plot_res.py visualizes the output
#
# Plot the list of biomarkers with their effect size
# Severak graphical options are available for personalizing the output
-../plot_res.py hmp_aerobiosis_small.res hmp_aerobiosis_small.png
+lefse_plot_res.py hmp_aerobiosis_small.res hmp_aerobiosis_small.png
-# plot_cladogram.py visualizes the output on a hierarchical tree
+# lefse_plot_cladogram.py visualizes the output on a hierarchical tree
#
# Plot the representation of the biomarkers on the hierarchical tree
# specified in the input data (using | in the name of the features)
# In this case we will obtain the RDP taxonomy.
# This is an early implementation of the module. I'm working on an improved version
# that will be released independently from LEfSe
-../plot_cladogram.py hmp_aerobiosis_small.res hmp_aerobiosis_small.cladogram.png --format png
+lefse_plot_cladogram.py hmp_aerobiosis_small.res hmp_aerobiosis_small.cladogram.png --format png
# Create a directory for storing the raw-data representation of the discovered biomarkers
mkdir biomarkers_raw_images
-# plot_features.py visualizes the raw-data features
+# lefse_plot_features.py visualizes the raw-data features
#
# The module for exporting the raw-data representation of the features.
# With the default options we will obtain the images for all the features that are
# detected as biomarkers
-../plot_features.py hmp_aerobiosis_small.in hmp_aerobiosis_small.res biomarkers_raw_images/
+lefse_plot_features.py hmp_aerobiosis_small.in hmp_aerobiosis_small.res biomarkers_raw_images/
+## Turn lefse back off
+conda deactivate lefse
+## bonus: seasonal greetings
+# echo '~ Oíche Shamhna féile dhuit!'
=====================================
format_input.py deleted
=====================================
@@ -1,453 +0,0 @@
-#!/usr/bin/env python
-
-import sys,os,argparse,pickle,re,numpy
-
-
-
-
-#***************************************************************************************************************
-#* Log of change *
-#* January 16, 2014 - George Weingart - george.weingart at gmail.com *
-#* *
-#* biom Support *
-#* Modified the program to enable it to accept biom files as input *
-#* *
-#* Added two optional input parameters: *
-#* 1. biom_c is the name of the biom metadata to be used as class *
-#* 2. biom_s is the name of the biom metadata to be used as subclass *
-#* class and subclass are used in the same context as the original *
-#* parameters class and subclass *
-#* These parameters are totally optional, the default is the program *
-#* chooses as class the first metadata received from the conversion *
-#* of the biom file into a sequential (pcl) file as generated by *
-#* breadcrumbs, and similarly, the second metadata is selected as *
-#* subclass. *
-#* The syntax or logic for the original non-biom case was NOT changed. *
-#* *
-#* <******************* IMPORTANT NOTE *************************> *
-#* The biom case requires breadcrumbs and therefore there is a *
-#* a conditional import of the breadcrumbs modules *
-#* If the User uses a biom input and breadcrumbs is not detected, *
-#* the run is abnormally ended *
-#* breadcrumbs itself needs a biom environment, so if the immport *
-#* of biom in breadcrumbs fails, the run is also abnormally
-#* ended (Only if the input file was biom) *
-#* *
-#* USAGE EXAMPLES *
-#* -------------- *
-#* Case #1: Using a sequential file as input (Old version - did not change *
-#* ./format_input.py hmp_aerobiosis_small.txt hmp_aerobiosis_small.in -c 1 -s 2 -u 3 -o 1000000 *
-#* Case #2: Using a biom file as input *
-#* ./format_input.py hmp_aerobiosis_small.biom hmp_aerobiosis_small.in -o 1000000 *
-#* Case #3: Using a biom file as input and override the class and subclass *
-#* ./format_input.py lefse.biom hmp_aerobiosis_small.in -biom_c oxygen_availability -biom_s body_site -o 1000000
-#* *
-#***************************************************************************************************************
-
-def read_input_file(inp_file, CommonArea):
-
- if inp_file.endswith('.biom'): #* If the file format is biom:
- CommonArea = biom_processing(inp_file) #* Process in biom format
- return CommonArea #* And return the CommonArea
-
- with open(inp_file) as inp:
- CommonArea['ReturnedData'] = [[v.strip() for v in line.strip().split("\t")] for line in inp.readlines()]
- return CommonArea
-
-def transpose(data):
- return zip(*data)
-
-def read_params(args):
- parser = argparse.ArgumentParser(description='LEfSe formatting modules')
- parser.add_argument('input_file', metavar='INPUT_FILE', type=str, help="the input file, feature hierarchical level can be specified with | or . and those symbols must not be present for other reasons in the input file.")
- parser.add_argument('output_file', metavar='OUTPUT_FILE', type=str,
- help="the output file containing the data for LEfSe")
- parser.add_argument('--output_table', type=str, required=False, default="",
- help="the formatted table in txt format")
- parser.add_argument('-f',dest="feats_dir", choices=["c","r"], type=str, default="r",
- help="set whether the features are on rows (default) or on columns")
- parser.add_argument('-c',dest="class", metavar="[1..n_feats]", type=int, default=1,
- help="set which feature use as class (default 1)")
- parser.add_argument('-s',dest="subclass", metavar="[1..n_feats]", type=int, default=None,
- help="set which feature use as subclass (default -1 meaning no subclass)")
- parser.add_argument('-o',dest="norm_v", metavar="float", type=float, default=-1.0,
- help="set the normalization value (default -1.0 meaning no normalization)")
- parser.add_argument('-u',dest="subject", metavar="[1..n_feats]", type=int, default=None,
- help="set which feature use as subject (default -1 meaning no subject)")
- parser.add_argument('-m',dest="missing_p", choices=["f","s"], type=str, default="d",
- help="set the policy to adopt with missin values: f removes the features with missing values, s removes samples with missing values (default f)")
- parser.add_argument('-n',dest="subcl_min_card", metavar="int", type=int, default=10,
- help="set the minimum cardinality of each subclass (subclasses with low cardinalities will be grouped together, if the cardinality is still low, no pairwise comparison will be performed with them)")
-
- parser.add_argument('-biom_c',dest="biom_class", type=str,
- help="For biom input files: Set which feature use as class ")
- parser.add_argument('-biom_s',dest="biom_subclass", type=str,
- help="For biom input files: set which feature use as subclass ")
-
- args = parser.parse_args()
-
- return vars(args)
-
-def remove_missing(data,roc):
- if roc == "c": data = transpose(data)
- max_len = max([len(r) for r in data])
- to_rem = []
- for i,r in enumerate(data):
- if len([v for v in r if not( v == "" or v.isspace())]) < max_len: to_rem.append(i)
- if len(to_rem):
- for i in to_rem.reverse():
- data.pop(i)
- if roc == "c": return transpose(data)
- return data
-
-
-def sort_by_cl(data,n,c,s,u):
- def sort_lines1(a,b):
- return int(a[c] > b[c])*2-1
- def sort_lines2u(a,b):
- if a[c] != b[c]: return int(a[c] > b[c])*2-1
- return int(a[u] > b[u])*2-1
- def sort_lines2s(a,b):
- if a[c] != b[c]: return int(a[c] > b[c])*2-1
- return int(a[s] > b[s])*2-1
- def sort_lines3(a,b):
- if a[c] != b[c]: return int(a[c] > b[c])*2-1
- if a[s] != b[s]: return int(a[s] > b[s])*2-1
- return int(a[u] > b[u])*2-1
- if n == 3: data.sort(sort_lines3)
- if n == 2:
- if s is None:
- data.sort(sort_lines2u)
- else:
- data.sort(sort_lines2s)
- if n == 1: data.sort(sort_lines1)
- return data
-
-def group_small_subclasses(cls,min_subcl):
- last = ""
- n = 0
- repl = []
- dd = [list(cls['class']),list(cls['subclass'])]
- for d in dd:
- if d[1] != last:
- if n < min_subcl and last != "":
- repl.append(d[1])
- last = d[1]
- n = 1
- for i,d in enumerate(dd):
- if d[1] in repl: dd[i][1] = "other"
- dd[i][1] = str(dd[i][0])+"_"+str(dd[i][1])
- cls['class'] = dd[0]
- cls['subclass'] = dd[1]
- return cls
-
-def get_class_slices(data):
- previous_class = data[0][0]
- previous_subclass = data[0][1]
- subclass_slices = []
- class_slices = []
- last_cl = 0
- last_subcl = 0
- class_hierarchy = []
- subcls = []
- for i,d in enumerate(data):
- if d[1] != previous_subclass:
- subclass_slices.append((previous_subclass,(last_subcl,i)))
- last_subcl = i
- subcls.append(previous_subclass)
- if d[0] != previous_class:
- class_slices.append((previous_class,(last_cl,i)))
- class_hierarchy.append((previous_class,subcls))
- subcls = []
- last_cl = i
- previous_subclass = d[1]
- previous_class = d[0]
- subclass_slices.append((previous_subclass,(last_subcl,i+1)))
- subcls.append(previous_subclass)
- class_slices.append((previous_class,(last_cl,i+1)))
- class_hierarchy.append((previous_class,subcls))
- return dict(class_slices), dict(subclass_slices), dict(class_hierarchy)
-
-def numerical_values(feats,norm):
- mm = []
- for k,v in feats.items():
- feats[k] = [float(val) for val in v]
- if norm < 0.0: return feats
- tr = zip(*(feats.values()))
- mul = []
- fk = feats.keys()
- hie = True if sum([k.count(".") for k in fk]) > len(fk) else False
- for i in range(len(feats.values()[0])):
- if hie: mul.append(sum([t for j,t in enumerate(tr[i]) if fk[j].count(".") < 1 ]))
- else: mul.append(sum(tr[i]))
- if hie and sum(mul) == 0:
- mul = []
- for i in range(len(feats.values()[0])):
- mul.append(sum(tr[i]))
- for i,m in enumerate(mul):
- if m == 0: mul[i] = 0.0
- else: mul[i] = float(norm) / m
- for k,v in feats.items():
- feats[k] = [val*mul[i] for i,val in enumerate(v)]
- if numpy.mean(feats[k]) and (numpy.std(feats[k])/numpy.mean(feats[k])) < 1e-10:
- feats[k] = [ float(round(kv*1e6)/1e6) for kv in feats[k]]
- return feats
-
-def add_missing_levels2(ff):
-
- if sum( [f.count(".") for f in ff] ) < 1: return ff
-
- dn = {}
-
- added = True
- while added:
- added = False
- for f in ff:
- lev = f.count(".")
- if lev == 0: continue
- if lev not in dn: dn[lev] = [f]
- else: dn[lev].append(f)
- for fn in sorted(dn,reverse=True):
- for f in dn[fn]:
- fc = ".".join(f.split('.')[:-1])
- if fc not in ff:
- ab_all = [ff[fg] for fg in ff if (fg.count(".") == 0 and fg == fc) or (fg.count(".") > 0 and fc == ".".join(fg.split('.')[:-1]))]
- ab =[]
- for l in [f for f in zip(*ab_all)]:
- ab.append(sum([float(ll) for ll in l]))
- ff[fc] = ab
- added = True
- if added:
- break
-
- return ff
-
-
-def add_missing_levels(ff):
- if sum( [f.count(".") for f in ff] ) < 1: return ff
-
- clades2leaves = {}
- for f in ff:
- fs = f.split(".")
- if len(fs) < 2:
- continue
- for l in range(len(fs)):
- n = ".".join( fs[:l] )
- if n in clades2leaves:
- clades2leaves[n].append( f )
- else:
- clades2leaves[n] = [f]
- for k,v in clades2leaves.items():
- if k and k not in ff:
- ff[k] = [sum(a) for a in zip(*[[float(fn) for fn in ff[vv]] for vv in v])]
- return ff
-
-
-
-def modify_feature_names(fn):
- ret = fn
-
- for v in [' ',r'\$',r'\@',r'#',r'%',r'\^',r'\&',r'\*',r'\"',r'\'']:
- ret = [re.sub(v,"",f) for f in ret]
- for v in ["/",r'\(',r'\)',r'-',r'\+',r'=',r'{',r'}',r'\[',r'\]',
- r',',r'\.',r';',r':',r'\?',r'\<',r'\>',r'\.',r'\,']:
- ret = [re.sub(v,"_",f) for f in ret]
- for v in ["\|"]:
- ret = [re.sub(v,".",f) for f in ret]
-
- ret2 = []
- for r in ret:
- if r[0] in ['0','1','2','3','4','5','6','7','8','9']:
- ret2.append("f_"+r)
- else: ret2.append(r)
-
- return ret2
-
-
-def rename_same_subcl(cl,subcl):
- toc = []
- for sc in set(subcl):
- if len(set([cl[i] for i in range(len(subcl)) if sc == subcl[i]])) > 1:
- toc.append(sc)
- new_subcl = []
- for i,sc in enumerate(subcl):
- if sc in toc: new_subcl.append(cl[i]+"_"+sc)
- else: new_subcl.append(sc)
- return new_subcl
-
-
-#*************************************************************************************
-#* Modifications by George Weingart, Jan 15, 2014 *
-#* If the input file is biom: *
-#* a. Load an AbundanceTable (Using breadcrumbs) *
-#* b. Create a sequential file from the AbundanceTable (de-facto - pcl) *
-#* c. Use that file as input to the rest of the program *
-#* d. Calculate the c,s,and u parameters, either from the values the User entered *
-#* from the meta data values in the biom file or set up defaults *
-#* <<<------------- I M P O R T A N T N O T E ------------------->> *
-#* breadcrumbs src directory must be included in the PYTHONPATH *
-#* <<<------------- I M P O R T A N T N O T E ------------------->> *
-#*************************************************************************************
-def biom_processing(inp_file):
- CommonArea = dict() #* Set up a dictionary to return
- CommonArea['abndData'] = AbundanceTable.funcMakeFromFile(inp_file, #* Create AbundanceTable from input biom file
- cDelimiter = None,
- sMetadataID = None,
- sLastMetadataRow = None,
- sLastMetadata = None,
- strFormat = None)
-
- #****************************************************************
- #* Building the data element here *
- #****************************************************************
- ResolvedData = list() #This is the Resolved data that will be returned
- IDMetadataName = CommonArea['abndData'].funcGetIDMetadataName() #* ID Metadataname
- IDMetadata = [CommonArea['abndData'].funcGetIDMetadataName()] #* The first Row
- for IDMetadataEntry in CommonArea['abndData'].funcGetMetadataCopy()[IDMetadataName]: #* Loop on all the metadata values
- IDMetadata.append(IDMetadataEntry)
- ResolvedData.append(IDMetadata) #Add the IDMetadata with all its values to the resolved area
- for key, value in CommonArea['abndData'].funcGetMetadataCopy().iteritems():
- if key != IDMetadataName:
- MetadataEntry = list() #* Set it up
- MetadataEntry.append(key) #* And post it to the area
- for x in value:
- MetadataEntry.append(x) #* Append the metadata value name
- ResolvedData.append(MetadataEntry)
- for AbundanceDataEntry in CommonArea['abndData'].funcGetAbundanceCopy(): #* The Abundance Data
- lstAbundanceDataEntry = list(AbundanceDataEntry) #Convert tuple to list
- ResolvedData.append(lstAbundanceDataEntry) #Append the list to the metadata list
- CommonArea['ReturnedData'] = ResolvedData #Post the results
- return CommonArea
-
-
-#*******************************************************************************
-#* Check the params and override in the case of biom *
-#*******************************************************************************
-def check_params_for_biom_case(params, CommonArea):
- CommonArea['MetadataNames'] = list() #Metadata names
- params['original_class'] = params['class'] #Save the original class
- params['original_subclass'] = params['subclass'] #Save the original subclass
- params['original_subject'] = params['subject'] #Save the original subclass
-
-
- TotalMetadataEntriesAndIDInBiomFile = len(CommonArea['abndData'].funcGetMetadataCopy()) # The number of metadata entries
- for i in range(0,TotalMetadataEntriesAndIDInBiomFile): #* Populate the meta data names table
- CommonArea['MetadataNames'].append(CommonArea['ReturnedData'][i][0]) #Add the metadata name
-
-
- #****************************************************
- #* Setting the params here *
- #****************************************************
-
- if TotalMetadataEntriesAndIDInBiomFile > 0: #If there is at least one entry - has to be the subject
- params['subject'] = 1
- if TotalMetadataEntriesAndIDInBiomFile == 2: #If there are 2 - The first is the subject and the second has to be the metadata, and that is the class
- params['class'] = 2
- if TotalMetadataEntriesAndIDInBiomFile == 3: #If there are 3: Set up default that the second entry is the class and the third is the subclass
- params['class'] = 2
- params['subclass'] = 3
- FlagError = False #Set up error flag
-
- if not params['biom_class'] is None and not params['biom_subclass'] is None: #Check if the User passed a valid class and subclass
- if params['biom_class'] in CommonArea['MetadataNames']:
- params['class'] = CommonArea['MetadataNames'].index(params['biom_class']) +1 #* Set up the index for that metadata
- else:
- FlagError = True
- if params['biom_subclass'] in CommonArea['MetadataNames']:
- params['subclass'] = CommonArea['MetadataNames'].index(params['biom_subclass']) +1 #* Set up the index for that metadata
- else:
- FlagError = True
- if FlagError == True: #* If the User passed an invalid class
- print "**Invalid biom class or subclass passed - Using defaults: First metadata=class, Second Metadata=subclass\n"
- params['class'] = 2
- params['subclass'] = 3
- return params
-
-
-
-if __name__ == '__main__':
- CommonArea = dict() #Build a Common Area to pass variables in the biom case
- params = read_params(sys.argv)
-
- #*************************************************************
- #* Conditionally import breadcrumbs if file is a biom file *
- #* If it is and no breadcrumbs found - abnormally exit *
- #*************************************************************
- if params['input_file'].endswith('.biom'):
- try:
- from lefsebiom.ConstantsBreadCrumbs import *
- from lefsebiom.AbundanceTable import *
- except ImportError:
- sys.stderr.write("************************************************************************************************************ \n")
- sys.stderr.write("* Error: Breadcrumbs libraries not detected - required to process biom files - run abnormally terminated * \n")
- sys.stderr.write("************************************************************************************************************ \n")
- exit(1)
-
-
- if type(params['subclass']) is int and int(params['subclass']) < 1:
- params['subclass'] = None
- if type(params['subject']) is int and int(params['subject']) < 1:
- params['subject'] = None
-
-
- CommonArea = read_input_file(sys.argv[1], CommonArea) #Pass The CommonArea to the Read
- data = CommonArea['ReturnedData'] #Select the data
-
- if sys.argv[1].endswith('biom'): #* Check if biom:
- params = check_params_for_biom_case(params, CommonArea) #Check the params for the biom case
-
- if params['feats_dir'] == "c":
- data = transpose(data)
-
- ncl = 1
- if not params['subclass'] is None: ncl += 1
- if not params['subject'] is None: ncl += 1
-
- first_line = zip(*data)[0]
-
- first_line = modify_feature_names(list(first_line))
-
- data = zip( first_line,
- *sort_by_cl(zip(*data)[1:],
- ncl,
- params['class']-1,
- params['subclass']-1 if not params['subclass'] is None else None,
- params['subject']-1 if not params['subject'] is None else None))
-# data.insert(0,first_line)
-# data = remove_missing(data,params['missing_p'])
- cls = {}
-
- cls_i = [('class',params['class']-1)]
- if params['subclass'] > 0: cls_i.append(('subclass',params['subclass']-1))
- if params['subject'] > 0: cls_i.append(('subject',params['subject']-1))
- cls_i.sort(lambda x, y: -cmp(x[1],y[1]))
- for v in cls_i: cls[v[0]] = data.pop(v[1])[1:]
- if not params['subclass'] > 0: cls['subclass'] = [str(cl)+"_subcl" for cl in cls['class']]
-
- cls['subclass'] = rename_same_subcl(cls['class'],cls['subclass'])
-# if 'subclass' in cls.keys(): cls = group_small_subclasses(cls,params['subcl_min_card'])
- class_sl,subclass_sl,class_hierarchy = get_class_slices(zip(*cls.values()))
-
- feats = dict([(d[0],d[1:]) for d in data])
-
- feats = add_missing_levels(feats)
-
- feats = numerical_values(feats,params['norm_v'])
- out = {}
- out['feats'] = feats
- out['norm'] = params['norm_v']
- out['cls'] = cls
- out['class_sl'] = class_sl
- out['subclass_sl'] = subclass_sl
- out['class_hierarchy'] = class_hierarchy
-
- if params['output_table']:
- with open( params['output_table'], "w") as outf:
- if 'class' in cls: outf.write( "\t".join(list(["class"])+list(cls['class'])) + "\n" )
- if 'subclass' in cls: outf.write( "\t".join(list(["subclass"])+list(cls['subclass'])) + "\n" )
- if 'subject' in cls: outf.write( "\t".join(list(["subject"])+list(cls['subject'])) + "\n" )
- for k,v in out['feats'].items(): outf.write( "\t".join([k]+[str(vv) for vv in v]) + "\n" )
-
- with open(params['output_file'], 'wb') as back_file:
- pickle.dump(out,back_file)
-
=====================================
lefse.py deleted
=====================================
@@ -1,241 +0,0 @@
-import os,sys,math,pickle
-import random as lrand
-import rpy2.robjects as robjects
-import argparse
-import numpy
-#import svmutil
-
-def init():
- lrand.seed(1982)
- robjects.r('library(splines)')
- robjects.r('library(stats4)')
- robjects.r('library(survival)')
- robjects.r('library(mvtnorm)')
- robjects.r('library(modeltools)')
- robjects.r('library(coin)')
- robjects.r('library(MASS)')
-
-def get_class_means(class_sl,feats):
- means = {}
- clk = class_sl.keys()
- for fk,f in feats.items():
- means[fk] = [numpy.mean((f[class_sl[k][0]:class_sl[k][1]])) for k in clk]
- return clk,means
-
-def save_res(res,filename):
- with open(filename, 'w') as out:
- for k,v in res['cls_means'].items():
- out.write(k+"\t"+str(math.log(max(max(v),1.0),10.0))+"\t")
- if k in res['lda_res_th']:
- for i,vv in enumerate(v):
- if vv == max(v):
- out.write(str(res['cls_means_kord'][i])+"\t")
- break
- out.write(str(res['lda_res'][k]))
- else: out.write("\t")
- out.write( "\t" + (res['wilcox_res'][k] if 'wilcox_res' in res and k in res['wilcox_res'] else "-")+"\n")
-
-def load_data(input_file, nnorm = False):
- with open(input_file, 'rb') as inputf:
- inp = pickle.load(inputf)
- if nnorm: return inp['feats'],inp['cls'],inp['class_sl'],inp['subclass_sl'],inp['class_hierarchy'],inp['norm']
- else: return inp['feats'],inp['cls'],inp['class_sl'],inp['subclass_sl'],inp['class_hierarchy']
-
-def load_res(input_file):
- with open(input_file, 'rb') as inputf:
- inp = pickle.load(inputf)
- return inp['res'],inp['params'],inp['class_sl'],inp['subclass_sl']
-
-
-def test_kw_r(cls,feats,p,factors):
- robjects.globalenv["y"] = robjects.FloatVector(feats)
- for i,f in enumerate(factors):
- robjects.globalenv['x'+str(i+1)] = robjects.FactorVector(robjects.StrVector(cls[f]))
- fo = "y~x1"
- #for i,f in enumerate(factors[1:]):
- # if f == "subclass" and len(set(cls[f])) <= len(set(cls["class"])): continue
- # if len(set(cls[f])) == len(cls[f]): continue
- # fo += "+x"+str(i+2)
- kw_res = robjects.r('kruskal.test('+fo+',)$p.value')
- return float(tuple(kw_res)[0]) < p, float(tuple(kw_res)[0])
-
-def test_rep_wilcoxon_r(sl,cl_hie,feats,th,multiclass_strat,mul_cor,fn,min_c,comp_only_same_subcl,curv=False):
- comp_all_sub = not comp_only_same_subcl
- tot_ok = 0
- alpha_mtc = th
- all_diff = []
- for pair in [(x,y) for x in cl_hie.keys() for y in cl_hie.keys() if x < y]:
- dir_cmp = "not_set" #
- l_subcl1, l_subcl2 = (len(cl_hie[pair[0]]), len(cl_hie[pair[1]]))
- if mul_cor != 0: alpha_mtc = th*l_subcl1*l_subcl2 if mul_cor == 2 else 1.0-math.pow(1.0-th,l_subcl1*l_subcl2)
- ok = 0
- curv_sign = 0
- first = True
- for i,k1 in enumerate(cl_hie[pair[0]]):
- br = False
- for j,k2 in enumerate(cl_hie[pair[1]]):
- if not comp_all_sub and k1[len(pair[0]):] != k2[len(pair[1]):]:
- ok += 1
- continue
- cl1 = feats[sl[k1][0]:sl[k1][1]]
- cl2 = feats[sl[k2][0]:sl[k2][1]]
- med_comp = False
- if len(cl1) < min_c or len(cl2) < min_c:
- med_comp = True
- sx,sy = numpy.median(cl1),numpy.median(cl2)
- if cl1[0] == cl2[0] and len(set(cl1)) == 1 and len(set(cl2)) == 1:
- tres, first = False, False
- elif not med_comp:
- robjects.globalenv["x"] = robjects.FloatVector(cl1+cl2)
- robjects.globalenv["y"] = robjects.FactorVector(robjects.StrVector(["a" for a in cl1]+["b" for b in cl2]))
- pv = float(robjects.r('pvalue(wilcox_test(x~y,data=data.frame(x,y)))')[0])
- tres = pv < alpha_mtc*2.0
- if first:
- first = False
- if not curv and ( med_comp or tres ):
- dir_cmp = sx < sy
- #if sx == sy: br = True
- elif curv:
- dir_cmp = None
- if med_comp or tres:
- curv_sign += 1
- dir_cmp = sx < sy
- else: br = True
- elif not curv and med_comp:
- if ((sx < sy) != dir_cmp or sx == sy): br = True
- elif curv:
- if tres and dir_cmp == None:
- curv_sign += 1
- dir_cmp = sx < sy
- if tres and dir_cmp != (sx < sy):
- br = True
- curv_sign = -1
- elif not tres or (sx < sy) != dir_cmp or sx == sy: br = True
- if br: break
- ok += 1
- if br: break
- if curv: diff = curv_sign > 0
- else: diff = (ok == len(cl_hie[pair[1]])*len(cl_hie[pair[0]])) # or (not comp_all_sub and dir_cmp != "not_set")
- if diff: tot_ok += 1
- if not diff and multiclass_strat: return False
- if diff and not multiclass_strat: all_diff.append(pair)
- if not multiclass_strat:
- tot_k = len(cl_hie.keys())
- for k in cl_hie.keys():
- nk = 0
- for a in all_diff:
- if k in a: nk += 1
- if nk == tot_k-1: return True
- return False
- return True
-
-
-
-def contast_within_classes_or_few_per_class(feats,inds,min_cl,ncl):
- ff = zip(*[v for n,v in feats.items() if n != 'class'])
- cols = [ff[i] for i in inds]
- cls = [feats['class'][i] for i in inds]
- if len(set(cls)) < ncl:
- return True
- for c in set(cls):
- if cls.count(c) < min_cl:
- return True
- cols_cl = [x for i,x in enumerate(cols) if cls[i] == c]
- for i,col in enumerate(zip(*cols_cl)):
- if (len(set(col)) <= min_cl and min_cl > 1) or (min_cl == 1 and len(set(col)) <= 1):
- return True
- return False
-
-def test_lda_r(cls,feats,cl_sl,boots,fract_sample,lda_th,tol_min,nlogs):
- fk = feats.keys()
- means = dict([(k,[]) for k in feats.keys()])
- feats['class'] = list(cls['class'])
- clss = list(set(feats['class']))
- for uu,k in enumerate(fk):
- if k == 'class': continue
- ff = [(feats['class'][i],v) for i,v in enumerate(feats[k])]
- for c in clss:
- if len(set([float(v[1]) for v in ff if v[0] == c])) > max(float(feats['class'].count(c))*0.5,4): continue
- for i,v in enumerate(feats[k]):
- if feats['class'][i] == c:
- feats[k][i] = math.fabs(feats[k][i] + lrand.normalvariate(0.0,max(feats[k][i]*0.05,0.01)))
- rdict = {}
- for a,b in feats.items():
- if a == 'class' or a == 'subclass' or a == 'subject':
- rdict[a] = robjects.StrVector(b)
- else: rdict[a] = robjects.FloatVector(b)
- robjects.globalenv["d"] = robjects.DataFrame(rdict)
- lfk = len(feats[fk[0]])
- rfk = int(float(len(feats[fk[0]]))*fract_sample)
- f = "class ~ "+fk[0]
- for k in fk[1:]: f += " + " + k.strip()
- ncl = len(set(cls['class']))
- min_cl = int(float(min([cls['class'].count(c) for c in set(cls['class'])]))*fract_sample*fract_sample*0.5)
- min_cl = max(min_cl,1)
- pairs = [(a,b) for a in set(cls['class']) for b in set(cls['class']) if a > b]
-
- for k in fk:
- for i in range(boots):
- means[k].append([])
- for i in range(boots):
- for rtmp in range(1000):
- rand_s = [lrand.randint(0,lfk-1) for v in range(rfk)]
- if not contast_within_classes_or_few_per_class(feats,rand_s,min_cl,ncl): break
- rand_s = [r+1 for r in rand_s]
- means[k][i] = []
- for p in pairs:
- robjects.globalenv["rand_s"] = robjects.IntVector(rand_s)
- robjects.globalenv["sub_d"] = robjects.r('d[rand_s,]')
- z = robjects.r('z <- suppressWarnings(lda(as.formula('+f+'),data=sub_d,tol='+str(tol_min)+'))')
- robjects.r('w <- z$scaling[,1]')
- robjects.r('w.unit <- w/sqrt(sum(w^2))')
- robjects.r('ss <- sub_d[,-match("class",colnames(sub_d))]')
- if 'subclass' in feats:
- robjects.r('ss <- ss[,-match("subclass",colnames(ss))]')
- if 'subject' in feats:
- robjects.r('ss <- ss[,-match("subject",colnames(ss))]')
- robjects.r('xy.matrix <- as.matrix(ss)')
- robjects.r('LD <- xy.matrix%*%w.unit')
- robjects.r('effect.size <- abs(mean(LD[sub_d[,"class"]=="'+p[0]+'"]) - mean(LD[sub_d[,"class"]=="'+p[1]+'"]))')
- scal = robjects.r('wfinal <- w.unit * effect.size')
- rres = robjects.r('mm <- z$means')
- rowns = list(rres.rownames)
- lenc = len(list(rres.colnames))
- coeff = [abs(float(v)) if not math.isnan(float(v)) else 0.0 for v in scal]
- res = dict([(pp,[float(ff) for ff in rres.rx(pp,True)] if pp in rowns else [0.0]*lenc ) for pp in [p[0],p[1]]])
- for j,k in enumerate(fk):
- gm = abs(res[p[0]][j] - res[p[1]][j])
- means[k][i].append((gm+coeff[j])*0.5)
- res = {}
- for k in fk:
- m = max([numpy.mean([means[k][kk][p] for kk in range(boots)]) for p in range(len(pairs))])
- res[k] = math.copysign(1.0,m)*math.log(1.0+math.fabs(m),10)
- return res,dict([(k,x) for k,x in res.items() if math.fabs(x) > lda_th])
-
-
-def test_svm(cls,feats,cl_sl,boots,fract_sample,lda_th,tol_min,nsvm):
- return NULL
-"""
- fk = feats.keys()
- clss = list(set(cls['class']))
- y = [clss.index(c)*2-1 for c in list(cls['class'])]
- xx = [feats[f] for f in fk]
- if nsvm:
- maxs = [max(v) for v in xx]
- mins = [min(v) for v in xx]
- x = [ dict([(i+1,(v-mins[i])/(maxs[i]-mins[i])) for i,v in enumerate(f)]) for f in zip(*xx)]
- else: x = [ dict([(i+1,v) for i,v in enumerate(f)]) for f in zip(*xx)]
-
- lfk = len(feats[fk[0]])
- rfk = int(float(len(feats[fk[0]]))*fract_sample)
- mm = []
-
- best_c = svmutil.svm_ms(y, x, [pow(2.0,i) for i in range(-5,10)],'-t 0 -q')
- for i in range(boots):
- rand_s = [lrand.randint(0,lfk-1) for v in range(rfk)]
- r = svmutil.svm_w([y[yi] for yi in rand_s], [x[xi] for xi in rand_s], best_c,'-t 0 -q')
- mm.append(r[:len(fk)])
- m = [numpy.mean(v) for v in zip(*mm)]
- res = dict([(v,m[i]) for i,v in enumerate(fk)])
- return res,dict([(k,x) for k,x in res.items() if math.fabs(x) > lda_th])
-"""
=====================================
__init__.py → lefse/__init__.py
=====================================
=====================================
lefse/__pycache__/__init__.cpython-38.pyc
=====================================
Binary files /dev/null and b/lefse/__pycache__/__init__.cpython-38.pyc differ
=====================================
lefse/__pycache__/lefse.cpython-38.pyc
=====================================
Binary files /dev/null and b/lefse/__pycache__/lefse.cpython-38.pyc differ
=====================================
lefse/lefse.py
=====================================
@@ -0,0 +1,265 @@
+import os,sys,math,pickle
+import random as lrand
+import rpy2.robjects as robjects
+import argparse
+import numpy
+#import svmutil
+
+def init():
+ lrand.seed(1982)
+ robjects.r('library(splines)')
+ robjects.r('library(stats4)')
+ robjects.r('library(survival)')
+ robjects.r('library(mvtnorm)')
+ robjects.r('library(modeltools)')
+ robjects.r('library(coin)')
+ robjects.r('library(MASS)')
+
+def get_class_means(class_sl,feats):
+ means = {}
+ clk = list(class_sl.keys())
+ for fk,f in feats.items():
+ means[fk] = [numpy.mean((f[class_sl[k][0]:class_sl[k][1]])) for k in clk]
+ return clk,means
+
+def save_res(res,filename):
+ with open(filename, 'w') as out:
+ for k,v in res['cls_means'].items():
+ out.write(k+"\t"+str(math.log(max(max(v),1.0),10.0))+"\t")
+ if k in res['lda_res_th']:
+ for i,vv in enumerate(v):
+ if vv == max(v):
+ out.write(str(res['cls_means_kord'][i])+"\t")
+ break
+ out.write(str(res['lda_res'][k]))
+ else: out.write("\t")
+ out.write( "\t" + (res['wilcox_res'][k] if 'wilcox_res' in res and k in res['wilcox_res'] else "-")+"\n")
+
+def load_data(input_file, nnorm = False):
+ with open(input_file, 'rb') as inputf:
+ inp = pickle.load(inputf)
+ if nnorm: return inp['feats'],inp['cls'],inp['class_sl'],inp['subclass_sl'],inp['class_hierarchy'],inp['norm']
+ else: return inp['feats'],inp['cls'],inp['class_sl'],inp['subclass_sl'],inp['class_hierarchy']
+
+def load_res(input_file):
+ with open(input_file, 'rb') as inputf:
+ inp = pickle.load(inputf)
+ return inp['res'],inp['params'],inp['class_sl'],inp['subclass_sl']
+
+
+def test_kw_r(cls,feats,p,factors):
+ robjects.globalenv["y"] = robjects.FloatVector(feats)
+ for i,f in enumerate(factors):
+ robjects.globalenv['x'+str(i+1)] = robjects.FactorVector(robjects.StrVector(cls[f]))
+ fo = "y~x1"
+ #for i,f in enumerate(factors[1:]):
+ # if f == "subclass" and len(set(cls[f])) <= len(set(cls["class"])): continue
+ # if len(set(cls[f])) == len(cls[f]): continue
+ # fo += "+x"+str(i+2)
+ kw_res = robjects.r('kruskal.test('+fo+',)$p.value')
+ return float(tuple(kw_res)[0]) < p, float(tuple(kw_res)[0])
+
+def test_rep_wilcoxon_r(sl,cl_hie,feats,th,multiclass_strat,mul_cor,fn,min_c,comp_only_same_subcl,curv=False):
+ comp_all_sub = not comp_only_same_subcl
+ tot_ok = 0
+ alpha_mtc = th
+ all_diff = []
+ for pair in [(x,y) for x in cl_hie.keys() for y in cl_hie.keys() if x < y]:
+ dir_cmp = "not_set" #
+ l_subcl1, l_subcl2 = (len(cl_hie[pair[0]]), len(cl_hie[pair[1]]))
+ if mul_cor != 0: alpha_mtc = th*l_subcl1*l_subcl2 if mul_cor == 2 else 1.0-math.pow(1.0-th,l_subcl1*l_subcl2)
+ ok = 0
+ curv_sign = 0
+ first = True
+ for i,k1 in enumerate(cl_hie[pair[0]]):
+ br = False
+ for j,k2 in enumerate(cl_hie[pair[1]]):
+ if not comp_all_sub and k1[len(pair[0]):] != k2[len(pair[1]):]:
+ ok += 1
+ continue
+ cl1 = feats[sl[k1][0]:sl[k1][1]]
+ cl2 = feats[sl[k2][0]:sl[k2][1]]
+ med_comp = False
+ if len(cl1) < min_c or len(cl2) < min_c:
+ med_comp = True
+ sx,sy = numpy.median(cl1),numpy.median(cl2)
+ if cl1[0] == cl2[0] and len(set(cl1)) == 1 and len(set(cl2)) == 1:
+ tres, first = False, False
+ elif not med_comp:
+ robjects.globalenv["x"] = robjects.FloatVector(cl1+cl2)
+ robjects.globalenv["y"] = robjects.FactorVector(robjects.StrVector(["a" for a in cl1]+["b" for b in cl2]))
+ pv = float(robjects.r('pvalue(wilcox_test(x~y,data=data.frame(x,y)))')[0])
+ tres = pv < alpha_mtc*2.0
+ if first:
+ first = False
+ if not curv and ( med_comp or tres ):
+ dir_cmp = sx < sy
+ #if sx == sy: br = True
+ elif curv:
+ dir_cmp = None
+ if med_comp or tres:
+ curv_sign += 1
+ dir_cmp = sx < sy
+ else: br = True
+ elif not curv and med_comp:
+ if ((sx < sy) != dir_cmp or sx == sy): br = True
+ elif curv:
+ if tres and dir_cmp == None:
+ curv_sign += 1
+ dir_cmp = sx < sy
+ if tres and dir_cmp != (sx < sy):
+ br = True
+ curv_sign = -1
+ elif not tres or (sx < sy) != dir_cmp or sx == sy: br = True
+ if br: break
+ ok += 1
+ if br: break
+ if curv: diff = curv_sign > 0
+ else: diff = (ok == len(cl_hie[pair[1]])*len(cl_hie[pair[0]])) # or (not comp_all_sub and dir_cmp != "not_set")
+ if diff: tot_ok += 1
+ if not diff and multiclass_strat: return False
+ if diff and not multiclass_strat: all_diff.append(pair)
+ if not multiclass_strat:
+ tot_k = len(list(cl_hie.keys()))
+ for k in cl_hie.keys():
+ nk = 0
+ for a in all_diff:
+ if k in a: nk += 1
+ if nk == tot_k-1: return True
+ return False
+ return True
+
+
+
+def contast_within_classes_or_few_per_class(feats,inds,min_cl,ncl):
+ ff = list(zip(*[v for n,v in feats.items() if n != 'class']))
+ cols = [ff[i] for i in inds]
+ cls = [feats['class'][i] for i in inds]
+ if len(set(cls)) < ncl:
+ return True
+ for c in set(cls):
+ if cls.count(c) < min_cl:
+ return True
+ cols_cl = [x for i,x in enumerate(cols) if cls[i] == c]
+ for i,col in enumerate(zip(*cols_cl)):
+ if (len(set(col)) <= min_cl and min_cl > 1) or (min_cl == 1 and len(set(col)) <= 1):
+ return True
+ return False
+
+def test_lda_r(cls,feats,cl_sl,boots,fract_sample,lda_th,tol_min,nlogs):
+ fk = list(feats.keys())
+ means = dict([(k,[]) for k in feats.keys()])
+ feats['class'] = list(cls['class'])
+ clss = list(set(feats['class']))
+
+ for uu,k in enumerate(fk):
+ if k == 'class':
+ continue
+
+ ff = [(feats['class'][i],v) for i,v in enumerate(feats[k])]
+
+ for c in clss:
+ if len(set([float(v[1]) for v in ff if v[0] == c])) > max(float(feats['class'].count(c))*0.5,4):
+ continue
+
+ for i,v in enumerate(feats[k]):
+ if feats['class'][i] == c:
+ feats[k][i] = math.fabs(feats[k][i] + lrand.normalvariate(0.0,max(feats[k][i]*0.05,0.01)))
+
+ rdict = {}
+
+ for a,b in feats.items():
+ if a == 'class' or a == 'subclass' or a == 'subject':
+ rdict[a] = robjects.StrVector(b)
+ else:
+ rdict[a] = robjects.FloatVector(b)
+
+ robjects.globalenv["d"] = robjects.DataFrame(rdict)
+ lfk = len(feats[fk[0]])
+ rfk = int(float(len(feats[fk[0]]))*fract_sample)
+ f = "class ~ "+fk[0]
+
+ for k in fk[1:]:
+ f += " + " + k.strip()
+
+ ncl = len(set(cls['class']))
+ min_cl = int(float(min([cls['class'].count(c) for c in set(cls['class'])]))*fract_sample*fract_sample*0.5)
+ min_cl = max(min_cl,1)
+ pairs = [(a,b) for a in set(cls['class']) for b in set(cls['class']) if a > b]
+
+ for k in fk:
+ for i in range(boots):
+ means[k].append([])
+
+ for i in range(boots):
+ for rtmp in range(1000):
+ rand_s = [lrand.randint(0,lfk-1) for v in range(rfk)]
+ if not contast_within_classes_or_few_per_class(feats,rand_s,min_cl,ncl):
+ break
+
+ rand_s = [r+1 for r in rand_s]
+ means[k][i] = []
+
+ for p in pairs:
+ robjects.globalenv["rand_s"] = robjects.IntVector(rand_s)
+ robjects.globalenv["sub_d"] = robjects.r('d[rand_s,]')
+ z = robjects.r('z <- suppressWarnings(lda(as.formula('+f+'),data=sub_d,tol='+str(tol_min)+'))')
+ robjects.r('w <- z$scaling[,1]')
+ robjects.r('w.unit <- w/sqrt(sum(w^2))')
+ robjects.r('ss <- sub_d[,-match("class",colnames(sub_d))]')
+
+ if 'subclass' in feats:
+ robjects.r('ss <- ss[,-match("subclass",colnames(ss))]')
+
+ if 'subject' in feats:
+ robjects.r('ss <- ss[,-match("subject",colnames(ss))]')
+
+ robjects.r('xy.matrix <- as.matrix(ss)')
+ robjects.r('LD <- xy.matrix%*%w.unit')
+ robjects.r('effect.size <- abs(mean(LD[sub_d[,"class"]=="'+p[0]+'"]) - mean(LD[sub_d[,"class"]=="'+p[1]+'"]))')
+ scal = robjects.r('wfinal <- w.unit * effect.size')
+ rres = robjects.r('mm <- z$means')
+ rowns = list(rres.rownames)
+ lenc = len(list(rres.colnames))
+ coeff = [abs(float(v)) if not math.isnan(float(v)) else 0.0 for v in scal]
+ res = dict([(pp,[float(ff) for ff in rres.rx(pp,True)] if pp in rowns else [0.0]*lenc ) for pp in [p[0],p[1]]])
+
+ for j,k in enumerate(fk):
+ gm = abs(res[p[0]][j] - res[p[1]][j])
+ means[k][i].append((gm+coeff[j])*0.5)
+
+ res = {}
+
+ for k in fk:
+ m = max([numpy.mean([means[k][kk][p] for kk in range(boots)]) for p in range(len(pairs))])
+ res[k] = math.copysign(1.0,m)*math.log(1.0+math.fabs(m),10)
+
+ return res,dict([(k,x) for k,x in res.items() if math.fabs(x) > lda_th])
+
+
+def test_svm(cls,feats,cl_sl,boots,fract_sample,lda_th,tol_min,nsvm):
+ return None
+"""
+ fk = feats.keys()
+ clss = list(set(cls['class']))
+ y = [clss.index(c)*2-1 for c in list(cls['class'])]
+ xx = [feats[f] for f in fk]
+ if nsvm:
+ maxs = [max(v) for v in xx]
+ mins = [min(v) for v in xx]
+ x = [ dict([(i+1,(v-mins[i])/(maxs[i]-mins[i])) for i,v in enumerate(f)]) for f in zip(*xx)]
+ else: x = [ dict([(i+1,v) for i,v in enumerate(f)]) for f in zip(*xx)]
+
+ lfk = len(feats[fk[0]])
+ rfk = int(float(len(feats[fk[0]]))*fract_sample)
+ mm = []
+
+ best_c = svmutil.svm_ms(y, x, [pow(2.0,i) for i in range(-5,10)],'-t 0 -q')
+ for i in range(boots):
+ rand_s = [lrand.randint(0,lfk-1) for v in range(rfk)]
+ r = svmutil.svm_w([y[yi] for yi in rand_s], [x[xi] for xi in rand_s], best_c,'-t 0 -q')
+ mm.append(r[:len(fk)])
+ m = [numpy.mean(v) for v in zip(*mm)]
+ res = dict([(v,m[i]) for i,v in enumerate(fk)])
+ return res,dict([(k,x) for k,x in res.items() if math.fabs(x) > lda_th])
+"""
=====================================
lefse/lefse.pyc
=====================================
Binary files /dev/null and b/lefse/lefse.pyc differ
=====================================
lefse2circlader.py → lefse/lefse2circlader.py
=====================================
@@ -1,6 +1,4 @@
-#!/usr/bin/env python
-
-from __future__ import with_statement
+#!/usr/bin/env python3
import sys
import os
@@ -19,7 +17,8 @@ def read_params(args):
return vars(parser.parse_args())
-def lefse2circlader(par):
+def lefse2circlader():
+ par = read_params(sys.argv)
finp,fout = bool(par['inp_f']), bool(par['out_f'])
with open(par['inp_f']) if finp else sys.stdin as inpf:
@@ -36,8 +35,7 @@ def lefse2circlader(par):
out_file.write( "\t".join( c ) + "\n" )
if __name__ == '__main__':
- params = read_params(sys.argv)
- lefse2circlader(params)
+ lefse2circlader()
=====================================
lefse/lefse_format_input.py
=====================================
@@ -0,0 +1,464 @@
+#!/usr/bin/env python3
+
+import sys,os,argparse,pickle,re,numpy
+
+import functools
+from lefsebiom.ConstantsBreadCrumbs import *
+from lefsebiom.AbundanceTable import *
+
+#***************************************************************************************************************
+#* Log of change *
+#* January 16, 2014 - George Weingart - george.weingart at gmail.com *
+#* *
+#* biom Support *
+#* Modified the program to enable it to accept biom files as input *
+#* *
+#* Added two optional input parameters: *
+#* 1. biom_c is the name of the biom metadata to be used as class *
+#* 2. biom_s is the name of the biom metadata to be used as subclass *
+#* class and subclass are used in the same context as the original *
+#* parameters class and subclass *
+#* These parameters are totally optional, the default is the program *
+#* chooses as class the first metadata received from the conversion *
+#* of the biom file into a sequential (pcl) file as generated by *
+#* breadcrumbs, and similarly, the second metadata is selected as *
+#* subclass. *
+#* The syntax or logic for the original non-biom case was NOT changed. *
+#* *
+#* <******************* IMPORTANT NOTE *************************> *
+#* The biom case requires breadcrumbs and therefore there is a *
+#* a conditional import of the breadcrumbs modules *
+#* If the User uses a biom input and breadcrumbs is not detected, *
+#* the run is abnormally ended *
+#* breadcrumbs itself needs a biom environment, so if the immport *
+#* of biom in breadcrumbs fails, the run is also abnormally
+#* ended (Only if the input file was biom) *
+#* *
+#* USAGE EXAMPLES *
+#* -------------- *
+#* Case #1: Using a sequential file as input (Old version - did not change *
+#* ./format_input.py hmp_aerobiosis_small.txt hmp_aerobiosis_small.in -c 1 -s 2 -u 3 -o 1000000 *
+#* Case #2: Using a biom file as input *
+#* ./format_input.py hmp_aerobiosis_small.biom hmp_aerobiosis_small.in -o 1000000 *
+#* Case #3: Using a biom file as input and override the class and subclass *
+#* ./format_input.py lefse.biom hmp_aerobiosis_small.in -biom_c oxygen_availability -biom_s body_site -o 1000000
+#* *
+#***************************************************************************************************************
+
+def read_input_file(inp_file, CommonArea):
+
+ if inp_file.endswith('.biom'): #* If the file format is biom:
+ CommonArea = biom_processing(inp_file) #* Process in biom format
+ return CommonArea #* And return the CommonArea
+
+ with open(inp_file) as inp:
+ CommonArea['ReturnedData'] = [[v.strip() for v in line.strip().split("\t")] for line in inp.readlines()]
+ return CommonArea
+
+def transpose(data):
+ return list(zip(*data))
+
+def read_params(args):
+ parser = argparse.ArgumentParser(description='LEfSe formatting modules')
+ parser.add_argument('input_file', metavar='INPUT_FILE', type=str, help="the input file, feature hierarchical level can be specified with | or . and those symbols must not be present for other reasons in the input file.")
+ parser.add_argument('output_file', metavar='OUTPUT_FILE', type=str,
+ help="the output file containing the data for LEfSe")
+ parser.add_argument('--output_table', type=str, required=False, default="",
+ help="the formatted table in txt format")
+ parser.add_argument('-f',dest="feats_dir", choices=["c","r"], type=str, default="r",
+ help="set whether the features are on rows (default) or on columns")
+ parser.add_argument('-c',dest="class", metavar="[1..n_feats]", type=int, default=1,
+ help="set which feature use as class (default 1)")
+ parser.add_argument('-s',dest="subclass", metavar="[1..n_feats]", type=int, default=None,
+ help="set which feature use as subclass (default -1 meaning no subclass)")
+ parser.add_argument('-o',dest="norm_v", metavar="float", type=float, default=-1.0,
+ help="set the normalization value (default -1.0 meaning no normalization)")
+ parser.add_argument('-u',dest="subject", metavar="[1..n_feats]", type=int, default=None,
+ help="set which feature use as subject (default -1 meaning no subject)")
+ parser.add_argument('-m',dest="missing_p", choices=["f","s"], type=str, default="d",
+ help="set the policy to adopt with missing values: f removes the features with missing values, s removes samples with missing values (default f)")
+ parser.add_argument('-n',dest="subcl_min_card", metavar="int", type=int, default=10,
+ help="set the minimum cardinality of each subclass (subclasses with low cardinalities will be grouped together, if the cardinality is still low, no pairwise comparison will be performed with them)")
+
+ parser.add_argument('-biom_c',dest="biom_class", type=str,
+ help="For biom input files: Set which feature use as class ")
+ parser.add_argument('-biom_s',dest="biom_subclass", type=str,
+ help="For biom input files: set which feature use as subclass ")
+
+ args = parser.parse_args()
+
+ return vars(args)
+
+def remove_missing(data,roc):
+ if roc == "c": data = transpose(data)
+ max_len = max([len(r) for r in data])
+ to_rem = []
+ for i,r in enumerate(data):
+ if len([v for v in r if not( v == "" or v.isspace())]) < max_len: to_rem.append(i)
+ if len(to_rem):
+ for i in to_rem.reverse():
+ data.pop(i)
+ if roc == "c": return transpose(data)
+ return data
+
+
+def sort_by_cl(data,n,c,s,u):
+ def sort_lines1(a,b):
+ return int(a[c] > b[c])*2-1
+
+ def sort_lines2u(a,b):
+ if a[c] != b[c]:
+ return int(a[c] > b[c])*2-1
+
+ return int(a[u] > b[u])*2-1
+
+ def sort_lines2s(a,b):
+ if a[c] != b[c]:
+ return int(a[c] > b[c])*2-1
+
+ return int(a[s] > b[s])*2-1
+
+ def sort_lines3(a,b):
+ if a[c] != b[c]:
+ return int(a[c] > b[c])*2-1
+
+ if a[s] != b[s]:
+ return int(a[s] > b[s])*2-1
+
+ return int(a[u] > b[u])*2-1
+
+ if n == 3:
+ data.sort(key = functools.cmp_to_key(lambda a,b: sort_lines3(a,b)))
+
+ if n == 2:
+ if s is None:
+ data.sort(key = functools.cmp_to_key(lambda a,b: sort_lines2u(a,b)))
+ else:
+ data.sort(key = functools.cmp_to_key(lambda a,b: sort_lines2s(a,b)))
+
+ if n == 1:
+ data.sort(key = functools.cmp_to_key(lambda a,b: sort_lines1(a,b)))
+
+ return data
+
+def group_small_subclasses(cls,min_subcl):
+ last = ""
+ n = 0
+ repl = []
+ dd = [list(cls['class']),list(cls['subclass'])]
+ for d in dd:
+ if d[1] != last:
+ if n < min_subcl and last != "":
+ repl.append(d[1])
+ last = d[1]
+ n = 1
+ for i,d in enumerate(dd):
+ if d[1] in repl: dd[i][1] = "other"
+ dd[i][1] = str(dd[i][0])+"_"+str(dd[i][1])
+ cls['class'] = dd[0]
+ cls['subclass'] = dd[1]
+ return cls
+
+def get_class_slices(data):
+ previous_class = data[0][0]
+ previous_subclass = data[0][1]
+ subclass_slices = []
+ class_slices = []
+ last_cl = 0
+ last_subcl = 0
+ class_hierarchy = []
+ subcls = []
+ for i,d in enumerate(data):
+ if d[1] != previous_subclass:
+ subclass_slices.append((previous_subclass,(last_subcl,i)))
+ last_subcl = i
+ subcls.append(previous_subclass)
+ if d[0] != previous_class:
+ class_slices.append((previous_class,(last_cl,i)))
+ class_hierarchy.append((previous_class,subcls))
+ subcls = []
+ last_cl = i
+ previous_subclass = d[1]
+ previous_class = d[0]
+ subclass_slices.append((previous_subclass,(last_subcl,i+1)))
+ subcls.append(previous_subclass)
+ class_slices.append((previous_class,(last_cl,i+1)))
+ class_hierarchy.append((previous_class,subcls))
+ return dict(class_slices), dict(subclass_slices), dict(class_hierarchy)
+
+def numerical_values(feats,norm):
+ mm = []
+ for k,v in feats.items():
+ feats[k] = [float(val) for val in v]
+ if norm < 0.0: return feats
+ tr = list(zip(*(list(feats.values()))))
+ mul = []
+ fk = list(feats.keys())
+ hie = True if sum([k.count(".") for k in fk]) > len(fk) else False
+ for i in range(len(list(feats.values())[0])):
+ if hie: mul.append(sum([t for j,t in enumerate(tr[i]) if fk[j].count(".") < 1 ]))
+ else: mul.append(sum(tr[i]))
+ if hie and sum(mul) == 0:
+ mul = []
+ for i in range(len(list(feats.values())[0])):
+ mul.append(sum(tr[i]))
+ for i,m in enumerate(mul):
+ if m == 0: mul[i] = 0.0
+ else: mul[i] = float(norm) / m
+ for k,v in feats.items():
+ feats[k] = [val*mul[i] for i,val in enumerate(v)]
+ if numpy.mean(feats[k]) and (numpy.std(feats[k])/numpy.mean(feats[k])) < 1e-10:
+ feats[k] = [ float(round(kv*1e6)/1e6) for kv in feats[k]]
+ return feats
+
+def add_missing_levels2(ff):
+
+ if sum( [f.count(".") for f in ff] ) < 1: return ff
+
+ dn = {}
+
+ added = True
+ while added:
+ added = False
+ for f in ff:
+ lev = f.count(".")
+ if lev == 0: continue
+ if lev not in dn: dn[lev] = [f]
+ else: dn[lev].append(f)
+ for fn in sorted(dn,reverse=True):
+ for f in dn[fn]:
+ fc = ".".join(f.split('.')[:-1])
+ if fc not in ff:
+ ab_all = [ff[fg] for fg in ff if (fg.count(".") == 0 and fg == fc) or (fg.count(".") > 0 and fc == ".".join(fg.split('.')[:-1]))]
+ ab =[]
+ for l in [f for f in zip(*ab_all)]:
+ ab.append(sum([float(ll) for ll in l]))
+ ff[fc] = ab
+ added = True
+ if added:
+ break
+
+ return ff
+
+
+def add_missing_levels(ff):
+ if sum( [f.count(".") for f in ff] ) < 1: return ff
+
+ clades2leaves = {}
+ for f in ff:
+ fs = f.split(".")
+ if len(fs) < 2:
+ continue
+ for l in range(len(fs)):
+ n = ".".join( fs[:l] )
+ if n in clades2leaves:
+ clades2leaves[n].append( f )
+ else:
+ clades2leaves[n] = [f]
+ for k,v in clades2leaves.items():
+ if k and k not in ff:
+ ff[k] = [sum(a) for a in zip(*[[float(fn) for fn in ff[vv]] for vv in v])]
+ return ff
+
+
+def modify_feature_names(fn):
+ ret = fn
+
+ for v in [' ',r'\$',r'\@',r'#',r'%',r'\^',r'\&',r'\*',r'\"',r'\'']:
+ ret = [re.sub(v,"",f) for f in ret]
+
+ for v in ["/",r'\(',r'\)',r'-',r'\+',r'=',r'{',r'}',r'\[',r'\]',
+ r',',r'\.',r';',r':',r'\?',r'\<',r'\>',r'\.',r'\,']:
+ ret = [re.sub(v,"_",f) for f in ret]
+
+ for v in ["\|"]:
+ ret = [re.sub(v,".",f) for f in ret]
+
+ ret2 = []
+ for r in ret:
+ if r[0] in ['0','1','2','3','4','5','6','7','8','9','_']:
+ ret2.append("f_"+r)
+ else:
+ ret2.append(r)
+
+ return ret2
+
+
+def rename_same_subcl(cl,subcl):
+ toc = []
+ for sc in set(subcl):
+ if len(set([cl[i] for i in range(len(subcl)) if sc == subcl[i]])) > 1:
+ toc.append(sc)
+ new_subcl = []
+ for i,sc in enumerate(subcl):
+ if sc in toc: new_subcl.append(cl[i]+"_"+sc)
+ else: new_subcl.append(sc)
+ return new_subcl
+
+
+#*************************************************************************************
+#* Modifications by George Weingart, Jan 15, 2014 *
+#* If the input file is biom: *
+#* a. Load an AbundanceTable (Using breadcrumbs) *
+#* b. Create a sequential file from the AbundanceTable (de-facto - pcl) *
+#* c. Use that file as input to the rest of the program *
+#* d. Calculate the c,s,and u parameters, either from the values the User entered *
+#* from the meta data values in the biom file or set up defaults *
+#* <<<------------- I M P O R T A N T N O T E ------------------->> *
+#* breadcrumbs src directory must be included in the PYTHONPATH *
+#* <<<------------- I M P O R T A N T N O T E ------------------->> *
+#*************************************************************************************
+def biom_processing(inp_file):
+ CommonArea = dict() #* Set up a dictionary to return
+ CommonArea['abndData'] = AbundanceTable.funcMakeFromFile(inp_file, #* Create AbundanceTable from input biom file
+ cDelimiter = None,
+ sMetadataID = None,
+ sLastMetadataRow = None,
+ sLastMetadata = None,
+ strFormat = None)
+
+ #****************************************************************
+ #* Building the data element here *
+ #****************************************************************
+ ResolvedData = list() #This is the Resolved data that will be returned
+ IDMetadataName = CommonArea['abndData'].funcGetIDMetadataName() #* ID Metadataname
+ IDMetadata = [CommonArea['abndData'].funcGetIDMetadataName()] #* The first Row
+ IDMetadata.extend([IDMetadataEntry for IDMetadataEntry in CommonArea['abndData'].funcGetMetadataCopy()[IDMetadataName]]) #* Loop on all the metadata values
+
+ ResolvedData.append(IDMetadata) #Add the IDMetadata with all its values to the resolved area
+ for key, value in CommonArea['abndData'].funcGetMetadataCopy().items():
+ if key != IDMetadataName:
+ MetadataEntry = [key] + value #* Set it up
+ ResolvedData.append(MetadataEntry)
+ for AbundanceDataEntry in CommonArea['abndData'].funcGetAbundanceCopy(): #* The Abundance Data
+ lstAbundanceDataEntry = list(AbundanceDataEntry) #Convert tuple to list
+ ResolvedData.append(lstAbundanceDataEntry) #Append the list to the metadata list
+ CommonArea['ReturnedData'] = ResolvedData #Post the results
+ return CommonArea
+
+
+#*******************************************************************************
+#* Check the params and override in the case of biom *
+#*******************************************************************************
+def check_params_for_biom_case(params, CommonArea):
+ CommonArea['MetadataNames'] = list() #Metadata names
+ params['original_class'] = params['class'] #Save the original class
+ params['original_subclass'] = params['subclass'] #Save the original subclass
+ params['original_subject'] = params['subject'] #Save the original subclass
+
+
+ TotalMetadataEntriesAndIDInBiomFile = len(CommonArea['abndData'].funcGetMetadataCopy()) # The number of metadata entries
+ for i in range(0,TotalMetadataEntriesAndIDInBiomFile): #* Populate the meta data names table
+ CommonArea['MetadataNames'].append(CommonArea['ReturnedData'][i][0]) #Add the metadata name
+
+
+ #****************************************************
+ #* Setting the params here *
+ #****************************************************
+
+ if TotalMetadataEntriesAndIDInBiomFile > 0: #If there is at least one entry - has to be the subject
+ params['subject'] = 1
+ if TotalMetadataEntriesAndIDInBiomFile == 2: #If there are 2 - The first is the subject and the second has to be the metadata, and that is the class
+ params['class'] = 2
+ if TotalMetadataEntriesAndIDInBiomFile == 3: #If there are 3: Set up default that the second entry is the class and the third is the subclass
+ params['class'] = 2
+ params['subclass'] = 3
+ FlagError = False #Set up error flag
+
+ if not params['biom_class'] is None and not params['biom_subclass'] is None: #Check if the User passed a valid class and subclass
+ if params['biom_class'] in CommonArea['MetadataNames']:
+ params['class'] = CommonArea['MetadataNames'].index(params['biom_class'])+1 #* Set up the index for that metadata
+ else:
+ FlagError = True
+ if params['biom_subclass'] in CommonArea['MetadataNames']:
+ params['subclass'] = CommonArea['MetadataNames'].index(params['biom_subclass'])+1 #* Set up the index for that metadata
+ else:
+ FlagError = True
+ if FlagError == True: #* If the User passed an invalid class
+ print("**Invalid biom class or subclass passed - Using defaults: First metadata=class, Second Metadata=subclass\n")
+ params['class'] = 2
+ params['subclass'] = 3
+ return params
+
+def format_input():
+ CommonArea = dict() #Build a Common Area to pass variables in the biom case
+ params = read_params(sys.argv)
+
+ if type(params['subclass']) is int and int(params['subclass']) < 1:
+ params['subclass'] = None
+ if type(params['subject']) is int and int(params['subject']) < 1:
+ params['subject'] = None
+
+
+ CommonArea = read_input_file(sys.argv[1], CommonArea) #Pass The CommonArea to the Read
+ data = CommonArea['ReturnedData'] #Select the data
+
+ if sys.argv[1].endswith('biom'): #* Check if biom:
+ params = check_params_for_biom_case(params, CommonArea) #Check the params for the biom case
+
+ if params['feats_dir'] == "c":
+ data = transpose(data)
+
+ ncl = 1
+ if not params['subclass'] is None: ncl += 1
+ if not params['subject'] is None: ncl += 1
+
+ first_line = list(zip(*data))[0]
+
+ first_line = modify_feature_names(list(first_line))
+
+ data = list(zip( first_line,
+ *sort_by_cl(list(zip(*data))[1:],
+ ncl,
+ params['class']-1,
+ params['subclass']-1 if not params['subclass'] is None else None,
+ params['subject']-1 if not params['subject'] is None else None)))
+# data.insert(0,first_line)
+# data = remove_missing(data,params['missing_p'])
+ cls = {}
+
+ cls_i = [('class',params['class']-1)]
+ if params['subclass'] is not None and params['subclass'] > 0:
+ cls_i.append(('subclass',params['subclass']-1))
+
+ if params['subject'] is not None and params['subject'] > 0:
+ cls_i.append(('subject',params['subject']-1))
+
+ cls_i.sort(key = functools.cmp_to_key(lambda x,y: -((x[1] > y[1]) - (x[1] < y[1]))))
+
+ for v in cls_i:
+ cls[v[0]] = data.pop(v[1])[1:]
+
+ if params['subclass'] is None:
+ cls['subclass'] = [str(cl)+"_subcl" for cl in cls['class']]
+
+ cls['subclass'] = rename_same_subcl(cls['class'],cls['subclass'])
+# if 'subclass' in cls.keys(): cls = group_small_subclasses(cls,params['subcl_min_card'])
+ class_sl,subclass_sl,class_hierarchy = get_class_slices(list(zip(cls['class'], cls['subclass'], cls['subject'])))
+
+ feats = dict([(d[0],d[1:]) for d in data])
+
+ feats = add_missing_levels(feats)
+
+ feats = numerical_values(feats,params['norm_v'])
+ out = {}
+ out['feats'] = feats
+ out['norm'] = params['norm_v']
+ out['cls'] = cls
+ out['class_sl'] = class_sl
+ out['subclass_sl'] = subclass_sl
+ out['class_hierarchy'] = class_hierarchy
+
+ if params['output_table']:
+ with open( params['output_table'], "w") as outf:
+ if 'class' in cls: outf.write( "\t".join(list(["class"])+list(cls['class'])) + "\n" )
+ if 'subclass' in cls: outf.write( "\t".join(list(["subclass"])+list(cls['subclass'])) + "\n" )
+ if 'subject' in cls: outf.write( "\t".join(list(["subject"])+list(cls['subject'])) + "\n" )
+ for k,v in out['feats'].items(): outf.write( "\t".join([k]+[str(vv) for vv in v]) + "\n" )
+
+ with open(params['output_file'], 'wb') as back_file:
+ pickle.dump(out,back_file)
+
+
+if __name__ == '__main__':
+ format_input()
\ No newline at end of file
=====================================
lefse/lefse_plot_cladogram.py
=====================================
@@ -0,0 +1,349 @@
+#!/usr/bin/env python3
+
+import os,sys,matplotlib,argparse,string
+# from io import StringIO
+matplotlib.use('Agg')
+from pylab import *
+from lefse.lefse import *
+import numpy as np
+
+colors = ['r','g','b','m','c',[1.0,0.5,0.0],[0.0,1.0,0.0],[0.33,0.125,0.0],[0.75,0.75,0.75],'k']
+dark_colors = [[0.4,0.0,0.0],[0.0,0.2,0.0],[0.0,0.0,0.4],'m','c',[1.0,0.5,0.0],[0.0,1.0,0.0],[0.33,0.125,0.0],[0.75,0.75,0.75],'k']
+
+class CladeNode:
+ def __init__(self, name, abundance, viz = True):
+ self.id = name
+ self.name = name.split(".")
+ self.last_name = self.name[-1]
+ self.abundance = abundance
+ self.pos = (-1.0,-1.0)
+ self.children = {}
+ self.isleaf = True
+ self.color = 'y'
+ self.next_leaf = -1
+ self.prev_leaf = -1
+ self.viz = viz
+ def __repr__(self):
+ return self.last_name
+ def add_child(self,node):
+ self.isleaf = False
+ self.children[node.__repr__()] = node
+ def get_children(self):
+ ck = sorted(self.children.keys())
+ return [self.children[k] for k in ck]
+ def get_color(self):
+ return self.color
+ def set_color(self,c):
+ self.color = c
+ def set_pos(self,pos):
+ self.pos = pos
+
+def read_params(args):
+ parser = argparse.ArgumentParser(description='Cladoplot')
+ parser.add_argument('input_file', metavar='INPUT_FILE', type=str, help="tab delimited input file")
+ parser.add_argument('output_file', metavar='OUTPUT_FILE', type=str, help="the file for the output image")
+ parser.add_argument('--clade_sep',dest="clade_sep", type=float, default=1.5)
+ parser.add_argument('--max_lev',dest="max_lev", type=int, default=-1)
+ parser.add_argument('--max_point_size',dest="max_point_size", type=float, default=6.0)
+ parser.add_argument('--min_point_size',dest="min_point_size", type=float, default=1)
+ parser.add_argument('--point_edge_width',dest="markeredgewidth", type=float, default=.25)
+ parser.add_argument('--siblings_connector_width',dest="siblings_connector_width", type=float, default=2)
+ parser.add_argument('--parents_connector_width',dest="parents_connector_width", type=float, default=0.75)
+ parser.add_argument('--radial_start_lev',dest="radial_start_lev", type=int, default=1)
+ parser.add_argument('--labeled_start_lev',dest="labeled_start_lev", type=int, default=2)
+ parser.add_argument('--labeled_stop_lev',dest="labeled_stop_lev", type=int, default=5)
+ parser.add_argument('--abrv_start_lev',dest="abrv_start_lev", type=int, default=3)
+ parser.add_argument('--abrv_stop_lev',dest="abrv_stop_lev", type=int, default=5)
+ parser.add_argument('--expand_void_lev',dest="expand_void_lev", type=int, default=1)
+ parser.add_argument('--class_legend_vis',dest="class_legend_vis", type=int, default=1)
+ parser.add_argument('--colored_connector',dest="colored_connectors", type=int, default=1)
+ parser.add_argument('--alpha',dest="alpha", type=float, default=0.2)
+ parser.add_argument('--title',dest="title", type=str, default="Cladogram")
+ parser.add_argument('--sub_clade',dest="sub_clade", type=str, default="")
+ parser.add_argument('--title_font_size',dest="title_font_size", type=str, default="14")
+ parser.add_argument('--right_space_prop',dest="r_prop", type=float, default=0.1)
+ parser.add_argument('--left_space_prop',dest="l_prop", type=float, default=0.1)
+ parser.add_argument('--label_font_size',dest="label_font_size", type=str, default="6")
+ parser.add_argument('--background_color',dest="back_color", type=str, choices=["k","w"], default="w", help="set the color of the background")
+ parser.add_argument('--colored_labels',dest="col_lab", type=int, choices=[0,1], default=1, help="draw the label with class color (1) or in black (0)")
+ parser.add_argument('--class_legend_font_size',dest="class_legend_font_size", type=str, default="10")
+ parser.add_argument('--dpi',dest="dpi", type=int, default=72)
+ parser.add_argument('--format', dest="format", choices=["png","svg","pdf"], default="svg", type=str, help="the format for the output file")
+ parser.add_argument('--all_feats', dest="all_feats", type=str, default="")
+ args = parser.parse_args()
+ return vars(args)
+
+def cmp_names(la,lb):
+ if len(la) != len(lb): return False
+ for p in [(a,b) for i,a in enumerate(la) for j,b in enumerate(lb) if i == j]:
+ if p[0] != p[1]: return False
+ return True
+
+def build_tree(father,all_nodes,l,depth,viz):
+ cc = [n for n in all_nodes if len(n.name) > len(father.name) and cmp_names(father.name,n.name[:len(father.name)])]
+ children = [n for n in cc if len(n.name) == len(father.name)+1]
+ if len(children) == 0 and l < depth -1: # !!!
+ nc = CladeNode(father.id+"."+father.id.split(".")[-1],1.0,viz)
+ father.add_child(nc)
+ children.append(nc)
+ for child in children:
+ build_tree(child,cc,l+1,depth,viz)
+ father.add_child(child)
+
+def get_all_nodes(father):
+ ret = [father]
+ children = father.get_children()
+ for c in children:
+ ret += get_all_nodes(c)
+ return ret
+
+def read_data(input_file,params):
+ with open(input_file, 'r') as inp:
+ if params['sub_clade'] == "":
+ rows = [line.strip().split()[:-1] for line in inp.readlines() if params['max_lev'] < 1 or line.split()[0].count(".") < params['max_lev']]
+ else: rows = [line.split(params['sub_clade']+".")[1].strip().split()[:-1] for line in inp.readlines() if ( params['max_lev'] < 1 or line.split()[0].count(".") < params['max_lev'] ) and line.startswith(params['sub_clade']+".")]
+ all_names = [lin[0] for lin in rows]
+ to_add = []
+
+ abundances = [float(v) for v in list(zip(*rows))[1] if float(v) >= 0.0]
+ tree = {}
+ tree['classes'] = list(set([v[2] for v in rows if len(v)>2]))
+ tree['classes'].sort()
+ all_nodes = [CladeNode("root."+row[0],float(row[1])) for row in rows]
+
+ depth = max([len(n.name) for n in all_nodes])
+
+ n2 = ["_".join(nn.name) for nn in all_nodes]
+ for i,nn in enumerate(all_nodes):
+ n = nn
+ while "_".join(n.name[:-1]) not in n2 and len(n.name) > 1:
+ n = CladeNode(".".join(n.name[:-1]),n.abundance)
+ all_nodes.append(n)
+ n2.append("_".join(n.name))
+
+ cls2 = []
+ if params['all_feats'] != "":
+ cls2 = sorted(params['all_feats'].split(":"))
+ for i,v in enumerate(rows):
+ if len(v)>2:
+ if len(cls2) > 0: all_nodes[i].set_color(colors[cls2.index(v[2])%len(colors)])
+ else:
+ if v[2].count('rgbcol') > 0:
+ ccc = [float(tt) for tt in v[2].split('_')[1:]]
+ all_nodes[i].set_color(ccc)
+ else: all_nodes[i].set_color(colors[sorted(tree['classes']).index(v[2])%len(colors)])
+ root = CladeNode("root",-1.0)
+ root.set_pos((0.0,0.0))
+
+ build_tree(root,all_nodes,0,depth,params['expand_void_lev']==1)
+
+ all_nodes = get_all_nodes(root)
+
+ tree['root'] = root
+ tree['max_abs'] = max(abundances)
+ tree['min_abs'] = min(abundances)
+ levs = []
+ for i in range(depth):
+ depthi = [n for n in all_nodes if len(n.name) == i+1]
+ levs.append(len(depthi))
+ tree['nlev'] = levs
+ return tree
+
+def add_all_pos(father,n,distn,seps,tsep,mlev,last_leaf=-1,nc=1):
+ children = father.get_children()
+ leaves = True if children[0].isleaf else False
+ for i,child in enumerate(children):
+ if leaves:
+ n += 1.0
+ men = 0.5 if len(children) == 1 else 0.0
+ child.set_pos((n*distn-men*float(distn)+tsep,(len(father.name))/float(mlev-1)))
+ if last_leaf != -1:
+ child.prev_leaf = last_leaf
+ last_leaf.next_leaf = child
+ last_leaf = child
+ else:
+ ln = n
+ ltsep = tsep
+ n,tsep,last_leaf = add_all_pos(child,n,distn,seps,tsep,mlev,last_leaf,len(children))
+ nn = (ln + n)*0.5*distn
+ ssep = (ltsep + tsep)*0.5
+ if n-ln == 1:
+ ssep = ltsep
+ child.set_pos((nn+ssep,(len(father.name))/float(mlev-1)))
+ tsep += seps[len(father.name)-1]
+ return n,tsep,last_leaf
+
+def plot_points(father,params,pt_scale,ax):
+ children = father.get_children()
+ children.sort(key = lambda a: -int(a.get_color() == 'y')*a.abundance)
+ x,r = father.pos[0], father.pos[1]
+ for i,child in enumerate(children):
+ xc,rc = plot_points(child,params,pt_scale,ax)
+ if not father.viz: return x,r
+ ps = pt_scale[0]+father.abundance/pt_scale[1]+pt_scale[0]
+ col = father.get_color()
+ pw = params['markeredgewidth'] if col == 'y' else params['markeredgewidth']*3.0
+ if x==0 and r==0: ax.plot(x,r, 'o',markersize=ps,color=col,markeredgewidth=0.01,markeredgecolor=params['fore_color'])
+ else: ax.plot(x,r, 'o',markersize=ps,color=col,markeredgewidth=pw,markeredgecolor=params['fore_color'])
+ return x,r
+
+def plot_lines(father,params,depth,ax,xf):
+ children = father.get_children()
+ x,r = father.pos[0], father.pos[1]
+ for i,child in enumerate(children):
+ xc,rc = plot_lines(child,params,depth,ax,x)
+ if i == 0: x_first, r_first = xc, rc
+ if len(father.name) >= depth-params['radial_start_lev']:
+ col = params['fore_color']
+ lw=params['parents_connector_width']
+ if not child.viz: continue
+ if father.get_color() != 'y' and father.get_color() == child.get_color() and params['colored_connectors']:
+ col = child.get_color()
+ lw *=2.5
+ if col != params['fore_color']:
+ ax.plot([x,xc],[r,rc],"-",color=params['fore_color'],lw=lw*1.5)
+ ax.plot([x,xc],[r,rc],"-",color=col,lw=lw)
+
+ if not father.viz or (len(children) == 1 and not children[0].viz): return x,r
+ if len(father.name) < depth-params['radial_start_lev']:
+ col = params['fore_color']
+ lw=params['parents_connector_width']
+ if father.get_color() != 'y':
+ f =True
+ for child in children:
+ if child.get_color() != father.get_color() or not params['colored_connectors']:
+ f = False
+ break
+ if f:
+ col = father.get_color()
+ lw *= 2.5
+ if not (x==0 and r==0):
+ xx = xc if len(children) > 0 else x
+ if len(children) == 0: rc = r
+ xt = x if len(children)>1 else xx
+ if col != params['fore_color']:
+ ax.plot([x,xt],[r,rc],"-",color=params['fore_color'],lw=lw*1.5)
+ ax.plot([x,xt],[r,rc],"-",color=col,lw=lw)
+ if len(children) > 0 and 1 < len(father.name) < depth-params['radial_start_lev']:
+ xs = arange(x_first,xc,0.01)
+ ys = [rc for t in xs]
+ ax.plot(xs,ys,"-",color=col,lw=params['siblings_connector_width'],markeredgecolor=params['fore_color'])
+ return x,r
+
+def uniqueid():
+ for l in string.ascii_lowercase: yield l
+ for l in string.ascii_lowercase:
+ for i in range(10):
+ yield l+str(i)
+ i = 0
+ while True:
+ yield str(i)
+ i += 1
+
+def plot_names(father,params,depth,ax,u_i,seps):
+ children = father.get_children()
+ l = len(father.name)
+ if len(children)==0:
+ if father.prev_leaf == -1 or father.next_leaf == -1:
+ fr_0, fr_1 = father.pos[0], father.pos[0]
+ else: fr_0, fr_1 = (father.pos[0]+father.prev_leaf.pos[0])*0.5, (father.pos[0]+father.next_leaf.pos[0])*0.5
+ for i,child in enumerate(children):
+ fr,to = plot_names(child,params,depth,ax,u_i,seps)
+ if i == 0: fr_0 = fr
+ fr_1 = to
+ if father.get_color() != 'y' and params['labeled_start_lev'] < l <= params['labeled_stop_lev']+1:
+ col = father.get_color()
+ dd = params['labeled_stop_lev'] - params['labeled_start_lev'] + 1
+ de = depth - 1
+ dim = 1.0/float(de)
+ perc_ext = 0.65 if dim > 0.1 else 1.0
+ clto = (de-l+1)*dim+dim*(dd+1-(l-dd-1))*perc_ext
+ clto = (de-l+1)*dim+dim*(dd-(l-params['labeled_start_lev'])+1)*perc_ext
+ des = float(180.0*(fr_0+fr_1)/np.pi)*0.5-90
+ lab = ""
+ txt = father.last_name
+ if params['abrv_start_lev'] < l <= params['abrv_stop_lev'] + 1:
+ ide = next(u_i)
+ lab = str(ide)+": "+father.last_name
+ txt = str(ide)
+# ax.bar(fr_0, clto, width = fr_1-fr_0, bottom = float(l-1)/float(depth-1), alpha = params['alpha'], color=col, edgecolor=col)
+ ax.bar(fr_0, clto, width = fr_1-fr_0, bottom = float(l-1)/float(de), alpha = params['alpha'], color=col, edgecolor=col)
+ ax.bar(0.0, 0.0, width = 0.0, bottom = 0.0, alpha = 1.0, color=col, edgecolor=params['fore_color'], label=lab)
+ if l <= params['abrv_stop_lev'] + 1:
+ if not params['col_lab']: col = params['fore_color']
+ else:
+ if col not in colors: col = params['fore_color']
+ else: col = dark_colors[colors.index(col)%len(dark_colors)]
+ ax.text((fr_0+fr_1)*0.5, clto+float(l-1)/float(de)-dim*perc_ext/2.0, txt, size = params['label_font_size'], rotation=des, ha ="center", va="center", color=col)
+ return fr_0, fr_1
+
+def draw_tree(out_file,tree,params):
+ plt_size = 7
+ nlev = tree['nlev']
+ pt_scale = (params['min_point_size'],max(1.0,((tree['max_abs']-tree['min_abs']))/(params['max_point_size']-params['min_point_size'])))
+ depth = len(nlev)
+ sep = (2.0*np.pi)/float(nlev[-1])
+ seps = [params['clade_sep']*sep/float(depth-i+1) for i in range(1,len(tree['nlev'])+1)]
+ totseps = sum([s*nlev[i] for i,s in enumerate(seps[:-1])])
+ clade_sep_err = True if totseps > np.pi else False
+ while totseps > np.pi:
+ params['clade_sep'] *= 0.75
+ seps = [params['clade_sep']*sep/(float(depth-i+1)*0.25) for i in range(1,len(tree['nlev'])+1)]
+ totseps = sum([s*nlev[i] for i,s in enumerate(seps[:-1])])
+ if clade_sep_err: print('clade_sep parameter too large, lowered to',params['clade_sep'])
+
+ fig = plt.figure(edgecolor=params['back_color'],facecolor=params['back_color'])
+ ax = fig.add_subplot(111, polar=True, frame_on=False, facecolor=params['back_color'] )
+ plt.subplots_adjust(right=1.0-params['r_prop'],left=params['l_prop'])
+ ax.grid(False)
+ xticks([])
+ yticks([])
+
+ ds = (2.0*np.pi-totseps)/float(nlev[-1])
+
+ add_all_pos(tree['root'],0.0,ds,seps,0.0,depth)
+
+ plot_lines(tree['root'],params,depth,ax,0)
+ plot_points(tree['root'],params,pt_scale,ax)
+ plot_names(tree['root'],params,depth,ax,uniqueid(),seps)
+ r = np.arange(0, 3.0, 0.01)
+ theta = 2*np.pi*r
+
+ def get_col_attr(x):
+ return hasattr(x, 'set_color') and not hasattr(x, 'set_facecolor')
+
+ h, l = ax.get_legend_handles_labels()
+ if len(l) > 0:
+ # Each column allows at most 35 species (rows)
+ ncol = len(l)//35+1
+ leg = ax.legend(bbox_to_anchor=(1.02, 1), frameon=False, loc=2, borderaxespad=0.,
+ prop={'size':params['label_font_size']},ncol=ncol)
+ if leg != None:
+ gca().add_artist(leg)
+ for o in leg.findobj(get_col_attr):
+ o.set_color(params['fore_color'])
+
+ cll = sorted(tree['classes']) if params['all_feats'] == "" else sorted(params['all_feats'].split(":"))
+ nll = [ax.bar(0.0, 0.0, width = 0.0, bottom = 0.0, color=colors[i%len(colors)], label=c) for i,c in enumerate(cll) if c in tree['classes']]
+ cl = [c for c in cll if c in tree['classes']]
+
+ ax.set_title(params['title'],size=params['title_font_size'],color=params['fore_color'])
+
+ if params['class_legend_vis']:
+ l2 = legend(nll, cl, loc=2, prop={'size':params['class_legend_font_size']}, frameon=False)
+ if l2 != None:
+ for o in l2.findobj(get_col_attr):
+ o.set_color(params['fore_color'])
+ # add bbox to deal with legnd overflow
+ plt.savefig(out_file,format=params['format'],facecolor=params['back_color'],edgecolor=params['fore_color'],dpi=params['dpi'], bbox_inches='tight')
+ plt.close()
+
+def plot_cladogram():
+ params = read_params(sys.argv)
+ params['fore_color'] = 'w' if params['back_color'] == 'k' else 'k'
+ clad_tree = read_data(params['input_file'],params)
+ draw_tree(params['output_file'],clad_tree,params)
+
+
+if __name__ == '__main__':
+ plot_cladogram()
=====================================
plot_features.py → lefse/lefse_plot_features.py
=====================================
@@ -1,9 +1,9 @@
-#!/usr/bin/env python
+#!/usr/bin/env python3
import os,sys,matplotlib,zipfile,argparse,string
matplotlib.use('Agg')
from pylab import *
-from lefse import *
+from lefse.lefse import *
import random as rand
colors = ['r','g','b','m','c']
@@ -40,7 +40,7 @@ def read_data(file_data,file_feats,params):
with open(file_feats, 'r') as features:
feats_to_plot = [(f.split()[:-1],len(f.split()) == 5) for f in features.readlines()]
if not feats_to_plot:
- print "No features to plot\n"
+ print("No features to plot\n")
sys.exit(0)
feats,cls,class_sl,subclass_sl,class_hierarchy,params['norm_v'] = load_data(file_data, True)
if params['feature_num'] > 0:
@@ -51,13 +51,13 @@ def read_data(file_data,file_feats,params):
if params['f'] == "one" and f[0][0] != params['feature_name']: continue
features[f[0][0]] = {'dim':float(f[0][1]), 'abundances':feats[f[0][0]], 'sig':f[1], 'cls':cls, 'class_sl':class_sl, 'subclass_sl':subclass_sl, 'class_hierarchy':class_hierarchy}
if not features:
- print "No features to plot\n"
+ print("No features to plot\n")
sys.exit(0)
return features
def plot(name,k_n,feat,params):
fig = plt.figure(figsize=(params['width'], params['height']),edgecolor=params['fore_color'],facecolor=params['back_color'])
- ax = fig.add_subplot(111,axis_bgcolor=params['back_color'])
+ ax = fig.add_subplot(111,facecolor=params['back_color'])
subplots_adjust(bottom=0.15)
max_m = 0.0
@@ -137,13 +137,13 @@ def plot(name,k_n,feat,params):
plt.close()
return name
-if __name__ == '__main__':
+def plot_features():
params = read_params(sys.argv)
params['fore_color'] = 'w' if params['back_color'] == 'k' else 'k'
features = read_data(params['input_file_1'],params['input_file_2'],params)
if params['archive'] == "zip": file = zipfile.ZipFile(params['output_file'], "w")
for k,f in features.items():
- print "Exporting ", k
+ print("Exporting ", k)
if params['archive'] == "zip":
of = plot("/tmp/"+str(int(f['sig']))+"_"+"-".join(k.split("."))+"."+params['format'],k,f,params)
file.write(of, os.path.basename(of), zipfile.ZIP_DEFLATED)
@@ -151,3 +151,7 @@ if __name__ == '__main__':
if params['f'] == 'one': plot(params['output_file'],k,f,params)
else: plot(params['output_file']+str(int(f['sig']))+"_"+"-".join(k.split("."))+"."+params['format'],k,f,params)
if params['archive'] == "zip": file.close()
+
+
+if __name__ == '__main__':
+ plot_features()
\ No newline at end of file
=====================================
plot_res.py → lefse/lefse_plot_res.py
=====================================
@@ -1,4 +1,4 @@
-#!/usr/bin/env python
+#!/usr/bin/env python3
import os,sys
import matplotlib
@@ -6,7 +6,7 @@ matplotlib.use('Agg')
from pylab import *
from collections import defaultdict
-from lefse import *
+from lefse.lefse import *
import argparse
colors = ['r','g','b','m','c','y','k','w']
@@ -44,7 +44,7 @@ def read_data(input_file,output_file,otu_only):
rows = [line.strip().split()[:-1] for line in inp.readlines() if len(line.strip().split())>3 and len(line.strip().split()[0].split('.'))==8] # a feature with length 8 will have an OTU id associated with it
classes = list(set([v[2] for v in rows if len(v)>2]))
if len(classes) < 1:
- print "No differentially abundant features found in "+input_file
+ print("No differentially abundant features found in "+input_file)
os.system("touch "+output_file)
sys.exit()
data = {}
@@ -59,7 +59,7 @@ def plot_histo_hor(path,params,data,bcl,report_features):
cls = sorted(data['cls'])
if bcl: data['rows'].sort(key=lambda ab: fabs(float(ab[3]))*(cls.index(ab[2])*2-1))
else:
- mmax = max([fabs(float(a)) for a in zip(*data['rows'])[3]])
+ mmax = max([fabs(float(a)) for a in list(zip(*list(data['rows'])))[3]])
data['rows'].sort(key=lambda ab: fabs(float(ab[3]))/mmax+(cls.index(ab[2])+1))
pos = arange(len(data['rows']))
head = 0.75
@@ -67,11 +67,11 @@ def plot_histo_hor(path,params,data,bcl,report_features):
ht = head + tail
ints = max(len(pos)*0.2,1.5)
fig = plt.figure(figsize=(params['width'], ints + ht), edgecolor=params['back_color'],facecolor=params['back_color'])
- ax = fig.add_subplot(111,frame_on=False,axis_bgcolor=params['back_color'])
+ ax = fig.add_subplot(111,frame_on=False,facecolor=params['back_color'])
ls, rs = params['ls'], 1.0-params['rs']
plt.subplots_adjust(left=ls,right=rs,top=1-head*(1.0-ints/(ints+ht)), bottom=tail*(1.0-ints/(ints+ht)))
- fig.canvas.set_window_title('LDA results')
+ fig.canvas.manager.set_window_title('LDA results')
l_align = {'horizontalalignment':'left', 'verticalalignment':'baseline'}
r_align = {'horizontalalignment':'right', 'verticalalignment':'baseline'}
@@ -94,9 +94,9 @@ def plot_histo_hor(path,params,data,bcl,report_features):
ax.barh(pos[i],vv, align='center', color=col, label=lab, height=0.8, edgecolor=params['fore_color'])
mv = max([abs(float(v[3])) for v in data['rows']])
if report_features:
- print 'OTU\tLDA_score\tCLass'
+ print('OTU\tLDA_score\tCLass')
for i in out_data:
- print '%s\t%s\t%s' %(i, out_data[i][0], out_data[i][1])
+ print('%s\t%s\t%s' %(i, out_data[i][0], out_data[i][1]))
for i,r in enumerate(data['rows']):
indcl = cls.index(data['rows'][i][2])
if params['n_scl'] < 0: rr = r[0]
@@ -136,9 +136,9 @@ def plot_histo_ver(path,params,data,report_features):
if params['n_scl'] < 0: nam = [d[0] for d in data['rows']]
else: nam = [d[0].split(".")[-min(d[0].count("."),params['n_scl'])] for d in data['rows']]
fig = plt.figure(edgecolor=params['back_color'],facecolor=params['back_color'],figsize=(params['width'], params['height']))
- ax = fig.add_subplot(111,axis_bgcolor=params['back_color'])
+ ax = fig.add_subplot(111,facecolor=params['back_color'])
plt.subplots_adjust(top=0.9, left=params['ls'], right=params['rs'], bottom=0.3)
- fig.canvas.set_window_title('LDA results')
+ fig.canvas.manager.set_window_title('LDA results')
l_align = {'horizontalalignment':'left', 'verticalalignment':'baseline'}
r_align = {'horizontalalignment':'right', 'verticalalignment':'baseline'}
added = []
@@ -156,9 +156,9 @@ def plot_histo_ver(path,params,data,report_features):
vv = fabs(float(v[3]))
ax.bar(pos[i],vv, align='center', color=col, label=lab)
if report_features:
- print 'OTU\tLDA_score\tCLass'
+ print('OTU\tLDA_score\tCLass')
for i in out_data:
- print '%s\t%s\t%s' %(i, out_data[i][0], out_data[i][1])
+ print('%s\t%s\t%s' %(i, out_data[i][0], out_data[i][1]))
xticks(pos,nam,rotation=-20, ha = 'left',size=params['feature_font_size'])
ax.set_title(params['title'],size=params['title_font_size'])
ax.set_ylabel("LDA SCORE (log 10)")
@@ -169,11 +169,13 @@ def plot_histo_ver(path,params,data,report_features):
plt.savefig(path,format=params['format'],facecolor=params['back_color'],edgecolor=params['fore_color'],dpi=params['dpi'])
plt.close()
-if __name__ == '__main__':
+def plot_res():
params = read_params(sys.argv)
params['fore_color'] = 'w' if params['back_color'] == 'k' else 'k'
data = read_data(params['input_file'],params['output_file'],params['otu_only'])
if params['orientation'] == 'v': plot_histo_ver(params['output_file'],params,data,params['report_features'])
else: plot_histo_hor(params['output_file'],params,data,len(data['cls']) == 2,params['report_features'])
-
+
+if __name__ == '__main__':
+ plot_res()
\ No newline at end of file
=====================================
lefse/lefse_run.py
=====================================
@@ -0,0 +1,108 @@
+#!/usr/bin/env python3
+
+import os,sys,math,pickle
+from lefse.lefse import *
+
+def read_params(args):
+ parser = argparse.ArgumentParser(description='LEfSe 1.1.01')
+ parser.add_argument('input_file', metavar='INPUT_FILE', type=str, help="the input file")
+ parser.add_argument('output_file', metavar='OUTPUT_FILE', type=str,
+ help="the output file containing the data for the visualization module")
+ parser.add_argument('-o',dest="out_text_file", metavar='str', type=str, default="",
+ help="set the file for exporting the result (only concise textual form)")
+ parser.add_argument('-a',dest="anova_alpha", metavar='float', type=float, default=0.05,
+ help="set the alpha value for the Anova test (default 0.05)")
+ parser.add_argument('-w',dest="wilcoxon_alpha", metavar='float', type=float, default=0.05,
+ help="set the alpha value for the Wilcoxon test (default 0.05)")
+ parser.add_argument('-l',dest="lda_abs_th", metavar='float', type=float, default=2.0,
+ help="set the threshold on the absolute value of the logarithmic LDA score (default 2.0)")
+ parser.add_argument('--nlogs',dest="nlogs", metavar='int', type=int, default=3,
+ help="max log ingluence of LDA coeff")
+ parser.add_argument('--verbose',dest="verbose", metavar='int', choices=[0,1], type=int, default=0,
+ help="verbose execution (default 0)")
+ parser.add_argument('--wilc',dest="wilc", metavar='int', choices=[0,1], type=int, default=1,
+ help="wheter to perform the Wicoxon step (default 1)")
+ parser.add_argument('-r',dest="rank_tec", metavar='str', choices=['lda','svm'], type=str, default='lda',
+ help="select LDA or SVM for effect size (default LDA)")
+ parser.add_argument('--svm_norm',dest="svm_norm", metavar='int', choices=[0,1], type=int, default=1,
+ help="whether to normalize the data in [0,1] for SVM feature waiting (default 1 strongly suggested)")
+ parser.add_argument('-b',dest="n_boots", metavar='int', type=int, default=30,
+ help="set the number of bootstrap iteration for LDA (default 30)")
+ parser.add_argument('-e',dest="only_same_subcl", metavar='int', type=int, default=0,
+ help="set whether perform the wilcoxon test only among the subclasses with the same name (default 0)")
+ parser.add_argument('-c',dest="curv", metavar='int', type=int, default=0,
+ help="set whether perform the wilcoxon test ing the Curtis's approach [BETA VERSION] (default 0)")
+ parser.add_argument('-f',dest="f_boots", metavar='float', type=float, default=0.67,
+ help="set the subsampling fraction value for each bootstrap iteration (default 0.66666)")
+ parser.add_argument('-s',dest="strict", choices=[0,1,2], type=int, default=0,
+ help="set the multiple testing correction options. 0 no correction (more strict, default), 1 correction for independent comparisons, 2 correction for dependent comparison")
+# parser.add_argument('-m',dest="m_boots", type=int, default=5,
+# help="minimum cardinality of classes in each bootstrapping iteration (default 5)")
+ parser.add_argument('--min_c',dest="min_c", metavar='int', type=int, default=10,
+ help="minimum number of samples per subclass for performing wilcoxon test (default 10)")
+ parser.add_argument('-t',dest="title", metavar='str', type=str, default="",
+ help="set the title of the analysis (default input file without extension)")
+ parser.add_argument('-y',dest="multiclass_strat", choices=[0,1], type=int, default=0,
+ help="(for multiclass tasks) set whether the test is performed in a one-against-one ( 1 - more strict!) or in a one-against-all setting ( 0 - less strict) (default 0)")
+ args = parser.parse_args()
+
+ params = vars(args)
+ if params['title'] == "":
+ params['title'] = params['input_file'].split("/")[-1].split('.')[0]
+
+ return params
+
+
+def lefse_run():
+ init()
+ params = read_params(sys.argv)
+ feats,cls,class_sl,subclass_sl,class_hierarchy = load_data(params['input_file'])
+ kord,cls_means = get_class_means(class_sl,feats)
+ wilcoxon_res = {}
+ kw_n_ok = 0
+ nf = 0
+ for feat_name,feat_values in list(feats.items()):
+ if params['verbose']:
+ print("Testing feature",str(nf),": ",feat_name)
+ nf += 1
+ kw_ok,pv = test_kw_r(cls,feat_values,params['anova_alpha'],sorted(cls.keys()))
+ if not kw_ok:
+ if params['verbose']: print("\tkw ko")
+ del feats[feat_name]
+ wilcoxon_res[feat_name] = "-"
+ continue
+ if params['verbose']: print("\tkw ok\t")
+
+ if not params['wilc']: continue
+ kw_n_ok += 1
+ res_wilcoxon_rep = test_rep_wilcoxon_r(subclass_sl,class_hierarchy,feat_values,params['wilcoxon_alpha'],params['multiclass_strat'],params['strict'],feat_name,params['min_c'],params['only_same_subcl'],params['curv'])
+ wilcoxon_res[feat_name] = str(pv) if res_wilcoxon_rep else "-"
+ if not res_wilcoxon_rep:
+ if params['verbose']: print("wilc ko")
+ del feats[feat_name]
+ elif params['verbose']: print("wilc ok\t")
+
+ if len(feats) > 0:
+ print("Number of significantly discriminative features:", len(feats), "(", kw_n_ok, ") before internal wilcoxon")
+ if params['lda_abs_th'] < 0.0:
+ lda_res,lda_res_th = dict([(k,0.0) for k,v in feats.items()]), dict([(k,v) for k,v in feats.items()])
+ else:
+ if params['rank_tec'] == 'lda': lda_res,lda_res_th = test_lda_r(cls,feats,class_sl,params['n_boots'],params['f_boots'],params['lda_abs_th'],0.0000000001,params['nlogs'])
+ elif params['rank_tec'] == 'svm': lda_res,lda_res_th = test_svm(cls,feats,class_sl,params['n_boots'],params['f_boots'],params['lda_abs_th'],0.0,params['svm_norm'])
+ else: lda_res,lda_res_th = dict([(k,0.0) for k,v in feats.items()]), dict([(k,v) for k,v in feats.items()])
+ else:
+ print("Number of significantly discriminative features:", len(feats), "(", kw_n_ok, ") before internal wilcoxon")
+ print("No features with significant differences between the two classes")
+ lda_res,lda_res_th = {},{}
+ outres = {}
+ outres['lda_res_th'] = lda_res_th
+ outres['lda_res'] = lda_res
+ outres['cls_means'] = cls_means
+ outres['cls_means_kord'] = kord
+ outres['wilcox_res'] = wilcoxon_res
+ print("Number of discriminative features with abs LDA score >",params['lda_abs_th'],":",len(lda_res_th))
+ save_res(outres,params["output_file"])
+
+
+if __name__ == '__main__':
+ lefse_run()
\ No newline at end of file
=====================================
qiime2lefse.py → lefse/qiime2lefse.py
=====================================
@@ -1,4 +1,4 @@
-#!/usr/bin/env python
+#!/usr/bin/env python3
import sys
@@ -40,7 +40,13 @@ def read_params(args):
-def qiime2lefse( fin, fmd, fout, all_md, sel_md ):
+def qiime2lefse():
+ pars = read_params( sys.argv )
+ fin = pars['in']
+ fmd = pars['md']
+ fout = pars['out']
+ all_md = not pars['c'] and not pars['s'] and not pars['u']
+ sel_md = [pars['c'],pars['s'],pars['u']]
with (fin if fin==sys.stdin else open(fin)) as inpf :
lines = [list(ll) for ll in
(zip(*[l.strip().split('\t')
@@ -62,7 +68,7 @@ def qiime2lefse( fin, fmd, fout, all_md, sel_md ):
mdd = dict(zip(mdf,l[1:]))
md[l[0]] = mdd
- selected_md = md.values()[0].keys() if md else []
+ selected_md = list(md.values())[0].keys() if md else []
if not all_md:
selected_md = [s for s in sel_md if s]
@@ -79,10 +85,4 @@ def qiime2lefse( fin, fmd, fout, all_md, sel_md ):
outf.write( "\t".join(l) + "\n" )
if __name__ == '__main__':
- pars = read_params( sys.argv )
-
- qiime2lefse( fin = pars['in'],
- fmd = pars['md'],
- fout = pars['out'],
- all_md = not pars['c'] and not pars['s'] and not pars['u'],
- sel_md = [pars['c'],pars['s'],pars['u']])
+ qiime2lefse()
=====================================
lefsebiom/AbundanceTable.py
=====================================
The diff for this file was not included because it is too large.
=====================================
lefsebiom/CClade.py
=====================================
@@ -32,7 +32,6 @@ __maintainer__ = "Timothy Tickle"
__email__ = "ttickle at sph.harvard.edu"
__status__ = "Development"
-import blist
import sys
class CClade:
@@ -64,7 +63,7 @@ class CClade:
Set all the values given as a list in the same order given.
"""
- self.m_adValues = blist.blist( [0] ) * len( adValues )
+ self.m_adValues = [0] * len( adValues )
for i, d in enumerate( adValues ):
if d:
self.m_adValues[i] = d
=====================================
lefsebiom/ValidateData.py
=====================================
@@ -449,7 +449,7 @@ class ValidateData:
#Check elements
listSize = len(parameterValue)
- for i in xrange(0,listSize):
+ for i in range(0,listSize):
if(not ValidateData.funcIsValidNumeric(parameterValue[i])):
return False
return True
@@ -472,7 +472,7 @@ class ValidateData:
#Check elements
listSize = len(parameterValue)
- for i in xrange(0,listSize):
+ for i in range(0,listSize):
if(not ValidateData.funcIsValidString(parameterValue[i])):
return False
return True
@@ -520,7 +520,7 @@ class ValidateData:
return False
#Check key elements
- keyList = parameterValue.keys()
+ keyList = list(parameterValue.keys())
keyListSize = len(keyList)
for i in range(0,keyListSize):
if keyList[i] == None:
@@ -530,7 +530,7 @@ class ValidateData:
return False
#Check key elements
- itemList = parameterValue.values()
+ itemList = list(parameterValue.values())
itemListSize = len(itemList)
for i in range(0,itemListSize):
=====================================
plot_cladogram.py deleted
=====================================
@@ -1,342 +0,0 @@
-#!/usr/bin/env python
-
-import os,sys,matplotlib,argparse,string
-matplotlib.use('Agg')
-from pylab import *
-from lefse import *
-import numpy as np
-
-colors = ['r','g','b','m','c',[1.0,0.5,0.0],[0.0,1.0,0.0],[0.33,0.125,0.0],[0.75,0.75,0.75],'k']
-dark_colors = [[0.4,0.0,0.0],[0.0,0.2,0.0],[0.0,0.0,0.4],'m','c',[1.0,0.5,0.0],[0.0,1.0,0.0],[0.33,0.125,0.0],[0.75,0.75,0.75],'k']
-
-class CladeNode:
- def __init__(self, name, abundance, viz = True):
- self.id = name
- self.name = name.split(".")
- self.last_name = self.name[-1]
- self.abundance = abundance
- self.pos = (-1.0,-1.0)
- self.children = {}
- self.isleaf = True
- self.color = 'y'
- self.next_leaf = -1
- self.prev_leaf = -1
- self.viz = viz
- def __repr__(self):
- return self.last_name
- def add_child(self,node):
- self.isleaf = False
- self.children[node.__repr__()] = node
- def get_children(self):
- ck = sorted(self.children.keys())
- return [self.children[k] for k in ck]
- def get_color(self):
- return self.color
- def set_color(self,c):
- self.color = c
- def set_pos(self,pos):
- self.pos = pos
-
-def read_params(args):
- parser = argparse.ArgumentParser(description='Cladoplot')
- parser.add_argument('input_file', metavar='INPUT_FILE', type=str, help="tab delimited input file")
- parser.add_argument('output_file', metavar='OUTPUT_FILE', type=str, help="the file for the output image")
- parser.add_argument('--clade_sep',dest="clade_sep", type=float, default=1.5)
- parser.add_argument('--max_lev',dest="max_lev", type=int, default=-1)
- parser.add_argument('--max_point_size',dest="max_point_size", type=float, default=6.0)
- parser.add_argument('--min_point_size',dest="min_point_size", type=float, default=1)
- parser.add_argument('--point_edge_width',dest="markeredgewidth", type=float, default=.25)
- parser.add_argument('--siblings_connector_width',dest="siblings_connector_width", type=float, default=2)
- parser.add_argument('--parents_connector_width',dest="parents_connector_width", type=float, default=0.75)
- parser.add_argument('--radial_start_lev',dest="radial_start_lev", type=int, default=1)
- parser.add_argument('--labeled_start_lev',dest="labeled_start_lev", type=int, default=2)
- parser.add_argument('--labeled_stop_lev',dest="labeled_stop_lev", type=int, default=5)
- parser.add_argument('--abrv_start_lev',dest="abrv_start_lev", type=int, default=3)
- parser.add_argument('--abrv_stop_lev',dest="abrv_stop_lev", type=int, default=5)
- parser.add_argument('--expand_void_lev',dest="expand_void_lev", type=int, default=1)
- parser.add_argument('--class_legend_vis',dest="class_legend_vis", type=int, default=1)
- parser.add_argument('--colored_connector',dest="colored_connectors", type=int, default=1)
- parser.add_argument('--alpha',dest="alpha", type=float, default=0.2)
- parser.add_argument('--title',dest="title", type=str, default="Cladogram")
- parser.add_argument('--sub_clade',dest="sub_clade", type=str, default="")
- parser.add_argument('--title_font_size',dest="title_font_size", type=str, default="14")
- parser.add_argument('--right_space_prop',dest="r_prop", type=float, default=0.1)
- parser.add_argument('--left_space_prop',dest="l_prop", type=float, default=0.1)
- parser.add_argument('--label_font_size',dest="label_font_size", type=str, default="6")
- parser.add_argument('--background_color',dest="back_color", type=str, choices=["k","w"], default="w", help="set the color of the background")
- parser.add_argument('--colored_labels',dest="col_lab", type=int, choices=[0,1], default=1, help="draw the label with class color (1) or in black (0)")
- parser.add_argument('--class_legend_font_size',dest="class_legend_font_size", type=str, default="10")
- parser.add_argument('--dpi',dest="dpi", type=int, default=72)
- parser.add_argument('--format', dest="format", choices=["png","svg","pdf"], default="svg", type=str, help="the format for the output file")
- parser.add_argument('--all_feats', dest="all_feats", type=str, default="")
- args = parser.parse_args()
- return vars(args)
-
-def cmp_names(la,lb):
- if len(la) != len(lb): return False
- for p in [(a,b) for i,a in enumerate(la) for j,b in enumerate(lb) if i == j]:
- if p[0] != p[1]: return False
- return True
-
-def build_tree(father,all_nodes,l,depth,viz):
- cc = [n for n in all_nodes if len(n.name) > len(father.name) and cmp_names(father.name,n.name[:len(father.name)])]
- children = [n for n in cc if len(n.name) == len(father.name)+1]
- if len(children) == 0 and l < depth -1: # !!!
- nc = CladeNode(father.id+"."+father.id.split(".")[-1],1.0,viz)
- father.add_child(nc)
- children.append(nc)
- for child in children:
- build_tree(child,cc,l+1,depth,viz)
- father.add_child(child)
-
-def get_all_nodes(father):
- ret = [father]
- children = father.get_children()
- for c in children:
- ret += get_all_nodes(c)
- return ret
-
-def read_data(input_file,params):
- with open(input_file, 'r') as inp:
- if params['sub_clade'] == "": rows = [line.strip().split()[:-1] for line in inp.readlines() if params['max_lev'] < 1 or line.split()[0].count(".") < params['max_lev']]
- else: rows = [line.split(params['sub_clade']+".")[1].strip().split()[:-1] for line in inp.readlines() if ( params['max_lev'] < 1 or line.split()[0].count(".") < params['max_lev'] ) and line.startswith(params['sub_clade']+".")]
- all_names = [lin[0] for lin in rows]
- to_add = []
-
- abundances = [float(v) for v in zip(*rows)[1] if v >= 0.0]
- tree = {}
- tree['classes'] = list(set([v[2] for v in rows if len(v)>2]))
- tree['classes'].sort()
- all_nodes = [CladeNode("root."+row[0],float(row[1])) for row in rows]
-
- depth = max([len(n.name) for n in all_nodes])
-
- n2 = ["_".join(nn.name) for nn in all_nodes]
- for i,nn in enumerate(all_nodes):
- n = nn
- while "_".join(n.name[:-1]) not in n2 and len(n.name) > 1:
- n = CladeNode(".".join(n.name[:-1]),n.abundance)
- all_nodes.append(n)
- n2.append("_".join(n.name))
-
- cls2 = []
- if params['all_feats'] != "":
- cls2 = sorted(params['all_feats'].split(":"))
- for i,v in enumerate(rows):
- if len(v)>2:
- if len(cls2) > 0: all_nodes[i].set_color(colors[cls2.index(v[2])%len(colors)])
- else:
- if v[2].count('rgbcol') > 0:
- ccc = [float(tt) for tt in v[2].split('_')[1:]]
- all_nodes[i].set_color(ccc)
- else: all_nodes[i].set_color(colors[sorted(tree['classes']).index(v[2])%len(colors)])
- root = CladeNode("root",-1.0)
- root.set_pos((0.0,0.0))
-
- build_tree(root,all_nodes,0,depth,params['expand_void_lev']==1)
-
- all_nodes = get_all_nodes(root)
-
- tree['root'] = root
- tree['max_abs'] = max(abundances)
- tree['min_abs'] = min(abundances)
- levs = []
- for i in range(depth):
- depthi = [n for n in all_nodes if len(n.name) == i+1]
- levs.append(len(depthi))
- tree['nlev'] = levs
- return tree
-
-def add_all_pos(father,n,distn,seps,tsep,mlev,last_leaf=-1,nc=1):
- children = father.get_children()
- leaves = True if children[0].isleaf else False
- for i,child in enumerate(children):
- if leaves:
- n += 1.0
- men = 0.5 if len(children) == 1 else 0.0
- child.set_pos((n*distn-men*float(distn)+tsep,(len(father.name))/float(mlev-1)))
- if last_leaf != -1:
- child.prev_leaf = last_leaf
- last_leaf.next_leaf = child
- last_leaf = child
- else:
- ln = n
- ltsep = tsep
- n,tsep,last_leaf = add_all_pos(child,n,distn,seps,tsep,mlev,last_leaf,len(children))
- nn = (ln + n)*0.5*distn
- ssep = (ltsep + tsep)*0.5
- if n-ln == 1:
- ssep = ltsep
- child.set_pos((nn+ssep,(len(father.name))/float(mlev-1)))
- tsep += seps[len(father.name)-1]
- return n,tsep,last_leaf
-
-def plot_points(father,params,pt_scale,ax):
- children = father.get_children()
- children.sort(key = lambda a: -int(a.get_color() == 'y')*a.abundance)
- x,r = father.pos[0], father.pos[1]
- for i,child in enumerate(children):
- xc,rc = plot_points(child,params,pt_scale,ax)
- if not father.viz: return x,r
- ps = pt_scale[0]+father.abundance/pt_scale[1]+pt_scale[0]
- col = father.get_color()
- pw = params['markeredgewidth'] if col == 'y' else params['markeredgewidth']*3.0
- if x==0 and r==0: ax.plot(x,r, 'o',markersize=ps,color=col,markeredgewidth=0.01,markeredgecolor=params['fore_color'])
- else: ax.plot(x,r, 'o',markersize=ps,color=col,markeredgewidth=pw,markeredgecolor=params['fore_color'])
- return x,r
-
-def plot_lines(father,params,depth,ax,xf):
- children = father.get_children()
- x,r = father.pos[0], father.pos[1]
- for i,child in enumerate(children):
- xc,rc = plot_lines(child,params,depth,ax,x)
- if i == 0: x_first, r_first = xc, rc
- if len(father.name) >= depth-params['radial_start_lev']:
- col = params['fore_color']
- lw=params['parents_connector_width']
- if not child.viz: continue
- if father.get_color() != 'y' and father.get_color() == child.get_color() and params['colored_connectors']:
- col = child.get_color()
- lw *=2.5
- if col != params['fore_color']:
- ax.plot([x,xc],[r,rc],"-",color=params['fore_color'],lw=lw*1.5)
- ax.plot([x,xc],[r,rc],"-",color=col,lw=lw)
-
- if not father.viz or (len(children) == 1 and not children[0].viz): return x,r
- if len(father.name) < depth-params['radial_start_lev']:
- col = params['fore_color']
- lw=params['parents_connector_width']
- if father.get_color() != 'y':
- f =True
- for child in children:
- if child.get_color() != father.get_color() or not params['colored_connectors']:
- f = False
- break
- if f:
- col = father.get_color()
- lw *= 2.5
- if not (x==0 and r==0):
- xx = xc if len(children) > 0 else x
- if len(children) == 0: rc = r
- xt = x if len(children)>1 else xx
- if col != params['fore_color']:
- ax.plot([x,xt],[r,rc],"-",color=params['fore_color'],lw=lw*1.5)
- ax.plot([x,xt],[r,rc],"-",color=col,lw=lw)
- if len(children) > 0 and 1 < len(father.name) < depth-params['radial_start_lev']:
- xs = arange(x_first,xc,0.01)
- ys = [rc for t in xs]
- ax.plot(xs,ys,"-",color=col,lw=params['siblings_connector_width'],markeredgecolor=params['fore_color'])
- return x,r
-
-def uniqueid():
- for l in string.lowercase: yield l
- for l in string.lowercase:
- for i in range(10):
- yield l+str(i)
- i = 0
- while True:
- yield str(i)
- i += 1
-
-def plot_names(father,params,depth,ax,u_i,seps):
- children = father.get_children()
- l = len(father.name)
- if len(children)==0:
- if father.prev_leaf == -1 or father.next_leaf == -1:
- fr_0, fr_1 = father.pos[0], father.pos[0]
- else: fr_0, fr_1 = (father.pos[0]+father.prev_leaf.pos[0])*0.5, (father.pos[0]+father.next_leaf.pos[0])*0.5
- for i,child in enumerate(children):
- fr,to = plot_names(child,params,depth,ax,u_i,seps)
- if i == 0: fr_0 = fr
- fr_1 = to
- if father.get_color() != 'y' and params['labeled_start_lev'] < l <= params['labeled_stop_lev']+1:
- col = father.get_color()
- dd = params['labeled_stop_lev'] - params['labeled_start_lev'] + 1
- de = depth - 1
- dim = 1.0/float(de)
- perc_ext = 0.65 if dim > 0.1 else 1.0
- clto = (de-l+1)*dim+dim*(dd+1-(l-dd-1))*perc_ext
- clto = (de-l+1)*dim+dim*(dd-(l-params['labeled_start_lev'])+1)*perc_ext
- des = float(180.0*(fr_0+fr_1)/np.pi)*0.5-90
- lab = ""
- txt = father.last_name
- if params['abrv_start_lev'] < l <= params['abrv_stop_lev'] + 1:
- ide = u_i.next()
- lab = str(ide)+": "+father.last_name
- txt = str(ide)
-# ax.bar(fr_0, clto, width = fr_1-fr_0, bottom = float(l-1)/float(depth-1), alpha = params['alpha'], color=col, edgecolor=col)
- ax.bar(fr_0, clto, width = fr_1-fr_0, bottom = float(l-1)/float(de), alpha = params['alpha'], color=col, edgecolor=col)
- ax.bar(0.0, 0.0, width = 0.0, bottom = 0.0, alpha = 1.0, color=col, edgecolor=params['fore_color'], label=lab)
- if l <= params['abrv_stop_lev'] + 1:
- if not params['col_lab']: col = params['fore_color']
- else:
- if col not in colors: col = params['fore_color']
- else: col = dark_colors[colors.index(col)%len(dark_colors)]
- ax.text((fr_0+fr_1)*0.5, clto+float(l-1)/float(de)-dim*perc_ext/2.0, txt, size = params['label_font_size'], rotation=des, ha ="center", va="center", color=col)
- return fr_0, fr_1
-
-def draw_tree(out_file,tree,params):
- plt_size = 7
- nlev = tree['nlev']
- pt_scale = (params['min_point_size'],max(1.0,((tree['max_abs']-tree['min_abs']))/(params['max_point_size']-params['min_point_size'])))
- depth = len(nlev)
- sep = (2.0*np.pi)/float(nlev[-1])
- seps = [params['clade_sep']*sep/float(depth-i+1) for i in range(1,len(tree['nlev'])+1)]
- totseps = sum([s*nlev[i] for i,s in enumerate(seps[:-1])])
- clade_sep_err = True if totseps > np.pi else False
- while totseps > np.pi:
- params['clade_sep'] *= 0.75
- seps = [params['clade_sep']*sep/(float(depth-i+1)*0.25) for i in range(1,len(tree['nlev'])+1)]
- totseps = sum([s*nlev[i] for i,s in enumerate(seps[:-1])])
- if clade_sep_err: print 'clade_sep parameter too large, lowered to',params['clade_sep']
-
- fig = plt.figure(edgecolor=params['back_color'],facecolor=params['back_color'])
- ax = fig.add_subplot(111, polar=True, frame_on=False, axis_bgcolor=params['back_color'] )
- plt.subplots_adjust(right=1.0-params['r_prop'],left=params['l_prop'])
- ax.grid(False)
- xticks([])
- yticks([])
-
- ds = (2.0*np.pi-totseps)/float(nlev[-1])
-
- add_all_pos(tree['root'],0.0,ds,seps,0.0,depth)
-
- plot_lines(tree['root'],params,depth,ax,0)
- plot_points(tree['root'],params,pt_scale,ax)
- plot_names(tree['root'],params,depth,ax,uniqueid(),seps)
-
- r = np.arange(0, 3.0, 0.01)
- theta = 2*np.pi*r
-
- def get_col_attr(x):
- return hasattr(x, 'set_color') and not hasattr(x, 'set_facecolor')
-
- h, l = ax.get_legend_handles_labels()
- if len(l) > 0:
- leg = ax.legend(bbox_to_anchor=(1.05, 1), frameon=False, loc=2, borderaxespad=0.,prop={'size':params['label_font_size']})
- if leg != None:
- gca().add_artist(leg)
- for o in leg.findobj(get_col_attr):
- o.set_color(params['fore_color'])
-
- cll = sorted(tree['classes']) if params['all_feats'] == "" else sorted(params['all_feats'].split(":"))
- nll = [ax.bar(0.0, 0.0, width = 0.0, bottom = 0.0, color=colors[i%len(colors)], label=c) for i,c in enumerate(cll) if c in tree['classes']]
- cl = [c for c in cll if c in tree['classes']]
-
- ax.set_title(params['title'],size=params['title_font_size'],color=params['fore_color'])
-
- if params['class_legend_vis']:
- l2 = legend(nll, cl, loc=2, prop={'size':params['class_legend_font_size']}, frameon=False)
- if l2 != None:
- for o in l2.findobj(get_col_attr):
- o.set_color(params['fore_color'])
-
- plt.savefig(out_file,format=params['format'],facecolor=params['back_color'],edgecolor=params['fore_color'],dpi=params['dpi'])
- plt.close()
-
-if __name__ == '__main__':
- params = read_params(sys.argv)
- params['fore_color'] = 'w' if params['back_color'] == 'k' else 'k'
- clad_tree = read_data(params['input_file'],params)
- draw_tree(params['output_file'],clad_tree,params)
-
=====================================
requirements.txt
=====================================
@@ -1,5 +1,3 @@
-
-- R
-- R libraries: splines, stats4, survival, mvtnorm, modeltools, coin, MASS
-- python libraries: rpy2 (v. 2.1 or higher), numpy, matplotlib (v. 1.0 or higher), argparse
-
+- R (v. 3.6 or higher)
+- R libraries: survival, mvtnorm, modeltools, coin, MASS
+- python libraries: rpy2 (v. 2.1 or higher), numpy, matplotlib (v. 1.0 or higher), scipy, cython, biom-format
\ No newline at end of file
=====================================
run_lefse.py deleted
=====================================
@@ -1,103 +0,0 @@
-#!/usr/bin/env python
-
-import os,sys,math,pickle
-from lefse import *
-
-def read_params(args):
- parser = argparse.ArgumentParser(description='LEfSe 1.0')
- parser.add_argument('input_file', metavar='INPUT_FILE', type=str, help="the input file")
- parser.add_argument('output_file', metavar='OUTPUT_FILE', type=str,
- help="the output file containing the data for the visualization module")
- parser.add_argument('-o',dest="out_text_file", metavar='str', type=str, default="",
- help="set the file for exporting the result (only concise textual form)")
- parser.add_argument('-a',dest="anova_alpha", metavar='float', type=float, default=0.05,
- help="set the alpha value for the Anova test (default 0.05)")
- parser.add_argument('-w',dest="wilcoxon_alpha", metavar='float', type=float, default=0.05,
- help="set the alpha value for the Wilcoxon test (default 0.05)")
- parser.add_argument('-l',dest="lda_abs_th", metavar='float', type=float, default=2.0,
- help="set the threshold on the absolute value of the logarithmic LDA score (default 2.0)")
- parser.add_argument('--nlogs',dest="nlogs", metavar='int', type=int, default=3,
- help="max log ingluence of LDA coeff")
- parser.add_argument('--verbose',dest="verbose", metavar='int', choices=[0,1], type=int, default=0,
- help="verbose execution (default 0)")
- parser.add_argument('--wilc',dest="wilc", metavar='int', choices=[0,1], type=int, default=1,
- help="wheter to perform the Wicoxon step (default 1)")
- parser.add_argument('-r',dest="rank_tec", metavar='str', choices=['lda','svm'], type=str, default='lda',
- help="select LDA or SVM for effect size (default LDA)")
- parser.add_argument('--svm_norm',dest="svm_norm", metavar='int', choices=[0,1], type=int, default=1,
- help="whether to normalize the data in [0,1] for SVM feature waiting (default 1 strongly suggested)")
- parser.add_argument('-b',dest="n_boots", metavar='int', type=int, default=30,
- help="set the number of bootstrap iteration for LDA (default 30)")
- parser.add_argument('-e',dest="only_same_subcl", metavar='int', type=int, default=0,
- help="set whether perform the wilcoxon test only among the subclasses with the same name (default 0)")
- parser.add_argument('-c',dest="curv", metavar='int', type=int, default=0,
- help="set whether perform the wilcoxon test ing the Curtis's approach [BETA VERSION] (default 0)")
- parser.add_argument('-f',dest="f_boots", metavar='float', type=float, default=0.67,
- help="set the subsampling fraction value for each bootstrap iteration (default 0.66666)")
- parser.add_argument('-s',dest="strict", choices=[0,1,2], type=int, default=0,
- help="set the multiple testing correction options. 0 no correction (more strict, default), 1 correction for independent comparisons, 2 correction for independent comparison")
-# parser.add_argument('-m',dest="m_boots", type=int, default=5,
-# help="minimum cardinality of classes in each bootstrapping iteration (default 5)")
- parser.add_argument('--min_c',dest="min_c", metavar='int', type=int, default=10,
- help="minimum number of samples per subclass for performing wilcoxon test (default 10)")
- parser.add_argument('-t',dest="title", metavar='str', type=str, default="",
- help="set the title of the analysis (default input file without extension)")
- parser.add_argument('-y',dest="multiclass_strat", choices=[0,1], type=int, default=0,
- help="(for multiclass tasks) set whether the test is performed in a one-against-one ( 1 - more strict!) or in a one-against-all setting ( 0 - less strict) (default 0)")
- args = parser.parse_args()
-
- params = vars(args)
- if params['title'] == "": params['title'] = params['input_file'].split("/")[-1].split('.')[0]
- return params
-
-
-
-if __name__ == '__main__':
- init()
- params = read_params(sys.argv)
- feats,cls,class_sl,subclass_sl,class_hierarchy = load_data(params['input_file'])
- kord,cls_means = get_class_means(class_sl,feats)
- wilcoxon_res = {}
- kw_n_ok = 0
- nf = 0
- for feat_name,feat_values in feats.items():
- if params['verbose']:
- print "Testing feature",str(nf),": ",feat_name,
- nf += 1
- kw_ok,pv = test_kw_r(cls,feat_values,params['anova_alpha'],sorted(cls.keys()))
- if not kw_ok:
- if params['verbose']: print "\tkw ko"
- del feats[feat_name]
- wilcoxon_res[feat_name] = "-"
- continue
- if params['verbose']: print "\tkw ok\t",
-
- if not params['wilc']: continue
- kw_n_ok += 1
- res_wilcoxon_rep = test_rep_wilcoxon_r(subclass_sl,class_hierarchy,feat_values,params['wilcoxon_alpha'],params['multiclass_strat'],params['strict'],feat_name,params['min_c'],params['only_same_subcl'],params['curv'])
- wilcoxon_res[feat_name] = str(pv) if res_wilcoxon_rep else "-"
- if not res_wilcoxon_rep:
- if params['verbose']: print "wilc ko"
- del feats[feat_name]
- elif params['verbose']: print "wilc ok\t"
-
- if len(feats) > 0:
- print "Number of significantly discriminative features:", len(feats), "(", kw_n_ok, ") before internal wilcoxon"
- if params['lda_abs_th'] < 0.0:
- lda_res,lda_res_th = dict([(k,0.0) for k,v in feats.items()]), dict([(k,v) for k,v in feats.items()])
- else:
- if params['rank_tec'] == 'lda': lda_res,lda_res_th = test_lda_r(cls,feats,class_sl,params['n_boots'],params['f_boots'],params['lda_abs_th'],0.0000000001,params['nlogs'])
- elif params['rank_tec'] == 'svm': lda_res,lda_res_th = test_svm(cls,feats,class_sl,params['n_boots'],params['f_boots'],params['lda_abs_th'],0.0,params['svm_norm'])
- else: lda_res,lda_res_th = dict([(k,0.0) for k,v in feats.items()]), dict([(k,v) for k,v in feats.items()])
- else:
- print "Number of significantly discriminative features:", len(feats), "(", kw_n_ok, ") before internal wilcoxon"
- print "No features with significant differences between the two classes"
- lda_res,lda_res_th = {},{}
- outres = {}
- outres['lda_res_th'] = lda_res_th
- outres['lda_res'] = lda_res
- outres['cls_means'] = cls_means
- outres['cls_means_kord'] = kord
- outres['wilcox_res'] = wilcoxon_res
- print "Number of discriminative features with abs LDA score >",params['lda_abs_th'],":",len(lda_res_th)
- save_res(outres,params["output_file"])
=====================================
setup.py
=====================================
@@ -0,0 +1,32 @@
+import setuptools
+from setuptools.command.install import install
+from io import open
+import os
+
+install_requires = ["numpy", "matplotlib", "biom-format", "rpy2"]
+setuptools.setup(
+ name='lefse',
+ version='1.1.2',
+ author='Nicola Segata',
+ author_email='nicola.segata at unitn.it',
+ url='http://github.com/SegataLab/lefse/',
+ packages = ['lefse', 'lefsebiom'],
+ long_description_content_type='text/markdown',
+ long_description=open('README.md').read(),
+ entry_points={
+ 'console_scripts': [
+ 'lefse_format_input.py = lefse.lefse_format_input:format_input',
+ 'lefse_plot_cladogram.py = lefse.lefse_plot_cladogram:plot_cladogram',
+ 'lefse_plot_features.py = lefse.lefse_plot_features:plot_features',
+ 'lefse_plot_res.py = lefse.lefse_plot_res:plot_res',
+ 'lefse_run.py = lefse.lefse_run:lefse_run',
+ 'lefse2circlader.py = lefse.lefse2circlader:lefse2circlader',
+ 'qiime2lefse.py = lefse.qiime2lefse:qiime2lefse'
+ ]
+ },
+ description="""
+ LEfSe determines the features (organisms, clades, operational taxonomic units, genes, or functions)
+ most likely to explain differences between classes by coupling standard tests for statistical significance with additional tests encoding
+ biological consistency and effect relevance.""",
+ install_requires=install_requires
+)
View it on GitLab: https://salsa.debian.org/med-team/lefse/-/commit/1b6d11f20a2393d8e497b7d8ac991775d29c8c24
--
View it on GitLab: https://salsa.debian.org/med-team/lefse/-/commit/1b6d11f20a2393d8e497b7d8ac991775d29c8c24
You're receiving this email because of your account on salsa.debian.org.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://alioth-lists.debian.net/pipermail/debian-med-commit/attachments/20210928/131d315a/attachment-0001.htm>
More information about the debian-med-commit
mailing list