[med-svn] [Git][med-team/proteinortho][master] 4 commits: New upstream version 6.0.30+dfsg

Tue Apr 20 14:11:34 BST 2021


Nilesh Patra pushed to branch master at Debian Med / proteinortho


Commits:
3746e15d by Nilesh Patra at 2021-04-20T18:33:16+05:30
New upstream version 6.0.30+dfsg
- - - - -
3a1c34bb by Nilesh Patra at 2021-04-20T18:33:19+05:30
Update upstream source from tag 'upstream/6.0.30+dfsg'

Update to upstream version '6.0.30+dfsg'
with Debian dir c91bfdce448714479e0530c576d46711d7d427dd
- - - - -
db0cf5ee by Nilesh Patra at 2021-04-20T18:34:35+05:30
Refresh patches

- - - - -
2626c7e2 by Nilesh Patra at 2021-04-20T18:34:51+05:30
Interim changelog entry

- - - - -


12 changed files:

- CHANGELOG
- CHANGEUID
- Makefile
- README.md
- debian/changelog
- debian/patches/baseline.patch
- debian/patches/deb_diamond
- manual.html
- proteinortho6.pl
- proteinorthoHelper.html
- src/proteinortho2xml.pl
- src/proteinortho_grab_proteins.pl


Changes:

=====================================
CHANGELOG
=====================================
@@ -288,4 +288,12 @@
 	10. Jan (5379)
 		last update introduced a bug that removes the *.blast-graph if --step=3 is used
 	29. Jan (5399)
-		fixed a bug (https://gitlab.com/paulklemm_PHD/proteinortho/-/issues/44) involving the --isoform=trinity option (search pattern was too strict), thanks to Sasha Sh !
\ No newline at end of file
+		fixed a bug (https://gitlab.com/paulklemm_PHD/proteinortho/-/issues/44) involving the --isoform=trinity option (search pattern was too strict), thanks to Sasha Sh !
+	3. Feb (5444)
+		fixed another --isoform=uniprot bug + more STDERR output 
+	15. Feb (5544)
+		new -step=4 : for each orthology group (of proteinortho.tsv) build a hmm profile and search in the input files to enrich the group.
+		proteinortho_grab_proteins.pl now supports multiple cpu threads (-cpus)
+	6. April (5584)
+		fixed a bug, where generated databases are always overwritten (https://gitlab.com/paulklemm_PHD/proteinortho/-/issues/48). Thanks to Bjoern 
+		fixed another bug that caused the compilation of proteinortho_clustering to fail on CentOS (OMP_PROC_BIND=true) (https://gitlab.com/paulklemm_PHD/proteinortho/-/issues/39). 


=====================================
CHANGEUID
=====================================
@@ -1 +1 @@
-5399
+5584


=====================================
Makefile
=====================================
@@ -60,8 +60,12 @@ UNAME_S=$(shell uname -s)_$(shell uname -m)
 # output dir of make (make install moves these to PREFIX)
 BUILDDIR=src/BUILD/$(UNAME_S)
 
+ifndef CC
 CC=cc
+endif
+ifndef CXX
 CXX=g++
+endif
 
 IS_COLOR_COMPATIBLE:=$(shell tput color 2>/dev/null)
 ifdef IS_COLOR_COMPATIBLE


=====================================
README.md
=====================================
@@ -42,7 +42,7 @@ Connected components within this graph can be considered as putative co-ortholog
  <details><summary>6.0.13: (Click to expand)</summary>
 
   - added -p=autoblast : this option alows the comparison of aminoacid and nucleotide sequences. E.g. Proteom-vs-Genome: find the protein that corresponds to a given gene (/cluster). 
-  - added -isoform={ncbi,uniprot,trinity} option : The reciprocal best hit graph is build using isoform information (isoforms are treated equivalent).
+  - added -isoform={ncbi,uniprot,trinity} option : The reciprocal best hit graph is build using isoform information (isoforms are treated equivalent). [more information about --isoform](https://gitlab.com/paulklemm_PHD/proteinortho/-/wikis/FAQ#how-does-the-isoform-work)
 </details>
 
 6.0.14 : public release to https://usegalaxy.eu/
@@ -93,7 +93,7 @@ If you need brew (see [here](https://brew.sh/index_de))
 
 <br>
 
-#### Easy installation with docker [![install with docker](https://img.shields.io/badge/install%20with-docker-brightgreen.svg?style=flat)](https://quay.io/repository/biocontainers/proteinortho)
+#### Deploy with docker [![install with docker](https://img.shields.io/badge/install%20with-docker-brightgreen.svg?style=flat)](https://quay.io/repository/biocontainers/proteinortho)
 
     docker pull quay.io/biocontainers/proteinortho:TAG
 
@@ -137,14 +137,14 @@ Or you can integrate proteinortho into your own galaxy instance using: [proteino
 
 <br>
 
-#### Easy installation with dpkg (root privileges are required)
+#### Installation with dpkg (root privileges are required)
 
 The deb package can be downloaded here: [https://packages.debian.org/unstable/proteinortho](https://packages.debian.org/unstable/proteinortho).
 Afterwards the deb package can be installed with `sudo dpkg -i proteinortho*deb`.
 
 <br>
 
-#### *(Easy installation with apt-get)*
+#### *(Installation with apt-get)*
 
 **! Disclamer: Work in progress !**
 *proteinortho will be released to stable with Debian 11 (~2021), then proteinortho can be installed with `apt-get install proteinortho` (currently this installes the outdated version v5.16b)*
@@ -375,8 +375,10 @@ Open `proteinorthoHelper.html` in your favorite browser or visit [lechnerlab.de/
 
     </details>     
         
-  - **--isoform**={ncbi,uniprot,trinity}
+  - **--isoform**={ncbi,uniprot,trinity} [more information about --isoform](https://gitlab.com/paulklemm_PHD/proteinortho/-/wikis/FAQ#how-does-the-isoform-work)
    
+    Merge isoforms to a single entity. 
+
     <details><summary>ncbi</summary> 
         
         isoforms are specified in ncbi style 
@@ -419,16 +421,13 @@ Open `proteinorthoHelper.html` in your favorite browser or visit [lechnerlab.de/
         (...)
         ---
         
-        The protein id is TRINITY_DN1000_c115_g5a and the isoform id is specified with _i1
+        The protein id is TRINITY_DN1000_c115_g5a and the isoform id is specified with i1
         
     </details>
 
  **Search options (step 1-2)**
   (output: <myproject>.blast-graph)
 
-<details>
-  <summary>(Click to expand)</summary>
-
   - **--p**=algorithm (default: diamond) 
 
     <details>
@@ -443,10 +442,9 @@ Open `proteinorthoHelper.html` in your favorite browser or visit [lechnerlab.de/
 
         - diamond : Only for protein files! standard diamond procedure and for
         genes/proteins of length >40 with the additional --sensitive flag
-        Warning: Please use version 0.9.29 or later to avoid this known bug: https://gitlab.com/paulklemm_PHD/proteinortho/issues/24
+        Warning: Please use version 0.9.29 or later to avoid this known bug: #24
 
-        - lastn,lastp : lastal. -n : dna files, -p protein files (BLOSUM62
-        scoring matrix)!
+        - lastn,lastp : lastal. -n : dna files, -p protein files (BLOSUM62 scoring matrix)!
 
         - rapsearch : Only for protein files!
 
@@ -463,15 +461,18 @@ Open `proteinorthoHelper.html` in your favorite browser or visit [lechnerlab.de/
     </details>
     <br>
 
+  - **--sim**=float (default: 0.95)
+    min. reciprocal similarity for additional hits. 1 : only the best reciprocal hits are reported, 0 : all possible reciprocal blast matches (within the -evalue) are reported.
+
+<details>
+  <summary>More (Click to expand)</summary>
+
   - **--e**=evalue (default: 1e-05)
     E-value for blast
 
   - **--selfblast**
     apply selfblast, detects paralogs without orthologs
 
-  - **--sim**=float (default: 0.95)
-    min. similarity for additional hits
-
   - **--identity**=number (default: 25)
     min. percent identity of best blast hits
 
@@ -488,7 +489,7 @@ Open `proteinorthoHelper.html` in your favorite browser or visit [lechnerlab.de/
   (output: <myproject>.ffadj-graph, <myproject>.poff.tsv (tab separated file)-graph)
 
 <details>
-  <summary>(Click to expand)</summary>
+  <summary>More (Click to expand)</summary>
 
   - **--synteny**
     activate PoFF extension to separate similar by contextual adjacencies
@@ -509,8 +510,12 @@ Open `proteinorthoHelper.html` in your favorite browser or visit [lechnerlab.de/
  **Clustering options (step 3)**
   (output: <myproject>.proteinortho.tsv, <myproject>.proteinortho.html, <myproject>.proteinortho-graph)
 
+  - **--conn**=float (default: 0.1)
+    min. algebraic connectivity. <b>This is the main parameter for the clustering step.</b> Choose larger values then more splits are done, resulting in more and smaller clusters. (There are still cluster with an alg. conn. below this given threshold allowed if the protein to species ratio is good enough, see -minspecies option below)
+
 <details>
-  <summary>(Click to expand)</summary>
+
+  <summary>More (Click to expand)</summary>
 
   - **--singles**
     report singleton genes without any hit
@@ -518,9 +523,6 @@ Open `proteinorthoHelper.html` in your favorite browser or visit [lechnerlab.de/
   - **--purity**=float (default: 1e-7)
     avoid spurious graph assignments
 
-  - **--conn**=float (default: 0.1)
-    min. algebraic connectivity. <b>This is the main parameter for the clustering step.</b> Choose larger values then more splits are done, resulting in more and smaller clusters.
-
   - **--minspecies**=float (default: 1, must be >=0)
     min. number of genes per species. If a group is found with up to (minspecies) genes/species, it wont be split again (regardless of the connectivity).
 
@@ -547,15 +549,15 @@ Open `proteinorthoHelper.html` in your favorite browser or visit [lechnerlab.de/
 
  **Misc options**
 
+  - **--checkfasta**
+    checks input fasta files if the given algorithm can process the given fasta file.
+
 <details>
   <summary>(Click to expand)</summary>
 
   - **--cleanblast**
     cleans blast-graph with proteinortho_cleanupblastgraph
 
-  - **--checkfasta**
-    checks input fasta files if the given algorithm can process the given fasta file.
-
   - **--desc**
     write description files (for NCBI FASTA input only)
 


=====================================
debian/changelog
=====================================
@@ -1,3 +1,10 @@
+proteinortho (6.0.30+dfsg-1) UNRELEASED; urgency=medium
+
+  * New upstream version 6.0.30+dfsg
+  * Refresh patches
+
+ -- Nilesh Patra <nilesh at debian.org>  Tue, 20 Apr 2021 18:34:39 +0530
+
 proteinortho (6.0.28+dfsg-1) unstable; urgency=medium
 
   * New upstream version


=====================================
debian/patches/baseline.patch
=====================================
@@ -5,7 +5,7 @@ Last-Update: Sun, 15 Nov 2020 20:11:22 +0100
 
 --- a/Makefile
 +++ b/Makefile
-@@ -158,10 +158,10 @@ ifeq ($(USELAPACK),TRUE)
+@@ -162,10 +162,10 @@
  ifeq ($(USEPRECOMPILEDLAPACK),TRUE)
  ifeq ($(STATIC),TRUE)
  	@echo "[ 20%] Building **proteinortho_clustering** with LAPACK (static linking)";
@@ -19,7 +19,7 @@ Last-Update: Sun, 15 Nov 2020 20:11:22 +0100
  			echo "$(CXX) -O2 -static $(CPPFLAGS) $(CXXFLAGS) -fopenmp  -o $@ $< $(LDFLAGS) $(LDLIBS) -Wl,--allow-multiple-definition -llapack -lblas -lgfortran -pthread -Wl,--whole-archive -lpthread -Wl,--no-whole-archive -lquadmath" && $(CXX) -O2 -static $(CPPFLAGS) $(CXXFLAGS) -fopenmp  -o $@ $< $(LDFLAGS) $(LDLIBS) -Wl,--allow-multiple-definition -llapack -lblas -lgfortran -pthread -Wl,--whole-archive -lpthread -Wl,--no-whole-archive -lquadmath && ([ $$? -eq 0 ] && $@ -test && [ $$? -eq 0 ] ) || ( \
  				echo "......$(ORANGE)static linking failed, now I try dynamic linking.$(NC)"; \
  				echo "$(CXX) -O2  $(CPPFLAGS) $(CXXFLAGS) -fopenmp  -o $@ $< $(LDFLAGS) $(LDLIBS) -llapack -lblas -pthread -Wl,--whole-archive -lpthread -Wl,--no-whole-archive" && $(CXX) -O2  $(CPPFLAGS) $(CXXFLAGS) -fopenmp  -o $@ $< $(LDFLAGS) $(LDLIBS) -llapack -lblas -pthread -Wl,--whole-archive -lpthread -Wl,--no-whole-archive && ([ $$? -eq 0 ] && $@ -test && [ $$? -eq 0 ] && echo "......OK dynamic linking was successful for proteinortho_clustering!";) || ( \
-@@ -178,8 +178,8 @@ ifeq ($(STATIC),TRUE)
+@@ -182,8 +182,8 @@
  							echo "$(CXX) -O2 $(CPPFLAGS) $(CXXFLAGS) -fopenmp -o $@ $< -Isrc/lapack-3.8.0/build/include/ -Lsrc/lapack-3.8.0/build/lib/ -llapack -lblas $(LDFLAGS) $(LDLIBS) -lgfortran" && $(CXX) -O2 $(CPPFLAGS) $(CXXFLAGS) -fopenmp -o $@ $< -Isrc/lapack-3.8.0/build/include/ -Lsrc/lapack-3.8.0/build/lib/ -llapack -lblas $(LDFLAGS) $(LDLIBS) -lgfortran && echo "......OK dynamic linking was successful for proteinortho_clustering!" || ( echo "" ) ; ) ) ) ) ) )
  else
  	@echo "[ 20%] Building **proteinortho_clustering** with LAPACK (dynamic linking)";


=====================================
debian/patches/deb_diamond
=====================================
@@ -4,7 +4,7 @@ Forwarded: not-needed
 
 --- a/proteinortho6.pl
 +++ b/proteinortho6.pl
-@@ -548,6 +548,9 @@
+@@ -561,6 +561,9 @@
    $NC="";
  }
  
@@ -16,7 +16,7 @@ Forwarded: not-needed
  ##########################################################################################
 --- a/Makefile
 +++ b/Makefile
-@@ -278,7 +278,7 @@
+@@ -282,7 +282,7 @@
  	fi
  
  	@echo -n " [3/12] -p=diamond test: "
@@ -25,7 +25,7 @@ Forwarded: not-needed
  		echo "$(ORANGE)diamond missing, skipping...$(NC)"; \
  	else \
  		./proteinortho6.pl -silent -force -project=test_diamond -p=diamond test/*.faa; \
-@@ -287,7 +287,7 @@
+@@ -291,7 +291,7 @@
  	fi
  
  	@echo -n " [4/12] -p=diamond (--moresensitive) test (subparaBlast): "


=====================================
manual.html
=====================================
@@ -1,241 +1,208 @@
 <h1 id="proteinortho">Proteinortho</h1>
-
-<p>Proteinortho is a tool to detect orthologous genes within different species. For doing so, it compares similarities of given gene sequences and clusters them to find significant groups. The algorithm was designed to handle large-scale data and can be applied to hundreds of species at one. Details can be found in <a href="https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-12-124">Lechner et al., BMC Bioinformatics. 2011 Apr 28;12:124.</a>
-To enhance the prediction accuracy, the relative order of genes (synteny) can be used as additional feature for the discrimination of orthologs. The corresponding extension, namely PoFF (manuscript in preparation), is already build in Proteinortho. The general workflow of proteinortho is depicted [<img src="https://www.dropbox.com/s/7ubl1ginn3fmf8k/proteinortho_workflow.jpg?dl=0" alt="here" />].</p>
-
-<h1 id="newfeaturesofproteinorthoversion6">New Features of Proteinortho Version 6!</h1>
-
+<p>Proteinortho is a tool to detect orthologous genes within different species.</p>
+<p>For doing so, it compares similarities of given gene sequences and clusters them to find significant groups. 
+The algorithm was designed to handle large-scale data and can be applied to hundreds of species at one. 
+Details can be found in (<a href="https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-12-124">doi:10.1186/1471-2105-12-124</a>).
+To enhance the prediction accuracy, the relative order of genes (synteny) can be used as additional feature for the discrimination of orthologs. The corresponding extension, namely PoFF (<a href="https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0105015">doi:10.1371/journal.pone.0105015</a>), is already build in Proteinortho. The general workflow of proteinortho: </p>
+<p><img src="https://www.uni-marburg.de/de/fb16/ipc/ag-lechner/graph.png/@@images/image/unimr_lead_image_sd" alt="proteinortho.workflow.png" height="250"></p>
+<p><strong>Input</strong>: Multiple fasta files (orange boxes) with many proteins/genes (circles). </p>
+<p>First an initial all vs. all comparison between all proteins of all species is performed to determine protein similarities (upper right image). <br>
+The second stage is the clustering of similar genes to meaningful co-orthologous groups (lower right image). <br>
+Connected components within this graph can be considered as putative co-orthologous groups in theory and are returned in the output (lower left image).</p>
+<p><strong>Output</strong>: Groups (*.proteinortho) and pairs (*.proteinortho-graph) of orthologs proteins/genes.</p>
+<h1 id="new-features-of-proteinortho-version-6">New Features of Proteinortho Version 6</h1>
 <ul>
-<li><p>Implementation of various Blast alternatives for step (for -step=2 the -p= options): Diamond, MMseqs2, Last, Topaz, Rapsearch2, Blat, Ublast and Usearch</p></li>
-
-<li><p>Multithreading support for the clustering step (-step=3)</p></li>
-
-<li><p>Integration of the LAPACK Fortran Library for a faster clustering step (-step=3)</p></li>
-
-<li><p>Integration of the bitscore weights in the connectivity calculation for more data dependant splits (-step=3)
-<details>
-<summary>Minor features: (Click to expand)</summary></p></li>
-
-<li><p>Output now supports OrthoXML (-xml) and HTML.</p></li>
-
-<li><p>Various test routines (make test).</p></li>
-
+<li>Implementation of various Blast alternatives for step (for -step=2 the -p= options): Diamond, MMseqs2, Last, Topaz, Rapsearch2, Blat, Ublast and Usearch</li>
+<li>Multithreading support for the clustering step (-step=3)</li>
+<li>Integration of the LAPACK Fortran Library for a faster clustering step (-step=3)</li>
+<li>Integration of the bitscore weights in the connectivity calculation for more data dependant splits (-step=3)</li>
+<li><p>Continuous Integration & Continuous Development <a href="https://gitlab.com/paulklemm_PHD/proteinortho/pipelines"><img src="https://gitlab.com/paulklemm_PHD/proteinortho/badges/master/pipeline.svg" alt="pipeline status"></a> 
+<details></p>
+<summary>New minor features: (Click to expand)</summary>
+</li>
+<li><p>Output now supports OrthoXML (-xml) and HTML.</p>
+</li>
+<li><a href="https://gitlab.com/paulklemm_PHD/proteinortho/wikis/Tools%20and%20additional%20programs">proteinortho_history.pl</a> a new tool for tracking proteins (or pairs of proteins) in the workflow of proteinortho.</li>
+<li><a href="https://gitlab.com/paulklemm_PHD/proteinortho/wikis/Tools%20and%20additional%20programs">proteinortho_summary.pl</a></li>
+<li>Various test routines (make test).</li>
 <li><p>New heuristics for connectivity calculation (-step=3).
-</details></p></li>
+</details><details></p>
+<summary>6.0.12: (Click to expand)</summary>
+</li>
+<li><p>improved <a href="https://gitlab.com/paulklemm_PHD/proteinortho/wikis/Tools%20and%20additional%20programs">proteinortho_history.pl</a> : now the program is "smarter" in detecting files automatically</p>
+</li>
+<li>added <a href="https://gitlab.com/paulklemm_PHD/proteinortho/wikis/Tools%20and%20additional%20programs">proteinortho_summary.pl</a> : a tool for summarizing the proteinortho-graph on species level. With the output it is easy to identify weak connected species.   </li>
+<li><p>removed the diamond spam
+</details>
+<details><summary>6.0.13: (Click to expand)</summary></p>
+</li>
+<li><p>added -p=autoblast : this option alows the comparison of aminoacid and nucleotide sequences. E.g. Proteom-vs-Genome: find the protein that corresponds to a given gene (/cluster). </p>
+</li>
+<li>added -isoform={ncbi,uniprot,trinity} option : The reciprocal best hit graph is build using isoform information (isoforms are treated equivalent).
+</details></li>
 </ul>
-
-<h1 id="continuousintegration">Continuous Integration</h1>
-
-<p>supports
-The badge 
-<a href="https://gitlab.com/paulklemm_PHD/proteinortho/commits/master"><img src="https://gitlab.com/paulklemm_PHD/proteinortho/badges/master/pipeline.svg" alt="pipeline status" /></a> indicates the current status of the continuous integration (CI) among various platforms (ubuntu, centos, debian, fedora) and GNU c++ versions (5, 6, latest)
-The whole git repository gets deployed on a clean docker imager (gcc:latest,gcc:5,ubuntu:latest,fedora:latest,debian:latest,centos:latest) and compiled (make all) and tested (make test). The badge is green only if all test are passed. For more information see <a href="https://gitlab.com/paulklemm_PHD/proteinortho/wikis/Continuous%20Integration">Continuous Integration (proteinortho wiki)</a>.</p>
-
-<h1 id="tableofcontents">Table of Contents</h1>
-
+<p>6.0.14 : public release to <a href="https://usegalaxy.eu/">https://usegalaxy.eu/</a></p>
+<p>A more detailed list of all changes: <a href="https://gitlab.com/paulklemm_PHD/proteinortho/blob/master/CHANGELOG">CHANGELOG</a></p>
+<h1 id="table-of-contents">Table of Contents</h1>
 <ol>
 <li><a href="#installation">Installation</a></li>
-
 <li><a href="#synopsis">Synopsis and Description</a></li>
-
 <li><a href="#options">Options/Parameters</a></li>
-
 <li><a href="#poff">PoFF synteny extension</a></li>
-
 <li><a href="#output">Output description</a></li>
-
 <li><a href="#examples">Examples</a></li>
-
-<li><a href="https://gitlab.com/paulklemm_PHD/proteinortho/wikis/Error-Codes">Error Codes and Troubleshooting</a> <- look here if you cannot compile/run (proteinortho wiki)</li>
-
-<li><a href="https://gitlab.com/paulklemm_PHD/proteinortho/wikis/Large-compute-jobs-(the--jobs-option)">Large compute jobs example</a> (proteinortho wiki)</li>
-
-<li><a href="https://gitlab.com/paulklemm_PHD/proteinortho/wikis/biological-example">Biological example</a> (proteinortho wiki)</li>
 </ol>
-
-<p>Bug reports: See chapter 7. or send a mail to incoming+paulklemm-phd-proteinortho-7278443-issue- at incoming.gitlab.com (Please include the 'Parameter-vector' that is printed for all errors)
-You can also send a mail to lechner at staff.uni-marburg.de.</p>
-
+<h1 id="-proteinortho-wiki-https-gitlab-com-paulklemm_phd-proteinortho-wikis-table-of-contents"><a href="https://gitlab.com/paulklemm_PHD/proteinortho/wikis/">Proteinortho-Wiki</a> Table of Contents</h1>
+<ol>
+<li><a href="https://gitlab.com/paulklemm_PHD/proteinortho/wikis/Tools%20and%20additional%20programs">Tools and additional programs</a></li>
+<li><a href="https://gitlab.com/paulklemm_PHD/proteinortho/wikis/Error-Codes">Error Codes and Troubleshooting</a> <- look here if you cannot compile/run proteinortho</li>
+<li><a href="https://gitlab.com/paulklemm_PHD/proteinortho/wikis/Large-compute-jobs-(the--jobs-option">Large compute jobs example</a>)</li>
+<li><a href="https://gitlab.com/paulklemm_PHD/proteinortho/wikis/FAQ">FAQ</a> <br>
+<a href="https://gitlab.com/paulklemm_PHD/proteinortho/wikis/">(...)</a></li>
+</ol>
+<p>Bug reports: Please have a look at chapter <a href="https://gitlab.com/paulklemm_PHD/proteinortho/wikis/Error-Codes">2.</a> first or send a mail to incoming+paulklemm-phd-proteinortho-7278443-issue- at incoming.gitlab.com. (please include the 'parameter-vector' that is printed for all errors)
+You can also send mails to lechner at staff.uni-marburg.de. Any suggestions, feedback and comments are welcome!</p>
 <h1 id="installation">Installation</h1>
-
-<p><strong>Proteinortho comes with precompiled binaries of all executables (Linux/x86) so you should be able to run perl proteinortho6.pl in the downloaded directory.</strong>
+<p> <strong>Proteinortho comes with precompiled binaries of all executables (Linux/x86) so you should be able to run perl proteinortho6.pl in the downloaded directory.</strong>
 You could also move all executables to your favorite directory (e.g. with make install PREFIX=/home/paul/bin).
-If you cannot execute the src/BUILD/Linux<em>x86</em>64/proteinortho_clustering, then you have to recompile with make, see the section 2. Building and installing proteinortho from source.</p>
-
+If you cannot execute the src/BUILD/Linux_x86_64/proteinortho_clustering, then you have to recompile with make, see the section 2. Building and installing proteinortho from source.</p>
 <p><br></p>
+<h4 id="easy-installation-with-bio-conda-for-linux-osx-install-with-bioconda-https-img-shields-io-badge-install-20with-bioconda-brightgreen-svg-style-flat-http-bioconda-github-io-recipes-proteinortho-readme-html-alt-https-img-shields-io-conda-dn-bioconda-proteinortho-svg-style-flat-https-bioconda-github-io-recipes-proteinortho-readme-html-">Easy installation with (bio)conda (for Linux + OSX) <a href="http://bioconda.github.io/recipes/proteinortho/README.html"><img src="https://img.shields.io/badge/install%20with-bioconda-brightgreen.svg?style=flat" alt="install with bioconda"></a> <a href="https://bioconda.github.io/recipes/proteinortho/README.html"><img src="https://img.shields.io/conda/dn/bioconda/proteinortho.svg?style=flat" alt="alt"></a></h4>
+<pre><code>conda <span class="hljs-keyword">install</span> proteinortho
+</code></pre><p>If you need conda (see <a href="https://docs.anaconda.com/anaconda/install/">here</a>) and the bioconda channel: <code>conda config --add channels defaults && conda config --add channels bioconda && conda config --add channels conda-forge</code>.</p>
+<p><br></p>
+<h4 id="easy-installation-with-brew-for-osx-install-with-brew-https-img-shields-io-badge-install-20with-brew-brightgreen-svg-style-flat-https-formulae-brew-sh-formula-proteinortho-dl-https-img-shields-io-badge-dynamic-json-svg-label-downloads-query-27analytics-27-27install-27-27365d-27-27proteinortho-27-url-https-3a-2f-2fformulae-brew-sh-2fapi-2fformula-2fproteinortho-json-color-green-https-formulae-brew-sh-formula-proteinortho-">Easy installation with brew (for OSX) <a href="https://formulae.brew.sh/formula/proteinortho"><img src="https://img.shields.io/badge/install%20with-brew-brightgreen.svg?style=flat" alt="install with brew"></a> <a href="https://formulae.brew.sh/formula/proteinortho"><img src="https://img.shields.io/badge/dynamic/json.svg?label=downloads&query=$[%27analytics%27][%27install%27][%27365d%27][%27proteinortho%27]&url=https%3A%2F%2Fformulae.brew.sh%2Fapi%2Fformula%2Fproteinortho.json&color=green" alt="dl"></a></h4>
+<pre><code><span class="hljs-keyword">brew </span><span class="hljs-keyword">install </span>proteinortho
+</code></pre><p>If you need brew (see <a href="https://brew.sh/index_de">here</a>)</p>
+<p><br></p>
+<h4 id="easy-installation-with-docker-install-with-docker-https-img-shields-io-badge-install-20with-docker-brightgreen-svg-style-flat-https-quay-io-repository-biocontainers-proteinortho-">Easy installation with docker <a href="https://quay.io/repository/biocontainers/proteinortho"><img src="https://img.shields.io/badge/install%20with-docker-brightgreen.svg?style=flat" alt="install with docker"></a></h4>
+<pre><code>docker pull quay.io<span class="hljs-regexp">/biocontainers/</span><span class="hljs-string">proteinortho:</span>TAG
+</code></pre><p>with TAG specified <a href="https://quay.io/repository/biocontainers/proteinortho?tab=tags">here</a> (e.g. 6.0.23--hfd40d39_0).</p>
+<details>
+  <summary>how to docker (Click to expand)</summary>
 
-<h4 id="easyinstallationwithbiocondaforlinuxosx">Easy installation with (bio)conda (for Linux + OSX)</h4>
-
-<pre><code>conda install proteinortho
-</code></pre>
-
-<p>If you need conda (see <a href="https://docs.anaconda.com/anaconda/install/">here</a>) and the bioconda channel: <code>conda config --add channels defaults && conda config --add channels bioconda && conda config --add channels conda-forge</code>.</p>
-
-<p><a href="http://bioconda.github.io/recipes/proteinortho/README.html"><img src="https://img.shields.io/badge/install%20with-bioconda-brightgreen.svg?style=flat" alt="install with bioconda" /></a> <a href="https://bioconda.github.io/recipes/proteinortho/README.html"><img src="https://img.shields.io/conda/dn/bioconda/proteinortho.svg?style=flat" alt="alt" /></a></p>
+  <br>
 
-<p><br> </p>
+  To start a bash shell 
 
-<h4 id="easyinstallationwithbrewforosx">Easy installation with brew (for OSX)</h4>
+  <code>docker run --rm -it quay.io/biocontainers/proteinortho:6.0.22--hfd40d39_0 bash</code>
 
-<pre><code>brew install proteinortho
-</code></pre>
+  Here you can start/use proteinortho.
+  You can change "6.0.22--hfd40d39_0" with any tag/version that is available <a href="https://quay.io/repository/biocontainers/proteinortho?tab=tags">here</a>. Sadly there is no ":latest" tag available ...
 
-<p>If you need brew (see <a href="https://brew.sh/index_de">here</a>)</p>
+  ### Now lets try to mount your home in docker
 
-<p><a href="https://formulae.brew.sh/formula/proteinortho"><img src="https://img.shields.io/badge/install%20with-brew-brightgreen.svg?style=flat" alt="install with brew" /></a> <a href="https://formulae.brew.sh/formula/proteinortho"><img src="https://img.shields.io/badge/dynamic/json.svg?label=downloads&query=$[%27analytics%27][%27install%27][%27365d%27][%27proteinortho%27]&url=https%3A%2F%2Fformulae.brew.sh%2Fapi%2Fformula%2Fproteinortho.json&color=green" alt="dl" /></a></p>
+  This is neccessary if you want to access your local files:
 
-<p><br></p>
+  <code>docker run --rm --mount "type=bind,src=/home/$(id -un),dst=/home/$(id -un)" -u $(id -u):$(id -g) -it quay.io/biocontainers/proteinortho:6.0.22--hfd40d39_0 bash</code>
 
-<h4 id="easyinstallationwithdocker">Easy installation with docker</h4>
+  now you have your home directory mounted to /home/YOURNAME. (load your bashrc within docker : <code>source /home/YOURNAME/.bashrc</code>)
 
-<pre><code>docker pull quay.io/biocontainers/proteinortho
-</code></pre>
 
-<p><a href="https://quay.io/repository/biocontainers/proteinortho"><img src="https://img.shields.io/badge/install%20with-docker-brightgreen.svg?style=flat" alt="install with docker" /></a></p>
+</details>
 
 <p><br></p>
-
-<h4 id="easyinstallationwithdpkgrootprivilegesarerequired">Easy installation with dpkg (root privileges are required)</h4>
-
+<h4 id="available-at-galaxy-europe">Available at Galaxy Europe</h4>
+<p>Simply go to the european galaxy server and search for proteinortho:</p>
+<pre><code><span class="hljs-symbol">https:</span><span class="hljs-comment">//usegalaxy.eu</span>
+</code></pre><p>Or you can integrate proteinortho into your own galaxy instance using: <a href="https://toolshed.g2.bx.psu.edu/view/iuc/proteinortho/4850f0d15f01">proteinortho (iuc repository)</a></p>
+<p><br></p>
+<h4 id="easy-installation-with-dpkg-root-privileges-are-required-">Easy installation with dpkg (root privileges are required)</h4>
 <p>The deb package can be downloaded here: <a href="https://packages.debian.org/unstable/proteinortho">https://packages.debian.org/unstable/proteinortho</a>.
 Afterwards the deb package can be installed with <code>sudo dpkg -i proteinortho*deb</code>.</p>
-
 <p><br></p>
-
-<h4 id="easyinstallationwithaptget"><em>(Easy installation with apt-get)</em></h4>
-
+<h4 id="-easy-installation-with-apt-get-"><em>(Easy installation with apt-get)</em></h4>
 <p><strong>! Disclamer: Work in progress !</strong>
-<em>proteinortho will be released to stable with Debian 11 (~2021), then proteinortho can be installed with <code>sudo apt-get install proteinortho</code> (currently this installes the outdated version v5.16b)</em></p>
-
+<em>proteinortho will be released to stable with Debian 11 (~2021), then proteinortho can be installed with <code>apt-get install proteinortho</code> (currently this installes the outdated version v5.16b)</em></p>
 <p><br></p>
-
-<h4 id="1prerequisites">1. Prerequisites</h4>
-
-<p>Proteinortho uses standard software which is often installed already or is part of then package repositories and can thus easily be installed. The sources come with a precompiled version of Proteinortho for 64bit Linux.</p>
-
-<p><details>
-  <summary>To <b>run</b> Proteinortho, you need: (Click to expand)</summary></p>
-
-<ul>
-<li><p>At least one of the following the following programs (default is diamond):</p>
-
-<ul>
-<li>NCBI BLAST+ or NCBI BLAST legacy (to test this, type tblastn. apt-get install ncbi-blast+)</li>
-
-<li>Diamond (apt-get install diamond, brew install diamond, conda install diamond, https://github.com/bbuchfink/diamond)</li>
-
-<li>Last (http://last.cbrc.jp/)</li>
-
-<li>Rapsearch (https://github.com/zhaoyanswill/RAPSearch2)</li>
-
-<li>Topaz (https://github.com/ajm/topaz)</li>
-
-<li>usearch (https://www.drive5.com/usearch/download.html)</li>
-
-<li>ublast (is part of usearch)</li>
-
-<li>blat (http://hgdownload.soe.ucsc.edu/admin/)</li>
-
-<li>mmseqs2 (conda install mmseqs2, https://github.com/soedinglab/MMseqs2)</li></ul></li>
-
-<li><p>Perl v5.08 or higher (to test this, type perl -v in the command line)</p></li>
-
-<li><p>Python v2.6.0 or higher to include synteny analysis (to test this, type 'python -V' in the command line) </p></li>
-
-<li><p>Perl standard modules (these should come with Perl): Thread::Queue, File::Basename, Pod::Usage, threads (if you miss one just install with <code>cpan install ...</code> )
-</details></p></li>
-</ul>
-
-<p><br>
+<h4 id="prerequisites-for-compiling-proteinortho-from-source">Prerequisites for compiling proteinortho from source</h4>
+<p>Proteinortho uses standard software which is often installed already or is part of then package repositories and can thus easily be installed. The sources come with a precompiled version of Proteinortho for 64bit Linux x86.</p>
 <details>
-  <summary>To <b>compile</b> Proteinortho (linux/osx), you need: (Click to expand)</summary></p>
-
-<ul>
-<li>GNU make (to test this, type 'make' in the command line)</li>
-
-<li>GNU g++ v4.1 or higher (to test this, type 'g++ --version' in the command line) </li>
-
-<li>openmp (to test this, type 'g++ -fopenmp' in the command line) </li>
-
-<li>(optional) gfortran for compiling LAPACK (to test this, type 'whereis gfortran' in the command line)</li>
-
-<li>(optional) CMake for compiling LAPACK (to test this, type 'cmake' in the command line), OR you can use your own compiled version of lapack (you can get this with 'apt-get install liblapack3') and run 'make USEPRECOMPILEDLAPACK=TRUE'</li>
-</ul>
-
-<p></details></p>
+  <summary>To <b>run</b> Proteinortho, you need: (Click to expand)</summary>
+
+
+   - At least one of the following the following programs (default is diamond):
+
+     - NCBI BLAST+ or NCBI BLAST legacy (to test this, type tblastn. apt-get install ncbi-blast+)
+     - Diamond (apt-get install diamond, brew install diamond, conda install diamond, <a href="https://github.com/bbuchfink/diamond">https://github.com/bbuchfink/diamond</a>)
+     - Last (<a href="http://last.cbrc.jp/">http://last.cbrc.jp/</a>)
+     - Rapsearch (<a href="https://github.com/zhaoyanswill/RAPSearch2">https://github.com/zhaoyanswill/RAPSearch2</a>)
+     - Topaz (<a href="https://github.com/ajm/topaz">https://github.com/ajm/topaz</a>)
+     - usearch (<a href="https://www.drive5.com/usearch/download.html">https://www.drive5.com/usearch/download.html</a>)
+     - ublast (is part of usearch)
+     - blat (<a href="http://hgdownload.soe.ucsc.edu/admin/">http://hgdownload.soe.ucsc.edu/admin/</a>)
+     - mmseqs2 (conda install mmseqs2, <a href="https://github.com/soedinglab/MMseqs2">https://github.com/soedinglab/MMseqs2</a>)
+   - Perl v5.08 or higher (to test this, type perl -v in the command line)
+   - (optional) Python v3.0 or higher to include synteny analysis (to test this, type 'python -V' in the command line)
+   - Perl standard modules (these should come with Perl): Thread::Queue, File::Basename, Pod::Usage, threads (if you miss one just install with <code>cpan install ...</code> )
+</details>
 
 <p><br></p>
+<details>
+  <summary>To <b>compile</b> Proteinortho (linux/osx), you need: (Click to expand)</summary>
 
-<h4 id="2buildingandinstallingproteinorthofromsourcelinuxandosx">2. Building and installing proteinortho from source (linux and osx)</h4>
-
-<p>Here you can use a working lapack library, check this with 'dpkg --get-selections | grep lapack'. Install lapack e.g. with 'apt-get install libatlas3-base' or liblapack3.</p>
-
-<p>If you dont have Lapack, then 'make' will automatically compiles Lapack v3.8.0 for you !</p>
-
-<p>Fetch the latest source code archive downloaded from <a href="https://gitlab.com/paulklemm_PHD/proteinortho/-/archive/master/proteinortho-master.zip">here</a> 
-<details> <summary>or from here (Click to expand)</summary></p>
-
-<blockquote>
-  <p>git clone https://gitlab.com/paulklemm_PHD/proteinortho</p>
-  
-  <p>wget https://gitlab.com/paulklemm_PHD/proteinortho/-/archive/master/proteinortho-master.zip
-  </details>
-  <br></p>
-</blockquote>
-
-<ul>
-<li><code>tar -xzvf proteinortho*.tar.gz</code> or <code>unzip proteinortho*.zip</code> : Extract the files</li>
-
-<li><code>cd proteinortho*</code> : Change directory into the extracted folder</li>
+   - GNU make (to test this, type 'make' in the command line)
+   - GNU g++ v4.1 or higher (to test this, type 'g++ --version' in the command line)
+   - openmp (to test this, type 'g++ -fopenmp' in the command line)
+   - (optional) gfortran for compiling LAPACK (to test this, type 'whereis gfortran' in the command line)
+   - (optional) CMake for compiling LAPACK (to test this, type 'cmake' in the command line), OR you can use your own compiled version of lapack (you can get this with 'apt-get install liblapack3') and run 'make USEPRECOMPILEDLAPACK=TRUE'
 
-<li>You can now run proteinortho6.pl directly (linux only).</li>
+</details>
 
-<li><code>make clean && make</code> : If you want to recompile Proteinortho. (For osx you need a newer g++ compiler to support multithreading, see below)</li>
+<p><br></p>
+<h4 id="building-and-installing-proteinortho-from-source-linux-and-osx-">Building and installing proteinortho from source (linux and osx)</h4>
+<p>  Here you can use a working lapack library, check this with 'dpkg --get-selections | grep lapack'. Install lapack e.g. with 'apt-get install libatlas3-base' or liblapack3.</p>
+<p>  If you dont have Lapack, then 'make' will automatically compiles Lapack v3.8.0 for you !</p>
+<p>  Fetch the latest source code archive downloaded from <a href="https://gitlab.com/paulklemm_PHD/proteinortho/-/archive/master/proteinortho-master.zip">here</a></p>
+<details> <summary>or from here (Click to expand)</summary>
 
-<li><code>make install</code> or <code>make install PREFIX=~/bin</code> if you dont have root privileges. </li>
+  > git clone <a href="https://gitlab.com/paulklemm_PHD/proteinortho">https://gitlab.com/paulklemm_PHD/proteinortho</a>
 
-<li><code>make test</code> : To make sure Proteinortho works as expected. The output should look like below (3. Make test output).</li>
-</ul>
+  > wget <a href="https://gitlab.com/paulklemm_PHD/proteinortho/-/archive/master/proteinortho-master.zip">https://gitlab.com/paulklemm_PHD/proteinortho/-/archive/master/proteinortho-master.zip</a>
+</details>
+<br>
 
-<p><details>
-  <summary><b>OSX additional informations (the -fopenmp error)</b></summary></p>
+  - <code>tar -xzvf proteinortho*.tar.gz</code> or <code>unzip proteinortho*.zip</code> : Extract the files
+  - <code>cd proteinortho*</code> : Change directory into the extracted folder
+  - You can now run proteinortho6.pl directly (linux only).
+  - <code>make clean && make</code> : If you want to recompile Proteinortho. (For osx you need a newer g++ compiler to support multithreading, see below)
+  - <code>make install</code> or <code>make install PREFIX=~/bin</code> if you dont have root privileges.
+  - <code>make test</code> : To make sure Proteinortho works as expected. The output should look like below (3. Make test output).
 
+<details>
+  <summary><b>OSX additional informations (the -fopenmp error)</b></summary>
 <pre>
-Install a newer g++ compiler for -fopenmp support (multithreading) with brew (get brew here https://brew.sh/index_de)
-<pre><code>brew install gcc --without-multilib
-</code></pre>
+Install a newer g++ compiler for -fopenmp support (multithreading) with brew (get brew here <a href="https://brew.sh/index_de">https://brew.sh/index_de</a>)
 
-Then you should have a g++-7 or whatever newer version that there is (g++-8,9,...). 
-Next you have to tell make to use this new compiler with one of the following:<pre><code>ln -s /usr/local/bin/gcc-7 /usr/local/bin/gcc
-ln -s /usr/local/bin/g++-7 /usr/local/bin/g++
-</code></pre>
+<code>brew install gcc --without-multilib</code>
 
-OR(!) specify the new g++ in 'make CXX=/usr/local/bin/g++-7 all'
-</pre>
-
-<p></details></p>
+Then you should have a g++-7 or whatever newer version that there is (g++-8,9,...).
+Next you have to tell make to use this new compiler with one of the following:
+<code>ln -s /usr/local/bin/gcc-7 /usr/local/bin/gcc
+ln -s /usr/local/bin/g++-7 /usr/local/bin/g++</code>
 
-<p><details>
-  <summary>'make' successful output (Click to expand)</summary></p>
+OR(!) specify the new g++ in 'make CXX=/usr/local/bin/g++-7 all'
+</pre>
+</details>
 
+<details>
+  <summary>'make' successful output (Click to expand)</summary>
 <pre>
 [  0%] Prepare proteinortho_clustering ...
-[ 20%] Building **proteinortho_clustering** with LAPACK (static/dynamic linking)
-[ 25%] Building **graphMinusRemovegraph**
-[ 50%] Building **cleanupblastgraph**
-[ 75%] Building **po_tree**
+[ 20%] Building <strong>proteinortho_clustering</strong> with LAPACK (static/dynamic linking)
+[ 25%] Building <strong>graphMinusRemovegraph</strong>
+[ 50%] Building <strong>cleanupblastgraph</strong>
+[ 75%] Building <strong>po_tree</strong>
 [100%] Everything is compiled with no errors.
 </pre>
 
-<p>The compilation of proteinortho_clustering has multiple fall-back routines. If everything fails please look here <a href="https://gitlab.com/paulklemm_PHD/proteinortho/wikis/Error%20Codes">Troubleshooting (proteinortho wiki)</a>.</p>
-
-<p></details></p>
+The compilation of proteinortho_clustering has multiple fall-back routines. If everything fails please look here <a href="https://gitlab.com/paulklemm_PHD/proteinortho/wikis/Error%20Codes">Troubleshooting (proteinortho wiki)</a>.
 
-<h4 id="3maketestoutput">3. Make test output</h4>
-
-<p><details>
-  <summary>'make test' successful output (Click to expand)</summary></p>
+</details>
 
+<h4 id="3-make-test-output">3. Make test output</h4>
+<details>
+  <summary>'make test' successful output (Click to expand)</summary>
 <pre>
 Everything is compiled with no errors.
 [TEST] 1. basic proteinortho6.pl -step=2 tests
@@ -252,42 +219,23 @@ Everything is compiled with no errors.
  [11/11] -p=mmseqsp (mmseqs) test: passed
 [TEST] 2. -step=3 tests (proteinortho_clustering)
  [1/2] various test functions of proteinortho_clustering (-test): passed
- [2/2] Compare results of 'with lapack' and 'without lapack': passed
+ [2/2] Compare results of 'with lapack' and 'without lapack': passed
 [TEST] Clean up all test files...
 [TEST] All tests passed
 </pre>
-
-<p></details></p>
+</details>
 
 <p>If you have problems compiling/running the program go to <a href="https://gitlab.com/paulklemm_PHD/proteinortho/wikis/Error%20Codes">Troubleshooting (proteinortho wiki)</a>.</p>
-
 <p><br></p>
-
 <h1 id="synopsis">SYNOPSIS</h1>
-
-<blockquote>
-  <p><strong>proteinortho6.pl [options] \<fasta file(s)\></strong> (one fasta for each species, at least 2)</p>
-</blockquote>
-
-<p>OR</p>
-
 <blockquote>
-  <p><strong>proteinortho [options] \<fasta file(s)\></strong></p>
+<p><strong>proteinortho [options] \<fasta file(s)\></strong></p>
 </blockquote>
-
+<p>   one fasta for each species; at least 2</p>
 <h1 id="description">DESCRIPTION</h1>
-
-<p><strong>proteinortho</strong> is a tool to detect orthologous genes within different
-  species. For doing so, it compares similarities of given gene sequences
-  and clusters them to find significant groups. The algorithm was designed
-  to handle large-scale data and can be applied to hundreds of species at
-  one. Details can be found in Lechner et al., BMC Bioinformatics. 2011 Apr
-  28;12:124. To enhance the prediction accuracy, the relative order of genes
-  (synteny) can be used as additional feature for the discrimination of
-  orthologs. The corresponding extension, namely PoFF (manuscript in
-  preparation), is already build in Proteinortho.</p>
-
-<p>Proteinortho assumes, that you have all your gene sequences in FASTA
+<p>  <strong>proteinortho</strong> is a tool to detect orthologous genes within different
+  species. </p>
+<p>  Proteinortho assumes, that you have all your gene sequences in FASTA
   format either represented as amino acids or as nucleotides. The source
   code archive contains some examples, namely C.faa, E.faa, L.faa, M.faa
   located in the test/ directory. <strong>By default Proteinortho assumes amino</strong>
@@ -296,239 +244,309 @@ Everything is compiled with no errors.
   -p=blastn+ (or some other algorithm). (In case you have only have NCBI
   BLAST legacy installed, you need to tell this too - either by adding
   -p=blastp or -p=blastn respectively.) The full command for the example
-  files would thus be </p>
-
+  files would thus be</p>
 <blockquote>
-  <p>proteinortho6.pl -project=test test/C.faa test/E.faa</p>
+<p>proteinortho6.pl -project=test test/C.faa test/E.faa</p>
 </blockquote>
-
-<p>test/L.faa test/M.faa. Instead of naming the FASTA files one by one, you
+<p>  test/L.faa test/M.faa. Instead of naming the FASTA files one by one, you
   could also use test/*.faa. Please note that the parameter
   -project=test is optional, for naming the output. With this, you can set the prefix of the output
   files generated by Proteinortho. If you skip the project parameter, the
   default project name will be myproject.</p>
-
-<h1 id="optionsgraphicaluserinterface">OPTIONS graphical user interface</h1>
-
+<h1 id="options-graphical-user-interface">OPTIONS graphical user interface</h1>
 <p>Open <code>proteinorthoHelper.html</code> in your favorite browser or visit <a href="http://lechnerlab.de/proteinortho/">lechnerlab.de/proteinortho</a> online for an interactiv exploration of the different options of proteinortho.</p>
-
 <h1 id="options">OPTIONS</h1>
-
-<p><strong>Main parameters</strong> (can be used with -- or -)</p>
-
+<p> <strong>Main parameters</strong> (can be used with -- or -)</p>
 <ul>
 <li><p><strong>--project</strong>=name (default: myproject)
-prefix for all resulting file names</p></li>
-
+prefix for all resulting file names</p>
+</li>
 <li><p><strong>--cpus</strong>=number (default: all available)
 the number of processors to use (multicore/processor support)</p>
-
 <ul>
 <li><p><strong>--ram</strong>=number (default: 90% of free memory)
-maximal used ram threshold for LAPACK and the input graph in MB</p></li>
-
+maximal used ram threshold for LAPACK and the input graph in MB</p>
+</li>
 <li><p><strong>--verbose</strong>={0,1,2} (default: 1)
-verbose level. 1:keeps you informed about the progress</p></li>
-
+verbose level. 1:keeps you informed about the progress</p>
+</li>
 <li><p><strong>--silent</strong>
-sets verbose level to 0.</p></li>
-
+sets verbose level to 0.</p>
+</li>
 <li><p><strong>--temp</strong>=directory(.)
-path to the temporary files</p></li>
-
+path to the temporary files</p>
+</li>
 <li><p><strong>--force</strong>
-forces the recalculation of the blast results in any case in step=2. Also forces the recreation of the database generation in step=1</p></li>
-
+forces the recalculation of the blast results in any case in step=2. Also forces the recreation of the database generation in step=1</p>
+</li>
 <li><p><strong>--clean</strong>
-removes all database-index-files generated by the -p algorithm afterwards</p></li>
-
+removes all database-index-files generated by the -p algorithm afterwards</p>
+</li>
 <li><p><strong>--step</strong>={0,1,2,3} (default: 0)
-0 -> all. 1 -> prepare blast (build db). 2 -> run all-versus-all
-blast. 3 -> run the clustering.</p></li></ul>
-
-<p><strong>Search options (step 1-2)</strong>
-(output: <myproject>.blast-graph)</p></li>
+0 -> all. 1 -> prepare blast (build db). 2 -> run all-versus-all
+blast. 3 -> run the clustering.</p>
+</li>
 </ul>
+<details>
+ <summary>(Show more information)</summary>
+
+   proteinortho test/<em>faa 
 
-<p><details>
-  <summary>(Click to expand)</summary></p>
+   # the following 3 commands are producing the same results as the command above
+   proteinortho -step=1 test/</em>faa 
+   proteinortho -step=2 test/*faa 
+   proteinortho -step=3
+
+</details>   
 
 <ul>
-<li><p><strong>--p</strong>=algorithm (default: diamond)</p>
+<li><strong>--keep</strong>
+stores temporary blast results for reuse (same -project= name is mandatory)</li>
+</ul>
+<details>
+ <summary>(Show more information)</summary>
 
-<p><details>
-  <summary>show all algorithms (Click to expand)</summary></p>
+   # 1. generate db files
 
-<pre><code>- autoblast,blastn_legacy,blastp_legacy,tblastx_legacy : legacy blast family (shell commands: blastall -) family. The suffix 'n' or 'p' indicates nucleotide or protein input files.
+   proteinortho -step=1 -project=test -keep infile/<em>fasta
 
-- autoblast : standard blast+ family 
-automatically detects: blastn,blastp,tblastx,blastx depending on the input (can also be mixed together!)
+   # 2. run the all-versus-all blast of some input files (infile/) 
 
-- blastn+,blastp+,tblastx+ : standard blast+ family (shell commands: blastn,blastp,tblastx)
-family. The suffix 'n' or 'p' indicates nucleotide or protein input files.
+   proteinortho -step=2 -project=test -keep infile/</em>fasta
 
-- diamond : Only for protein files! standard diamond procedure and for
-genes/proteins of length >40 with the additional --sensitive flag
+   # now you can insert more fasta files to infile/ and reuse everything computed 
 
-- lastn,lastp : lastal. -n : dna files, -p protein files (BLOSUM62
-scoring matrix)!
+   proteinortho -step=2 -project=test -keep infile/*fasta
 
-- rapsearch : Only for protein files! 
+   # finally run clustering
 
-- mmseqsp,mmseqsn : mmseqs2. -n : dna files, -p protein files
+   proteinortho -step=3 -project=test -keep
 
-- topaz : Only for protein files!
+</details>     
 
-- usearch : usearch_local procedure with -id 0 (minimum identity
-percentage).
+<ul>
+<li><strong>--isoform</strong>={ncbi,uniprot,trinity}</li>
+</ul>
+<details><summary>ncbi</summary> 
 
-- ublast : usearch_ublast procedure.
+   isoforms are specified in ncbi style 
 
-- blatp,blatn : blat. -n : dna files, -p protein files
-</code></pre>
+   ---
+   ><strong>ENSMUSP00000021091.8</strong> pep chromosome:GRCm38:11:74673949:74724670:-1 <strong>gene:ENSMUSG00000020745.15</strong> transcript:ENSMUST00000021091.14 gene_biotype:protein_coding transcript_biotype:protein_coding gene_symbol:Pafah1b1 description:platelet-activating factor acetylhydrolase, <strong>isoform</strong> 1b, subunit 1 [Source:MGI Symbol;Acc:MGI:109520]
+   ><strong>ENSMUSP00000099578.2</strong> pep chromosome:GRCm38:11:74673950:74723858:-1 <strong>gene:ENSMUSG00000020745.15</strong> transcript:ENSMUST00000102520.8 gene_biotype:protein_coding transcript_biotype:protein_coding gene_symbol:Pafah1b1 description:platelet-activating factor acetylhydrolase, <strong>isoform</strong> 1b, subunit 1 [Source:MGI Symbol;Acc:MGI:109520]<br>   ---
 
-<p></details>
-<br></p></li>
+   Different protein identifier (ENSMUSP00000021091.8, ENSMUSP00000099578.2) but the same gene id (ENSMUSG00000020745.15). The word 'isoform' is also mandatory!
 
-<li><p><strong>--e</strong>=evalue (default: 1e-05)
-E-value for blast</p></li>
+</details>
+<details><summary>uniprot</summary> 
+
+   isoforms are specified in uniprot style using the '<em>_additional.fa' files 
 
-<li><p><strong>--selfblast</strong>
-apply selfblast, detects paralogs without orthologs</p></li>
+   E.g. C.fa: 
 
-<li><p><strong>--sim</strong>=float (default: 0.95)
-min. similarity for additional hits</p></li>
+   ---
+   >tr|ADHA2|R4GDP1_DANRE Gamma-aminobutyric
+   (...)
+   ---
 
-<li><p><strong>--identity</strong>=number (default: 25)
-min. percent identity of best blast hits</p></li>
+   C_additional.fa: 
 
-<li><p><strong>--cov</strong>=number (default: 50)
-min. coverage of best blast alignments in %</p></li>
+   ---
+   >tr|QDHQ4|R4GDP1_DANRE isoform of ADHA2
+   (...)
+   ---
+
+   QDHQ4 is the isoform of ADHA2. Please simply add the </em>_additional.fa files to the proteinortho call!
 
-<li><p><strong>--subparaBlast</strong>='options'
-additional parameters for the search tool (-p=blastp+,diamond,...) example -subpara='-seg no'
-or -subpara='--more-sensitive' for diamond
 </details>
-<br></p>
+<details><summary>trinity</summary> 
 
-<p><strong>Synteny options (optional, step 2)</strong>
-(output: <myproject>.ffadj-graph, <myproject>.poff.tsv (tab separated file)-graph)</p></li>
-</ul>
+   isoforms are specified in trinity style:
 
-<p><details>
-  <summary>(Click to expand)</summary></p>
+   ---
+   >TRINITY_DN1000_c115_g5_i1 len=247 path=[31015:0-148 23018:149-246]
+   (...)
+   ---
 
-<ul>
-<li><p><strong>--synteny</strong>
-activate PoFF extension to separate similar by contextual adjacencies
-(requires .gff for each .fasta)</p></li>
+   The protein id is TRINITY_DN1000_c115_g5a and the isoform id is specified with i1
 
-<li><p><strong>--dups</strong>=number (default: 0)
-PoFF: number of reiterations for adjacencies heuristic, to determine
-duplicated regions</p></li>
+</details>
 
-<li><p><strong>--cs</strong>=number (default: 3)
-PoFF: Size of a maximum common substring (MCS) for adjacency matches</p></li>
+<p><strong>Search options (step 1-2)</strong>
+(output: <myproject>.blast-graph)</p>
+<ul>
+<li><strong>--p</strong>=algorithm (default: diamond) </li>
+</ul>
+<p><details></p>
+ <summary>show all options (Click to expand)</summary>
 
-<li><p><strong>--alpha</strong>=number (default: .5)
-PoFF: weight of adjacencies vs. sequence similarity
+<ul>
+<li><p>autoblast : automatically detects the blast+ program (blastp,blastn,tblastn,blastx) depending on the input (can also be mixed together!)</p>
+</li>
+<li><p>blastn_legacy,blastp_legacy,tblastx_legacy : legacy blast family (shell commands: blastall -) family. The suffix 'n' or 'p' indicates nucleotide or protein input files.</p>
+</li>
+<li><p>blastn+,blastp+,tblastx+ : standard blast family (shell commands: blastn,blastp,tblastx)
+family. The suffix 'n' or 'p' indicates nucleotide or protein input files.</p>
+</li>
+<li><p>diamond : Only for protein files! standard diamond procedure and for
+genes/proteins of length >40 with the additional --sensitive flag
+Warning: Please use version 0.9.29 or later to avoid this known bug: #24</p>
+</li>
+<li><p>lastn,lastp : lastal. -n : dna files, -p protein files (BLOSUM62 scoring matrix)!</p>
+</li>
+<li><p>rapsearch : Only for protein files!</p>
+</li>
+<li><p>mmseqsp,mmseqsn : mmseqs2. -n : dna files, -p protein files</p>
+</li>
+<li><p>topaz : Only for protein files!</p>
+</li>
+<li><p>usearch : usearch_local procedure with -id 0 (minimum identity
+percentage).</p>
+</li>
+<li><p>ublast : usearch_ublast procedure.</p>
+</li>
+<li><p>blatp,blatn : blat. -n : dna files, -p protein files
 </details>
 <br></p>
-
-<p><strong>Clustering options (step 3)</strong>
-(output: <myproject>.proteinortho.tsv, <myproject>.proteinortho.html, <myproject>.proteinortho-graph)</p></li>
+</li>
 </ul>
+<ul>
+<li><strong>--sim</strong>=float (default: 0.95)
+min. reciprocal similarity for additional hits. 1 : only the best reciprocal hits are reported, 0 : all possible reciprocal blast matches (within the -evalue) are reported.</li>
+</ul>
+</li>
+</ul>
+<details>
+  <summary>More (Click to expand)</summary>
 
-<p><details>
-  <summary>(Click to expand)</summary></p>
+  - <strong>--e</strong>=evalue (default: 1e-05)
+    E-value for blast
 
-<ul>
-<li><p><strong>--singles</strong>
-report singleton genes without any hit</p></li>
+  - <strong>--selfblast</strong>
+    apply selfblast, detects paralogs without orthologs
+
+  - <strong>--identity</strong>=number (default: 25)
+    min. percent identity of best blast hits
+
+  - <strong>--cov</strong>=number (default: 50)
+    min. coverage of best blast alignments in %
 
-<li><p><strong>--purity</strong>=float (default: 1e-7)
-avoid spurious graph assignments</p></li>
+  - <strong>--subparaBlast</strong>='options'
+    additional parameters for the search tool (-p=blastp+,diamond,...) example -subpara='-seg no'
+    or -subpara='--more-sensitive' for diamond
+</details>
+<br>
 
-<li><p><strong>--conn</strong>=float (default: 0.1)
-min. algebraic connectivity. <b>This is the main parameter for the clustering step.</b> Choose larger values then more splits are done, resulting in more and smaller clusters.</p></li>
+ <strong>Synteny options (optional, step 2)</strong>
+  (output: <myproject>.ffadj-graph, <myproject>.poff.tsv (tab separated file)-graph)
 
-<li><p><strong>--minspecies</strong>=float (default: 1, must be >=0)
-min. number of genes per species. If a group is found with up to (minspecies) genes/species, it wont be split again (regardless of the connectivity).</p></li>
+<details>
+  <summary>More (Click to expand)</summary>
 
-<li><p><strong>--nograph</strong>
-do not generate *-graph file (pairwise orthology relations)</p></li>
+  - <strong>--synteny</strong>
+    activate PoFF extension to separate similar by contextual adjacencies
+    (requires .gff for each .fasta)
 
-<li><p><strong>--subparaCluster</strong>='options'
-additional parameters for the clustering algorithm (proteinortho_clustering) example -subparaCluster='-maxnodes 10000'. 
-Note: -rmgraph cannot be set. All other parameters of subparaCluster are replacing the default values (like -cpus or -minSpecies)</p></li>
+  - <strong>--dups</strong>=number (default: 0)
+    PoFF: number of reiterations for adjacencies heuristic, to determine
+    duplicated regions
 
-<li><p><strong>--xml</strong>
-do generate an orthologyXML file (see http://www.orthoxml.org for more information). You can also use proteinortho2xml.pl <myproject.proteinortho>.</p></li>
+  - <strong>--cs</strong>=number (default: 3)
+    PoFF: Size of a maximum common substring (MCS) for adjacency matches
 
-<li><p><strong>--exactstep3</strong>
-perform step 3 without the k-mere heuristic (much slower for huge
-datasets but more precise)</p></li>
+  - <strong>--alpha</strong>=number (default: .5)
+    PoFF: weight of adjacencies vs. sequence similarity
+</details>
+<br>
 
-<li><p><strong>--mcl</strong>
-perform the clustering without the k-mere heuristic. The k-mere heuristic is only applied for very large connected components (>1e+6 nodes) and if the algorithm would start to iteratate very slowly</p></li>
-</ul>
+ <strong>Clustering options (step 3)</strong>
+  (output: <myproject>.proteinortho.tsv, <myproject>.proteinortho.html, <myproject>.proteinortho-graph)
 
-<p></details>
-<br></p>
+  - <strong>--conn</strong>=float (default: 0.1)
+    min. algebraic connectivity. <b>This is the main parameter for the clustering step.</b> Choose larger values then more splits are done, resulting in more and smaller clusters. (There are still cluster with an alg. conn. below this given threshold allowed if the protein to species ratio is good enough, see -minspecies option below)
 
-<p><strong>Misc options</strong></p>
+<details>
 
-<p><details>
-  <summary>(Click to expand)</summary></p>
+  <summary>More (Click to expand)</summary>
 
-<ul>
-<li><p><strong>--cleanblast</strong>
-cleans blast-graph with proteinortho_cleanupblastgraph</p></li>
+  - <strong>--singles</strong>
+    report singleton genes without any hit
 
-<li><p><strong>--checkfasta</strong>
-checks input fasta files if the given algorithm can process the given fasta file.</p></li>
+  - <strong>--purity</strong>=float (default: 1e-7)
+    avoid spurious graph assignments
 
-<li><p><strong>--desc</strong>
-write description files (for NCBI FASTA input only)</p></li>
+  - <strong>--minspecies</strong>=float (default: 1, must be >=0)
+    min. number of genes per species. If a group is found with up to (minspecies) genes/species, it wont be split again (regardless of the connectivity).
 
-<li><p><strong>--binpath</strong>=directory (default: PATH)
-path to your local executables (blast, diamond, mcl, ...)</p></li>
+  - <strong>--nograph</strong>
+    do not generate <em>-graph file (pairwise orthology relations)
 
-<li><p><strong>--debug</strong>
-gives detailed information for bug tracking</p></li>
-</ul>
+  - <strong>--subparaCluster</strong>='options'
+    additional parameters for the clustering algorithm (proteinortho_clustering) example -subparaCluster='-maxnodes 10000'.
+    Note: -rmgraph cannot be set. All other parameters of subparaCluster are replacing the default values (like -cpus or -minSpecies)
 
-<p></details>
-<br></p>
+  - <strong>--xml</strong>
+    do generate an orthologyXML file (see <a href="http://www.orthoxml.org">http://www.orthoxml.org</a> for more information). You can also use proteinortho2xml.pl <myproject.proteinortho>.
 
-<p><strong>Large compute jobs</strong></p>
+  - <strong>--exactstep3</strong>
+    perform step 3 without the k-mere heuristic (much slower for huge
+    datasets but more precise)
 
-<ul>
-<li><p><strong>--jobs</strong>=M/N
-If you want to involve multiple machines or separate a Proteinortho
-run into smaller chunks, use the -jobs=<strong>M</strong>/<strong>N</strong> option. First, run
-'proteinortho6.pl -steps=1 ...' to generate the indices. Then you can
-run 'proteinortho6.pl -steps=2 -jobs=<strong>M</strong>/<strong>N</strong> ...' to run small chunks
-separately. Instead of <strong>M</strong> and <strong>N</strong> numbers must be set representing the
-number of jobs you want to divide the run into (<strong>M</strong>) and the job
-division to be performed by the process. E.g. to divide a Proteinortho
-run into 4 jobs to run on several machines, use 'proteinortho6.pl -steps=2 -jobs=1/4', 'proteinortho6.pl -steps=2 -jobs=1/4', 'proteinortho6.pl -steps=2 -jobs=2/4', 'proteinortho6.pl -steps=2 -jobs=3/4', 'proteinortho6.pl -steps=2 -jobs=4/4'.</p>
-
-<p>See <a href="https://gitlab.com/paulklemm_PHD/proteinortho/wikis/Large-compute-jobs-(the--jobs-option)">Large compute jobs, the --jobs option (proteinortho wiki)</a> for more details.</p></li>
-</ul>
+  - <strong>--mcl</strong>
+    perform the clustering without the k-mere heuristic. The k-mere heuristic is only applied for very large connected components (>1e+6 nodes) and if the algorithm would start to iteratate very slowly
 
-<p><br></p>
 
-<h1 id="poff">PoFF</h1>
+</details>
+<br>
+
+ <strong>Misc options</strong>
 
-<p>The PoFF extension allows you to use the relative order of genes (synteny)
+  - <strong>--checkfasta</strong>
+    checks input fasta files if the given algorithm can process the given fasta file.
+
+<details>
+  <summary>(Click to expand)</summary>
+
+  - <strong>--cleanblast</strong>
+    cleans blast-graph with proteinortho_cleanupblastgraph
+
+  - <strong>--desc</strong>
+    write description files (for NCBI FASTA input only)
+
+  - <strong>--binpath</strong>=directory (default: $PATH)
+    path to your local executables (blast, diamond, mcl, ...)
+
+  - <strong>--debug</strong>
+    gives detailed information for bug tracking
+
+</details>
+<br>
+
+ <strong>Large compute jobs</strong>
+  - <strong>--jobs</strong>=M/N
+    If you want to involve multiple machines or separate a Proteinortho
+    run into smaller chunks, use the -jobs=<strong>M</strong>/<strong>N</strong> option. First, run
+    'proteinortho6.pl -steps=1 ...' to generate the indices. Then you can
+    run 'proteinortho6.pl -steps=2 -jobs=<strong>M</strong>/<strong>N</strong> ...' to run small chunks
+    separately. Instead of <strong>M</strong> and <strong>N</strong> numbers must be set representing the
+    number of jobs you want to divide the run into (<strong>M</strong>) and the job
+    division to be performed by the process. E.g. to divide a Proteinortho
+    run into 4 jobs to run on several machines, use 'proteinortho6.pl -steps=2 -jobs=1/4', 'proteinortho6.pl -steps=2 -jobs=1/4', 'proteinortho6.pl -steps=2 -jobs=2/4', 'proteinortho6.pl -steps=2 -jobs=3/4', 'proteinortho6.pl -steps=2 -jobs=4/4'.
+
+    See <a href="https://gitlab.com/paulklemm_PHD/proteinortho/wikis/Large-compute-jobs-(the--jobs-option">Large compute jobs, the --jobs option (proteinortho wiki)</a>) for more details.
+
+<br>
+
+# PoFF
+
+  The PoFF extension allows you to use the relative order of genes (synteny)
   as an additional criterion to disentangle complex co-orthology relations.
   To do so, add the parameter -synteny. You can use it to either come closer
   to one-to-one orthology relations by preferring synthetically conserved
   copies in the presence of two very similar paralogs (default), or just to
   reduce noise in the predictions by detecting multiple copies of genomic
   areas (add the parameter -dups=3). Please note that you need additional
-  data to include synteny, namely the gene positions in GFF3 format. 
+  data to include synteny, namely the gene positions in GFF3 format.
   AsProteinortho is primarily made for proteins, it will only accept GFF
   entries of type CDS (column #3 in the GFF-file). The attributes column
   (#9) must contain Name=GENE IDENTIFIER where GENE IDENTIFIER corresponds
@@ -537,268 +555,244 @@ run into 4 jobs to run on several machines, use 'proteinortho6.pl -steps=2 -jobs
   files are provided in the source code archive. Hence, we can run
   proteinortho6.pl -project=test -synteny test/A1.faa test/B1.faa test/E1.faa
   test/F1.faa to add synteny information to the calculations. Of course,
-  this only makes sense if species are sufficiently similar. You won't gain
+  this only makes sense if species are sufficiently similar. You won't gain
   much when comparing e.g. bacteria with fungi. When the analysis is done
   you will find an additional file in your current working directory, namely
   test.poff.tsv (tab separated file). This file is equivalent to the test.proteinortho.tsv file (above) but
   can be considered more accurate as synteny was involved for its
-  construction.</p>
-
-<h1 id="output">Output</h1>
-
-<p><strong>BLAST Search (step 1-2)</strong></p>
-
-<p><details>
-  <summary>myproject.blast-graph (Click to expand)</summary></p>
-
-<pre><code>filtered raw blast data based on adaptive reciprocal best blast
-matches (= reciprocal best match plus all reciprocal matches within a
-range of 95% by default) The first two rows are just comments
-explaining the meaning of each row. Whenever a further comment line (starting
-with #) follows, it indicates results comparing the two species is
-about to follow. E.g. # M.faa L.faa tells that the next lines representing
-results for species M and L. All matches are reciprocal matches. If
-e.g. a match for M_15 L_15 is shown, L_15 M_15 exists implicitly.
-E-Values and bit scores for both directions are given behind each
-match.
-The 4 comment numbers ('# 3.8e-124        434.9...') are representing the median values of  
-evalue_ab, bitscore_ab, evalue_ba and bitscore_ba.
-
-  # file_a    file_b
-  # a   b     evalue_ab     bitscore_ab   evalue_ba     bitscore_ba 
-  # E.faa     C.faa   
-  # 3.8e-124        434.9   2.8e-126        442.2
-  E_11  C_11  5.9e-51 190.7   5.6e-50 187.61
-  E_10  C_10  3.8e-124    434.9   2.8e-126    442.2
-  ...
-</code></pre>
-
-<p></details>
- <br></p>
-
-<p><strong>Clustering (step 3)</strong></p>
-
-<p><details>
+  construction.
+
+# Output
+ <strong>BLAST Search (step 1-2)</strong>
+
+<details>
+  <summary>myproject.blast-graph (Click to expand)</summary>
+
+    filtered raw blast data based on adaptive reciprocal best blast
+    matches (= reciprocal best match matches within a range of 95% by default) 
+
+    A line starting with # indicates the two species that are analysed below. E.g. '# M.faa L.faa' tells that the next lines are for species M versus species L.
+
+    All matches are reciprocal matches. If
+    e.g. a match for M_15 L_15 is shown, L_15 M_15 exists implicitly.
+
+    E-Values and bit scores for both directions are given behind each
+    match. 
+
+    The 4 numbers below the species (e.g. '# 3.8e-124        434.9...') are representing the median values for this comparison.
+
+      # file_a    file_b
+      # a   b     evalue_ab     bitscore_ab   evalue_ba     bitscore_ba
+      # E.faa     C.faa<br>      # 3.8e-124        434.9   2.8e-126        442.2
+      E_11  C_11  5.9e-51 190.7   5.6e-50 187.61
+      E_10  C_10  3.8e-124    434.9   2.8e-126    442.2
+      ...
+ </details>
+ <br>
+
+ <strong>Clustering (step 3)</strong>
+
+<details>
   <summary>myproject.proteinortho-graph (Click to expand)</summary>
-    clustered myproject.blast-graph. Its connected components are represented in myproject.proteinortho.tsv / myproject.proteinortho.html. The format of myproject.blast-graph is the same as the
-    blast-graph (see above).</p>
 
-<pre><code>  # file_a    file_b
-  # a   b     evalue_ab     bitscore_ab   evalue_ba     bitscore_ba
-  # E.faa     C.faa
-  E_10  C_10  3.8e-124    434.9   2.8e-126    442.2
-  E_11  C_11  5.9e-51 190.7   5.6e-50 187.6
-</code></pre>
+    clustered version of the myproject.blast-graph.
 
-<p></details>
- <br></p>
+    Its connected components are represented in myproject.proteinortho.tsv / myproject.proteinortho.html.
 
-<p><details>
+    The format of myproject.blast-graph is the equivalent to the myproject.blast-graph (see above).
+
+      # file_a    file_b
+      # a   b     evalue_ab     bitscore_ab   evalue_ba     bitscore_ba
+      # E.faa     C.faa
+      E_10  C_10  3.8e-124    434.9   2.8e-126    442.2
+      E_11  C_11  5.9e-51 190.7   5.6e-50 187.6
+      ...
+ </details>
+ <br>
+
+ <details>
   <summary> myproject.proteinortho.tsv (Click to expand)</summary>
-    The connected components. The first line starting with #is a comment
-    line indicating the meaning of each column for each of the following
-    lines which represent an orthologous group each. The very first column
-    indicates the number of species covered by this group. The second
-    column indicates the number of genes included in the group. Often,
-    this number will equal the number of species, meaning that there is a
-    single ortholog in each species. If the number of genes is bigger than
-    the number of species, there are co-orthologs present. The third
-    column gives rise to the algebraic connectivity of the respective
-    group. Basically, this indicates how densely the genes are connected
-    in the orthology graph that was used for clustering. A connectivity of
-    1 indicates a perfect dense cluster with each gene similar to each
-    other gene. By default, Proteinortho splits each group into two more
-    dense subgroups when the connectivity is below 0.1 (can be user defined).
-    Hint: you can open this file in Excel / Numbers / Open Office.</p>
-
-<pre><code>  # Species   Genes   Alg.-Conn.    C.faa   C2.faa  E.faa   L.faa   M.faa
-  2   5     0.16  *     *     *     L_643,L_641   M_649,M_640,M_642
-  3   6     0.138   C_164,C_166,C_167,C_2   *     *     L_2   M_2
-  2   4     0.489   *     *     *     L_645,L_647   M_644,M_646
-</code></pre>
-
-<p></details>
- <br></p>
-
-<p><details>
+
+    The connected components of myproject.proteinortho-graph. 
+
+    The very first column indicates the number of species covered by this group. 
+    The second column indicates the number of genes included in this group. 
+
+    If the number of genes is bigger than the number of species, there are co-orthologs present. 
+
+    The third column gives the algebraic connectivity of the respective group. This indicates how densely the genes are connected
+    in the orthology graph that was used for clustering. 
+    A connectivity of 1 indicates a perfect dense cluster with each gene beeing connected/orthologous to each
+    other gene. 
+
+    By default, Proteinortho splits each group into two more dense subgroups when the connectivity is below 0.1 (can be user defined).
+
+    Hint: you can open this file in Excel / Numbers / Open Office.
+
+      # Species   Genes   Alg.-Conn.    C.faa   C2.faa  E.faa   L.faa   M.faa
+      2   5     0.16  </em>     <em>     </em>     L_643,L_641   M_649,M_640,M_642
+      3   6     0.138   C_164,C_166,C_167,C_2   <em>     </em>     L_2   M_2
+      2   4     0.489   <em>     </em>     <em>     L_645,L_647   M_644,M_646
+
+ </details>
+ <br>
+
+<a href="https://gitlab.com/paulklemm_PHD/proteinortho/wikis/Tools-and-additional-programs#proteinortho-graphblast-graph-species-summary-table">myproject.proteinortho-graph.summary</a>
+
+ <br>
+ <details>
   <summary> myproject.proteinortho.html (Click to expand)</summary>
     The html version of the myproject.proteinortho.tsv file
  </details>
- <br></p>
+ <br>
 
-<p><strong>POFF (-synteny)</strong></p>
+ <strong>POFF (-synteny)</strong>
 
-<p>The synteny based graph files (myproject.ffadj-graph and
+  The synteny based graph files (myproject.ffadj-graph and
   myproject.poff.tsv (tab separated file)-graph) have two additional columns: same_strand and
   simscore. The first one indicates if two genes from a match are located at
   the same strands (1) or not (-1). The second one is an internal score
   which can be interpreted as a normalized weight ranging from 0 to 1 based
   on the respective e-values. Moreover, a second comment line is followed
-  after the species lines, e.g.</p>
+  after the species lines, e.g.
 
-<pre><code># M.faa L.faa
-# Scores: 4   39    34.000000     39.000000
-</code></pre>
+    # M.faa L.faa
+    # Scores: 4   39    34.000000     39.000000
 
-<p><details>
-  <summary>myproject.ffadj-graph (Click to expand)</summary></p>
+  <details>
+  <summary>myproject.ffadj-graph (Click to expand)</summary>
 
-<pre><code>filtered blast data based on adaptive reciprocal best blast matches
-and synteny (only if -synteny is set)
-</code></pre>
+    filtered blast data based on adaptive reciprocal best blast matches
+    and synteny (only if -synteny is set)
 
-<p></details>
- <br></p>
+ </details>
+ <br>
 
-<p><details>
-  <summary>myproject.poff.tsv (tab separated file)-graph (Click to expand)</summary></p>
+  <details>
+  <summary>myproject.poff.tsv (tab separated file)-graph (Click to expand)</summary>
 
-<pre><code>clustered ffadj graph. Its connected components are represented in
-myproject.poff.tsv (tab separated file) (only if -synteny is set)
-</code></pre>
+    clustered ffadj graph. Its connected components are represented in
+    myproject.poff.tsv (tab separated file) (only if -synteny is set)
 
-<p></details>
- <br></p>
+ </details>
+ <br>
 
-<h1 id="examples">EXAMPLES</h1>
 
-<p><strong>Calling proteinortho</strong>
+# EXAMPLES
+ <strong>Calling proteinortho</strong>
   Sequences are typically given in plain fasta format like the files in
-  test/</p>
+  test/
 
-<p>test/C.faa:</p>
+  test/C.faa:
 
-<pre><code>>C_10
-VVLCRYEIGGLAQVLDTQFDMYTNCHKMCSADSQVTYKEAANLTARVTTDRQKEPLTGGY
-HGAKLGFLGCSLLRSRDYGYPEQNFHAKTDLFALPMGDHYCGDEGSGNAYLCDFDNQYGR
-...
-</code></pre>
+    >C_10
+    VVLCRYEIGGLAQVLDTQFDMYTNCHKMCSADSQVTYKEAANLTARVTTDRQKEPLTGGY
+    HGAKLGFLGCSLLRSRDYGYPEQNFHAKTDLFALPMGDHYCGDEGSGNAYLCDFDNQYGR
+    ...
 
-<p>test/E.faa:</p>
+   test/E.faa:
 
-<pre><code>>E_10
-CVLDNYQIALLRNVLPKLFMTKNFIEGMCGGGGEENYKAMTRATAKSTTDNQNAPLSGGF
-NDGKMGTGCLPSAAKNYKYPENAVSGASNLYALIVGESYCGDENDDKAYLCDVNQYAPNV
-...
-</code></pre>
+    >E_10
+    CVLDNYQIALLRNVLPKLFMTKNFIEGMCGGGGEENYKAMTRATAKSTTDNQNAPLSGGF
+    NDGKMGTGCLPSAAKNYKYPENAVSGASNLYALIVGESYCGDENDDKAYLCDVNQYAPNV
+    ...
 
-<p>To run proteinortho for these sequences, simply call</p>
+  To run proteinortho for these sequences, simply call
 
-<pre><code>perl proteinortho6.pl test/C.faa test/E.faa test/L.faa test/M.faa
-</code></pre>
+    perl proteinortho6.pl test/C.faa test/E.faa test/L.faa test/M.faa
 
-<p>To give the outputs the name 'test', call</p>
+  To give the outputs the name 'test', call
 
-<pre><code>perl proteinortho6.pl -project=test test/*faa
-</code></pre>
+    perl proteinortho6.pl -project=test test/</em>faa
 
-<p>To use blast instead of the default diamond, call</p>
+  To use blast instead of the default diamond, call
 
-<pre><code>perl proteinortho6.pl -project=test -p=blastp+ test/*faa
-</code></pre>
+    perl proteinortho6.pl -project=test -p=blastp+ test/<em>faa
 
-<p>If installed with make install, you can also call</p>
+  If installed with make install, you can also call
 
-<pre><code>proteinortho -project=test -p=blastp+ test/*faa
-</code></pre>
+    proteinortho -project=test -p=blastp+ test/</em>faa
 
-<h1 id="hints">Hints</h1>
 
-<p>Using .faa to indicate that your file contains amino acids and .fna to
-  show it contains nucleotides makes life much easier.</p>
+# Hints
+  Using .faa to indicate that your file contains amino acids and .fna to
+  show it contains nucleotides makes life much easier.
 
-<p>Sequence IDs must be unique within a single FASTA file. Consider renaming
+  Sequence IDs must be unique within a single FASTA file. Consider renaming
   otherwise. Note: Till version 5.15 sequences IDs had to be unique among
   the whole dataset. Proteinortho now keeps track of name and species to
-  avoid the necessissity of renaming.</p>
+  avoid the necessissity of renaming.
 
-<p>You need write permissions in the directory of your FASTA files as
+  You need write permissions in the directory of your FASTA files as
   Proteinortho will create blast databases. If this is not the case,
-  consider using symbolic links to the FASTA files.</p>
+  consider using symbolic links to the FASTA files.
 
-<p>The directory src contains useful tools, e.g. proteinortho<em>grab</em>proteins.pl which
+  The directory src contains useful tools, e.g. proteinortho_grab_proteins.pl which
   fetches protein sequences of orthologous groups from Proteinortho output
-  table. (These files are installed during 'make install')</p>
+  table. (These files are installed during 'make install')
 
-<h1 id="kmereheuristic">Kmere Heuristic</h1>
 
-<h2 id="example1">Example 1</h2>
+# Kmere Heuristic
 
-<p>In the following example a huge blast graph is used for step 3 (clustering). 
-The first connected component contains 7410694 nodes, hence the kmere heuristic is activated.
-Since the fiedler vector would result in a good split, the kmere heuristic is then deactivated immediatly.</p>
-
-<p><details>
-  <summary>as fallback (Click to expand)</summary></p>
 
-<pre><code>...
-[CRITICAL WARNING]   Failed to partition subgraph with 6929 nodes into (6929,0,0) sized groups, now using kmere heuristic as fall-back.
-...
-</code></pre>
+## Example 1
 
-<p></details></p>
+In the following example a huge blast graph is used for step 3 (clustering).
+The first connected component contains 7410694 nodes, hence the kmere heuristic is activated.
+Since the fiedler vector would result in a good split, the kmere heuristic is then deactivated immediatly.
 
-<p><details>
-<summary>working example for large graphs (Click to expand)</summary></p>
+<details>
+  <summary>as fallback (Click to expand)</summary>
 
-<pre><code>...
-17:32:15 [DEBUG] (kmere-heuristic) The current connected component is so large that the k-mere heuristic can be used. First: Testing if a normal split would result in a good partition (|.|>20%) of the CC.
- [WARNING] (kmere-heuristic) A normal split would NOT result in a good partition (|.|>20%) of the CC, therefore  the k-mere heuristic is now used. The current connected component will be split in 3.85373 (= number of proteins <6929> / ( n
-odes per species <1> * number of species <1798>)) groups greedily accordingly to the fiedler vector.
-...
-</code></pre>
+    ...
+    [CRITICAL WARNING]   Failed to partition subgraph with 6929 nodes into (6929,0,0) sized groups, now using kmere heuristic as fall-back.
+    ...
 
-<p></details></p>
+</details>
 
-<p><details>
-<summary>example for large graphs, where kmere is tested but not needed (Click to expand)</summary></p>
+<details>
+<summary>working example for large graphs (Click to expand)</summary>
 
-<pre><code>...
-20:27:07 [DEBUG] (kmere-heuristic) The current connected component is so large that the k-mere heuristic can be used. First: Testing if a normal split would result in a good partition (|.|>20%) of the CC.
-20:27:09 [DEBUG] (kmere-heuristic) A normal split would result in a good partition (|.|>20%) of the CC, therefore returning now to the normal algorithm (no k-mere heuristic).
-...
-</code></pre>
+    ...
+    17:32:15 [DEBUG] (kmere-heuristic) The current connected component is so large that the k-mere heuristic can be used. First: Testing if a normal split would result in a good partition (|.|>20%) of the CC.
+     [WARNING] (kmere-heuristic) A normal split would NOT result in a good partition (|.|>20%) of the CC, therefore  the k-mere heuristic is now used. The current connected component will be split in 3.85373 (= number of proteins <6929> / ( n
+    odes per species <1> * number of species <1798>)) groups greedily accordingly to the fiedler vector.
+    ...
 
-<p></details></p>
+</details>
 
-<h1 id="creditwherecreditisdue">Credit where credit is due</h1>
+<details>
+<summary>example for large graphs, where kmere is tested but not needed (Click to expand)</summary>
 
-<ul>
-<li>The all-versus-all BLAST-analysis (-step=2) is only possible with (one of) the following underlying algorithms:
+    ...
+    20:27:07 [DEBUG] (kmere-heuristic) The current connected component is so large that the k-mere heuristic can be used. First: Testing if a normal split would result in a good partition (|.|>20%) of the CC.
+    20:27:09 [DEBUG] (kmere-heuristic) A normal split would result in a good partition (|.|>20%) of the CC, therefore returning now to the normal algorithm (no k-mere heuristic).
+    ...
 
+</details>
 
+<h1 id="credit-where-credit-is-due">Credit where credit is due</h1>
 <ul>
-<li>NCBI BLAST+ or NCBI BLAST legacy (https://blast.ncbi.nlm.nih.gov/Blast.cgi?PAGE<em>TYPE=BlastDocs&DOC</em>TYPE=Download)</li>
-
-<li>Diamond (doi:10.1038/nmeth.3176, https://github.com/bbuchfink/diamond)</li>
-
-<li>Last (doi:10.1101/gr.113985.110, http://last.cbrc.jp/)</li>
-
-<li>Rapsearch2 (doi:10.1093/bioinformatics/btr595, https://github.com/zhaoyanswill/RAPSearch2)</li>
-
-<li>Topaz (doi:10.1186/s12859-018-2290-3, https://github.com/ajm/topaz)</li>
-
-<li>usearch,ublast (doi:10.1093/bioinformatics/btq461, https://www.drive5.com/usearch/download.html)</li>
-
-<li>blat (http://hgdownload.soe.ucsc.edu/admin/)</li>
-
-<li>mmseqs2 (doi:10.1038/nbt.3988 (2017). https://github.com/soedinglab/MMseqs2)</li></ul>
+<li>The all-versus-all BLAST-analysis (-step=2) is only possible with (one of) the following underlying algorithms:<ul>
+<li>NCBI BLAST+ or NCBI BLAST legacy (<a href="https://blast.ncbi.nlm.nih.gov/Blast.cgi?PAGE_TYPE=BlastDocs&DOC_TYPE=Download">https://blast.ncbi.nlm.nih.gov/Blast.cgi?PAGE_TYPE=BlastDocs&DOC_TYPE=Download</a>)</li>
+<li>Diamond (doi:10.1038/nmeth.3176, <a href="https://github.com/bbuchfink/diamond">https://github.com/bbuchfink/diamond</a>)</li>
+<li>Last (doi:10.1101/gr.113985.110, <a href="http://last.cbrc.jp/">http://last.cbrc.jp/</a>)</li>
+<li>Rapsearch2 (doi:10.1093/bioinformatics/btr595, <a href="https://github.com/zhaoyanswill/RAPSearch2">https://github.com/zhaoyanswill/RAPSearch2</a>)</li>
+<li>Topaz (doi:10.1186/s12859-018-2290-3, <a href="https://github.com/ajm/topaz">https://github.com/ajm/topaz</a>)</li>
+<li>usearch,ublast (doi:10.1093/bioinformatics/btq461, <a href="https://www.drive5.com/usearch/download.html">https://www.drive5.com/usearch/download.html</a>)</li>
+<li>blat (<a href="http://hgdownload.soe.ucsc.edu/admin/">http://hgdownload.soe.ucsc.edu/admin/</a>)</li>
+<li>mmseqs2 (doi:10.1038/nbt.3988 (2017). <a href="https://github.com/soedinglab/MMseqs2">https://github.com/soedinglab/MMseqs2</a>)</li>
+</ul>
 </li>
-
-<li>The clustering step (-step=3) got a huge speedup with the integration of LAPACK (Univ. of Tennessee; Univ. of California, Berkeley; Univ. of Colorado Denver; and NAG Ltd., http://www.netlib.org/lapack/)</li>
-
-<li>The html output of the *proteinortho.tsv (orthology groups) is enhanced by clusterize (https://github.com/NeXTs/Clusterize.js), reducing the scroll lag.</li>
+<li>The clustering step (-step=3) got a huge speedup with the integration of LAPACK (Univ. of Tennessee; Univ. of California, Berkeley; Univ. of Colorado Denver; and NAG Ltd., <a href="http://www.netlib.org/lapack/">http://www.netlib.org/lapack/</a>)</li>
+<li>The html output of the *proteinortho.tsv (orthology groups) is enhanced by clusterize (<a href="https://github.com/NeXTs/Clusterize.js">https://github.com/NeXTs/Clusterize.js</a>), reducing the scroll lag.</li>
 </ul>
-
-<h1 id="onlineinformation">ONLINE INFORMATION</h1>
-
-<p>For download and online information, see
+<h1 id="online-information">ONLINE INFORMATION</h1>
+<p>  For download and online information, see
   <a href="https://www.bioinf.uni-leipzig.de/Software/proteinortho/">https://www.bioinf.uni-leipzig.de/Software/proteinortho/</a>
   or
   <a href="https://gitlab.com/paulklemm_PHD/proteinortho">https://gitlab.com/paulklemm_PHD/proteinortho</a></p>
-
 <h1 id="references">REFERENCES</h1>
-
-<p>Lechner, M., Findeisz, S., Steiner, L., Marz, M., Stadler, P. F., &
+<p>  Lechner, M., Findeisz, S., Steiner, L., Marz, M., Stadler, P. F., &
   Prohaska, S. J. (2011). Proteinortho: detection of (co-) orthologs in
-  large-scale analysis. BMC bioinformatics, 12(1), 124.</p>
\ No newline at end of file
+  large-scale analysis. BMC bioinformatics, 12(1), 124.</p>


=====================================
proteinortho6.pl
=====================================
@@ -118,6 +118,11 @@ ncbi -> if the word 'isoform' is found
 uniprot -> 'Isoform of XYZ' (You need to add the *_additional.fasta files to the analysis)
 trinity -> using '_iX' suffix
 
+All isoforms are united to a single entity and treated as one. 
+Extracting a group with an isoform will result in all isoforms.
+For more information have a look at: 
+https://gitlab.com/paulklemm_PHD/proteinortho/-/wikis/FAQ#how-does-the-isoform-work
+
 =back
 
 =head2 Search options (step 1-2)
@@ -158,7 +163,11 @@ apply selfblast, detects paralogs without orthologs
 
 =item B<--sim>=float (default: 0.95)
 
-min. similarity for additional hits
+The -sim relaxes the one best reciprocal hit to all adaptive reciprocal best hits. 
+E.g. for -sim=0.9 all reciprocal hits within 90% of the highest bitscore are returned. 
+This can reduces false negative on the cost of maybe increasing the false positives too.
+-sim=1 will result in only the best reciprocal hits. In the most case this correspond to obe-to-one orthologs.
+-sim=0 will result in all reciprocal hits within the evalue threshold.
 
 =item B<--identity>=number (default: 25)
 
@@ -452,11 +461,11 @@ Lechner, M., Findeisz, S., Steiner, L., Marz, M., Stadler, P. F., & Prohaska, S.
 ##########################################################################################
 use strict;
 use warnings "all";
+use threads;
+use threads::shared;
 use File::Basename;
 use Cwd;
 use Cwd 'abs_path';
-use threads;
-use threads::shared;
 use Thread::Queue;
 use Pod::Usage; # --man
 use POSIX;
@@ -464,7 +473,7 @@ use POSIX;
 ##########################################################################################
 # Variables
 ##########################################################################################
-our $version = "6.0.28";
+our $version = "6.0.30";
 our $step = 0;    # 0/1/2/3 -> do all / only apply step 1 / only apply step 2 / only apply step 3
 our $verbose = 1; # 0/1   -> don't / be verbose
 our $debug = 0;   # 0/1   -> don't / show debug data
@@ -516,6 +525,8 @@ our $syn_lock :shared;
 our $all_jobs_submitted :shared = 0;
 our $po_path = "";
 our $run_id = "";
+our $hmm_search_evalue = "1e-100";
+our $hmm_minprots = 15; 
 our %gene_counter;    # Holds the number of genes for each data file (for sorting)
 our %max_gene_length_diamond;    # Holds the maximum length of genes for each data file (for diamond -> -sensitive option)
 our $threads_per_process :shared = 1; # Number of subthreads for blast
@@ -539,6 +550,7 @@ our $BLUE="\033[1;36m";
 our $ORANGE="\033[1;33m";
 our $NC="\033[0m"; # No Color
 our %isoform_mapping;
+our %isoform_mapping_best;
 
 my $tput=`tput color 2>/dev/null`; # test if the shell supports colors
 $tput=~s/[\r\n]+$//;
@@ -556,7 +568,7 @@ our @files = ();
 our @files_cleanup = ();
 our %files_map;
 foreach my $option (@ARGV) {
-  if ($option =~ m/^--?step=(0|1|2|3)$/)      { $step = $1;   }
+  if ($option =~ m/^--?step=(0|1|2|3|4)$/)      { $step = $1;   }
   elsif ($option =~ m/^--?verbose$/)      { $verbose = 1;  }
   elsif ($option =~ m/^--?verbose=([012])$/)      { $verbose = $1;  }
   elsif ($option =~ m/^--?silent$/)      { $verbose = 0;  }
@@ -587,6 +599,9 @@ foreach my $option (@ARGV) {
   elsif ($option =~ m/^--?cpus=(\d*)$/)       { $cpus = $1; }
   elsif ($option =~ m/^--?cpus=auto$/)          { $cpus = 0; }
   elsif ($option =~ m/^--?alpha=([0-9\.]+)$/)     { $alpha = $1; }
+  elsif ($option =~ m/^--?hmm_?evalue=([0-9\.eE+-]+)$/)     { $hmm_search_evalue = $1; }
+  elsif ($option =~ m/^--?hmm_?e=([0-9\.eE+-]+)$/)     { $hmm_search_evalue = $1; }
+  elsif ($option =~ m/^--?hmm_?minprots?=([0-9]+)$/)     { $hmm_minprots = $1; }
   elsif ($option =~ m/^--?purity=([0-9\.]+)$/)      { $purity = $1; }
   elsif ($option =~ m/^--?report=([0-9]+)$/)      { $report = $1; }
   elsif ($option =~ m/^--?minspecies=([0-9.]+)$/)       { if($1>=0){$minspecies = $1;}else{&Error("the argument -minspecies=$1 is invalid.$RED minspecies needs to be >=0!$NC\nminspecies: the min. number of genes per species. If a group is found with up to (minspecies) genes/species, it wont be split again (regardless of the connectivity).");} }
@@ -637,6 +652,7 @@ foreach my $option (@ARGV) {
   else  {&print_usage(); &reset_locale();die "$RED"."[Error]$NC $ORANGE Invalid command line option: \'$option\'! $NC\n\n"; }
 }
 
+
 if($jobnumber != -1 && $tmp_path eq "" ){print STDERR "Please consider to use the -tmp option in combination with -jobs if you use proteinortho on a computing cluster system. Usually these clusters have dedicated local scratch directories.\n";}
 if($selfblast){$checkblast=1;}
 
@@ -709,7 +725,6 @@ if($gccversion_main eq "" || $gccversion_main < 5){
   $ompprocbind="true";
 }
 
-
 our $simgraph = "$project.blast-graph$run_id";    # Output file graph
 our $syngraph = "$project.ffadj-graph$run_id";    # Output file synteny
 our $csimgraph = "$project.proteinortho-graph";   # Output file graph
@@ -736,7 +751,7 @@ sub get_parameter{
   return "Parameter-vector : (",'version',"=$version",",",'step',"=$step",",",'verbose',"=$verbose",",",'debug',"=$debug",",",'exactstep3',"=$exactstep3",",",'synteny',"=$synteny",",",'duplication',"=$duplication",",",'cs',"=$cs",",",'alpha',"=$alpha",",",'connectivity',"=$connectivity",",",'cpus',"=$cpus",",",'evalue',"=$evalue",",",'purity',"=$purity",",",'coverage',"=$coverage",",",'identity',"=$identity",",",'blastmode',"=$blastmode",",",'sim',"=$sim",",",'report',"=$report",",",'keep',"=$keep",",",'force',"=$force",",",'selfblast',"=$selfblast",",",'twilight',"=$twilight",",",'singles',"=$singles",",",'clean',"=$clean",",",'blastOptions',"=$blastOptions",",",'nograph',"=$nograph",",",'xml',"=$doxml",",",'desc',"=$desc",",",'tmp_path',"=$tmp_path,",'blastversion',"=$blastversion",",",'binpath',"=$binpath",",",'makedb',"=$makedb",",",'blast',"=$blast",",",'jobs_todo',"=$jobs_todo",",",'project',"=$project",",",'po_path',"=$po_path",",",'run_id',"=$run_id",",",'threads_per_process',"=$threads_per_process",",","useMcl","=$useMcl",',freemem',"=$freemem_inMB",")\n";
 }
 
-if (-e "$project.proteinortho.tsv" && $step!=1 && ( ( scalar(@files) > 2 && $step!=3 ) || $step==3 ) ) {
+if (-e "$project.proteinortho.tsv" && $step!=1 && $step!=4 && ( ( scalar(@files) > 2 && $step!=3 ) || $step==3 ) ) {
   print STDERR "!!!$ORANGE\n[Warning]:$NC Data files for project '$project' already exists. Previous output files might be overwritten.\n$NC";
   print STDERR "Press 'strg+c' to prevent me from proceeding or wait 10 seconds to continue...\n!!!\n";
     sleep 10;
@@ -744,7 +759,7 @@ if (-e "$project.proteinortho.tsv" && $step!=1 && ( ( scalar(@files) > 2 && $ste
 }
 
 # if step=1 -> generate blast db, no need for unlinking results
-if($step!=1){
+if($step!=1 && $step!=4){
   if($step!=3){ # if step 3 then dont destroy the blast results ...
     unlink($simgraph);
     unlink($syngraph);
@@ -837,7 +852,7 @@ if($isoform eq "uniprot"){
 }
 
 # Always do
-if($step < 3){ # don't check for step=3=clustering (not necessary) and if step=2, then files are allready checked with step=1
+if($step < 3 || $step==4){ # don't check for step=3=clustering (not necessary) and if step=2, then files are allready checked with step=1
   &check_files;   # Check files, count sequences
 }
 @files = ();
@@ -882,7 +897,7 @@ if ($step == 0 || $step == 2) {
 
   # check if database exists for inputs:
   foreach my $file (keys %gene_counter) {
-    if($blastmode !~ m/blat/ && `ls '${file}'.${blastmode}* 1>/dev/null 2>&1` ne ""){ # test here
+    if($blastmode !~ m/blat/ && `ls '${file}'.${blastmode}* 2>/dev/null` ne ""){ # test here
       my $a = "$tmp_path/DB/".basename($file);
       if(`ls '${a}'.${blastmode}* 1>/dev/null 2>&1` ne ""){                           # or here
         &Error("I did not find the database for ".$_[0]." (".$a."). Did you run --step=1 and maybe removed the databases? Please rerun 'proteinortho --step=1 --force /path/to/fastas' such that the databases can be recreated and then proceed with -step=2 and -step=3.");
@@ -900,6 +915,9 @@ if ($step == 0 || $step == 2) {
   &run_blast;   # Run blasts
 }
 
+undef %max_gene_length_diamond;
+%max_gene_length_diamond=();
+
 # Check for duplicated edges if --unique or --checkblast
 if($checkblast == 1){
 
@@ -942,8 +960,22 @@ if($checkblast == 1){
 if ($step == 0 || $step == 3) {
   if($verbose){print STDERR "\n$GREEN**Step 3**$NC\n";}
   &cluster;             # form clusters and write outputs
-  if($verbose){print STDERR "\n$GREEN"."All finished.$NC\n";}
 }
+
+# Step 4
+our $hmm_QUEUE = Thread::Queue->new();    # A new empty queue
+our $hmm_QUEUE_maxNumEl=0;
+our $hmm_proteinorthofile = $synteny ? $syntable : $simtable;
+our $hmm_header="";
+our @hmm_filenames;
+our %hmm_filenames2colid;
+our $hmm_num_cols = 0;
+
+if ($step == 4) {  
+  if($verbose){print STDERR "\n$GREEN**Step 4**$NC\n";}
+  &hmmenriched;             
+}
+
 if ($clean) {&clean;}           # remove blast indices
 
 if ($nograph) {
@@ -956,18 +988,178 @@ if (!$keep && $tmp_path =~ m/\/proteinortho_cache_[^\/]+\d*\/$/ && $step!=1){sys
 
 foreach my $file (@files_cleanup){unlink($file);}
 
+if($verbose){print STDERR "\n$GREEN"."All finished.$NC\n";}
+
 ##########################################################################################
 # Functions
 ##########################################################################################
 
 sub clean {
-  print STDERR "Removing temporary files...\n";
+  if($verbose){ print STDERR "Removing temporary files...\n" }
   foreach my $file (@files) {
     system("rm $file.$blastmode"."*");
   }
+}
+
+sub workertest {
+  local $SIG{KILL} = sub { threads->exit };
+  my $tid = threads->tid();
+
+  while(my $job = $hmm_QUEUE->dequeue()){
+    print STDERR "$tid - -> start\n";
+    system("sleep 2");
+    print STDERR "$tid -> end\n";
+  }
+}
+
+sub hmmenriched {
+
+  if(!-e "$tmp_path/proteinortho_grab_proteins.$hmm_proteinorthofile.done"){
+    system("proteinortho_grab_proteins.pl -cpus=$cpus -minprots=$hmm_minprots -ignoreWarning -tofiles='$tmp_path/' $hmm_proteinorthofile ".join(" ", at files)."; touch $tmp_path/proteinortho_grab_proteins.$hmm_proteinorthofile.done");
+    if ($? != 0) { &Error("'proteinortho_grab_proteins.pl' failed with code $?.$NC (Please visit https://gitlab.com/paulklemm_PHD/proteinortho/wikis/Error%20Code).") }
+  }
+ 
+  my %orthogroupfiles;
+  foreach my $fp (glob("$tmp_path/*OrthoGroup*.fasta")) { if($fp=~m/OrthoGroup([0-9]+)\.fasta/){ $orthogroupfiles{$1}=$fp }}
+
+  my $i = -1;
+  open(my $FH,"<$hmm_proteinorthofile");
+  while(<$FH>){
+    chomp;
+    my $line=$_;
+    if(/^#/){
+      #print RESULTFH "$line\n";
+      $hmm_header .= $line;
+      @hmm_filenames = split("\t",$line);
+      for (my $i = 0; $i < scalar @hmm_filenames; $i++) { 
+        my $base_file = $hmm_filenames[$i];
+        $base_file=~s/\.fasta|\.faa|\.fna//g;
+        $hmm_filenames2colid{$base_file}=$i 
+      }
+      next
+    }
 
+    $i=$i+1;
+    my @a=split("\t",$line);
+
+    if($hmm_num_cols==0){ $hmm_num_cols=scalar @a }
+    
+    my $groupfile = "";
+    if( exists $orthogroupfiles{$i} && $a[1] >= $hmm_minprots ){ $groupfile = basename($orthogroupfiles{$i}); }
+    $hmm_QUEUE->enqueue($groupfile."$;$line$;$i");
+  }
+  close($FH);
+
+  for (my $i = 0; $i < $cpus; $i++) { $hmm_QUEUE->enqueue(undef) }
+  $hmm_QUEUE_maxNumEl = $hmm_QUEUE->pending;
+
+  for (my $i = 0; $i < $cpus; $i++) { threads->create('hmmWorker') } # spawn a thread for each core
+
+  system("head -n1 $hmm_proteinorthofile >$hmm_proteinorthofile.enriched");
+
+  open(my $FH_RESULT,">>$hmm_proteinorthofile.enriched");
+  my %hmm_has_protein;
+  foreach (threads->list()) {
+    $_->join(); # await all threads
+    my $tid=$_->tid();
+    open(my $FH,"<$tmp_path/tid.$tid.out");
+    while(<$FH>){
+      chomp;
+      my $line=$_;
+      my @a = split("\t",$line);
+      my $species=0;
+      my $prots=0;
+      for (my $j = 3; $j < scalar @a; $j++) {
+        my @newel;
+        foreach (split(",",$a[$j])){
+          if(!exists $hmm_has_protein{$_}){
+            $hmm_has_protein{$_}=();
+            push(@newel,$_);
+          }
+        }
+        $a[$j] = join(",", at newel);
+        $prots += scalar @newel;
+        if($a[$j] eq ""){$a[$j]="*"}else{$species++}
+      }
+      $a[0]=$species;
+      $a[1]=$prots;
+      if($species>0){print $FH_RESULT join("\t", at a)."\n"}      
+    }
+    close($FH);
+  }
+  close($FH_RESULT);
+}
+
+sub hmmWorker {
+  local $SIG{KILL} = sub { threads->exit };
+
+  my $tid = threads->tid();
+  my %has_protein;
+
+  open(my $FHret,">$tmp_path/tid.$tid.out");
+  while(defined(my $job = $hmm_QUEUE->dequeue())){
+
+    $job = "$tmp_path/$job";
+    if ($verbose){ print STDERR "\r[tid=$tid] -step=4 in progress : ".(int(10000*(1-$hmm_QUEUE->pending/$hmm_QUEUE_maxNumEl))/100)."%" }
+
+    my ($groupfile,$line,$i) = split("$;",$job);
+    my @tmp_hmm_cols = split("\t",$line);
+    if($tmp_hmm_cols[1] < $hmm_minprots){ print $FHret join("\t", at tmp_hmm_cols)."\n"; next }
+
+    @has_protein{ split(",",join(",", @tmp_hmm_cols[3 .. $#tmp_hmm_cols ])) }=();
+
+    if($verbose>1) {print STDERR "aligning grouped sequences\n";}
+
+    if(!-e "$groupfile.aln"){
+      system("mafft --anysymbol --auto --thread 1 $groupfile 2>/dev/null | MSAtrimmer.pl -fasta -HminPairId=5% -norealn - > $groupfile.aln 2>/dev/null");
+      if ($? != 0) { &Error("'mafft' ($groupfile) failed with code $?.$NC (Please visit https://gitlab.com/paulklemm_PHD/proteinortho/wikis/Error%20Code).") }
+    }
+
+    if($verbose>1) {print STDERR "build hmm database\n";}
+
+    if(!-e "$groupfile.aln.hmm"){
+      system("hmmbuild $groupfile.aln.hmm $groupfile.aln >/dev/null 2>/dev/null");
+      if ($? != 0) { &Error("'hmmbuild' ($groupfile) failed with code $?.$NC (Please visit https://gitlab.com/paulklemm_PHD/proteinortho/wikis/Error%20Code).") }
+    }
+
+    if($verbose>1) {print STDERR "search hmm database against input sequences\n";}
+
+    my $newprots=0;
+    my $newspecies=0;
+
+    foreach my $file (@files){
+      my $base_file = basename($file);
+      $base_file=~s/\.fasta|\.faa|\.fna//g;
+      if(!exists $hmm_filenames2colid{$base_file}){print STDERR "$base_file invalid"; die;}
+      my $coli = $hmm_filenames2colid{$base_file};
+      
+      my $hmm_ret=`hmmsearch --cpu 1 -E $hmm_search_evalue $groupfile.aln.hmm $file 2>/dev/null | grep '\>\>'`;# |  2>/dev/null
+
+      foreach (split("\n",$hmm_ret)) {
+        $_=~s/^>> ?//;$_=~s/[ \t].*$//; chomp;
+        if(!exists $has_protein{$_}){
+          if( $tmp_hmm_cols[$coli] eq "" || $tmp_hmm_cols[$coli] eq "*" ){ $newspecies++;$tmp_hmm_cols[$coli]="" }else{ $_=",$_" }
+          $tmp_hmm_cols[$coli] .= $_; 
+          $newprots++;
+        }
+      }
+    }
+
+    if($verbose>1) {print STDERR "$tid evaluating results\n";}
+    
+    if($newprots>0){
+      $tmp_hmm_cols[0] += $newspecies;
+      $tmp_hmm_cols[1] += $newprots;
+      $tmp_hmm_cols[2] =  "NA";
+    }
+    if($tmp_hmm_cols[1]>=2 || ($singles && $tmp_hmm_cols[1]>=1)){ 
+      print $FHret join("\t", at tmp_hmm_cols)."\n";
+    }
+  }
+  close($FHret);
 }
 
+
 sub cluster {
 
   if(!$useMcl){
@@ -1012,10 +1204,14 @@ sub cluster {
 
     system ("OMP_PROC_BIND=$ompprocbind $po_path/proteinortho_clustering $cluster_verbose_level -minspecies $minspecies -ram ".$freemem_inMB." -kmere ".(1-$exactstep3)." -debug $debug -cpus $cpus -weighted 1 -conn $connectivity -purity $purity ".($clusterOptions ne "" ? "$clusterOptions" : "" )." -rmgraph '$rm_simgraph' '$simgraph'* >'$simtable' ".($verbose == 2 ? "" : "2>/dev/null"));
     if ($? != 0) {
-          &Error("'proteinortho_clustering' failed with code $?.$NC (Please visit https://gitlab.com/paulklemm_PHD/proteinortho/wikis/Error%20Code)\nMaybe your operating system does not support the statically compiled version, please try recompiling proteinortho with 'make clean' and 'make' (and 'make install PREFIX=...').");
-        }
-    }else{
-  my $stderrout = `$po_path/proteinortho_do_mcl.pl $cpus $simgraph*`;
+      if($verbose){print STDERR "$ORANGE"."[WARNING]$NC proteinortho_clustering failed. I will now retry without the OMP_PROC_BIND flag.$NC\n";} # minimum 5 MB
+      system ("$po_path/proteinortho_clustering $cluster_verbose_level -minspecies $minspecies -ram ".$freemem_inMB." -kmere ".(1-$exactstep3)." -debug $debug -cpus $cpus -weighted 1 -conn $connectivity -purity $purity ".($clusterOptions ne "" ? "$clusterOptions" : "" )." -rmgraph '$rm_simgraph' '$simgraph'* >'$simtable' ".($verbose == 2 ? "" : "2>/dev/null"));
+      if ($? != 0) {
+        &Error("'proteinortho_clustering' failed with code $?.$NC (Please visit https://gitlab.com/paulklemm_PHD/proteinortho/wikis/Error%20Code)\nMaybe your operating system does not support the statically compiled version, please try recompiling proteinortho with 'make clean' and 'make' (and 'make install PREFIX=...').");
+      }
+    }
+  }else{
+    my $stderrout = `$po_path/proteinortho_do_mcl.pl $cpus $simgraph*`;
     if ($? != 0) {
       print STDERR $stderrout;
       &Error("proteinortho_do_mcl.pl failed with code $?.$NC (Please visit https://gitlab.com/paulklemm_PHD/proteinortho/wikis/Error%20Code)\nDo you have a woring mcl version installed? (Try 'mcl' in the terminal) If not please install with e.g. 'apt-get install mcl'.");
@@ -1076,8 +1272,14 @@ sub cluster {
       if($verbose == 2){ $cluster_verbose_level = "-debug 1 "; }
 
       system ("OMP_PROC_BIND=$ompprocbind $po_path/proteinortho_clustering $cluster_verbose_level ".($clusterOptions ne "" ? "$clusterOptions" : "" )." -minspecies $minspecies -ram ".$freemem_inMB." -kmere ".(1-$exactstep3)." -debug $debug -cpus $cpus -weighted 1 -conn $connectivity -purity $purity -rmgraph '$rm_syngraph' '$syngraph'* >'$syntable' ".($verbose == 2 ? "" : "2>/dev/null"));
+      
+      if($verbose){print STDERR "$ORANGE"."[WARNING]$NC proteinortho_clustering failed. I will now retry without the OMP_PROC_BIND flag.$NC\n";} # minimum 5 MB
       if ($? != 0) {
-        &Error("proteinortho_clustering failed with code $?.$NC (Please visit https://gitlab.com/paulklemm_PHD/proteinortho/wikis/Error%20Code)\nDid you use a static version? Maybe your operating system does not support the static compiled version, please recompile 'make clean' and 'make' or 'make USEPRECOMPILEDLAPACK=FALSE'.");
+        if($verbose){print STDERR "$ORANGE"."[WARNING]$NC proteinortho_clustering failed. I will now retry without the OMP_PROC_BIND flag.$NC\n";} # minimum 5 MB
+        system ("$po_path/proteinortho_clustering $cluster_verbose_level ".($clusterOptions ne "" ? "$clusterOptions" : "" )." -minspecies $minspecies -ram ".$freemem_inMB." -kmere ".(1-$exactstep3)." -debug $debug -cpus $cpus -weighted 1 -conn $connectivity -purity $purity -rmgraph '$rm_syngraph' '$syngraph'* >'$syntable' ".($verbose == 2 ? "" : "2>/dev/null"));
+        if ($? != 0) {
+          &Error("proteinortho_clustering failed with code $?.$NC (Please visit https://gitlab.com/paulklemm_PHD/proteinortho/wikis/Error%20Code)\nDid you use a static version? Maybe your operating system does not support the static compiled version, please recompile 'make clean' and 'make' or 'make USEPRECOMPILEDLAPACK=FALSE'.");
+        }
       }
     }else{
       system($po_path.'/proteinortho_do_mcl.pl '.$cpus.' '.$syngraph.'*');
@@ -1149,7 +1351,7 @@ Options:
                       2 -> run blast (and ff-adj, if -synteny is set)
                       3 -> clustering
                       0 -> all (default)
-         -isoform=    Enables the isoform processing:
+         -isoform=    Enables the isoform merging:
                       ncbi -> if the word 'isoform' is found 
                       uniprot -> 'Isoform of XYZ' (You need to add the *_additional.fasta files to the analysis)
                       trinity -> using '_iX' suffix
@@ -1161,7 +1363,10 @@ Options:
                       The suffix '_legacy' indicates legacy blastall (otherwise blast+ is used).
          -checkfasta  Checks if the given fasta files are compatible with the algorithm of -p
          -e=          E-value for blast [default: 1e-05]
-
+         -sim=        min. reciprocal similarity for additional hits (0..1) [default: 0.95]
+                      1 : only the best reciprocal hits are reported
+                      0 : all possible reciprocal blast matches (within the -evalue) are reported
+                      
          [Synteny options]
          -synteny     activate PoFF extension to separate similar sequences print
                       by contextual adjacencies (requires .gff for each .fasta)
@@ -1171,7 +1376,7 @@ Options:
          [Clustering options]
          -singles     report singleton genes without any hit
          -conn=       min. algebraic connectivity [default: 0.1]
-         -xml        produces an OrthoXML formatted file of the *.proteinortho.
+         -xml         produces an OrthoXML formatted file of the *.proteinortho.
 
          (...) show more with --help
 
@@ -1204,10 +1409,12 @@ Options:
                       2 -> run blast (and ff-adj, if -synteny is set)
                       3 -> clustering
                       0 -> all (default)
-         -isoform     Enables the isoform processing:
+         -isoform=    Enables the isoform merging. All isoforms are united to a single entity and treated as one. Extracting a group with an isoform will result in all isoforms.
                       ncbi -> if the word 'isoform' is found 
                       uniprot -> 'Isoform of XYZ' (You need to add the *_additional.fasta files to the analysis)
                       trinity -> using '_iX' suffix
+                      For more information have a look at: 
+                        https://gitlab.com/paulklemm_PHD/proteinortho/-/wikis/FAQ#how-does-the-isoform-work
 
          [Search options]
          -p=          blast program [default: diamond]
@@ -1233,7 +1440,9 @@ Options:
          -checkfasta  Checks if the given fasta files are compatible with the algorithm of -p
          -e=          E-value for blast [default: 1e-05]
          -selfblast   apply selfblast, detects paralogs without orthologs
-         -sim=        min. similarity for additional hits (0..1) [default: 0.95]
+         -sim=        min. reciprocal similarity for additional hits (0..1) [default: 0.95]
+                      1 : only the best reciprocal hits are reported
+                      0 : all possible reciprocal blast matches (within the -evalue) are reported
          -identity=   min. percent identity of best blast hits [default: 25]
          -cov=        min. coverage of best blast alignments in % [default: 50]
          -subparaBlast=    additional parameters for the search tool
@@ -1440,7 +1649,7 @@ sub workerthread {
     else {
       %lengths = %{&get_gene_lengths($file_i,$file_j)};
     }
-    my %reciprocal_matches = %{&match(\%lengths,$file_i,$file_j,$result_ij,$result_ji)};
+    my %reciprocal_matches = %{&match(\%lengths,$file_j,$file_i,$result_ij,$result_ji)};
 
     # Remove secondary hits if better exist (test here instead of later)
     %reciprocal_matches = %{&adaptive_best_blast_matches(\%reciprocal_matches)};
@@ -1509,9 +1718,15 @@ sub workerthread {
           }
           # Remap to full ID
           my $a = $track{$file_i.$i};
-          if (defined($full_id_map_i{$a})) {$a = $full_id_map_i{$a};}
           my $b = $track{$file_j.$j};
+          
+          #if($isoform ne "" && exists $isoform_mapping_best{$file_j." ".$a." vs ".$file_i." ".$b}){
+          #  ($a,$b)=split(" ",$isoform_mapping_best{$file_j." ".$a." vs ".$file_i." ".$b});
+          #} 
+
+          if (defined($full_id_map_i{$a})) {$a = $full_id_map_i{$a};}
           if (defined($full_id_map_j{$b})) {$b = $full_id_map_j{$b};}
+
           # Store
           $synteny{"$b $a"} = $score;
 
@@ -1567,7 +1782,19 @@ sub workerthread {
       open(GRAPH,">>$simgraph") || &Error("Could not open file '$simgraph': $!");
       print GRAPH "# $short_file_j\t$short_file_i\n# ".median(@arr0)."\t".median(@arr1)."\t".median(@arr2)."\t".median(@arr3)."\n";
       foreach (sort keys %reciprocal_matches) {
-        my $line = "$_ ".$reciprocal_matches{$_};
+        my ($a,$b) = split(" ",$_);
+
+        #if($isoform ne "" && exists $isoform_mapping_best{$file_j." ".$a." vs ".$file_i." ".$b}){
+        #  ($a,$b)=split(" ",$isoform_mapping_best{$file_j." ".$a." vs ".$file_i." ".$b});
+        #} 
+
+# if($a =~ /137_TRINITY_DN39932_c6_g1/){print STDERR "RECIPROCAL $_\n"}
+# elsif($b =~ /137_TRINITY_DN39932_c6_g1/){print STDERR "RECIPROCAL $_\n"}
+# elsif($a =~ /138_TRINITY_DN47563_c4_g1/){print STDERR "RECIPROCAL $_\n"}
+# elsif($b =~ /138_TRINITY_DN47563_c4_g1/){print STDERR "RECIPROCAL $_\n"}
+
+        my $line = "$a $b ".$reciprocal_matches{$_};
+
         $line =~ s/ /\t/g;
         print GRAPH "$line\n";
       }
@@ -1819,7 +2046,7 @@ sub match {
   my @j = @{(shift)};
 
   my %legal_i = &get_legal_matches(\%length,$file_i,$file_j, at i);
-  my %legal_j = &get_legal_matches(\%length,$file_i,$file_j, at j);
+  my %legal_j = &get_legal_matches(\%length,$file_j,$file_i, at j);
 
   return &get_reciprocal_matches(\%legal_i,\%legal_j);
 }
@@ -1861,7 +2088,7 @@ sub get_legal_matches {
 
     ## Check for criteria
     # Well formatted
-    if (!defined($local_bitscore))              {next;}
+    if (!defined($local_bitscore)){next;}
     if ($evalue < $local_evalue) {next;} # 5.17, post filter e-value
 
     # Percent identity
@@ -1870,35 +2097,66 @@ sub get_legal_matches {
     # Min. length
     if ($blastmode eq "tblastx+" || $blastmode eq "tblastx") {$alignment_length *= 3;}
 
-    if($blastmode eq "autoblast"){
-      if($file_i eq "nucl" && $file_j eq "nucl"){
-        $alignment_length /= 3;
-      }
-      # if i is nucl and j is prot -> then the alignment length is in aa length (all nucl lengths are allready /3) -> ok
-    }
+    #if($blastmode eq "autoblast"){
+    #  if($file_i eq "nucl" && $file_j eq "nucl"){
+    #    $alignment_length /= 3;
+    #  }
+    #  # if i is nucl and j is prot -> then the alignment length is in aa length (all nucl lengths are allready /3) -> ok
+    #}
 
     if ($alignment_length < $length{$query_id}*($coverage/100)+0.5)     {if ($debug>1) {print STDERR "!$query_id -> $subject_id is removed because of coverage threshold\n";} next;}
     if ($alignment_length < $length{$subject_id}*($coverage/100)+0.5)     {if ($debug>1) {print STDERR "!$query_id -> $subject_id is removed because of coverage threshold\n";} next;}
 
     if($isoform ne ""){
-      if(exists $isoform_mapping{$query_id} ){ $query_id=$isoform_mapping{$query_id}; }
-      if(exists $isoform_mapping{$subject_id} ){ $subject_id=$isoform_mapping{$subject_id}; }
-    }
-   
-    # It hit itself (only during selfblast)
-    # if ($selfblast && $query_id eq $subject_id)           {next;} # 5.16 reuse IDs
-    ## Listing them in the graph is okay, clustering will ignore them
 
-    # Similar hits? Take the better one
-    if (defined($result{"$query_id $subject_id"})) {
-      my ($remote_evalue, $remote_bitscore) = split(" ",$result{"$query_id $subject_id"});
-      if ($local_evalue > $remote_evalue) {next;}
-      if ($local_bitscore < $remote_bitscore) {next;}
-    }
+      my $query_id_iso=$query_id;
+      my $subject_id_iso=$subject_id;
+      if(exists $isoform_mapping{$file_i." ".$query_id} ){   $query_id_iso   = $isoform_mapping{$file_i." ".$query_id} }
+      if(exists $isoform_mapping{$file_j." ".$subject_id} ){ $subject_id_iso = $isoform_mapping{$file_j." ".$subject_id} }
+
+      # It hit itself (only during selfblast)
+      # if ($selfblast && $query_id eq $subject_id)           {next;} # 5.16 reuse IDs
+      ## Listing them in the graph is okay, clustering will ignore them
 
-    # Store data
-    if ($debug>1) {print STDERR "!$query_id -> $subject_id ($local_evalue,$local_bitscore)\n";}
-    $result{"$query_id $subject_id"} = "$local_evalue $local_bitscore";
+      # Similar hits? Take the better one
+      if (defined($result{"$query_id_iso $subject_id_iso"})) {
+        my ($remote_evalue, $remote_bitscore) = split(" ",$result{"$query_id_iso $subject_id_iso"});
+        if ($local_evalue > $remote_evalue) {next;}
+        if ($local_bitscore < $remote_bitscore) {next;}
+      }
+
+      # save the best isoform id (e.g. for trinity the query_id_iso contains the id without _iXX, this mapping resolves the missing suffix)
+      #$isoform_mapping_best{$file_i." ".$query_id_iso." vs ".$file_j." ".$subject_id_iso}   = $query_id." ". $subject_id;
+
+# if($query_id =~ /137_TRINITY_DN39932_c6_g1/){print STDERR "$_\n"}
+# elsif($subject_id =~ /137_TRINITY_DN39932_c6_g1/){print STDERR "$_\n"}
+# elsif($query_id =~ /138_TRINITY_DN47563_c4_g1/){print STDERR "$_\n"}
+# elsif($subject_id =~ /138_TRINITY_DN47563_c4_g1/){print STDERR "$_\n"}
+
+      #print STDERR "DEBUG:: $query_id_iso --> $query_id\n";
+
+      # Store data
+      if ($debug>1) {print STDERR "!$query_id -> $subject_id ($local_evalue,$local_bitscore)\n";}
+      $result{"$query_id_iso $subject_id_iso"} = "$local_evalue $local_bitscore";
+
+    }else{
+
+      # It hit itself (only during selfblast)
+      # if ($selfblast && $query_id eq $subject_id)           {next;} # 5.16 reuse IDs
+      ## Listing them in the graph is okay, clustering will ignore them
+
+      # Similar hits? Take the better one
+      if (defined($result{"$query_id $subject_id"})) {
+        my ($remote_evalue, $remote_bitscore) = split(" ",$result{"$query_id $subject_id"});
+        if ($local_evalue > $remote_evalue) {next;}
+        if ($local_bitscore < $remote_bitscore) {next;}
+      }
+
+      # Store data
+      if ($debug>1) {print STDERR "!$query_id -> $subject_id ($local_evalue,$local_bitscore)\n";}
+      $result{"$query_id $subject_id"} = "$local_evalue $local_bitscore";
+    }
+   
   }
 
   return %result;
@@ -1936,9 +2194,7 @@ sub generate_indices {
   if($verbose){print STDERR "Generating indices";if($force){print STDERR " anyway (forced).\n"}else{print STDERR ".\n";}}
   if ($blastmode eq "rapsearch") {
     foreach my $file (@_) {
-      #if ($file =~ /\s/) {print STDERR ("$ORANGE\n[WARNING]$NC : File name '$file' contains whitespaces. This might lead to undesired effects. If you encounter unusual behavior, please change the file name!$NC\n");}
-
-      if(!$force && `ls '${file}'.${blastmode}* 1>/dev/null 2>&1` ne ""){
+      if(!$force && `ls '${file}'.${blastmode}* 2>/dev/null` ne ""){
         if ($verbose) {print STDERR "The database for '$file' is present and will be used\n";}
       }else{
         if ($verbose) {print STDERR "Building database for '$file'\t(".$gene_counter{$file}." sequences)\n";}
@@ -1957,9 +2213,7 @@ sub generate_indices {
   }
   elsif ($blastmode eq "diamond" ) {
     foreach my $file (@_) {
-      #if ($file =~ /\s/) {print STDERR ("$ORANGE\n[WARNING]$NC : File name '$file' contains whitespaces. This might lead to undesired effects. If you encounter unusual behavior, please change the file name!$NC\n");}
-
-      if(!$force && `ls '${file}'.${blastmode}* 1>/dev/null 2>&1` ne ""){
+      if(!$force && `ls '${file}'.${blastmode}* 2>/dev/null` ne ""){
         if ($verbose) {print STDERR "The database for '$file' is present and will be used\n";}
       }else{
         if ($verbose) {print STDERR "Building database for '$file'\t(".$gene_counter{$file}." sequences)\n";}
@@ -1981,7 +2235,7 @@ sub generate_indices {
     	#if ($file =~ /\s/) {print STDERR ("$ORANGE\n[WARNING]$NC : File name '$file' contains whitespaces. This might lead to undesired effects. If you encounter unusual behavior, please change the file name!$NC\n");}
 
 
-      if(!$force && `ls '${file}'.${blastmode}* 1>/dev/null 2>&1` ne ""){
+      if(!$force && `ls '${file}'.${blastmode}* 2>/dev/null` ne ""){
         if ($verbose) {print STDERR "The database for '$file' is present and will be used\n";}
       }else{
         if ($verbose) {print STDERR "Building database for '$file'\t(".$gene_counter{$file}." sequences)\n";}
@@ -2000,9 +2254,7 @@ sub generate_indices {
   }
   elsif ($blastmode eq "mmseqsp") {
     foreach my $file (@_) {
-      #if ($file =~ /\s/) {print STDERR ("$ORANGE\n[WARNING]$NC : File name '$file' contains whitespaces. This might lead to undesired effects. If you encounter unusual behavior, please change the file name!$NC\n");}
-
-      if(!$force && `ls '${file}'.${blastmode}* 1>/dev/null 2>&1` ne ""){
+      if(!$force && `ls '${file}'.${blastmode}* 2>/dev/null` ne ""){
         if ($verbose) {print STDERR "The database for '$file' is present and will be used\n";}
       }else{
         if ($verbose) {print STDERR "Building database for '$file'\t(".$gene_counter{$file}." sequences)\n";}
@@ -2021,10 +2273,7 @@ sub generate_indices {
   }
   elsif ($blastmode eq "mmseqsn") {
     foreach my $file (@_) {
-            #if ($file =~ /\s/) {print STDERR ("$ORANGE\n[WARNING]$NC : File name '$file' contains whitespaces. This might lead to undesired effects. If you encounter unusual behavior, please change the file name!$NC\n");}
-
-
-      if(!$force && `ls '${file}'.${blastmode}* 1>/dev/null 2>&1` ne ""){
+      if(!$force && `ls '${file}'.${blastmode}* 2>/dev/null` ne ""){
         if ($verbose) {print STDERR "The database for '$file' is present and will be used\n";}
       }else{
         if ($verbose) {print STDERR "Building database for '$file'\t(".$gene_counter{$file}." sequences)\n";}
@@ -2043,10 +2292,7 @@ sub generate_indices {
   }
   elsif ($blastmode eq "usearch" || $blastmode eq "ublast") {
     foreach my $file (@_) {
-            #if ($file =~ /\s/) {print STDERR ("$ORANGE\n[WARNING]$NC : File name '$file' contains whitespaces. This might lead to undesired effects. If you encounter unusual behavior, please change the file name!$NC\n");}
-
-
-      if(!$force && `ls '${file}'.${blastmode}* 1>/dev/null 2>&1` ne ""){
+      if(!$force && `ls '${file}'.${blastmode}* 2>/dev/null` ne ""){
         if ($verbose) {print STDERR "The database for '$file' is present and will be used\n";}
       }else{
         if ($verbose) {print STDERR "Building database for '$file'\t(".$gene_counter{$file}." sequences)\n";}
@@ -2065,9 +2311,7 @@ sub generate_indices {
   }
   elsif ($blastmode eq "lastp" || $blastmode eq "lastn") {
     foreach my $file (@_) {
-            #if ($file =~ /\s/) {print STDERR ("$ORANGE\n[WARNING]$NC : File name '$file' contains whitespaces. This might lead to undesired effects. If you encounter unusual behavior, please change the file name!$NC\n");}
-
-      if(!$force && `ls '${file}'.${blastmode}* 1>/dev/null 2>&1` ne ""){
+      if(!$force && `ls '${file}'.${blastmode}* 2>/dev/null` ne ""){
         if ($verbose) {print STDERR "The database for '$file' is present and will be used\n";}
       }else{
         if ($verbose) {print STDERR "Building database for '$file'\t(".$gene_counter{$file}." sequences)\n";}
@@ -2102,9 +2346,7 @@ sub generate_indices {
   }
   elsif ($blastmode =~ m/blast.*\+$/) {  # new blast+
     foreach my $file (@_) {
-            #if ($file =~ /\s/) {print STDERR ("$ORANGE\n[WARNING]$NC : File name '$file' contains whitespaces. This might lead to undesired effects. If you encounter unusual behavior, please change the file name!$NC\n");}
-
-      if(!$force && `ls '${file}'.${blastmode}* 1>/dev/null 2>&1` ne ""){
+      if(!$force && `ls '${file}'.${blastmode}* 2>/dev/null` ne ""){
         if ($verbose) {print STDERR "The database for '$file' is present and will be used\n";}
       }else{
         if ($verbose) {print STDERR "Building database for '$file'\t(".$gene_counter{$file}." sequences)\n";}
@@ -2122,9 +2364,7 @@ sub generate_indices {
     unlink('formatdb.log');
   }elsif ($blastmode =~ m/autoblast/) {  # new blast+
     foreach my $file (@_) {
-            #if ($file =~ /\s/) {print STDERR ("$ORANGE\n[WARNING]$NC : File name '$file' contains whitespaces. This might lead to undesired effects. If you encounter unusual behavior, please change the file name!$NC\n");}
-
-      if(!$force && `ls '${file}'.${blastmode}* 1>/dev/null 2>&1` ne ""){
+      if(!$force && `ls '${file}'.${blastmode}* 2>/dev/null` ne ""){
         if ($verbose) {print STDERR "The database for '$file' is present and will be used\n";}
       }else{
         if ($verbose) {print STDERR "Building database for '$file'\t(".$gene_counter{$file}." sequences)\n";}
@@ -2146,9 +2386,7 @@ sub generate_indices {
     unlink('formatdb.log');
   }else { # old blastall
     foreach my $file (@_) {
-            #if ($file =~ /\s/) {print STDERR ("$ORANGE\n[WARNING]$NC : File name '$file' contains whitespaces. This might lead to undesired effects. If you encounter unusual behavior, please change the file name!$NC\n");}
-
-      if(!$force && `ls '${file}'.${blastmode}* 1>/dev/null 2>&1` ne ""){
+      if(!$force && `ls '${file}'.${blastmode}* 2>/dev/null` ne ""){
         if ($verbose) {print STDERR "The database for '$file' is present and will be used\n";}
       }else{
         if ($verbose) {print STDERR "Building database for '$file'\t(".$gene_counter{$file}." sequences)\n";}
@@ -2653,19 +2891,19 @@ sub read_details {
     if ($curLine =~ />/) { #head line -> gene name and description
 
       if($isoform eq "uniprot"){
-        if($curLine =~ m/isoform of ([^ ,]+)/i || $curLine =~ m/isoform ([^ ,]+)/i){
+        if($curLine =~ m/isoform of ([^ ,]+)/i){
           my $iso = $1;
           $curLine =~ s/[\r\n]+$//;#chomp only removes last \n newline, now also \r are removed and all occurences
           $curLine =~ s/^>//;
           $curLine =~ s/\s.*//;
-          $isoform_mapping{$curLine} = $iso;
+          $isoform_mapping{$file." ".$curLine} = $iso;
 
-          if($debug){print STDERR "found isoform '$curLine' => '$iso'\n";}
+          if($debug){print STDERR "found isoform '$file $curLine' => '$iso'\n";}
         }
         $curLine =~ s/[\r\n]+$//;#chomp only removes last \n newline, now also \r are removed and all occurences
         $curLine =~ s/^>//;
         $curLine =~ s/\s.*//;
-        $isoform_mapping_ncbiuniprot_correction{&convertUniprotAndNCBI($curLine)}=$curLine;
+        $isoform_mapping_ncbiuniprot_correction{$file." ".&convertUniprotAndNCBI($curLine)}=$file." ".$curLine;
 
       }elsif($isoform eq "trinity"){
 
@@ -2675,9 +2913,11 @@ sub read_details {
           $curLine =~ s/^>//;
           $curLine =~ s/\s.*//;
           $iso =~ s/^>//;
-          $isoform_mapping{$curLine}=$iso;
+          $isoform_mapping{$file." ".$curLine}=$iso;
+
+#print STDERR "DEBUG::1) ".$file." ".$curLine."\n";
 
-          if($debug){print STDERR "found isoform '$curLine' => '$iso'\n";}
+          if($debug){print STDERR "found isoform '$file $curLine' => '$iso'\n";}
         }
       }elsif($isoform eq "ncbi"){
         if($curLine =~ m/gene:([^ ]+).*isoform/i || $curLine =~ m/isoform.*Acc:MGI:([0-9])/i){
@@ -2685,14 +2925,14 @@ sub read_details {
           $curLine =~ s/[\r\n]+$//;#chomp only removes last \n newline, now also \r are removed and all occurences
           $curLine =~ s/^>//;
           $curLine =~ s/\s.*//;
-          $isoform_mapping{$curLine} = $iso;
+          $isoform_mapping{$file." ".$curLine} = $iso;
 
-          if($debug){print STDERR "found isoform '$curLine' => '$iso'\n";}
+          if($debug){print STDERR "found isoform '$file $curLine' => '$iso'\n";}
         }
         $curLine =~ s/[\r\n]+$//;#chomp only removes last \n newline, now also \r are removed and all occurences
         $curLine =~ s/^>//;
         $curLine =~ s/\s.*//;
-        $isoform_mapping_ncbiuniprot_correction{$curLine}=$curLine;
+        $isoform_mapping_ncbiuniprot_correction{$file." ".$curLine}=$file." ".$curLine;
       }
 
       if(!$force && $checkfasta && exists($allowedAlphabet->{$blastmode}) && $cur_gene_is_valid<1){last;}


=====================================
proteinorthoHelper.html
=====================================
@@ -248,6 +248,16 @@ The resulting proteinortho command (replace proteinortho with 'perl proteinortho
 <div class="fb-number form-group field-cov"><label for="cov" class="fb-number-label"><b>-cov</b> : min. coverage of best blast alignments in %</label><br>    <input type="number" class="form-control" name="cov" value="50" min="0" step="any" max="100" id="cov"></div>
 <div class="fb-text form-group field-subparaBlast"><label for="subparaBlast" class="fb-text-label"><b>-subparaBlast</b> : additional parameters for the search tool(-p). Careful when changing -p, then this option may lead to errors.<span class="tooltip-element" tooltip="example -subpara='-seg no' for -p=blastp+ or -subpara='--more-sensitive' for -p=diamond. Do not use '=' here!">?</span></label><br>    '<input type="text" pattern="[^'=]+" class="form-control" name="subparaBlast" id="subparaBlast" title="example -subpara='-seg no' or -subpara='--more-sensitive' for diamond. Do not use '=' here!">'</div>
 <div class="fb-checkbox"><input name="additionalFlags1[]" id="checkfasta" value="checkfasta" type="checkbox"><label for="additionalFlags1-6"><b>-checkfasta</b> : checks if the given algorithm (-p) can process the input fasta file.</label></div>
+<br>
+<div class="fb-radio-group form-group field-isoform"><label for="isoform" class="fb-radio-group-label"><b>-isoform</b> : Enables the isoform processing.</label>
+  <div class="radio-group">
+      <div class="fb-radio"><input name="isoform" id="isoform-no" value="blastn" type="radio" checked><label for="isoform-no">disabled</label></div>
+      <div class="fb-radio"><input name="isoform" id="isoform-0" value="blastn" type="radio"><label for="isoform-0">ncbi (Different proteins have the same gene id. The word isoform is mandatory!)</label></div>
+    <div class="fb-radio"><input name="isoform" id="isoform-1" value="blastp" type="radio"><label for="isoform-1">uniprot (Isoforms are specified in uniprot style using the '*_additional.fa' files, just add these files to the analysis)</label></div>
+    <div class="fb-radio"><input name="isoform" id="isoform-2" value="tblastx" type="radio"><label for="isoform-2">trinity (Isoforms are given in the trinity ids TRINITY_DN1000_c115_g5_i1 -> isoform 1)</label></div> 
+  </div>
+</div>
+
 <br><hr>
 <div class=""><h2 id="control-5359817"><b>Synteny</b> options (optional, step 2)</h2></div>
 <hr><br>
@@ -324,6 +334,15 @@ run into smaller chunks, use the -jobs=<strong>M</strong>/<strong>N</strong> opt
 
     var output="proteinortho";
 
+    var isoformno=document.getElementById('isoform-no').checked;
+    var isoform0=document.getElementById('isoform-0').checked;
+    var isoform1=document.getElementById('isoform-1').checked;
+    var isoform2=document.getElementById('isoform-2').checked;
+
+    if(isoform0){output+=" -isoform=ncbi";}
+    else if(isoform1){output+=" -isoform=uniprot";}
+    else if(isoform2){output+=" -isoform=trinity";}
+
     var step0=document.getElementById('step-0').checked;
     var step1=document.getElementById('step-1').checked;
     var step2=document.getElementById('step-2').checked;


=====================================
src/proteinortho2xml.pl
=====================================
@@ -15,6 +15,7 @@ my $po_file = $ARGV[0];
 my $cur_id = 0;
 
 my @species;
+my @species_num_prots;
 my $headerisset=0;
 my %protID2id;
 my $orthologygroup="";
@@ -115,6 +116,8 @@ if($po_file=~m/proteinortho-graph/){
 	    if(length($line)>0 && substr($line,0,1) ne "#"){ 
 	    	chomp($line);
 
+	    	if(!$headerisset){print STDERR "[ERROR] header is missing.\n";die}
+
 	   		my $curorthologygroup = ""; # create template orthology group
 
 	   		my @linesplt=split(/\t/,$line); # -> first 3 cols are number of species/genes and the algebraic connectivity of the given group
@@ -131,6 +134,7 @@ if($po_file=~m/proteinortho-graph/){
 
 					for(my $j = 0 ; $j < scalar(@linesplt2) ; $j=$j+1){
 
+						if( $linesplt2[$j] eq "*" ){next}
 						# $linesplt2[$j] is a protein (name like tr|A8JEJ4|A8JEJ4_CHLRE) of the current ortho group
 
 						if(!exists($protID2id{$i."#".$linesplt2[$j]})){ # map down to a integer A8JEJ4->0 ...
@@ -144,6 +148,8 @@ if($po_file=~m/proteinortho-graph/){
 						$protId=~s/^[sptr]{2}\|//g; # protein names can be 'sp|O08314|TGT_HELPY' -> 'O08314'
 						$protId=~s/\|[^|]+$//g; # protein names can be 'sp|O08314|TGT_HELPY' -> 'O08314'
 
+						if(scalar @species_num_prots != scalar @linesplt){ @species_num_prots=(0)x(scalar @linesplt) }
+						$species_num_prots[$i-3]++;
 						$species[$i-3].="\t\t\t\t".'<gene id="'.$protID2id{$i."#".$linesplt2[$j]}.'" protId="'.$protId."\"/>\n";
 						$curorthologygroup.="\t\t\t".'<geneRef id="'.$protID2id{$i."#".$linesplt2[$j]}."\"/>\n";
 					}
@@ -153,6 +159,7 @@ if($po_file=~m/proteinortho-graph/){
 	   			}
 	   		}
 	   		if($orthologygroup ne ""){$orthologygroup.="\n";}
+	   		if($con eq "NA"){$con="-1"}
 	   		$orthologygroup.="\t\t<orthologGroup id=\"$cur_id\">\n\t\t\t<score id=\"algcon\" value=\"$con\"/>\n".$curorthologygroup."\t\t</orthologGroup>";
 
 			$cur_id=$cur_id+1;
@@ -196,25 +203,30 @@ print "\n\t<notes>\n\t\tProteinortho OrthoXML file.\n\t</notes>\n";
 if($po_file=~m/proteinortho-graph/){
 	foreach my $cur_species (keys %species_graph){
 
-		print $species_graph{$cur_species}{"header"};
+		if(scalar keys %{$species_graph{$cur_species}} > 0){
+
+			print $species_graph{$cur_species}{"header"};
+
+			foreach my $prot (keys %{$species_graph{$cur_species}}){
+				if($prot eq "header"){next;}
 
-		foreach my $prot (keys %{$species_graph{$cur_species}}){
-			if($prot eq "header"){next;}
+				my $prot_id = $prot;
 
-	   		my $prot_id = $prot;
+				if($prot=~m/^[^\|]+\|([^\|]+)\|[^\|]+/){ # extract the true name (A8JEJ4)
+					$prot_id=$1;
+				}
 
-			if($prot=~m/^[^\|]+\|([^\|]+)\|[^\|]+/){ # extract the true name (A8JEJ4)
-				$prot_id=$1;
+				print "\t\t\t\t".'<gene id="'.$protID2id{$cur_species."#".$prot}.'" protId="'.$prot_id."\"/>\n";
 			}
 
-			print "\t\t\t\t".'<gene id="'.$protID2id{$cur_species."#".$prot}.'" protId="'.$prot_id."\"/>\n";
+			print "\t\t\t</genes>\n\t\t</database>\n\t</species>\n";
 		}
-
-		print "\t\t\t</genes>\n\t\t</database>\n\t</species>\n";
 	}
 }else{
-	foreach my $cur_species (@species){
-		print $cur_species."\t\t\t</genes>\n\t\t</database>\n\t</species>\n";
+	for (my $i = 0; $i < scalar @species; $i++) {
+		if($species_num_prots[$i]>0){
+			print $species[$i]."\t\t\t</genes>\n\t\t</database>\n\t</species>\n";
+		}
 	}
 }
 


=====================================
src/proteinortho_grab_proteins.pl
=====================================
@@ -28,12 +28,18 @@
 # @author Paul Klemm
 # @email klemmp at staff.uni-marburg.de
 # @company Bioinformatics, University of Leipzig
-# @version 5
-# @date 1-29-2021
+# @version 6
+# @date 2-10-2021
 #
 ##########################################################################################
 
+use strict;
+use threads;
+use threads::shared;
 use POSIX;
+use File::Basename;
+use Thread::Queue;
+our $QUEUE = Thread::Queue->new();    # A new empty queue
 
 my $usage = <<'ENDUSAGE';
 proteinortho_grab_proteins.pl        greps all genes/proteins of a given fasta file
@@ -48,9 +54,13 @@ proteinortho_grab_proteins.pl (options) QUERY FASTA1 (FASTA2 ...)
 	FASTA*	fasta file(s) (database)
 
 	(options):
-		-tofiles, -t  print everything to files instead of stdout files are called OrthoGroup**.fasta for a proteinortho.tsv file
+		-tofiles, -t  print everything to files instead of stdout files are called OrthoGroup.fasta for a proteinortho.tsv file
+		-tofiles=DIR  additionally specifies a output directory for the OrthoGroup files
 		-E            enables regex matching otherwise the string is escaped (e.g. | -> \|)
 		-exact        search patters are extended with a \b, that indicates end of word.
+		-cpus=INT     the number of parallel open files for reading, this is strictly limited by the I/O bandwith (default:1).
+		              for fast SSD drives, you can increase this to gain speed..
+		-minprot=X    if you give a proteinortho.tsv file, this filters out groups with less than X proteins (default:0).
 		-source, -s   adds the filename (FASTA1,...) to the found gene-name
 		-F=s          char delimiter for multiple identifier if QUERY is a string input (default: ',')
 		-isoform      if you use proteinortho with --isoform option, then you need to set this option here too. 
@@ -98,28 +108,37 @@ DESCRIPTION
  
 ENDUSAGE
 
-my $query;
-my $help;
-my $tofiles=0;
-my $isoform=0;
-my $justid;
-my $prefix=">";
-my $doregex=0;
-my $source=0;
-my $exact=0;
-my $del=',';
-my @ARGV_copy=@ARGV;
-
-my @ARGV_copyiddone=(1) x (scalar(@ARGV_copy));
-my $ARGV_copyiddone_counter=scalar(@ARGV_copy);
+
+
+our $query;
+our $help;
+our $tofiles="-1";
+our $isoform=0;
+our $justid;
+our $prefix=">";
+our $doregex=0;
+our $source=0;
+our $exact=0;
+our $del=',';
+our @ARGV_copy=@ARGV;
+our $ignoreWarning=0;
+our $minprot = 0;
+our $cpus=1;
+
+our @ARGV_copyiddone=(1) x (scalar(@ARGV_copy));
+our $ARGV_copyiddone_counter=scalar(@ARGV_copy);
 for(my $v = 0 ; $v < scalar @ARGV_copy ; $v++){
 	$ARGV_copyiddone[$v]=1;
 	if($ARGV_copy[$v] =~ m/--?(help|h)$/){$help=1;}
-	elsif($ARGV_copy[$v] =~ m/^--?(tofiles|t)$/){$tofiles=1;}
+	elsif($ARGV_copy[$v] =~ m/^--?(tofiles|t)$/){$tofiles="";}
+	elsif($ARGV_copy[$v] =~ m/^--?(tofiles|t)=(.*)$/){$tofiles="$2/";}
 	elsif($ARGV_copy[$v] =~ m/^--?(source|s)$/){$source=1;}
 	elsif($ARGV_copy[$v] =~ m/^--?F=(.*)$/){$del=$1;}
+	elsif($ARGV_copy[$v] =~ m/^--?minprots?=(.*)$/){$minprot=$1;}
+	elsif($ARGV_copy[$v] =~ m/^--?cpus?=(.*)$/){$cpus=$1;}
 	elsif($ARGV_copy[$v] =~ m/^--?E$/){$doregex=1;}
 	elsif($ARGV_copy[$v] =~ m/^--?isoform$/){$isoform=1;}
+	elsif($ARGV_copy[$v] =~ m/^--?ignoreWarning$/){$ignoreWarning=1;}
 	elsif($ARGV_copy[$v] =~ m/^--?exact$/){$exact=1;}
 	elsif($ARGV_copy[$v] =~ m/^-.+/){print $usage; print STDERR "ERROR: invalid option ".$ARGV_copy[$v]."!\n\n";exit(1);}
 	elsif(!defined($query)){$query = $ARGV_copy[$v];}
@@ -144,8 +163,7 @@ if($fail){
 	exit(1);
 }
 
-my %qdata;
-# my $qdata_count = {};
+our %qdata;
 
 our $orthogroupcounter=0;
 our $genecounter=0;
@@ -157,6 +175,7 @@ for(my $v = 0 ; $v < scalar @ARGV_copy ; $v++){
 	$numOfFastas++;
 }
 our @filenames;
+our @thread_return :shared = ("")x$cpus;
 
 my $foundHeader=0;
 
@@ -164,14 +183,16 @@ my $foundHeader=0;
 
 sub processLine{
 	my $line = shift;
-	my $prefix = shift;
+	my $prefix_group = shift;
 
 	$line=~s/[\r\n]+$//; 
 
 	my @sp = split(/\t/,$line);
 	if(substr($line,0,1) eq "#"){$foundHeader=1;@filenames=@sp; next;}
 	if(scalar(@sp)>3){
-		for(my $v = 3 ; $v < scalar @sp ; $v++){
+	 	if( @sp > 3 && $sp[1] < $minprot ){$orthogroupcounter++; next}
+
+	 	for(my $v = 3 ; $v < scalar @sp ; $v++){
 			if($sp[$v] eq "*" || $sp[$v] eq ""){next;}
 			my @spp = split(",",$sp[$v]);
 
@@ -180,16 +201,17 @@ sub processLine{
 				if($spp[$vv] eq "*" || $spp[$vv] eq ""){next;}
 				$spp[$vv]=~s/^\(//;$spp[$vv]=~s/\)$//;
 
-				if(!exists $filenames[$v]){
-					$filenames[$v]=""
-				}
+				if(!exists $filenames[$v]){ $filenames[$v]="" }
+
+#print join (",", at filenames)."\n";die;
+#print STDERR $filenames[$v].":>".$spp[$vv]."<\n";
 
-				$qdata{$filenames[$v]}{$spp[$vv]}=$prefix.".OrthoGroup".$orthogroupcounter;
+				$qdata{$filenames[$v]}{$spp[$vv]}=$prefix_group.".OrthoGroup".$orthogroupcounter;
 				$genecounter++;
 			}
 		}
+		$orthogroupcounter++;
 	}
-	$orthogroupcounter++;
 }
 
 unless(open(my $FH,'<',$query)) {
@@ -206,9 +228,9 @@ unless(open(my $FH,'<',$query)) {
 		}
 	}
 }else{
-	if($isoform && $exact){print STDERR "[STDERR] WARNING The -isoform option is not compatible with -exact if a proteinortho file is given. -exact is now unset.\n";$exact=0;}
-	elsif(!$isoform && !$exact){print STDERR "[STDERR] WARNING The -exact option is mandatory if a proteinortho file is given. -exact is now set.\n";$exact=1;}
-	if($doregex){print STDERR "[STDERR] WARNING The -E option is not allowed if a proteinortho file is given. -E is now unset.\n";$doregex=0;}
+	if($isoform && $exact){print STDERR "[proteinortho_grab_proteins.pl] WARNING The -isoform option is not compatible with -exact if a proteinortho file is given. -exact is now unset.\n";$exact=0;}
+	elsif(!$isoform && !$exact){print STDERR "[proteinortho_grab_proteins.pl] WARNING The -exact option is mandatory if a proteinortho file is given. -exact is now set.\n";$exact=1;}
+	if($doregex){print STDERR "[proteinortho_grab_proteins.pl] WARNING The -E option is not allowed if a proteinortho file is given. -E is now unset.\n";$doregex=0;}
 
 	my $query_basename=$query;
 	if($query_basename =~ m/\/([^\/]+)$/){
@@ -217,167 +239,185 @@ unless(open(my $FH,'<',$query)) {
 
 	while(<$FH>){ &processLine($_, $query_basename) }
 	close($FH);
-	print STDERR "[STDERR] Done reading the query $query file. Now I know $orthogroupcounter groups with $genecounter genes/proteins in total.\n";
+	print STDERR "[proteinortho_grab_proteins.pl] Done reading the query $query file. Now I know $orthogroupcounter groups with $genecounter genes/proteins in total.\n";
 }
 
-
 if( $foundHeader==0 && $numOfFastas > 3 && $genecounter > 20){
-	print STDERR "\nWARNING : The header of the proteinortho file is missing, this can increase the runtime dramatically. Please include the first line (starting with '#'), to accelerate this program.\n$NC\n";
+	print STDERR "\nWARNING : The header of the proteinortho file is missing, this can increase the runtime dramatically. Please include the first line (starting with '#'), to accelerate this program.\n\n";
 	sleep 1;
 }
 
-if( $tofiles==1 && ($orthogroupcounter > 100) ){
-	print STDERR "\n!!!\nWARNING : This call will produce $orthogroupcounter files (one for each orthology group) !\nIn the *.html file you can individually extract single groups by clicking on the front part of a row.\n$NC";
+if( $tofiles ne "-1" && ($orthogroupcounter > 100) && !$ignoreWarning ){
+	print STDERR "\n!!!\nWARNING : This call will produce $orthogroupcounter files (one for each orthology group) !\nIn the *.html file you can individually extract single groups by clicking on the front part of a row.\n";
 	print STDERR "Press 'strg+c' to prevent me from proceeding or wait 20 seconds to continue...\n!!!\n";
   	sleep 20;
 	print STDERR "\nWell then, proceeding...\n\n";
 	sleep 1;
 }
 
-my $cur_gene="";
-my $cur_gene_filename="";
-my %cur_gene_firsttime;
-my $genecounterfound=0;
-my $basename = "";
+for (my $v = 0 ; $v < scalar @ARGV_copy ; $v++){ if($ARGV_copyiddone[$v]){next;} $QUEUE->enqueue($ARGV_copy[$v]) }
+for (my $i = 0; $i < $cpus; $i++) {$QUEUE->enqueue(undef)}
 
-my $fastai=1;
+for (my $i = 0; $i < $cpus; $i++) { threads->create('worker') } # spawn a thread for each core
 
-my %foundIDs;
+my %cache_files;
+my %master;
+my $genecounterfound = 0;
+my %genefound;
+foreach my $t (threads->list()) {
+	$t->join;
+	my @ret = split("$;$;",$thread_return[$t->tid]); # await thread
+	my $first=shift @ret;
 
-for(my $v = 0 ; $v < scalar @ARGV_copy ; $v++){
+	$genecounterfound += scalar split("$;",$first);
+	map { $genefound{$_}=1 } split("$;",$first);
 
-	if($ARGV_copyiddone[$v]){next;}
+	for (my $i = 0; $i < scalar @ret; $i++) {
+		my ($key,$load)=split("$;",$ret[$i]);
+		if(!exists $master{$key}){$master{$key}=""}
+		$master{$key}.=$load;
+	}
+}
+foreach my $key (keys %master) {
+	if($tofiles ne "-1"){
+		open(my $FHOUT,">$key");
+		print $FHOUT $master{$key};
+		close($FHOUT);
+	}else{
+		print $master{$key};
+	}
+}
 
-	print STDERR "[STDERR] ($fastai/$numOfFastas) : ";if($basename ne ""){print STDERR "Done reading $basename. "}print STDERR "Start reading the fasta file ".($ARGV_copy[$v])."\n";
-	$fastai++;
+sub worker {
+	local $SIG{KILL} = sub { threads->exit };
+	my $tid = threads->tid();
 
-	$basename = $ARGV_copy[$v];
-	if($basename =~ m/\/([^\/]+)$/){
-		$basename=$1;
-	}
+	my @genefound;
+	my %cache;
 
-	open(my $FH,'<',$ARGV_copy[$v]);
-	my $geneprintswitch = 0;
-	
-	while(<$FH>){
-		$_=~s/[\r\n]+$//;
+	while(my $job = $QUEUE->dequeue()){
+		
+		my $cur_gene_filename="";
+		my %cur_gene_firsttime;
+		my $cur_gene="";
+		my $basename = basename($job);
+
+		print STDERR "[proteinortho_grab_proteins.pl] Start processing the fasta file ".($job)." (tid=$tid)\n";
 
-		if($_ eq "" || length $_ < 2 || substr($_,0,1) eq "#"){next;}
+		open(my $FH,'<',$job);
+		my $geneprintswitch = 0;
 		
-		my $curLine=$_;
-
-		if(substr($curLine,0,1) eq $prefix ){
-			$geneprintswitch = 0;
-
-			if($cur_gene ne ""){
-				$cur_gene_filename=~s/\\b//g;
-				$cur_gene_filename=~s/[^a-zA-Z0-9.]//g;
-				if($tofiles){ # print to files
-					my $writemodus=">>";
-					if(!exists $cur_gene_firsttime{$cur_gene_filename}){$writemodus=">";$cur_gene_firsttime{$cur_gene_filename}=1;}
-					open($FHOUT,$writemodus,$cur_gene_filename);
-					print $FHOUT $cur_gene;
-					close($FHOUT);
-				}else{
-					print $cur_gene;
-				}
-				$cur_gene="";
-			}
+		while(<$FH>){
+			$_=~s/[\r\n]+$//;
+
+			if($_ eq "" || length $_ < 2 || substr($_,0,1) eq "#"){next;}
+			
+			my $curLine=$_;
 
-			if($exact && exists $qdata{$basename}){
+			if(substr($curLine,0,1) eq $prefix ){
 
-				my $genename=$curLine;
-				my @arr=split(" ",$genename);
-				if(scalar(@arr)>0){$genename=$arr[0];}
-				$genename=~s/^>//;
+				$geneprintswitch = 0;
 
-				if(exists $qdata{$basename}{$genename}){
+				if($cur_gene ne ""){
+					$cur_gene_filename=~s/\\b//g;
+					$cur_gene_filename=~s/[^a-zA-Z0-9.]//g;
 					
-					my $headerstr=$curLine;
-					if($source){$headerstr=$headerstr." ".$basename;}
+					if(!exists $cache{($tofiles ne "-1" ? $tofiles : "" ).$cur_gene_filename}){$cache{($tofiles ne "-1" ? $tofiles : "" ).$cur_gene_filename}=""}
+					$cache{($tofiles ne "-1" ? $tofiles : "" ).$cur_gene_filename} .= $cur_gene;
+				
+					$cur_gene="";
+				}
 
-					$cur_gene.=$headerstr."\n";
-					$cur_gene_filename=$qdata{$basename}{$genename}.".fasta";
-					$geneprintswitch = 1;
-					$genecounterfound++;
+				if($exact && exists $qdata{$basename}){
 
-					delete $qdata{$basename}{$genename};
-				}
+					my $genename=$curLine;
+					my @arr=split(" ",$genename);
+					if(scalar(@arr)>0){$genename=$arr[0];}
+					$genename=~s/^>//;
 
-			}else{ # fallback, if the basename (filename) does not exists, try all 
+					if(exists $qdata{$basename}{$genename}){
+					
+						my $headerstr=$curLine;
+						if($source){$headerstr=$headerstr." ".$basename;}
 
-				foreach my $filename (keys %qdata) { 
-					if($geneprintswitch){last;}
-					foreach my $key (keys %{$qdata{$filename}}) { 
-						my $regexv=$key; 
-						my $curLine_test = $curLine;
-						
-						if(!$doregex && !$exact){$regexv=quotemeta($regexv);}
+						$cur_gene.=$headerstr."\n";
+						$cur_gene_filename=$qdata{$basename}{$genename}.".fasta";
+						$geneprintswitch = 1;
+						push(@genefound,$genename);
+					}
 
-						my $test_match = 0;
-						if( !$exact ){
-							
-							# use regular expression
-							$test_match = $curLine_test =~ $regexv;
+				}else{ # fallback, if the basename (filename) does not exists, try all 
 
-						}else{
-							
-							# directly compare starting with first 5 character of fasta entry as offset
-							my $offset = 1; # start at 1 -> fasta entries starts with ">"
-							while( !( $test_match = substr($curLine,$offset,length $key) eq $key ) && $offset < 5 ){
-								$offset++;
-							} 
-						}
+					foreach my $filename (keys %qdata) { 
+						if($geneprintswitch){last;}
+						foreach my $key (keys %{$qdata{$filename}}) { 
 
-						if( $test_match ){
+							if(!defined $qdata{$filename}{$key}){next}
 
-							if($qdata{$filename}{$key} eq ""){
-								print STDERR "[STDERR] WARNING The input ($key) was found multiple times in the fasta files ".(!$exact ? "(maybe try --exact)." : ".")."\n";
+							my $regexv = $key; 
+							my $curLine_test = $curLine;
+							
+							if(!$doregex && !$exact){$regexv=quotemeta($regexv);}
+
+							my $test_match = 0;
+							if( !$exact ){
+								
+								# use regular expression
+								$test_match = $curLine_test =~ $regexv;
+
+							}else{
+								
+								# directly compare starting with first 5 character of fasta entry as offset
+								my $offset = 1; # start at 1 -> fasta entries starts with ">"
+								while( !( $test_match = substr($curLine,$offset,length $key) eq $key ) && $offset < 5 ){
+									$offset++;
+								} 
 							}
 
-							my $headerstr=$curLine;
-							if($source){$headerstr=$headerstr." ".$basename;}
+							if( $test_match ){
+
+								if( $qdata{$filename}{$key} eq ""){
+									print STDERR "[proteinortho_grab_proteins.pl] WARNING The input ($key) was found multiple times in the fasta files ".(!$exact ? "(maybe try --exact)." : ".")."\n";
+								}
 
-							$cur_gene.=$headerstr."\n";	
-							$cur_gene_filename = $qdata{$filename}{$key}.".fasta";
-							$geneprintswitch = 1;
-							$genecounterfound++;
+								my $headerstr=$curLine;
+								if($source){$headerstr=$headerstr." ".$basename;}
 
-							delete $qdata{$filename}{$key};
+								$cur_gene.=$headerstr."\n";	
+								$cur_gene_filename = $qdata{$filename}{$key}.".fasta";
+								$geneprintswitch = 1;
+								push(@genefound,$key);
 
-							last;
+								last;
+							}
 						}
 					}
-				}	
-			}
+				}
+			}else{ if($geneprintswitch){ $cur_gene.=$curLine."\n" } }
+		}
+		close($FH);
 
-		}else{
-			if($geneprintswitch){
-				$cur_gene.=$curLine."\n";
-			}
+		if($cur_gene ne ""){
+			$cur_gene_filename=~s/\\b//g;
+			$cur_gene_filename=~s/[^a-zA-Z0-9.]//g;
+
+			if(!exists $cache{($tofiles ne "-1" ? $tofiles : "" ).$cur_gene_filename}){$cache{($tofiles ne "-1" ? $tofiles : "" ).$cur_gene_filename}=""}
+			$cache{($tofiles ne "-1" ? $tofiles : "" ).$cur_gene_filename} .= $cur_gene;
 		}
 	}
-	close($FH);
-}
 
-if($cur_gene ne ""){
-	$cur_gene_filename=~s/\\b//g;
-	$cur_gene_filename=~s/[^a-zA-Z0-9.]//g;
-	if($tofiles){ # print to files
-		my $writemodus=">>";
-		if(!exists $cur_gene_firsttime{$cur_gene_filename}){$writemodus=">";$cur_gene_firsttime{$cur_gene_filename}=1;}
-		open($FHOUT,$writemodus,$cur_gene_filename);
-		print $FHOUT $cur_gene;
-	}else{
-		print $cur_gene;
-	}
-	close($FHOUT);
+	my $ret=join("$;", at genefound);
+	foreach my $key (keys %cache) { $ret.="$;$;$key$;".$cache{$key} }
+
+	$thread_return[$tid]=$ret;
 }
 
+
 if($genecounter != $genecounterfound){
-	print STDERR "[STDERR] WARNING The input ($query) contains $genecounter queries, but I extracted $genecounterfound entries out of the fasta(s).";
-	if(!$exact){print STDERR " If this is not desired, please consider using the -exact option";}elsif($genecounter > $genecounterfound){print STDERR "\n-> This should not have happen, maybe some fasta files are missing as input?\n(If you cannot solve this error, please send a report to incoming+paulklemm-phd-proteinortho-7278443-issue-\@incoming.gitlab.com or visit https://gitlab.com/paulklemm_PHD/proteinortho/wikis/Error%20Codes for more help. Further more all mails to lechner\@staff.uni-marburg.de are welcome)\n";}
-	print STDERR "\n";
+	print STDERR "[proteinortho_grab_proteins.pl] WARNING The input ($query) contains $genecounter queries, but I extracted $genecounterfound entries out of the fasta(s).";
+	if(!$exact){print STDERR " If this is not desired, please consider using the -exact option";}
+	#elsif($genecounter > $genecounterfound){print STDERR "\n-> This should not have happen, maybe some fasta files are missing as input?\n(If you cannot solve this error, please send a report to incoming+paulklemm-phd-proteinortho-7278443-issue-\@incoming.gitlab.com or visit https://gitlab.com/paulklemm_PHD/proteinortho/wikis/Error%20Codes for more help. Further more all mails to lechner\@staff.uni-marburg.de are welcome)\n";}
+	print STDERR "\n\n";
 
 	if($genecounterfound < $genecounter){
 		print STDERR "The following ids were not found:\n";
@@ -387,11 +427,27 @@ if($genecounter != $genecounterfound){
 				if(10 < $counter++){last}
 				print STDERR $key."\n"; 
 			}
-			if(10 < $counter){print STDERR " ...\n";last}
+			if(10 < $counter){
+				print STDERR " ...\n";		
+
+				open(FH,">missing_ids.txt");
+				foreach my $filenamee (keys %qdata) { 
+					foreach my $keyy (keys %{$qdata{$filenamee}}) { 
+						if(!exists $genefound{$keyy}){ print FH "$keyy\n" }
+					}
+				}
+				close(FH);
+				print STDERR "I produced a file containing all missing ids in the current working directory (missing_ids.txt)\n";
+
+				last
+			}
 		}
-		print STDERR "\nPlease make sure that the upper ids are part of the given fasta files (try searching these in the given fasta files) !\n";
+		print STDERR "\nPlease make sure that those ids are part of the given fasta files (try searching these in the given fasta files) !\n";
+	}
+	if($genecounterfound == 0){
+		print STDERR "\nIf you used the --isoform option in proteinortho, then please set -isoform here too (no need to specify the isoform type, e.g. uniprot)!\n";
 	}
 	
 }else{
-	print STDERR "[STDERR] All entries of the query are found in the fasta(s).\n";
+	print STDERR "[proteinortho_grab_proteins.pl] All entries of the query are found in the fasta(s).\n";
 }



View it on GitLab: https://salsa.debian.org/med-team/proteinortho/-/compare/44662a1b14bb7bc3293f4f96c25ae56eba775c93...2626c7e2faf2cc39d9001e5678051ad76b90bd3e

-- 
View it on GitLab: https://salsa.debian.org/med-team/proteinortho/-/compare/44662a1b14bb7bc3293f4f96c25ae56eba775c93...2626c7e2faf2cc39d9001e5678051ad76b90bd3e
You're receiving this email because of your account on salsa.debian.org.


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://alioth-lists.debian.net/pipermail/debian-med-commit/attachments/20210420/dcf319e9/attachment-0001.htm>