[med-svn] [Git][med-team/proteinortho][upstream] New upstream version 6.1.7+dfsg

Andreas Tille (@tille) gitlab at salsa.debian.org
Tue Jan 17 15:30:02 GMT 2023



Andreas Tille pushed to branch upstream at Debian Med / proteinortho


Commits:
42208790 by Andreas Tille at 2023-01-17T16:21:02+01:00
New upstream version 6.1.7+dfsg
- - - - -


6 changed files:

- CHANGELOG
- CHANGEUID
- Makefile
- proteinortho6.pl
- src/.gitlab-ci.yml
- src/proteinortho_clustering.cpp


Changes:

=====================================
CHANGELOG
=====================================
@@ -149,19 +149,19 @@
 		proteinoprtho 6.0 alpha release
 	28. Nov
 		Makefile update (lapack zipped,...)
-	5. Dez
+	5. Dec
 		MCL integration (-mcl option in proteinortho.pl)
 		XML bugfix (species with . in the name did confuse the xml parser)
-	6. Dez
+	6. Dec
 		MCL pre/postprocessing (src/do_mcl.pl)
 		double -> float in proteinortho_clustering.cpp
 		weights are unsigned shorts again (only the last commit was unsigned int) proteinortho_clustering.cpp
-	10. Dez
+	10. Dec
 		gff4fasta update: Test for additional naming schemes
-	20.-21. Dez
+	20.-21. Dec
 		src/do_mcl.pl performance increase, dev/blastgraph2CCblastgraphs.cpp improvement
 		orthoXML improvement (still not accepted by orthobechmarkproject)
-	25. Dez
+	25. Dec
 		no lapack version (src/proteinortho_clustering_nolapack.cpp)
 		no cmake version (make all_nocmake, needs lapack installed)
 2019
@@ -273,7 +273,7 @@
 		refined the Makefile for proteinortho_clustering, now it additionally tests if the program is executable with the -test option		
 	12. Oct (5122)
 		enhancement of the Makefile (more verbose, added standard compiler flags, cleanup)
-	2. Dez (5195)
+	2. Dec (5195)
 		fixing proteinortho_grab_protein.pl (https://gitlab.com/paulklemm_PHD/proteinortho/-/issues/41)
 		and ids that are not found are printed to STDERR 
 2021
@@ -374,5 +374,43 @@
 		- added the with and without lapack test back to the Makefile (make test or make test_step3)
 		- refined feature for proteinortho_clustering -maxnodes: if max-nodes are exceeded, remove the worst 10% of edges of this CC (by weight) until the CC breaks into smaller CC
 	23. Nov (6051) v6.1.4
-		- small bugfix: newlines of the output of proteinortho_summary.pl are now printed to STDOUT instead of STDERR, thanks to Matthias Bernt
-		- bugfix: for large project the step=1 could soft lock itself sometimes caused by the log_worker (if used with many cores)
+		- small bugfix: newlines of the output of proteinortho_summary.pl are now printed to STDOUT instead of STDERR, thanks to Matthias Bernt !9
+		- bugfix: for large project the step=1 could soft lock itself sometimes caused by the log_worker (if used with many cores) #69
+	23. Nov (6056) v6.1.5
+		- bugfix: proteinortho_clustering -core will split before evaluating the cluster
+		- improvement: proteinortho_clustering the new purity detection was too careful (resulting in very long runtimes), now only a fixed number of purities are tested improving the runtime dramatically for large inputs
+		- bugfix: the bug of #69 was not fixed entirely
+	23. Nov (6060) v6.1.5
+		- new feature for proteinortho_clustering: -abc enables a abc type graph input (undirected)
+		- some minor adjustments to the -core option of proteinortho_clustering
+	26. Nov (6065) v6.1.5
+		- fixed a bug that sometimes will result in a suspended but running proteinortho instance (diamond needs separate WD for each call)
+	26. Nov (6066) v6.1.6
+		- small bugfix for -step=1 and input files with forbidden symbols (invalid fasta files)
+	30. Nov (6070) v6.1.6
+		- bug #69 did still occure, therefore a new wrapper for diamond, blast, ... is introduced that tests for stalemates and restarts the program a few times until it will fail.
+		- ci/cd clean up, introduced a own docker-worker for testing  
+	1. Dec (6075) v6.1.6
+		- introduced a new default value for maxnodes (species^2) for proteinortho_clustering, usually this improves the runtime by a lot
+		- some minor improvements to proteinortho_clustering
+		- improved stderr output for proteinortho6.pl
+		- added -threads_per_process option to proteinortho6.pl to fine tune the threading system (-step=2 only)
+	5. Dec (6080) v6.1.6
+		- diamond uses all cores available to generate datasets (step=1), this is now restricted to the specified number of cores (-cpus) 
+		- new feature: a fallback system for the blast commands of step=2: if e.g. a diamond calls zombifies (0% cpup running without terminating), proteinortho automatically detects this and restarts this comparison a few times instead of running indefinitely waiting on diamond
+	6. Dec (6081) v6.1.6
+		- disabled the fallback system for mmseqs as mmseqs spawns the true workload in another thread
+	12. Dec (6082) v6.1.6
+		- fixed a bug introduced with 6.1.3 where proteinortho_clustering maxnodes option can lead to infinite loops 
+		- small improvement to the runtime of proteinortho_clustering
+	13. Dec (6085) v6.1.7
+		- fixed a bug where -p=mmseqsp results in empty output 
+	15. Dec (6086) v6.1.7
+		- fixed a bug where multiple -jobs can delete the cache of other job-runs (if used without -keep). Now -jobs defines different individual caches
+	22. Dec (6087) v6.1.7
+		- improved error output if -step=2 fails (includes now the the $! error message), see #70, thanks to Matthias Bernt
+	25. Dec (6088) v6.1.7
+		- minor changes to step=2 stderr messages
+	28. Dec (6088) v6.1.7
+		- added a wait on each kill call of proteinortho_exec
+		
\ No newline at end of file


=====================================
CHANGEUID
=====================================
@@ -1 +1 @@
-6051
+6089


=====================================
Makefile
=====================================
@@ -449,5 +449,5 @@ test_clean2:
 
 .PHONY: clean
 clean:
-	rm -rf src/BUILD test/C.faa.* test/E.faa.* test/C2.faa.* test/L.faa.* test/M.faa.*
+	rm -rf src/BUILD test/C*.faa.* test/E.faa.* test/C2.faa.* test/L.faa.* test/M.faa.*
 	rm -rf src/lapack-3.8.0/


=====================================
proteinortho6.pl
=====================================
@@ -95,6 +95,11 @@ You can adjust the parameters e.g. a more strict -e evalue cut off and write the
 
 the number of processors to use (multicore/processor support) for step 2 and step 3
 
+=item B<--threads_per_process>=number (default: 1)
+
+number of threads per process, 
+e.g. using -cpus=4 and -threads_per_process=2 will spawn 4 workerthreads using 2 cpu cores each = total of 8 cores
+
 =item B<--verbose>={0,1,2} (default: 1)
 
 verbose level. 1:keeps you informed about the progress
@@ -509,7 +514,7 @@ use POSIX;
 ##########################################################################################
 # Variables
 ##########################################################################################
-our $version = "6.1.4";
+our $version = "6.1.7";
 our $step = 0;    # 0/1/2/3 -> do all / only apply step 1 / only apply step 2 / only apply step 3
 our $verbose = 1; # 0/1   -> don't / be verbose
 our $core = 0; 
@@ -630,13 +635,12 @@ foreach my $option (@ARGV) {
   elsif ($option =~ m/^--?cleanupblast$/)       { $checkblast = 1;  }
   elsif ($option =~ m/^--?te?mp=(.+)$/)     { $tmp_path = $1;
                 # make sure it ends with /
-                unless ($tmp_path =~ /\/$/) {$tmp_path .= "/";}my $pwd=$ENV{"HOME"}; $tmp_path=~s/~/$pwd/g;
+                unless ($tmp_path =~ /\/$/) {$tmp_path .= "/";}
+                my $pwd=$ENV{"HOME"}; $tmp_path=~s/~/$pwd/g;
                 if(! -d $tmp_path || ! -R $tmp_path || ! -W $tmp_path ){
                   &Error(" -tmp=$tmp_path is not accessible. Check if the directory exists and is read- and writable.");
                 }else{
-                  if($tmp_path =~ m/proteinortho_cache/){
-                    $keep=1;
-                  }
+                  if($tmp_path =~ m/proteinortho_cache/){ $keep=1; }
                 }
               }
   elsif ($option =~ m/^--?debug$/)          { $debug = 1;  }
@@ -646,6 +650,7 @@ foreach my $option (@ARGV) {
   elsif ($option =~ m/^--?e=(.*)$/)       { $evalue = $1; }
   elsif ($option =~ m/^--?range=([0-9]+)$/)       { $range = $1; }
   elsif ($option =~ m/^--?cpus=(\d*)$/)       { $cpus = $1; }
+  elsif ($option =~ m/^--?threads_per_process=(\d*)$/)       { $threads_per_process = $1; }
   elsif ($option =~ m/^--?cpus=auto$/)          { $cpus = 0; }
   elsif ($option =~ m/^--?alpha=([0-9\.]+)$/)     { $alpha = $1; }
   elsif ($option =~ m/^--?hmm_?evalue=([0-9\.eE+-]+)$/)     { $hmm_search_evalue = $1; }
@@ -750,6 +755,96 @@ our $blastmode_pendant = { # if you choose blastp+ and the input is nucleotide -
 
 my $restart_counter=-1; # restart only once
 
+our $proteinortho_exec='proteinortho_exec(){ # works with sh as well as bash (not csh)
+  max_monitor_time=25920; # 3 days in 10s (sleep_time is 10s)
+  max_restarts=10; # if max restarts are reached -> terminate with exit 1
+  sleep_time=10;
+  max_sus=60; # 10 min of suspension = restart
+  restart_counter=0;
+
+  work(){ 
+    # fork the current process into a monitor and the input cmd
+    # for debug could use this: force a OK after first run : #if [ "$restart_counter" -eq "1" ]; then cmd="sleep 1"; fi 
+
+    eval "$cmd &"; # execute the input command in the background
+    pid=$!;
+    '.($debug ? 'echo "[main] started $pid";': "").'
+    
+    trap \'max_monitor_time=0; kill -1 $pid 2>/dev/null; wait $pid 2>/dev/null;\' 2 9 EXIT;
+
+    (
+      # this is the monitor script, every update interval check if cmd is suspended or not -> if too many suspended calls are found terminate this monitor as well as the cmd call
+      it=0; 
+      is_sus=0;
+      sus_counter=0;
+      sleep $sleep_time;
+      '.($debug ? 'echo "[monitor] starting $pid";': "").'
+      while [ "$it" -lt "$max_monitor_time" ] & [ "$pid" != "-1" ] & kill -0 $pid 2>/dev/null; do 
+        # test of the cmd command has 0 cpu usage -> if yes then increment the sus_counter
+        if [ "$(ps -o pcpu $pid | tail -n1 | grep " 0.")" != "" ];then is_sus=1; sus_counter=$((sus_counter+1)); else sus_counter=0; fi;
+        '.($debug ? 'ps -o pcpu $pid; echo "[monitor] $pid=is_sus:$is_sus sus_counter:$sus_counter"': "").'
+        if [ "$sus_counter" -gt "$max_sus" ]; then '.($debug ? 'echo "max_sus reached : kill $pid"': "").'
+          kill $pid 2>/dev/null;
+          wait $pid 2>/dev/null;
+          return 1; # signal FAIL -> max_sus is reached
+        fi;
+        sleep $sleep_time; it=$((it+1)); 
+      done
+      '.($debug ? 'echo "[monitor] all ok $pid";': "").'
+      return 0; # signal OK
+    ) &
+    pid_monitor=$!;
+
+    wait $pid; # wait on a finish from cmd -> if the monitor stops, this will halt as well
+    ret_pid=$?;
+    
+    if [ "$ret_pid" -eq "0" ]; then # command did run with OK output -> terminate this script (if monitor stops: this will never be 0)
+      if kill -0 $pid_monitor 2>/dev/null; then kill $pid_monitor 2>/dev/null; wait $pid_monitor 2>/dev/null; fi; # kill the monitor if it is running
+      return $ret_pid; 
+    fi;
+
+    wait $pid_monitor; # await the monitor (one last sleep tick)
+    ret_monitor=$?; # this will be 1 if max_sus is reached
+
+    '.($debug ? 'echo "[$restart_counter] pid:$pid (monitor=$pid_monitor) => ret_pid=$ret_pid, ret_monitor=$ret_monitor"': "").'
+
+    if [ "$ret_monitor" != "0" ]; then # restart the process -> another work() call
+      if [ "$restart_counter" -lt "$max_restarts" ];then
+        restart_counter=$((restart_counter+1));
+        '.($debug ? 'echo "[main] $pid restart (restart_counter=$restart_counter) return work";': "").'
+        return &work;
+      else
+        '.($debug ? 'echo "[main] $pid max_restarts reached return 1";': "").'
+        return 1; # max restarts are reached -> terminate with 1
+      fi
+    else
+      '.($debug ? 'echo "[main] $pid return $ret_pid";': "").'
+      return $ret_pid; # monitor was fine but the program did terminate -> passthrough
+    fi
+  }
+
+  if [ "$1" != "" ]; then
+    cmd=$1;
+    pid=-1;
+
+    precheck=1;
+    if ! command -v ps >/dev/null 2>&1; then precheck=0; fi;
+    if ! command -v grep >/dev/null 2>&1; then precheck=0; fi;
+    if ! command -v tail >/dev/null 2>&1; then precheck=0; fi;
+    if ! command -v kill >/dev/null 2>&1; then precheck=0; fi;
+    if ! command -v wait >/dev/null 2>&1; then precheck=0; fi;
+
+    if [ "$precheck" -eq "1" ]; then
+      #trap \'max_monitor_time=0; echo "KILL"; jobs -p; jobs -p | xargs kill -1 2>/dev/null; echo "KILL DONE";jobs -p; exit\' 2 9 EXIT;
+      exit &work;
+    else
+      '.($debug ? 'echo "precheck failed, using now a simple eval call"': "").'
+      eval $cmd;
+      exit $?;
+    fi
+  fi
+}';
+
 ##########################################################################################
 # Check parameters
 ##########################################################################################
@@ -781,12 +876,11 @@ if ($jobnumber != -1) {
   $run_id = "_$jobnumber"."_".$split_to_X_jobs;
 }
 
-
-our $gccversionstr = `gcc --version 2>/dev/null`;
-our $gccversion_main = "";
-if ($? == 0) {
-  $gccversion_main=($gccversionstr =~ m/^[^\n]+ ([\d]+)\.[\d\.]+\n/);
-}
+#our $gccversionstr = `gcc --version 2>/dev/null`;
+#our $gccversion_main = "";
+#if ($? == 0) {
+#  $gccversion_main=($gccversionstr =~ m/^[^\n]+ ([\d]+)\.[\d\.]+\n/);
+#}
 #our $ompprocbind="close";
 #if($gccversion_main eq "" || $gccversion_main < 5){
 #  $ompprocbind="true";
@@ -896,20 +990,20 @@ if($step!=1 && $step!=4){
   unlink($desctable);
 }
 
-if($tmp_path eq ""){$tmp_path=".";}
-if( $tmp_path !~ qr/proteinortho_cache_$project/ ){
-  if(! -d $tmp_path."/proteinortho_cache_$project"){
-    system("mkdir $tmp_path/proteinortho_cache_$project");
-    if( -d $tmp_path."/proteinortho_cache_$project"){
-      $tmp_path.="/proteinortho_cache_$project/";
+if( $tmp_path eq "" ){$tmp_path=".";}
+my $suffixjob="";
+if($jobnumber != -1){ $suffixjob = "_${jobnumber}_${jobs_todo}" }
+if( $tmp_path !~ qr/proteinortho_cache_$project$suffixjob/ ){
+  if(! -d $tmp_path."/proteinortho_cache_$project$suffixjob"){
+    system("mkdir $tmp_path/proteinortho_cache_$project$suffixjob");
+    if( -d $tmp_path."/proteinortho_cache_$project$suffixjob"){
+      $tmp_path.="/proteinortho_cache_$project$suffixjob/";
     }
   }else{
-    $tmp_path.="/proteinortho_cache_$project/";
+    $tmp_path.="/proteinortho_cache_$project$suffixjob/";
   }
 }else{
-  if(!-d $tmp_path){
-   mkdir("$tmp_path");
-  }
+  if(! -d $tmp_path){ mkdir("$tmp_path"); }
 }
 if($inproject ne $project){ system("cd '$tmp_path/'; ln -s ../proteinortho_cache_$inproject/* ."); }
 
@@ -1001,15 +1095,15 @@ if(scalar keys %isoform_mapping == 0 && $isoform ne ""){print STDERR "\n!!!!\nWA
 if ($step == 0 || $step == 1) {
   if($verbose){print STDERR "\n$GREEN**Step 1**$NC\n";}
 
-  $do_log=1;
-  $_->kill('KILL')->detach foreach threads->list(); 
-  threads->create('log_Worker');
-  foreach my $file (@files) { $step1_QUEUE->enqueue($file); }
-  if($debug){print STDERR "Creating ".(scalar @files > $cpus+10 || $debug ? $cpus : 1)." generate_indices_Worker (cpus=$cpus) for ".$step1_QUEUE->pending." jobs\n"}
-  if($debug){print STDERR "currently running ".(threads->list())." threads (pre spawn)\n";}
-  threads->create('generate_indices_Worker');
+  #$do_log=1;
+  #$_->kill('KILL')->detach foreach threads->list(); 
+  #threads->create('log_Worker');
+  #foreach my $file (@files) { $step1_QUEUE->enqueue($file); }
+  #if($debug){print STDERR "Creating ".(scalar @files > $cpus+10 || $debug ? $cpus : 1)." generate_indices_Worker (cpus=$cpus) for ".$step1_QUEUE->pending." jobs\n"}
+  #if($debug){print STDERR "currently running ".(threads->list())." threads (pre spawn)\n";}
+  #threads->create('generate_indices_Worker');
   
-  # INFO:
+  # !! INFO: !!
   # here we use only one thread to generate the db files !
   # -> diamond will fail if multiple makedb instances are running in parallel, 
   #    prducing un-joinable running threads (stuck within the `$cmd 2>&1` in the generate_indices)
@@ -1018,11 +1112,12 @@ if ($step == 0 || $step == 1) {
   
   # code for multiple threads (does not work sometimes for diamond): 
   #for (my $i = 0; $i < ((scalar @files > 500 && scalar @files > $cpus+10) || $debug ? $cpus : 1) ; $i++) { my $tmp = threads->create('generate_indices_Worker') } # spawn a thread for each core
-  if($debug){print STDERR "currently running ".(threads->list())." threads (post spawn)\n";}
-  $_->join() foreach threads->list(); 
-#  $_->kill('KILL')->detach foreach threads->list(); 
+  #if($debug){print STDERR "currently running ".(threads->list())." threads (post spawn)\n";}
+  #$_->join() foreach threads->list(); 
 
-  #&generate_indices(@files);  # Generate index files for blast
+  foreach my $file (@files){
+    &generate_indices($file);
+  }; # Generate index files for blast
   if ($desc) {
     &write_descriptions;  # Write sequence description file
   }
@@ -1510,6 +1605,8 @@ Options:
          -project=    prefix for all result file names [default: myproject]
          -inproject=  load data from this namespace instead (works with intermediate files for step=2 and blast-graph for step=3)
          -cpus=       number of processors to use [default: auto]
+         -threads_per_process= number of threads per process [default: $threads_per_process], 
+                      e.g. using -cpus=4 and -threads_per_process=2 will spawn 4 workerthreads using 2 cpu cores each = total of 8 cores
          -silent      sets verbose to 0
          -temp=       path for temporary files [default: working directory]
          -keep        stores temporary blast results for reuse (proteinortho_cache_project directory). 
@@ -1679,6 +1776,7 @@ sub run_blast {
   &print_blast_stats();
 
   # Spawn worker threads
+
   for (my $i = 0; $i < $cpus; $i++) {threads->create('workerthread');}
 
   # For each file against each other file
@@ -2090,6 +2188,7 @@ sub get_gene_lengths {
   my $last_gene = "";
   my $lengths_lastgene=0;
   while (my $file = shift @_) {
+    my $abs_path_file=abs_path "$file "; $abs_path_file=~s/ $//g;
     open(FASTA,"<$file") || &Error("Could not open '$file': $!");
     while (<FASTA>) {
       $_=~s/[\r\n]+$//;
@@ -2098,7 +2197,7 @@ sub get_gene_lengths {
         if($last_gene ne ""){
           if( (exists $lengths{$last_gene} && $lengths_lastgene > $lengths{$last_gene}) || !exists $lengths{$last_gene}){
             $lengths{$last_gene}=$lengths_lastgene;
-            if($blastmode eq "autoblast" && $autoblast_fileis{$file} eq "nucl"){
+            if($blastmode eq "autoblast" && $autoblast_fileis{$abs_path_file} eq "nucl"){
               $lengths{$last_gene}/=3;
             }
             $lengths{&convertUniprotAndNCBI($last_gene)}=$lengths{$last_gene};
@@ -2121,7 +2220,7 @@ sub get_gene_lengths {
     if($last_gene ne ""){
       if( (exists $lengths{$last_gene} && $lengths_lastgene > $lengths{$last_gene}) || !exists $lengths{$last_gene}){
         $lengths{$last_gene}=$lengths_lastgene;
-        if($blastmode eq "autoblast" && $autoblast_fileis{$file} eq "nucl"){
+        if($blastmode eq "autoblast" && $autoblast_fileis{$abs_path_file} eq "nucl"){
           $lengths{$last_gene}/=3;
         }
         $lengths{&convertUniprotAndNCBI($last_gene)}=$lengths{$last_gene};
@@ -2527,14 +2626,15 @@ sub generate_indices {
         if ($verbose) {$msg.= "The database for '$file' is present and will be used\n";}
       }else{
         if ($verbose) {$msg.= "Building database for '$file'\t(".$gene_counter{$file}." sequences)\n";}
-        if(!exists $autoblast_fileis{$file}){
+        my $abs_path_file=abs_path "$file "; $abs_path_file=~s/ $//g;
+        if(!exists $autoblast_fileis{$abs_path_file}){
           $msg.= "$ORANGE\n[WARNING]$NC I could not detect the type of '$file', i assume aminoacid sequences...\n";
-          $autoblast_fileis{$file}="prot";
+          $autoblast_fileis{$abs_path_file}="prot";
         }
-        if ($debug) {$msg.= "$makedb -dbtype ".$autoblast_fileis{$file}." -in '$file' -out '$file.$blastmode'\n";}
-        $cmd="$makedb -dbtype ".$autoblast_fileis{$file}." -in '$file' -out '$file.$blastmode'";
+        if ($debug) {$msg.= "$makedb -dbtype ".$autoblast_fileis{$abs_path_file}." -in '$file' -out '$file.$blastmode'\n";}
+        $cmd="$makedb -dbtype ".$autoblast_fileis{$abs_path_file}." -in '$file' -out '$file.$blastmode'";
         my $makedb_ret = `$cmd 2>&1`;
-        if ($? != 0) {$msg.= ("$ORANGE\n[WARNING]$NC ".$blastmode." failed to create a database ($makedb -dbtype ".$autoblast_fileis{$file}." -in '$file' -out '$file.$blastmode'). The output is:\n-----------------\n$makedb_ret-----------------\nI will now proceed with writing the database files to the DB/ directory in $tmp_path (-tmp)."); if($step==1){$msg.= "$ORANGE Please ensure that you use -tmp=$tmp_path -keep (and use the same -project= name) for future analysis.$NC";}print "\n";
+        if ($? != 0) {$msg.= ("$ORANGE\n[WARNING]$NC ".$blastmode." failed to create a database ($makedb -dbtype ".$autoblast_fileis{$abs_path_file}." -in '$file' -out '$file.$blastmode'). The output is:\n-----------------\n$makedb_ret-----------------\nI will now proceed with writing the database files to the DB/ directory in $tmp_path (-tmp)."); if($step==1){$msg.= "$ORANGE Please ensure that you use -tmp=$tmp_path -keep (and use the same -project= name) for future analysis.$NC";}print "\n";
 
           mkdir("$tmp_path/DB");
           if($step==1){$oldkeep=$keep;$keep=1;}
@@ -2570,7 +2670,8 @@ sub generate_indices {
   }
 ##MARK_FOR_NEW_BLAST_ALGORITHM
   if($debug){print STDERR "generate_indices:$file almost end\n"}
-  $log_QUEUE->enqueue($msg);
+  if($verbose){print STDERR $msg}
+  #$log_QUEUE->enqueue($msg);
   if($debug){print STDERR "generate_indices:$file end\n"}
   return;
 }
@@ -2583,12 +2684,15 @@ sub blast {
   my $a = $_[0];
   my $b = $_[1];
   my $thread_id = $_[2];
-  if( $blastmode eq "mmseqsp" || $blastmode eq "mmseqsn" || $blastmode eq "mmseqs" ){ if(!-d "$tmp_path/mmseqs".$thread_id ){mkdir("$tmp_path/mmseqs".$thread_id);}else{rmdir("$tmp_path/mmseqs".$thread_id);mkdir("$tmp_path/mmseqs".$thread_id);}}
+  #if( $blastmode eq "mmseqsp" || $blastmode eq "mmseqsn" || $blastmode eq "mmseqs" ){ if(!-d "$tmp_path/mmseqs".$thread_id ){mkdir("$tmp_path/mmseqs".$thread_id);}else{rmdir("$tmp_path/mmseqs".$thread_id);mkdir("$tmp_path/mmseqs".$thread_id);}}
 
   $a = basename($a);
   $b = basename($b);
 
-  my $bla = "$tmp_path$b.vs.$a.".$blastmode;
+  my $bla = "$b.vs.$a.".$blastmode;
+
+  my $WD="${tmp_path}/${bla}_WD";
+  mkdir($WD);
 
   my $blastOptions_diamond=''; #additional option --sensitive (if gene length is larger than 40)
   my $max_genlen = 1;
@@ -2607,12 +2711,12 @@ sub blast {
 
   my ($fileType) = $a =~ /\.(\w*)$/;
 
-  $a = $_[0]; # get a b again for blast algorithm (now they are not basenamed)
-  $b = $_[1];
+  $a = abs_path "$_[0] "; # get a b again for blast algorithm (now they are not basenamed)
+  $b = abs_path "$_[1] ";
+  $a=~s/ $//g; # this is a trick, to get the abs_path of the symlink (not the target) 1. add whitespace to filename 2. call abs_path 3. remove whitespace
+  $b=~s/ $//g; # otherwise the target is printed, which does not neccesarilly contain the DB files
 
-  if(-e $tmp_path."/DB/".basename($a)){
-    $a = $tmp_path."/DB/".basename($a);
-  }
+  if(-e $tmp_path."/DB/".basename($a)){ $a = "../DB/".basename($a); }
 
   #
   # 1. run the blast algorithm -> produce a $bla.tmp
@@ -2625,131 +2729,145 @@ sub blast {
   # 
   # sh -ic '{ cat 1>&3; kill 0; } | { sleep 2; kill 0; }' 3>&1
 
-  if    ($blastmode eq "blastp_legacy" || $blastmode eq "blastn_legacy" || $blastmode eq "tblastx_legacy") {lock($threads_per_process); $command = $binpath."blastall -a $threads_per_process -d '$a.$blastmode' -i '$b' -p $blastmode -m8 -e $evalue $blastOptions $printSTDERR ".($keep ? " | tee '$bla.tmp'" : "");}
-  elsif ($blastmode eq "autoblast" && $autoblast_fileis{$a} eq "prot" && $autoblast_fileis{$b} eq "prot")  {lock($threads_per_process); $command = $binpath."blastp -num_threads $threads_per_process -db '$a.$blastmode' -query '$b' -evalue $evalue -outfmt 6 $blastOptions $printSTDERR ".($keep ? " | tee '$bla.tmp'" : "");}
-  elsif ($blastmode eq "autoblast" && $autoblast_fileis{$a} eq "nucl" && $autoblast_fileis{$b} eq "nucl")  {lock($threads_per_process); $command = $binpath."blastn -num_threads $threads_per_process -db '$a.$blastmode' -query '$b' -evalue $evalue -outfmt 6 $blastOptions $printSTDERR ".($keep ? " | tee '$bla.tmp'" : "");}
-  elsif ($blastmode eq "autoblast" && $autoblast_fileis{$a} eq "nucl" && $autoblast_fileis{$b} eq "prot")  {lock($threads_per_process); $command = $binpath."tblastn -num_threads $threads_per_process -db '$a.$blastmode' -query '$b' -evalue $evalue -outfmt 6 $blastOptions $printSTDERR ".($keep ? " | tee '$bla.tmp'" : "");}
-  elsif ($blastmode eq "autoblast" && $autoblast_fileis{$a} eq "prot" && $autoblast_fileis{$b} eq "nucl")  {lock($threads_per_process); $command = $binpath."blastx -num_threads $threads_per_process -db '$a.$blastmode' -query '$b' -evalue $evalue -outfmt 6 $blastOptions $printSTDERR ".($keep ? " | tee '$bla.tmp'" : "");}
-  elsif ($blastmode eq "blastp+")   {lock($threads_per_process); $command = $binpath."blastp -num_threads $threads_per_process -db '$a.$blastmode' -query '$b' -evalue $evalue -outfmt 6 $blastOptions $printSTDERR ".($keep ? " | tee '$bla.tmp'" : "");}
-  elsif ($blastmode eq "blastn+")   {lock($threads_per_process); $command = $binpath."blastn -num_threads $threads_per_process -db '$a.$blastmode' -query '$b' -evalue $evalue -outfmt 6 $blastOptions $printSTDERR ".($keep ? " | tee '$bla.tmp'" : "");}
-  elsif ($blastmode eq "tblastx+")  {lock($threads_per_process); $command = $binpath."tblastx -num_threads $threads_per_process -db '$a.$blastmode' -query '$b' -evalue $evalue -outfmt 6 $blastOptions $printSTDERR ".($keep ? " | tee '$bla.tmp'" : "");}
-  elsif ($blastmode eq "blatp")     {lock($threads_per_process); $command = $binpath."blat -prot '$a' '$b' -out=blast8 '$bla.tmp' $blastOptions >\/dev\/null $printSTDERR";}
-  elsif ($blastmode eq "blatn")     {lock($threads_per_process); $command = $binpath."blat '$a' '$b' -out=blast8 '$bla.tmp' $blastOptions >\/dev\/null $printSTDERR";}
-  elsif ($blastmode eq "blatx")     {lock($threads_per_process); $command = $binpath."blat '$a' '$b' -t=dnax -q=dnax -out=blast8 '$bla.tmp' $blastOptions >\/dev\/null $printSTDERR";}
-  #elsif ($blastmode eq "pblatp")      {lock($threads_per_process); $command = $binpath."pblat -prot '$a' '$b' -threads=$threads_per_process -out=blast8 '$bla' >\/dev\/null 2>&1";}
-  #elsif ($blastmode eq "pblatn")      {lock($threads_per_process); $command = $binpath."pblat '$a' '$b' -threads=$threads_per_process -out=blast8 '$bla' >\/dev\/null 2>&1";}
-  #elsif ($blastmode eq "pblatx")      {lock($threads_per_process); $command = $binpath."pblat '$a' '$b' -t=dnax -q=dnax -threads=$threads_per_process -out=blast8 '$bla' >\/dev\/null 2>&1";}
-  elsif ($blastmode eq "diamond" )  {lock($threads_per_process); $command = $binpath."diamond blastp --threads $threads_per_process --db '$a.$blastmode' --query '$b' -e $evalue --outfmt 6 --quiet $blastOptions $blastOptions_diamond $printSTDERR ".($keep ? " | tee '$bla.tmp'" : "");}
-  elsif ($blastmode eq "topaz" )    {lock($threads_per_process); $command = $binpath."topaz search -T $threads_per_process -p '$a.$blastmode' -f '$b' -E $evalue --blasttab $blastOptions >'$bla.tmp' $printSTDERR";}
-  elsif ($blastmode eq "ublast")    {lock($threads_per_process); $command = $binpath."usearch -ublast '$b' -db '$a.$blastmode' -threads $threads_per_process -evalue $evalue -blast6out '$bla.tmp' $blastOptions >\/dev\/null $printSTDERR";}
-  elsif ($blastmode eq "usearch")   {lock($threads_per_process); $command = $binpath."usearch -usearch_local '$b' -db '$a.$blastmode' -threads $threads_per_process -evalue $evalue -id 0 -blast6out '$bla.tmp' $blastOptions >\/dev\/null $printSTDERR";}
-  elsif ($blastmode eq "rapsearch") {lock($threads_per_process); $command = $binpath."rapsearch -s T -d '$a.$blastmode' -q '$b' -z $threads_per_process -e $evalue -out '$bla.tmp' $blastOptions >\/dev\/null $printSTDERR";}
-  elsif ($blastmode eq "lastp")     {lock($threads_per_process); $command = $binpath."lastal -E$evalue -fBlastTab+ -pBL62 -P$threads_per_process '$a.$blastmode' '$b' $blastOptions > '$bla.tmp' $printSTDERR";}
-  elsif ($blastmode eq "lastn")     {lock($threads_per_process); $command = $binpath."lastal -E$evalue -fBlastTab+ -P$threads_per_process '$a.$blastmode' '$b' $blastOptions > '$bla.tmp' $printSTDERR";}
-  elsif ($blastmode eq "mmseqsn" || $blastmode eq "mmseqsp")       {lock($threads_per_process); $command = $binpath."mmseqs search '$b.$blastmode' '$a.$blastmode' '$bla.tmp' $tmp_path/mmseqs$thread_id --threads $threads_per_process -e $evalue $blastOptions >\/dev\/null $printSTDERR";}
+  if    ($blastmode eq "blastp_legacy" || $blastmode eq "blastn_legacy" || $blastmode eq "tblastx_legacy") {lock($threads_per_process); $command = $binpath."blastall -a $threads_per_process -d '$a.$blastmode' -i '$b' -p $blastmode -m8 -e $evalue $blastOptions $printSTDERR >'../$bla.tmp'";}# .($keep ? " | tee '../$bla.tmp'" : "");}
+  elsif ($blastmode eq "autoblast" && $autoblast_fileis{$a} eq "prot" && $autoblast_fileis{$b} eq "prot")  {lock($threads_per_process); $command = $binpath."blastp -num_threads $threads_per_process -db '$a.$blastmode' -query '$b' -evalue $evalue -outfmt 6 $blastOptions $printSTDERR >'../$bla.tmp'";}# .($keep ? " | tee '../$bla.tmp'" : "");}
+  elsif ($blastmode eq "autoblast" && $autoblast_fileis{$a} eq "nucl" && $autoblast_fileis{$b} eq "nucl")  {lock($threads_per_process); $command = $binpath."blastn -num_threads $threads_per_process -db '$a.$blastmode' -query '$b' -evalue $evalue -outfmt 6 $blastOptions $printSTDERR >'../$bla.tmp'";}# .($keep ? " | tee '../$bla.tmp'" : "");}
+  elsif ($blastmode eq "autoblast" && $autoblast_fileis{$a} eq "nucl" && $autoblast_fileis{$b} eq "prot")  {lock($threads_per_process); $command = $binpath."tblastn -num_threads $threads_per_process -db '$a.$blastmode' -query '$b' -evalue $evalue -outfmt 6 $blastOptions $printSTDERR >'../$bla.tmp'";}# .($keep ? " | tee '../$bla.tmp'" : "");}
+  elsif ($blastmode eq "autoblast" && $autoblast_fileis{$a} eq "prot" && $autoblast_fileis{$b} eq "nucl")  {lock($threads_per_process); $command = $binpath."blastx -num_threads $threads_per_process -db '$a.$blastmode' -query '$b' -evalue $evalue -outfmt 6 $blastOptions $printSTDERR >'../$bla.tmp'";}# .($keep ? " | tee '../$bla.tmp'" : "");}
+  elsif ($blastmode eq "blastp+")   {lock($threads_per_process); $command = $binpath."blastp -num_threads $threads_per_process -db '$a.$blastmode' -query '$b' -evalue $evalue -outfmt 6 $blastOptions $printSTDERR >'../$bla.tmp'";}# .($keep ? " | tee '../$bla.tmp'" : "");}
+  elsif ($blastmode eq "blastn+")   {lock($threads_per_process); $command = $binpath."blastn -num_threads $threads_per_process -db '$a.$blastmode' -query '$b' -evalue $evalue -outfmt 6 $blastOptions $printSTDERR >'../$bla.tmp'";}# .($keep ? " | tee '../$bla.tmp'" : "");}
+  elsif ($blastmode eq "tblastx+")  {lock($threads_per_process); $command = $binpath."tblastx -num_threads $threads_per_process -db '$a.$blastmode' -query '$b' -evalue $evalue -outfmt 6 $blastOptions $printSTDERR >'../$bla.tmp'";}# .($keep ? " | tee '../$bla.tmp'" : "");}
+  elsif ($blastmode eq "blatp")     {lock($threads_per_process); $command = $binpath."blat -prot '$a' '$b' -out=blast8 '../$bla.tmp' $blastOptions >\/dev\/null $printSTDERR";}
+  elsif ($blastmode eq "blatn")     {lock($threads_per_process); $command = $binpath."blat '$a' '$b' -out=blast8 '../$bla.tmp' $blastOptions >\/dev\/null $printSTDERR";}
+  elsif ($blastmode eq "blatx")     {lock($threads_per_process); $command = $binpath."blat '$a' '$b' -t=dnax -q=dnax -out=blast8 '../$bla.tmp' $blastOptions >\/dev\/null $printSTDERR";}
+  #elsif ($blastmode eq "pblatp")      {lock($threads_per_process); $command = $binpath."pblat -prot '$a' '$b' -threads=$threads_per_process -out=blast8 '../$bla' >\/dev\/null 2>&1";}
+  #elsif ($blastmode eq "pblatn")      {lock($threads_per_process); $command = $binpath."pblat '$a' '$b' -threads=$threads_per_process -out=blast8 '../$bla' >\/dev\/null 2>&1";}
+  #elsif ($blastmode eq "pblatx")      {lock($threads_per_process); $command = $binpath."pblat '$a' '$b' -t=dnax -q=dnax -threads=$threads_per_process -out=blast8 '../$bla' >\/dev\/null 2>&1";}
+  elsif ($blastmode eq "diamond" )  {lock($threads_per_process); $command = $binpath."diamond blastp --threads $threads_per_process --db '$a.$blastmode' --query '$b' -e $evalue --outfmt 6 --quiet $blastOptions $blastOptions_diamond $printSTDERR >'../$bla.tmp'";}# .($keep ? " | tee '../$bla.tmp'" : "");}
+  elsif ($blastmode eq "topaz" )    {lock($threads_per_process); $command = $binpath."topaz search -T $threads_per_process -p '$a.$blastmode' -f '$b' -E $evalue --blasttab $blastOptions >'../$bla.tmp' $printSTDERR";}
+  elsif ($blastmode eq "ublast")    {lock($threads_per_process); $command = $binpath."usearch -ublast '$b' -db '$a.$blastmode' -threads $threads_per_process -evalue $evalue -blast6out '../$bla.tmp' $blastOptions >\/dev\/null $printSTDERR";}
+  elsif ($blastmode eq "usearch")   {lock($threads_per_process); $command = $binpath."usearch -usearch_local '$b' -db '$a.$blastmode' -threads $threads_per_process -evalue $evalue -id 0 -blast6out '../$bla.tmp' $blastOptions >\/dev\/null $printSTDERR";}
+  elsif ($blastmode eq "rapsearch") {lock($threads_per_process); $command = $binpath."rapsearch -s T -d '$a.$blastmode' -q '$b' -z $threads_per_process -e $evalue -out '../$bla.tmp' $blastOptions >\/dev\/null $printSTDERR";}
+  elsif ($blastmode eq "lastp")     {lock($threads_per_process); $command = $binpath."lastal -E$evalue -fBlastTab+ -pBL62 -P$threads_per_process '$a.$blastmode' '$b' $blastOptions > '../$bla.tmp' $printSTDERR";}
+  elsif ($blastmode eq "lastn")     {lock($threads_per_process); $command = $binpath."lastal -E$evalue -fBlastTab+ -P$threads_per_process '$a.$blastmode' '$b' $blastOptions > '../$bla.tmp' $printSTDERR";}
+  elsif ($blastmode eq "mmseqsn" || $blastmode eq "mmseqsp")       {lock($threads_per_process); $command = $binpath."mmseqs search '$b.$blastmode' '$a.$blastmode' '../$bla.tmp' mmseqs$thread_id --threads $threads_per_process -e $evalue $blastOptions >\/dev\/null $printSTDERR";}
   else  {&Error("This should not happen! Please submit the FASTA file(s) and the parameter vector (above) to incoming+paulklemm-phd-proteinortho-7278443-issue-\@incoming.gitlab.com to help fixing this issue.");}
   ##MARK_FOR_NEW_BLAST_ALGORITHM
 
   if($identical || $range>=0){ $command=~s/-?-outfmt 6 (?!qseqid)//g; } # remove additional outfmt 6 (not followed by qseqid)
   # NOTE diamond fails on --outfmt '6 qseqid ...' but is ok with --outfmt 6 qseqid ... 
 
+  my $plain_command=$command;
+  if ( $blastmode eq "mmseqsn" || $blastmode eq "mmseqsp" ){
+    $command = "(cd '$WD' && $command)\n";
+  }else{
+    $command = "cat << 'EOF'|sh;\n$proteinortho_exec; (cd '$WD' && proteinortho_exec \"($command)\")\nEOF\n";
+  }
+
   # File does not exists yet or I am forced to rewrite it
-  if (!(-e $bla) || $force) {
+  if (!(-e "$tmp_path/$bla") || $force) {
 
-    if (-e $bla && $force) { unlink("$bla"); }
+    if (-e "$tmp_path/$bla" && $force) { unlink("$tmp_path/$bla"); }
 
     if ($debug || $verbose==2) {print STDERR "$command\n";}                     # 5.16
 
     if ($blastmode eq "diamond") {
-      @data=`$command`; 
-      if ($? != 0) {
-        $command = $binpath."diamond blastp --ignore-warnings --threads $threads_per_process --db '$a.$blastmode' --query '$b' -e $evalue --outfmt 6 --quiet $blastOptions $blastOptions_diamond $printSTDERR ".($keep ? " | tee '$bla.tmp'" : "");
-        @data=`$command`;
-        if ($? != 0) {&TypicalBlastError($blastmode,$command);}
-      }
-      if($debug eq "test_sort"){while (<"$bla.tmp">){if ($_ =~ /[^\t]+([,])[^\t]+[eE]/) {print "found forbidden symbol '$1' at $_ in $bla\n";&reset_locale();die;}}}
-      if($keep){system("mv '$bla.tmp' '$bla'");}
+      
+      system("$command");
+      if ($? != 0) {&TypicalBlastError($blastmode,$plain_command,$?,$!);}
+      #if ($? != 0) {
+      #  $command = $binpath."diamond blastp --ignore-warnings --threads $threads_per_process --db '$a.$blastmode' --query '$b' -e $evalue --outfmt 6 --quiet $blastOptions $blastOptions_diamond $printSTDERR ".($keep ? " | tee '../$bla.tmp'" : "");
+      #  @data=`(cd '$WD' && $command)`;
+      #  if ($? != 0) {&TypicalBlastError($blastmode,$plain_command,$?,$!);}
+      #}
+      if($debug eq "test_sort"){while (<"$tmp_path/$bla.tmp">){if ($_ =~ /[^\t]+([,])[^\t]+[eE]/) {print "found forbidden symbol '$1' at $_ in $tmp_path/$bla\n";&reset_locale();die;}}}
+      system("mv '$tmp_path/$bla.tmp' '$tmp_path/$bla'");
 
     }elsif ($blastmode eq "rapsearch") {
 
       system("$command");
-      if ($? != 0) {&TypicalBlastError($blastmode,$command);}
+      if ($? != 0) {&TypicalBlastError($blastmode,$plain_command,$?,$!);}
       # NOTE -s f does not work everytime, therefore here is a conversion of -s t to f (ln -> evalues)
-      open(OUTM,">>$bla.m82");open(INM,"<$bla.tmp.m8");
+      open(OUTM,">>$tmp_path/$bla.m82");open(INM,"<$tmp_path/$bla.tmp.m8");
       while (<INM>){if(length($_)==0 || substr($_,0,1)eq"#"){next;}my @arr=split("\t",$_); print OUTM $arr[0]."\t".$arr[1]."\t".$arr[2]."\t".$arr[3]."\t".$arr[4]."\t".$arr[5]."\t".$arr[6]."\t".$arr[7]."\t".$arr[8]."\t".$arr[9]."\t".exp($arr[10])."\t".$arr[11];}
       close(OUTM);close(INM);
-      if($debug eq "test_sort"){while (<"$bla.tmp.m8">){if ($_ =~ /[^\t]+([,])[^\t]+[eE]/) {print "found forbidden symbol '$1' at $_ in $bla.m8\n";&reset_locale();die;}}}
-      unlink "$bla.aln"; #remove aln file
-      system("tail -n +6 '$bla.m82' >'$bla'"); # remove head/comment lines of rapsearch
-      unlink '$bla.m82';
-      unlink '$bla.tmp.m8';
+      if($debug eq "test_sort"){while (<"$tmp_path/$bla.tmp.m8">){if ($_ =~ /[^\t]+([,])[^\t]+[eE]/) {print "found forbidden symbol '$1' at $_ in $tmp_path/$bla.m8\n";&reset_locale();die;}}}
+      unlink "$tmp_path/$bla.aln"; #remove aln file
+      system("tail -n +6 '$tmp_path/$bla.m82' >'$tmp_path/$bla'"); # remove head/comment lines of rapsearch
+      unlink "$tmp_path/$bla.m82";
+      unlink "$tmp_path/$bla.tmp.m8";
 
     }elsif ($blastmode eq "usearch" || $blastmode eq "ublast") {
 
       system("$command");
-      if ($? != 0) {&TypicalBlastError($blastmode,$command);}
-      if($debug eq "test_sort"){while (<"$bla.tmp">){if ($_ =~ /[^\t]+([,])[^\t]+[eE]/) {print "found forbidden symbol '$1' at $_ in $bla\n";&reset_locale();die;}}}
-      system("perl $po_path/proteinortho_formatUsearch.pl '$bla.tmp' >'$bla'"); # problem with ublast/usearch: gene names include the description.
-      unlink "$bla.tmp";
+      if ($? != 0) {&TypicalBlastError($blastmode,$plain_command,$?,$!);}
+      if($debug eq "test_sort"){while (<"$tmp_path/$bla.tmp">){if ($_ =~ /[^\t]+([,])[^\t]+[eE]/) {print "found forbidden symbol '$1' at $_ in $tmp_path/$bla\n";&reset_locale();die;}}}
+      system("perl $po_path/proteinortho_formatUsearch.pl '$tmp_path/$bla.tmp' >'$tmp_path/$bla'"); # problem with ublast/usearch: gene names include the description.
+      unlink "$tmp_path/$bla.tmp";
 
     }elsif ($blastmode eq "topaz") {
 
       system("$command");
-      if ($? != 0) {&TypicalBlastError($blastmode,$command);}
-      if($debug eq "test_sort"){while (<"$bla.tmp">){if ($_ =~ /[^\t]+([,])[^\t]+[eE]/) {print "found forbidden symbol '$1' at $_ in $bla\n";&reset_locale();die;}}}
-      system("tail -n +2 '$bla.tmp' > '$bla'");
+      if ($? != 0) {&TypicalBlastError($blastmode,$plain_command,$?,$!);}
+      if($debug eq "test_sort"){while (<"$tmp_path/$bla.tmp">){if ($_ =~ /[^\t]+([,])[^\t]+[eE]/) {print "found forbidden symbol '$1' at $_ in $tmp_path/$bla\n";&reset_locale();die;}}}
+      system("tail -n +2 '$tmp_path/$bla.tmp' > '$tmp_path/$bla'");
       unlink "timing.txt";
-      unlink "$bla.tmp";
+      unlink "$tmp_path/$bla.tmp";
 
     }elsif ($blastmode eq "lastp" || $blastmode eq "lastn") {
 
       system("$command");
-      if ($? != 0) {&TypicalBlastError($blastmode,$command);}
-      if($debug eq "test_sort"){while (<"$bla.tmp">){if ($_ =~ /[^\t]+([,])[^\t]+[eE]/) {print "found forbidden symbol '$1' at $_ in $bla\n";&reset_locale();die;}}}
-      system("mv '$bla.tmp' '$bla'");
+      if ($? != 0) {&TypicalBlastError($blastmode,$plain_command,$?,$!);}
+      if($debug eq "test_sort"){while (<"$tmp_path/$bla.tmp">){if ($_ =~ /[^\t]+([,])[^\t]+[eE]/) {print "found forbidden symbol '$1' at $_ in $tmp_path/$bla\n";&reset_locale();die;}}}
+      system("mv '$tmp_path/$bla.tmp' '$tmp_path/$bla'");
 
     }elsif ($blastmode =~ m/.*blat.*/) {
 
       system("$command");
-      if ($? != 0) {&TypicalBlastError($blastmode,$command);}
-      if($debug eq "test_sort"){while (<"$bla.tmp">){if ($_ =~ /[^\t]+([,])[^\t]+[eE]/) {print "found forbidden symbol '$1' at $_ in $bla\n";&reset_locale();die;}}}
-      system('awk -F\'\t\' \'{if($11<'.$evalue.')print $0}\' \''.$bla.'.tmp\' > \''.$bla.'\'');
-      unlink "$bla.tmp";
+      if ($? != 0) {&TypicalBlastError($blastmode,$plain_command,$?,$!);}
+      if($debug eq "test_sort"){while (<"$tmp_path/$bla.tmp">){if ($_ =~ /[^\t]+([,])[^\t]+[eE]/) {print "found forbidden symbol '$1' at $_ in $tmp_path/$bla\n";&reset_locale();die;}}}
+      system('awk -F\'\t\' \'{if($11<'.$evalue.')print $0}\' \''."$tmp_path/$bla".'.tmp\' > \''."$tmp_path/$bla".'\'');
+      unlink "$tmp_path/$bla.tmp";
     
     }elsif ($blastmode eq "mmseqsn" || $blastmode eq "mmseqsp") {
 
       system("$command");
-      if ($? != 0) {&TypicalBlastError($blastmode,$command);}
-      system($binpath."mmseqs convertalis '$b.$blastmode' '$a.$blastmode' '$bla.tmp' '$bla.tmp2' >\/dev\/null 2>\/dev\/null");
+      if ($? != 0) {&TypicalBlastError($blastmode,$plain_command,$?,$!);}
+
+      if($threads_per_process>1){
+        system("mv '$tmp_path/${bla}'.tmp.index '$tmp_path/${bla}'.tmp-index");
+        system("mv '$tmp_path/${bla}'.tmp.dbtype '$tmp_path/${bla}'.tmp-dbtype");
+        system("if [ ! -e '$tmp_path/$bla' ]; then cat '$tmp_path/$bla'.tmp.* > '$tmp_path/$bla'.tmp; fi");
+        system("mv '$tmp_path/${bla}'.tmp-index '$tmp_path/${bla}'.tmp.index");
+        system("mv '$tmp_path/${bla}'.tmp-dbtype '$tmp_path/${bla}'.tmp.dbtype");
+      }
+      
+      system($binpath."mmseqs convertalis '$b.$blastmode' '$a.$blastmode' '$tmp_path/$bla.tmp' '$tmp_path/$bla.tmp2' >\/dev\/null 2>\/dev\/null");
 
-      if($debug eq "test_sort"){while (<"$bla.tmp">){if ($_ =~ /[^\t]+([,])[^\t]+[eE]/) {print "found forbidden symbol '$1' at $_ in $bla.tmp\n";&reset_locale();die;}}}
-      system("rm '$bla'.*index*");
-      system('sed -re \'s/([0-9\.]+)(E)(\-|\+)([0-9\.]+)/\1e\3\4/g\' \''.$bla.'.tmp2\' >\''.$bla.'\' 2>\/dev\/null');
-      unlink "$bla.dbtype";
-      unlink "$bla.tmp2";
-      unlink "$bla.index";
-      unlink "$bla.tmp";
+      if($debug eq "test_sort"){while (<"$tmp_path/$bla.tmp">){if ($_ =~ /[^\t]+([,])[^\t]+[eE]/) {print "found forbidden symbol '$1' at $_ in $tmp_path/$bla.tmp\n";&reset_locale();die;}}}
+      system('sed -re \'s/([0-9\.]+)(E)(\-|\+)([0-9\.]+)/\1e\3\4/g\' \''."$tmp_path/$bla".'.tmp2\' >\''."$tmp_path/$bla".'\' 2>\/dev\/null');
 
     }else { # -p=blastp,blastn,autoblast, ...
 
-      @data=`$command`;
-      if ($? != 0) {&TypicalBlastError($blastmode,$command);}
-      if($debug eq "test_sort"){while (<"$bla.tmp">){if ($_ =~ /[^\t]+([,])[^\t]+[eE]/) {print "found forbidden symbol '$1' at $_ in $bla.tmp\n";&reset_locale();die;}}}
-      if($keep){system("mv '$bla.tmp' '$bla'");}
+      # @data=`$command`;
+      system("$command");
+      if ($? != 0) {&TypicalBlastError($blastmode,$plain_command,$?,$!);}
+      if($debug eq "test_sort"){while (<"$tmp_path/$bla.tmp">){if ($_ =~ /[^\t]+([,])[^\t]+[eE]/) {print "found forbidden symbol '$1' at $_ in $tmp_path/$bla.tmp\n";&reset_locale();die;}}}
+      system("mv '$tmp_path/$bla.tmp' '$tmp_path/$bla'");
     }
 
 ##MARK_FOR_NEW_BLAST_ALGORITHM
 
-    if(-e $bla && scalar @data == 0){
-      @data = &readFile($bla);
-      if(!$keep){unlink $bla}
+    if(-e "$tmp_path/$bla" && scalar @data == 0){
+      @data = &readFile("$tmp_path/$bla");
+      if(!$keep){unlink "$tmp_path/$bla"}
     }
 
   }else{
-    if ($verbose) {print STDERR "\nNote: '$bla' exists, using pre-calculated data\n";}
+    if ($verbose) {print STDERR "\nNote: '$tmp_path/$bla' exists, using pre-calculated data\n";}
 
-    @data = &readFile($bla);
-    if(!$keep){unlink $bla}
+    @data = &readFile("$tmp_path/$bla");
+    if(!$keep){unlink "$tmp_path/$bla"}
   }
 
   if($selfblast){
@@ -2783,6 +2901,13 @@ sub readFile {
   return @data;
 }
 
+sub failed_to_detected_msg{
+  print STDERR ("\n!!!$ORANGE\n[WARNING]$NC Failed to detect '$blastmode'.$NC\nPlease install $blastmode in $binpath (or specify another binpath with -binpath=/home/...)\nI try now -p=blastp+ as fallback.\n");
+  print STDERR "\nPress 'strg+c' to prevent me from proceeding or wait 10 seconds to continue...\n!!!\n";
+   sleep 10;
+  print STDERR "\nWell then, proceeding...\n\n";
+}
+
 sub check_bins {
 
   if($blastmode eq "blast"){print STDERR "$ORANGE\n[WARNING]$NC There is no blastmode -p=$blastmode. I will choose the protein search blastp+ and continue...$NC"; $blastmode="blastp+";}
@@ -2862,10 +2987,7 @@ sub check_bins {
         return;
       }
     }
-    print STDERR ("\n!!!$ORANGE\n[WARNING]$NC Failed to detect '$blastmode'.$NC\nPlease install $blastmode in $binpath (or specify another binpath with -binpath=/home/...)\nI try now -p=blastp+ as fallback.\n");
-    print STDERR "\nPress 'strg+c' to prevent me from proceeding or wait 10 seconds to continue...\n!!!\n";
-     sleep 10;
-    print STDERR "\nWell then, proceeding...\n\n";
+    &failed_to_detected_msg();
     $blastmode="blastp+";
     goto RESTART;
     &Error("Failed to detect '$blastmode'! Tried to call '$binpath/$blastmode'\nPlease install $blastmode in $binpath (or specify another binpath with -binpath=/home/...).");
@@ -2879,10 +3001,7 @@ sub check_bins {
       $makedb="";
       return;
     }else{
-      print STDERR ("\n!!!$ORANGE\n[WARNING]$NC Failed to detect '$blastmode'.$NC\nPlease install $blastmode in $binpath (or specify another binpath with -binpath=/home/...)\nI try now -p=blastp+ as fallback.\n");
-      print STDERR "\nPress 'strg+c' to prevent me from proceeding or wait 10 seconds to continue...\n!!!\n";
-       sleep 10;
-      print STDERR "\nWell then, proceeding...\n\n";
+      &failed_to_detected_msg();
       $blastmode="blastp+";
       goto RESTART;
     }
@@ -2911,10 +3030,7 @@ sub check_bins {
      if($verbose){print STDERR "Detected '$blastmode' version $versionnumber\n";$blastversion=$versionnumber;}
       return;
     }else{
-      print STDERR ("\n!!!$ORANGE\n[WARNING]$NC Failed to detect '$blastmode'.$NC\nPlease install $blastmode in $binpath (or specify another binpath with -binpath=/home/...)\nI try now -p=blastp+ as fallback.\n");
-      print STDERR "\nPress 'strg+c' to prevent me from proceeding or wait 10 seconds to continue...\n!!!\n";
-       sleep 10;
-      print STDERR "\nWell then, proceeding...\n\n";
+      &failed_to_detected_msg();
       $blastmode="blastp+";
       goto RESTART;
     }
@@ -2947,7 +3063,7 @@ sub check_bins {
     if (defined($out) && $out =~ /diamond\sversion\s(.+)\n/) {
       my @version = split(/\s+/,$1);
       my $versionnumber = pop @version;
-      $makedb = $binpath."diamond makedb $makeBlastOptions --in";
+      $makedb = $binpath."diamond makedb -p $cpus $makeBlastOptions --in";
       if($verbose){print STDERR "Detected '$blastmode' version $versionnumber\n";$blastversion=$versionnumber;}
       if($versionnumber =~ m/^0\.9\.(\d+)/){ if($1 < 29){
         print STDERR "\n!!!! \nWARNING '$blastmode' version $versionnumber has a known bug that incorrectly computes the length of an alignment, thus the coverage threshold can produce wrong results leading in false negatives. See https://gitlab.com/paulklemm_PHD/proteinortho/issues/24 for more details.\n\n >>> Please update diamond to 0.9.29 or higher <<<\n";
@@ -2956,10 +3072,7 @@ sub check_bins {
         print STDERR "\nWell then, proceeding...\n\n";} }
       return;
     }else{
-      print STDERR ("\n!!!$ORANGE\n[WARNING]$NC Failed to detect '$blastmode'.$NC\nPlease install $blastmode in $binpath (or specify another binpath with -binpath=/home/...)\nI try now -p=blastp+ as fallback.\n");
-      print STDERR "\nPress 'strg+c' to prevent me from proceeding or wait 10 seconds to continue...\n!!!\n";
-       sleep 10;
-      print STDERR "\nWell then, proceeding...\n\n";
+      &failed_to_detected_msg();
       $blastmode="blastp+";
       goto RESTART;
     }
@@ -2975,10 +3088,7 @@ sub check_bins {
       if($verbose){print STDERR "Detected '$blastmode' version $versionnumber\n";$blastversion=$versionnumber;}
       return;
     }else{
-      print STDERR ("\n!!!$ORANGE\n[WARNING]$NC Failed to detect '$blastmode'.$NC\nPlease install $blastmode in $binpath (or specify another binpath with -binpath=/home/...)\nI try now -p=blastp+ as fallback.\n");
-      print STDERR "\nPress 'strg+c' to prevent me from proceeding or wait 10 seconds to continue...\n!!!\n";
-       sleep 10;
-      print STDERR "\nWell then, proceeding...\n\n";
+      &failed_to_detected_msg();
       $blastmode="blastp+";
       goto RESTART;
     }
@@ -2994,10 +3104,7 @@ sub check_bins {
       if($verbose){print STDERR "Detected '$blastmode' version $versionnumber\n";$blastversion=$versionnumber;}
       return;
     }else{
-      print STDERR ("\n!!!$ORANGE\n[WARNING]$NC Failed to detect '$blastmode'.$NC\nPlease install $blastmode in $binpath (or specify another binpath with -binpath=/home/...)\nI try now -p=blastp+ as fallback.\n");
-      print STDERR "\nPress 'strg+c' to prevent me from proceeding or wait 10 seconds to continue...\n!!!\n";
-       sleep 10;
-      print STDERR "\nWell then, proceeding...\n\n";
+      &failed_to_detected_msg();
       $blastmode="blastp+";
       goto RESTART;
     }
@@ -3013,10 +3120,7 @@ sub check_bins {
       if($verbose){print STDERR "Detected '$blastmode' version $versionnumber\n";$blastversion=$versionnumber;}
       return;
     }else{
-      print STDERR ("\n!!!$ORANGE\n[WARNING]$NC Failed to detect '$blastmode'.$NC\nPlease install $blastmode in $binpath (or specify another binpath with -binpath=/home/...)\nI try now -p=blastp+ as fallback.\n");
-      print STDERR "\nPress 'strg+c' to prevent me from proceeding or wait 10 seconds to continue...\n!!!\n";
-       sleep 10;
-      print STDERR "\nWell then, proceeding...\n\n";
+      &failed_to_detected_msg();
       $blastmode="blastp+";
       goto RESTART;
     }
@@ -3041,6 +3145,7 @@ sub check_files {
   @files = grep { !( $_ =~ /\.g[ft]f[34]?$/ ) } @files;
 
   foreach my $file (@files) {
+    if($debug){print STDERR "->$file\n";}
     if(-e $file && $file=~/\.gz$/){
       if($verbose){print STDERR "ERROR input '$file' seems to be compressed...\n";}
       exit 1;
@@ -3049,8 +3154,8 @@ sub check_files {
     }elsif(-e $file){
       $files_dedup{abs_path($file)}=$file;
       #if ($verbose) {print STDERR "Checking $file... ";}
-      #&read_details($file,0);
-      $check_files_QUEUE->enqueue($file);
+      &read_details($file,0);
+      #$check_files_QUEUE->enqueue($file);
       #if ($blastmode eq "autoblast") {print STDERR " (".$autoblast_fileis{$file}.") ";}
       #if ($verbose) {print STDERR "ok\n";}
     }elsif(!-e $file){
@@ -3063,13 +3168,6 @@ sub check_files {
 
   if($debug){print STDERR "Checking input files queue:".($check_files_QUEUE->pending())."\n"}
 
-  $do_log=1;
-  $_->kill('KILL')->detach foreach threads->list(); 
-  threads->create('log_Worker');
-  if($debug){print STDERR "Creating ".(scalar @files > $cpus+10 || $debug ? $cpus : 1)." check_files_Worker (cpus=$cpus)\n"}
-  for (my $i = 0; $i < ((scalar @files > 500 && scalar @files > $cpus+10) || $debug ? $cpus : 1) ; $i++) { my $tmp = threads->create('check_files_Worker') } 
-  $_->join() foreach threads->list(); 
-
   my $all_only_one_entry=1;
   foreach my $file (@files) {
     if($gene_counter{$file}>1){$all_only_one_entry=0;}
@@ -3114,6 +3212,8 @@ sub convertUniprotAndNCBI {
 
 sub read_details {
   my $file = shift;
+  if($debug){print STDERR "read_details($file)\n"}
+  my $abs_path_file=abs_path "$file "; $abs_path_file=~s/ $//g;
   my $from_synteny_call = shift;
   my %ids;        # local test for duplicated IDs
   my %genes;
@@ -3135,9 +3235,9 @@ sub read_details {
   if (!-e $file)    {&Error("File '$file' not found!");}
   open(FASTA,"<$file") || &Error("Could not open '$file': $!");
   while (<FASTA>) {
-    my $curLine=$_;
-    if ($curLine =~ /^[^a-z#>]/i){$msg.="\nERROR found line with forbidden symbols in '$file':\n$_\n"; exit 1}
-    if ($curLine =~/^$/){$did_found_emptyline=1;}
+    my $curLine=$_;$curLine=~s/\r//g;
+    if ($curLine =~/^ *$/){$did_found_emptyline=1;}
+    elsif ($curLine =~ /^[^a-z#>]/i){&Error("\nERROR found line with forbidden symbols in '$file':\n$_\n"); exit 1}
     if ($curLine =~ />/) { #head line -> gene name and description
       if($isoform eq "uniprot"){
         lock(%isoform_mapping);
@@ -3189,11 +3289,11 @@ sub read_details {
       if(!$force && $checkfasta && exists($allowedAlphabet->{$blastmode}) && ($genelength>50 && ( ($allowedAlphabet->{$blastmode} eq "n" && $ATCGNoccurences/$genelength < 0.5) || ($allowedAlphabet->{$blastmode} eq "a" && $ATCGNoccurences/$genelength > 0.8)))){$cur_gene_is_valid= -1;last;}
 
       if( $blastmode eq "autoblast" && 
-        !exists $autoblast_fileis{$file} && $genelength>50 ){
+        !exists $autoblast_fileis{$abs_path_file} && $genelength>50 ){
         if( $genelength>50 && $ATCGNoccurences/$genelength < 0.5 ){
-          $autoblast_fileis{$file} = "prot";
+          $autoblast_fileis{$abs_path_file} = "prot";
         }elsif( $ATCGNoccurences/$genelength > 0.8 ){
-          $autoblast_fileis{$file} = "nucl";
+          $autoblast_fileis{$abs_path_file} = "nucl";
         }
       }
       
@@ -3210,7 +3310,7 @@ sub read_details {
         $ids{$curLine} = $file;
       }
       if(index($curLine, ",") != -1 && $file=~/_clean/){
-        $msg.="\n$ORANGE [ERROR]$NC input '$file' contains a gene-name with a comma, this causes problems with the proteinortho.tsv output, I cannot clean the file, is the input gzipped?\nThe line is:\n$curLine\n";
+        &Error("\n$ORANGE [ERROR]$NC input '$file' contains a gene-name with a comma, this causes problems with the proteinortho.tsv output, I cannot clean the file, is the input gzipped?\nThe line is:\n$curLine\n");
         exit 1;
       }elsif(index($curLine, ",") != -1){
         # 6.0.32 : check if gene name contains a comma -> this will cause problems with the proteinortho.tsv output (gene cluster speparator)
@@ -3277,7 +3377,7 @@ sub read_details {
     close(FASTA_CLEAN);
     if($step==2){
       $msg.=("\n$ORANGE [WARNING]$NC Restarting the indices generation.$NC");
-      $msg.= &generate_indices($file_clean);
+      &generate_indices($file_clean);
     }
     $_ eq $file and $_ = $file_clean for @files;
     $_ eq $file and $_ = $file_clean for @files_cleanup;
@@ -3295,21 +3395,21 @@ sub read_details {
     }
   }
   if($blastmode eq "autoblast" && 
-    !exists $autoblast_fileis{$file} && $genelength>50 ){
+    !exists $autoblast_fileis{$abs_path_file} && $genelength>50 ){
     if( $genelength>50 && $ATCGNoccurences/$genelength < 0.5 ){
-      $autoblast_fileis{$file} = "prot";
+      $autoblast_fileis{$abs_path_file} = "prot";
     }elsif( $ATCGNoccurences/$genelength > 0.8 ){
-      $autoblast_fileis{$file} = "nucl";
+      $autoblast_fileis{$abs_path_file} = "nucl";
     }
   }
 
   if($did_found_emptyline){
     $msg.=("$ORANGE [WARNING]$NC Found empty line in $file, removing it with perl -lne.$NC");
-    system('perl -lne \'if($_ ne ""){print "$_";}\' '.$file.' >'.$file.'.tmp');
+    system('perl -lne \'if($_ !~/^ *$/){print "$_";}\' '.$file.' >'.$file.'.tmp');
     system('mv '.$file.'.tmp '.$file);
     if($step==2){
       $msg.=("$ORANGE [WARNING]$NC Restarting the indices generation.$NC");
-      $msg.= &generate_indices($file);
+      &generate_indices($file);
     }
   }
 
@@ -3355,10 +3455,12 @@ sub read_details {
     }
   }
 
-  if ($blastmode eq "autoblast") {$msg .= " (".$autoblast_fileis{$file}.") ";}
+  if ($blastmode eq "autoblast") {$msg .= " (".$autoblast_fileis{$abs_path_file}.") ";}
   $msg.=" ok\n";
 
-  $log_QUEUE->enqueue($msg);
+  if($verbose){print STDERR $msg}
+  if($debug){print STDERR "read_details($file):done\n"}
+  #$log_QUEUE->enqueue($msg);
   unless ($synteny) {return;}
 
   my %coordinates;
@@ -3407,7 +3509,11 @@ sub read_details {
 sub TypicalBlastError {
   my $blastmode=shift;
   my $command=shift;
-  &Error($blastmode." failed with code $?$NC ($command). (Please visit https://gitlab.com/paulklemm_PHD/proteinortho/wikis/Error%20Code)\nThe most common sources of this error are:\n- no space left on device error.\n- outdated $blastmode, please update $blastmode or consider another -p algorithm.\n- the databases are missing. Maybe you ran --step=1 and moved the databases afterwards? Please rerun 'proteinortho --step=1 --force /path/to/fastas'\n- maybe the fasta files contain mixed nucleotide and aminoacid sequences or are not suited for $blastmode? (For example diamond only processes protein sequences) Try 'proteinortho --step=1 --check --force /path/to/fastas'.")
+  $command=~s/2>\/dev\/null//g;
+  my $retcode=shift;
+  my $retmsg=shift;
+  &Error("$NC".$blastmode." failed\n${ORANGE}return error code$NC = $retcode\n${ORANGE}full error command$NC = $command\n${ORANGE}return error message$NC = $retmsg")
+  # The most common sources of this error are:\n- no space left on device error.\n- outdated $blastmode, please update $blastmode or consider another -p algorithm.\n- the databases are missing. Maybe you ran --step=1 and moved the databases afterwards? Please rerun 'proteinortho --step=1 --force /path/to/fastas'\n- maybe the fasta files contain mixed nucleotide and aminoacid sequences or are not suited for $blastmode? (For example diamond only processes protein sequences) Try 'proteinortho --step=1 --check --force /path/to/fastas'.
 }
 sub Error {
   $debug=1;
@@ -3464,76 +3570,76 @@ sub gff4fasta {
 }
 
 sub get_po_path {
-  my @tmppath = fileparse($0); # path to the C++-part of this program
+  my @fileparse0 = fileparse($0); # path to the C++-part of this program
 
   my $uname=`uname -s`;
   $uname=~s/[\r\n]+$//;
   $uname.="_".`uname -m`;
   $uname=~s/[\r\n]+$//;
 
-  if(!-x $tmppath[1]."/proteinortho_clustering"){
-    if(-x $tmppath[1]."/src/BUILD/$uname/proteinortho_clustering"){
-      $tmppath[1]=$tmppath[1]."/src/BUILD/$uname";
-      if($debug){print STDERR "Detected ".$tmppath[1]."\n";}
+  if(!-x $fileparse0[1]."/proteinortho_clustering"){
+    if(-x $fileparse0[1]."/src/BUILD/$uname/proteinortho_clustering"){
+      $fileparse0[1]=$fileparse0[1]."/src/BUILD/$uname";
+      if($debug){print STDERR "Detected ".$fileparse0[1]."\n";}
     }elsif(-x "/usr/bin/proteinortho_clustering"){
-      $tmppath[1]="/usr/bin/";
-      if($debug){print STDERR "Detected ".$tmppath[1]."\n";}
+      $fileparse0[1]="/usr/bin/";
+      if($debug){print STDERR "Detected ".$fileparse0[1]."\n";}
     }elsif(-x "/usr/local/bin/proteinortho_clustering"){
-      $tmppath[1]="/usr/local/bin/";
-      if($debug){print STDERR "Detected ".$tmppath[1]."\n";}
+      $fileparse0[1]="/usr/local/bin/";
+      if($debug){print STDERR "Detected ".$fileparse0[1]."\n";}
     }elsif(-x "$binpath/proteinortho_clustering"){
-      $tmppath[1]="$binpath/";
-      if($debug){print STDERR "Detected ".$tmppath[1]."\n";}
+      $fileparse0[1]="$binpath/";
+      if($debug){print STDERR "Detected ".$fileparse0[1]."\n";}
     }else{
       my $p=`which proteinortho_clustering`;
       $p=~s/^([^ ]+)\/proteinortho_clustering.*$/$1/;
       chomp($p);
-      $tmppath[1]=$p;
+      $fileparse0[1]=$p;
       if($debug){print STDERR "Detected (PATH enviroment variable)\n";}
     }
   }
 
-  if(!-x $tmppath[1]."/proteinortho_clustering"){
+  if(!-x $fileparse0[1]."/proteinortho_clustering"){
     &Error("cannot find 'proteinortho_clustering'$NC in: the current directory '.', ./src/, ./src/BUILD/$uname , /usr/bin, /usr/local/bin, \$PATH, -binpath=$binpath.\nPlease do one of the following:\n A. recompile proteinortho (with 'make clean', 'make' and 'make install' or 'make install PREFIX=...') or consider a installation with conda/brew (see the README for more informations)\n B. execute from within the downloaded directory, there are precompiled binaries for Linux_x86_64\n C. specify the path to the binaries with -binpath=...\n");
     exit 1;
   }
-  # if(!-x "$tmppath[1]/proteinortho_cleanupblastgraph"){
-  #   &Error("cannot find proteinortho_cleanupblastgraph in $tmppath[1].\nPlease do one of the following:\n A. recompile proteinortho (with 'make clean', 'make' and 'make install' or 'make install PREFIX=...') or consider a installation with conda/brew (see the README for more informations)\n B. execute from within the downloaded directory, there are precompiled binaries for Linux_x86_64\n C. specify the path to the binaries with -binpath=...\n";
+  # if(!-x "$fileparse0[1]/proteinortho_cleanupblastgraph"){
+  #   &Error("cannot find proteinortho_cleanupblastgraph in $fileparse0[1].\nPlease do one of the following:\n A. recompile proteinortho (with 'make clean', 'make' and 'make install' or 'make install PREFIX=...') or consider a installation with conda/brew (see the README for more informations)\n B. execute from within the downloaded directory, there are precompiled binaries for Linux_x86_64\n C. specify the path to the binaries with -binpath=...\n";
   #   exit 0;
   # }
-  # if(!-x "$tmppath[1]/po_tree"){
-  #   &Error("cannot find po_tree in $tmppath[1].\nPlease do one of the following:\n A. recompile proteinortho (with 'make clean', 'make' and 'make install' or 'make install PREFIX=...') or consider a installation with conda/brew (see the README for more informations)\n B. execute from within the downloaded directory, there are precompiled binaries for Linux_x86_64\n C. specify the path to the binaries with -binpath=...\n";
+  # if(!-x "$fileparse0[1]/po_tree"){
+  #   &Error("cannot find po_tree in $fileparse0[1].\nPlease do one of the following:\n A. recompile proteinortho (with 'make clean', 'make' and 'make install' or 'make install PREFIX=...') or consider a installation with conda/brew (see the README for more informations)\n B. execute from within the downloaded directory, there are precompiled binaries for Linux_x86_64\n C. specify the path to the binaries with -binpath=...\n";
   #   exit 0;
   # }
-  if(!-x $tmppath[1]."/proteinortho2html.pl"){
+  if(!-x $fileparse0[1]."/proteinortho2html.pl"){
    &Error("cannot find 'proteinortho2html.pl'$NC in: the current directory '.', ./src/, ./src/BUILD/$uname, /usr/bin, /usr/local/bin, \$PATH, -binpath=$binpath.\nPlease do one of the following:\n A. recompile proteinortho (with 'make clean', 'make' and 'make install' or 'make install PREFIX=...') or consider a installation with conda/brew (see the README for more informations)\n B. execute from within the downloaded directory, there are precompiled binaries for Linux_x86_64\n C. specify the path to the binaries with -binpath=...\n");
     exit 1;
   }
-  # if(!-x "$tmppath[1]/proteinortho2tree.pl"){
-  #   &Error("cannot find proteinortho2tree.pl in $tmppath[1].\nPlease do one of the following:\n A. recompile proteinortho (with 'make clean', 'make' and 'make install' or 'make install PREFIX=...') or consider a installation with conda/brew (see the README for more informations)\n B. execute from within the downloaded directory, there are precompiled binaries for Linux_x86_64\n C. specify the path to the binaries with -binpath=...\n";
+  # if(!-x "$fileparse0[1]/proteinortho2tree.pl"){
+  #   &Error("cannot find proteinortho2tree.pl in $fileparse0[1].\nPlease do one of the following:\n A. recompile proteinortho (with 'make clean', 'make' and 'make install' or 'make install PREFIX=...') or consider a installation with conda/brew (see the README for more informations)\n B. execute from within the downloaded directory, there are precompiled binaries for Linux_x86_64\n C. specify the path to the binaries with -binpath=...\n";
   #   exit 0;
   # }
-  if(!-x $tmppath[1]."/proteinortho_ffadj_mcs.py" && $synteny){
+  if(!-x $fileparse0[1]."/proteinortho_ffadj_mcs.py" && $synteny){
     &Error("cannot find 'proteinortho_ffadj_mcs.py'$NC in: the current directory '.', ./src/, ./src/BUILD/$uname, /usr/bin, /usr/local/bin, \$PATH, -binpath=$binpath.\nPlease do one of the following:\n A. recompile proteinortho (with 'make clean', 'make' and 'make install' or 'make install PREFIX=...') or consider a installation with conda/brew (see the README for more informations)\n B. execute from within the downloaded directory, there are precompiled binaries for Linux_x86_64\n C. specify the path to the binaries with -binpath=...\n");
     exit 1;
   }
-  if(!-x $tmppath[1]."/proteinortho_singletons.pl"){
+  if(!-x $fileparse0[1]."/proteinortho_singletons.pl"){
     &Error("cannot find 'proteinortho_singletons.pl'$NC in: the current directory '.', ./src/, ./src/BUILD/$uname, /usr/bin, /usr/local/bin, \$PATH, -binpath=$binpath.\nPlease do one of the following:\n A. recompile proteinortho (with 'make clean', 'make' and 'make install' or 'make install PREFIX=...') or consider a installation with conda/brew (see the README for more informations)\n B. execute from within the downloaded directory, there are precompiled binaries for Linux_x86_64\n C. specify the path to the binaries with -binpath=...\n");
     exit 1;
   }
-  if(!-x $tmppath[1]."/proteinortho_graphMinusRemovegraph"){
+  if(!-x $fileparse0[1]."/proteinortho_graphMinusRemovegraph"){
     &Error("cannot find 'proteinortho_graphMinusRemovegraph'$NC in: the current directory '.', ./src/, ./src/BUILD/$uname, /usr/bin, /usr/local/bin, \$PATH, -binpath=$binpath.\nPlease do one of the following:\n A. recompile proteinortho (with 'make clean', 'make' and 'make install' or 'make install PREFIX=...') or consider a installation with conda/brew (see the README for more informations)\n B. execute from within the downloaded directory, there are precompiled binaries for Linux_x86_64\n C. specify the path to the binaries with -binpath=...\n");
     exit 1;
   }
-  if(!-x $tmppath[1]."/proteinortho_do_mcl.pl"){
+  if(!-x $fileparse0[1]."/proteinortho_do_mcl.pl"){
     &Error("cannot find 'proteinortho_do_mcl.pl'$NC in: the current directory '.', ./src/, ./src/BUILD/$uname, /usr/bin, /usr/local/bin, \$PATH, -binpath=$binpath.\nPlease do one of the following:\n A. recompile proteinortho (with 'make clean', 'make' and 'make install' or 'make install PREFIX=...') or consider a installation with conda/brew (see the README for more informations)\n B. execute from within the downloaded directory, there are precompiled binaries for Linux_x86_64\n C. specify the path to the binaries with -binpath=...\n");
       exit 1;
   }
-  if(!-x $tmppath[1]."/proteinortho2xml.pl"){
+  if(!-x $fileparse0[1]."/proteinortho2xml.pl"){
     &Error("cannot find 'proteinortho2xml.pl'$NC in: the current directory '.', ./src/, ./src/BUILD/$uname, /usr/bin, /usr/local/bin, \$PATH, -binpath=$binpath.\nPlease do one of the following:\n A. recompile proteinortho (with 'make clean', 'make' and 'make install' or 'make install PREFIX=...') or consider a installation with conda/brew (see the README for more informations)\n B. execute from within the downloaded directory, there are precompiled binaries for Linux_x86_64\n C. specify the path to the binaries with -binpath=...\n");
     exit 1;
   }
-  return $tmppath[1];
+  return $fileparse0[1];
 }
 
 sub edgeweight {


=====================================
src/.gitlab-ci.yml
=====================================
@@ -5,13 +5,14 @@ before_script:
   - alias python=python3
   - DEBIAN_FRONTEND=noninteractive apt-get update && DEBIAN_FRONTEND=noninteractive apt-get -y install cmake diffutils wget ncbi-blast+ time git python3 libblas-dev liblapack-dev
 stages:
-  - test
-#  - test-precompiled-bins
-#  - recompile-and-test
+#  - test
+  - test-precompiled-bins
+  - recompile-and-test
  
 gcc-latest-manyoptions-together:
+  retry: 2
   image: gcc
-  stage: test
+  stage: test-precompiled-bins
   script:
   - echo "installing diamond"
   - wget http://github.com/bbuchfink/diamond/releases/download/v2.0.6/diamond-linux64.tar.gz 2>/dev/null
@@ -24,8 +25,9 @@ gcc-latest-manyoptions-together:
   - rm *descriptions *html *xml; 
 
 gcc-latest-someoptions-one-by-one:
+  retry: 2
   image: gcc
-  stage: test
+  stage: test-precompiled-bins
   script:
   - echo "installing diamond"
   - wget http://github.com/bbuchfink/diamond/releases/download/v2.0.6/diamond-linux64.tar.gz 2>/dev/null
@@ -65,8 +67,9 @@ gcc-latest-someoptions-one-by-one:
   - for f in myproject.*; do if [ "$(grep -v '#' $f | wc -l | awk '{print $1}' | tr -d '\n')" -lt 1 ]; then echo "$f failed size check, output should be more than 10 lines..."; exit 1; fi; done
  
 gcc-latest-all-p:
+  retry: 2
   image: gcc
-  stage: test
+  stage: recompile-and-test
   script:
   - export CWD=$(pwd)
   - echo "installing last"
@@ -92,8 +95,9 @@ gcc-latest-all-p:
   - for f in myproject.*; do if [ "$(grep -v '#' $f | wc -l | awk '{print $1}' | tr -d '\n')" -lt 10 ]; then echo "$f failed size check, output should be more than 10 lines..."; exit 1; fi; done
 
 gcc-latest-diamond:
+  retry: 2
   image: gcc
-  stage: test
+  stage: test-precompiled-bins
   script:
   - echo "installing diamond"
   - wget http://github.com/bbuchfink/diamond/releases/download/v2.0.6/diamond-linux64.tar.gz 2>/dev/null
@@ -107,27 +111,10 @@ gcc-latest-diamond:
   - rm *info; 
   - for f in myproject.*; do if [ "$(grep -v '#' $f | wc -l | awk '{print $1}' | tr -d '\n')" -lt 10 ]; then echo "$f failed size check, output should be more than 10 lines..."; exit 1; fi; done
 
-#nolapack-gcc-latest:
-#  image: gcc
-#  stage: test
-#  script:
-#  - echo "installing diamond"
-#  - wget http://github.com/bbuchfink/diamond/releases/download/v2.0.6/diamond-linux64.tar.gz 2>/dev/null
-#  - tar xzf diamond-linux64.tar.gz
-#  - cp diamond $HOME
-#  - export PATH="$PATH:$HOME"
-#  - echo "start proteinortho tests"
-#  - gcc --version
-#  - make clean
-#  - make LAPACK=FALSE
-#  - make test
-#  - perl proteinortho*pl test/*faa || exit 1
-#  - rm *info; 
-#  - for f in myproject.*; do if [ "$(grep -v '#' $f | wc -l | awk '{print $1}' | tr -d '\n')" -lt 10 ]; then echo "$f failed size check, output should be more than 10 lines..."; exit 1; fi; done
-
-#gcc5:
-#  image: gcc:5
-#  stage: test
+#gcc6:
+#  retry: 2
+#  image: gcc:6
+#  stage: recompile-and-test
 #  script:
 #  - echo "installing topaz"
 #  - git clone https://github.com/ajm/topaz
@@ -146,9 +133,10 @@ gcc-latest-diamond:
 #  - make all
 #  - make test
  
-ubuntu-latest0:
+ubuntu-latest-precompiled:
+  retry: 2
   image: ubuntu
-  stage: test
+  stage: test-precompiled-bins
   script:  
   - DEBIAN_FRONTEND=noninteractive apt-get -y update && DEBIAN_FRONTEND=noninteractive apt-get -y install python2.7-minimal gcc gfortran build-essential g++ python3
   - echo "installing topaz"
@@ -169,8 +157,9 @@ ubuntu-latest0:
   - for f in myproject.*; do if [ "$(grep -v '#' $f | wc -l | awk '{print $1}' | tr -d '\n')" -lt 10 ]; then echo "$f failed size check, output should be more than 10 lines..."; exit 1; fi; done
 
 ubuntu-latest:
+  retry: 2
   image: ubuntu
-  stage: test
+  stage: recompile-and-test
   script:
   - apt-get -y update && apt-get -y install gcc && apt-get -y install gfortran && apt-get -y install build-essential g++ && apt-get -y install python3
   - echo "installing topaz"
@@ -193,8 +182,9 @@ ubuntu-latest:
   - for f in myproject.*; do if [ "$(grep -v '#' $f | wc -l | awk '{print $1}' | tr -d '\n')" -lt 10 ]; then echo "$f failed size check, output should be more than 10 lines..."; exit 1; fi; done
 
 debian-latest:
+  retry: 2
   image: debian
-  stage: test
+  stage: recompile-and-test
   script:
   - apt-get -y update && apt-get -y install gcc && apt-get -y install gfortran && apt-get -y install build-essential g++ && apt-get -y install python3
   - echo "installing topaz"
@@ -216,64 +206,67 @@ debian-latest:
   - rm *info; 
   - for f in myproject.*; do if [ "$(grep -v '#' $f | wc -l | awk '{print $1}' | tr -d '\n')" -lt 10 ]; then echo "$f failed size check, output should be more than 10 lines..."; exit 1; fi; done
 
-#fedora-latest:
-#  image: fedora
-#  stage: test
-#  script:
-#  - yum -y groupinstall "Development Tools" 
-#  - yum -y install gcc-c++
-#  - yum -y install cmake
-#  - yum -y install make
-#  - yum -y install tar
-#  - yum -y install which
-#  - yum -y install wget
-#  - yum -y install libstdc++-static
-#  - yum -y install lapack-static
-#  - yum -y install cpan
-#  - yum -y install python
-#  - yum -y install ncbi-blast+
-#  - cpan Thread::Queue
-#  - wget ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/ncbi-blast*-x64-linux.tar.gz 2>/dev/null
-#  - tar -xzvf ncbi-blast*-x64-linux.tar.gz
-#  - cp ncbi-blast*/bin/blastp $HOME
-#  - cp ncbi-blast*/bin/makeblastdb $HOME
-#  - echo "installing diamond"
-#  - wget http://github.com/bbuchfink/diamond/releases/download/v2.0.6/diamond-linux64.tar.gz 2>/dev/null
-#  - tar xzf diamond-linux64.tar.gz
-#  - cp diamond $HOME
-#  - export PATH="$PATH:$HOME"
-#  - echo "start proteinortho tests"
+# fedora-latest:
+#   retry: 2
+#   image: fedora
+#   stage: test
+#   script:
+#   - yum -y groupinstall "Development Tools" 
+#   - yum -y install gcc-c++
+#   - yum -y install cmake
+#   - yum -y install make
+#   - yum -y install tar
+#   - yum -y install which
+#   - yum -y install wget
+#   - yum -y install libstdc++-static
+#   - yum -y install lapack-static
+#   - yum -y install cpan
+#   - yum -y install python
+#   - yum -y install ncbi-blast+
+#   - cpan Thread::Queue
+#   - wget ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/ncbi-blast*-x64-linux.tar.gz 2>/dev/null
+#   - tar -xzvf ncbi-blast*-x64-linux.tar.gz
+#   - cp ncbi-blast*/bin/blastp $HOME
+#   - cp ncbi-blast*/bin/makeblastdb $HOME
+#   - echo "installing diamond"
+#   - wget http://github.com/bbuchfink/diamond/releases/download/v2.0.6/diamond-linux64.tar.gz 2>/dev/null
+#   - tar xzf diamond-linux64.tar.gz
+#   - cp diamond $HOME
+#   - export PATH="$PATH:$HOME"
+#   - echo "start proteinortho tests"
 
-#centos-latest:
-#  image: centos
-#  stage: test
-#  script:
-#  - yum -y groupinstall "Development Tools" 
-#  - yum -y install gcc-c++
-#  - yum -y install cmake
-#  - yum -y install make
-#  - yum -y install tar
-#  - yum -y install which
-#  - yum -y install wget
-#  - yum -y install gcc-gfortran python3 atlas atlas-devel lapack blas
-#  - wget ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/ncbi-blast*-x64-linux.tar.gz 2>/dev/null
-#  - tar -xzvf ncbi-blast*-x64-linux.tar.gz
-#  - cp ncbi-blast*/bin/blastp $HOME
-#  - cp ncbi-blast*/bin/makeblastdb $HOME
-#  - echo "installing diamond"
-#  - wget http://github.com/bbuchfink/diamond/releases/download/v2.0.6/diamond-linux64.tar.gz 2>/dev/null
-#  - tar xzf diamond-linux64.tar.gz
-#  - cp diamond $HOME
-#  - export PATH="$PATH:$HOME"
-#  - echo "start proteinortho tests"
-#  - make clean
-#  - make
-#  - make test
-#  - perl proteinortho*pl test/*faa || exit 1
-#  - rm *info; 
-#  - for f in myproject.*; do if [ "$(grep -v '#' $f | wc -l | awk '{print $1}' | tr -d '\n')" -lt 10 ]; then echo "$f failed size check, output should be more than 10 lines..."; exit 1; fi; done
+# centos-latest:
+#   retry: 2
+#   image: centos
+#   stage: test
+#   script:
+#   - yum -y groupinstall "Development Tools" 
+#   - yum -y install gcc-c++
+#   - yum -y install cmake
+#   - yum -y install make
+#   - yum -y install tar
+#   - yum -y install which
+#   - yum -y install wget
+#   - yum -y install gcc-gfortran python3 atlas atlas-devel lapack blas
+#   - wget ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/ncbi-blast*-x64-linux.tar.gz 2>/dev/null
+#   - tar -xzvf ncbi-blast*-x64-linux.tar.gz
+#   - cp ncbi-blast*/bin/blastp $HOME
+#   - cp ncbi-blast*/bin/makeblastdb $HOME
+#   - echo "installing diamond"
+#   - wget http://github.com/bbuchfink/diamond/releases/download/v2.0.6/diamond-linux64.tar.gz 2>/dev/null
+#   - tar xzf diamond-linux64.tar.gz
+#   - cp diamond $HOME
+#   - export PATH="$PATH:$HOME"
+#   - echo "start proteinortho tests"
+#   - make clean
+#   - make
+#   - make test
+#   - perl proteinortho*pl test/*faa || exit 1
+#   - rm *info; 
+#   - for f in myproject.*; do if [ "$(grep -v '#' $f | wc -l | awk '{print $1}' | tr -d '\n')" -lt 10 ]; then echo "$f failed size check, output should be more than 10 lines..."; exit 1; fi; done
  
 #code_quality:
+#  retry: 2  
 #  image: docker:stable
 #  stage: codequality
 #  variables:


=====================================
src/proteinortho_clustering.cpp
=====================================
@@ -65,13 +65,14 @@ extern "C" {
 				int* lwork, int* iwork, int* liwork, int* info );
 }
 
-struct wedge {unsigned int edge; unsigned short weight;};
-struct protein {vector<wedge> edges; unsigned int species_id; string full_name;};
+struct wedge {unsigned int edge; unsigned short weight=0;};
+struct protein {vector<wedge> edges; unsigned int species_id; string full_name=""; bool is_done=false;};
 
 // Functions          
 float string2float(string);
 void tokenize(const string& , vector<string>& , const string&);
 void parse_file(string);
+void parse_abc_file(string);
 // void remove_edge_index(const unsigned int, const unsigned int);
 float getConnectivity_float(vector<unsigned int>*,bool,vector<float>*);
 double getConnectivity_double(vector<unsigned int>*,bool,vector<double>*);
@@ -82,10 +83,12 @@ void stats(unsigned int,unsigned int,bool);
 string getTime(void);
 	bool param_verbose 		= true;
 	bool param_core 		  = false;
+	bool param_abc 		  = false;
 	float param_con_threshold 	= 0.1;		// as a reference: a chain a-b-c-d has 0.25
 	unsigned int debug_level	= 0;
 	float param_sep_purity 	= -1;		// as a reference: a-b-c will give +/-0.707107 and 2.34857e-08 
 	unsigned int param_max_nodes	= 16777216; // 2^24
+	bool param_max_nodes_was_set=false;
 	float param_min_species	= 1;
 	string param_rmgraph            = "remove.graph";
 	bool param_useWeights = true;
@@ -256,6 +259,12 @@ size_t getCurrentRSS( )
 #endif 
 }
 
+/*void debug_log(string log){
+	std::hash<std::thread::id>{}(std::this_thread::get_id());
+	if(!log_out.count(tid))
+		log_out[tid]=make_shared<ofstream>((param_rmgraph+to_string(tid)).c_str());
+}*/
+
 class dispatch_queue{
   // based on: https://github.com/embeddedartistry/embedded-resources/blob/master/examples/cpp/dispatch.cpp
 
@@ -425,7 +434,7 @@ void dispatch_queue::dispatch_thread_handler(size_t tid){
 // Main
 ///////////////////////////////////////////////////////////
 void printHelp() {
-	cerr << "proteinortho_clustering - Spectral partitioning algorithm (last updated with proteinortho v6.1.2)" << "\n";
+	cerr << "proteinortho_clustering - Spectral partitioning algorithm (last updated with proteinortho v6.1.6)" << "\n";
 	cerr << "-----------------------------------------------------" << "\n";
 	cerr << "This tool is part of Proteinortho" << "\n";
 	cerr << "" << "\n";
@@ -440,9 +449,10 @@ void printHelp() {
 	cerr << "         -cpus int         the number of threads used for openMP ["<<num_cpus<<"]" << "\n";
 	cerr << "         -coreMaxProts int  the maximum number of proteins per species for -core ["<<param_coreMaxProteinsPerSpecies<<"]" << "\n";
 	cerr << "         -coreMinSpecies int  the minimum number of species for -core ["<<param_coreMinSpecies<<"]" << "\n";
+	cerr << "         -abc   flag to indicate the input is a abc formatted graph file instead of a blast-graph (tab-separated). Input is expected to be undirected. c is the similarity score (0-USHRT_MAX) e.g. the blast bitscore" << "\n";
 	cerr << "\ntechnical parameters:" << "\n";
 	cerr << "         -test             various test-functions are called first [not set]" << "\n";
-	cerr << "         -maxnodes int     only consider connected component with up to maxnodes nodes. If exceeded, greedily remove the worst 10 percent of edges (by weight) until satisfied ["<<param_max_nodes<<"]" << "\n";
+	cerr << "         -maxnodes int     only consider connected component with up to maxnodes nodes. If exceeded, greedily remove the worst 10 percent of edges (by weight) until satisfied [number of species ** 2]" << "\n";
 	cerr << "         -maxweight int    only use the edge weights for connected components with maxweight nodes ["<<param_max_nodes_weight<<"]" << "\n";
 	cerr << "         -epsilon float    convergence threshold ["<<param_epsilon<<"]" << "\n";
 	cerr << "         -weighted bool    the spectral partition is calculated using the bitscores ["<<param_useWeights<<"]" << "\n";
@@ -488,6 +498,9 @@ int main(int argc, char *argv[]) {
 			else if (parameter == "-core") {
 				param_core = true;
 			}
+			else if (parameter == "-abc") {
+				param_abc = true;
+			}
 			else if (parameter == "-coreMaxProt" || parameter == "-coreMaxProts") {
 				param_core = true;
 				paras++;
@@ -530,6 +543,7 @@ int main(int argc, char *argv[]) {
 			else if (parameter == "-maxnodes") {
 				paras++;
 				param_max_nodes = string2float(string(argv[paras]));
+				param_max_nodes_was_set=true;
 			}
 			else if (parameter == "-maxweight") {
 				paras++;
@@ -614,7 +628,7 @@ int main(int argc, char *argv[]) {
 
 		if (debug_level > 0) cerr << getTime() << " [DEBUG]   Debug level " << debug_level << "\n";
 
-		if(param_core){param_con_threshold=999;}
+		if(param_core){param_con_threshold=999;} // for -core the -conn parameter is disabled 
 
 		if(getEnvVar("OMP_NUM_THREADS") != "1"){
 			// restart with OMP_NUM_THREADS=1
@@ -627,7 +641,11 @@ int main(int argc, char *argv[]) {
 		// Parse files
 		for (vector<string>::iterator it=files.begin() ; it != files.end(); it++) {
 			if (debug_level > 0) cerr << getTime() << " [DEBUG]   Parsing file " << *it << "\n";
-			parse_file(*it);
+			if(param_abc){
+				parse_abc_file(*it);
+			}else{
+				parse_file(*it);
+			}
 			if (debug_level > 0) cerr << getTime() << " [DEBUG]   I know " << species_counter <<  " species with " << protein_counter << " proteins and " << edges << " edges in sum" << "\n";
 		}
 
@@ -653,6 +671,8 @@ int main(int argc, char *argv[]) {
 		// Stats
 		if (param_verbose) cerr << species_counter << " species" << "\n" << protein_counter << " paired proteins" << "\n" << edges << " bidirectional edges" << "\n";
 
+		if(param_max_nodes_was_set==false){param_max_nodes = species_counter*species_counter; }
+
 		if (debug_level > 0) cerr << getTime() << " [DEBUG]   Maximumum number of nodes for connectivity calculations is " << param_max_nodes << "\n";
 
 		// Prepare sort of output
@@ -828,7 +848,18 @@ void find_CCs_givenNodes(dispatch_queue*,vector<unsigned int>);
 void print_group(ConnectedComponent& , float, size_t, bool);
 float calc_group(vector<unsigned int>*);
 
-ConnectedComponent BFS(map<unsigned int, bool> * done, unsigned int cur_node ){
+ConnectedComponent BFS( map<unsigned int, bool> * done, unsigned int cur_node , float cut_off ){
+
+	/*
+	*
+	* Simple BFS implementation
+	* done map is used to identify allready visited nodes (can be set before call to indicate forbidden nodes)
+	* cut_off = ignore all edges below this cut off
+	* 
+	*/
+	size_t tid=std::hash<std::thread::id>{}(std::this_thread::get_id());
+	if(!graph_clean.count(tid))
+		graph_clean[tid]=make_shared<ofstream>((param_rmgraph+to_string(tid)).c_str());
 
 	ConnectedComponent ret; //return vector
 	list<unsigned int> q;
@@ -845,16 +876,26 @@ ConnectedComponent BFS(map<unsigned int, bool> * done, unsigned int cur_node ){
 			ret.d_sum+=graph[cur_node].edges.size();
 			(*done)[cur_node] = true;
 
-			for (unsigned int i = 0; i < graph[cur_node].edges.size(); i++) {
+			if(graph[cur_node].is_done){continue;}
+
+			for (unsigned int j = 0; j < graph[cur_node].edges.size(); j++) {
 
-				unsigned int adjacency_node = graph[cur_node].edges[i].edge;
+				if(graph[graph[cur_node].edges[j].edge].is_done){continue;}
+
+				if(graph[cur_node].edges[j].weight < cut_off){
+					protein node_i = graph[cur_node];
+					protein node_j = graph[node_i.edges[j].edge];
+					(*graph_clean[tid]) << node_i.full_name << "\t" << species[node_i.species_id] << "\t" << node_j.full_name << "\t" << species[node_j.species_id] << "\n";
+					continue;
+				} // ignore
+
+				unsigned int adjacency_node = graph[cur_node].edges[j].edge;
 
 				if(adjacency_node > graph.size()){
 					cerr << string("[ERROR] : Input graph is invalid. The node "+graph[cur_node].full_name +" is reporting an edge/adjacent node, that is not present in the graph.").c_str() << "\n";throw;
 				}
 
 				if( !done->count(adjacency_node) || !(*done)[adjacency_node] ){
-
 					(*done)[adjacency_node] = true;
 					q_new.push_back(adjacency_node);
 				}
@@ -863,10 +904,6 @@ ConnectedComponent BFS(map<unsigned int, bool> * done, unsigned int cur_node ){
 
 		q=q_new;
 	}
-	//ret.calc_density();
-	//if(ret.density > 1){
-	//	cerr << "[WARNING] : The input graph has duplicated edges, this lead to an invalid graph density of " << ret.density << " (should be <1). Please clean the .blast-graph with 'proteinortho.pl --cleanblast --step=3 --project=...' or use the cleanupblastgraph tool in src/ to remove the duplicated edges." << "\n";
-	//}
 	return ret;
 }
 
@@ -876,45 +913,107 @@ struct compare_ConnectedComponents { //sort from large to small
 	}
 };
 
-void removeLowQualityEdges( ConnectedComponent &cur_cc ){
+void removeLowQualityEdges( ConnectedComponent cur_cc , dispatch_queue *q ){
+	/*
+	 * remove low quality edges
+	 * 
+	 * find the range of values of this CC -> define a cut-off = cut_off=min_w+0.1*(max_w-min_w);
+	 * remove all edges below that value
+	 * redo BFS and start over again for this cluster until -maxnodes are satisfied
+	 * 
+	 */
+	if(debug_level>0) cerr << " [INFO]  removeLowQualityEdges " << cur_cc.size() << " start" << "\n";
+
+	// find the lowest 10% of edge values (min+0.1*(max-min))
+	float min_w = -1; 
+	float max_w = -1;
+	unsigned int original_number_nodes=cur_cc.size();
+	for (unsigned int i = 0; i < cur_cc.size(); i++) {
+		unsigned int from=cur_cc[i];
+		for(unsigned int j = 0 ; j < graph[from].edges.size() ; j++){
+			unsigned int to = graph[from].edges[j].edge;
+			unsigned int weight = graph[from].edges[j].weight;
+			if(min_w == -1 || weight < min_w){min_w=weight;} 
+			if(min_w == -1 || weight > max_w){max_w=weight;}
+		}
+	}
+	float cut_off=min_w+0.1*(max_w-min_w);
 
-		/*
-		 * remove low quality edges
-		 * 
-		 * find the range of values of this CC -> define a cut-off = cut_off=min_w+0.1*(max_w-min_w);
-		 * remove all edges below that value
-		 * redo BFS and start over again for this cluster until -maxnodes are satisfied
-		 * 
-		 */
+	if(debug_level>0) cerr << " [INFO]  removeLowQualityEdges " << cur_cc.size() << " cut_off=" << cut_off << " min_w=" << min_w << " max_w=" << max_w << "\n";
 
-		// find the lowest 10% of edge values (min+0.1*(max-min))
-		float min_w = -1; 
-		float max_w = -1;
-		for (unsigned int i = 0; i < cur_cc.size(); i++) {
-			unsigned int from=cur_cc[i];
-			for(unsigned int j = 0 ; j < graph[from].edges.size() ; j++){
-				unsigned int to = graph[from].edges[j].edge;
-				unsigned int weight = graph[from].edges[j].weight;
-				if(min_w == -1 || weight < min_w){min_w=weight;} 
-				if(min_w == -1 || weight > max_w){max_w=weight;}
-			}
-		}
-		float cut_off=min_w+0.1*(max_w-min_w);
-		if(debug_level>0) cerr << "cut_off="<<cut_off << " min_w="<<min_w<<" max_w="<<max_w<< endl;
-		for (unsigned int i = 0; i < cur_cc.size(); i++) {
-			unsigned int from=cur_cc[i];
-			auto it = std::remove_if(
-				graph[from].edges.begin(), 
-				graph[from].edges.end(), 
-				[cut_off,from](wedge cur_el)->bool 
-				{
-					if(debug_level>0 && cur_el.weight <= cut_off) cerr << "[WARNING] found bad-edge "<< from << "-"<< cur_el.edge << " weight=" << cur_el.weight << endl;
-					return cur_el.weight <= cut_off;
-				}
-			);
-			graph[from].edges.erase(it, graph[from].edges.end());
-		}
+	map<unsigned int, bool> done;	// Keep track on what was done (for each node)
+	bool allNodesAreDone = false;
+
+	vector<ConnectedComponent> CC; // vector of all connected components found
+
+	while( true ){ // CC.size() < num_cpus / gather up to num_cpus connected components
+
+		allNodesAreDone = true;
+
+		for (unsigned int id = 0 ; id < cur_cc.size() ; id++) {
+		
+			unsigned int protein_id = cur_cc[id];
+
+			if (done.count(protein_id) && done[protein_id]){continue;}// We were here already
+			if(debug_level>0) cerr << "find_CCs:start at "<< protein_id << "<" << graph.size() << endl;
 
+			done[protein_id]=true; // mark this node
+			ConnectedComponent cur_cc = BFS(&done,protein_id,cut_off); // get the CC of the current node (protein_id) 
+
+			if(cur_cc.size()==original_number_nodes){
+				// the cutoff is too low -> increase by 10% and restart
+				cut_off+=0.1*(max_w-min_w);
+				allNodesAreDone=false;
+				break;
+			}
+
+			// for -core, skip CC with less than param_coreMinSpecies species
+			if( param_core && cur_cc.species_num==0 ){
+				if(cur_cc.size()<param_coreMinSpecies){allNodesAreDone=false;break;} // if there are less nodes than wanted species -> skip
+				map<unsigned int,bool>cc_species;
+				for (int i = 0; i < cur_cc.size(); ++i){ cc_species[graph[cur_cc[i]].species_id]=true; }
+				cur_cc.species_num=cc_species.size();
+				if(cc_species.size()<param_coreMinSpecies){allNodesAreDone=false;break;} // less species as wanted -> skip
+			}
+
+			// Skip those that are too large (try heuristic)
+			if (cur_cc.size() > param_max_nodes) {
+
+				// reset done vector
+				//for (int i = 0; i < cur_cc.size(); ++i){ done[cur_cc[i]]=false; }
+				if(debug_level>0) cerr << " [WARNING]  Found a very large connected component that contains " << cur_cc.size() << ">" << param_max_nodes << " (maxnodes) elements. This behavior can be adjusted using -maxnodes. Now using a slow heuristic: try to identify and remove edges." << "\n";
+				q->dispatch([cur_cc,q]{ removeLowQualityEdges(cur_cc,q); });
+				//protein_id--; 
+				continue;
+			}
+
+			cur_cc.calc_dsum();
+			cur_cc.calc_density();
+			if(cur_cc.density > 1){ cerr << "[WARNING] : The input graph has duplicated edges, this lead to an invalid graph density of " << cur_cc.density << " (should be <1). Please clean the .blast-graph with 'proteinortho.pl --cleanblast --step=3 --project=...' or use the cleanupblastgraph tool in src/ to remove the duplicated edges." << "\n"; throw; }
+			if (debug_level > 0) cerr << getTime() << " [DEBUG:removeLowQualityEdges] Found connected component: " << cur_cc.size() << " proteins (ID: " << protein_id << "), graph density="<< cur_cc.density << ", sum of degrees="<< cur_cc.d_sum << " ini from " << graph[protein_id].full_name<< "\n";
+
+			q->dispatch([cur_cc,q]{ partition_CC(cur_cc,q,true,false); });
+			allNodesAreDone=false;
+			break;
+		}
+		if(allNodesAreDone)break; // no additional CC can be found -> done
+	}
+	if(debug_level>0) cerr << " [INFO]  removeLowQualityEdges " << cur_cc.size() << " done\n";
+
+	if(debug_level>0) cerr << "cut_off="<<cut_off << " min_w="<<min_w<<" max_w="<<max_w<< endl;
+	// for (unsigned int i = 0; i < cur_cc.size(); i++) {
+	// 	unsigned int from=cur_cc[i];
+	// 	auto it = std::remove_if(
+	// 		graph[from].edges.begin(), 
+	// 		graph[from].edges.end(), 
+	// 		[cut_off,from](wedge cur_el)->bool 
+	// 		{
+	// 			if(debug_level>0 && cur_el.weight <= cut_off) cerr << "[WARNING] found bad-edge "<< from << "-"<< cur_el.edge << " weight=" << cur_el.weight << endl;
+	// 			return cur_el.weight <= cut_off;
+	// 		}
+	// 	);
+	// 	graph[from].edges.erase(it, graph[from].edges.end());
+	// }
 }
 
 void find_CCs_givenNodes(dispatch_queue *q, vector<unsigned int> todo_work ){
@@ -923,8 +1022,8 @@ void find_CCs_givenNodes(dispatch_queue *q, vector<unsigned int> todo_work ){
 	bool allNodesAreDone = false;
 
 	vector<ConnectedComponent> CC; // vector of all connected components found
-	//unsigned int min_i=0;
 
+	// find all nodes outside todo_work and remove them from the done vector !
 	for (unsigned int i = 0 ; i < todo_work.size() ; i++) {
 		unsigned int from = todo_work[i]; 
 		done[ from ] = 0;
@@ -944,12 +1043,14 @@ void find_CCs_givenNodes(dispatch_queue *q, vector<unsigned int> todo_work ){
 		for (unsigned int i = 0 ; i < todo_work.size() ; i++) {
 
 			unsigned int protein_id=todo_work[i];
+			if ( graph[protein_id].is_done ){continue;}
 			if ( done.count(protein_id) && done[protein_id] ){continue;}// We were here already
 
 			//min_i=protein_id;
+			if (debug_level > 0) cerr << getTime() << " [DEBUG:find_CCs_givenNodes] start="<< protein_id << "\n";
 
 			done[protein_id]=true; // mark this node
-			ConnectedComponent cur_cc = BFS(&done,protein_id); // get the CC of the current node (protein_id) 
+			ConnectedComponent cur_cc = BFS(&done,protein_id,0); // get the CC of the current node (protein_id) 
 
 			// Do not report singles
 			if (cur_cc.size() < 2) {continue;} // singletons are from no interest
@@ -958,11 +1059,10 @@ void find_CCs_givenNodes(dispatch_queue *q, vector<unsigned int> todo_work ){
 
 			// Skip those that are too large (try heuristic)
 			if (cur_cc.size() > param_max_nodes) {
-
 				// reset done vector
-				for (int i = 0; i < cur_cc.size(); ++i){ done[cur_cc[i]]=false; }
+				// for (int i = 0; i < cur_cc.size(); ++i){ done[cur_cc[i]]=false; }
 				if(debug_level>0) cerr << " [WARNING]  Found a very large connected component that contains " << cur_cc.size() << ">" << param_max_nodes << " (maxnodes) elements. This behavior can be adjusted using -maxnodes. Now using a slow heuristic: try to identify and remove edges." << "\n";
-				removeLowQualityEdges(cur_cc); i--;
+				removeLowQualityEdges(cur_cc,q);
 				continue;
 			}
 
@@ -977,12 +1077,12 @@ void find_CCs_givenNodes(dispatch_queue *q, vector<unsigned int> todo_work ){
 			cur_cc.calc_dsum();
 			cur_cc.calc_density();
 			if(cur_cc.density > 1){ cerr << "[WARNING] : The input graph has duplicated edges, this lead to an invalid graph density of " << cur_cc.density << " (should be <1). Please clean the .blast-graph with 'proteinortho.pl --cleanblast --step=3 --project=...' or use the cleanupblastgraph tool in src/ to remove the duplicated edges." << "\n"; throw; }
-			if (debug_level > 0) cerr << getTime() << " [DEBUG] Found connected component: " << cur_cc.size() << " proteins (ID: " << protein_id << "), graph density="<< cur_cc.density << ", sum of degrees="<< cur_cc.d_sum << "\n";
+			if (debug_level > 0) cerr << getTime() << " [DEBUG:find_CCs_givenNodes] Found connected component: " << cur_cc.size() << " proteins (ID: " << protein_id << "), graph density="<< cur_cc.density << ", sum of degrees="<< cur_cc.d_sum << "\n";
 
 			q->dispatch([cur_cc,q]{ partition_CC(cur_cc,q,true,false); });
 			
 			allNodesAreDone=false;
-			break;
+			//break;
 		}
 		if(allNodesAreDone)break; // no additional CC can be found -> done
 	}
@@ -1109,8 +1209,10 @@ void partition_CC(ConnectedComponent cur_cc, dispatch_queue *q, bool do_lapack,
 		 */
 
 		float new_purity=0;
-		float best_cut=0;
-		float cut_value=0;
+		float best_cut=0; // stores the current best cut (sum of all edges removed by the given purity), first take a purity of 0 (no purity) and set this value (split + and - compartments of x_hat)
+		vector<float> x_hat_candidate_purity = x_hat;
+		for (unsigned int i = 0; i < x_hat_candidate_purity.size(); i++) { x_hat_candidate_purity[i]=abs(x_hat_candidate_purity[i]); } // purity is a positive threshold
+		sort(x_hat_candidate_purity.begin(),x_hat_candidate_purity.end()); // these values are used as purity tests. Later only subset is used of values not too similar to each other
 		for (unsigned int i = 0; i < x_hat.size(); i++) {
 			if(x_hat[i] < 0) {
 				unsigned int from=cur_cc[i];
@@ -1124,28 +1226,35 @@ void partition_CC(ConnectedComponent cur_cc, dispatch_queue *q, bool do_lapack,
 			}
 		}
 		if(debug_level>0)	cerr << " initial purity=0 best_cut=" << best_cut << endl;
+		float last_test_purity = -1; // omit purity test that are very close to previous values, to reduce runtime
+		int num_purity_tests=0; // test all of the first 25 purity values anyway
 		for(unsigned int pi = 0; pi < (unsigned int)n; pi++){
-			unsigned int from = cur_cc[pi];
-			if(!mapping.count(from) || abs(x_hat[ mapping[from] ])>0.12 ){continue;}
-			float test_purity = abs(x_hat[ mapping[from] ]);
-			float cut_value=0;
-			unsigned int num_nodes=0;
+			//if( x_hat_sort[ pi ] < -0.12 ){continue;} // skip large negative values
+			if( x_hat_candidate_purity[ pi ] > 0.12 ){break;} // there is no further valid value (sorted values)
+			float test_purity = x_hat_candidate_purity[ pi ];
+			if( test_purity == 0 || 
+					test_purity == last_test_purity || 
+					(num_purity_tests>25 && test_purity-last_test_purity<0.01) ){continue;} // current value is very similar to last purity -> omit
+			float cut_value = 0;
+			num_purity_tests++;
+			last_test_purity = test_purity;
+			unsigned int num_nodes = 0; // the number of nodes within the purity bounds
 			for (unsigned int i = 0; i < x_hat.size(); i++) {
-				if(abs(x_hat[i]) <= test_purity) {
+				if(abs(x_hat[i]) <= test_purity) { // add all the edges between this node (below purity) and nodes above the purity threshold to the cut value
 					num_nodes++;
 					unsigned int from=cur_cc[i];
 					for(unsigned int j = 0 ; j < graph[from].edges.size() ; j++){
 						unsigned int to = graph[from].edges[j].edge;
-						if(!mapping.count(to)){continue;}
+						if(!mapping.count(to)){continue;} // sanity check if the edge is in the current CC
 						if(abs(x_hat[mapping[to]]) > test_purity) {
 							cut_value+=graph[from].edges[j].weight;							
 						}
 					}
 				}
 			}
-			if( cut_value>0 && (best_cut==-1 || best_cut>cut_value) && num_nodes>1){
-				new_purity=test_purity;
-				best_cut=cut_value;
+			if( cut_value>0 && (best_cut==-1 || best_cut>cut_value) && num_nodes>1){ // update optimum
+				new_purity = test_purity;
+				best_cut = cut_value;
 				if(debug_level>0)	cerr << " initial purity="<<new_purity<<" best_cut=" << best_cut << endl;
 			}
 		}
@@ -1328,6 +1437,7 @@ void partition_CC(ConnectedComponent cur_cc, dispatch_queue *q, bool do_lapack,
 	if (debug_level > 0) cerr << getTime() << " [DEBUG] splitting CC with "<<cur_cc.size()<<" nodes "<<(debug_level==15 ? "@"+id+"" : "")<< " into ("<<groupA.size()<<""<<(debug_level==15 ? " @"+idA+"" : "")<< ","<<groupB.size()<<""<<(debug_level==15 ? " @"+idB+"" : "")<< ","<<groupZero.size()<<""<<(debug_level==15 ? " @"+idZ+"" : "")<< ") sized groups!"<< "\n";
 
 	// Catch error in laplacien calcs
+	// all nodes are in one partition -> restart with an increased restart level
 	if ( (groupA.size() == 0 && groupB.size() == 0) || 
 		 ( (groupA.size() == 0 || groupB.size() == 0) && groupZero.size() == 0) ){
 		
@@ -1346,9 +1456,39 @@ void partition_CC(ConnectedComponent cur_cc, dispatch_queue *q, bool do_lapack,
 		
 	}else{
 
+		// for the -core algorithm, test if the number of species now decreased by too much or keep on splitting
+		if(param_core){
+			map<unsigned int,bool>cc_speciesA;
+			for (int i = 0; i < groupA.size(); ++i){ cc_speciesA[graph[groupA[i]].species_id]=true; }
+			groupA.species_num=cc_speciesA.size();
+			map<unsigned int,bool>cc_speciesB;
+			for (int i = 0; i < groupB.size(); ++i){ cc_speciesB[graph[groupB[i]].species_id]=true; }
+			groupB.species_num=cc_speciesB.size();
+			map<unsigned int, bool> done; // for zero additionally test if this group is connected -> if not
+
+			if(groupA.species_num!=cur_cc.species_num && // neither A, B or Zero meet the -core criteria -> test zero and if zero also do not hold -> print the original group cur_cc before the split
+					groupB.species_num!=cur_cc.species_num && 
+					param_coreMaxProteinsPerSpecies*cur_cc.species_num > cur_cc.size()){
+
+				map<unsigned int,bool>cc_speciesZero;
+				for (int i = 0; i < groupZero.size(); ++i){ cc_speciesZero[graph[groupZero[i]].species_id]=true; }
+				groupZero.species_num=cc_speciesZero.size();
+
+				if( groupZero.species_num!=cur_cc.species_num ){ // the original CC is still fine -> print
+					print_group(cur_cc,connectivity,tid,false);
+					return; // done with this group, no need to split into A and B by the definition of -core
+				}
+			}
+		}
+
+		// now add the edges between A and B (and Zero) to the output rmgraph
+		// (the rmgraph contains all edges that are removed by the clustering)
+
 		if(!graph_clean.count(tid))
 			graph_clean[tid]=make_shared<ofstream>((param_rmgraph+to_string(tid)).c_str());
 
+		if(debug_level>0) cerr << "[INFO] removing between A and B+Zero" << endl;
+		// remove edges between A and (B or Zero)
 		for(unsigned int i = 0 ; i < groupA.size() ; i++){
 			protein node_i = graph[groupA[i]];
 			for(unsigned int j = 0 ; j < graph[groupA[i]].edges.size() ; j++){
@@ -1358,7 +1498,20 @@ void partition_CC(ConnectedComponent cur_cc, dispatch_queue *q, bool do_lapack,
 					(*graph_clean[tid]) << node_i.full_name << "\t" << species[node_i.species_id] << "\t" << node_j.full_name << "\t" << species[node_j.species_id] << "\n";
 				}
 			}
+			// auto it = std::remove_if(
+			// 	node_i.edges.begin(), 
+			// 	node_i.edges.end(), 
+			// 	[node_i,groupB_set,groupZero_set](wedge cur_el)->bool 
+			// 	{
+			// 		if(debug_level>0 && (groupB_set.count(cur_el.edge) || groupZero_set.count(cur_el.edge))) cerr << "[INFO] removing "<< node_i.full_name << "-"<< graph[cur_el.edge].full_name << " weight=" << cur_el.weight << endl;
+			// 		return groupB_set.count(cur_el.edge) || groupZero_set.count(cur_el.edge);
+			// 	}
+			// );
+			// node_i.edges.erase(it, node_i.edges.end());
 		}
+
+		if(debug_level>0) cerr << "[INFO] removing between B and Zero" << endl;
+		// remove edges between B and Zero
 		for(unsigned int i = 0 ; i < groupB.size() ; i++){
 			protein node_i = graph[groupB[i]];
 			for(unsigned int j = 0 ; j < graph[groupB[i]].edges.size() ; j++){
@@ -1367,37 +1520,36 @@ void partition_CC(ConnectedComponent cur_cc, dispatch_queue *q, bool do_lapack,
 					(*graph_clean[tid]) << node_i.full_name << "\t" << species[node_i.species_id] << "\t" << node_j.full_name << "\t" << species[node_j.species_id] << "\n";
 				}
 			}
+			// auto it = std::remove_if(
+			// 	node_i.edges.begin(), 
+			// 	node_i.edges.end(), 
+			// 	[node_i,groupZero_set](wedge cur_el)->bool 
+			// 	{
+			// 		if(debug_level>0 && groupZero_set.count(cur_el.edge)) cerr << "[INFO] removing "<< node_i.full_name << "-"<< graph[cur_el.edge].full_name << " weight=" << cur_el.weight << endl;
+			// 		return groupZero_set.count(cur_el.edge);
+			// 	}
+			// );
+			// node_i.edges.erase(it, node_i.edges.end());
 		}
 
+		// finally add the new clusters back to the stack of CCs
+		// for -core only add those clusters which meet the criteria of number of species
+
 		if(param_core){
-			map<unsigned int,bool>cc_speciesA;
-			for (int i = 0; i < groupA.size(); ++i){ cc_speciesA[graph[groupA[i]].species_id]=true; }
-			groupA.species_num=cc_speciesA.size();
-			map<unsigned int,bool>cc_speciesB;
-			for (int i = 0; i < groupB.size(); ++i){ cc_speciesB[graph[groupB[i]].species_id]=true; }
-			groupB.species_num=cc_speciesB.size();
-			
-			if( groupA.species_num!=cur_cc.species_num && 
-					groupB.species_num!=cur_cc.species_num && 
-					param_coreMaxProteinsPerSpecies*cur_cc.species_num > cur_cc.size()){
-				print_group(cur_cc,connectivity,tid,false);
-			}else{
-				if(groupA.species_num >= param_coreMinSpecies && groupA.size()>1)
-					q->dispatch([groupA,q]{ partition_CC(groupA,q,true,0); });
-				if(groupB.species_num >= param_coreMinSpecies && groupB.size()>1)
-					q->dispatch([groupB,q]{ partition_CC(groupB,q,true,0); });
-			}
-			return;
+			if((groupA.species_num >= param_coreMinSpecies || param_coreMaxProteinsPerSpecies*cur_cc.species_num > cur_cc.size()) && groupA.size()>1)
+				q->dispatch([groupA,q]{ find_CCs_givenNodes(q,groupA.m_content_CC); });
+			if((groupB.species_num >= param_coreMinSpecies || param_coreMaxProteinsPerSpecies*cur_cc.species_num > cur_cc.size()) && groupB.size()>1)
+				q->dispatch([groupB,q]{ find_CCs_givenNodes(q,groupB.m_content_CC); });
+			if((groupZero.species_num >= param_coreMinSpecies || param_coreMaxProteinsPerSpecies*cur_cc.species_num > cur_cc.size()) && groupZero.size()>1) // only if zero is connected 
+				q->dispatch([groupZero,q]{ find_CCs_givenNodes(q,groupZero.m_content_CC); });
+		}else{
+			if(groupA.size()>1)
+				q->dispatch([groupA,q]{ find_CCs_givenNodes(q,groupA.m_content_CC); });
+			if(groupB.size()>1)
+				q->dispatch([groupB,q]{ find_CCs_givenNodes(q,groupB.m_content_CC); });
+			if(groupZero.size()>1)
+				q->dispatch([groupZero,q]{ find_CCs_givenNodes(q,groupZero.m_content_CC); });
 		}
-
-		if(groupA.size()>1)
-			//q->dispatch([groupA,q]{ partition_CC(groupA,q,true,false); });
-			q->dispatch([groupA,q]{ find_CCs_givenNodes(q,groupA.m_content_CC); });
-		if(groupB.size()>1)
-			//q->dispatch([groupB,q]{ partition_CC(groupB,q,true,false); });
-			q->dispatch([groupB,q]{ find_CCs_givenNodes(q,groupB.m_content_CC); });
-		if(groupZero.size()>1)
-			q->dispatch([groupZero,q]{ find_CCs_givenNodes(q,groupZero.m_content_CC); });
 	}
 }
 
@@ -1415,44 +1567,49 @@ void find_CCs(dispatch_queue *q){
 
 		for (unsigned int protein_id = 0 ; protein_id < graph.size() ; protein_id++) {
 			
+			if ( graph[protein_id].is_done ){continue;}
+
 			if (done.count(protein_id) && done[protein_id]){continue;}// We were here already
 
-			if(debug_level>0) cerr << "find_CCs:start at "<< protein_id << "<" << graph.size() << endl;
+			if(debug_level>0) cerr << getTime() << " [DEBUG:find_CCs] find_CCs:start at "<< protein_id << "<" << graph.size() << endl;
 
 			//min_i=protein_id;
 
 			done[protein_id]=true; // mark this node
-			ConnectedComponent cur_cc = BFS(&done,protein_id); // get the CC of the current node (protein_id) 
+			ConnectedComponent cur_cc = BFS(&done,protein_id,0); // get the CC of the current node (protein_id) 
 
 			// Do not report singles
 			if (cur_cc.size() < 2) {continue;} // singletons are from no interest
 
+			// for -core, skip CC with less than param_coreMinSpecies species
+			if( param_core && cur_cc.species_num==0 ){
+				if(cur_cc.size()<param_coreMinSpecies){allNodesAreDone=false;break;} // if there are less nodes than wanted species -> skip
+				map<unsigned int,bool>cc_species;
+				for (int i = 0; i < cur_cc.size(); ++i){ cc_species[graph[cur_cc[i]].species_id]=true; }
+				cur_cc.species_num=cc_species.size();
+				if(cc_species.size()<param_coreMinSpecies){allNodesAreDone=false;break;} // less species as wanted -> skip
+			}
+
 			// Skip those that are too large (try heuristic)
 			if (cur_cc.size() > param_max_nodes) {
 
 				// reset done vector
-				for (int i = 0; i < cur_cc.size(); ++i){ done[cur_cc[i]]=false; }
+				//for (int i = 0; i < cur_cc.size(); ++i){ done[cur_cc[i]]=false; }
 				if(debug_level>0) cerr << " [WARNING]  Found a very large connected component that contains " << cur_cc.size() << ">" << param_max_nodes << " (maxnodes) elements. This behavior can be adjusted using -maxnodes. Now using a slow heuristic: try to identify and remove edges." << "\n";
-				removeLowQualityEdges(cur_cc); protein_id--;
+				q->dispatch([cur_cc,q]{ removeLowQualityEdges(cur_cc,q); });
+				//protein_id--; 
 				continue;
 			}
 
-			if(param_core && cur_cc.species_num==0){
-				if(cur_cc.size()<param_coreMinSpecies){allNodesAreDone=false;break;} // if there are less nodes than wanted species -> skip
-				map<unsigned int,bool>cc_species;
-				for (int i = 0; i < cur_cc.size(); ++i){ cc_species[graph[cur_cc[i]].species_id]=true; }
-				cur_cc.species_num=cc_species.size();
-				if(cc_species.size()<param_coreMinSpecies){allNodesAreDone=false;break;} // less species as wanted -> skip
-			}
-			
+			if (debug_level > 0) cerr << getTime() << " [DEBUG:find_CCs] Found connected component: " << cur_cc.size() << " proteins (ID: " << protein_id << "), ini from " << graph[protein_id].full_name<< "\n";
 			cur_cc.calc_dsum();
 			cur_cc.calc_density();
 			if(cur_cc.density > 1){ cerr << "[WARNING] : The input graph has duplicated edges, this lead to an invalid graph density of " << cur_cc.density << " (should be <1). Please clean the .blast-graph with 'proteinortho.pl --cleanblast --step=3 --project=...' or use the cleanupblastgraph tool in src/ to remove the duplicated edges." << "\n"; throw; }
-			if (debug_level > 0) cerr << getTime() << " [DEBUG:find_CCs] Found connected component: " << cur_cc.size() << " proteins (ID: " << protein_id << "), graph density="<< cur_cc.density << ", sum of degrees="<< cur_cc.d_sum << " ini from " << graph[protein_id].full_name<< "\n";
+			if (debug_level > 0) cerr << getTime() << " [DEBUG:find_CCs] did calc dsum and density: " << cur_cc.size() << " proteins (ID: " << protein_id << "), graph density="<< cur_cc.density << ", sum of degrees="<< cur_cc.d_sum << " ini from " << graph[protein_id].full_name<< "\n";
 
 			q->dispatch([cur_cc,q]{ partition_CC(cur_cc,q,true,false); });
 			allNodesAreDone=false;
-			break;
+			//break;
 		}
 		if(allNodesAreDone)break; // no additional CC can be found -> done
 	}
@@ -1461,6 +1618,7 @@ void find_CCs(dispatch_queue *q){
 ///////////////////////////////////////////////////////////
 // Major partioning algorithm
 ///////////////////////////////////////////////////////////
+
 void partition_graph() {
 	#ifdef timeAnalysis
 		auto t_tmp = std::chrono::steady_clock::now( );
@@ -1651,6 +1809,120 @@ void parse_file(string file) {
 	#endif
 }
 
+void parse_abc_file(string file) {
+	#ifdef timeAnalysis
+		auto t_tmp = std::chrono::steady_clock::now( );
+	#endif
+
+
+	species.push_back("input");
+	species2id["input"] = species_counter++;
+
+	if (param_verbose) cerr << "Reading " << file << "\n";
+	string line;
+	ifstream graph_file(file.c_str());
+	if (graph_file.is_open()) {
+		while (!graph_file.eof()) {
+			getline(graph_file, line);
+			vector<string> fields;
+			tokenize(line, fields, "\t");
+			// Header line
+			if ((fields.size() == 3) && fields[0].substr(0, 1) != "#") {
+				// a b e1 b1 e2 b2 score
+
+				// 5.16 deal with duplicated IDs by adding file ID to protein ID
+				string ida = fields[0];
+				string idb = fields[1];
+				//fields[0] += " "; fields[0] += to_string(file_a_id);
+				//fields[1] += " "; fields[1] += to_string(file_b_id);
+
+				// 5.16 do not point to yourself
+				if (!fields[0].compare(fields[1])) {continue;}
+
+				//if(debug_level>0) cerr << "parse_file:(" << fields[0]<<","<<fields[1] << ") -> " << (protein2id.find(fields[0]) == protein2id.end()) << endl;
+
+				// A new protein
+				if (protein2id.find(fields[0]) == protein2id.end())	{
+					protein a;
+					a.full_name	= ida;
+					// graph_ram_total+=a.full_name.size()*sizeof(string);
+					a.species_id = 0;
+					// graph_ram_total+=sizeof(unsigned int);
+					protein2id[fields[0]] = protein_counter++;
+					// graph_ram_total+=fields[0].size()*sizeof(string)+sizeof(unsigned int);
+
+					if( graph.size() >= 1073741824-1 ){
+						cerr << string("[CRITICAL ERROR]   Overflow: number of nodes overflow the maximum number for vector allocation (2^30=1073741824).").c_str() << "\n";throw;
+					}
+					graph.push_back(a);
+					// graph_ram_total+=sizeof(protein);
+				}
+				if (protein2id.find(fields[1]) == protein2id.end())	{
+					protein b;
+					b.full_name	= idb;
+					// graph_ram_total+=b.full_name.size()*sizeof(string);
+					b.species_id	= 0;
+					// graph_ram_total+=sizeof(unsigned int);
+					protein2id[fields[1]] = protein_counter++;
+					// graph_ram_total+=fields[1].size()*sizeof(string)+sizeof(unsigned int);
+
+					if( graph.size() >= 1073741824-1 ){
+						cerr << string("[CRITICAL ERROR]   Overflow: number of nodes overflow the maximum number for vector allocation (2^30=1073741824).").c_str() << "\n";throw;
+					}
+					graph.push_back(b);
+					// graph_ram_total+=sizeof(protein);
+				}
+
+				// Bitscores 
+
+				float bit_a = string2float(fields[3]);
+
+				if(bit_a<1){bit_a=1;}
+
+				if(bit_a>USHRT_MAX){
+					cerr << " [WARNING] unsigned short overflow " << bit_a <<  ">USHRT_MAX (bitscore of "<< ida<< " adj. to "<< idb<< ") using "<< USHRT_MAX<< " (USHRT_MAX) instead." << "\n";
+					bit_a=(float)USHRT_MAX;
+				}
+
+				// Add link to graph (reciprocal)					
+				unsigned int a_id = protein2id[fields[0]];
+				unsigned int b_id = protein2id[fields[1]];
+
+				// 5.17, add weight
+				wedge w;
+				w.edge=b_id;
+				// graph_ram_total+=sizeof(unsigned int);
+				w.weight=bit_a;
+				// graph_ram_total+=sizeof(unsigned int);
+				graph[a_id].edges.push_back(w);
+				// graph_ram_total+=sizeof(wedge);
+				w.edge=a_id;
+				// graph_ram_total+=sizeof(unsigned int);
+				w.weight=bit_a;
+				// graph_ram_total+=sizeof(unsigned int);
+				graph[b_id].edges.push_back(w);
+				// graph_ram_total+=sizeof(wedge);
+				edges++;
+			}
+		}
+		graph_file.close();
+	}
+	else {
+		cerr << string("Could not open file " + file).c_str() << "\n";throw;
+	}
+
+	// graph_ram_total_inKB += graph_ram_total/1e+3;
+	// if (debug_level > 0) cerr << getTime() << " [DEBUG]  Expected Memory Usage of the current input graph: " << graph_ram_total_inKB << " KB = "  << graph_ram_total_inKB/1e+3 << " MB. (current input graph are all the currently loaded files)" << "\n";
+
+	if(species_counter==0){species.push_back("0");species_counter++;}
+
+	#ifdef timeAnalysis
+		if(!t_master.count("parse_file")) t_master["parse_file"]=0;
+		t_master["parse_file"] += (float)std::chrono::duration_cast<std::chrono::nanoseconds>( std::chrono::steady_clock::now( ) - t_tmp ).count()/1e+9;
+	#endif
+}
+
+
 ///////////////////////////////////////////////////////////
 // Output
 ///////////////////////////////////////////////////////////
@@ -1708,31 +1980,30 @@ struct protein_degree{
 };
 
 // Group formatting
-void print_group(ConnectedComponent& nodes, float connectivity, size_t tid, bool failed) {
+void print_group(ConnectedComponent& cur_cc, float connectivity, size_t tid, bool failed) {
 
 	#ifdef timeAnalysis
 		auto t_tmp = std::chrono::steady_clock::now( );
 	#endif
 
+	for (unsigned int i = 0; i < cur_cc.size(); i++) { graph[cur_cc[i]].is_done=true; }
+
 	if(debug_level>0) cerr << getTime()<< " [DEBUG] print_group start"<< endl;
 
-	if(nodes.size()<2){return;}
+	if(cur_cc.size()<2){return;}
 
 	map<unsigned int, bool> done;	// Keep track on what was done (for each node)
-	for (unsigned int i = 0 ; i < nodes.size() ; i++) {
-		unsigned int from = nodes[i]; 
+	for (unsigned int i = 0 ; i < cur_cc.size() ; i++) {
+		unsigned int from = cur_cc[i]; 
 		done[ from ] = 0;
 	}
-	for (unsigned int i = 0 ; i < nodes.size() ; i++) {
-		unsigned int from = nodes[i]; 
+	for (unsigned int i = 0 ; i < cur_cc.size() ; i++) {
+		unsigned int from = cur_cc[i]; 
 		for(unsigned int j = 0 ; j < graph[from].edges.size() ; j++){
 			if(!done.count(graph[from].edges[j].edge))
 				done[ graph[from].edges[j].edge ]=1;
 		}
 	}
-	ConnectedComponent cur_cc = BFS(&done,nodes[0]); // get the CC of the current node (protein_id)#
-	if((int)cur_cc.size() - (int)nodes.size() != 0){return;}
-
 
 	if(!proteinorthotmp_clean.count(tid)){
 		proteinorthotmp_clean[tid]=make_shared<ofstream>((param_rmgraph+"_proteinortho_tmp_"+to_string(tid)).c_str());
@@ -1743,16 +2014,16 @@ void print_group(ConnectedComponent& nodes, float connectivity, size_t tid, bool
 
 	unsigned int species_number = 0;
 	// For each protein in group 	 	
-	for (unsigned int i = 0; i < nodes.size(); i++) {
-		unsigned int current_protein = nodes[i];
+	for (unsigned int i = 0; i < cur_cc.size(); i++) {
+		unsigned int current_protein = cur_cc[i];
 		unsigned int current_species = graph[current_protein].species_id;
 		if (line[current_species].size() == 0) 
 			species_number++;
 		line[current_species].push_back(make_pair( graph[current_protein].full_name , graph[current_protein].edges.size() ));
 	}
 
-	(*proteinorthotmp_clean[tid]) << species_number << "\t" << nodes.size() << "\t" << (debug_level>0 ? "@"+getCCid(nodes.m_content_CC)+":":"") << setprecision (3) << (connectivity < 0 ? -connectivity : connectivity) << (failed ? "*": "");
-	// cerr << species_number << "\t" << nodes.size() << "\t" << setprecision (3) << connectivity;
+	(*proteinorthotmp_clean[tid]) << species_number << "\t" << cur_cc.size() << "\t" << (debug_level>0 ? "@"+getCCid(cur_cc.m_content_CC)+":":"") << setprecision (3) << (connectivity < 0 ? -connectivity : connectivity) << (failed ? "*": "");
+	// cerr << species_number << "\t" << cur_cc.size() << "\t" << setprecision (3) << connectivity;
 
 	// List group data
 	for (unsigned int i = 0; i < species_counter; i++) {
@@ -1781,7 +2052,7 @@ void print_group(ConnectedComponent& nodes, float connectivity, size_t tid, bool
 	(*proteinorthotmp_clean[tid]) << "\n";
 	// cerr << "\n";
 
-	if(debug_level>0) cerr << getTime()<< " [DEBUG] print_group @"<<getCCid(nodes.m_content_CC)<<" ERROR="<< ((int)cur_cc.size() - (int)nodes.size()) << endl;
+	if(debug_level>0) cerr << getTime()<< " [DEBUG] print_group @"<<getCCid(cur_cc.m_content_CC)<<" ERROR="<< ((int)cur_cc.size() - (int)cur_cc.size()) << endl;
 
 	//os.close();
 



View it on GitLab: https://salsa.debian.org/med-team/proteinortho/-/commit/42208790e378b027cdbc447f7a50dd4fb343b658

-- 
View it on GitLab: https://salsa.debian.org/med-team/proteinortho/-/commit/42208790e378b027cdbc447f7a50dd4fb343b658
You're receiving this email because of your account on salsa.debian.org.


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://alioth-lists.debian.net/pipermail/debian-med-commit/attachments/20230117/9fdff725/attachment-0001.htm>


More information about the debian-med-commit mailing list