[med-svn] [Git][med-team/proteinortho][upstream] New upstream version 6.1.7+dfsg
Andreas Tille (@tille)
gitlab at salsa.debian.org
Tue Jan 17 15:30:02 GMT 2023
Andreas Tille pushed to branch upstream at Debian Med / proteinortho
Commits:
42208790 by Andreas Tille at 2023-01-17T16:21:02+01:00
New upstream version 6.1.7+dfsg
- - - - -
6 changed files:
- CHANGELOG
- CHANGEUID
- Makefile
- proteinortho6.pl
- src/.gitlab-ci.yml
- src/proteinortho_clustering.cpp
Changes:
=====================================
CHANGELOG
=====================================
@@ -149,19 +149,19 @@
proteinoprtho 6.0 alpha release
28. Nov
Makefile update (lapack zipped,...)
- 5. Dez
+ 5. Dec
MCL integration (-mcl option in proteinortho.pl)
XML bugfix (species with . in the name did confuse the xml parser)
- 6. Dez
+ 6. Dec
MCL pre/postprocessing (src/do_mcl.pl)
double -> float in proteinortho_clustering.cpp
weights are unsigned shorts again (only the last commit was unsigned int) proteinortho_clustering.cpp
- 10. Dez
+ 10. Dec
gff4fasta update: Test for additional naming schemes
- 20.-21. Dez
+ 20.-21. Dec
src/do_mcl.pl performance increase, dev/blastgraph2CCblastgraphs.cpp improvement
orthoXML improvement (still not accepted by orthobechmarkproject)
- 25. Dez
+ 25. Dec
no lapack version (src/proteinortho_clustering_nolapack.cpp)
no cmake version (make all_nocmake, needs lapack installed)
2019
@@ -273,7 +273,7 @@
refined the Makefile for proteinortho_clustering, now it additionally tests if the program is executable with the -test option
12. Oct (5122)
enhancement of the Makefile (more verbose, added standard compiler flags, cleanup)
- 2. Dez (5195)
+ 2. Dec (5195)
fixing proteinortho_grab_protein.pl (https://gitlab.com/paulklemm_PHD/proteinortho/-/issues/41)
and ids that are not found are printed to STDERR
2021
@@ -374,5 +374,43 @@
- added the with and without lapack test back to the Makefile (make test or make test_step3)
- refined feature for proteinortho_clustering -maxnodes: if max-nodes are exceeded, remove the worst 10% of edges of this CC (by weight) until the CC breaks into smaller CC
23. Nov (6051) v6.1.4
- - small bugfix: newlines of the output of proteinortho_summary.pl are now printed to STDOUT instead of STDERR, thanks to Matthias Bernt
- - bugfix: for large project the step=1 could soft lock itself sometimes caused by the log_worker (if used with many cores)
+ - small bugfix: newlines of the output of proteinortho_summary.pl are now printed to STDOUT instead of STDERR, thanks to Matthias Bernt !9
+ - bugfix: for large project the step=1 could soft lock itself sometimes caused by the log_worker (if used with many cores) #69
+ 23. Nov (6056) v6.1.5
+ - bugfix: proteinortho_clustering -core will split before evaluating the cluster
+ - improvement: proteinortho_clustering the new purity detection was too careful (resulting in very long runtimes), now only a fixed number of purities are tested improving the runtime dramatically for large inputs
+ - bugfix: the bug of #69 was not fixed entirely
+ 23. Nov (6060) v6.1.5
+ - new feature for proteinortho_clustering: -abc enables a abc type graph input (undirected)
+ - some minor adjustments to the -core option of proteinortho_clustering
+ 26. Nov (6065) v6.1.5
+ - fixed a bug that sometimes will result in a suspended but running proteinortho instance (diamond needs separate WD for each call)
+ 26. Nov (6066) v6.1.6
+ - small bugfix for -step=1 and input files with forbidden symbols (invalid fasta files)
+ 30. Nov (6070) v6.1.6
+ - bug #69 did still occure, therefore a new wrapper for diamond, blast, ... is introduced that tests for stalemates and restarts the program a few times until it will fail.
+ - ci/cd clean up, introduced a own docker-worker for testing
+ 1. Dec (6075) v6.1.6
+ - introduced a new default value for maxnodes (species^2) for proteinortho_clustering, usually this improves the runtime by a lot
+ - some minor improvements to proteinortho_clustering
+ - improved stderr output for proteinortho6.pl
+ - added -threads_per_process option to proteinortho6.pl to fine tune the threading system (-step=2 only)
+ 5. Dec (6080) v6.1.6
+ - diamond uses all cores available to generate datasets (step=1), this is now restricted to the specified number of cores (-cpus)
+ - new feature: a fallback system for the blast commands of step=2: if e.g. a diamond calls zombifies (0% cpup running without terminating), proteinortho automatically detects this and restarts this comparison a few times instead of running indefinitely waiting on diamond
+ 6. Dec (6081) v6.1.6
+ - disabled the fallback system for mmseqs as mmseqs spawns the true workload in another thread
+ 12. Dec (6082) v6.1.6
+ - fixed a bug introduced with 6.1.3 where proteinortho_clustering maxnodes option can lead to infinite loops
+ - small improvement to the runtime of proteinortho_clustering
+ 13. Dec (6085) v6.1.7
+ - fixed a bug where -p=mmseqsp results in empty output
+ 15. Dec (6086) v6.1.7
+ - fixed a bug where multiple -jobs can delete the cache of other job-runs (if used without -keep). Now -jobs defines different individual caches
+ 22. Dec (6087) v6.1.7
+ - improved error output if -step=2 fails (includes now the the $! error message), see #70, thanks to Matthias Bernt
+ 25. Dec (6088) v6.1.7
+ - minor changes to step=2 stderr messages
+ 28. Dec (6088) v6.1.7
+ - added a wait on each kill call of proteinortho_exec
+
\ No newline at end of file
=====================================
CHANGEUID
=====================================
@@ -1 +1 @@
-6051
+6089
=====================================
Makefile
=====================================
@@ -449,5 +449,5 @@ test_clean2:
.PHONY: clean
clean:
- rm -rf src/BUILD test/C.faa.* test/E.faa.* test/C2.faa.* test/L.faa.* test/M.faa.*
+ rm -rf src/BUILD test/C*.faa.* test/E.faa.* test/C2.faa.* test/L.faa.* test/M.faa.*
rm -rf src/lapack-3.8.0/
=====================================
proteinortho6.pl
=====================================
@@ -95,6 +95,11 @@ You can adjust the parameters e.g. a more strict -e evalue cut off and write the
the number of processors to use (multicore/processor support) for step 2 and step 3
+=item B<--threads_per_process>=number (default: 1)
+
+number of threads per process,
+e.g. using -cpus=4 and -threads_per_process=2 will spawn 4 workerthreads using 2 cpu cores each = total of 8 cores
+
=item B<--verbose>={0,1,2} (default: 1)
verbose level. 1:keeps you informed about the progress
@@ -509,7 +514,7 @@ use POSIX;
##########################################################################################
# Variables
##########################################################################################
-our $version = "6.1.4";
+our $version = "6.1.7";
our $step = 0; # 0/1/2/3 -> do all / only apply step 1 / only apply step 2 / only apply step 3
our $verbose = 1; # 0/1 -> don't / be verbose
our $core = 0;
@@ -630,13 +635,12 @@ foreach my $option (@ARGV) {
elsif ($option =~ m/^--?cleanupblast$/) { $checkblast = 1; }
elsif ($option =~ m/^--?te?mp=(.+)$/) { $tmp_path = $1;
# make sure it ends with /
- unless ($tmp_path =~ /\/$/) {$tmp_path .= "/";}my $pwd=$ENV{"HOME"}; $tmp_path=~s/~/$pwd/g;
+ unless ($tmp_path =~ /\/$/) {$tmp_path .= "/";}
+ my $pwd=$ENV{"HOME"}; $tmp_path=~s/~/$pwd/g;
if(! -d $tmp_path || ! -R $tmp_path || ! -W $tmp_path ){
&Error(" -tmp=$tmp_path is not accessible. Check if the directory exists and is read- and writable.");
}else{
- if($tmp_path =~ m/proteinortho_cache/){
- $keep=1;
- }
+ if($tmp_path =~ m/proteinortho_cache/){ $keep=1; }
}
}
elsif ($option =~ m/^--?debug$/) { $debug = 1; }
@@ -646,6 +650,7 @@ foreach my $option (@ARGV) {
elsif ($option =~ m/^--?e=(.*)$/) { $evalue = $1; }
elsif ($option =~ m/^--?range=([0-9]+)$/) { $range = $1; }
elsif ($option =~ m/^--?cpus=(\d*)$/) { $cpus = $1; }
+ elsif ($option =~ m/^--?threads_per_process=(\d*)$/) { $threads_per_process = $1; }
elsif ($option =~ m/^--?cpus=auto$/) { $cpus = 0; }
elsif ($option =~ m/^--?alpha=([0-9\.]+)$/) { $alpha = $1; }
elsif ($option =~ m/^--?hmm_?evalue=([0-9\.eE+-]+)$/) { $hmm_search_evalue = $1; }
@@ -750,6 +755,96 @@ our $blastmode_pendant = { # if you choose blastp+ and the input is nucleotide -
my $restart_counter=-1; # restart only once
+our $proteinortho_exec='proteinortho_exec(){ # works with sh as well as bash (not csh)
+ max_monitor_time=25920; # 3 days in 10s (sleep_time is 10s)
+ max_restarts=10; # if max restarts are reached -> terminate with exit 1
+ sleep_time=10;
+ max_sus=60; # 10 min of suspension = restart
+ restart_counter=0;
+
+ work(){
+ # fork the current process into a monitor and the input cmd
+ # for debug could use this: force a OK after first run : #if [ "$restart_counter" -eq "1" ]; then cmd="sleep 1"; fi
+
+ eval "$cmd &"; # execute the input command in the background
+ pid=$!;
+ '.($debug ? 'echo "[main] started $pid";': "").'
+
+ trap \'max_monitor_time=0; kill -1 $pid 2>/dev/null; wait $pid 2>/dev/null;\' 2 9 EXIT;
+
+ (
+ # this is the monitor script, every update interval check if cmd is suspended or not -> if too many suspended calls are found terminate this monitor as well as the cmd call
+ it=0;
+ is_sus=0;
+ sus_counter=0;
+ sleep $sleep_time;
+ '.($debug ? 'echo "[monitor] starting $pid";': "").'
+ while [ "$it" -lt "$max_monitor_time" ] & [ "$pid" != "-1" ] & kill -0 $pid 2>/dev/null; do
+ # test of the cmd command has 0 cpu usage -> if yes then increment the sus_counter
+ if [ "$(ps -o pcpu $pid | tail -n1 | grep " 0.")" != "" ];then is_sus=1; sus_counter=$((sus_counter+1)); else sus_counter=0; fi;
+ '.($debug ? 'ps -o pcpu $pid; echo "[monitor] $pid=is_sus:$is_sus sus_counter:$sus_counter"': "").'
+ if [ "$sus_counter" -gt "$max_sus" ]; then '.($debug ? 'echo "max_sus reached : kill $pid"': "").'
+ kill $pid 2>/dev/null;
+ wait $pid 2>/dev/null;
+ return 1; # signal FAIL -> max_sus is reached
+ fi;
+ sleep $sleep_time; it=$((it+1));
+ done
+ '.($debug ? 'echo "[monitor] all ok $pid";': "").'
+ return 0; # signal OK
+ ) &
+ pid_monitor=$!;
+
+ wait $pid; # wait on a finish from cmd -> if the monitor stops, this will halt as well
+ ret_pid=$?;
+
+ if [ "$ret_pid" -eq "0" ]; then # command did run with OK output -> terminate this script (if monitor stops: this will never be 0)
+ if kill -0 $pid_monitor 2>/dev/null; then kill $pid_monitor 2>/dev/null; wait $pid_monitor 2>/dev/null; fi; # kill the monitor if it is running
+ return $ret_pid;
+ fi;
+
+ wait $pid_monitor; # await the monitor (one last sleep tick)
+ ret_monitor=$?; # this will be 1 if max_sus is reached
+
+ '.($debug ? 'echo "[$restart_counter] pid:$pid (monitor=$pid_monitor) => ret_pid=$ret_pid, ret_monitor=$ret_monitor"': "").'
+
+ if [ "$ret_monitor" != "0" ]; then # restart the process -> another work() call
+ if [ "$restart_counter" -lt "$max_restarts" ];then
+ restart_counter=$((restart_counter+1));
+ '.($debug ? 'echo "[main] $pid restart (restart_counter=$restart_counter) return work";': "").'
+ return &work;
+ else
+ '.($debug ? 'echo "[main] $pid max_restarts reached return 1";': "").'
+ return 1; # max restarts are reached -> terminate with 1
+ fi
+ else
+ '.($debug ? 'echo "[main] $pid return $ret_pid";': "").'
+ return $ret_pid; # monitor was fine but the program did terminate -> passthrough
+ fi
+ }
+
+ if [ "$1" != "" ]; then
+ cmd=$1;
+ pid=-1;
+
+ precheck=1;
+ if ! command -v ps >/dev/null 2>&1; then precheck=0; fi;
+ if ! command -v grep >/dev/null 2>&1; then precheck=0; fi;
+ if ! command -v tail >/dev/null 2>&1; then precheck=0; fi;
+ if ! command -v kill >/dev/null 2>&1; then precheck=0; fi;
+ if ! command -v wait >/dev/null 2>&1; then precheck=0; fi;
+
+ if [ "$precheck" -eq "1" ]; then
+ #trap \'max_monitor_time=0; echo "KILL"; jobs -p; jobs -p | xargs kill -1 2>/dev/null; echo "KILL DONE";jobs -p; exit\' 2 9 EXIT;
+ exit &work;
+ else
+ '.($debug ? 'echo "precheck failed, using now a simple eval call"': "").'
+ eval $cmd;
+ exit $?;
+ fi
+ fi
+}';
+
##########################################################################################
# Check parameters
##########################################################################################
@@ -781,12 +876,11 @@ if ($jobnumber != -1) {
$run_id = "_$jobnumber"."_".$split_to_X_jobs;
}
-
-our $gccversionstr = `gcc --version 2>/dev/null`;
-our $gccversion_main = "";
-if ($? == 0) {
- $gccversion_main=($gccversionstr =~ m/^[^\n]+ ([\d]+)\.[\d\.]+\n/);
-}
+#our $gccversionstr = `gcc --version 2>/dev/null`;
+#our $gccversion_main = "";
+#if ($? == 0) {
+# $gccversion_main=($gccversionstr =~ m/^[^\n]+ ([\d]+)\.[\d\.]+\n/);
+#}
#our $ompprocbind="close";
#if($gccversion_main eq "" || $gccversion_main < 5){
# $ompprocbind="true";
@@ -896,20 +990,20 @@ if($step!=1 && $step!=4){
unlink($desctable);
}
-if($tmp_path eq ""){$tmp_path=".";}
-if( $tmp_path !~ qr/proteinortho_cache_$project/ ){
- if(! -d $tmp_path."/proteinortho_cache_$project"){
- system("mkdir $tmp_path/proteinortho_cache_$project");
- if( -d $tmp_path."/proteinortho_cache_$project"){
- $tmp_path.="/proteinortho_cache_$project/";
+if( $tmp_path eq "" ){$tmp_path=".";}
+my $suffixjob="";
+if($jobnumber != -1){ $suffixjob = "_${jobnumber}_${jobs_todo}" }
+if( $tmp_path !~ qr/proteinortho_cache_$project$suffixjob/ ){
+ if(! -d $tmp_path."/proteinortho_cache_$project$suffixjob"){
+ system("mkdir $tmp_path/proteinortho_cache_$project$suffixjob");
+ if( -d $tmp_path."/proteinortho_cache_$project$suffixjob"){
+ $tmp_path.="/proteinortho_cache_$project$suffixjob/";
}
}else{
- $tmp_path.="/proteinortho_cache_$project/";
+ $tmp_path.="/proteinortho_cache_$project$suffixjob/";
}
}else{
- if(!-d $tmp_path){
- mkdir("$tmp_path");
- }
+ if(! -d $tmp_path){ mkdir("$tmp_path"); }
}
if($inproject ne $project){ system("cd '$tmp_path/'; ln -s ../proteinortho_cache_$inproject/* ."); }
@@ -1001,15 +1095,15 @@ if(scalar keys %isoform_mapping == 0 && $isoform ne ""){print STDERR "\n!!!!\nWA
if ($step == 0 || $step == 1) {
if($verbose){print STDERR "\n$GREEN**Step 1**$NC\n";}
- $do_log=1;
- $_->kill('KILL')->detach foreach threads->list();
- threads->create('log_Worker');
- foreach my $file (@files) { $step1_QUEUE->enqueue($file); }
- if($debug){print STDERR "Creating ".(scalar @files > $cpus+10 || $debug ? $cpus : 1)." generate_indices_Worker (cpus=$cpus) for ".$step1_QUEUE->pending." jobs\n"}
- if($debug){print STDERR "currently running ".(threads->list())." threads (pre spawn)\n";}
- threads->create('generate_indices_Worker');
+ #$do_log=1;
+ #$_->kill('KILL')->detach foreach threads->list();
+ #threads->create('log_Worker');
+ #foreach my $file (@files) { $step1_QUEUE->enqueue($file); }
+ #if($debug){print STDERR "Creating ".(scalar @files > $cpus+10 || $debug ? $cpus : 1)." generate_indices_Worker (cpus=$cpus) for ".$step1_QUEUE->pending." jobs\n"}
+ #if($debug){print STDERR "currently running ".(threads->list())." threads (pre spawn)\n";}
+ #threads->create('generate_indices_Worker');
- # INFO:
+ # !! INFO: !!
# here we use only one thread to generate the db files !
# -> diamond will fail if multiple makedb instances are running in parallel,
# prducing un-joinable running threads (stuck within the `$cmd 2>&1` in the generate_indices)
@@ -1018,11 +1112,12 @@ if ($step == 0 || $step == 1) {
# code for multiple threads (does not work sometimes for diamond):
#for (my $i = 0; $i < ((scalar @files > 500 && scalar @files > $cpus+10) || $debug ? $cpus : 1) ; $i++) { my $tmp = threads->create('generate_indices_Worker') } # spawn a thread for each core
- if($debug){print STDERR "currently running ".(threads->list())." threads (post spawn)\n";}
- $_->join() foreach threads->list();
-# $_->kill('KILL')->detach foreach threads->list();
+ #if($debug){print STDERR "currently running ".(threads->list())." threads (post spawn)\n";}
+ #$_->join() foreach threads->list();
- #&generate_indices(@files); # Generate index files for blast
+ foreach my $file (@files){
+ &generate_indices($file);
+ }; # Generate index files for blast
if ($desc) {
&write_descriptions; # Write sequence description file
}
@@ -1510,6 +1605,8 @@ Options:
-project= prefix for all result file names [default: myproject]
-inproject= load data from this namespace instead (works with intermediate files for step=2 and blast-graph for step=3)
-cpus= number of processors to use [default: auto]
+ -threads_per_process= number of threads per process [default: $threads_per_process],
+ e.g. using -cpus=4 and -threads_per_process=2 will spawn 4 workerthreads using 2 cpu cores each = total of 8 cores
-silent sets verbose to 0
-temp= path for temporary files [default: working directory]
-keep stores temporary blast results for reuse (proteinortho_cache_project directory).
@@ -1679,6 +1776,7 @@ sub run_blast {
&print_blast_stats();
# Spawn worker threads
+
for (my $i = 0; $i < $cpus; $i++) {threads->create('workerthread');}
# For each file against each other file
@@ -2090,6 +2188,7 @@ sub get_gene_lengths {
my $last_gene = "";
my $lengths_lastgene=0;
while (my $file = shift @_) {
+ my $abs_path_file=abs_path "$file "; $abs_path_file=~s/ $//g;
open(FASTA,"<$file") || &Error("Could not open '$file': $!");
while (<FASTA>) {
$_=~s/[\r\n]+$//;
@@ -2098,7 +2197,7 @@ sub get_gene_lengths {
if($last_gene ne ""){
if( (exists $lengths{$last_gene} && $lengths_lastgene > $lengths{$last_gene}) || !exists $lengths{$last_gene}){
$lengths{$last_gene}=$lengths_lastgene;
- if($blastmode eq "autoblast" && $autoblast_fileis{$file} eq "nucl"){
+ if($blastmode eq "autoblast" && $autoblast_fileis{$abs_path_file} eq "nucl"){
$lengths{$last_gene}/=3;
}
$lengths{&convertUniprotAndNCBI($last_gene)}=$lengths{$last_gene};
@@ -2121,7 +2220,7 @@ sub get_gene_lengths {
if($last_gene ne ""){
if( (exists $lengths{$last_gene} && $lengths_lastgene > $lengths{$last_gene}) || !exists $lengths{$last_gene}){
$lengths{$last_gene}=$lengths_lastgene;
- if($blastmode eq "autoblast" && $autoblast_fileis{$file} eq "nucl"){
+ if($blastmode eq "autoblast" && $autoblast_fileis{$abs_path_file} eq "nucl"){
$lengths{$last_gene}/=3;
}
$lengths{&convertUniprotAndNCBI($last_gene)}=$lengths{$last_gene};
@@ -2527,14 +2626,15 @@ sub generate_indices {
if ($verbose) {$msg.= "The database for '$file' is present and will be used\n";}
}else{
if ($verbose) {$msg.= "Building database for '$file'\t(".$gene_counter{$file}." sequences)\n";}
- if(!exists $autoblast_fileis{$file}){
+ my $abs_path_file=abs_path "$file "; $abs_path_file=~s/ $//g;
+ if(!exists $autoblast_fileis{$abs_path_file}){
$msg.= "$ORANGE\n[WARNING]$NC I could not detect the type of '$file', i assume aminoacid sequences...\n";
- $autoblast_fileis{$file}="prot";
+ $autoblast_fileis{$abs_path_file}="prot";
}
- if ($debug) {$msg.= "$makedb -dbtype ".$autoblast_fileis{$file}." -in '$file' -out '$file.$blastmode'\n";}
- $cmd="$makedb -dbtype ".$autoblast_fileis{$file}." -in '$file' -out '$file.$blastmode'";
+ if ($debug) {$msg.= "$makedb -dbtype ".$autoblast_fileis{$abs_path_file}." -in '$file' -out '$file.$blastmode'\n";}
+ $cmd="$makedb -dbtype ".$autoblast_fileis{$abs_path_file}." -in '$file' -out '$file.$blastmode'";
my $makedb_ret = `$cmd 2>&1`;
- if ($? != 0) {$msg.= ("$ORANGE\n[WARNING]$NC ".$blastmode." failed to create a database ($makedb -dbtype ".$autoblast_fileis{$file}." -in '$file' -out '$file.$blastmode'). The output is:\n-----------------\n$makedb_ret-----------------\nI will now proceed with writing the database files to the DB/ directory in $tmp_path (-tmp)."); if($step==1){$msg.= "$ORANGE Please ensure that you use -tmp=$tmp_path -keep (and use the same -project= name) for future analysis.$NC";}print "\n";
+ if ($? != 0) {$msg.= ("$ORANGE\n[WARNING]$NC ".$blastmode." failed to create a database ($makedb -dbtype ".$autoblast_fileis{$abs_path_file}." -in '$file' -out '$file.$blastmode'). The output is:\n-----------------\n$makedb_ret-----------------\nI will now proceed with writing the database files to the DB/ directory in $tmp_path (-tmp)."); if($step==1){$msg.= "$ORANGE Please ensure that you use -tmp=$tmp_path -keep (and use the same -project= name) for future analysis.$NC";}print "\n";
mkdir("$tmp_path/DB");
if($step==1){$oldkeep=$keep;$keep=1;}
@@ -2570,7 +2670,8 @@ sub generate_indices {
}
##MARK_FOR_NEW_BLAST_ALGORITHM
if($debug){print STDERR "generate_indices:$file almost end\n"}
- $log_QUEUE->enqueue($msg);
+ if($verbose){print STDERR $msg}
+ #$log_QUEUE->enqueue($msg);
if($debug){print STDERR "generate_indices:$file end\n"}
return;
}
@@ -2583,12 +2684,15 @@ sub blast {
my $a = $_[0];
my $b = $_[1];
my $thread_id = $_[2];
- if( $blastmode eq "mmseqsp" || $blastmode eq "mmseqsn" || $blastmode eq "mmseqs" ){ if(!-d "$tmp_path/mmseqs".$thread_id ){mkdir("$tmp_path/mmseqs".$thread_id);}else{rmdir("$tmp_path/mmseqs".$thread_id);mkdir("$tmp_path/mmseqs".$thread_id);}}
+ #if( $blastmode eq "mmseqsp" || $blastmode eq "mmseqsn" || $blastmode eq "mmseqs" ){ if(!-d "$tmp_path/mmseqs".$thread_id ){mkdir("$tmp_path/mmseqs".$thread_id);}else{rmdir("$tmp_path/mmseqs".$thread_id);mkdir("$tmp_path/mmseqs".$thread_id);}}
$a = basename($a);
$b = basename($b);
- my $bla = "$tmp_path$b.vs.$a.".$blastmode;
+ my $bla = "$b.vs.$a.".$blastmode;
+
+ my $WD="${tmp_path}/${bla}_WD";
+ mkdir($WD);
my $blastOptions_diamond=''; #additional option --sensitive (if gene length is larger than 40)
my $max_genlen = 1;
@@ -2607,12 +2711,12 @@ sub blast {
my ($fileType) = $a =~ /\.(\w*)$/;
- $a = $_[0]; # get a b again for blast algorithm (now they are not basenamed)
- $b = $_[1];
+ $a = abs_path "$_[0] "; # get a b again for blast algorithm (now they are not basenamed)
+ $b = abs_path "$_[1] ";
+ $a=~s/ $//g; # this is a trick, to get the abs_path of the symlink (not the target) 1. add whitespace to filename 2. call abs_path 3. remove whitespace
+ $b=~s/ $//g; # otherwise the target is printed, which does not neccesarilly contain the DB files
- if(-e $tmp_path."/DB/".basename($a)){
- $a = $tmp_path."/DB/".basename($a);
- }
+ if(-e $tmp_path."/DB/".basename($a)){ $a = "../DB/".basename($a); }
#
# 1. run the blast algorithm -> produce a $bla.tmp
@@ -2625,131 +2729,145 @@ sub blast {
#
# sh -ic '{ cat 1>&3; kill 0; } | { sleep 2; kill 0; }' 3>&1
- if ($blastmode eq "blastp_legacy" || $blastmode eq "blastn_legacy" || $blastmode eq "tblastx_legacy") {lock($threads_per_process); $command = $binpath."blastall -a $threads_per_process -d '$a.$blastmode' -i '$b' -p $blastmode -m8 -e $evalue $blastOptions $printSTDERR ".($keep ? " | tee '$bla.tmp'" : "");}
- elsif ($blastmode eq "autoblast" && $autoblast_fileis{$a} eq "prot" && $autoblast_fileis{$b} eq "prot") {lock($threads_per_process); $command = $binpath."blastp -num_threads $threads_per_process -db '$a.$blastmode' -query '$b' -evalue $evalue -outfmt 6 $blastOptions $printSTDERR ".($keep ? " | tee '$bla.tmp'" : "");}
- elsif ($blastmode eq "autoblast" && $autoblast_fileis{$a} eq "nucl" && $autoblast_fileis{$b} eq "nucl") {lock($threads_per_process); $command = $binpath."blastn -num_threads $threads_per_process -db '$a.$blastmode' -query '$b' -evalue $evalue -outfmt 6 $blastOptions $printSTDERR ".($keep ? " | tee '$bla.tmp'" : "");}
- elsif ($blastmode eq "autoblast" && $autoblast_fileis{$a} eq "nucl" && $autoblast_fileis{$b} eq "prot") {lock($threads_per_process); $command = $binpath."tblastn -num_threads $threads_per_process -db '$a.$blastmode' -query '$b' -evalue $evalue -outfmt 6 $blastOptions $printSTDERR ".($keep ? " | tee '$bla.tmp'" : "");}
- elsif ($blastmode eq "autoblast" && $autoblast_fileis{$a} eq "prot" && $autoblast_fileis{$b} eq "nucl") {lock($threads_per_process); $command = $binpath."blastx -num_threads $threads_per_process -db '$a.$blastmode' -query '$b' -evalue $evalue -outfmt 6 $blastOptions $printSTDERR ".($keep ? " | tee '$bla.tmp'" : "");}
- elsif ($blastmode eq "blastp+") {lock($threads_per_process); $command = $binpath."blastp -num_threads $threads_per_process -db '$a.$blastmode' -query '$b' -evalue $evalue -outfmt 6 $blastOptions $printSTDERR ".($keep ? " | tee '$bla.tmp'" : "");}
- elsif ($blastmode eq "blastn+") {lock($threads_per_process); $command = $binpath."blastn -num_threads $threads_per_process -db '$a.$blastmode' -query '$b' -evalue $evalue -outfmt 6 $blastOptions $printSTDERR ".($keep ? " | tee '$bla.tmp'" : "");}
- elsif ($blastmode eq "tblastx+") {lock($threads_per_process); $command = $binpath."tblastx -num_threads $threads_per_process -db '$a.$blastmode' -query '$b' -evalue $evalue -outfmt 6 $blastOptions $printSTDERR ".($keep ? " | tee '$bla.tmp'" : "");}
- elsif ($blastmode eq "blatp") {lock($threads_per_process); $command = $binpath."blat -prot '$a' '$b' -out=blast8 '$bla.tmp' $blastOptions >\/dev\/null $printSTDERR";}
- elsif ($blastmode eq "blatn") {lock($threads_per_process); $command = $binpath."blat '$a' '$b' -out=blast8 '$bla.tmp' $blastOptions >\/dev\/null $printSTDERR";}
- elsif ($blastmode eq "blatx") {lock($threads_per_process); $command = $binpath."blat '$a' '$b' -t=dnax -q=dnax -out=blast8 '$bla.tmp' $blastOptions >\/dev\/null $printSTDERR";}
- #elsif ($blastmode eq "pblatp") {lock($threads_per_process); $command = $binpath."pblat -prot '$a' '$b' -threads=$threads_per_process -out=blast8 '$bla' >\/dev\/null 2>&1";}
- #elsif ($blastmode eq "pblatn") {lock($threads_per_process); $command = $binpath."pblat '$a' '$b' -threads=$threads_per_process -out=blast8 '$bla' >\/dev\/null 2>&1";}
- #elsif ($blastmode eq "pblatx") {lock($threads_per_process); $command = $binpath."pblat '$a' '$b' -t=dnax -q=dnax -threads=$threads_per_process -out=blast8 '$bla' >\/dev\/null 2>&1";}
- elsif ($blastmode eq "diamond" ) {lock($threads_per_process); $command = $binpath."diamond blastp --threads $threads_per_process --db '$a.$blastmode' --query '$b' -e $evalue --outfmt 6 --quiet $blastOptions $blastOptions_diamond $printSTDERR ".($keep ? " | tee '$bla.tmp'" : "");}
- elsif ($blastmode eq "topaz" ) {lock($threads_per_process); $command = $binpath."topaz search -T $threads_per_process -p '$a.$blastmode' -f '$b' -E $evalue --blasttab $blastOptions >'$bla.tmp' $printSTDERR";}
- elsif ($blastmode eq "ublast") {lock($threads_per_process); $command = $binpath."usearch -ublast '$b' -db '$a.$blastmode' -threads $threads_per_process -evalue $evalue -blast6out '$bla.tmp' $blastOptions >\/dev\/null $printSTDERR";}
- elsif ($blastmode eq "usearch") {lock($threads_per_process); $command = $binpath."usearch -usearch_local '$b' -db '$a.$blastmode' -threads $threads_per_process -evalue $evalue -id 0 -blast6out '$bla.tmp' $blastOptions >\/dev\/null $printSTDERR";}
- elsif ($blastmode eq "rapsearch") {lock($threads_per_process); $command = $binpath."rapsearch -s T -d '$a.$blastmode' -q '$b' -z $threads_per_process -e $evalue -out '$bla.tmp' $blastOptions >\/dev\/null $printSTDERR";}
- elsif ($blastmode eq "lastp") {lock($threads_per_process); $command = $binpath."lastal -E$evalue -fBlastTab+ -pBL62 -P$threads_per_process '$a.$blastmode' '$b' $blastOptions > '$bla.tmp' $printSTDERR";}
- elsif ($blastmode eq "lastn") {lock($threads_per_process); $command = $binpath."lastal -E$evalue -fBlastTab+ -P$threads_per_process '$a.$blastmode' '$b' $blastOptions > '$bla.tmp' $printSTDERR";}
- elsif ($blastmode eq "mmseqsn" || $blastmode eq "mmseqsp") {lock($threads_per_process); $command = $binpath."mmseqs search '$b.$blastmode' '$a.$blastmode' '$bla.tmp' $tmp_path/mmseqs$thread_id --threads $threads_per_process -e $evalue $blastOptions >\/dev\/null $printSTDERR";}
+ if ($blastmode eq "blastp_legacy" || $blastmode eq "blastn_legacy" || $blastmode eq "tblastx_legacy") {lock($threads_per_process); $command = $binpath."blastall -a $threads_per_process -d '$a.$blastmode' -i '$b' -p $blastmode -m8 -e $evalue $blastOptions $printSTDERR >'../$bla.tmp'";}# .($keep ? " | tee '../$bla.tmp'" : "");}
+ elsif ($blastmode eq "autoblast" && $autoblast_fileis{$a} eq "prot" && $autoblast_fileis{$b} eq "prot") {lock($threads_per_process); $command = $binpath."blastp -num_threads $threads_per_process -db '$a.$blastmode' -query '$b' -evalue $evalue -outfmt 6 $blastOptions $printSTDERR >'../$bla.tmp'";}# .($keep ? " | tee '../$bla.tmp'" : "");}
+ elsif ($blastmode eq "autoblast" && $autoblast_fileis{$a} eq "nucl" && $autoblast_fileis{$b} eq "nucl") {lock($threads_per_process); $command = $binpath."blastn -num_threads $threads_per_process -db '$a.$blastmode' -query '$b' -evalue $evalue -outfmt 6 $blastOptions $printSTDERR >'../$bla.tmp'";}# .($keep ? " | tee '../$bla.tmp'" : "");}
+ elsif ($blastmode eq "autoblast" && $autoblast_fileis{$a} eq "nucl" && $autoblast_fileis{$b} eq "prot") {lock($threads_per_process); $command = $binpath."tblastn -num_threads $threads_per_process -db '$a.$blastmode' -query '$b' -evalue $evalue -outfmt 6 $blastOptions $printSTDERR >'../$bla.tmp'";}# .($keep ? " | tee '../$bla.tmp'" : "");}
+ elsif ($blastmode eq "autoblast" && $autoblast_fileis{$a} eq "prot" && $autoblast_fileis{$b} eq "nucl") {lock($threads_per_process); $command = $binpath."blastx -num_threads $threads_per_process -db '$a.$blastmode' -query '$b' -evalue $evalue -outfmt 6 $blastOptions $printSTDERR >'../$bla.tmp'";}# .($keep ? " | tee '../$bla.tmp'" : "");}
+ elsif ($blastmode eq "blastp+") {lock($threads_per_process); $command = $binpath."blastp -num_threads $threads_per_process -db '$a.$blastmode' -query '$b' -evalue $evalue -outfmt 6 $blastOptions $printSTDERR >'../$bla.tmp'";}# .($keep ? " | tee '../$bla.tmp'" : "");}
+ elsif ($blastmode eq "blastn+") {lock($threads_per_process); $command = $binpath."blastn -num_threads $threads_per_process -db '$a.$blastmode' -query '$b' -evalue $evalue -outfmt 6 $blastOptions $printSTDERR >'../$bla.tmp'";}# .($keep ? " | tee '../$bla.tmp'" : "");}
+ elsif ($blastmode eq "tblastx+") {lock($threads_per_process); $command = $binpath."tblastx -num_threads $threads_per_process -db '$a.$blastmode' -query '$b' -evalue $evalue -outfmt 6 $blastOptions $printSTDERR >'../$bla.tmp'";}# .($keep ? " | tee '../$bla.tmp'" : "");}
+ elsif ($blastmode eq "blatp") {lock($threads_per_process); $command = $binpath."blat -prot '$a' '$b' -out=blast8 '../$bla.tmp' $blastOptions >\/dev\/null $printSTDERR";}
+ elsif ($blastmode eq "blatn") {lock($threads_per_process); $command = $binpath."blat '$a' '$b' -out=blast8 '../$bla.tmp' $blastOptions >\/dev\/null $printSTDERR";}
+ elsif ($blastmode eq "blatx") {lock($threads_per_process); $command = $binpath."blat '$a' '$b' -t=dnax -q=dnax -out=blast8 '../$bla.tmp' $blastOptions >\/dev\/null $printSTDERR";}
+ #elsif ($blastmode eq "pblatp") {lock($threads_per_process); $command = $binpath."pblat -prot '$a' '$b' -threads=$threads_per_process -out=blast8 '../$bla' >\/dev\/null 2>&1";}
+ #elsif ($blastmode eq "pblatn") {lock($threads_per_process); $command = $binpath."pblat '$a' '$b' -threads=$threads_per_process -out=blast8 '../$bla' >\/dev\/null 2>&1";}
+ #elsif ($blastmode eq "pblatx") {lock($threads_per_process); $command = $binpath."pblat '$a' '$b' -t=dnax -q=dnax -threads=$threads_per_process -out=blast8 '../$bla' >\/dev\/null 2>&1";}
+ elsif ($blastmode eq "diamond" ) {lock($threads_per_process); $command = $binpath."diamond blastp --threads $threads_per_process --db '$a.$blastmode' --query '$b' -e $evalue --outfmt 6 --quiet $blastOptions $blastOptions_diamond $printSTDERR >'../$bla.tmp'";}# .($keep ? " | tee '../$bla.tmp'" : "");}
+ elsif ($blastmode eq "topaz" ) {lock($threads_per_process); $command = $binpath."topaz search -T $threads_per_process -p '$a.$blastmode' -f '$b' -E $evalue --blasttab $blastOptions >'../$bla.tmp' $printSTDERR";}
+ elsif ($blastmode eq "ublast") {lock($threads_per_process); $command = $binpath."usearch -ublast '$b' -db '$a.$blastmode' -threads $threads_per_process -evalue $evalue -blast6out '../$bla.tmp' $blastOptions >\/dev\/null $printSTDERR";}
+ elsif ($blastmode eq "usearch") {lock($threads_per_process); $command = $binpath."usearch -usearch_local '$b' -db '$a.$blastmode' -threads $threads_per_process -evalue $evalue -id 0 -blast6out '../$bla.tmp' $blastOptions >\/dev\/null $printSTDERR";}
+ elsif ($blastmode eq "rapsearch") {lock($threads_per_process); $command = $binpath."rapsearch -s T -d '$a.$blastmode' -q '$b' -z $threads_per_process -e $evalue -out '../$bla.tmp' $blastOptions >\/dev\/null $printSTDERR";}
+ elsif ($blastmode eq "lastp") {lock($threads_per_process); $command = $binpath."lastal -E$evalue -fBlastTab+ -pBL62 -P$threads_per_process '$a.$blastmode' '$b' $blastOptions > '../$bla.tmp' $printSTDERR";}
+ elsif ($blastmode eq "lastn") {lock($threads_per_process); $command = $binpath."lastal -E$evalue -fBlastTab+ -P$threads_per_process '$a.$blastmode' '$b' $blastOptions > '../$bla.tmp' $printSTDERR";}
+ elsif ($blastmode eq "mmseqsn" || $blastmode eq "mmseqsp") {lock($threads_per_process); $command = $binpath."mmseqs search '$b.$blastmode' '$a.$blastmode' '../$bla.tmp' mmseqs$thread_id --threads $threads_per_process -e $evalue $blastOptions >\/dev\/null $printSTDERR";}
else {&Error("This should not happen! Please submit the FASTA file(s) and the parameter vector (above) to incoming+paulklemm-phd-proteinortho-7278443-issue-\@incoming.gitlab.com to help fixing this issue.");}
##MARK_FOR_NEW_BLAST_ALGORITHM
if($identical || $range>=0){ $command=~s/-?-outfmt 6 (?!qseqid)//g; } # remove additional outfmt 6 (not followed by qseqid)
# NOTE diamond fails on --outfmt '6 qseqid ...' but is ok with --outfmt 6 qseqid ...
+ my $plain_command=$command;
+ if ( $blastmode eq "mmseqsn" || $blastmode eq "mmseqsp" ){
+ $command = "(cd '$WD' && $command)\n";
+ }else{
+ $command = "cat << 'EOF'|sh;\n$proteinortho_exec; (cd '$WD' && proteinortho_exec \"($command)\")\nEOF\n";
+ }
+
# File does not exists yet or I am forced to rewrite it
- if (!(-e $bla) || $force) {
+ if (!(-e "$tmp_path/$bla") || $force) {
- if (-e $bla && $force) { unlink("$bla"); }
+ if (-e "$tmp_path/$bla" && $force) { unlink("$tmp_path/$bla"); }
if ($debug || $verbose==2) {print STDERR "$command\n";} # 5.16
if ($blastmode eq "diamond") {
- @data=`$command`;
- if ($? != 0) {
- $command = $binpath."diamond blastp --ignore-warnings --threads $threads_per_process --db '$a.$blastmode' --query '$b' -e $evalue --outfmt 6 --quiet $blastOptions $blastOptions_diamond $printSTDERR ".($keep ? " | tee '$bla.tmp'" : "");
- @data=`$command`;
- if ($? != 0) {&TypicalBlastError($blastmode,$command);}
- }
- if($debug eq "test_sort"){while (<"$bla.tmp">){if ($_ =~ /[^\t]+([,])[^\t]+[eE]/) {print "found forbidden symbol '$1' at $_ in $bla\n";&reset_locale();die;}}}
- if($keep){system("mv '$bla.tmp' '$bla'");}
+
+ system("$command");
+ if ($? != 0) {&TypicalBlastError($blastmode,$plain_command,$?,$!);}
+ #if ($? != 0) {
+ # $command = $binpath."diamond blastp --ignore-warnings --threads $threads_per_process --db '$a.$blastmode' --query '$b' -e $evalue --outfmt 6 --quiet $blastOptions $blastOptions_diamond $printSTDERR ".($keep ? " | tee '../$bla.tmp'" : "");
+ # @data=`(cd '$WD' && $command)`;
+ # if ($? != 0) {&TypicalBlastError($blastmode,$plain_command,$?,$!);}
+ #}
+ if($debug eq "test_sort"){while (<"$tmp_path/$bla.tmp">){if ($_ =~ /[^\t]+([,])[^\t]+[eE]/) {print "found forbidden symbol '$1' at $_ in $tmp_path/$bla\n";&reset_locale();die;}}}
+ system("mv '$tmp_path/$bla.tmp' '$tmp_path/$bla'");
}elsif ($blastmode eq "rapsearch") {
system("$command");
- if ($? != 0) {&TypicalBlastError($blastmode,$command);}
+ if ($? != 0) {&TypicalBlastError($blastmode,$plain_command,$?,$!);}
# NOTE -s f does not work everytime, therefore here is a conversion of -s t to f (ln -> evalues)
- open(OUTM,">>$bla.m82");open(INM,"<$bla.tmp.m8");
+ open(OUTM,">>$tmp_path/$bla.m82");open(INM,"<$tmp_path/$bla.tmp.m8");
while (<INM>){if(length($_)==0 || substr($_,0,1)eq"#"){next;}my @arr=split("\t",$_); print OUTM $arr[0]."\t".$arr[1]."\t".$arr[2]."\t".$arr[3]."\t".$arr[4]."\t".$arr[5]."\t".$arr[6]."\t".$arr[7]."\t".$arr[8]."\t".$arr[9]."\t".exp($arr[10])."\t".$arr[11];}
close(OUTM);close(INM);
- if($debug eq "test_sort"){while (<"$bla.tmp.m8">){if ($_ =~ /[^\t]+([,])[^\t]+[eE]/) {print "found forbidden symbol '$1' at $_ in $bla.m8\n";&reset_locale();die;}}}
- unlink "$bla.aln"; #remove aln file
- system("tail -n +6 '$bla.m82' >'$bla'"); # remove head/comment lines of rapsearch
- unlink '$bla.m82';
- unlink '$bla.tmp.m8';
+ if($debug eq "test_sort"){while (<"$tmp_path/$bla.tmp.m8">){if ($_ =~ /[^\t]+([,])[^\t]+[eE]/) {print "found forbidden symbol '$1' at $_ in $tmp_path/$bla.m8\n";&reset_locale();die;}}}
+ unlink "$tmp_path/$bla.aln"; #remove aln file
+ system("tail -n +6 '$tmp_path/$bla.m82' >'$tmp_path/$bla'"); # remove head/comment lines of rapsearch
+ unlink "$tmp_path/$bla.m82";
+ unlink "$tmp_path/$bla.tmp.m8";
}elsif ($blastmode eq "usearch" || $blastmode eq "ublast") {
system("$command");
- if ($? != 0) {&TypicalBlastError($blastmode,$command);}
- if($debug eq "test_sort"){while (<"$bla.tmp">){if ($_ =~ /[^\t]+([,])[^\t]+[eE]/) {print "found forbidden symbol '$1' at $_ in $bla\n";&reset_locale();die;}}}
- system("perl $po_path/proteinortho_formatUsearch.pl '$bla.tmp' >'$bla'"); # problem with ublast/usearch: gene names include the description.
- unlink "$bla.tmp";
+ if ($? != 0) {&TypicalBlastError($blastmode,$plain_command,$?,$!);}
+ if($debug eq "test_sort"){while (<"$tmp_path/$bla.tmp">){if ($_ =~ /[^\t]+([,])[^\t]+[eE]/) {print "found forbidden symbol '$1' at $_ in $tmp_path/$bla\n";&reset_locale();die;}}}
+ system("perl $po_path/proteinortho_formatUsearch.pl '$tmp_path/$bla.tmp' >'$tmp_path/$bla'"); # problem with ublast/usearch: gene names include the description.
+ unlink "$tmp_path/$bla.tmp";
}elsif ($blastmode eq "topaz") {
system("$command");
- if ($? != 0) {&TypicalBlastError($blastmode,$command);}
- if($debug eq "test_sort"){while (<"$bla.tmp">){if ($_ =~ /[^\t]+([,])[^\t]+[eE]/) {print "found forbidden symbol '$1' at $_ in $bla\n";&reset_locale();die;}}}
- system("tail -n +2 '$bla.tmp' > '$bla'");
+ if ($? != 0) {&TypicalBlastError($blastmode,$plain_command,$?,$!);}
+ if($debug eq "test_sort"){while (<"$tmp_path/$bla.tmp">){if ($_ =~ /[^\t]+([,])[^\t]+[eE]/) {print "found forbidden symbol '$1' at $_ in $tmp_path/$bla\n";&reset_locale();die;}}}
+ system("tail -n +2 '$tmp_path/$bla.tmp' > '$tmp_path/$bla'");
unlink "timing.txt";
- unlink "$bla.tmp";
+ unlink "$tmp_path/$bla.tmp";
}elsif ($blastmode eq "lastp" || $blastmode eq "lastn") {
system("$command");
- if ($? != 0) {&TypicalBlastError($blastmode,$command);}
- if($debug eq "test_sort"){while (<"$bla.tmp">){if ($_ =~ /[^\t]+([,])[^\t]+[eE]/) {print "found forbidden symbol '$1' at $_ in $bla\n";&reset_locale();die;}}}
- system("mv '$bla.tmp' '$bla'");
+ if ($? != 0) {&TypicalBlastError($blastmode,$plain_command,$?,$!);}
+ if($debug eq "test_sort"){while (<"$tmp_path/$bla.tmp">){if ($_ =~ /[^\t]+([,])[^\t]+[eE]/) {print "found forbidden symbol '$1' at $_ in $tmp_path/$bla\n";&reset_locale();die;}}}
+ system("mv '$tmp_path/$bla.tmp' '$tmp_path/$bla'");
}elsif ($blastmode =~ m/.*blat.*/) {
system("$command");
- if ($? != 0) {&TypicalBlastError($blastmode,$command);}
- if($debug eq "test_sort"){while (<"$bla.tmp">){if ($_ =~ /[^\t]+([,])[^\t]+[eE]/) {print "found forbidden symbol '$1' at $_ in $bla\n";&reset_locale();die;}}}
- system('awk -F\'\t\' \'{if($11<'.$evalue.')print $0}\' \''.$bla.'.tmp\' > \''.$bla.'\'');
- unlink "$bla.tmp";
+ if ($? != 0) {&TypicalBlastError($blastmode,$plain_command,$?,$!);}
+ if($debug eq "test_sort"){while (<"$tmp_path/$bla.tmp">){if ($_ =~ /[^\t]+([,])[^\t]+[eE]/) {print "found forbidden symbol '$1' at $_ in $tmp_path/$bla\n";&reset_locale();die;}}}
+ system('awk -F\'\t\' \'{if($11<'.$evalue.')print $0}\' \''."$tmp_path/$bla".'.tmp\' > \''."$tmp_path/$bla".'\'');
+ unlink "$tmp_path/$bla.tmp";
}elsif ($blastmode eq "mmseqsn" || $blastmode eq "mmseqsp") {
system("$command");
- if ($? != 0) {&TypicalBlastError($blastmode,$command);}
- system($binpath."mmseqs convertalis '$b.$blastmode' '$a.$blastmode' '$bla.tmp' '$bla.tmp2' >\/dev\/null 2>\/dev\/null");
+ if ($? != 0) {&TypicalBlastError($blastmode,$plain_command,$?,$!);}
+
+ if($threads_per_process>1){
+ system("mv '$tmp_path/${bla}'.tmp.index '$tmp_path/${bla}'.tmp-index");
+ system("mv '$tmp_path/${bla}'.tmp.dbtype '$tmp_path/${bla}'.tmp-dbtype");
+ system("if [ ! -e '$tmp_path/$bla' ]; then cat '$tmp_path/$bla'.tmp.* > '$tmp_path/$bla'.tmp; fi");
+ system("mv '$tmp_path/${bla}'.tmp-index '$tmp_path/${bla}'.tmp.index");
+ system("mv '$tmp_path/${bla}'.tmp-dbtype '$tmp_path/${bla}'.tmp.dbtype");
+ }
+
+ system($binpath."mmseqs convertalis '$b.$blastmode' '$a.$blastmode' '$tmp_path/$bla.tmp' '$tmp_path/$bla.tmp2' >\/dev\/null 2>\/dev\/null");
- if($debug eq "test_sort"){while (<"$bla.tmp">){if ($_ =~ /[^\t]+([,])[^\t]+[eE]/) {print "found forbidden symbol '$1' at $_ in $bla.tmp\n";&reset_locale();die;}}}
- system("rm '$bla'.*index*");
- system('sed -re \'s/([0-9\.]+)(E)(\-|\+)([0-9\.]+)/\1e\3\4/g\' \''.$bla.'.tmp2\' >\''.$bla.'\' 2>\/dev\/null');
- unlink "$bla.dbtype";
- unlink "$bla.tmp2";
- unlink "$bla.index";
- unlink "$bla.tmp";
+ if($debug eq "test_sort"){while (<"$tmp_path/$bla.tmp">){if ($_ =~ /[^\t]+([,])[^\t]+[eE]/) {print "found forbidden symbol '$1' at $_ in $tmp_path/$bla.tmp\n";&reset_locale();die;}}}
+ system('sed -re \'s/([0-9\.]+)(E)(\-|\+)([0-9\.]+)/\1e\3\4/g\' \''."$tmp_path/$bla".'.tmp2\' >\''."$tmp_path/$bla".'\' 2>\/dev\/null');
}else { # -p=blastp,blastn,autoblast, ...
- @data=`$command`;
- if ($? != 0) {&TypicalBlastError($blastmode,$command);}
- if($debug eq "test_sort"){while (<"$bla.tmp">){if ($_ =~ /[^\t]+([,])[^\t]+[eE]/) {print "found forbidden symbol '$1' at $_ in $bla.tmp\n";&reset_locale();die;}}}
- if($keep){system("mv '$bla.tmp' '$bla'");}
+ # @data=`$command`;
+ system("$command");
+ if ($? != 0) {&TypicalBlastError($blastmode,$plain_command,$?,$!);}
+ if($debug eq "test_sort"){while (<"$tmp_path/$bla.tmp">){if ($_ =~ /[^\t]+([,])[^\t]+[eE]/) {print "found forbidden symbol '$1' at $_ in $tmp_path/$bla.tmp\n";&reset_locale();die;}}}
+ system("mv '$tmp_path/$bla.tmp' '$tmp_path/$bla'");
}
##MARK_FOR_NEW_BLAST_ALGORITHM
- if(-e $bla && scalar @data == 0){
- @data = &readFile($bla);
- if(!$keep){unlink $bla}
+ if(-e "$tmp_path/$bla" && scalar @data == 0){
+ @data = &readFile("$tmp_path/$bla");
+ if(!$keep){unlink "$tmp_path/$bla"}
}
}else{
- if ($verbose) {print STDERR "\nNote: '$bla' exists, using pre-calculated data\n";}
+ if ($verbose) {print STDERR "\nNote: '$tmp_path/$bla' exists, using pre-calculated data\n";}
- @data = &readFile($bla);
- if(!$keep){unlink $bla}
+ @data = &readFile("$tmp_path/$bla");
+ if(!$keep){unlink "$tmp_path/$bla"}
}
if($selfblast){
@@ -2783,6 +2901,13 @@ sub readFile {
return @data;
}
+sub failed_to_detected_msg{
+ print STDERR ("\n!!!$ORANGE\n[WARNING]$NC Failed to detect '$blastmode'.$NC\nPlease install $blastmode in $binpath (or specify another binpath with -binpath=/home/...)\nI try now -p=blastp+ as fallback.\n");
+ print STDERR "\nPress 'strg+c' to prevent me from proceeding or wait 10 seconds to continue...\n!!!\n";
+ sleep 10;
+ print STDERR "\nWell then, proceeding...\n\n";
+}
+
sub check_bins {
if($blastmode eq "blast"){print STDERR "$ORANGE\n[WARNING]$NC There is no blastmode -p=$blastmode. I will choose the protein search blastp+ and continue...$NC"; $blastmode="blastp+";}
@@ -2862,10 +2987,7 @@ sub check_bins {
return;
}
}
- print STDERR ("\n!!!$ORANGE\n[WARNING]$NC Failed to detect '$blastmode'.$NC\nPlease install $blastmode in $binpath (or specify another binpath with -binpath=/home/...)\nI try now -p=blastp+ as fallback.\n");
- print STDERR "\nPress 'strg+c' to prevent me from proceeding or wait 10 seconds to continue...\n!!!\n";
- sleep 10;
- print STDERR "\nWell then, proceeding...\n\n";
+ &failed_to_detected_msg();
$blastmode="blastp+";
goto RESTART;
&Error("Failed to detect '$blastmode'! Tried to call '$binpath/$blastmode'\nPlease install $blastmode in $binpath (or specify another binpath with -binpath=/home/...).");
@@ -2879,10 +3001,7 @@ sub check_bins {
$makedb="";
return;
}else{
- print STDERR ("\n!!!$ORANGE\n[WARNING]$NC Failed to detect '$blastmode'.$NC\nPlease install $blastmode in $binpath (or specify another binpath with -binpath=/home/...)\nI try now -p=blastp+ as fallback.\n");
- print STDERR "\nPress 'strg+c' to prevent me from proceeding or wait 10 seconds to continue...\n!!!\n";
- sleep 10;
- print STDERR "\nWell then, proceeding...\n\n";
+ &failed_to_detected_msg();
$blastmode="blastp+";
goto RESTART;
}
@@ -2911,10 +3030,7 @@ sub check_bins {
if($verbose){print STDERR "Detected '$blastmode' version $versionnumber\n";$blastversion=$versionnumber;}
return;
}else{
- print STDERR ("\n!!!$ORANGE\n[WARNING]$NC Failed to detect '$blastmode'.$NC\nPlease install $blastmode in $binpath (or specify another binpath with -binpath=/home/...)\nI try now -p=blastp+ as fallback.\n");
- print STDERR "\nPress 'strg+c' to prevent me from proceeding or wait 10 seconds to continue...\n!!!\n";
- sleep 10;
- print STDERR "\nWell then, proceeding...\n\n";
+ &failed_to_detected_msg();
$blastmode="blastp+";
goto RESTART;
}
@@ -2947,7 +3063,7 @@ sub check_bins {
if (defined($out) && $out =~ /diamond\sversion\s(.+)\n/) {
my @version = split(/\s+/,$1);
my $versionnumber = pop @version;
- $makedb = $binpath."diamond makedb $makeBlastOptions --in";
+ $makedb = $binpath."diamond makedb -p $cpus $makeBlastOptions --in";
if($verbose){print STDERR "Detected '$blastmode' version $versionnumber\n";$blastversion=$versionnumber;}
if($versionnumber =~ m/^0\.9\.(\d+)/){ if($1 < 29){
print STDERR "\n!!!! \nWARNING '$blastmode' version $versionnumber has a known bug that incorrectly computes the length of an alignment, thus the coverage threshold can produce wrong results leading in false negatives. See https://gitlab.com/paulklemm_PHD/proteinortho/issues/24 for more details.\n\n >>> Please update diamond to 0.9.29 or higher <<<\n";
@@ -2956,10 +3072,7 @@ sub check_bins {
print STDERR "\nWell then, proceeding...\n\n";} }
return;
}else{
- print STDERR ("\n!!!$ORANGE\n[WARNING]$NC Failed to detect '$blastmode'.$NC\nPlease install $blastmode in $binpath (or specify another binpath with -binpath=/home/...)\nI try now -p=blastp+ as fallback.\n");
- print STDERR "\nPress 'strg+c' to prevent me from proceeding or wait 10 seconds to continue...\n!!!\n";
- sleep 10;
- print STDERR "\nWell then, proceeding...\n\n";
+ &failed_to_detected_msg();
$blastmode="blastp+";
goto RESTART;
}
@@ -2975,10 +3088,7 @@ sub check_bins {
if($verbose){print STDERR "Detected '$blastmode' version $versionnumber\n";$blastversion=$versionnumber;}
return;
}else{
- print STDERR ("\n!!!$ORANGE\n[WARNING]$NC Failed to detect '$blastmode'.$NC\nPlease install $blastmode in $binpath (or specify another binpath with -binpath=/home/...)\nI try now -p=blastp+ as fallback.\n");
- print STDERR "\nPress 'strg+c' to prevent me from proceeding or wait 10 seconds to continue...\n!!!\n";
- sleep 10;
- print STDERR "\nWell then, proceeding...\n\n";
+ &failed_to_detected_msg();
$blastmode="blastp+";
goto RESTART;
}
@@ -2994,10 +3104,7 @@ sub check_bins {
if($verbose){print STDERR "Detected '$blastmode' version $versionnumber\n";$blastversion=$versionnumber;}
return;
}else{
- print STDERR ("\n!!!$ORANGE\n[WARNING]$NC Failed to detect '$blastmode'.$NC\nPlease install $blastmode in $binpath (or specify another binpath with -binpath=/home/...)\nI try now -p=blastp+ as fallback.\n");
- print STDERR "\nPress 'strg+c' to prevent me from proceeding or wait 10 seconds to continue...\n!!!\n";
- sleep 10;
- print STDERR "\nWell then, proceeding...\n\n";
+ &failed_to_detected_msg();
$blastmode="blastp+";
goto RESTART;
}
@@ -3013,10 +3120,7 @@ sub check_bins {
if($verbose){print STDERR "Detected '$blastmode' version $versionnumber\n";$blastversion=$versionnumber;}
return;
}else{
- print STDERR ("\n!!!$ORANGE\n[WARNING]$NC Failed to detect '$blastmode'.$NC\nPlease install $blastmode in $binpath (or specify another binpath with -binpath=/home/...)\nI try now -p=blastp+ as fallback.\n");
- print STDERR "\nPress 'strg+c' to prevent me from proceeding or wait 10 seconds to continue...\n!!!\n";
- sleep 10;
- print STDERR "\nWell then, proceeding...\n\n";
+ &failed_to_detected_msg();
$blastmode="blastp+";
goto RESTART;
}
@@ -3041,6 +3145,7 @@ sub check_files {
@files = grep { !( $_ =~ /\.g[ft]f[34]?$/ ) } @files;
foreach my $file (@files) {
+ if($debug){print STDERR "->$file\n";}
if(-e $file && $file=~/\.gz$/){
if($verbose){print STDERR "ERROR input '$file' seems to be compressed...\n";}
exit 1;
@@ -3049,8 +3154,8 @@ sub check_files {
}elsif(-e $file){
$files_dedup{abs_path($file)}=$file;
#if ($verbose) {print STDERR "Checking $file... ";}
- #&read_details($file,0);
- $check_files_QUEUE->enqueue($file);
+ &read_details($file,0);
+ #$check_files_QUEUE->enqueue($file);
#if ($blastmode eq "autoblast") {print STDERR " (".$autoblast_fileis{$file}.") ";}
#if ($verbose) {print STDERR "ok\n";}
}elsif(!-e $file){
@@ -3063,13 +3168,6 @@ sub check_files {
if($debug){print STDERR "Checking input files queue:".($check_files_QUEUE->pending())."\n"}
- $do_log=1;
- $_->kill('KILL')->detach foreach threads->list();
- threads->create('log_Worker');
- if($debug){print STDERR "Creating ".(scalar @files > $cpus+10 || $debug ? $cpus : 1)." check_files_Worker (cpus=$cpus)\n"}
- for (my $i = 0; $i < ((scalar @files > 500 && scalar @files > $cpus+10) || $debug ? $cpus : 1) ; $i++) { my $tmp = threads->create('check_files_Worker') }
- $_->join() foreach threads->list();
-
my $all_only_one_entry=1;
foreach my $file (@files) {
if($gene_counter{$file}>1){$all_only_one_entry=0;}
@@ -3114,6 +3212,8 @@ sub convertUniprotAndNCBI {
sub read_details {
my $file = shift;
+ if($debug){print STDERR "read_details($file)\n"}
+ my $abs_path_file=abs_path "$file "; $abs_path_file=~s/ $//g;
my $from_synteny_call = shift;
my %ids; # local test for duplicated IDs
my %genes;
@@ -3135,9 +3235,9 @@ sub read_details {
if (!-e $file) {&Error("File '$file' not found!");}
open(FASTA,"<$file") || &Error("Could not open '$file': $!");
while (<FASTA>) {
- my $curLine=$_;
- if ($curLine =~ /^[^a-z#>]/i){$msg.="\nERROR found line with forbidden symbols in '$file':\n$_\n"; exit 1}
- if ($curLine =~/^$/){$did_found_emptyline=1;}
+ my $curLine=$_;$curLine=~s/\r//g;
+ if ($curLine =~/^ *$/){$did_found_emptyline=1;}
+ elsif ($curLine =~ /^[^a-z#>]/i){&Error("\nERROR found line with forbidden symbols in '$file':\n$_\n"); exit 1}
if ($curLine =~ />/) { #head line -> gene name and description
if($isoform eq "uniprot"){
lock(%isoform_mapping);
@@ -3189,11 +3289,11 @@ sub read_details {
if(!$force && $checkfasta && exists($allowedAlphabet->{$blastmode}) && ($genelength>50 && ( ($allowedAlphabet->{$blastmode} eq "n" && $ATCGNoccurences/$genelength < 0.5) || ($allowedAlphabet->{$blastmode} eq "a" && $ATCGNoccurences/$genelength > 0.8)))){$cur_gene_is_valid= -1;last;}
if( $blastmode eq "autoblast" &&
- !exists $autoblast_fileis{$file} && $genelength>50 ){
+ !exists $autoblast_fileis{$abs_path_file} && $genelength>50 ){
if( $genelength>50 && $ATCGNoccurences/$genelength < 0.5 ){
- $autoblast_fileis{$file} = "prot";
+ $autoblast_fileis{$abs_path_file} = "prot";
}elsif( $ATCGNoccurences/$genelength > 0.8 ){
- $autoblast_fileis{$file} = "nucl";
+ $autoblast_fileis{$abs_path_file} = "nucl";
}
}
@@ -3210,7 +3310,7 @@ sub read_details {
$ids{$curLine} = $file;
}
if(index($curLine, ",") != -1 && $file=~/_clean/){
- $msg.="\n$ORANGE [ERROR]$NC input '$file' contains a gene-name with a comma, this causes problems with the proteinortho.tsv output, I cannot clean the file, is the input gzipped?\nThe line is:\n$curLine\n";
+ &Error("\n$ORANGE [ERROR]$NC input '$file' contains a gene-name with a comma, this causes problems with the proteinortho.tsv output, I cannot clean the file, is the input gzipped?\nThe line is:\n$curLine\n");
exit 1;
}elsif(index($curLine, ",") != -1){
# 6.0.32 : check if gene name contains a comma -> this will cause problems with the proteinortho.tsv output (gene cluster speparator)
@@ -3277,7 +3377,7 @@ sub read_details {
close(FASTA_CLEAN);
if($step==2){
$msg.=("\n$ORANGE [WARNING]$NC Restarting the indices generation.$NC");
- $msg.= &generate_indices($file_clean);
+ &generate_indices($file_clean);
}
$_ eq $file and $_ = $file_clean for @files;
$_ eq $file and $_ = $file_clean for @files_cleanup;
@@ -3295,21 +3395,21 @@ sub read_details {
}
}
if($blastmode eq "autoblast" &&
- !exists $autoblast_fileis{$file} && $genelength>50 ){
+ !exists $autoblast_fileis{$abs_path_file} && $genelength>50 ){
if( $genelength>50 && $ATCGNoccurences/$genelength < 0.5 ){
- $autoblast_fileis{$file} = "prot";
+ $autoblast_fileis{$abs_path_file} = "prot";
}elsif( $ATCGNoccurences/$genelength > 0.8 ){
- $autoblast_fileis{$file} = "nucl";
+ $autoblast_fileis{$abs_path_file} = "nucl";
}
}
if($did_found_emptyline){
$msg.=("$ORANGE [WARNING]$NC Found empty line in $file, removing it with perl -lne.$NC");
- system('perl -lne \'if($_ ne ""){print "$_";}\' '.$file.' >'.$file.'.tmp');
+ system('perl -lne \'if($_ !~/^ *$/){print "$_";}\' '.$file.' >'.$file.'.tmp');
system('mv '.$file.'.tmp '.$file);
if($step==2){
$msg.=("$ORANGE [WARNING]$NC Restarting the indices generation.$NC");
- $msg.= &generate_indices($file);
+ &generate_indices($file);
}
}
@@ -3355,10 +3455,12 @@ sub read_details {
}
}
- if ($blastmode eq "autoblast") {$msg .= " (".$autoblast_fileis{$file}.") ";}
+ if ($blastmode eq "autoblast") {$msg .= " (".$autoblast_fileis{$abs_path_file}.") ";}
$msg.=" ok\n";
- $log_QUEUE->enqueue($msg);
+ if($verbose){print STDERR $msg}
+ if($debug){print STDERR "read_details($file):done\n"}
+ #$log_QUEUE->enqueue($msg);
unless ($synteny) {return;}
my %coordinates;
@@ -3407,7 +3509,11 @@ sub read_details {
sub TypicalBlastError {
my $blastmode=shift;
my $command=shift;
- &Error($blastmode." failed with code $?$NC ($command). (Please visit https://gitlab.com/paulklemm_PHD/proteinortho/wikis/Error%20Code)\nThe most common sources of this error are:\n- no space left on device error.\n- outdated $blastmode, please update $blastmode or consider another -p algorithm.\n- the databases are missing. Maybe you ran --step=1 and moved the databases afterwards? Please rerun 'proteinortho --step=1 --force /path/to/fastas'\n- maybe the fasta files contain mixed nucleotide and aminoacid sequences or are not suited for $blastmode? (For example diamond only processes protein sequences) Try 'proteinortho --step=1 --check --force /path/to/fastas'.")
+ $command=~s/2>\/dev\/null//g;
+ my $retcode=shift;
+ my $retmsg=shift;
+ &Error("$NC".$blastmode." failed\n${ORANGE}return error code$NC = $retcode\n${ORANGE}full error command$NC = $command\n${ORANGE}return error message$NC = $retmsg")
+ # The most common sources of this error are:\n- no space left on device error.\n- outdated $blastmode, please update $blastmode or consider another -p algorithm.\n- the databases are missing. Maybe you ran --step=1 and moved the databases afterwards? Please rerun 'proteinortho --step=1 --force /path/to/fastas'\n- maybe the fasta files contain mixed nucleotide and aminoacid sequences or are not suited for $blastmode? (For example diamond only processes protein sequences) Try 'proteinortho --step=1 --check --force /path/to/fastas'.
}
sub Error {
$debug=1;
@@ -3464,76 +3570,76 @@ sub gff4fasta {
}
sub get_po_path {
- my @tmppath = fileparse($0); # path to the C++-part of this program
+ my @fileparse0 = fileparse($0); # path to the C++-part of this program
my $uname=`uname -s`;
$uname=~s/[\r\n]+$//;
$uname.="_".`uname -m`;
$uname=~s/[\r\n]+$//;
- if(!-x $tmppath[1]."/proteinortho_clustering"){
- if(-x $tmppath[1]."/src/BUILD/$uname/proteinortho_clustering"){
- $tmppath[1]=$tmppath[1]."/src/BUILD/$uname";
- if($debug){print STDERR "Detected ".$tmppath[1]."\n";}
+ if(!-x $fileparse0[1]."/proteinortho_clustering"){
+ if(-x $fileparse0[1]."/src/BUILD/$uname/proteinortho_clustering"){
+ $fileparse0[1]=$fileparse0[1]."/src/BUILD/$uname";
+ if($debug){print STDERR "Detected ".$fileparse0[1]."\n";}
}elsif(-x "/usr/bin/proteinortho_clustering"){
- $tmppath[1]="/usr/bin/";
- if($debug){print STDERR "Detected ".$tmppath[1]."\n";}
+ $fileparse0[1]="/usr/bin/";
+ if($debug){print STDERR "Detected ".$fileparse0[1]."\n";}
}elsif(-x "/usr/local/bin/proteinortho_clustering"){
- $tmppath[1]="/usr/local/bin/";
- if($debug){print STDERR "Detected ".$tmppath[1]."\n";}
+ $fileparse0[1]="/usr/local/bin/";
+ if($debug){print STDERR "Detected ".$fileparse0[1]."\n";}
}elsif(-x "$binpath/proteinortho_clustering"){
- $tmppath[1]="$binpath/";
- if($debug){print STDERR "Detected ".$tmppath[1]."\n";}
+ $fileparse0[1]="$binpath/";
+ if($debug){print STDERR "Detected ".$fileparse0[1]."\n";}
}else{
my $p=`which proteinortho_clustering`;
$p=~s/^([^ ]+)\/proteinortho_clustering.*$/$1/;
chomp($p);
- $tmppath[1]=$p;
+ $fileparse0[1]=$p;
if($debug){print STDERR "Detected (PATH enviroment variable)\n";}
}
}
- if(!-x $tmppath[1]."/proteinortho_clustering"){
+ if(!-x $fileparse0[1]."/proteinortho_clustering"){
&Error("cannot find 'proteinortho_clustering'$NC in: the current directory '.', ./src/, ./src/BUILD/$uname , /usr/bin, /usr/local/bin, \$PATH, -binpath=$binpath.\nPlease do one of the following:\n A. recompile proteinortho (with 'make clean', 'make' and 'make install' or 'make install PREFIX=...') or consider a installation with conda/brew (see the README for more informations)\n B. execute from within the downloaded directory, there are precompiled binaries for Linux_x86_64\n C. specify the path to the binaries with -binpath=...\n");
exit 1;
}
- # if(!-x "$tmppath[1]/proteinortho_cleanupblastgraph"){
- # &Error("cannot find proteinortho_cleanupblastgraph in $tmppath[1].\nPlease do one of the following:\n A. recompile proteinortho (with 'make clean', 'make' and 'make install' or 'make install PREFIX=...') or consider a installation with conda/brew (see the README for more informations)\n B. execute from within the downloaded directory, there are precompiled binaries for Linux_x86_64\n C. specify the path to the binaries with -binpath=...\n";
+ # if(!-x "$fileparse0[1]/proteinortho_cleanupblastgraph"){
+ # &Error("cannot find proteinortho_cleanupblastgraph in $fileparse0[1].\nPlease do one of the following:\n A. recompile proteinortho (with 'make clean', 'make' and 'make install' or 'make install PREFIX=...') or consider a installation with conda/brew (see the README for more informations)\n B. execute from within the downloaded directory, there are precompiled binaries for Linux_x86_64\n C. specify the path to the binaries with -binpath=...\n";
# exit 0;
# }
- # if(!-x "$tmppath[1]/po_tree"){
- # &Error("cannot find po_tree in $tmppath[1].\nPlease do one of the following:\n A. recompile proteinortho (with 'make clean', 'make' and 'make install' or 'make install PREFIX=...') or consider a installation with conda/brew (see the README for more informations)\n B. execute from within the downloaded directory, there are precompiled binaries for Linux_x86_64\n C. specify the path to the binaries with -binpath=...\n";
+ # if(!-x "$fileparse0[1]/po_tree"){
+ # &Error("cannot find po_tree in $fileparse0[1].\nPlease do one of the following:\n A. recompile proteinortho (with 'make clean', 'make' and 'make install' or 'make install PREFIX=...') or consider a installation with conda/brew (see the README for more informations)\n B. execute from within the downloaded directory, there are precompiled binaries for Linux_x86_64\n C. specify the path to the binaries with -binpath=...\n";
# exit 0;
# }
- if(!-x $tmppath[1]."/proteinortho2html.pl"){
+ if(!-x $fileparse0[1]."/proteinortho2html.pl"){
&Error("cannot find 'proteinortho2html.pl'$NC in: the current directory '.', ./src/, ./src/BUILD/$uname, /usr/bin, /usr/local/bin, \$PATH, -binpath=$binpath.\nPlease do one of the following:\n A. recompile proteinortho (with 'make clean', 'make' and 'make install' or 'make install PREFIX=...') or consider a installation with conda/brew (see the README for more informations)\n B. execute from within the downloaded directory, there are precompiled binaries for Linux_x86_64\n C. specify the path to the binaries with -binpath=...\n");
exit 1;
}
- # if(!-x "$tmppath[1]/proteinortho2tree.pl"){
- # &Error("cannot find proteinortho2tree.pl in $tmppath[1].\nPlease do one of the following:\n A. recompile proteinortho (with 'make clean', 'make' and 'make install' or 'make install PREFIX=...') or consider a installation with conda/brew (see the README for more informations)\n B. execute from within the downloaded directory, there are precompiled binaries for Linux_x86_64\n C. specify the path to the binaries with -binpath=...\n";
+ # if(!-x "$fileparse0[1]/proteinortho2tree.pl"){
+ # &Error("cannot find proteinortho2tree.pl in $fileparse0[1].\nPlease do one of the following:\n A. recompile proteinortho (with 'make clean', 'make' and 'make install' or 'make install PREFIX=...') or consider a installation with conda/brew (see the README for more informations)\n B. execute from within the downloaded directory, there are precompiled binaries for Linux_x86_64\n C. specify the path to the binaries with -binpath=...\n";
# exit 0;
# }
- if(!-x $tmppath[1]."/proteinortho_ffadj_mcs.py" && $synteny){
+ if(!-x $fileparse0[1]."/proteinortho_ffadj_mcs.py" && $synteny){
&Error("cannot find 'proteinortho_ffadj_mcs.py'$NC in: the current directory '.', ./src/, ./src/BUILD/$uname, /usr/bin, /usr/local/bin, \$PATH, -binpath=$binpath.\nPlease do one of the following:\n A. recompile proteinortho (with 'make clean', 'make' and 'make install' or 'make install PREFIX=...') or consider a installation with conda/brew (see the README for more informations)\n B. execute from within the downloaded directory, there are precompiled binaries for Linux_x86_64\n C. specify the path to the binaries with -binpath=...\n");
exit 1;
}
- if(!-x $tmppath[1]."/proteinortho_singletons.pl"){
+ if(!-x $fileparse0[1]."/proteinortho_singletons.pl"){
&Error("cannot find 'proteinortho_singletons.pl'$NC in: the current directory '.', ./src/, ./src/BUILD/$uname, /usr/bin, /usr/local/bin, \$PATH, -binpath=$binpath.\nPlease do one of the following:\n A. recompile proteinortho (with 'make clean', 'make' and 'make install' or 'make install PREFIX=...') or consider a installation with conda/brew (see the README for more informations)\n B. execute from within the downloaded directory, there are precompiled binaries for Linux_x86_64\n C. specify the path to the binaries with -binpath=...\n");
exit 1;
}
- if(!-x $tmppath[1]."/proteinortho_graphMinusRemovegraph"){
+ if(!-x $fileparse0[1]."/proteinortho_graphMinusRemovegraph"){
&Error("cannot find 'proteinortho_graphMinusRemovegraph'$NC in: the current directory '.', ./src/, ./src/BUILD/$uname, /usr/bin, /usr/local/bin, \$PATH, -binpath=$binpath.\nPlease do one of the following:\n A. recompile proteinortho (with 'make clean', 'make' and 'make install' or 'make install PREFIX=...') or consider a installation with conda/brew (see the README for more informations)\n B. execute from within the downloaded directory, there are precompiled binaries for Linux_x86_64\n C. specify the path to the binaries with -binpath=...\n");
exit 1;
}
- if(!-x $tmppath[1]."/proteinortho_do_mcl.pl"){
+ if(!-x $fileparse0[1]."/proteinortho_do_mcl.pl"){
&Error("cannot find 'proteinortho_do_mcl.pl'$NC in: the current directory '.', ./src/, ./src/BUILD/$uname, /usr/bin, /usr/local/bin, \$PATH, -binpath=$binpath.\nPlease do one of the following:\n A. recompile proteinortho (with 'make clean', 'make' and 'make install' or 'make install PREFIX=...') or consider a installation with conda/brew (see the README for more informations)\n B. execute from within the downloaded directory, there are precompiled binaries for Linux_x86_64\n C. specify the path to the binaries with -binpath=...\n");
exit 1;
}
- if(!-x $tmppath[1]."/proteinortho2xml.pl"){
+ if(!-x $fileparse0[1]."/proteinortho2xml.pl"){
&Error("cannot find 'proteinortho2xml.pl'$NC in: the current directory '.', ./src/, ./src/BUILD/$uname, /usr/bin, /usr/local/bin, \$PATH, -binpath=$binpath.\nPlease do one of the following:\n A. recompile proteinortho (with 'make clean', 'make' and 'make install' or 'make install PREFIX=...') or consider a installation with conda/brew (see the README for more informations)\n B. execute from within the downloaded directory, there are precompiled binaries for Linux_x86_64\n C. specify the path to the binaries with -binpath=...\n");
exit 1;
}
- return $tmppath[1];
+ return $fileparse0[1];
}
sub edgeweight {
=====================================
src/.gitlab-ci.yml
=====================================
@@ -5,13 +5,14 @@ before_script:
- alias python=python3
- DEBIAN_FRONTEND=noninteractive apt-get update && DEBIAN_FRONTEND=noninteractive apt-get -y install cmake diffutils wget ncbi-blast+ time git python3 libblas-dev liblapack-dev
stages:
- - test
-# - test-precompiled-bins
-# - recompile-and-test
+# - test
+ - test-precompiled-bins
+ - recompile-and-test
gcc-latest-manyoptions-together:
+ retry: 2
image: gcc
- stage: test
+ stage: test-precompiled-bins
script:
- echo "installing diamond"
- wget http://github.com/bbuchfink/diamond/releases/download/v2.0.6/diamond-linux64.tar.gz 2>/dev/null
@@ -24,8 +25,9 @@ gcc-latest-manyoptions-together:
- rm *descriptions *html *xml;
gcc-latest-someoptions-one-by-one:
+ retry: 2
image: gcc
- stage: test
+ stage: test-precompiled-bins
script:
- echo "installing diamond"
- wget http://github.com/bbuchfink/diamond/releases/download/v2.0.6/diamond-linux64.tar.gz 2>/dev/null
@@ -65,8 +67,9 @@ gcc-latest-someoptions-one-by-one:
- for f in myproject.*; do if [ "$(grep -v '#' $f | wc -l | awk '{print $1}' | tr -d '\n')" -lt 1 ]; then echo "$f failed size check, output should be more than 10 lines..."; exit 1; fi; done
gcc-latest-all-p:
+ retry: 2
image: gcc
- stage: test
+ stage: recompile-and-test
script:
- export CWD=$(pwd)
- echo "installing last"
@@ -92,8 +95,9 @@ gcc-latest-all-p:
- for f in myproject.*; do if [ "$(grep -v '#' $f | wc -l | awk '{print $1}' | tr -d '\n')" -lt 10 ]; then echo "$f failed size check, output should be more than 10 lines..."; exit 1; fi; done
gcc-latest-diamond:
+ retry: 2
image: gcc
- stage: test
+ stage: test-precompiled-bins
script:
- echo "installing diamond"
- wget http://github.com/bbuchfink/diamond/releases/download/v2.0.6/diamond-linux64.tar.gz 2>/dev/null
@@ -107,27 +111,10 @@ gcc-latest-diamond:
- rm *info;
- for f in myproject.*; do if [ "$(grep -v '#' $f | wc -l | awk '{print $1}' | tr -d '\n')" -lt 10 ]; then echo "$f failed size check, output should be more than 10 lines..."; exit 1; fi; done
-#nolapack-gcc-latest:
-# image: gcc
-# stage: test
-# script:
-# - echo "installing diamond"
-# - wget http://github.com/bbuchfink/diamond/releases/download/v2.0.6/diamond-linux64.tar.gz 2>/dev/null
-# - tar xzf diamond-linux64.tar.gz
-# - cp diamond $HOME
-# - export PATH="$PATH:$HOME"
-# - echo "start proteinortho tests"
-# - gcc --version
-# - make clean
-# - make LAPACK=FALSE
-# - make test
-# - perl proteinortho*pl test/*faa || exit 1
-# - rm *info;
-# - for f in myproject.*; do if [ "$(grep -v '#' $f | wc -l | awk '{print $1}' | tr -d '\n')" -lt 10 ]; then echo "$f failed size check, output should be more than 10 lines..."; exit 1; fi; done
-
-#gcc5:
-# image: gcc:5
-# stage: test
+#gcc6:
+# retry: 2
+# image: gcc:6
+# stage: recompile-and-test
# script:
# - echo "installing topaz"
# - git clone https://github.com/ajm/topaz
@@ -146,9 +133,10 @@ gcc-latest-diamond:
# - make all
# - make test
-ubuntu-latest0:
+ubuntu-latest-precompiled:
+ retry: 2
image: ubuntu
- stage: test
+ stage: test-precompiled-bins
script:
- DEBIAN_FRONTEND=noninteractive apt-get -y update && DEBIAN_FRONTEND=noninteractive apt-get -y install python2.7-minimal gcc gfortran build-essential g++ python3
- echo "installing topaz"
@@ -169,8 +157,9 @@ ubuntu-latest0:
- for f in myproject.*; do if [ "$(grep -v '#' $f | wc -l | awk '{print $1}' | tr -d '\n')" -lt 10 ]; then echo "$f failed size check, output should be more than 10 lines..."; exit 1; fi; done
ubuntu-latest:
+ retry: 2
image: ubuntu
- stage: test
+ stage: recompile-and-test
script:
- apt-get -y update && apt-get -y install gcc && apt-get -y install gfortran && apt-get -y install build-essential g++ && apt-get -y install python3
- echo "installing topaz"
@@ -193,8 +182,9 @@ ubuntu-latest:
- for f in myproject.*; do if [ "$(grep -v '#' $f | wc -l | awk '{print $1}' | tr -d '\n')" -lt 10 ]; then echo "$f failed size check, output should be more than 10 lines..."; exit 1; fi; done
debian-latest:
+ retry: 2
image: debian
- stage: test
+ stage: recompile-and-test
script:
- apt-get -y update && apt-get -y install gcc && apt-get -y install gfortran && apt-get -y install build-essential g++ && apt-get -y install python3
- echo "installing topaz"
@@ -216,64 +206,67 @@ debian-latest:
- rm *info;
- for f in myproject.*; do if [ "$(grep -v '#' $f | wc -l | awk '{print $1}' | tr -d '\n')" -lt 10 ]; then echo "$f failed size check, output should be more than 10 lines..."; exit 1; fi; done
-#fedora-latest:
-# image: fedora
-# stage: test
-# script:
-# - yum -y groupinstall "Development Tools"
-# - yum -y install gcc-c++
-# - yum -y install cmake
-# - yum -y install make
-# - yum -y install tar
-# - yum -y install which
-# - yum -y install wget
-# - yum -y install libstdc++-static
-# - yum -y install lapack-static
-# - yum -y install cpan
-# - yum -y install python
-# - yum -y install ncbi-blast+
-# - cpan Thread::Queue
-# - wget ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/ncbi-blast*-x64-linux.tar.gz 2>/dev/null
-# - tar -xzvf ncbi-blast*-x64-linux.tar.gz
-# - cp ncbi-blast*/bin/blastp $HOME
-# - cp ncbi-blast*/bin/makeblastdb $HOME
-# - echo "installing diamond"
-# - wget http://github.com/bbuchfink/diamond/releases/download/v2.0.6/diamond-linux64.tar.gz 2>/dev/null
-# - tar xzf diamond-linux64.tar.gz
-# - cp diamond $HOME
-# - export PATH="$PATH:$HOME"
-# - echo "start proteinortho tests"
+# fedora-latest:
+# retry: 2
+# image: fedora
+# stage: test
+# script:
+# - yum -y groupinstall "Development Tools"
+# - yum -y install gcc-c++
+# - yum -y install cmake
+# - yum -y install make
+# - yum -y install tar
+# - yum -y install which
+# - yum -y install wget
+# - yum -y install libstdc++-static
+# - yum -y install lapack-static
+# - yum -y install cpan
+# - yum -y install python
+# - yum -y install ncbi-blast+
+# - cpan Thread::Queue
+# - wget ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/ncbi-blast*-x64-linux.tar.gz 2>/dev/null
+# - tar -xzvf ncbi-blast*-x64-linux.tar.gz
+# - cp ncbi-blast*/bin/blastp $HOME
+# - cp ncbi-blast*/bin/makeblastdb $HOME
+# - echo "installing diamond"
+# - wget http://github.com/bbuchfink/diamond/releases/download/v2.0.6/diamond-linux64.tar.gz 2>/dev/null
+# - tar xzf diamond-linux64.tar.gz
+# - cp diamond $HOME
+# - export PATH="$PATH:$HOME"
+# - echo "start proteinortho tests"
-#centos-latest:
-# image: centos
-# stage: test
-# script:
-# - yum -y groupinstall "Development Tools"
-# - yum -y install gcc-c++
-# - yum -y install cmake
-# - yum -y install make
-# - yum -y install tar
-# - yum -y install which
-# - yum -y install wget
-# - yum -y install gcc-gfortran python3 atlas atlas-devel lapack blas
-# - wget ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/ncbi-blast*-x64-linux.tar.gz 2>/dev/null
-# - tar -xzvf ncbi-blast*-x64-linux.tar.gz
-# - cp ncbi-blast*/bin/blastp $HOME
-# - cp ncbi-blast*/bin/makeblastdb $HOME
-# - echo "installing diamond"
-# - wget http://github.com/bbuchfink/diamond/releases/download/v2.0.6/diamond-linux64.tar.gz 2>/dev/null
-# - tar xzf diamond-linux64.tar.gz
-# - cp diamond $HOME
-# - export PATH="$PATH:$HOME"
-# - echo "start proteinortho tests"
-# - make clean
-# - make
-# - make test
-# - perl proteinortho*pl test/*faa || exit 1
-# - rm *info;
-# - for f in myproject.*; do if [ "$(grep -v '#' $f | wc -l | awk '{print $1}' | tr -d '\n')" -lt 10 ]; then echo "$f failed size check, output should be more than 10 lines..."; exit 1; fi; done
+# centos-latest:
+# retry: 2
+# image: centos
+# stage: test
+# script:
+# - yum -y groupinstall "Development Tools"
+# - yum -y install gcc-c++
+# - yum -y install cmake
+# - yum -y install make
+# - yum -y install tar
+# - yum -y install which
+# - yum -y install wget
+# - yum -y install gcc-gfortran python3 atlas atlas-devel lapack blas
+# - wget ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/ncbi-blast*-x64-linux.tar.gz 2>/dev/null
+# - tar -xzvf ncbi-blast*-x64-linux.tar.gz
+# - cp ncbi-blast*/bin/blastp $HOME
+# - cp ncbi-blast*/bin/makeblastdb $HOME
+# - echo "installing diamond"
+# - wget http://github.com/bbuchfink/diamond/releases/download/v2.0.6/diamond-linux64.tar.gz 2>/dev/null
+# - tar xzf diamond-linux64.tar.gz
+# - cp diamond $HOME
+# - export PATH="$PATH:$HOME"
+# - echo "start proteinortho tests"
+# - make clean
+# - make
+# - make test
+# - perl proteinortho*pl test/*faa || exit 1
+# - rm *info;
+# - for f in myproject.*; do if [ "$(grep -v '#' $f | wc -l | awk '{print $1}' | tr -d '\n')" -lt 10 ]; then echo "$f failed size check, output should be more than 10 lines..."; exit 1; fi; done
#code_quality:
+# retry: 2
# image: docker:stable
# stage: codequality
# variables:
=====================================
src/proteinortho_clustering.cpp
=====================================
@@ -65,13 +65,14 @@ extern "C" {
int* lwork, int* iwork, int* liwork, int* info );
}
-struct wedge {unsigned int edge; unsigned short weight;};
-struct protein {vector<wedge> edges; unsigned int species_id; string full_name;};
+struct wedge {unsigned int edge; unsigned short weight=0;};
+struct protein {vector<wedge> edges; unsigned int species_id; string full_name=""; bool is_done=false;};
// Functions
float string2float(string);
void tokenize(const string& , vector<string>& , const string&);
void parse_file(string);
+void parse_abc_file(string);
// void remove_edge_index(const unsigned int, const unsigned int);
float getConnectivity_float(vector<unsigned int>*,bool,vector<float>*);
double getConnectivity_double(vector<unsigned int>*,bool,vector<double>*);
@@ -82,10 +83,12 @@ void stats(unsigned int,unsigned int,bool);
string getTime(void);
bool param_verbose = true;
bool param_core = false;
+ bool param_abc = false;
float param_con_threshold = 0.1; // as a reference: a chain a-b-c-d has 0.25
unsigned int debug_level = 0;
float param_sep_purity = -1; // as a reference: a-b-c will give +/-0.707107 and 2.34857e-08
unsigned int param_max_nodes = 16777216; // 2^24
+ bool param_max_nodes_was_set=false;
float param_min_species = 1;
string param_rmgraph = "remove.graph";
bool param_useWeights = true;
@@ -256,6 +259,12 @@ size_t getCurrentRSS( )
#endif
}
+/*void debug_log(string log){
+ std::hash<std::thread::id>{}(std::this_thread::get_id());
+ if(!log_out.count(tid))
+ log_out[tid]=make_shared<ofstream>((param_rmgraph+to_string(tid)).c_str());
+}*/
+
class dispatch_queue{
// based on: https://github.com/embeddedartistry/embedded-resources/blob/master/examples/cpp/dispatch.cpp
@@ -425,7 +434,7 @@ void dispatch_queue::dispatch_thread_handler(size_t tid){
// Main
///////////////////////////////////////////////////////////
void printHelp() {
- cerr << "proteinortho_clustering - Spectral partitioning algorithm (last updated with proteinortho v6.1.2)" << "\n";
+ cerr << "proteinortho_clustering - Spectral partitioning algorithm (last updated with proteinortho v6.1.6)" << "\n";
cerr << "-----------------------------------------------------" << "\n";
cerr << "This tool is part of Proteinortho" << "\n";
cerr << "" << "\n";
@@ -440,9 +449,10 @@ void printHelp() {
cerr << " -cpus int the number of threads used for openMP ["<<num_cpus<<"]" << "\n";
cerr << " -coreMaxProts int the maximum number of proteins per species for -core ["<<param_coreMaxProteinsPerSpecies<<"]" << "\n";
cerr << " -coreMinSpecies int the minimum number of species for -core ["<<param_coreMinSpecies<<"]" << "\n";
+ cerr << " -abc flag to indicate the input is a abc formatted graph file instead of a blast-graph (tab-separated). Input is expected to be undirected. c is the similarity score (0-USHRT_MAX) e.g. the blast bitscore" << "\n";
cerr << "\ntechnical parameters:" << "\n";
cerr << " -test various test-functions are called first [not set]" << "\n";
- cerr << " -maxnodes int only consider connected component with up to maxnodes nodes. If exceeded, greedily remove the worst 10 percent of edges (by weight) until satisfied ["<<param_max_nodes<<"]" << "\n";
+ cerr << " -maxnodes int only consider connected component with up to maxnodes nodes. If exceeded, greedily remove the worst 10 percent of edges (by weight) until satisfied [number of species ** 2]" << "\n";
cerr << " -maxweight int only use the edge weights for connected components with maxweight nodes ["<<param_max_nodes_weight<<"]" << "\n";
cerr << " -epsilon float convergence threshold ["<<param_epsilon<<"]" << "\n";
cerr << " -weighted bool the spectral partition is calculated using the bitscores ["<<param_useWeights<<"]" << "\n";
@@ -488,6 +498,9 @@ int main(int argc, char *argv[]) {
else if (parameter == "-core") {
param_core = true;
}
+ else if (parameter == "-abc") {
+ param_abc = true;
+ }
else if (parameter == "-coreMaxProt" || parameter == "-coreMaxProts") {
param_core = true;
paras++;
@@ -530,6 +543,7 @@ int main(int argc, char *argv[]) {
else if (parameter == "-maxnodes") {
paras++;
param_max_nodes = string2float(string(argv[paras]));
+ param_max_nodes_was_set=true;
}
else if (parameter == "-maxweight") {
paras++;
@@ -614,7 +628,7 @@ int main(int argc, char *argv[]) {
if (debug_level > 0) cerr << getTime() << " [DEBUG] Debug level " << debug_level << "\n";
- if(param_core){param_con_threshold=999;}
+ if(param_core){param_con_threshold=999;} // for -core the -conn parameter is disabled
if(getEnvVar("OMP_NUM_THREADS") != "1"){
// restart with OMP_NUM_THREADS=1
@@ -627,7 +641,11 @@ int main(int argc, char *argv[]) {
// Parse files
for (vector<string>::iterator it=files.begin() ; it != files.end(); it++) {
if (debug_level > 0) cerr << getTime() << " [DEBUG] Parsing file " << *it << "\n";
- parse_file(*it);
+ if(param_abc){
+ parse_abc_file(*it);
+ }else{
+ parse_file(*it);
+ }
if (debug_level > 0) cerr << getTime() << " [DEBUG] I know " << species_counter << " species with " << protein_counter << " proteins and " << edges << " edges in sum" << "\n";
}
@@ -653,6 +671,8 @@ int main(int argc, char *argv[]) {
// Stats
if (param_verbose) cerr << species_counter << " species" << "\n" << protein_counter << " paired proteins" << "\n" << edges << " bidirectional edges" << "\n";
+ if(param_max_nodes_was_set==false){param_max_nodes = species_counter*species_counter; }
+
if (debug_level > 0) cerr << getTime() << " [DEBUG] Maximumum number of nodes for connectivity calculations is " << param_max_nodes << "\n";
// Prepare sort of output
@@ -828,7 +848,18 @@ void find_CCs_givenNodes(dispatch_queue*,vector<unsigned int>);
void print_group(ConnectedComponent& , float, size_t, bool);
float calc_group(vector<unsigned int>*);
-ConnectedComponent BFS(map<unsigned int, bool> * done, unsigned int cur_node ){
+ConnectedComponent BFS( map<unsigned int, bool> * done, unsigned int cur_node , float cut_off ){
+
+ /*
+ *
+ * Simple BFS implementation
+ * done map is used to identify allready visited nodes (can be set before call to indicate forbidden nodes)
+ * cut_off = ignore all edges below this cut off
+ *
+ */
+ size_t tid=std::hash<std::thread::id>{}(std::this_thread::get_id());
+ if(!graph_clean.count(tid))
+ graph_clean[tid]=make_shared<ofstream>((param_rmgraph+to_string(tid)).c_str());
ConnectedComponent ret; //return vector
list<unsigned int> q;
@@ -845,16 +876,26 @@ ConnectedComponent BFS(map<unsigned int, bool> * done, unsigned int cur_node ){
ret.d_sum+=graph[cur_node].edges.size();
(*done)[cur_node] = true;
- for (unsigned int i = 0; i < graph[cur_node].edges.size(); i++) {
+ if(graph[cur_node].is_done){continue;}
+
+ for (unsigned int j = 0; j < graph[cur_node].edges.size(); j++) {
- unsigned int adjacency_node = graph[cur_node].edges[i].edge;
+ if(graph[graph[cur_node].edges[j].edge].is_done){continue;}
+
+ if(graph[cur_node].edges[j].weight < cut_off){
+ protein node_i = graph[cur_node];
+ protein node_j = graph[node_i.edges[j].edge];
+ (*graph_clean[tid]) << node_i.full_name << "\t" << species[node_i.species_id] << "\t" << node_j.full_name << "\t" << species[node_j.species_id] << "\n";
+ continue;
+ } // ignore
+
+ unsigned int adjacency_node = graph[cur_node].edges[j].edge;
if(adjacency_node > graph.size()){
cerr << string("[ERROR] : Input graph is invalid. The node "+graph[cur_node].full_name +" is reporting an edge/adjacent node, that is not present in the graph.").c_str() << "\n";throw;
}
if( !done->count(adjacency_node) || !(*done)[adjacency_node] ){
-
(*done)[adjacency_node] = true;
q_new.push_back(adjacency_node);
}
@@ -863,10 +904,6 @@ ConnectedComponent BFS(map<unsigned int, bool> * done, unsigned int cur_node ){
q=q_new;
}
- //ret.calc_density();
- //if(ret.density > 1){
- // cerr << "[WARNING] : The input graph has duplicated edges, this lead to an invalid graph density of " << ret.density << " (should be <1). Please clean the .blast-graph with 'proteinortho.pl --cleanblast --step=3 --project=...' or use the cleanupblastgraph tool in src/ to remove the duplicated edges." << "\n";
- //}
return ret;
}
@@ -876,45 +913,107 @@ struct compare_ConnectedComponents { //sort from large to small
}
};
-void removeLowQualityEdges( ConnectedComponent &cur_cc ){
+void removeLowQualityEdges( ConnectedComponent cur_cc , dispatch_queue *q ){
+ /*
+ * remove low quality edges
+ *
+ * find the range of values of this CC -> define a cut-off = cut_off=min_w+0.1*(max_w-min_w);
+ * remove all edges below that value
+ * redo BFS and start over again for this cluster until -maxnodes are satisfied
+ *
+ */
+ if(debug_level>0) cerr << " [INFO] removeLowQualityEdges " << cur_cc.size() << " start" << "\n";
+
+ // find the lowest 10% of edge values (min+0.1*(max-min))
+ float min_w = -1;
+ float max_w = -1;
+ unsigned int original_number_nodes=cur_cc.size();
+ for (unsigned int i = 0; i < cur_cc.size(); i++) {
+ unsigned int from=cur_cc[i];
+ for(unsigned int j = 0 ; j < graph[from].edges.size() ; j++){
+ unsigned int to = graph[from].edges[j].edge;
+ unsigned int weight = graph[from].edges[j].weight;
+ if(min_w == -1 || weight < min_w){min_w=weight;}
+ if(min_w == -1 || weight > max_w){max_w=weight;}
+ }
+ }
+ float cut_off=min_w+0.1*(max_w-min_w);
- /*
- * remove low quality edges
- *
- * find the range of values of this CC -> define a cut-off = cut_off=min_w+0.1*(max_w-min_w);
- * remove all edges below that value
- * redo BFS and start over again for this cluster until -maxnodes are satisfied
- *
- */
+ if(debug_level>0) cerr << " [INFO] removeLowQualityEdges " << cur_cc.size() << " cut_off=" << cut_off << " min_w=" << min_w << " max_w=" << max_w << "\n";
- // find the lowest 10% of edge values (min+0.1*(max-min))
- float min_w = -1;
- float max_w = -1;
- for (unsigned int i = 0; i < cur_cc.size(); i++) {
- unsigned int from=cur_cc[i];
- for(unsigned int j = 0 ; j < graph[from].edges.size() ; j++){
- unsigned int to = graph[from].edges[j].edge;
- unsigned int weight = graph[from].edges[j].weight;
- if(min_w == -1 || weight < min_w){min_w=weight;}
- if(min_w == -1 || weight > max_w){max_w=weight;}
- }
- }
- float cut_off=min_w+0.1*(max_w-min_w);
- if(debug_level>0) cerr << "cut_off="<<cut_off << " min_w="<<min_w<<" max_w="<<max_w<< endl;
- for (unsigned int i = 0; i < cur_cc.size(); i++) {
- unsigned int from=cur_cc[i];
- auto it = std::remove_if(
- graph[from].edges.begin(),
- graph[from].edges.end(),
- [cut_off,from](wedge cur_el)->bool
- {
- if(debug_level>0 && cur_el.weight <= cut_off) cerr << "[WARNING] found bad-edge "<< from << "-"<< cur_el.edge << " weight=" << cur_el.weight << endl;
- return cur_el.weight <= cut_off;
- }
- );
- graph[from].edges.erase(it, graph[from].edges.end());
- }
+ map<unsigned int, bool> done; // Keep track on what was done (for each node)
+ bool allNodesAreDone = false;
+
+ vector<ConnectedComponent> CC; // vector of all connected components found
+
+ while( true ){ // CC.size() < num_cpus / gather up to num_cpus connected components
+
+ allNodesAreDone = true;
+
+ for (unsigned int id = 0 ; id < cur_cc.size() ; id++) {
+
+ unsigned int protein_id = cur_cc[id];
+
+ if (done.count(protein_id) && done[protein_id]){continue;}// We were here already
+ if(debug_level>0) cerr << "find_CCs:start at "<< protein_id << "<" << graph.size() << endl;
+ done[protein_id]=true; // mark this node
+ ConnectedComponent cur_cc = BFS(&done,protein_id,cut_off); // get the CC of the current node (protein_id)
+
+ if(cur_cc.size()==original_number_nodes){
+ // the cutoff is too low -> increase by 10% and restart
+ cut_off+=0.1*(max_w-min_w);
+ allNodesAreDone=false;
+ break;
+ }
+
+ // for -core, skip CC with less than param_coreMinSpecies species
+ if( param_core && cur_cc.species_num==0 ){
+ if(cur_cc.size()<param_coreMinSpecies){allNodesAreDone=false;break;} // if there are less nodes than wanted species -> skip
+ map<unsigned int,bool>cc_species;
+ for (int i = 0; i < cur_cc.size(); ++i){ cc_species[graph[cur_cc[i]].species_id]=true; }
+ cur_cc.species_num=cc_species.size();
+ if(cc_species.size()<param_coreMinSpecies){allNodesAreDone=false;break;} // less species as wanted -> skip
+ }
+
+ // Skip those that are too large (try heuristic)
+ if (cur_cc.size() > param_max_nodes) {
+
+ // reset done vector
+ //for (int i = 0; i < cur_cc.size(); ++i){ done[cur_cc[i]]=false; }
+ if(debug_level>0) cerr << " [WARNING] Found a very large connected component that contains " << cur_cc.size() << ">" << param_max_nodes << " (maxnodes) elements. This behavior can be adjusted using -maxnodes. Now using a slow heuristic: try to identify and remove edges." << "\n";
+ q->dispatch([cur_cc,q]{ removeLowQualityEdges(cur_cc,q); });
+ //protein_id--;
+ continue;
+ }
+
+ cur_cc.calc_dsum();
+ cur_cc.calc_density();
+ if(cur_cc.density > 1){ cerr << "[WARNING] : The input graph has duplicated edges, this lead to an invalid graph density of " << cur_cc.density << " (should be <1). Please clean the .blast-graph with 'proteinortho.pl --cleanblast --step=3 --project=...' or use the cleanupblastgraph tool in src/ to remove the duplicated edges." << "\n"; throw; }
+ if (debug_level > 0) cerr << getTime() << " [DEBUG:removeLowQualityEdges] Found connected component: " << cur_cc.size() << " proteins (ID: " << protein_id << "), graph density="<< cur_cc.density << ", sum of degrees="<< cur_cc.d_sum << " ini from " << graph[protein_id].full_name<< "\n";
+
+ q->dispatch([cur_cc,q]{ partition_CC(cur_cc,q,true,false); });
+ allNodesAreDone=false;
+ break;
+ }
+ if(allNodesAreDone)break; // no additional CC can be found -> done
+ }
+ if(debug_level>0) cerr << " [INFO] removeLowQualityEdges " << cur_cc.size() << " done\n";
+
+ if(debug_level>0) cerr << "cut_off="<<cut_off << " min_w="<<min_w<<" max_w="<<max_w<< endl;
+ // for (unsigned int i = 0; i < cur_cc.size(); i++) {
+ // unsigned int from=cur_cc[i];
+ // auto it = std::remove_if(
+ // graph[from].edges.begin(),
+ // graph[from].edges.end(),
+ // [cut_off,from](wedge cur_el)->bool
+ // {
+ // if(debug_level>0 && cur_el.weight <= cut_off) cerr << "[WARNING] found bad-edge "<< from << "-"<< cur_el.edge << " weight=" << cur_el.weight << endl;
+ // return cur_el.weight <= cut_off;
+ // }
+ // );
+ // graph[from].edges.erase(it, graph[from].edges.end());
+ // }
}
void find_CCs_givenNodes(dispatch_queue *q, vector<unsigned int> todo_work ){
@@ -923,8 +1022,8 @@ void find_CCs_givenNodes(dispatch_queue *q, vector<unsigned int> todo_work ){
bool allNodesAreDone = false;
vector<ConnectedComponent> CC; // vector of all connected components found
- //unsigned int min_i=0;
+ // find all nodes outside todo_work and remove them from the done vector !
for (unsigned int i = 0 ; i < todo_work.size() ; i++) {
unsigned int from = todo_work[i];
done[ from ] = 0;
@@ -944,12 +1043,14 @@ void find_CCs_givenNodes(dispatch_queue *q, vector<unsigned int> todo_work ){
for (unsigned int i = 0 ; i < todo_work.size() ; i++) {
unsigned int protein_id=todo_work[i];
+ if ( graph[protein_id].is_done ){continue;}
if ( done.count(protein_id) && done[protein_id] ){continue;}// We were here already
//min_i=protein_id;
+ if (debug_level > 0) cerr << getTime() << " [DEBUG:find_CCs_givenNodes] start="<< protein_id << "\n";
done[protein_id]=true; // mark this node
- ConnectedComponent cur_cc = BFS(&done,protein_id); // get the CC of the current node (protein_id)
+ ConnectedComponent cur_cc = BFS(&done,protein_id,0); // get the CC of the current node (protein_id)
// Do not report singles
if (cur_cc.size() < 2) {continue;} // singletons are from no interest
@@ -958,11 +1059,10 @@ void find_CCs_givenNodes(dispatch_queue *q, vector<unsigned int> todo_work ){
// Skip those that are too large (try heuristic)
if (cur_cc.size() > param_max_nodes) {
-
// reset done vector
- for (int i = 0; i < cur_cc.size(); ++i){ done[cur_cc[i]]=false; }
+ // for (int i = 0; i < cur_cc.size(); ++i){ done[cur_cc[i]]=false; }
if(debug_level>0) cerr << " [WARNING] Found a very large connected component that contains " << cur_cc.size() << ">" << param_max_nodes << " (maxnodes) elements. This behavior can be adjusted using -maxnodes. Now using a slow heuristic: try to identify and remove edges." << "\n";
- removeLowQualityEdges(cur_cc); i--;
+ removeLowQualityEdges(cur_cc,q);
continue;
}
@@ -977,12 +1077,12 @@ void find_CCs_givenNodes(dispatch_queue *q, vector<unsigned int> todo_work ){
cur_cc.calc_dsum();
cur_cc.calc_density();
if(cur_cc.density > 1){ cerr << "[WARNING] : The input graph has duplicated edges, this lead to an invalid graph density of " << cur_cc.density << " (should be <1). Please clean the .blast-graph with 'proteinortho.pl --cleanblast --step=3 --project=...' or use the cleanupblastgraph tool in src/ to remove the duplicated edges." << "\n"; throw; }
- if (debug_level > 0) cerr << getTime() << " [DEBUG] Found connected component: " << cur_cc.size() << " proteins (ID: " << protein_id << "), graph density="<< cur_cc.density << ", sum of degrees="<< cur_cc.d_sum << "\n";
+ if (debug_level > 0) cerr << getTime() << " [DEBUG:find_CCs_givenNodes] Found connected component: " << cur_cc.size() << " proteins (ID: " << protein_id << "), graph density="<< cur_cc.density << ", sum of degrees="<< cur_cc.d_sum << "\n";
q->dispatch([cur_cc,q]{ partition_CC(cur_cc,q,true,false); });
allNodesAreDone=false;
- break;
+ //break;
}
if(allNodesAreDone)break; // no additional CC can be found -> done
}
@@ -1109,8 +1209,10 @@ void partition_CC(ConnectedComponent cur_cc, dispatch_queue *q, bool do_lapack,
*/
float new_purity=0;
- float best_cut=0;
- float cut_value=0;
+ float best_cut=0; // stores the current best cut (sum of all edges removed by the given purity), first take a purity of 0 (no purity) and set this value (split + and - compartments of x_hat)
+ vector<float> x_hat_candidate_purity = x_hat;
+ for (unsigned int i = 0; i < x_hat_candidate_purity.size(); i++) { x_hat_candidate_purity[i]=abs(x_hat_candidate_purity[i]); } // purity is a positive threshold
+ sort(x_hat_candidate_purity.begin(),x_hat_candidate_purity.end()); // these values are used as purity tests. Later only subset is used of values not too similar to each other
for (unsigned int i = 0; i < x_hat.size(); i++) {
if(x_hat[i] < 0) {
unsigned int from=cur_cc[i];
@@ -1124,28 +1226,35 @@ void partition_CC(ConnectedComponent cur_cc, dispatch_queue *q, bool do_lapack,
}
}
if(debug_level>0) cerr << " initial purity=0 best_cut=" << best_cut << endl;
+ float last_test_purity = -1; // omit purity test that are very close to previous values, to reduce runtime
+ int num_purity_tests=0; // test all of the first 25 purity values anyway
for(unsigned int pi = 0; pi < (unsigned int)n; pi++){
- unsigned int from = cur_cc[pi];
- if(!mapping.count(from) || abs(x_hat[ mapping[from] ])>0.12 ){continue;}
- float test_purity = abs(x_hat[ mapping[from] ]);
- float cut_value=0;
- unsigned int num_nodes=0;
+ //if( x_hat_sort[ pi ] < -0.12 ){continue;} // skip large negative values
+ if( x_hat_candidate_purity[ pi ] > 0.12 ){break;} // there is no further valid value (sorted values)
+ float test_purity = x_hat_candidate_purity[ pi ];
+ if( test_purity == 0 ||
+ test_purity == last_test_purity ||
+ (num_purity_tests>25 && test_purity-last_test_purity<0.01) ){continue;} // current value is very similar to last purity -> omit
+ float cut_value = 0;
+ num_purity_tests++;
+ last_test_purity = test_purity;
+ unsigned int num_nodes = 0; // the number of nodes within the purity bounds
for (unsigned int i = 0; i < x_hat.size(); i++) {
- if(abs(x_hat[i]) <= test_purity) {
+ if(abs(x_hat[i]) <= test_purity) { // add all the edges between this node (below purity) and nodes above the purity threshold to the cut value
num_nodes++;
unsigned int from=cur_cc[i];
for(unsigned int j = 0 ; j < graph[from].edges.size() ; j++){
unsigned int to = graph[from].edges[j].edge;
- if(!mapping.count(to)){continue;}
+ if(!mapping.count(to)){continue;} // sanity check if the edge is in the current CC
if(abs(x_hat[mapping[to]]) > test_purity) {
cut_value+=graph[from].edges[j].weight;
}
}
}
}
- if( cut_value>0 && (best_cut==-1 || best_cut>cut_value) && num_nodes>1){
- new_purity=test_purity;
- best_cut=cut_value;
+ if( cut_value>0 && (best_cut==-1 || best_cut>cut_value) && num_nodes>1){ // update optimum
+ new_purity = test_purity;
+ best_cut = cut_value;
if(debug_level>0) cerr << " initial purity="<<new_purity<<" best_cut=" << best_cut << endl;
}
}
@@ -1328,6 +1437,7 @@ void partition_CC(ConnectedComponent cur_cc, dispatch_queue *q, bool do_lapack,
if (debug_level > 0) cerr << getTime() << " [DEBUG] splitting CC with "<<cur_cc.size()<<" nodes "<<(debug_level==15 ? "@"+id+"" : "")<< " into ("<<groupA.size()<<""<<(debug_level==15 ? " @"+idA+"" : "")<< ","<<groupB.size()<<""<<(debug_level==15 ? " @"+idB+"" : "")<< ","<<groupZero.size()<<""<<(debug_level==15 ? " @"+idZ+"" : "")<< ") sized groups!"<< "\n";
// Catch error in laplacien calcs
+ // all nodes are in one partition -> restart with an increased restart level
if ( (groupA.size() == 0 && groupB.size() == 0) ||
( (groupA.size() == 0 || groupB.size() == 0) && groupZero.size() == 0) ){
@@ -1346,9 +1456,39 @@ void partition_CC(ConnectedComponent cur_cc, dispatch_queue *q, bool do_lapack,
}else{
+ // for the -core algorithm, test if the number of species now decreased by too much or keep on splitting
+ if(param_core){
+ map<unsigned int,bool>cc_speciesA;
+ for (int i = 0; i < groupA.size(); ++i){ cc_speciesA[graph[groupA[i]].species_id]=true; }
+ groupA.species_num=cc_speciesA.size();
+ map<unsigned int,bool>cc_speciesB;
+ for (int i = 0; i < groupB.size(); ++i){ cc_speciesB[graph[groupB[i]].species_id]=true; }
+ groupB.species_num=cc_speciesB.size();
+ map<unsigned int, bool> done; // for zero additionally test if this group is connected -> if not
+
+ if(groupA.species_num!=cur_cc.species_num && // neither A, B or Zero meet the -core criteria -> test zero and if zero also do not hold -> print the original group cur_cc before the split
+ groupB.species_num!=cur_cc.species_num &&
+ param_coreMaxProteinsPerSpecies*cur_cc.species_num > cur_cc.size()){
+
+ map<unsigned int,bool>cc_speciesZero;
+ for (int i = 0; i < groupZero.size(); ++i){ cc_speciesZero[graph[groupZero[i]].species_id]=true; }
+ groupZero.species_num=cc_speciesZero.size();
+
+ if( groupZero.species_num!=cur_cc.species_num ){ // the original CC is still fine -> print
+ print_group(cur_cc,connectivity,tid,false);
+ return; // done with this group, no need to split into A and B by the definition of -core
+ }
+ }
+ }
+
+ // now add the edges between A and B (and Zero) to the output rmgraph
+ // (the rmgraph contains all edges that are removed by the clustering)
+
if(!graph_clean.count(tid))
graph_clean[tid]=make_shared<ofstream>((param_rmgraph+to_string(tid)).c_str());
+ if(debug_level>0) cerr << "[INFO] removing between A and B+Zero" << endl;
+ // remove edges between A and (B or Zero)
for(unsigned int i = 0 ; i < groupA.size() ; i++){
protein node_i = graph[groupA[i]];
for(unsigned int j = 0 ; j < graph[groupA[i]].edges.size() ; j++){
@@ -1358,7 +1498,20 @@ void partition_CC(ConnectedComponent cur_cc, dispatch_queue *q, bool do_lapack,
(*graph_clean[tid]) << node_i.full_name << "\t" << species[node_i.species_id] << "\t" << node_j.full_name << "\t" << species[node_j.species_id] << "\n";
}
}
+ // auto it = std::remove_if(
+ // node_i.edges.begin(),
+ // node_i.edges.end(),
+ // [node_i,groupB_set,groupZero_set](wedge cur_el)->bool
+ // {
+ // if(debug_level>0 && (groupB_set.count(cur_el.edge) || groupZero_set.count(cur_el.edge))) cerr << "[INFO] removing "<< node_i.full_name << "-"<< graph[cur_el.edge].full_name << " weight=" << cur_el.weight << endl;
+ // return groupB_set.count(cur_el.edge) || groupZero_set.count(cur_el.edge);
+ // }
+ // );
+ // node_i.edges.erase(it, node_i.edges.end());
}
+
+ if(debug_level>0) cerr << "[INFO] removing between B and Zero" << endl;
+ // remove edges between B and Zero
for(unsigned int i = 0 ; i < groupB.size() ; i++){
protein node_i = graph[groupB[i]];
for(unsigned int j = 0 ; j < graph[groupB[i]].edges.size() ; j++){
@@ -1367,37 +1520,36 @@ void partition_CC(ConnectedComponent cur_cc, dispatch_queue *q, bool do_lapack,
(*graph_clean[tid]) << node_i.full_name << "\t" << species[node_i.species_id] << "\t" << node_j.full_name << "\t" << species[node_j.species_id] << "\n";
}
}
+ // auto it = std::remove_if(
+ // node_i.edges.begin(),
+ // node_i.edges.end(),
+ // [node_i,groupZero_set](wedge cur_el)->bool
+ // {
+ // if(debug_level>0 && groupZero_set.count(cur_el.edge)) cerr << "[INFO] removing "<< node_i.full_name << "-"<< graph[cur_el.edge].full_name << " weight=" << cur_el.weight << endl;
+ // return groupZero_set.count(cur_el.edge);
+ // }
+ // );
+ // node_i.edges.erase(it, node_i.edges.end());
}
+ // finally add the new clusters back to the stack of CCs
+ // for -core only add those clusters which meet the criteria of number of species
+
if(param_core){
- map<unsigned int,bool>cc_speciesA;
- for (int i = 0; i < groupA.size(); ++i){ cc_speciesA[graph[groupA[i]].species_id]=true; }
- groupA.species_num=cc_speciesA.size();
- map<unsigned int,bool>cc_speciesB;
- for (int i = 0; i < groupB.size(); ++i){ cc_speciesB[graph[groupB[i]].species_id]=true; }
- groupB.species_num=cc_speciesB.size();
-
- if( groupA.species_num!=cur_cc.species_num &&
- groupB.species_num!=cur_cc.species_num &&
- param_coreMaxProteinsPerSpecies*cur_cc.species_num > cur_cc.size()){
- print_group(cur_cc,connectivity,tid,false);
- }else{
- if(groupA.species_num >= param_coreMinSpecies && groupA.size()>1)
- q->dispatch([groupA,q]{ partition_CC(groupA,q,true,0); });
- if(groupB.species_num >= param_coreMinSpecies && groupB.size()>1)
- q->dispatch([groupB,q]{ partition_CC(groupB,q,true,0); });
- }
- return;
+ if((groupA.species_num >= param_coreMinSpecies || param_coreMaxProteinsPerSpecies*cur_cc.species_num > cur_cc.size()) && groupA.size()>1)
+ q->dispatch([groupA,q]{ find_CCs_givenNodes(q,groupA.m_content_CC); });
+ if((groupB.species_num >= param_coreMinSpecies || param_coreMaxProteinsPerSpecies*cur_cc.species_num > cur_cc.size()) && groupB.size()>1)
+ q->dispatch([groupB,q]{ find_CCs_givenNodes(q,groupB.m_content_CC); });
+ if((groupZero.species_num >= param_coreMinSpecies || param_coreMaxProteinsPerSpecies*cur_cc.species_num > cur_cc.size()) && groupZero.size()>1) // only if zero is connected
+ q->dispatch([groupZero,q]{ find_CCs_givenNodes(q,groupZero.m_content_CC); });
+ }else{
+ if(groupA.size()>1)
+ q->dispatch([groupA,q]{ find_CCs_givenNodes(q,groupA.m_content_CC); });
+ if(groupB.size()>1)
+ q->dispatch([groupB,q]{ find_CCs_givenNodes(q,groupB.m_content_CC); });
+ if(groupZero.size()>1)
+ q->dispatch([groupZero,q]{ find_CCs_givenNodes(q,groupZero.m_content_CC); });
}
-
- if(groupA.size()>1)
- //q->dispatch([groupA,q]{ partition_CC(groupA,q,true,false); });
- q->dispatch([groupA,q]{ find_CCs_givenNodes(q,groupA.m_content_CC); });
- if(groupB.size()>1)
- //q->dispatch([groupB,q]{ partition_CC(groupB,q,true,false); });
- q->dispatch([groupB,q]{ find_CCs_givenNodes(q,groupB.m_content_CC); });
- if(groupZero.size()>1)
- q->dispatch([groupZero,q]{ find_CCs_givenNodes(q,groupZero.m_content_CC); });
}
}
@@ -1415,44 +1567,49 @@ void find_CCs(dispatch_queue *q){
for (unsigned int protein_id = 0 ; protein_id < graph.size() ; protein_id++) {
+ if ( graph[protein_id].is_done ){continue;}
+
if (done.count(protein_id) && done[protein_id]){continue;}// We were here already
- if(debug_level>0) cerr << "find_CCs:start at "<< protein_id << "<" << graph.size() << endl;
+ if(debug_level>0) cerr << getTime() << " [DEBUG:find_CCs] find_CCs:start at "<< protein_id << "<" << graph.size() << endl;
//min_i=protein_id;
done[protein_id]=true; // mark this node
- ConnectedComponent cur_cc = BFS(&done,protein_id); // get the CC of the current node (protein_id)
+ ConnectedComponent cur_cc = BFS(&done,protein_id,0); // get the CC of the current node (protein_id)
// Do not report singles
if (cur_cc.size() < 2) {continue;} // singletons are from no interest
+ // for -core, skip CC with less than param_coreMinSpecies species
+ if( param_core && cur_cc.species_num==0 ){
+ if(cur_cc.size()<param_coreMinSpecies){allNodesAreDone=false;break;} // if there are less nodes than wanted species -> skip
+ map<unsigned int,bool>cc_species;
+ for (int i = 0; i < cur_cc.size(); ++i){ cc_species[graph[cur_cc[i]].species_id]=true; }
+ cur_cc.species_num=cc_species.size();
+ if(cc_species.size()<param_coreMinSpecies){allNodesAreDone=false;break;} // less species as wanted -> skip
+ }
+
// Skip those that are too large (try heuristic)
if (cur_cc.size() > param_max_nodes) {
// reset done vector
- for (int i = 0; i < cur_cc.size(); ++i){ done[cur_cc[i]]=false; }
+ //for (int i = 0; i < cur_cc.size(); ++i){ done[cur_cc[i]]=false; }
if(debug_level>0) cerr << " [WARNING] Found a very large connected component that contains " << cur_cc.size() << ">" << param_max_nodes << " (maxnodes) elements. This behavior can be adjusted using -maxnodes. Now using a slow heuristic: try to identify and remove edges." << "\n";
- removeLowQualityEdges(cur_cc); protein_id--;
+ q->dispatch([cur_cc,q]{ removeLowQualityEdges(cur_cc,q); });
+ //protein_id--;
continue;
}
- if(param_core && cur_cc.species_num==0){
- if(cur_cc.size()<param_coreMinSpecies){allNodesAreDone=false;break;} // if there are less nodes than wanted species -> skip
- map<unsigned int,bool>cc_species;
- for (int i = 0; i < cur_cc.size(); ++i){ cc_species[graph[cur_cc[i]].species_id]=true; }
- cur_cc.species_num=cc_species.size();
- if(cc_species.size()<param_coreMinSpecies){allNodesAreDone=false;break;} // less species as wanted -> skip
- }
-
+ if (debug_level > 0) cerr << getTime() << " [DEBUG:find_CCs] Found connected component: " << cur_cc.size() << " proteins (ID: " << protein_id << "), ini from " << graph[protein_id].full_name<< "\n";
cur_cc.calc_dsum();
cur_cc.calc_density();
if(cur_cc.density > 1){ cerr << "[WARNING] : The input graph has duplicated edges, this lead to an invalid graph density of " << cur_cc.density << " (should be <1). Please clean the .blast-graph with 'proteinortho.pl --cleanblast --step=3 --project=...' or use the cleanupblastgraph tool in src/ to remove the duplicated edges." << "\n"; throw; }
- if (debug_level > 0) cerr << getTime() << " [DEBUG:find_CCs] Found connected component: " << cur_cc.size() << " proteins (ID: " << protein_id << "), graph density="<< cur_cc.density << ", sum of degrees="<< cur_cc.d_sum << " ini from " << graph[protein_id].full_name<< "\n";
+ if (debug_level > 0) cerr << getTime() << " [DEBUG:find_CCs] did calc dsum and density: " << cur_cc.size() << " proteins (ID: " << protein_id << "), graph density="<< cur_cc.density << ", sum of degrees="<< cur_cc.d_sum << " ini from " << graph[protein_id].full_name<< "\n";
q->dispatch([cur_cc,q]{ partition_CC(cur_cc,q,true,false); });
allNodesAreDone=false;
- break;
+ //break;
}
if(allNodesAreDone)break; // no additional CC can be found -> done
}
@@ -1461,6 +1618,7 @@ void find_CCs(dispatch_queue *q){
///////////////////////////////////////////////////////////
// Major partioning algorithm
///////////////////////////////////////////////////////////
+
void partition_graph() {
#ifdef timeAnalysis
auto t_tmp = std::chrono::steady_clock::now( );
@@ -1651,6 +1809,120 @@ void parse_file(string file) {
#endif
}
+void parse_abc_file(string file) {
+ #ifdef timeAnalysis
+ auto t_tmp = std::chrono::steady_clock::now( );
+ #endif
+
+
+ species.push_back("input");
+ species2id["input"] = species_counter++;
+
+ if (param_verbose) cerr << "Reading " << file << "\n";
+ string line;
+ ifstream graph_file(file.c_str());
+ if (graph_file.is_open()) {
+ while (!graph_file.eof()) {
+ getline(graph_file, line);
+ vector<string> fields;
+ tokenize(line, fields, "\t");
+ // Header line
+ if ((fields.size() == 3) && fields[0].substr(0, 1) != "#") {
+ // a b e1 b1 e2 b2 score
+
+ // 5.16 deal with duplicated IDs by adding file ID to protein ID
+ string ida = fields[0];
+ string idb = fields[1];
+ //fields[0] += " "; fields[0] += to_string(file_a_id);
+ //fields[1] += " "; fields[1] += to_string(file_b_id);
+
+ // 5.16 do not point to yourself
+ if (!fields[0].compare(fields[1])) {continue;}
+
+ //if(debug_level>0) cerr << "parse_file:(" << fields[0]<<","<<fields[1] << ") -> " << (protein2id.find(fields[0]) == protein2id.end()) << endl;
+
+ // A new protein
+ if (protein2id.find(fields[0]) == protein2id.end()) {
+ protein a;
+ a.full_name = ida;
+ // graph_ram_total+=a.full_name.size()*sizeof(string);
+ a.species_id = 0;
+ // graph_ram_total+=sizeof(unsigned int);
+ protein2id[fields[0]] = protein_counter++;
+ // graph_ram_total+=fields[0].size()*sizeof(string)+sizeof(unsigned int);
+
+ if( graph.size() >= 1073741824-1 ){
+ cerr << string("[CRITICAL ERROR] Overflow: number of nodes overflow the maximum number for vector allocation (2^30=1073741824).").c_str() << "\n";throw;
+ }
+ graph.push_back(a);
+ // graph_ram_total+=sizeof(protein);
+ }
+ if (protein2id.find(fields[1]) == protein2id.end()) {
+ protein b;
+ b.full_name = idb;
+ // graph_ram_total+=b.full_name.size()*sizeof(string);
+ b.species_id = 0;
+ // graph_ram_total+=sizeof(unsigned int);
+ protein2id[fields[1]] = protein_counter++;
+ // graph_ram_total+=fields[1].size()*sizeof(string)+sizeof(unsigned int);
+
+ if( graph.size() >= 1073741824-1 ){
+ cerr << string("[CRITICAL ERROR] Overflow: number of nodes overflow the maximum number for vector allocation (2^30=1073741824).").c_str() << "\n";throw;
+ }
+ graph.push_back(b);
+ // graph_ram_total+=sizeof(protein);
+ }
+
+ // Bitscores
+
+ float bit_a = string2float(fields[3]);
+
+ if(bit_a<1){bit_a=1;}
+
+ if(bit_a>USHRT_MAX){
+ cerr << " [WARNING] unsigned short overflow " << bit_a << ">USHRT_MAX (bitscore of "<< ida<< " adj. to "<< idb<< ") using "<< USHRT_MAX<< " (USHRT_MAX) instead." << "\n";
+ bit_a=(float)USHRT_MAX;
+ }
+
+ // Add link to graph (reciprocal)
+ unsigned int a_id = protein2id[fields[0]];
+ unsigned int b_id = protein2id[fields[1]];
+
+ // 5.17, add weight
+ wedge w;
+ w.edge=b_id;
+ // graph_ram_total+=sizeof(unsigned int);
+ w.weight=bit_a;
+ // graph_ram_total+=sizeof(unsigned int);
+ graph[a_id].edges.push_back(w);
+ // graph_ram_total+=sizeof(wedge);
+ w.edge=a_id;
+ // graph_ram_total+=sizeof(unsigned int);
+ w.weight=bit_a;
+ // graph_ram_total+=sizeof(unsigned int);
+ graph[b_id].edges.push_back(w);
+ // graph_ram_total+=sizeof(wedge);
+ edges++;
+ }
+ }
+ graph_file.close();
+ }
+ else {
+ cerr << string("Could not open file " + file).c_str() << "\n";throw;
+ }
+
+ // graph_ram_total_inKB += graph_ram_total/1e+3;
+ // if (debug_level > 0) cerr << getTime() << " [DEBUG] Expected Memory Usage of the current input graph: " << graph_ram_total_inKB << " KB = " << graph_ram_total_inKB/1e+3 << " MB. (current input graph are all the currently loaded files)" << "\n";
+
+ if(species_counter==0){species.push_back("0");species_counter++;}
+
+ #ifdef timeAnalysis
+ if(!t_master.count("parse_file")) t_master["parse_file"]=0;
+ t_master["parse_file"] += (float)std::chrono::duration_cast<std::chrono::nanoseconds>( std::chrono::steady_clock::now( ) - t_tmp ).count()/1e+9;
+ #endif
+}
+
+
///////////////////////////////////////////////////////////
// Output
///////////////////////////////////////////////////////////
@@ -1708,31 +1980,30 @@ struct protein_degree{
};
// Group formatting
-void print_group(ConnectedComponent& nodes, float connectivity, size_t tid, bool failed) {
+void print_group(ConnectedComponent& cur_cc, float connectivity, size_t tid, bool failed) {
#ifdef timeAnalysis
auto t_tmp = std::chrono::steady_clock::now( );
#endif
+ for (unsigned int i = 0; i < cur_cc.size(); i++) { graph[cur_cc[i]].is_done=true; }
+
if(debug_level>0) cerr << getTime()<< " [DEBUG] print_group start"<< endl;
- if(nodes.size()<2){return;}
+ if(cur_cc.size()<2){return;}
map<unsigned int, bool> done; // Keep track on what was done (for each node)
- for (unsigned int i = 0 ; i < nodes.size() ; i++) {
- unsigned int from = nodes[i];
+ for (unsigned int i = 0 ; i < cur_cc.size() ; i++) {
+ unsigned int from = cur_cc[i];
done[ from ] = 0;
}
- for (unsigned int i = 0 ; i < nodes.size() ; i++) {
- unsigned int from = nodes[i];
+ for (unsigned int i = 0 ; i < cur_cc.size() ; i++) {
+ unsigned int from = cur_cc[i];
for(unsigned int j = 0 ; j < graph[from].edges.size() ; j++){
if(!done.count(graph[from].edges[j].edge))
done[ graph[from].edges[j].edge ]=1;
}
}
- ConnectedComponent cur_cc = BFS(&done,nodes[0]); // get the CC of the current node (protein_id)#
- if((int)cur_cc.size() - (int)nodes.size() != 0){return;}
-
if(!proteinorthotmp_clean.count(tid)){
proteinorthotmp_clean[tid]=make_shared<ofstream>((param_rmgraph+"_proteinortho_tmp_"+to_string(tid)).c_str());
@@ -1743,16 +2014,16 @@ void print_group(ConnectedComponent& nodes, float connectivity, size_t tid, bool
unsigned int species_number = 0;
// For each protein in group
- for (unsigned int i = 0; i < nodes.size(); i++) {
- unsigned int current_protein = nodes[i];
+ for (unsigned int i = 0; i < cur_cc.size(); i++) {
+ unsigned int current_protein = cur_cc[i];
unsigned int current_species = graph[current_protein].species_id;
if (line[current_species].size() == 0)
species_number++;
line[current_species].push_back(make_pair( graph[current_protein].full_name , graph[current_protein].edges.size() ));
}
- (*proteinorthotmp_clean[tid]) << species_number << "\t" << nodes.size() << "\t" << (debug_level>0 ? "@"+getCCid(nodes.m_content_CC)+":":"") << setprecision (3) << (connectivity < 0 ? -connectivity : connectivity) << (failed ? "*": "");
- // cerr << species_number << "\t" << nodes.size() << "\t" << setprecision (3) << connectivity;
+ (*proteinorthotmp_clean[tid]) << species_number << "\t" << cur_cc.size() << "\t" << (debug_level>0 ? "@"+getCCid(cur_cc.m_content_CC)+":":"") << setprecision (3) << (connectivity < 0 ? -connectivity : connectivity) << (failed ? "*": "");
+ // cerr << species_number << "\t" << cur_cc.size() << "\t" << setprecision (3) << connectivity;
// List group data
for (unsigned int i = 0; i < species_counter; i++) {
@@ -1781,7 +2052,7 @@ void print_group(ConnectedComponent& nodes, float connectivity, size_t tid, bool
(*proteinorthotmp_clean[tid]) << "\n";
// cerr << "\n";
- if(debug_level>0) cerr << getTime()<< " [DEBUG] print_group @"<<getCCid(nodes.m_content_CC)<<" ERROR="<< ((int)cur_cc.size() - (int)nodes.size()) << endl;
+ if(debug_level>0) cerr << getTime()<< " [DEBUG] print_group @"<<getCCid(cur_cc.m_content_CC)<<" ERROR="<< ((int)cur_cc.size() - (int)cur_cc.size()) << endl;
//os.close();
View it on GitLab: https://salsa.debian.org/med-team/proteinortho/-/commit/42208790e378b027cdbc447f7a50dd4fb343b658
--
View it on GitLab: https://salsa.debian.org/med-team/proteinortho/-/commit/42208790e378b027cdbc447f7a50dd4fb343b658
You're receiving this email because of your account on salsa.debian.org.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://alioth-lists.debian.net/pipermail/debian-med-commit/attachments/20230117/9fdff725/attachment-0001.htm>
More information about the debian-med-commit
mailing list