[med-svn] [condetri] 01/01: Imported Upstream version 2.3

Steffen Möller moeller at moszumanska.debian.org
Sun Mar 6 20:13:21 UTC 2016


This is an automated email from the git hooks/post-receive script.

moeller pushed a commit to branch master
in repository condetri.

commit 1c72c756a263d41037387c786bcc9c2b136dd024
Author: Steffen Moeller <moeller at debian.org>
Date:   Sun Mar 6 16:57:00 2016 +0100

    Imported Upstream version 2.3
---
 MANUAL_ConDeTri.pdf | Bin 0 -> 251428 bytes
 README.md           |  73 ++++++
 condetri.pl         | 664 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 filterPCRdupl.pl    | 237 +++++++++++++++++++
 wiki/README.wiki    |  71 ++++++
 5 files changed, 1045 insertions(+)

diff --git a/MANUAL_ConDeTri.pdf b/MANUAL_ConDeTri.pdf
new file mode 100644
index 0000000..47ba599
Binary files /dev/null and b/MANUAL_ConDeTri.pdf differ
diff --git a/README.md b/README.md
new file mode 100644
index 0000000..343af71
--- /dev/null
+++ b/README.md
@@ -0,0 +1,73 @@
+# condetri
+Automatically exported from code.google.com/p/condetri
+
+##Summary
+Trim reads from the 3'-end and extract reads (or read pairs) of good quality. If the reads are paired, the filtering is done pairwise, and if one read in a pair has low quality, the remaining read is saved as single end.
+
+
+##Usage
+perl condetri.pl -fastq1=file1 [-fastq2=file2 -prefix=s -cutfirst=i -hq=i -lq=i -frac=[0,1] -minlen=i -mh=i -ml=i -sc=i -rmN]
+
+
+##Description
+**The trimming is performed in two steps:**
+
+**(1)** Trimming low quality bases from the 3'-end
+
+**(2)** Overall quality check of read/pair
+
+**Details:**
+
+**(1)** Bases are removed from the 3'-end if the quality score is lower than some threshold (*hq*). When a base with higher quality is reached, it is kept temporarily and the preceding bases are considered. After this point, also bases with low quality can be saved, if they are surrounded by high quality bases. Up to *ml* consecutive bases are saved temporarily, but if the following base also has low quality, all temporarily saved bases are removed, and the trimming starts all over again [...]
+
+**(2)** After the trimming step, the quality scores of the remaining read are controlled for. A read is approved if a certain fraction (*frac*) of the bases have a quality score higher than *hq*, and there is no base in the read that has a quality score below some lower bound (*lq*). If the input is paired-end reads, both reads in a pair must be approved for keeping the pair. If only one of the reads is approved, it is saved to an additional, unpaired file which can be used as single-end data.
+
+
+###Comments
+The parameters -cutfirst and -rmN was added to ConDeTri? in version 2.0 (for details, see release notes). If -cutfirst is used, -rmN is somewhat needless (it will still scan for non ATGC-characters from the 5' end, but the first i bases has already been removed). Also keep in mind that these choices will affect the minlen-parameter. For example if you have 75bp read, set the -minlen=50 and use the -cutfirst=6, the maximum possible number of bases that can be trimmed from the 3' end is 19 [...]
+
+
+###Input parameters
+(default values in brackets [])
+<pre>
+ -fastq1=file   Fastq file. If a second file is given, the files are trimmed as
+ -fastq2=file    a pair. The reads must have the same order in both files.
+ -prefix=string Prefix for the output file(s). The filtered fastq file(s) will be
+                 named prefix_trim1.fastq (and prefix_trim2.fastq if present).
+                 For pairs, a third file will be given with unpaired reads (reads 
+                 from pairs where one low quality read has been removed).
+ -cutfirst=i    Removes the first i base from the 5' end before trimming [0].
+ -rmN           Removes N from the 5' end before trimming[no].
+ -hq=i          Hiqh quality threshold [25].
+ -lq=i          Low quality threshold [10].
+ -frac=[0,1]    Fraction of read that must exceed hq [0.8].
+ -minlen=i      Min allowed read length [50].
+ -mh=i          When this no of consecutive hq bases is reached, the trimming stops [5].
+ -ml=i          Max no of lq bases allowed after a stretch of hq bases from 3'-end [1].
+ -sc=i          Illumina scoring table: Score=ASCII-sc. Illumina GAII (for which ConDeTri
+                 was written) used 64, newer Illumina data uses 33, which is Sanger
+                 standard. Can be set to any other integer if wanted [64].
+ -q             Prints Illumina scoring table.
+ -h             Prints a help message.
+ </pre>
+
+###Output files
+<pre>
+ prefix_trim1.fastq             File(s) with the trimmed reads (one file for
+ prefix_trim2.fastq              single-end data, three for paired-end data, where
+ prefix_trim_unpaired.fastq      the last file includes reads from the two input
+                                 files whose read pair had too poor quality.
+ prefix.stats                   Includes basic statistics in columns.
+ </pre>
+**The columns for the .stats file are the following:**<BR>
+1) PREFIX <BR>
+2) NUMBER OF READS IN ORIGINAL FILE(S) <BR>
+3) NUMBER OF BASES IN ORIGINAL FILE(S) <BR>
+4) NO OF PAIRED READS AFTER TRIMMING <BR>
+5) NO OF BASES IN PAIRS AFTER TRIMMING  <BR>
+6) NO OF UNPAIRED READS AFTER TRIMMING <BR>
+7) NO OF UNPAIRED BASES AFTER TRIMMING <BR>
+
+The .stats files are suitable for concatenation to make a summary table for several fastq files, for example one file for each lane in a flowcell or a set of transcriptome samples. Just use:
+
+$ cat *.stats
diff --git a/condetri.pl b/condetri.pl
new file mode 100644
index 0000000..7d36af1
--- /dev/null
+++ b/condetri.pl
@@ -0,0 +1,664 @@
+#!/usr/bin/perl
+
+# condetri_v2.3.pl
+# September 2010, mod Oct 2011, Mar, Jun & Oct 2012 & Mar 2015
+# Author: Linnéa Smeds (linnea.smeds at ebc.uu.se)
+
+use strict;
+use warnings;
+use Getopt::Long;
+
+
+my $usage = "# condetri_v2.3.pl
+# September 2010
+# Version 2.3, October 2012
+# Author: Linnéa Smeds (linnea.smeds\@ebc.uu.se)
+# This version uses \"zcat\" instead of Zlib which means no library installation
+# is required. (Works in linux, windows users should continue using v2.2).
+# Also added two parameters \"-cutlast\" and \"-notrim\" so that ConDeTri can be
+# used for trimming a fixed number of base pairs and stop before the quality based
+# trimming starts.
+# ---------------------------------------------------------------------------------
+# Description: Trim reads from the 3'-end and extract reads (or read pairs) of
+# good quality. If the reads are paired, the filtering is done pairwise, and 
+# if one read in a pair has low quality, the remaning read is saved as single end.
+# Usage: perl condetri.pl -fastq1=file1 [-fastq2=file2 -prefix=s -cutfirst=i 
+# -rmN -hq=i -lq=i -frac=i -lfrac=i -minlen=i -mh=i -ml=i -sc=i -pb=s]
+
+-fastq1=file \t Fastq(.gz) file. If a second file is given, the files are trimmed
+-fastq2=file \t as a pair. The reads must have the same order in both files.
+-prefix=string \t Prefix for the output file(s). The filtered fastq file(s) will
+ \t\t be named prefix_trim1.fastq (and prefix_trim2.fastq if present). For pairs,
+ \t\t a third file will be given with unpaired reads (reads from pairs where one 
+ \t\t low quality read has been removed).
+-cutfirst=i \t Remove i first bases from the 5'end before any trimming [0].
+-cutlast=i \t Remove i bases from the 3'end before any trimming [0].
+-rmN \t\t Remove non-ATCG bases from 5'end before any trimming [no].
+-notrim \t Stop script after removing 5' or 3' bases (skips the actual
+ \t\t trimming and filtering step) [no].
+-hq=i \t\t Hiqh quality threshold [25].
+-lq=i \t\t Low quality threshold [10].
+-frac=[0,1]\t Fraction of read that must exceed hq [0.8].
+-lfrac=[0,1]\t Maximum fraction of bases with qual<lq [0].
+-minlen=i \t Min allowed read length [50].
+-mh=i \t\t When this no of sequential hq bases is reached, the trimming stops [5].
+-ml=i \t\t Max no of lq bases allowed after a stretch of hq bases from 3'-end [1].
+-sc=i\t\t Scoring table, Score=ASCII-sc, 64 for Solexa and Illumina 1.3+ and
+ \t\t Illumina 1.5+, 33 for Sanger and newer Illumina data (from 1.8+). [64].
+-pb=fa|fq\t Print the removed low quality reads to a fasta file or a fastq
+\t\t file (for investigation/evaluation purposes) [no].
+-q \t\t Print Solexa/Illumina scoring table (64 offset).
+-q33 \t\t Print Sanger/new Illumina scoring table (33 offset).
+-h \t\t Print this help message.\n";
+		
+# Starting time
+my $time = time;
+
+# Input parameters
+my ($read1,$read2,$prefix,$cutfirst,$cutlast,$removeN,$notrim,$HQlim,$lowlim,$minfrac,$lfrac,$minReadLen,
+	$maxNoHQ,$maxNoLQ,$scoring,$printBad,$tbl,$tbl33,$help);
+GetOptions(
+  	"fastq1=s" => \$read1,
+   	"fastq2=s" => \$read2,
+  	"prefix=s" => \$prefix,
+	"cutfirst=i" => \$cutfirst,
+	"cutlast=i" => \$cutlast,
+	"rmN" => \$removeN,
+	"notrim" => \$notrim,
+  	"hq=i" => \$HQlim,
+	"lq=i" => \$lowlim,
+	"frac=s" => \$minfrac,
+	"lfrac=s" => \$lfrac,
+	"minlen=i" => \$minReadLen,
+	"mh=i" => \$maxNoHQ,
+   	"ml=i" => \$maxNoLQ,
+	"sc=i" => \$scoring,
+	"pb=s" => \$printBad,
+	"q" => \$tbl,
+	"q33" => \$tbl33,
+	"h" => \$help);
+	
+
+#--------------------------------------------------------------------------------
+#Checking input, set default if not given
+if($tbl) {
+	&table64();
+	exit;
+}
+if($tbl33) {
+	&table33();
+	exit;
+}
+unless($read1) {
+	die $usage . "\n";
+}
+if($help) {
+	die $usage . "\n";
+}
+unless($prefix) {
+	$prefix = $read1;
+	$prefix =~ s/\.\w+//;
+}
+unless($cutfirst) {
+	$cutfirst=0;
+}
+unless($cutlast) {
+	$cutlast=0;
+}
+unless($HQlim) {
+	$HQlim=25;
+}
+unless($lowlim) {
+	$lowlim=10;
+}
+if($minfrac) {
+	if($minfrac<0 || $minfrac>1) {
+		die "Error: frac must be between 0 and 1.\n";
+	}
+}
+else{
+	$minfrac=0.8;
+}
+if($lfrac) {
+	if($lfrac<0 || $lfrac>1) {
+		die "Error: lfrac must be between 0 and 1.\n";
+	}
+}
+else{
+	$lfrac=0;
+}
+unless($maxNoHQ) {
+	$maxNoHQ=5;
+}
+unless($maxNoLQ) {
+	$maxNoLQ=1;
+}
+unless($minReadLen) {
+	$minReadLen=50;
+}
+unless($scoring) {
+	$scoring=64;
+}
+if($printBad) {
+	unless($printBad eq "fa" || $printBad eq "fq") {
+		die "Error: pb must be either \"fa\" or \"fq\".\n";
+	}
+}
+else {
+	$printBad="";
+}
+unless(-e $read1) {
+	die "Error: File $read1 doesn't exist!\n";
+}
+print "\ncondetri_v2.3.pl started " . localtime() . "\n";
+print "------------------------------------------------------------------\n";
+print "Settings: ";
+if($cutfirst>0) {
+	print "Cutfirst=$cutfirst "; 
+}
+if($cutlast>0) {
+	print "Cutlast=$cutlast "; 
+}
+print "HQ=$HQlim LQ=$lowlim Frac=$minfrac Lfrac=$lfrac MH=$maxNoHQ ML=$maxNoLQ Minlen=$minReadLen Scoring=$scoring\n";
+if($removeN) {
+	print "Remove non-ATCG characters from 5'-end before trimming\n";
+}
+if($printBad ne "") {
+	print "Print removed reads to a $printBad file.\n";
+}
+
+
+my ($totNoReads, $pairReads, $unpairedReads, $badReads) = (0,0,0,0);
+my ($totNoBases, $pairBases, $unpairedBases, $firstbases, $lastbases) = (0,0,0,0,0);
+#--------------------------------------------------------------------------------
+# Trimming paired files
+if($read2) {
+	unless(-e $read2) {
+		die "Error: File $read2 doesn't exist!\n";
+	}
+
+	if ($read1 =~ /\.gz$/) {
+	   open(IN1, "zcat $read1 |");
+	}
+	else {
+	   open(IN1, $read1);
+	}
+
+	if ($read2 =~ /\.gz$/) {
+	   open(IN2, "zcat $read2 |");
+	}
+	else {
+	   open(IN2, $read2);
+	}
+	
+
+	my $out1 = $prefix . "_trim1.fastq";
+	my $out2 = $prefix . "_trim2.fastq";
+	my $out3 = $prefix . "_trim_unpaired.fastq";
+	my $out4 = $prefix . "_badreads.".$printBad;
+
+	open(OUT1, ">$out1");
+	open(OUT2, ">$out2");
+	unless($notrim) {
+		open(OUT3, ">$out3");
+	}
+	if($printBad ne "") {
+		open(OUT4, ">$out4");
+	}
+	my($head1,$seq1,$plus1,$qual1,$head2,$seq2,$plus2,$qual2);
+	print "Processing...\n";
+
+	while(my $line = <IN1>) {
+		$head1 = $line;
+		chomp($seq1 = <IN1>);
+		$plus1 = <IN1>;
+		chomp($qual1 = <IN1>);
+		
+		$head2 = <IN2>;
+		chomp($seq2 = <IN2>);
+		$plus2 = <IN2>;
+		chomp($qual2 = <IN2>);
+		
+		my($newseq1, $newscore1, $newseq2, $newscore2)=($seq1,$qual1,$seq2,$qual2);	
+
+		# Cut from 5'-end
+		if($cutfirst>0) {
+			$newseq1 = substr($seq1, $cutfirst, length($seq1)-$cutfirst);
+			$newscore1 = substr($qual1, $cutfirst, length($qual1)-$cutfirst);
+			$newseq2 = substr($seq2, $cutfirst, length($seq2)-$cutfirst);
+			$newscore2 = substr($qual2, $cutfirst, length($qual2)-$cutfirst);
+		}
+		
+		# Remove non-ATCG characters from 5'end
+		if($removeN) {
+			my ($tmp1, $tmp2);
+			($newseq1, $newscore1, $tmp1) = &removeNs($newseq1, $newscore1);
+			($newseq2, $newscore2, $tmp2) = &removeNs($newseq2, $newscore2);
+			$firstbases=$firstbases+$tmp1+$tmp2;
+		}
+		
+		#Cut from 3'-end
+		if($cutlast>0) {
+			$newseq1 = substr($newseq1, 0, length($newseq1)-$cutlast);
+			$newscore1 = substr($newscore1, 0, length($newscore1)-$cutlast);
+			$newseq2 = substr($newseq2, 0, length($newseq2)-$cutlast);
+			$newscore2 = substr($newscore2, 0, length($newscore2)-$cutlast);
+		}
+		
+		# If -notrim flag is used, the reads are not trimmed, just printed
+		if($notrim) {
+			print OUT1 $head1 . $newseq1 ."\n" . $plus1 . $newscore1 . "\n";
+			print OUT2 $head2 . $newseq2 ."\n" . $plus2 . $newscore2 . "\n";
+			$pairReads+=2;
+			$pairBases+=length($newseq1)+length($newseq2);
+		}
+		else {
+		
+			# Trim both reads
+			($newseq1, $newscore1) = &trimEnd($newseq1,$newscore1, $HQlim, $lowlim, $maxNoHQ, $maxNoLQ, $minReadLen, $scoring);
+			($newseq2, $newscore2) = &trimEnd($newseq2,$newscore2, $HQlim, $lowlim, $maxNoHQ, $maxNoLQ, $minReadLen, $scoring);
+
+			# Check if reads are ok, print good reads
+			if(readOK($newscore1, $HQlim, $lowlim, $minfrac, $lfrac, $scoring, $minReadLen)) {
+				if(&readOK($newscore2, $HQlim, $lowlim, $minfrac, $lfrac, $scoring, $minReadLen)) {
+					print OUT1 $head1 . $newseq1 ."\n" . $plus1 . $newscore1 . "\n";
+					print OUT2 $head2 . $newseq2 ."\n" . $plus2 . $newscore2 . "\n";
+					$pairReads+=2;
+					$pairBases+=length($newseq1)+length($newseq2);
+				}
+				else {
+					print OUT3 $head1 . $newseq1 ."\n" . $plus1 . $newscore1 . "\n";
+					$unpairedReads++;
+					$unpairedBases+=length($newseq1);
+					if($printBad eq "fa") {
+						print OUT4 ">".$head2.$seq2."\n";
+						$badReads++;
+					}
+					elsif($printBad eq "fq") {
+						print OUT4 $head2.$seq2."\n".$plus2.$qual2."\n";
+						$badReads++;
+					}
+				}
+			}
+			else {
+				if(&readOK($newscore2, $HQlim, $lowlim, $minfrac, $lfrac, $scoring, $minReadLen)) {
+					print OUT3 $head2 . $newseq2 ."\n" . $plus2 . $newscore2 . "\n";
+					$unpairedReads++;
+					$unpairedBases+=length($newseq2);
+					if($printBad eq "fa") {
+						print OUT4 ">".$head1.$seq1."\n";
+						$badReads++;
+					}
+					elsif($printBad eq "fq") {
+						print OUT4 $head1.$seq1."\n".$plus1.$qual1."\n";
+						$badReads++;
+					}
+				}
+				else {
+					if($printBad eq "fa" ) {
+						print OUT4 ">".$head1.$seq1."\n>".$head2.$seq2."\n";
+						$badReads+=2;
+					}
+					elsif($printBad eq "fq") {
+						print OUT4 $head1.$seq1."\n".$plus1.$qual1."\n".
+									$head2.$seq2."\n".$plus2.$qual2."\n";
+						$badReads+=2;
+					}
+				}
+			}
+		}
+		$totNoBases+=length($seq1)+length($seq2);
+		$totNoReads+=2;
+		if ($totNoReads%100000==0) {
+			print "$totNoReads reads processed\r";
+		}
+	}
+}
+# Trimming single end files
+else {
+	my $out1 = $prefix . "_trim.fastq";
+	my $out4 = $prefix . "_badreads.".$printBad;
+
+	if ($read1 =~ /\.gz$/) {
+	   tie	*IN1,'IO::Zlib',$read1,"rb";
+	}
+	else {
+	   open(IN1, $read1);
+	}
+
+	open(OUT1, ">$out1");
+	if($printBad ne "") {
+		open(OUT4, ">$out4");
+	}	
+
+	print "Processing...\n";
+	my($head1,$seq1,$plus1,$qual1);
+
+	while(my $line = <IN1>) {
+		$head1 = $line;
+		chomp($seq1 = <IN1>);
+		$plus1 = <IN1>;
+		chomp($qual1 = <IN1>);
+
+		my($newseq1, $newscore1) = ($seq1,$qual1);
+
+		# Cut from 5'-end
+		if($cutfirst>0) {
+			$newseq1 = substr($seq1, $cutfirst, length($seq1)-$cutfirst);
+			$newscore1 = substr($qual1, $cutfirst, length($qual1)-$cutfirst);
+		}
+		
+		# Remove non-ATCG characters from 5'end
+		if($removeN) {
+			my $tmp1;
+			($newseq1, $newscore1, $tmp1) = &removeNs($newseq1, $newscore1);
+			$firstbases+=$tmp1;
+		}
+
+		#Cut from 3'-end
+		if($cutlast>0) {
+			$newseq1 = substr($newseq1, 0, length($newseq1)-$cutlast);
+			$newscore1 = substr($newscore1, 0, length($newscore1)-$cutlast);
+		}
+		
+		# If -notrim flag is used, the reads are not trimmed, just printed
+		if($notrim) {
+			print OUT1 $head1 . $newseq1 ."\n" . $plus1 . $newscore1 . "\n";
+			$unpairedReads++;
+			$unpairedBases+=length($newseq1);
+		}
+		else {	
+			# Trim both reads		
+			($newseq1, $newscore1) = &trimEnd($newseq1, $newscore1, $HQlim, $lowlim, $maxNoHQ, $maxNoLQ, $minReadLen, $scoring);
+			if(readOK($newscore1, $HQlim, $lowlim, $minfrac, $lfrac, $scoring, $minReadLen)) {
+				print OUT1 $head1 . $newseq1 ."\n" . $plus1 . $newscore1 . "\n";
+				$unpairedReads++;
+				$unpairedBases+=length($newseq1);
+			}
+			else {
+				if($printBad eq "fa") {
+					print OUT4 ">".$head1.$seq1."\n";
+					$badReads++;
+				}
+				elsif($printBad eq "fq") {
+					print OUT4 $head1.$seq1."\n".$plus1.$qual1."\n";
+					$badReads++;
+				}
+			}
+		}
+		$totNoBases+=length($seq1);
+		$totNoReads+=1;
+		if ($totNoReads%100000==0) {
+			print "$totNoReads reads processed\r";
+		}
+	}
+	
+}
+$firstbases+=$cutfirst*$totNoReads;
+$lastbases+=$cutlast*$totNoReads;
+my $totCutBases = $firstbases+$lastbases;
+#--------------------------------------------------------------------------------
+# Print statistics to table
+open(STATS, ">$prefix".".stats");
+print STATS $prefix."\t".$totNoReads."\t".$totNoBases."\t".$pairReads."\t".$pairBases.
+		"\t".$unpairedReads."\t".$unpairedBases."\n";
+
+print "\nDone!\n";
+print "------------------------------------------------------------------\n";
+print "$totNoReads reads with $totNoBases bases in input files\n";
+if($cutfirst>0) {
+	print "$cutfirst bases was removed from the 5'-end\n";
+}
+if($cutlast>0) {
+	print "$cutlast bases was removed from the 3'-end\n";
+}
+if($totCutBases>0) {
+	print "In total, $totCutBases was removed before quality trimming\n";
+}
+if($notrim) {
+	print "Use of flag \"notrim\" means no further quality trimming was done\n";
+}
+if($read2) {
+	my $percent = 100*($pairReads/$totNoReads);
+	$percent = sprintf "%.2f", $percent;
+	print "$pairReads ($percent%) reads with $pairBases bases saved in pair files\n";
+	unless($notrim) {
+		$percent = 100*($unpairedReads/$totNoReads);
+		$percent = sprintf "%.2f", $percent;
+		print "$unpairedReads ($percent%) reads with $unpairedBases bases saved in unpaired file\n";
+		print "  due to low quality of the other read in the pair\n";
+	}
+}
+else {
+	my $percent = 100*($unpairedReads/$totNoReads);
+	$percent = sprintf "%.2f", $percent;
+	print "$unpairedReads ($percent%) reads with $unpairedBases bases saved\n";
+}
+if($printBad ne "") {
+	print "$badReads removed reads was printed to a $printBad file\n";  
+}
+print "------------------------------------------------------------------\n";
+$time = time-$time;
+if($time<60) {
+	print "Total time elapsed: $time sec.\n";
+}
+else {
+	$time = int($time/60 + 0.5);
+	print "Total time elapsed: $time min.\n";
+}
+
+#--------------------------------------------------------------------------------
+# Subroutines
+
+sub removeNs {
+	my $seq = shift;
+	my $qual = shift;
+	
+#	print "seq är $seq\n";
+
+	my @s = split("", $seq);
+	my @t = split("", $qual);
+	my $cnt = 0;
+	
+	while(scalar(@s)>0 && $s[0] !~ m/[ATCG]/i) {
+		shift(@s);
+		shift(@t);
+		$cnt++;	
+	}
+	$seq = join("", @s);
+	$qual = join("", @t);	
+	return ($seq, $qual, $cnt);
+}
+
+sub trimEnd { 
+
+	my $seq = shift;
+	my $qual = shift;
+	my $HQ = shift;
+	my $LQ = shift;
+	my $maxHQ = shift;
+	my $maxLQ = shift;
+	my $len = shift;
+	my $sc = shift;
+
+	if($seq ne "") {
+		my $LQ_flag = 0;
+		my $HQ_warn = 0;
+		my $LQinHQ_flag = "no";
+		my ($qual_end, $seq_end) = ("","");
+
+		my @t = split("", $qual);
+		my @s = split("", $seq);
+
+		while(scalar(@t)>$len && $HQ_warn<=$maxHQ) {
+
+			if (ord($t[scalar(@t)-1])-$sc < $HQ) {
+			
+				if ($HQ_warn > 0 && $LQ_flag <= $maxLQ && ord($t[scalar(@t)-1])-$sc > $LQ) {
+					$qual_end = pop(@t).$qual_end;
+					$seq_end = pop(@s).$seq_end;
+					$LQinHQ_flag = "yes";
+					$LQ_flag++;
+
+				}
+				else {
+					pop(@t);
+					pop(@s);
+					($qual_end, $seq_end) = ("","");
+					$HQ_warn = 0;
+				}
+				
+			}
+			else {
+				$qual_end = pop(@t).$qual_end;
+				$seq_end = pop(@s).$seq_end;
+				if($LQinHQ_flag eq "yes") {
+					$HQ_warn = 1;
+					$LQinHQ_flag = "no";
+				}
+				else {
+					$HQ_warn++;
+				}
+			}
+		}
+		$seq = join("", @s);
+		$qual = join("", @t);
+		$seq .= $seq_end;
+		$qual .= $qual_end;
+	
+	}
+	return ($seq, $qual);
+}
+sub readOK {
+
+	my $qual = shift;
+	my $HQ = shift;
+	my $LQ = shift;
+	my $frac = shift;
+	my $lfrac = shift;
+	my $sc = shift;
+	my $minlen = shift;
+		
+	if(length($qual)<$minlen) {
+		return 0;
+	}
+
+	my @t = split("", $qual);
+
+	my $score_cnt=0;
+	my $bad_cnt=0;
+
+	for (my $i=0; $i<scalar(@t)-1; $i++) {
+		if(ord($t[$i])-$sc >= $HQ) {
+			$score_cnt++;
+		}
+		if(ord($t[$i])-$sc < $lowlim) {
+			if($lfrac==0) {
+				return 0;
+			}
+			$bad_cnt++;
+		}
+	}
+	if($score_cnt/scalar(@t) >= $frac && $bad_cnt/scalar(@t)<=$lfrac) {
+		return 1;
+	}
+	else {
+		return 0;
+	}
+}
+sub table64 {
+print "Char	ASCII	Char-64	P(error)
+;	59	-5	0.7597
+<	60	-4	0.7153
+=	61	-3	0.6661
+>	62	-2	0.6131
+?	63	-1	0.5573
+@	64	0	0.5000
+A	65	1	0.4427
+B	66	2	0.3869
+C	67	3	0.3339
+D	68	4	0.2847
+E	69	5	0.2403
+F	70	6	0.2008
+G	71	7	0.1663
+H	72	8	0.1368
+I	73	9	0.1118
+J	74	10	0.0909
+K	75	11	0.0736
+L	76	12	0.0594
+M	77	13	0.0477
+N	78	14	0.0383
+O	79	15	0.0307
+P	80	16	0.0245
+Q	81	17	0.0196
+R	82	18	0.0156
+S	83	19	0.0124
+T	84	20	0.0099
+U	85	21	0.0079
+V	86	22	0.0063
+W	87	23	0.0050
+X	88	24	0.0040
+Y	89	25	0.0032
+Z	90	26	0.0025
+[	91	27	0.0020
+\\	92	28	0.0016
+]	93	29	0.0013
+^	94	30	0.0010
+_	95	31	0.0008
+`	96	32	0.0006
+a	97	33	0.0005
+b	98	34	0.0004
+c	99	35	0.0003
+d	100	36	0.0003
+e	101	37	0.0002
+f	102	38	0.0002
+g	103	39	0.0001
+h	104	40	0.0001
+";
+}
+sub table33 {
+print "Char	ASCII	Char-33	P(error)
+!	33	0	0.5000
+\"	34	1	0.4427
+#	35	2	0.3869
+\$	36	3	0.3339
+%	37	4	0.2847
+&	38	5	0.2403
+'	39	6	0.2008
+(	40	7	0.1663
+)	41	8	0.1368
+*	42	9	0.1118
++	43	10	0.0909
+,	44	11	0.0736
+-	45	12	0.0594
+.	46	13	0.0477
+/	47	14	0.0383
+0	48	15	0.0307
+1	49	16	0.0245
+2	50	17	0.0196
+3	51	18	0.0156
+4	52	19	0.0124
+5	53	20	0.0099
+6	54	21	0.0079
+7	55	22	0.0063
+8	56	23	0.0050
+9	57	24	0.0040
+:	58	25	0.0032
+;	59	26	0.0025
+<	60	27	0.0020
+=	61	28	0.0016
+>	62	29	0.0013
+?	63	30	0.0010
+@	64	31	0.0008
+A	65	32	0.0006
+B	66	33	0.0005
+C	67	34	0.0004
+D	68	35	0.0003
+E	69	36	0.0003
+F	70	37	0.0002
+G	71	38	0.0002
+H	72	39	0.0001
+I	73	40	0.0001
+J	74	41	0.0001
+";
+}
diff --git a/filterPCRdupl.pl b/filterPCRdupl.pl
new file mode 100644
index 0000000..0bd568c
--- /dev/null
+++ b/filterPCRdupl.pl
@@ -0,0 +1,237 @@
+#!/usr/bin/perl
+
+# filterPCRdupl.pl
+# November 2010	
+# Author: Linnéa Smeds (linnea.smeds at ebc.uu.se)
+
+use strict;
+use warnings;
+use Getopt::Long;
+use List::Util qw[min max];
+
+my $usage = "
+# filterPCRdupl.pl
+# November 2010	
+# Author: Linnéa Smeds (linnea.smeds\@ebc.uu.se)
+# This version doesn't fail when finding too short reads, but removes them and
+# prints a warning message. AND it can open gz files also the second time (so
+# that something is actually printed to the output..)
+# -----------------------------------------------------------------------------
+# Description: Extract all non-redundant read pairs in a fastq file pair, by 
+# comparing the first N bases in both reads between all pairs. If there are 
+# several copies of the same pair, only the pair with the highest quality 
+# score is kept. Also saves a histogram of the copy distribution.
+# Usage: perl filterPCRdupl.pl -fastq1=file1 -fastq2=file2 [-prefix=s -cmp=N]
+
+-fastq1=file \t Fastq files with the read pairs, (pair1 in first file and pair2 
+-fastq2=file \t in the second). Reads must have the same order in both files.
+-prefix=string \t Prefix for the output files. The filtered fastq files will be
+ \t\t named prefix_uniq1.fastq and prefix_uniq2.fastq, and the 
+ \t\t histogram with copy distribution will be named prefix_copy.hist.
+ \t\t [same prefix as file1].
+-cmp=number \t The number of bases from the start of the reads in each pair
+ \t\t that will be used for comparison [50].
+-h \t\t Print this help message.\n";
+				
+
+# Starting time
+my $time = time;
+
+# Input parameters
+my ($read1, $read2, $prefix, $baseComp, $help);
+GetOptions(
+ 	"fastq1=s" => \$read1,
+ 	"fastq2=s" => \$read2,
+   	"cmp=i" => \$baseComp,
+   	"prefix=s" => \$prefix,
+	"h" => \$help);
+
+#Checking input, set default if not given
+unless($read1 && $read2) {
+	die $usage . "\n";
+}
+if($help) {
+	die $usage . "\n";
+}
+unless($baseComp) {
+	$baseComp=50;
+}
+unless($prefix) {
+	$prefix = $read1;
+	$prefix =~ s/\.\w+//;
+}
+unless(-e $read1) {
+	die "Error: File $read1 doesn't exist!\n";
+}
+unless(-e $read2) {
+	die "Error: File $read2 doesn't exist!\n";
+}
+
+my %reads =();
+
+if ($read1 =~ /\.gz$/) {
+   open(RD1, "zcat $read1 |");
+}
+else {
+   open(RD1, $read1);
+}
+
+if ($read2 =~ /\.gz$/) {
+   open(RD2, "zcat $read2 |");
+}
+else {
+   open(RD2, $read2);
+}
+	
+print "uniqueFastqPairs started " . localtime() . "\n";
+
+$| = 1;
+print "Going through all read pairs...\n";
+my($hd1, $hd2, $seq1, $seq2, $plus1, $plus2, $qual1, $qual2);
+my ($totCnt, $duplCnt)=(0,0);
+while(<RD1>) {
+	$hd1=$_;
+	chomp($seq1=<RD1>);
+	$plus1=<RD1>;
+	$qual1=<RD1>;
+
+	$hd2 =<RD2>;
+	chomp($seq2=<RD2>);
+	$plus2=<RD2>;
+	$qual2=<RD2>;
+	
+	if(min(length($seq1), length($seq2))<$baseComp) {
+		my $smallest = min(length($seq1), length($seq2));
+		print "WARNING: Try to compare $baseComp first bases, but the read length is $smallest.\n".
+			"Read pair deleted. Try decreasing -cmp below $smallest\n";
+		next;
+	}
+	my $key = substr($seq1, 0, $baseComp) . substr($seq2, 0, $baseComp);
+	my @t1 = split(//, $qual1);
+	my @t2 = split(//, $qual2);
+
+	my $score=0;
+	for(my $i=0; $i<scalar(@t1); $i++) {
+		my $temp1 = ord($t1[$i])-64;
+		$score+=$temp1;	
+	}
+	for(my $i=0; $i<scalar(@t2); $i++) {
+		my $temp2 = ord($t2[$i])-64;
+		$score+=$temp2;	
+	}
+
+	if(defined $reads{$key}) {
+		if($score > $reads{$key}{'scr'}) {
+			 $reads{$key}{'id'}=$hd1;
+			 $reads{$key}{'scr'}=$score;
+		}
+		$reads{$key}{'cpy'}++;
+		$duplCnt++;
+	}
+	else {
+		$reads{$key}{'scr'}=$score;
+		$reads{$key}{'id'}=$hd1;
+		$reads{$key}{'cpy'}=1;
+	}
+	$totCnt++;
+}
+close(RD1);
+close(RD2);
+
+my $curr_time = time-$time;
+if($curr_time<60) {
+	print "\tdone in $curr_time seconds.\n";
+}
+else {
+	$curr_time=int($curr_time/60 + 0.5);
+	print "\tdone in $curr_time minutes.\n";
+}
+$curr_time = time;
+
+print "Writing unique read pairs to output...\n";
+
+my $out1 = $prefix . "_uniq1.fastq";
+my $out2 = $prefix . "_uniq2.fastq";
+my %copies = ();
+
+open(OUT1, ">$out1");
+open(OUT2, ">$out2");
+if ($read1 =~ /\.gz$/) {
+   open(RD1, "zcat $read1 |");
+}
+else {
+   open(RD1, $read1);
+}
+
+if ($read2 =~ /\.gz$/) {
+   open(RD2, "zcat $read2 |");
+}
+else {
+   open(RD2, $read2);
+}
+
+while(<RD1>) {
+	$hd1=$_;
+	chomp($seq1=<RD1>);
+	$plus1=<RD1>;
+	$qual1=<RD1>;
+
+	$hd2 =<RD2>;
+	chomp($seq2=<RD2>);
+	$plus2=<RD2>;
+	$qual2=<RD2>;
+
+	my $key = substr($seq1, 0, $baseComp) . substr($seq2, 0, $baseComp);
+
+
+	if(defined $reads{$key} && $reads{$key}{'id'} eq $hd1) {
+		print OUT1 $hd1 . $seq1 . "\n" . $plus1 . $qual1;
+		print OUT2 $hd2 . $seq2 . "\n" . $plus2 . $qual2;
+		if(defined $copies{$reads{$key}{'cpy'}}) {
+			$copies{$reads{$key}{'cpy'}}++;
+		}
+		else {
+			$copies{$reads{$key}{'cpy'}}=1;
+		}
+		delete $reads{$key};	
+	}
+}
+close(RD1);
+close(RD2);
+close(OUT1);
+close(OUT2);
+
+$curr_time = time-$curr_time;
+if($curr_time<60) {
+	print "\tdone in $curr_time seconds.\n";
+}
+else {
+	$curr_time=int($curr_time/60 + 0.5);
+	print "\tdone in $curr_time minutes.\n";
+}
+$curr_time = time;
+
+
+print "Writing copy distribution histogram...\n";
+my $distOut = $prefix ."_copy.hist";
+open(OUT, ">$distOut");
+foreach my $key (sort {$a<=>$b} keys %copies) {
+	print OUT $key ."\t". $copies{$key} ."\n";
+}
+close(OUT);
+print "\tdone!\n";
+
+my $frac = 100*($duplCnt/$totCnt);
+$frac = sprintf "%.2f", $frac;
+print "$totCnt pairs in total, $duplCnt ($frac%) duplicated pairs removed.\n";
+
+$time = (time-$time);
+if($time<60) {
+	print "Total time elapsed: $time seconds.\n";
+}
+else {
+	$time=int($time/60 + 0.5);
+	print "Total time elapsed: $time minutes.\n";
+}
+
+
diff --git a/wiki/README.wiki b/wiki/README.wiki
new file mode 100644
index 0000000..e0431c5
--- /dev/null
+++ b/wiki/README.wiki
@@ -0,0 +1,71 @@
+#labels Featured
+=README CONDETRI=
+------------------
+
+==Summary==
+Trim reads from the 3'-end and extract reads (or read pairs) of good quality. 
+If the reads are paired, the filtering is done pairwise, and if one read in a
+pair has low quality, the remaining read is saved as single end.
+
+<BR>
+==Usage==
+perl condetri.pl -fastq1=file1 `[`-fastq2=file2 -prefix=s -cutfirst=i -hq=i -lq=i -frac=`[`0,1`]` -minlen=i -mh=i -ml=i -sc=i -rmN`]`
+
+<BR>
+==Description==
+*The trimming is performed in two steps:*
+
+*(1)* Trimming low quality bases from the 3'-end
+
+*(2)* Overall quality check of read/pair
+
+*Details:*
+
+*(1)* Bases are removed from the 3'-end if the quality score is lower than some threshold (hq). When a base with higher quality is reached, it is kept temporarily and the preceding bases are considered. After this point, also bases with low quality can be saved, if they are surrounded by high quality bases. Up to ml consecutive bases are saved temporarily, but if the following base also has low quality, all temporarily saved bases are removed, and the trimming starts all over again. The  [...]
+
+*(2)* After the trimming step, the quality scores of the remaining read are controlled for. A read is approved if a certain fraction (frac) of the bases have a quality score higher than hq, and there is no base in the read that has a quality score below some lower bound (lq). If the input is paired-end reads, both reads in a pair must be approved for keeping the pair. If only one of the reads is approved, it is saved to an additional, unpaired file which can be used as single-end data.
+
+<BR>
+==Comments==
+The parameters -cutfirst and -rmN was added to ConDeTri in version 2.0 (for details, see release notes). If -cutfirst is used, -rmN is somewhat needless (it will still scan for non ATGC-characters from the 5' end, but the first i bases has already been removed). Also keep in mind that these choices will affect the minlen-parameter. For example if you have 75bp read, set the -minlen=50 and use the -cutfirst=6, the maximum possible number of bases that can be trimmed from the 3' end is 19  [...]
+
+<BR>
+==Input parameters== 
+(default values in brackets `[``]`)
+<pre>
+-fastq1=file	Fastq file. If a second file is given, the files are trimmed as
+-fastq2=file	 a pair. The reads must have the same order in both files.
+-prefix=string	Prefix for the output file(s). The filtered fastq file(s) will be
+	         named prefix_trim1.fastq (and prefix_trim2.fastq if present).
+ 	  	 For pairs, a third file will be given with unpaired reads (reads 
+ 	  	 from pairs where one low quality read has been removed).
+-cutfirst=i	Removes the first i base from the 5' end before trimming `[`0`]`.
+-rmN		Removes N from the 5' end before trimming`[`no`]`.
+-hq=i		Hiqh quality threshold `[`25`]`.
+-lq=i		Low quality threshold `[`10`]`.
+-frac=`[`0,1`]`	Fraction of read that must exceed hq `[`0.8`]`.
+-minlen=i	Min allowed read length `[`50`]`.
+-mh=i		When this no of consecutive hq bases is reached, the trimming stops `[`5`]`.
+-ml=i		Max no of lq bases allowed after a stretch of hq bases from 3'-end `[`1`]`.
+-sc=i		Illumina scoring table, Score=ASCII-sc, usually 64, is Sanger
+  	  	 standard. Can be set to any other integer if wanted `[`64`]`.
+-q		Prints Illumina scoring table.
+-h		Prints a help message.
+</pre>
+<BR>
+==Output files==
+<pre>
+prefix_trim1.fastq		File(s) with the trimmed reads (one file for
+prefix_trim2.fastq		single-end data, three for paired-end data, where
+prefix_trim_unpaired.fastq	the last file includes reads from the two input
+ 	  	  	  	 files whose read pair had too poor quality.
+prefix.stats			Includes basic statistics in columns.
+</pre>
+The columns for the .stats file are the following:<BR>
+PREFIX, NUMBER OF READS IN ORIGINAL FILE(S), NUMBER OF BASES IN ORIGINAL FILE(S), NO OF PAIRED READS AFTER TRIMMING,<BR>NO OF BASES IN PAIRS AFTER TRIMMING, NO OF UNPAIRED READS AFTER TRIMMING, NO OF UNPAIRED BASES AFTER TRIMMING.
+
+The .stats files are suitable for concatenation to make a summary table for several fastq files, for example one file for each lane in a flowcell or a set of transcriptome samples. Just use:
+
+$ cat `*`.stats
+
+ 
\ No newline at end of file

-- 
Alioth's /usr/local/bin/git-commit-notice on /srv/git.debian.org/git/debian-med/condetri.git



More information about the debian-med-commit mailing list