[med-svn] [Git][med-team/stacks][master] 7 commits: New upstream version 2.60+dfsg
Andreas Tille (@tille)
gitlab at salsa.debian.org
Tue Sep 27 11:42:34 BST 2022
Andreas Tille pushed to branch master at Debian Med / stacks
Commits:
cfc9f125 by Nilesh Patra at 2021-11-02T00:07:48+05:30
New upstream version 2.60+dfsg
- - - - -
cd73b73e by Andreas Tille at 2022-09-27T12:20:32+02:00
New upstream version 2.62+dfsg
- - - - -
afc93edc by Andreas Tille at 2022-09-27T12:20:32+02:00
routine-update: New upstream version
- - - - -
642ed1b4 by Andreas Tille at 2022-09-27T12:20:32+02:00
Update upstream source from tag 'upstream/2.62+dfsg'
Update to upstream version '2.62+dfsg'
with Debian dir b394fff46c780a855124eef32d13ab835ec15d1a
- - - - -
15d7ead4 by Andreas Tille at 2022-09-27T12:20:32+02:00
routine-update: Standards-Version: 4.6.1
- - - - -
80c72c78 by Andreas Tille at 2022-09-27T12:37:38+02:00
Refresh patches
- - - - -
3a737b16 by Andreas Tille at 2022-09-27T12:39:31+02:00
Upload to unstable
- - - - -
23 changed files:
- ChangeLog
- Makefile.am
- configure.ac
- debian/changelog
- debian/control
- − debian/patches/2to3.patch
- debian/patches/script-exe-paths
- debian/patches/series
- − scripts/count_fixed_catalog_snps.py
- scripts/denovo_map.pl
- − scripts/integrate_alignments.py
- scripts/stacks-dist-extract
- src/bootstrap.cc
- src/clean.h
- src/export_formats.cc
- src/export_formats.h
- src/ordered.h
- src/populations.cc
- src/populations.h
- src/process_radtags.cc
- src/process_radtags.h
- src/renz.cc
- src/ustacks.cc
Changes:
=====================================
ChangeLog
=====================================
@@ -1,3 +1,28 @@
+Stacks 2.62 - June 28, 2022
+---------------------------
+ Feature: added a '--vcf-all' export to populations which will export fixed
+ and variable sites to a VCF file. If '--ordered' is also specified, it
+ will only export non-overlapping sites.
+ Feature: improved ustacks logging to include final number of assembled
+ loci; modified denovo_map.pl to include this in its logging output.
+ Improved logging in populations program.
+ Feature: Added variant of PstI cutsite to process_radtags: pstishi only
+ leaves "GCAG" as the cutsite remnant, published in:
+ Shirasawa, Hirakawa, and Isobe, 2016, DNA Research; doi: 10.1093/dnares/dsw004
+ Bugfix: fixed assert() failure in populations when no population map was
+ specified.
+ Bugfix: updated stacks-dist-extract --pretty print to better handle
+ printing comments.
+
+Stacks 2.61 - April 19, 2022
+----------------------------
+ Feature: parallelized process_radtags. Can now run on multiple cores (max of 24 cores), resulting
+ in a speedup of 2-3x, depending on physical I/O and number of cores used. Minor improvements
+ to output status messages.
+ Feature: added '--pretty' print option to stacks-dist-extract script..
+ Bugfix: corrected bug in parsing of bootstrap archive file, long lines were not properly handled.
+ Feature: Added HhaI restriction enzyme.
+
Stacks 2.60 - October 26, 2021
------------------------------
Feature: memory usage reduction in populations. Some examples of memory savings:
=====================================
Makefile.am
=====================================
@@ -85,7 +85,6 @@ kmer_filter_LDADD = $(LDADD) libclean.a
populations_LDADD = $(LDADD) libpop.a
dist_bin_SCRIPTS = scripts/denovo_map.pl scripts/ref_map.pl \
- scripts/integrate_alignments.py scripts/count_fixed_catalog_snps.py \
scripts/stacks-integrate-alignments scripts/stacks-dist-extract scripts/stacks-gdb \
scripts/stacks-samtools-tview scripts/stacks-count-reads-per-sample-per-locus \
scripts/stacks-hist2d-loci-samples-coverage # scripts/denovo_map.py
=====================================
configure.ac
=====================================
@@ -2,7 +2,7 @@
# Process this file with autoconf to produce a configure script.
AC_PREREQ(2.59)
-AC_INIT([Stacks], [2.60])
+AC_INIT([Stacks], [2.62])
AC_CONFIG_AUX_DIR([config])
AM_INIT_AUTOMAKE([-Wall -Werror foreign parallel-tests subdir-objects])
AC_CONFIG_SRCDIR([src/ustacks.cc])
=====================================
debian/changelog
=====================================
@@ -1,3 +1,10 @@
+stacks (2.62+dfsg-1) unstable; urgency=medium
+
+ * New upstream version
+ * Standards-Version: 4.6.1 (routine-update)
+
+ -- Andreas Tille <tille at debian.org> Tue, 27 Sep 2022 12:38:01 +0200
+
stacks (2.60+dfsg-1) unstable; urgency=medium
* Team upload.
=====================================
debian/control
=====================================
@@ -10,7 +10,7 @@ Build-Depends: debhelper-compat (= 13),
libsparsehash-dev,
samtools,
libhts-dev
-Standards-Version: 4.6.0
+Standards-Version: 4.6.1
Vcs-Browser: https://salsa.debian.org/med-team/stacks
Vcs-Git: https://salsa.debian.org/med-team/stacks.git
Homepage: https://creskolab.uoregon.edu/stacks/
=====================================
debian/patches/2to3.patch deleted
=====================================
@@ -1,164 +0,0 @@
-Author: Andreas Tille <tille at debian.org>
-Last-Update: Wed, 10 Jul 2019 21:58:30 +0200
-Description: Result of 2to3 port to Python3
-
---- a/scripts/count_fixed_catalog_snps.py
-+++ b/scripts/count_fixed_catalog_snps.py
-@@ -1,4 +1,4 @@
--#!/usr/bin/env python
-+#!/usr/bin/python3
-
- import optparse
- import sys
-@@ -39,7 +39,7 @@
- batch_id = int(opts.batch_id)
-
- if len(path) == 0 or os.path.exists(path) == False:
-- print >> sys.stderr, "You must specify a valid path to Stacks input files."
-+ print("You must specify a valid path to Stacks input files.", file=sys.stderr)
- p.print_help()
- sys.exit()
-
-@@ -62,10 +62,10 @@
- pos = entry.find(".matches.tsv")
- if (pos != -1):
- files.append(entry[0:pos])
-- print >> sys.stderr, "Found", len(files), "Stacks samples."
-+ print("Found", len(files), "Stacks samples.", file=sys.stderr)
-
- except:
-- print >> sys.stderr, "Unable to read files from Stacks directory, '" + path + "'"
-+ print("Unable to read files from Stacks directory, '" + path + "'", file=sys.stderr)
-
-
- def parse_cigar(cigar, components):
-@@ -143,14 +143,14 @@
- matches[sample_locus] = cat_locus
- else:
- if cat_locus != matches[sample_locus]:
-- print >> sys.stderr, "Error: sample locus", sample_locus, "matches more than one catalog locus."
-+ print("Error: sample locus", sample_locus, "matches more than one catalog locus.", file=sys.stderr)
-
- if len(cigar) > 0:
- if sample_locus not in cigars:
- cigars[sample_locus] = cigar
- else:
- if cigar != cigars[sample_locus]:
-- print >> sys.stderr, "Error: sample locus", sample_locus, "has multiple cigar alignments."
-+ print("Error: sample locus", sample_locus, "has multiple cigar alignments.", file=sys.stderr)
-
- fh.close()
-
-@@ -279,23 +279,23 @@
- #
- i = 1
- for file in files:
-- print >> sys.stderr, "Processing file", str(i), "of", len(files), "['" + file + "']"
-+ print("Processing file", str(i), "of", len(files), "['" + file + "']", file=sys.stderr)
- cnt = count_sample_snps(path, file, sample_snps)
-- print >> sys.stderr, " Found", cnt, "heterozygous SNPs in sample."
-+ print(" Found", cnt, "heterozygous SNPs in sample.", file=sys.stderr)
- i += 1
-
- total_snps = 0
- for locus in sample_snps:
- for col in sample_snps[locus]:
- total_snps += 1
--print >> sys.stderr, "Found", total_snps, "variable sites across the population."
-+print("Found", total_snps, "variable sites across the population.", file=sys.stderr)
-
- #
- # Count all the SNPs found in the catalog.
- #
--print >> sys.stderr, "Processing the catalog"
-+print("Processing the catalog", file=sys.stderr)
- cnt = count_catalog_snps(path, batch_id, catalog_snps)
--print >> sys.stderr, " Found", cnt, "heterozygous SNPs in the catalog."
-+print(" Found", cnt, "heterozygous SNPs in the catalog.", file=sys.stderr)
-
- #
- # Count all the SNPs in the catalog but not in any sample: these are the fixed differences cstacks identified.
-@@ -315,7 +315,7 @@
- fixed_snps += 1
- for col in sample_snps[locus]:
- if col not in c:
-- print "Locus:", locus, "col:", col, "Catalog SNPs:", catalog_snps[locus], "Sample SNPs:", sample_snps[locus]
-+ print("Locus:", locus, "col:", col, "Catalog SNPs:", catalog_snps[locus], "Sample SNPs:", sample_snps[locus])
-
--print >> sys.stderr, "Found", total_snps, "SNPs across all samples and in the catalog."
--print >> sys.stderr, "Found", fixed_snps, "fixed SNPs only in the catalog."
-+print("Found", total_snps, "SNPs across all samples and in the catalog.", file=sys.stderr)
-+print("Found", fixed_snps, "fixed SNPs only in the catalog.", file=sys.stderr)
---- a/scripts/integrate_alignments.py
-+++ b/scripts/integrate_alignments.py
-@@ -1,4 +1,4 @@
--#!/usr/bin/env python
-+#!/usr/bin/python3
-
- import optparse
- import sys
-@@ -49,7 +49,7 @@
- batch_id = int(opts.batch_id)
-
- if len(out_path) == 0 or os.path.exists(out_path) == False:
-- print >> sys.stderr, "You must specify a valid path to write files to."
-+ print("You must specify a valid path to write files to.", file=sys.stderr)
- p.print_help()
- sys.exit()
-
-@@ -57,12 +57,12 @@
- out_path += "/"
-
- if len(aln_path) == 0 or os.path.exists(aln_path) == False:
-- print >> sys.stderr, "You must specify a valid path to a SAM file containing catalog locus alignments."
-+ print("You must specify a valid path to a SAM file containing catalog locus alignments.", file=sys.stderr)
- p.print_help()
- sys.exit()
-
- if len(path) == 0 or os.path.exists(path) == False:
-- print >> sys.stderr, "You must specify a valid path to Stacks input files."
-+ print("You must specify a valid path to Stacks input files.", file=sys.stderr)
- p.print_help()
- sys.exit()
-
-@@ -85,10 +85,10 @@
- pos = entry.find(".matches.tsv")
- if (pos != -1):
- files.append(entry[0:pos])
-- print >> sys.stderr, "Found", len(files), "Stacks samples."
-+ print("Found", len(files), "Stacks samples.", file=sys.stderr)
-
- except:
-- print >> sys.stderr, "Unable to read files from Stacks directory, '", path, "'"
-+ print("Unable to read files from Stacks directory, '", path, "'", file=sys.stderr)
-
-
- def parse_catalog_alignments(aln_path, alns):
-@@ -115,7 +115,7 @@
- alns[locus] = (chr, bp, "+");
-
- fh.close()
-- print >> sys.stderr, "Loaded", len(alns), "catalog locus alignments from '", aln_path, "'."
-+ print("Loaded", len(alns), "catalog locus alignments from '", aln_path, "'.", file=sys.stderr)
-
-
- def convert_sample(path, file, out_path, alns):
-@@ -414,14 +414,14 @@
-
- i = 1
- for file in files:
-- print >> sys.stderr, "Processing file", str(i), "of", len(files), "['" + file + "']"
-+ print("Processing file", str(i), "of", len(files), "['" + file + "']", file=sys.stderr)
- cnt = convert_sample(path, file, out_path, alns)
-- print >> sys.stderr, " Added alignments for", cnt, "loci."
-+ print(" Added alignments for", cnt, "loci.", file=sys.stderr)
- i += 1
-
- #
- # Now process the catalog.
- #
--print >> sys.stderr, "Processing the catalog"
-+print("Processing the catalog", file=sys.stderr)
- cnt = convert_catalog(path, batch_id, out_path, alns)
--print >> sys.stderr, " Added alignments for", cnt, "catalog loci."
-+print(" Added alignments for", cnt, "catalog loci.", file=sys.stderr)
=====================================
debian/patches/script-exe-paths
=====================================
@@ -9,7 +9,7 @@ Last-Update: 2016-10-21
This patch header follows DEP-3: http://dep.debian.net/deps/dep3/
--- a/Makefile.am
+++ b/Makefile.am
-@@ -104,10 +104,10 @@
+@@ -103,10 +103,10 @@ debug:
$(MAKE) all "CXXFLAGS=-g3 -Wall -DDEBUG -Og"
install-data-hook:
=====================================
debian/patches/series
=====================================
@@ -1,4 +1,3 @@
RisR.patch
script-exe-paths
use_debian_packaged_htslib.patch
-2to3.patch
=====================================
scripts/count_fixed_catalog_snps.py deleted
=====================================
@@ -1,321 +0,0 @@
-#!/usr/bin/env python
-
-import optparse
-import sys
-import os
-import gzip
-import re
-
-#
-# Global configuration variables.
-#
-path = ""
-aln_path = ""
-out_path = ""
-batch_id = -1
-
-def parse_command_line():
- global path
- global batch_id
-
- p = optparse.OptionParser()
-
- #
- # Add options.
- #
- p.add_option("-p", action="store", dest="path",
- help="path to Stacks directory.")
- p.add_option("-b", action="store", dest="batch_id",
- help="Stacks batch ID.")
-
- #
- # Parse the command line
- #
- (opts, args) = p.parse_args()
-
- if opts.path != None:
- path = opts.path
- if opts.batch_id != None:
- batch_id = int(opts.batch_id)
-
- if len(path) == 0 or os.path.exists(path) == False:
- print >> sys.stderr, "You must specify a valid path to Stacks input files."
- p.print_help()
- sys.exit()
-
- if batch_id < 0:
- pritn >> sys.stderr, "You must specify the batch ID that was supplied to Stacks."
- p.print_help()
- sys.exit()
-
- if path.endswith("/") == False:
- path += "/"
-
-
-def find_stacks_files(path, files):
- try:
- entries = os.listdir(path)
-
- for entry in entries:
- pos = entry.find(".matches.tsv.gz")
- if (pos == -1):
- pos = entry.find(".matches.tsv")
- if (pos != -1):
- files.append(entry[0:pos])
- print >> sys.stderr, "Found", len(files), "Stacks samples."
-
- except:
- print >> sys.stderr, "Unable to read files from Stacks directory, '" + path + "'"
-
-
-def parse_cigar(cigar, components):
- #
- # Parse a cigar string, e.g. 48M1D47M or 43M3D52M3D.
- #
- start = 0
- end = 0
- dist = ""
- for c in cigar:
- if c.isalpha():
- dist = int(cigar[start:end])
- op = c.upper();
- end += 1
- start = end
- components.append((op, dist))
- else:
- end += 1
-
-
-def adjust_snps(cigar, sample_snps):
- #
- # Adjust SNP positions according to how this sample was aligned to the catalog
- # (which is described by the CIGAR string).
- #
- if len(sample_snps) == 0:
- return
-
- comp = []
- parse_cigar(cigar, comp)
-
- index = 0
- offset = 0
- bp = 0
- for (op, dist) in comp:
- if op == 'M' or op == 'I':
- bp += dist
- while index < len(sample_snps) and sample_snps[index] < bp:
- sample_snps[index] += offset
- index += 1
- if op == 'D':
- offset += dist
-
-
-def count_sample_snps(path, file, sample_snps):
- matches = {}
- cigars = {}
- #
- # Open the matches file and load the matches to the catalog.
- #
- p = path + file + ".matches.tsv.gz"
- if os.path.exists(p):
- gzipped = True;
- fh = gzip.open(p, 'rb')
- else:
- gzipped = False;
- fh = open(path + file + ".matches.tsv", "r")
-
- for line in fh:
- line = line.strip("\n")
-
- if len(line) == 0 or line[0] == "#":
- continue
-
- parts = line.split("\t")
-
- cat_locus = int(parts[2])
- sample_locus = int(parts[4])
- if len(parts) == 9:
- cigar = parts[8] if len(parts[8]) > 0 else ""
- else:
- cigar = ""
-
- if sample_locus not in matches:
- matches[sample_locus] = cat_locus
- else:
- if cat_locus != matches[sample_locus]:
- print >> sys.stderr, "Error: sample locus", sample_locus, "matches more than one catalog locus."
-
- if len(cigar) > 0:
- if sample_locus not in cigars:
- cigars[sample_locus] = cigar
- else:
- if cigar != cigars[sample_locus]:
- print >> sys.stderr, "Error: sample locus", sample_locus, "has multiple cigar alignments."
-
- fh.close()
-
- #
- # Open the SNPs file and record all the SNP positions found in this sample.
- #
- if gzipped:
- fh = gzip.open(path + file + ".snps.tsv.gz", "rb")
- else:
- fh = open(path + file + ".snps.tsv", "r")
-
- snps = {}
-
- for line in fh:
- if line[0] == "#":
- continue
-
- if len(line) == 0:
- continue
-
- parts = line.split("\t")
- sample_locus = int(parts[2])
- col = int(parts[3])
- model = parts[4]
-
- if model != "E":
- continue
-
- if sample_locus not in matches:
- continue
-
- if sample_locus not in snps:
- snps[sample_locus] = []
- snps[sample_locus].append(col)
-
- fh.close()
-
- #
- # Adjust SNP positions according to the gapped alignments recorded in the CIGAR string.
- #
- for sample_locus in snps:
- if sample_locus in cigars:
- adjust_snps(cigars[sample_locus], snps[sample_locus])
-
- snp_cnt = 0
-
- #
- # Transfer this sample's SNPs to the catalog level dictionary.
- #
- for sample_locus in snps:
- cat_locus = matches[sample_locus]
-
- for col in snps[sample_locus]:
- if cat_locus in sample_snps:
- if col in sample_snps[cat_locus]:
- sample_snps[cat_locus][col] += 1
- # print >> sys.stderr, "CatLocus:", cat_locus, "; col:", col
- else:
- sample_snps[cat_locus][col] = 1
- snp_cnt += 1
- else:
- sample_snps[cat_locus] = {}
- sample_snps[cat_locus][col] = 1
- snp_cnt += 1
-
- return snp_cnt
-
-
-def count_catalog_snps(path, batch_id, catalog_snps):
- #
- # Open the tags file and rewrite it with the alignment coordinates.
- #
- p = path + "batch_" + str(batch_id) + ".catalog.snps.tsv.gz"
- if os.path.exists(p):
- gzipped = True;
- fh = gzip.open(path + "batch_" + str(batch_id) + ".catalog.snps.tsv.gz", "rb")
- else:
- gzipped = False;
- fh = open(path + "batch_" + str(batch_id) + ".catalog.snps.tsv", "r")
-
- snp_cnt = 0
-
- for line in fh:
- if line[0] == "#":
- continue
-
- if len(line) == 0:
- continue
-
- parts = line.split("\t")
- cat_locus = int(parts[2])
- col = int(parts[3])
- model = parts[4]
-
- if model != "E":
- continue
-
- snp_cnt += 1
-
- if cat_locus not in catalog_snps:
- catalog_snps[cat_locus] = []
-
- catalog_snps[cat_locus].append(col)
-
- fh.close()
-
- return snp_cnt
-
-# # # #
-# # ------------------------------------------------------------------------------------------- # #
-# # # #
-
-parse_command_line()
-
-files = []
-catalog_snps = {}
-sample_snps = {}
-
-#
-# Find all the sample files by looking for the matches files in the Stacks directory.
-#
-find_stacks_files(path, files)
-
-#
-# Count all the SNPs identified in the samples.
-#
-i = 1
-for file in files:
- print >> sys.stderr, "Processing file", str(i), "of", len(files), "['" + file + "']"
- cnt = count_sample_snps(path, file, sample_snps)
- print >> sys.stderr, " Found", cnt, "heterozygous SNPs in sample."
- i += 1
-
-total_snps = 0
-for locus in sample_snps:
- for col in sample_snps[locus]:
- total_snps += 1
-print >> sys.stderr, "Found", total_snps, "variable sites across the population."
-
-#
-# Count all the SNPs found in the catalog.
-#
-print >> sys.stderr, "Processing the catalog"
-cnt = count_catalog_snps(path, batch_id, catalog_snps)
-print >> sys.stderr, " Found", cnt, "heterozygous SNPs in the catalog."
-
-#
-# Count all the SNPs in the catalog but not in any sample: these are the fixed differences cstacks identified.
-#
-total_snps = 0
-fixed_snps = 0
-
-for locus in catalog_snps:
- if locus not in sample_snps:
- continue
- c = {}
- for col in catalog_snps[locus]:
- c[col] = 1
- total_snps += 1
- if col not in sample_snps[locus]:
- # print >> sys.stderr, "Locus:", locus, "Catalog SNPs:", catalog_snps[locus], "Sample SNPs:", sample_snps[locus]
- fixed_snps += 1
- for col in sample_snps[locus]:
- if col not in c:
- print "Locus:", locus, "col:", col, "Catalog SNPs:", catalog_snps[locus], "Sample SNPs:", sample_snps[locus]
-
-print >> sys.stderr, "Found", total_snps, "SNPs across all samples and in the catalog."
-print >> sys.stderr, "Found", fixed_snps, "fixed SNPs only in the catalog."
=====================================
scripts/denovo_map.pl
=====================================
@@ -184,11 +184,14 @@ sub execute_stacks {
check_return_value($?, $log_fh, $cmd);
#
- # Pull the depth of coverage from ustacks.
+ # Pull the depth of coverage and number of loci from ustacks..
#
my $depthline = (grep(/^Final coverage/, @results))[0];
my ($depth, $stdev, $max, $nreads, $pctreads) = ($depthline =~ /mean=([\d.]+); stdev=([\d.]+); max=(\d+); n_reads=(\d+)\(([\d.]+)%\)/);
- push(@depths_of_cov, [$sample->{'file'}, $depth, $stdev, $max, $nreads, $pctreads]);
+
+ $depthline = (grep(/^Final number of stacks built/, @results))[0];
+ my ($locicnt) = ($depthline =~ /Final number of stacks built: ([\d.]+)/);
+ push(@depths_of_cov, [$sample->{'file'}, $depth, $stdev, $max, $nreads, $pctreads, $locicnt]);
}
$i++;
@@ -733,11 +736,11 @@ sub write_depths_of_cov {
print $log_fh
"\nBEGIN cov_per_sample\n",
"# Depths of Coverage for Processed Samples\n",
- "sample\tdepth of cov\tmax cov\tnumber reads incorporated\t% reads incorporated\n";
+ "sample\tloci assembled\tdepth of cov\tmax cov\tnumber reads incorporated\t% reads incorporated\n";
foreach $a (@{$depths}) {
- print STDERR $a->[0], "; depth: ", $a->[1], "x; max: ", $a->[3], "x; number of reads: ", $a->[4], " (", $a->[5], "%)\n";
- print $log_fh $a->[0], "\t", $a->[1], "\t", $a->[3], "\t", $a->[4], "\t", $a->[5], "\n";
+ print STDERR $a->[0], "; loci assembled: ", $a->[6], "; depth: ", $a->[1], "x; max: ", $a->[3], "x; number of reads: ", $a->[4], " (", $a->[5], "%)\n";
+ print $log_fh $a->[0], "\t", $a->[6], "\t", $a->[1], "\t", $a->[3], "\t", $a->[4], "\t", $a->[5], "\n";
}
print $log_fh
@@ -891,7 +894,7 @@ denovo_map.pl --samples dir --popmap path --out-path dir [--paired [--rm-pcr-dup
Stack assembly options:
-M: number of mismatches allowed between stacks within individuals (for ustacks).
- -n: number of mismatches allowed between stacks between individuals (for cstacks; default 1; suggested: set to ustacks -M).
+ -n: number of mismatches allowed between stacks between individuals (for cstacks; default: set to ustacks -M).
SNP model options:
--var-alpha: significance level at which to call variant sites (for gstacks; default: 0.05).
=====================================
scripts/integrate_alignments.py deleted
=====================================
@@ -1,427 +0,0 @@
-#!/usr/bin/env python
-
-import optparse
-import sys
-import os
-import gzip
-import re
-
-#
-# Global configuration variables.
-#
-path = ""
-aln_path = ""
-out_path = ""
-batch_id = -1
-
-def parse_command_line():
- global out_path
- global aln_path
- global path
- global batch_id
-
- p = optparse.OptionParser()
-
- #
- # Add options.
- #
- p.add_option("-o", action="store", dest="out_path",
- help="write modified Stacks files to this output path.")
- p.add_option("-a", action="store", dest="aln_path",
- help="SAM file containing catalog locus alignments.")
- p.add_option("-p", action="store", dest="path",
- help="path to Stacks directory.")
- p.add_option("-b", action="store", dest="batch_id",
- help="Stacks batch ID.")
-
- #
- # Parse the command line
- #
- (opts, args) = p.parse_args()
-
- if opts.out_path != None:
- out_path = opts.out_path
- if opts.aln_path != None:
- aln_path = opts.aln_path
- if opts.path != None:
- path = opts.path
- if opts.batch_id != None:
- batch_id = int(opts.batch_id)
-
- if len(out_path) == 0 or os.path.exists(out_path) == False:
- print >> sys.stderr, "You must specify a valid path to write files to."
- p.print_help()
- sys.exit()
-
- if out_path.endswith("/") == False:
- out_path += "/"
-
- if len(aln_path) == 0 or os.path.exists(aln_path) == False:
- print >> sys.stderr, "You must specify a valid path to a SAM file containing catalog locus alignments."
- p.print_help()
- sys.exit()
-
- if len(path) == 0 or os.path.exists(path) == False:
- print >> sys.stderr, "You must specify a valid path to Stacks input files."
- p.print_help()
- sys.exit()
-
- if batch_id < 0:
- pritn >> sys.stderr, "You must specify the batch ID that was supplied to Stacks."
- p.print_help()
- sys.exit()
-
- if path.endswith("/") == False:
- path += "/"
-
-
-def find_stacks_files(path, files):
- try:
- entries = os.listdir(path)
-
- for entry in entries:
- pos = entry.find(".matches.tsv.gz")
- if (pos == -1):
- pos = entry.find(".matches.tsv")
- if (pos != -1):
- files.append(entry[0:pos])
- print >> sys.stderr, "Found", len(files), "Stacks samples."
-
- except:
- print >> sys.stderr, "Unable to read files from Stacks directory, '", path, "'"
-
-
-def parse_catalog_alignments(aln_path, alns):
- fh = open(aln_path, "r")
-
- for line in fh:
- line = line.strip("\n")
-
- if len(line) == 0 or line[0] == "#" or line[0] == "@":
- continue
-
- parts = line.split("\t")
- locus = int(parts[0])
- chr = parts[2]
- bp = int(parts[3])
- flag = int(parts[1])
-
- #
- # Check which strand the read is aligned to.
- #
- if flag & 16 > 0:
- alns[locus] = (chr, bp, "-");
- else:
- alns[locus] = (chr, bp, "+");
-
- fh.close()
- print >> sys.stderr, "Loaded", len(alns), "catalog locus alignments from '", aln_path, "'."
-
-
-def convert_sample(path, file, out_path, alns):
- matches = {}
- #
- # Open the matches file and load the matches to the catalog.
- #
- p = path + file + ".matches.tsv.gz"
- if os.path.exists(p):
- gzipped = True;
- fh = gzip.open(p, 'rb')
- else:
- gzipped = False;
- fh = open(path + file + ".matches.tsv", "r")
-
- for line in fh:
- line = line.strip("\n")
-
- if len(line) == 0 or line[0] == "#":
- continue
-
- parts = line.split("\t")
-
- cat_locus = int(parts[2])
- sample_locus = int(parts[4])
- matches[sample_locus] = cat_locus
-
- fh.close()
-
- #
- # Open the tags file and rewrite it with the alignment coordinates.
- #
- if gzipped:
- fh = gzip.open(path + file + ".tags.tsv.gz", "rb")
- else:
- fh = open(path + file + ".tags.tsv", "r")
-
- out_fh = open(out_path + file + ".tags.tsv", "w")
-
- alns_written = {}
-
- for line in fh:
- if line[0] == "#":
- out_fh.write(line)
- continue
-
- if len(line) == 0:
- continue
-
- parts = line.split("\t")
- sample_locus = int(parts[2])
- read_type = parts[6]
-
- if read_type == "consensus":
- if sample_locus not in matches:
- continue;
- cat_locus = matches[sample_locus]
- if cat_locus not in alns:
- continue;
-
- (chr, bp, strand) = alns[cat_locus]
-
- if sample_locus in alns_written:
- alns_written[sample_locus] += 1
- else:
- alns_written[sample_locus] = 1;
-
- buf = "\t".join(parts[0:3]) + "\t" + chr + "\t" + str(bp) + "\t" + strand + "\t" + "\t".join(parts[6:])
- out_fh.write(buf)
-
- elif sample_locus in alns_written:
- out_fh.write(line)
-
- fh.close()
- out_fh.close()
-
- #
- # Open the SNPs, Alleles, and Matches files and rewrite those that had alignment coordinates.
- #
- if gzipped:
- fh = gzip.open(path + file + ".snps.tsv.gz", "rb")
- else:
- fh = open(path + file + ".snps.tsv", "r")
-
- out_fh = open(out_path + file + ".snps.tsv", "w")
-
- for line in fh:
- if line[0] == "#":
- out_fh.write(line)
- continue
-
- if len(line) == 0:
- continue
-
- parts = line.split("\t")
- sample_locus = int(parts[2])
-
- if sample_locus in alns_written:
- out_fh.write(line)
-
- fh.close()
- out_fh.close()
-
- if gzipped:
- fh = gzip.open(path + file + ".alleles.tsv.gz", "rb")
- else:
- fh = open(path + file + ".alleles.tsv", "r")
-
- out_fh = open(out_path + file + ".alleles.tsv", "w")
-
- for line in fh:
- if line[0] == "#":
- out_fh.write(line)
- continue
-
- if len(line) == 0:
- continue
-
- parts = line.split("\t")
- sample_locus = int(parts[2])
-
- if sample_locus in alns_written:
- out_fh.write(line)
-
- fh.close()
- out_fh.close()
-
- if gzipped:
- fh = gzip.open(path + file + ".matches.tsv.gz", "rb")
- else:
- fh = open(path + file + ".matches.tsv", "r")
-
- out_fh = open(out_path + file + ".matches.tsv", "w")
-
- for line in fh:
- if line[0] == "#":
- out_fh.write(line)
- continue
-
- if len(line) == 0:
- continue
-
- parts = line.split("\t")
- sample_locus = int(parts[2])
-
- if sample_locus in alns_written:
- out_fh.write(line)
-
- fh.close()
- out_fh.close()
-
- #
- # If it exists, open the model file and rewrite it with the alignment coordinates.
- #
- if gzipped:
- if os.path.exists(path + file + ".models.tsv.gz") == False:
- return len(alns_written)
- elif os.path.exists(path + file + ".models.tsv") == False:
- return len(alns_written)
-
- if gzipped:
- fh = gzip.open(path + file + ".models.tsv.gz", "rb")
- else:
- fh = open(path + file + ".models.tsv", "r")
-
- out_fh = open(out_path + file + ".models.tsv", "w")
-
- for line in fh:
- if line[0] == "#":
- out_fh.write(line)
- continue
-
- if len(line) == 0:
- continue
-
- parts = line.split("\t")
- sample_locus = int(parts[2])
- read_type = parts[6]
-
- if sample_locus in alns_written:
- if read_type == "consensus":
- cat_locus = matches[sample_locus]
- (chr, bp, strand) = alns[cat_locus]
- buf = "\t".join(parts[0:3]) + "\t" + chr + "\t" + str(bp) + "\t" + strand + "\t" + "\t".join(parts[6:])
- out_fh.write(buf)
- else:
- out_fh.write(line)
-
- fh.close()
- out_fh.close()
-
- return len(alns_written)
-
-
-def convert_catalog(path, batch_id, out_path, alns):
- #
- # Open the tags file and rewrite it with the alignment coordinates.
- #
- p = path + "batch_" + str(batch_id) + ".catalog.tags.tsv.gz"
- if os.path.exists(p):
- gzipped = True;
- fh = gzip.open(path + "batch_" + str(batch_id) + ".catalog.tags.tsv.gz", "rb")
- else:
- gzipped = False;
- fh = open(path + "batch_" + str(batch_id) + ".catalog.tags.tsv", "r")
-
- out_fh = open(out_path + "batch_" + str(batch_id) + ".catalog.tags.tsv", "w")
-
- alns_written = {}
-
- for line in fh:
- if line[0] == "#":
- out_fh.write(line)
- continue
-
- if len(line) == 0:
- continue
-
- parts = line.split("\t")
- cat_locus = int(parts[2])
-
- if cat_locus not in alns:
- continue;
-
- (chr, bp, strand) = alns[cat_locus]
-
- if cat_locus in alns_written:
- alns_written[cat_locus] += 1
- else:
- alns_written[cat_locus] = 1;
-
- buf = "\t".join(parts[0:3]) + "\t" + chr + "\t" + str(bp) + "\t" + strand + "\t" + "\t".join(parts[6:])
- out_fh.write(buf)
-
- fh.close()
- out_fh.close()
-
- if gzipped:
- fh = gzip.open(path + "batch_" + str(batch_id) + ".catalog.snps.tsv.gz", "rb")
- else:
- fh = open(path + "batch_" + str(batch_id) + ".catalog.snps.tsv", "r")
-
- out_fh = open(out_path + "batch_" + str(batch_id) + ".catalog.snps.tsv", "w")
-
- for line in fh:
- if line[0] == "#":
- out_fh.write(line)
- continue
-
- if len(line) == 0:
- continue
-
- parts = line.split("\t")
- cat_locus = int(parts[2])
-
- if cat_locus in alns_written:
- out_fh.write(line)
-
- fh.close()
- out_fh.close()
-
- if gzipped:
- fh = gzip.open(path + "batch_" + str(batch_id) + ".catalog.alleles.tsv.gz", "rb")
- else:
- fh = open(path + "batch_" + str(batch_id) + ".catalog.alleles.tsv", "r")
-
- out_fh = open(out_path + "batch_" + str(batch_id) + ".catalog.alleles.tsv", "w")
-
- for line in fh:
- if line[0] == "#":
- out_fh.write(line)
- continue
-
- if len(line) == 0:
- continue
-
- parts = line.split("\t")
- cat_locus = int(parts[2])
-
- if cat_locus in alns_written:
- out_fh.write(line)
-
- fh.close()
- out_fh.close()
-
- return len(alns_written)
-
-
-parse_command_line()
-
-files = []
-alns = {}
-
-find_stacks_files(path, files)
-parse_catalog_alignments(aln_path, alns)
-
-i = 1
-for file in files:
- print >> sys.stderr, "Processing file", str(i), "of", len(files), "['" + file + "']"
- cnt = convert_sample(path, file, out_path, alns)
- print >> sys.stderr, " Added alignments for", cnt, "loci."
- i += 1
-
-#
-# Now process the catalog.
-#
-print >> sys.stderr, "Processing the catalog"
-cnt = convert_catalog(path, batch_id, out_path, alns)
-print >> sys.stderr, " Added alignments for", cnt, "catalog loci."
=====================================
scripts/stacks-dist-extract
=====================================
@@ -3,27 +3,40 @@
usage="\
Usage:
$(basename "$0") DIST_FILE
- $(basename "$0") DIST_FILE SECTION_NAME
+ $(basename "$0") [--pretty] DIST_FILE SECTION_NAME
(1) Prints available sections/distributions for DIST_FILE.
(2) Extracts section SECTION_NAME from DIST_FILE.
"
-if { [[ $# -ne 1 ]] && [[ $# -ne 2 ]]; } || [[ "$1" =~ ^(-h|--help|--version)$ ]] ;then
+if { [[ $# -ne 1 ]] && [[ $# -ne 2 ]] && [[ $# -ne 3 ]]; } || [[ "$1" =~ ^(-h|--help|--version)$ ]] ;then
echo -n "$usage" >&2
exit 1
fi
-dist_f="$1"
+if [[ "$1" =~ ^--pretty$ ]]; then
+ dist_f="$2"
+ sect="$3"
+else
+ dist_f="$1"
+ sect="$2"
+fi
+
[[ -e "$dist_f" ]] || { ls -- "$dist_f"; exit 1; }
-if [[ "$2" ]] ;then
- section="$(echo "$2" | grep -oE '\w+' | tr -d '\n')"
+if [[ $# -ge 2 ]] ;then
+ section="$(echo "$sect" | grep -oE '\w+' | tr -d '\n')"
if ! grep --quiet "^BEGIN $section\\b" "$dist_f" ;then
echo "Error: Couldn't find section '$section' in '$dist_f'." >&1
exit 1
fi
- sed -nE "/^END $section($| )/ q; /^BEGIN $section($| )/,$ p" "$dist_f" | sed 1d
+
+ if [[ "$1" =~ ^--pretty$ ]]; then
+ sed -nE "/^END $section($| )/ q; /^BEGIN $section($| )/,$ p" "$dist_f" | sed 1d | sed -nE "/^#/p"
+ sed -nE "/^END $section($| )/ q; /^BEGIN $section($| )/,$ p" "$dist_f" | sed 1d | grep -v "^#" | column -s " " -t
+ else
+ sed -nE "/^END $section($| )/ q; /^BEGIN $section($| )/,$ p" "$dist_f" | sed 1d
+ fi
else
grep ^BEGIN "$dist_f" | cut -c7-
fi
=====================================
src/bootstrap.cc
=====================================
@@ -149,7 +149,8 @@ BSArchive::parse_archive()
cerr << "Parsing bootstrap archive file to obtain values for resampling...";
- char line[max_len];
+ char *line = (char *) malloc(sizeof(char) * max_len);
+ int size = max_len;
memset(line, '\0', max_len);
vector<string> parts;
@@ -159,7 +160,7 @@ BSArchive::parse_archive()
// line for each of those pops.
//
size_t lineno = 0;
- fh.getline(line, max_len);
+ read_line(fh, &line, &size);
lineno++;
int pop_cnt = is_integer(line);
if (pop_cnt < 1) {
@@ -168,7 +169,7 @@ BSArchive::parse_archive()
}
for (int i = 0; i < pop_cnt; i++) {
- fh.getline(line, max_len);
+ read_line(fh, &line, &size);
lineno++;
//
@@ -204,7 +205,7 @@ BSArchive::parse_archive()
//
// Parse the body of the file.
//
- while (fh.getline(line, max_len)) {
+ while (read_line(fh, &line, &size)) {
lineno++;
size_t len = strlen(line);
@@ -351,6 +352,9 @@ BSArchive::parse_archive()
<< this->_nuc_div_stats[this->_div_key[i][j]].size() << " sets of nucleotide-based divergence statistics read.\n";
}
cerr << "\n";
+
+ fh.close();
+ free(line);
return 0;
}
=====================================
src/clean.h
=====================================
@@ -131,7 +131,7 @@ public:
{
return (lhs.se == rhs.se && lhs.pe == rhs.pe);
}
- friend ofstream& operator<<(ofstream &out, const BarcodePair &bp)
+ friend ostream& operator<<(ostream &out, const BarcodePair &bp)
{
if (bp.pe.length() > 0)
out << bp.se << "-" << bp.pe;
@@ -286,6 +286,41 @@ public:
}
};
+class ReadComp {
+public:
+ Seq *s;
+ RawRead *r;
+ BarcodePair *bc;
+
+ ReadComp(Seq *s, RawRead *r, BarcodePair *bc) {
+ this->s = s;
+ this->r = r;
+ this->bc = bc;
+ }
+ void destroy() {
+ delete this->s;
+ delete this->r;
+ delete this->bc;
+ }
+};
+
+class ReadCompPair {
+public:
+ Seq *s_1;
+ Seq *s_2;
+ RawRead *r_1;
+ RawRead *r_2;
+ BarcodePair *bc;
+
+ ReadCompPair(Seq *s_1, Seq *s_2, RawRead *r_1, RawRead *r_2, BarcodePair *bc) {
+ this->s_1 = s_1;
+ this->s_2 = s_2;
+ this->r_1 = r_1;
+ this->r_2 = r_2;
+ this->bc = bc;
+ }
+};
+
int parse_illumina_v1(const char *);
int parse_illumina_v2(const char *);
int parse_input_record(Seq *, RawRead *);
=====================================
src/export_formats.cc
=====================================
@@ -1,6 +1,6 @@
// -*-mode:c++; c-style:k&r; c-basic-offset:4;-*-
//
-// Copyright 2012-2019, Julian Catchen <jcatchen at illinois.edu>
+// Copyright 2012-2022, Julian Catchen <jcatchen at illinois.edu>
//
// This file is part of Stacks.
//
@@ -1207,7 +1207,7 @@ OrderableExport::write_batch(const vector<LocBin*> &loci)
//
vector<NucTally *> sites;
OLocTally<NucTally> ord;
- ord.order(sites, loci);
+ ord.order(sites, loci, true);
map<size_t, size_t> key;
map<size_t, size_t>::iterator key_it = key.begin();
@@ -1246,6 +1246,48 @@ OrderableExport::write_batch(const vector<LocBin*> &loci)
return 0;
}
+
+int
+OrderableAllExport::write_batch(const vector<LocBin*> &loci)
+{
+ if (ordered_export) {
+ //
+ // We need to order nucleotide positions (fixed or variable) to take into account overlapping loci.
+ //
+ vector<NucTally *> sites;
+ OLocTally<NucTally> ord;
+ ord.order(sites, loci, false);
+
+ map<size_t, size_t> key;
+ map<size_t, size_t>::iterator key_it = key.begin();
+
+ for (size_t k = 0; k < loci.size(); k++)
+ key_it = key.insert(key_it, pair<size_t, size_t>(loci[k]->cloc->id, k));
+
+ for (uint pos = 0; pos < sites.size(); pos++) {
+ int loc_id = sites[pos]->loc_id;
+
+ const LocBin *loc = loci[key[loc_id]];
+ size_t col = sites[pos]->col;
+
+ this->write_site(loc->cloc, loc->s, loc->d, col);
+ }
+
+ } else {
+ for (uint k = 0; k < loci.size(); k++) {
+ const LocBin *loc = loci[k];
+ const CSLocus *cloc = loc->cloc;
+
+ for (uint col = 0; col < cloc->len; col++) {
+
+ this->write_site(cloc, loc->s, loc->d, col);
+ }
+ }
+ }
+
+ return 0;
+}
+
int
GenePopExport::open(const MetaPopInfo *mpopi)
{
@@ -2728,6 +2770,151 @@ VcfHapsExport::write_batch(const vector<LocBin*>& loci)
return 0;
}
+int
+VcfAllSitesExport::open(const MetaPopInfo *mpopi)
+{
+ this->_path = out_path + out_prefix + ".all.vcf";
+ cout << "All fixed and variable sites will be written in VCF format to '" << this->_path << "'\n";
+
+ this->_mpopi = mpopi;
+
+ VcfHeader header;
+ //
+ // Check for the existence of a contigs definition file, include them in the VCF header if found.
+ //
+ const vector<Contig> &contigs = this->_mpopi->contigs();
+ if (contigs.size() > 0)
+ header.add_std_meta_with_contigs(contigs);
+ else
+ header.add_std_meta();
+
+ header.add_meta(VcfMeta::predefs::info_loc_strand);
+ for(size_t s : this->_mpopi->sample_indexes_orig_order())
+ header.add_sample(this->_mpopi->samples()[s].name);
+
+ this->_writer = new VcfWriter(this->_path, move(header));
+
+ return 0;
+}
+
+int
+VcfAllSitesExport::write_site(const CSLocus* cloc,
+ const LocPopSum* psum,
+ Datum const*const* d,
+ size_t col)
+{
+ const LocTally* t = psum->meta_pop();
+
+ // Check that a call was made for this site before proceeding with output.
+ if (!Nt4(t->nucs[col].p_allele).is_acgt())
+ return 0;
+
+ VcfRecord rec;
+ string id = to_string(cloc->id);
+ id += ":";
+ id += to_string(col);
+ if (loci_ordered) {
+ rec.append_chrom(string(cloc->loc.chr()));
+ rec.append_pos(cloc->sort_bp(col));
+ id += ':';
+ id += (cloc->loc.strand == strand_plus ? '+' : '-');
+ } else {
+ rec.append_chrom(to_string(cloc->id));
+ rec.append_pos(col);
+ }
+ rec.append_id(id);
+
+ if (t->nucs[col].fixed == true) {
+ const char ref = t->nucs[col].p_allele;
+
+ rec.append_allele(Nt2(ref));
+ rec.append_qual();
+ rec.append_filters("PASS");
+ rec.append_info(string("NS=") + to_string(t->nucs[col].num_indv));
+ rec.append_info(string("AF=") + "1.0");
+ rec.append_format("GT");
+
+ for (size_t s : _mpopi->sample_indexes_orig_order()) {
+ stringstream sample;
+
+ if (d[s] == NULL || col >= uint(d[s]->len) || d[s]->model[col] == 'U') {
+ // Data does not exist.
+ sample << "./.";
+ } else {
+ sample << "0/0";
+ }
+
+ rec.append_sample(sample.str());
+ }
+
+ } else {
+ // What is our index into the haplotype for this column?
+ int index = -1;
+ for (size_t n = 0; n < cloc->snps.size(); n++)
+ if (cloc->snps[n]->col == col) {
+ index = (int) n;
+ break;
+ }
+ assert(index >= 0);
+
+ const char ref = t->nucs[col].p_allele;
+ const char alt = t->nucs[col].q_allele;
+ char freq_alt[32];
+ sprintf(freq_alt, "%0.3f", 1 - t->nucs[col].p_freq);
+
+ rec.append_allele(Nt2(cloc->loc.strand == strand_plus ? ref : reverse(ref)));
+ rec.append_allele(Nt2(cloc->loc.strand == strand_plus ? alt : reverse(alt)));
+ rec.append_qual();
+ rec.append_filters("PASS");
+ rec.append_info(string("NS=") + to_string(t->nucs[col].num_indv));
+ rec.append_info(string("AF=") + freq_alt);
+ rec.append_format("GT");
+ rec.append_format("DP");
+ rec.append_format("AD");
+ rec.append_format("GQ");
+ rec.append_format("GL");
+
+ const vector<Nt2> alleles {Nt2(ref), Nt2(alt)};
+ for (size_t s : _mpopi->sample_indexes_orig_order()) {
+ stringstream sample;
+
+ if (d[s] == NULL || col >= uint(d[s]->len) || d[s]->model[col] == 'U') {
+ // Data does not exist.
+ sample << "./.";
+ } else {
+ if (d[s]->model[col] == 'O') {
+ assert(d[s]->obshap.size() == 1
+ || (d[s]->obshap.size() == 2 && d[s]->obshap[0][index] == d[s]->obshap[1][index]));
+ sample << (d[s]->obshap[0][index] == ref ? "0/0" : "1/1");
+ } else {
+ assert(d[s]->model[col] == 'E');
+ assert(d[s]->obshap.size() == 2 && d[s]->obshap[0][index] != d[s]->obshap[1][index]);
+ sample << "0/1";
+ }
+
+ // DP.
+ sample << ":" << d[s]->snpdata[index].tot_depth;
+ // AD.
+ if (d[s]->snpdata[index].nt_depths.sum() > 0)
+ sample << ":" << d[s]->snpdata[index].nt_depths[Nt2(ref)]
+ << "," << d[s]->snpdata[index].nt_depths[Nt2(alt)];
+ else
+ sample << ":.";
+ // GQ.
+ assert(d[s]->snpdata[index].gq != UINT8_MAX);
+ sample << ':' << int(d[s]->snpdata[index].gq);
+ // // GL.
+ sample << ':' << VcfRecord::util::fmt_gt_gl(alleles, d[s]->snpdata[index].gtliks);
+ }
+ rec.append_sample(sample.str());
+ }
+ }
+
+ this->_writer->write_record(rec);
+
+ return 0;
+}
+
int
PlinkExport::open(const MetaPopInfo *mpopi)
{
@@ -2804,7 +2991,7 @@ PlinkExport::write_batch(const vector<LocBin*> &loci)
//
vector<NucTally *> sites;
OLocTally<NucTally> ord;
- ord.order(sites, loci);
+ ord.order(sites, loci, true);
map<size_t, size_t> key;
map<size_t, size_t>::iterator key_it = key.begin();
=====================================
src/export_formats.h
=====================================
@@ -1,6 +1,6 @@
// -*-mode:c++; c-style:k&r; c-basic-offset:4;-*-
//
-// Copyright 2012-2021, Julian Catchen <jcatchen at illinois.edu>
+// Copyright 2012-2022, Julian Catchen <jcatchen at illinois.edu>
//
// This file is part of Stacks.
//
@@ -60,6 +60,9 @@ class Export {
static int transpose(ifstream &ifh, vector<string> &transposed);
};
+//
+// For exporting ordered, variable sites.
+//
class OrderableExport : public Export {
public:
OrderableExport() {}
@@ -70,6 +73,19 @@ class OrderableExport : public Export {
virtual int write_site(const CSLocus* cloc, const LocPopSum* psum, Datum const*const* datums, size_t col, size_t index) = 0;
};
+//
+// For exporting ordered, variable AND fixed sites.
+//
+class OrderableAllExport : public Export {
+ public:
+ OrderableAllExport() {}
+ virtual ~OrderableAllExport() {}
+ int write_batch(const vector<LocBin*>& loci);
+
+ protected:
+ virtual int write_site(const CSLocus* cloc, const LocPopSum* psum, Datum const*const* datums, size_t col) = 0;
+};
+
class GenPos {
public:
uint id;
@@ -451,6 +467,21 @@ class VcfHapsExport: public Export {
int write_header() { return 0; }
};
+class VcfAllSitesExport: public OrderableAllExport {
+ const MetaPopInfo*_mpopi;
+ VcfWriter* _writer;
+
+ public:
+ VcfAllSitesExport() : _mpopi(NULL), _writer(NULL) {}
+ ~VcfAllSitesExport() { delete this->_writer; }
+ int open(const MetaPopInfo *mpopi);
+ int write_header() { return 0; }
+
+private:
+ int write_site(const CSLocus* cloc, const LocPopSum* psum, Datum const*const* datums, size_t col);
+
+};
+
class TreemixExport: public OrderableExport {
const MetaPopInfo*_mpopi;
ofstream _writer;
=====================================
src/ordered.h
=====================================
@@ -1,6 +1,6 @@
// -*-mode:c++; c-style:k&r; c-basic-offset:4;-*-
//
-// Copyright 2014-2021, Julian Catchen <jcatchen at illinois.edu>
+// Copyright 2014-2022, Julian Catchen <jcatchen at illinois.edu>
//
// This file is part of Stacks.
//
@@ -48,7 +48,7 @@ public:
Ordered(): incompatible_sites(0), total_sites(0), uniq_sites(0), multiple_sites(0) {}
virtual ~Ordered() {}
- int init_sites(vector<StatT *> &, map<uint, uint> &, const vector<LocBin *> &);
+ int init_sites(vector<StatT *> &, map<uint, uint> &, const vector<LocBin *> &, bool);
int init_sites(vector<StatT *> &, map<uint, uint> &, const vector<LocBin *> &, uint);
int init_sites(vector<StatT *> &, map<uint, uint> &, const vector<LocBin *> &, uint, uint);
int init_haplotypes(vector<StatT *> &, map<uint, uint> &, const vector<LocBin *> &);
@@ -56,7 +56,7 @@ public:
template<class StatT>
int
-Ordered<StatT>::init_sites(vector<StatT *> &sites, map<uint, uint> &sites_key, const vector<LocBin *> &sorted_loci)
+Ordered<StatT>::init_sites(vector<StatT *> &sites, map<uint, uint> &sites_key, const vector<LocBin *> &sorted_loci, bool het_sites_only)
{
const CSLocus *loc;
const LocTally *ltally;
@@ -71,8 +71,12 @@ Ordered<StatT>::init_sites(vector<StatT *> &sites, map<uint, uint> &sites_key, c
ltally = sorted_loci[pos]->s->meta_pop();
for (uint k = 0; k < loc->len; k++) {
- if (ltally->nucs[k].allele_cnt == 2)
- bps.push_back(ltally->nucs[k].bp);
+ if (het_sites_only) {
+ if (ltally->nucs[k].allele_cnt == 2)
+ bps.push_back(ltally->nucs[k].bp);
+ } else {
+ bps.push_back(ltally->nucs[k].bp);
+ }
}
}
@@ -390,12 +394,12 @@ public:
this->log_fh = NULL;
}
- int order(vector<StatT *> &, const vector<LocBin *> &);
+ int order(vector<StatT *> &, const vector<LocBin *> &, bool);
};
template<class StatT>
int
-OLocTally<StatT>::order(vector<StatT *> &sites, const vector<LocBin *> &sorted_loci)
+OLocTally<StatT>::order(vector<StatT *> &sites, const vector<LocBin *> &sorted_loci, bool het_sites_only)
{
this->incompatible_sites = 0;
this->total_sites = 0;
@@ -403,7 +407,7 @@ OLocTally<StatT>::order(vector<StatT *> &sites, const vector<LocBin *> &sorted_l
map<uint, uint> sites_key;
- this->init_sites(sites, sites_key, sorted_loci);
+ this->init_sites(sites, sites_key, sorted_loci, het_sites_only);
this->uniq_sites = sites_key.size();
@@ -419,7 +423,9 @@ OLocTally<StatT>::order(vector<StatT *> &sites, const vector<LocBin *> &sorted_l
ltally = sorted_loci[pos]->s->meta_pop();
for (uint k = 0; k < loc->len; k++) {
- if (ltally->nucs[k].allele_cnt != 2) continue;
+ if (het_sites_only &&
+ ltally->nucs[k].allele_cnt != 2)
+ continue;
assert(sites_key.count(ltally->nucs[k].bp) > 0);
this->total_sites++;
=====================================
src/populations.cc
=====================================
@@ -1,6 +1,6 @@
// -*-mode:c++; c-style:k&r; c-basic-offset:4;-*-
//
-// Copyright 2012-2021, Julian Catchen <jcatchen at illinois.edu>
+// Copyright 2012-2022, Julian Catchen <jcatchen at illinois.edu>
//
// This file is part of Stacks.
//
@@ -131,26 +131,6 @@ try {
//
parse_command_line(argc, argv);
- // NucTally *n = new NucTally();
- // n->optimize();
- // delete n;
- //exit(1);
-
- // SumStat *s = new SumStat();
- // s->optimize();
- // delete s;
- // exit(1);
-
- // LocSum *ls = new LocSum(144);
- // ls->optimize(144);
- // delete ls;
- // exit(1);
-
- // PopPair *p = new PopPair();
- // p->optimize();
- // delete p;
- // exit(1);
-
//
// Open and initialize the log file.
//
@@ -518,6 +498,11 @@ BatchLocusProcessor::init(string in_path, string pmap_path)
else
this->init_stacks_loci(in_path, pmap_path);
+ //
+ // Initialize the CatalogDists object to record the number of present/absent sample/locus/snp counts.
+ //
+ this->_dists.init(this->_mpopi);
+
//
// Check that the number of samples does not exceed the maximum we can handle
// without overflowing our internal data structures.
@@ -642,8 +627,8 @@ BatchLocusProcessor::init_stacks_loci(string in_path, string pmap_path)
// Create the population map or check that all samples have data.
if (pmap_path.empty()) {
- cout << "No population map specified, using all samples...\n";
this->_mpopi->init_names(this->cloc_reader().header().samples());
+
} else {
size_t n_samples_before = this->_mpopi->n_samples();
this->_mpopi->intersect_with(this->cloc_reader().header().samples());
@@ -744,7 +729,7 @@ BatchLocusProcessor::next_batch_stacks_loci(ostream &log_fh)
//
// Apply locus & SNP filters.
//
- this->_dists.accumulate_pre_filtering(loc->sample_cnt, loc->cloc);
+ this->_dists.accumulate_pre_filtering(loc);
this->_loc_filter.whitelist_snp_filter(*loc);
if (this->_loc_filter.apply_filters_stacks(*loc, log_fh, *this->_mpopi)) {
delete loc;
@@ -937,7 +922,7 @@ BatchLocusProcessor::next_batch_external_loci(ostream &log_fh)
//
// Apply filters.
//
- this->_dists.accumulate_pre_filtering(loc->sample_cnt, loc->cloc);
+ this->_dists.accumulate_pre_filtering(loc);
if (this->_loc_filter.apply_filters_external(*loc, log_fh, *this->_mpopi)) {
delete loc;
continue;
@@ -1623,27 +1608,93 @@ LocusFilter::filter_haps(LocBin& loc, const MetaPopInfo& mpopi, ostream &log_fh)
}
int
-CatalogDists::accumulate_pre_filtering(const size_t sample_cnt, const CSLocus *loc)
+CatalogDists::init(const MetaPopInfo *mpopi)
{
- size_t missing;
+ assert(mpopi != NULL);
+
+ this->_pop_cnt = mpopi->pops().size();
+ this->_sample_cnt = mpopi->n_samples();
+
+ assert(this->_pop_cnt > 0);
+ assert(this->_sample_cnt > 0);
+
+ this->_pop_order = new size_t [this->_pop_cnt];
+ this->_samples = new size_t [this->_sample_cnt];
+
+ this->_pre_loci_pres_per_sample = new size_t [this->_sample_cnt];
+ this->_pre_loci_miss_per_sample = new size_t [this->_sample_cnt];
+
+ this->_sites_pres_per_sample = new size_t [this->_sample_cnt];
+ this->_sites_miss_per_sample = new size_t [this->_sample_cnt];
+ this->_loci_pres_per_sample = new size_t [this->_sample_cnt];
+ this->_loci_miss_per_sample = new size_t [this->_sample_cnt];
+
+ memset(this->_pre_loci_pres_per_sample, 0, this->_sample_cnt * sizeof(size_t));
+ memset(this->_pre_loci_miss_per_sample, 0, this->_sample_cnt * sizeof(size_t));
+
+ memset(this->_sites_pres_per_sample, 0, this->_sample_cnt * sizeof(size_t));
+ memset(this->_sites_miss_per_sample, 0, this->_sample_cnt * sizeof(size_t));
+ memset(this->_loci_pres_per_sample, 0, this->_sample_cnt * sizeof(size_t));
+ memset(this->_loci_miss_per_sample, 0, this->_sample_cnt * sizeof(size_t));
+
+ this->_pre_total_loci = 0;
+ this->_total_loci = 0;
+ this->_total_sites = 0;
+
+ size_t pop_sthg = 0;
+
+ for (size_t i_pop = 0; i_pop < mpopi->pops().size(); ++i_pop) {
+ const Pop& pop = mpopi->pops()[i_pop];
+
+ for (uint i = pop.first_sample; i <= pop.last_sample; i++)
+ this->_samples[i] = pop_sthg;
+ this->_pop_order[pop_sthg] = i_pop;
- if (this->_pre_valid.count(loc->cnt) == 0)
- this->_pre_valid[loc->cnt] = 1;
+ pop_sthg++;
+ }
+ return 0;
+}
+
+int
+CatalogDists::accumulate_pre_filtering(const LocBin *locbin)
+{
+ Datum **d;
+ size_t missing;
+
+ const CSLocus *cloc = locbin->cloc;
+ const size_t sample_cnt = locbin->sample_cnt;
+
+ if (this->_pre_valid.count(cloc->cnt) == 0)
+ this->_pre_valid[cloc->cnt] = 1;
else
- this->_pre_valid[loc->cnt]++;
+ this->_pre_valid[cloc->cnt]++;
- missing = sample_cnt - loc->cnt;
+ missing = sample_cnt - cloc->cnt;
if (this->_pre_absent.count(missing) == 0)
this->_pre_absent[missing] = 1;
else
this->_pre_absent[missing]++;
- if (this->_pre_snps_per_loc.count(loc->snps.size()) == 0)
- this->_pre_snps_per_loc[loc->snps.size()] = 1;
+ if (this->_pre_snps_per_loc.count(cloc->snps.size()) == 0)
+ this->_pre_snps_per_loc[cloc->snps.size()] = 1;
else
- this->_pre_snps_per_loc[loc->snps.size()]++;
+ this->_pre_snps_per_loc[cloc->snps.size()]++;
+
+ //
+ // Record which individuals are missing loci and/or SNPs.
+ //
+ d = locbin->d;
+ for (uint j = 0; j < this->_sample_cnt; j++) {
+ if (d[j] == NULL) {
+ this->_pre_loci_miss_per_sample[j]++;
+ continue;
+ }
+ this->_pre_loci_pres_per_sample[j]++;
+ }
+ this->_pre_total_loci++;
+
return 0;
}
@@ -1735,7 +1786,7 @@ CatalogDists::write_results(ostream &log_fh, const MetaPopInfo *mpopi)
begin_section("missing_samples_per_loc_prefilters");
log_fh << "# Distribution of missing samples for each catalog locus prior to filtering.\n"
- << "# Absent samples at locus\tCount\n";
+ << "Absent samples at locus\tCount\n";
for (cnt_it = this->_pre_absent.begin(); cnt_it != this->_pre_absent.end(); cnt_it++)
log_fh << cnt_it->first << "\t" << cnt_it->second << "\n";
end_section();
@@ -1756,7 +1807,7 @@ CatalogDists::write_results(ostream &log_fh, const MetaPopInfo *mpopi)
begin_section("missing_samples_per_loc_postfilters");
log_fh << "# Distribution of missing samples for each catalog locus after filtering.\n"
- << "# Absent samples at locus\tCount\n";
+ << "Absent samples at locus\tCount\n";
for (cnt_it = this->_post_absent.begin(); cnt_it != this->_post_absent.end(); cnt_it++)
log_fh << cnt_it->first << "\t" << cnt_it->second << "\n";
end_section();
@@ -1769,7 +1820,19 @@ CatalogDists::write_results(ostream &log_fh, const MetaPopInfo *mpopi)
end_section();
const vector<Sample> &s = mpopi->samples();
-
+
+ begin_section("loci_per_sample_prefilters");
+ log_fh << "# Number of loci per individual sample (before filtering).\n"
+ << "sample\tn_loci\tpresent_loci\tmissing_loci\tfrequency_missing\n";
+ for (uint i = 0; i < this->_sample_cnt; i++)
+ log_fh << s[i].name << "\t"
+ << this->_pre_total_loci << "\t"
+ << this->_pre_loci_pres_per_sample[i] << "\t"
+ << this->_pre_loci_miss_per_sample[i] << "\t"
+ << fixed << setprecision(4)
+ << (double) this->_pre_loci_miss_per_sample[i] / (double) this->_pre_total_loci << "\n";
+ end_section();
+
begin_section("loci_per_sample");
log_fh << "# Number of loci per individual sample (after filtering).\n"
<< "sample\tn_loci\tpresent_loci\tmissing_loci\tfrequency_missing\n";
@@ -3574,6 +3637,7 @@ parse_command_line(int argc, char* argv[])
{"quiet", no_argument, NULL, 'q'},
{"verbose", no_argument, NULL, 'd'},
{"vcf", no_argument, NULL, 1004},
+ {"vcf-all", no_argument, NULL, 1009},
{"fasta-loci", no_argument, NULL, 1006},
{"fasta-samples", no_argument, NULL, 'J'},
{"fasta-samples-raw", no_argument, NULL, 'F'},
@@ -3829,6 +3893,9 @@ parse_command_line(int argc, char* argv[])
add_export<VcfExport>();
add_export<VcfHapsExport>();
break;
+ case 1009: // --vcf-all
+ add_export<VcfAllSitesExport>();
+ break;
case 1006: // --fasta-loci
add_export<FastaLociExport>();
break;
@@ -4135,11 +4202,10 @@ void help() {
<< " --fasta-loci: output locus consensus sequences in FASTA format.\n"
<< " --fasta-samples: output the sequences of the two haplotypes of each (diploid) sample, for each locus, in FASTA format.\n"
<< " --vcf: output SNPs and haplotypes in Variant Call Format (VCF).\n"
+ << " --vcf-all: output all sites in Variant Call Format (VCF).\n"
<< " --genepop: output SNPs and haplotypes in GenePop format.\n"
<< " --structure: output results in Structure format.\n"
<< " --radpainter: output results in fineRADstructure/RADpainter format.\n"
- // << " --phase*: output genotypes in PHASE format.\n"
- // << " --fastphase*: output genotypes in fastPHASE format.\n"
<< " --plink: output genotypes in PLINK format.\n"
<< " --hzar: output genotypes in Hybrid Zone Analysis using R (HZAR) format.\n"
<< " --phylip: output nucleotides that are fixed-within, and variant among populations in Phylip format for phylogenetic tree construction.\n"
=====================================
src/populations.h
=====================================
@@ -106,8 +106,10 @@ class CatalogDists {
map<size_t, size_t> _pre_snps_per_loc, _post_snps_per_loc;
size_t _pop_cnt, _sample_cnt;
+ size_t _pre_total_loci;
size_t _total_sites, _total_loci;
size_t *_pop_order, *_samples;
+ size_t *_pre_loci_pres_per_sample, *_pre_loci_miss_per_sample;
size_t *_sites_pres_per_sample, *_sites_miss_per_sample;
size_t *_loci_pres_per_sample, *_loci_miss_per_sample;
@@ -118,55 +120,27 @@ public:
this->_pop_order = NULL; // The array order of each population.
this->_samples = NULL; // Which population each sample belongs to.
+ this->_pre_loci_pres_per_sample = NULL; // Array of samples tallying # loci present
+ this->_pre_loci_miss_per_sample = NULL; // Array of samples tallying # loci missing
+
this->_sites_pres_per_sample = NULL; // Array of samples tallying # variant sites present
this->_sites_miss_per_sample = NULL; // Array of samples tallying # variant sites missing
this->_loci_pres_per_sample = NULL; // Array of samples tallying # loci present
this->_loci_miss_per_sample = NULL; // Array of samples tallying # loci missing
- this->_total_loci = 0;
- this->_total_sites = 0;
- }
- CatalogDists(const MetaPopInfo *mpopi) {
- assert(mpopi != NULL);
-
- this->_pop_cnt = mpopi->pops().size();
- this->_sample_cnt = mpopi->n_samples();
-
- assert(this->_pop_cnt > 0);
- assert(this->_sample_cnt > 0);
-
- this->_pop_order = new size_t [this->_pop_cnt];
- this->_samples = new size_t [this->_sample_cnt];
- this->_sites_pres_per_sample = new size_t [this->_sample_cnt];
- this->_sites_miss_per_sample = new size_t [this->_sample_cnt];
- this->_loci_pres_per_sample = new size_t [this->_sample_cnt];
- this->_loci_miss_per_sample = new size_t [this->_sample_cnt];
-
- memset(this->_sites_pres_per_sample, 0, this->_sample_cnt * sizeof(size_t));
- memset(this->_sites_miss_per_sample, 0, this->_sample_cnt * sizeof(size_t));
- memset(this->_loci_pres_per_sample, 0, this->_sample_cnt * sizeof(size_t));
- memset(this->_loci_miss_per_sample, 0, this->_sample_cnt * sizeof(size_t));
-
- this->_total_loci = 0;
- this->_total_sites = 0;
-
- size_t pop_sthg = 0;
-
- for (size_t i_pop = 0; i_pop < mpopi->pops().size(); ++i_pop) {
- const Pop& pop = mpopi->pops()[i_pop];
-
- for (uint i = pop.first_sample; i <= pop.last_sample; i++)
- this->_samples[i] = pop_sthg;
- this->_pop_order[pop_sthg] = i_pop;
-
- pop_sthg++;
- }
+ this->_pre_total_loci = 0;
+ this->_total_loci = 0;
+ this->_total_sites = 0;
}
~CatalogDists() {
if (this->_pop_order != NULL)
delete [] this->_pop_order;
if (this->_samples != NULL)
delete [] this->_samples;
+ if (this->_pre_loci_pres_per_sample != NULL)
+ delete [] this->_pre_loci_pres_per_sample;
+ if (this->_pre_loci_miss_per_sample != NULL)
+ delete [] this->_pre_loci_miss_per_sample;
if (this->_sites_pres_per_sample != NULL)
delete [] this->_sites_pres_per_sample;
if (this->_sites_miss_per_sample != NULL)
@@ -180,7 +154,8 @@ public:
CatalogDists(const CatalogDists &) = delete;
CatalogDists(CatalogDists &&) = delete;
- int accumulate_pre_filtering(const size_t, const CSLocus *);
+ int init(const MetaPopInfo *mpopi);
+ int accumulate_pre_filtering(const LocBin *);
int accumulate(const vector<LocBin *> &);
int write_results(ostream &log_fh, const MetaPopInfo *mpopi);
};
@@ -332,15 +307,15 @@ public:
BatchLocusProcessor():
_sig_hwe_dev(NULL),
_input_mode(InputMode::stacks2), _user_supplied_whitelist(false), _batch_size(0), _batch_num(0),
- _mpopi(NULL), _next_loc(NULL), _unordered_bp(0) {}
+ _mpopi(NULL), _next_loc(NULL), _dists(), _unordered_bp(0) {}
BatchLocusProcessor(InputMode mode, size_t batch_size, MetaPopInfo *popi):
_sig_hwe_dev(NULL),
_input_mode(mode), _user_supplied_whitelist(false), _batch_size(batch_size), _batch_num(0),
- _mpopi(popi), _next_loc(NULL), _dists(popi), _unordered_bp(0) {}
+ _mpopi(popi), _next_loc(NULL), _dists(), _unordered_bp(0) {}
BatchLocusProcessor(InputMode mode, size_t batch_size):
_sig_hwe_dev(NULL),
_input_mode(mode), _user_supplied_whitelist(false), _batch_size(batch_size), _batch_num(0),
- _mpopi(NULL), _next_loc(NULL), _unordered_bp(0) {}
+ _mpopi(NULL), _next_loc(NULL), _dists(), _unordered_bp(0) {}
~BatchLocusProcessor() {
for (uint i = 0; i < this->_loci.size(); i++)
delete this->_loci[i];
@@ -428,10 +403,8 @@ int tabulate_haplotypes(map<int, CSLocus *> &, PopMap<CSLocus> *);
int tabulate_locus_haplotypes(CSLocus *, Datum **, int);
int create_genotype_map(CSLocus *, Datum **, int);
int call_population_genotypes(CSLocus *, Datum **, int);
-int translate_genotypes(map<string, string> &, map<string, map<string, string> > &, map<int, CSLocus *> &, PopMap<CSLocus> *, map<int, string> &, set<int> &); // This function doesn't exist (March 24, 2016)
+
int correct_fst_bonferroni_win(vector<PopPair *> &);
-int bootstrap_fst_approximate_dist(vector<double> &, vector<int> &, double *, int *, map<int, vector<double> > &); // not used (March 23, 2016)
-int bootstrap_popstats_approximate_dist(vector<double> &, vector<double> &, vector<int> &, double *, int *, int, map<int, vector<double> > &, map<int, vector<double> > &); // not used (March 23, 2016)
double bootstrap_approximate_pval(int, double, map<int, vector<double> > &);
bool hap_compare(const pair<string,int>&, const pair<string,int>&);
=====================================
src/process_radtags.cc
=====================================
@@ -1,6 +1,6 @@
// -*-mode:c++; c-style:k&r; c-basic-offset:4;-*-
//
-// Copyright 2011-2021, Julian Catchen <jcatchen at illinois.edu>
+// Copyright 2011-2022, Julian Catchen <jcatchen at illinois.edu>
//
// This file is part of Stacks.
//
@@ -68,7 +68,12 @@ int barcode_dist_2 = -1;
double win_size = 0.15;
uint score_limit = 10;
uint len_limit = 0;
-uint num_threads = 1;
+
+uint num_threads = 1;
+int num_input_threads = 1;
+int num_output_threads = 0;
+int num_worker_threads = 0;
+int max_records_queued = 0;
//
// How to shift FASTQ-encoded quality scores from ASCII down to raw scores
@@ -159,44 +164,79 @@ int main (int argc, char* argv[]) {
else
open_files(files, barcodes, pair_1_fhs, pair_2_fhs, rem_1_fhs, rem_2_fhs, counters);
+ //
+ // Set the number of OpenMP parallel threads to execute -- minimum of 3 threads required.
+ //
+ #ifdef _OPENMP
+ if (num_threads >= 3) {
+ if (num_threads > max_threads) {
+ num_threads = max_threads;
+ cerr << "Warning: too many threads requested and performance will be negatively affected; reducing thread count to " << num_threads << ".\n\n";
+ }
+ omp_set_num_threads(num_threads);
+ allocate_threads();
+ } else if (num_threads > 1) {
+ cerr << "Warning: minimum number of threads required is 3; reverting to single-threaded mode.\n";
+ }
+ #endif
+
int result = 1;
for (uint i = 0; i < files.size(); i++) {
cerr << "Processing file " << i+1 << " of " << files.size() << " [" << files[i].first.c_str() << "]\n";
- counters[files[i].first]["total"] = 0;
- counters[files[i].first]["ill_filtered"] = 0;
- counters[files[i].first]["adapter"] = 0;
- counters[files[i].first]["low_quality"] = 0;
- counters[files[i].first]["noradtag"] = 0;
- counters[files[i].first]["ambiguous"] = 0;
- counters[files[i].first]["retained"] = 0;
- counters[files[i].first]["recovered"] = 0;
- counters[files[i].first]["transposed"] = 0;
+ initialize_counters(counters[files[i].first]);
+
+ if (num_threads >= 3) {
+ if (paired) {
+ if (out_file_type == FileT::gzfastq || out_file_type == FileT::gzfasta)
+ result = process_paired_reads_parallel(files[i].first, files[i].second,
+ se_bc, pe_bc,
+ pair_1_gzfhs, pair_2_gzfhs, rem_1_gzfhs, rem_2_gzfhs,
+ counters[files[i].first], barcode_log);
+ else
+ result = process_paired_reads_parallel(files[i].first, files[i].second,
+ se_bc, pe_bc,
+ pair_1_fhs, pair_2_fhs, rem_1_fhs, rem_2_fhs,
+ counters[files[i].first], barcode_log);
+ } else {
+ if (out_file_type == FileT::gzfastq || out_file_type == FileT::gzfasta)
+ result = process_reads_parallel(files[i].first,
+ se_bc, pe_bc,
+ pair_1_gzfhs,
+ counters[files[i].first], barcode_log);
+ else
+ result = process_reads_parallel(files[i].first,
+ se_bc, pe_bc,
+ pair_1_fhs,
+ counters[files[i].first], barcode_log);
+ }
- if (paired) {
- if (out_file_type == FileT::gzfastq || out_file_type == FileT::gzfasta)
+ } else {
+ if (paired) {
+ if (out_file_type == FileT::gzfastq || out_file_type == FileT::gzfasta)
result = process_paired_reads(files[i].first, files[i].second,
- se_bc, pe_bc,
- pair_1_gzfhs, pair_2_gzfhs, rem_1_gzfhs, rem_2_gzfhs,
- counters[files[i].first], barcode_log);
- else
- result = process_paired_reads(files[i].first, files[i].second,
- se_bc, pe_bc,
- pair_1_fhs, pair_2_fhs, rem_1_fhs, rem_2_fhs,
- counters[files[i].first], barcode_log);
- } else {
- if (out_file_type == FileT::gzfastq || out_file_type == FileT::gzfasta)
+ se_bc, pe_bc,
+ pair_1_gzfhs, pair_2_gzfhs, rem_1_gzfhs, rem_2_gzfhs,
+ counters[files[i].first], barcode_log);
+ else
+ result = process_paired_reads(files[i].first, files[i].second,
+ se_bc, pe_bc,
+ pair_1_fhs, pair_2_fhs, rem_1_fhs, rem_2_fhs,
+ counters[files[i].first], barcode_log);
+ } else {
+ if (out_file_type == FileT::gzfastq || out_file_type == FileT::gzfasta)
result = process_reads(files[i].first,
- se_bc, pe_bc,
- pair_1_gzfhs,
- counters[files[i].first], barcode_log);
- else
- result = process_reads(files[i].first,
- se_bc, pe_bc,
- pair_1_fhs,
- counters[files[i].first], barcode_log);
- }
+ se_bc, pe_bc,
+ pair_1_gzfhs,
+ counters[files[i].first], barcode_log);
+ else
+ result = process_reads(files[i].first,
+ se_bc, pe_bc,
+ pair_1_fhs,
+ counters[files[i].first], barcode_log);
+ }
+ }
cerr << " "
<< counters[files[i].first]["total"] << " total reads; ";
@@ -217,7 +257,7 @@ int main (int argc, char* argv[]) {
}
}
- cerr << "Closing files, flushing buffers...\n";
+ cerr << "Closing files, flushing buffers...";
if (out_file_type == FileT::gzfastq || out_file_type == FileT::gzfasta) {
close_file_handles(pair_1_gzfhs);
if (paired) {
@@ -233,13 +273,102 @@ int main (int argc, char* argv[]) {
close_file_handles(pair_2_fhs);
}
}
-
+ cerr << "done.\n";
+
print_results(argc, argv, barcodes, counters, barcode_log);
+ cerr << "process_radtags is done.\n";
+
return 0;
IF_NDEBUG_CATCH_ALL_EXCEPTIONS
}
+int
+allocate_threads()
+{
+ if (num_threads <= 4)
+ num_output_threads = 1;
+ else if (num_threads <= 6)
+ num_output_threads = 2;
+ else if (num_threads <= 8)
+ num_output_threads = 3;
+ else
+ num_output_threads = 4;
+
+ num_worker_threads = num_threads - num_input_threads - num_output_threads;
+
+ max_records_queued = 1000 * num_worker_threads;
+
+ cerr << "Setting number of threads to " << num_threads << "; "
+ << num_input_threads << " reader thread; "
+ << num_output_threads << " writer threads; "
+ << num_worker_threads << " worker threads.\n";
+
+ return 0;
+}
+
+int
+initialize_counters(map<string, long> &counter)
+{
+ counter["total"] = 0;
+ counter["ill_filtered"] = 0;
+ counter["adapter"] = 0;
+ counter["low_quality"] = 0;
+ counter["noradtag"] = 0;
+ counter["ambiguous"] = 0;
+ counter["retained"] = 0;
+ counter["recovered"] = 0;
+ counter["transposed"] = 0;
+
+ return 0;
+}
+
+int
+consolidate_counters(map<string, long> &counter,
+ vector<map<string, long>> &local_counters)
+{
+ for (uint j = 0; j < local_counters.size(); j++) {
+ map<string, long> &cntr = local_counters[j];
+
+ counter["total"] += cntr["total"];
+ counter["ill_filtered"] += cntr["ill_filtered"];
+ counter["adapter"] += cntr["adapter"];
+ counter["low_quality"] += cntr["low_quality"];
+ counter["noradtag"] += cntr["noradtag"];
+ counter["ambiguous"] += cntr["ambiguous"];
+ counter["retained"] += cntr["retained"];
+ counter["recovered"] += cntr["recovered"];
+ counter["transposed"] += cntr["transposed"];
+ }
+
+ return 0;
+}
+
+int
+consolidate_barcode_logs(map<BarcodePair, map<string, long>> &barcode_log,
+ vector<map<BarcodePair, map<string, long>>> &local_barcode_logs)
+{
+ for (uint j = 0; j < local_barcode_logs.size(); j++) {
+ map<BarcodePair, map<string, long>> &blog = local_barcode_logs[j];
+
+ for (auto it = blog.begin(); it != blog.end(); it++) {
+ BarcodePair bc = it->first;
+ if (barcode_log.count(bc) == 0) {
+ barcode_log[bc]["noradtag"] = 0;
+ barcode_log[bc]["total"] = 0;
+ barcode_log[bc]["low_qual"] = 0;
+ barcode_log[bc]["retained"] = 0;
+ }
+ barcode_log[bc]["noradtag"] += it->second["noradtag"];
+ barcode_log[bc]["total"] += it->second["total"];
+ barcode_log[bc]["low_qual"] += it->second["low_qual"];
+ barcode_log[bc]["retained"] += it->second["retained"];
+ }
+ }
+
+ return 0;
+}
+
template <typename fhType>
int
process_paired_reads(string prefix_1,
@@ -287,28 +416,28 @@ process_paired_reads(string prefix_1,
discard_fh_1 = new ofstream(path_1.c_str(), ifstream::out);
if (discard_fh_1->fail()) {
- cerr << "Error opening discard output file '" << path_1 << "'\n";
+ cerr << "Error opening single-end discard output file '" << path_1 << "'\n";
exit(1);
}
path_2 = out_path + prefix_2 + ".discards";
discard_fh_2 = new ofstream(path_2.c_str(), ifstream::out);
- if (discard_fh_1->fail()) {
- cerr << "Error opening discard output file '" << path_2 << "'\n";
+ if (discard_fh_2->fail()) {
+ cerr << "Error opening paired-end discard output file '" << path_2 << "'\n";
exit(1);
}
}
//
- // Read in the first record, initializing the Seq object s. Then
- // initialize the Read object r, then loop, using the same objects.
+ // Read in the first pair of records, initializing the Seq object s_1 and s_2. Then
+ // initialize the Read objects r_1 and r_2, then loop, using the same objects.
//
Seq *s_1 = fh_1->next_seq();
Seq *s_2 = fh_2->next_seq();
if (s_1 == NULL || s_2 == NULL) {
- cerr << "Attempting to read first pair of input records, unable to allocate "
- << "Seq object (Was the correct input type specified?).\n";
+ cerr << "Error: a failure occurred while attempting to read first pair of input sequences "
+ << "(was the correct input type specified?).\n";
exit(1);
}
@@ -329,12 +458,21 @@ process_paired_reads(string prefix_1,
if (max_bc_size_1 == 0)
bc.set(prefix_1, prefix_2);
- long i = 1;
+ long i = 1;
+ int line_len = 33;
+ char numstr[id_len];
- cerr << " Processing RAD-Tags...";
+ cerr << " Processing pairs of RAD-Tags...";
do {
- if (i % 1000000 == 0)
- cerr << i/1000000 << "M...";
+ if (i % 1000000 == 0) {
+ sprintf(numstr, "%ldM...", i/1000000);
+ line_len += strlen(numstr);
+ cerr << numstr;
+ if (line_len > 80) {
+ cerr << "\n";
+ line_len = 0;
+ }
+ }
parse_input_record(s_1, r_1);
parse_input_record(s_2, r_2);
@@ -454,7 +592,7 @@ process_paired_reads(string prefix_1,
write_fasta(discard_fh_2, s_2);
if (!result_1 || !result_2) {
- cerr << "Error writing to discard file for '" << bc.str() << "'\n";
+ cerr << "Error writing to discard file\n";
return_val = -1;
break;
}
@@ -481,6 +619,449 @@ process_paired_reads(string prefix_1,
return return_val;
}
+template <typename fhType>
+int
+process_paired_reads_parallel(string prefix_1,
+ string prefix_2,
+ set<string> &se_bc, set<string> &pe_bc,
+ map<BarcodePair, fhType *> &pair_1_fhs,
+ map<BarcodePair, fhType *> &pair_2_fhs,
+ map<BarcodePair, fhType *> &rem_1_fhs,
+ map<BarcodePair, fhType *> &rem_2_fhs,
+ map<string, long> &counter,
+ map<BarcodePair, map<string, long> > &barcode_log) {
+ Input *fh_1=NULL, *fh_2=NULL;
+ RawRead *r_1=NULL, *r_2=NULL;
+ ofstream *discard_fh_1=NULL, *discard_fh_2=NULL;
+
+ int return_val = 1;
+
+ string path_1 = in_path_1 + prefix_1;
+ string path_2 = in_path_2 + prefix_2;
+
+ if (interleaved)
+ cerr << " Reading data from:\n " << path_1 << "\n";
+ else
+ cerr << " Reading data from:\n " << path_1 << " and\n " << path_2 << "\n";
+
+ if (in_file_type == FileT::fastq) {
+ fh_1 = new Fastq(path_1.c_str());
+ fh_2 = interleaved ? fh_1 : new Fastq(path_2.c_str());
+ } else if (in_file_type == FileT::gzfastq) {
+ fh_1 = new GzFastq(path_1.c_str());
+ fh_2 = interleaved ? fh_1 : new GzFastq(path_2.c_str());
+ } else if (in_file_type == FileT::bam) {
+ fh_1 = new BamUnAln(path_1.c_str());
+ fh_2 = fh_1;
+ } else if (in_file_type == FileT::bustard) {
+ fh_1 = new Bustard(path_1.c_str());
+ fh_2 = interleaved ? fh_1 : new Bustard(path_2.c_str());
+ }
+
+ //
+ // Open a file for recording discarded reads
+ //
+ if (discards) {
+ path_1 = out_path + prefix_1 + ".discards";
+ discard_fh_1 = new ofstream(path_1.c_str(), ifstream::out);
+
+ if (discard_fh_1->fail()) {
+ cerr << "Error opening single-end discard output file '" << path_1 << "'\n";
+ exit(1);
+ }
+
+ path_2 = out_path + prefix_2 + ".discards";
+ discard_fh_2 = new ofstream(path_2.c_str(), ifstream::out);
+
+ if (discard_fh_2->fail()) {
+ cerr << "Error opening paired-end discard output file '" << path_2 << "'\n";
+ exit(1);
+ }
+ }
+
+ //
+ // Set len_limit so that if we encounter reads already shorter than truncate_seq limit
+ // they will be discarded.
+ //
+ if (truncate_seq > 0)
+ len_limit = truncate_seq;
+
+ //
+ // The reader thread will read from the current input file and store them in a set of input buffer
+ // queue objects (one per worker thread). The worker threads will process reads from their input
+ // queue and store them in a corresponding output buffer queue object. The writer threads will
+ // pull reads from the ouput queues and write them to disk.
+ //
+ vector<queue<pair<Seq *, Seq *>>> input_buffer(num_worker_threads, queue<pair<Seq *, Seq *>>());
+ vector<queue<ReadCompPair>> output_buffer(num_worker_threads, queue<ReadCompPair>());
+
+ //
+ // Create a set of locks, to protect the input and output queues, one for each worker thread.
+ //
+ omp_lock_t inpbuf_lock[num_worker_threads];
+ omp_lock_t outbuf_lock[num_worker_threads];
+ for (int j = 0; j < num_worker_threads; j++) {
+ omp_init_lock(&inpbuf_lock[j]);
+ omp_init_lock(&outbuf_lock[j]);
+ }
+
+ map<BarcodePair, int> output_thread_map;
+ uint k = 0;
+ for (auto it = pair_1_fhs.begin(); it != pair_1_fhs.end(); it++) {
+ output_thread_map[it->first] = (k % num_output_threads) + num_input_threads;
+ k++;
+ }
+
+ //
+ // Create a set of local counters for each worker thread. These will have to be
+ // consolidated after processing the file.
+ //
+ vector<map<BarcodePair, map<string, long>>> local_barcode_log;
+ vector<map<string, long>> local_counters;
+ for (int j = 0; j < num_worker_threads; j++) {
+ local_barcode_log.push_back(map<BarcodePair, map<string, long>>());
+ local_counters.push_back(map<string, long>());
+ }
+
+ //
+ // Read in the first pair of records, initializing the Seq object s_1 and s_2. Then
+ // initialize the Read objects r_1 and r_2, then loop, using the same objects.
+ //
+ Seq *s_1 = fh_1->next_seq();
+ Seq *s_2 = fh_2->next_seq();
+ if (s_1 == NULL || s_2 == NULL) {
+ cerr << "Error: a failure occurred while attempting to read first pair of input sequences "
+ << "(was the correct input type specified?).\n";
+ exit(1);
+ }
+
+ long records_read = 0;
+ long records_processed = 0;
+ long records_written = 0;
+ bool stop = false;
+ uint seq_init_len_1 = strlen(s_1->seq);
+ uint seq_init_len_2 = strlen(s_2->seq);
+
+ int logger_thread = num_input_threads;
+ long prev_log = 0;
+ int line_len = 33;
+ char numstr[id_len];
+
+ cerr << " Processing pairs of RAD-Tags...";
+
+ #pragma omp parallel shared(records_read, records_processed, records_written, stop, inpbuf_lock, outbuf_lock)
+ {
+ int thread_no = omp_get_thread_num();
+
+ if (thread_no == 0) {
+ //
+ // Thread 0 is our reader thread. Read and fill the buffer anywhere there is a NULL pointer.
+ //
+ uint worker_thread_index = 0;
+
+ do {
+ omp_set_lock(&inpbuf_lock[worker_thread_index]);
+ input_buffer[worker_thread_index].push(make_pair(s_1, s_2));
+ omp_unset_lock(&inpbuf_lock[worker_thread_index]);
+
+ worker_thread_index++;
+ worker_thread_index = worker_thread_index % num_worker_threads;
+ records_read++;
+
+ while (records_read - records_written > max_records_queued) {
+ #pragma omp taskyield
+ #pragma omp flush(records_written)
+ }
+ } while ((s_1 = fh_1->next_seq()) != NULL &&
+ (s_2 = fh_2->next_seq()) != NULL);
+
+ stop = true;
+
+ } else if (thread_no > 0 && thread_no < (num_input_threads + num_output_threads)) {
+ //
+ // These are our writer threads. It writes any processed read to the appropriate output file(s).
+ //
+ uint worker_thread_index = 0;
+ RawRead *r_1, *r_2;
+ Seq *s_1, *s_2;
+ BarcodePair *bc;
+
+ while (stop == false || records_written < records_read) {
+
+ //
+ // Attempt to get the lock on this output buffer. If the lock is already held by another thread
+ // move to the next output buffer.
+ //
+ if (omp_test_lock(&outbuf_lock[worker_thread_index]) == false) {
+ worker_thread_index++;
+ worker_thread_index = worker_thread_index % num_worker_threads;
+ continue;
+ } else {
+ if (output_buffer[worker_thread_index].size() == 0) {
+ omp_unset_lock(&outbuf_lock[worker_thread_index]);
+ worker_thread_index++;
+ worker_thread_index = worker_thread_index % num_worker_threads;
+ continue;
+ }
+
+ ReadCompPair &rc = output_buffer[worker_thread_index].front();
+
+ //
+ // Does this barcode combination belong to this output thread? If not, move to the next output buffer.
+ //
+ if (output_thread_map.count(*rc.bc) > 0 &&
+ output_thread_map.at(*rc.bc) != thread_no) {
+ omp_unset_lock(&outbuf_lock[worker_thread_index]);
+ worker_thread_index++;
+ worker_thread_index = worker_thread_index % num_worker_threads;
+ continue;
+ }
+
+ //
+ // Found a record to output for this writer thread.
+ //
+ r_1 = rc.r_1;
+ r_2 = rc.r_2;
+ s_1 = rc.s_1;
+ s_2 = rc.s_2;
+ bc = rc.bc;
+ output_buffer[worker_thread_index].pop();
+
+ omp_unset_lock(&outbuf_lock[worker_thread_index]);
+ }
+
+ int result_1 = 1;
+ int result_2 = 1;
+
+ if (r_1->retain && r_2->retain) {
+ if (retain_header) {
+ result_1 = (out_file_type == FileT::fastq || out_file_type == FileT::gzfastq) ?
+ write_fastq(pair_1_fhs[*bc], s_1, r_1) :
+ write_fasta(pair_1_fhs[*bc], s_1, r_1);
+ result_2 = (out_file_type == FileT::fastq || out_file_type == FileT::gzfastq) ?
+ write_fastq(pair_2_fhs[*bc], s_2, r_2) :
+ write_fasta(pair_2_fhs[*bc], s_2, r_2);
+ } else {
+ result_1 = (out_file_type == FileT::fastq || out_file_type == FileT::gzfastq) ?
+ write_fastq(pair_1_fhs[*bc], r_1, overhang) :
+ write_fasta(pair_1_fhs[*bc], r_1, overhang);
+ result_2 = (out_file_type == FileT::fastq || out_file_type == FileT::gzfastq) ?
+ write_fastq(pair_2_fhs[*bc], r_2, overhang) :
+ write_fasta(pair_2_fhs[*bc], r_2, overhang);
+ }
+ } else if (r_1->retain && !r_2->retain) {
+ //
+ // Write to the remainder file.
+ //
+ if (retain_header)
+ result_1 = (out_file_type == FileT::fastq || out_file_type == FileT::gzfastq) ?
+ write_fastq(rem_1_fhs[*bc], s_1, r_1) :
+ write_fasta(rem_1_fhs[*bc], s_1, r_1);
+ else
+ result_1 = (out_file_type == FileT::fastq || out_file_type == FileT::gzfastq) ?
+ write_fastq(rem_1_fhs[*bc], r_1, overhang) :
+ write_fasta(rem_1_fhs[*bc], r_1, overhang);
+
+ } else if (!r_1->retain && r_2->retain) {
+ //
+ // Write to the remainder file.
+ //
+ if (retain_header)
+ result_2 = (out_file_type == FileT::fastq || out_file_type == FileT::gzfastq) ?
+ write_fastq(rem_2_fhs[*bc], s_2, r_2) :
+ write_fasta(rem_2_fhs[*bc], s_2, r_2);
+ else
+ result_2 = (out_file_type == FileT::fastq || out_file_type == FileT::gzfastq) ?
+ write_fastq(rem_2_fhs[*bc], r_2, overhang) :
+ write_fasta(rem_2_fhs[*bc], r_2, overhang);
+ }
+
+ if (!result_1 || !result_2) {
+ cerr << "Error writing to output file for '" << bc->str() << "'\n";
+ return_val = -1;
+ break;
+ }
+
+ if (discards && !r_1->retain)
+ result_1 = (out_file_type == FileT::fastq || out_file_type == FileT::gzfastq) ?
+ write_fastq(discard_fh_1, s_1) :
+ write_fasta(discard_fh_1, s_1);
+ if (discards && !r_2->retain)
+ result_2 = (out_file_type == FileT::fastq || out_file_type == FileT::gzfastq) ?
+ write_fastq(discard_fh_2, s_2) :
+ write_fasta(discard_fh_2, s_2);
+
+ if (!result_1 || !result_2) {
+ cerr << "Error writing to discard file for '" << bc->str() << "'\n";
+ return_val = -1;
+ break;
+ }
+
+ delete r_1;
+ delete r_2;
+ delete s_1;
+ delete s_2;
+ delete bc;
+
+ worker_thread_index++;
+ worker_thread_index = worker_thread_index % num_worker_threads;
+
+ #pragma omp atomic
+ records_written++;
+
+ if (thread_no == logger_thread) {
+ if (int(records_written / 1000000) > prev_log) {
+ prev_log = int(records_written/1000000);
+ sprintf(numstr, "%dM...", int(records_written/1000000));
+ line_len += strlen(numstr);
+ cerr << numstr;
+ if (line_len > 80) {
+ cerr << "\n";
+ line_len = 0;
+ }
+ }
+ }
+ }
+
+ } else {
+ //
+ // All other threads process individual reads.
+ //
+ Seq *s_1, *s_2;
+ RawRead *r_1, *r_2;
+ BarcodePair *bc;
+ uint thread_index = thread_no - num_input_threads - num_output_threads;
+ queue<pair<Seq *, Seq *>> &thread_input_buffer = input_buffer[thread_index];
+ omp_lock_t &thread_inpbuf_lock = inpbuf_lock[thread_index];
+ queue<ReadCompPair> &thread_output_buffer = output_buffer[thread_index];
+ omp_lock_t &thread_outbuf_lock = outbuf_lock[thread_index];
+
+ map<BarcodePair, map<string, long>> &blog = local_barcode_log[thread_index];
+ map<string, long> &cntr = local_counters[thread_index];
+
+ while (stop == false || records_processed < records_read) {
+
+ while (stop == false && thread_input_buffer.size() == 0) {
+ #pragma omp flush(stop)
+ }
+
+ if (stop && thread_input_buffer.size() == 0)
+ break;
+
+ omp_set_lock(&thread_inpbuf_lock);
+ s_1 = thread_input_buffer.front().first;
+ s_2 = thread_input_buffer.front().second;
+ thread_input_buffer.pop();
+ omp_unset_lock(&thread_inpbuf_lock);
+
+ r_1 = new RawRead(seq_init_len_1, 1, min_bc_size_1, win_size);
+ r_2 = new RawRead(seq_init_len_2, 2, min_bc_size_2, win_size);
+
+ parse_input_record(s_1, r_1);
+ parse_input_record(s_2, r_2);
+ cntr["total"] += 2;
+
+ //
+ // If a BestRAD protocol was used, check which read the barcode/restriction cutsite is on
+ // and transpose the reads if necessary.
+ //
+ if (bestrad)
+ if (check_for_transposed_reads(r_1, r_2, renz_1)) {
+ transpose_reads(&r_1, &r_2);
+ cntr["transposed"]++;
+ }
+
+ bc = new BarcodePair();
+ //
+ // If no barcodes were specified, set the barcode object to be the input file name so
+ // that reads are written to an output file of the same name as the input file.
+ //
+ if (max_bc_size_1 == 0)
+ bc->set(prefix_1, prefix_2);
+
+ switch (barcode_type) {
+ case inline_null:
+ case index_null:
+ case null_index:
+ bc->set(r_1->se_bc);
+ break;
+ case index_inline:
+ case inline_index:
+ case inline_inline:
+ case index_index:
+ bc->set(r_1->se_bc, r_2->pe_bc);
+ break;
+ case null_null:
+ default:
+ break;
+ }
+
+ process_barcode(r_1, r_2, *bc, pair_1_fhs, se_bc, pe_bc, blog, cntr);
+
+ //
+ // Adjust the size of the read to accommodate truncating the sequence and variable
+ // barcode lengths. With standard Illumina data we want to output constant length
+ // reads even as the barcode size may change. Other technologies, like IonTorrent
+ // need to be truncated uniformly.
+ //
+ if (truncate_seq > 0) {
+ if (truncate_seq + r_1->inline_bc_len <= r_1->len)
+ r_1->set_len(truncate_seq + r_1->inline_bc_len);
+ if (truncate_seq + r_2->inline_bc_len <= r_2->len)
+ r_2->set_len(truncate_seq + r_2->inline_bc_len);
+ } else {
+ if (barcode_type == inline_null || barcode_type == inline_inline || barcode_type == inline_index)
+ r_1->set_len(r_1->len - (max_bc_size_1 - r_1->inline_bc_len));
+ if (barcode_type == index_inline || barcode_type == inline_inline)
+ r_2->set_len(r_2->len - (max_bc_size_2 - r_2->inline_bc_len));
+ }
+
+ if (r_1->retain)
+ process_singlet(r_1, renz_1, false, blog[*bc], cntr);
+ if (r_2->retain)
+ process_singlet(r_2, renz_2, true, blog[*bc], cntr);
+
+ //
+ // Store the result in the output buffer for writing.
+ //
+ omp_set_lock(&thread_outbuf_lock);
+ thread_output_buffer.push(ReadCompPair(s_1, s_2, r_1, r_2, bc));
+ omp_unset_lock(&thread_outbuf_lock);
+
+ #pragma omp atomic
+ records_processed++;
+
+ #pragma omp flush(records_read, stop)
+ }
+ }
+ }
+ cerr << "\n";
+
+ for (int j = 0; j < num_worker_threads; j++) {
+ omp_destroy_lock(&inpbuf_lock[j]);
+ omp_destroy_lock(&outbuf_lock[j]);
+ }
+
+ //
+ // Consolidate counters and barcode logs from the worker threads.
+ //
+ consolidate_barcode_logs(barcode_log, local_barcode_log);
+ consolidate_counters(counter, local_counters);
+
+ if (discards) {
+ delete discard_fh_1;
+ delete discard_fh_2;
+ }
+
+ delete fh_1;
+ if (interleaved == false) delete fh_2;
+
+ delete r_1;
+ delete r_2;
+
+ return return_val;
+}
+
template <typename fhType>
int
process_reads(string prefix,
@@ -488,7 +1069,7 @@ process_reads(string prefix,
map<BarcodePair, fhType *> &pair_1_fhs,
map<string, long> &counter,
map<BarcodePair, map<string, long> > &barcode_log) {
- Input *fh=NULL;
+ Input *fh=NULL;
RawRead *r;
ofstream *discard_fh=NULL;
@@ -518,24 +1099,24 @@ process_reads(string prefix,
}
}
+ //
+ // Set len_limit so that if we encounter reads already shorter than truncate_seq limit
+ // they will be discarded.
+ //
+ if (truncate_seq > 0)
+ len_limit = truncate_seq;
+
//
// Read in the first record, initializing the Seq object s. Then
// initialize the Read object r, then loop, using the same objects.
//
Seq *s = fh->next_seq();
if (s == NULL) {
- cerr << "Attempting to read first input record, unable to allocate "
- << "Seq object (Was the correct input type specified?).\n";
+ cerr << "Error: a failure occurred while attempting to read the first input sequence "
+ << "(was the correct input type specified?).\n";
exit(1);
}
- //
- // Set len_limit so that if we encounter reads already shorter than truncate_seq limit
- // they will be discarded.
- //
- if (truncate_seq > 0)
- len_limit = truncate_seq;
-
r = new RawRead(strlen(s->seq), 1, min_bc_size_1, win_size);
BarcodePair bc;
@@ -548,11 +1129,21 @@ process_reads(string prefix,
//cerr << "Length: " << r->len << "; Window length: " << r->win_len << "; Stop position: " << r->stop_pos << "\n";
- long i = 1;
+ long i = 1;
+ int line_len = 24;
+ char numstr[id_len];
+
cerr << " Processing RAD-Tags...";
do {
- if (i % 1000000 == 0)
- cerr << i/1000000 << "M...";
+ if (i % 1000000 == 0) {
+ sprintf(numstr, "%ldM...", i/1000000);
+ line_len += strlen(numstr);
+ cerr << numstr;
+ if (line_len > 80) {
+ cerr << "\n";
+ line_len = 0;
+ }
+ }
counter["total"]++;
@@ -577,7 +1168,7 @@ process_reads(string prefix,
if (truncate_seq + r->inline_bc_len <= r->len)
r->set_len(truncate_seq + r->inline_bc_len);
} else {
- if (barcode_type == inline_null || barcode_type == inline_inline || barcode_type == inline_index)
+ if (barcode_type == inline_null || barcode_type == inline_inline || barcode_type == inline_index)
r->set_len(r->len - (max_bc_size_1 - r->inline_bc_len));
}
@@ -632,6 +1223,354 @@ process_reads(string prefix,
return return_val;
}
+template <typename fhType>
+int
+process_reads_parallel(string prefix,
+ set<string> &se_bc, set<string> &pe_bc,
+ map<BarcodePair, fhType *> &pair_1_fhs,
+ map<string, long> &counter,
+ map<BarcodePair, map<string, long>> &barcode_log)
+{
+ Input *fh = NULL;
+ ofstream *discard_fh = NULL;
+
+ int return_val = 1;
+
+ string path = in_path_1 + prefix;
+
+ if (in_file_type == FileT::fastq)
+ fh = new Fastq(path.c_str());
+ else if (in_file_type == FileT::gzfastq)
+ fh = new GzFastq(path.c_str());
+ else if (in_file_type == FileT::bam)
+ fh = new BamUnAln(path.c_str());
+ else if (in_file_type == FileT::bustard)
+ fh = new Bustard(path.c_str());
+
+ //
+ // Open a file for recording discarded reads
+ //
+ if (discards) {
+ path = out_path + prefix + ".discards";
+ discard_fh = new ofstream(path.c_str(), ifstream::out);
+
+ if (discard_fh->fail()) {
+ cerr << "Error opening discard output file '" << path << "'\n";
+ exit(1);
+ }
+ }
+
+ //
+ // Set len_limit so that if we encounter reads already shorter than truncate_seq limit
+ // they will be discarded.
+ //
+ if (truncate_seq > 0)
+ len_limit = truncate_seq;
+
+ //
+ // The reader thread will read from the current input file and store them in a set of input buffer
+ // queue objects (one per worker thread). The worker threads will process reads from their input
+ // queue and store them in a corresponding output buffer queue object. The writer threads will
+ // pull reads from the ouput queues and write them to disk.
+ //
+ vector<queue<Seq *>> input_buffer(num_worker_threads, queue<Seq *>());
+ vector<queue<ReadComp>> output_buffer(num_worker_threads, queue<ReadComp>());
+
+ //
+ // Create a set of locks, to protect the input and output queues, one for each worker thread.
+ //
+ omp_lock_t inpbuf_lock[num_worker_threads];
+ omp_lock_t outbuf_lock[num_worker_threads];
+ for (int j = 0; j < num_worker_threads; j++) {
+ omp_init_lock(&inpbuf_lock[j]);
+ omp_init_lock(&outbuf_lock[j]);
+ }
+
+ map<BarcodePair, int> output_thread_map;
+ uint k = 0;
+ for (auto it = pair_1_fhs.begin(); it != pair_1_fhs.end(); it++) {
+ output_thread_map[it->first] = (k % num_output_threads) + num_input_threads;
+ k++;
+ }
+
+ //
+ // Create a set of local counters for each worker thread. These will have to be
+ // consolidated after processing the file.
+ //
+ vector<map<BarcodePair, map<string, long>>> local_barcode_log;
+ vector<map<string, long>> local_counters;
+ for (int j = 0; j < num_worker_threads; j++) {
+ local_barcode_log.push_back(map<BarcodePair, map<string, long>>());
+ local_counters.push_back(map<string, long>());
+ }
+
+ //
+ // Read in the first record, initializing the Seq object s, checking that we can parse the file.
+ //
+ Seq *s = fh->next_seq();
+ if (s == NULL) {
+ cerr << "Error: a failure occurred while attempting to read the first input sequence "
+ << "(was the correct input type specified?).\n";
+ exit(1);
+ }
+
+ long records_read = 0;
+ long records_processed = 0;
+ long records_written = 0;
+ bool stop = false;
+ uint seq_init_len = strlen(s->seq);
+
+ int logger_thread = num_input_threads;
+ long prev_log = 0;
+ int line_len = 24;
+ char numstr[id_len];
+
+ cerr << " Processing RAD-Tags...";
+
+ #pragma omp parallel shared(records_read, records_processed, records_written, stop, inpbuf_lock, outbuf_lock)
+ {
+ int thread_no = omp_get_thread_num();
+
+ if (thread_no == 0) {
+ //
+ // Thread 0 is our reader thread. Read and fill the buffer anywhere there is a NULL pointer.
+ //
+ uint worker_thread_index = 0;
+
+ do {
+ omp_set_lock(&inpbuf_lock[worker_thread_index]);
+ input_buffer[worker_thread_index].push(s);
+ omp_unset_lock(&inpbuf_lock[worker_thread_index]);
+
+ worker_thread_index++;
+ worker_thread_index = worker_thread_index % num_worker_threads;
+ records_read++;
+
+ //
+ // Throttle the input reading thread to control memory usage.
+ //
+ while (records_read - records_written > max_records_queued) {
+ #pragma omp taskyield
+ #pragma omp flush(records_written)
+ }
+ } while ((s = fh->next_seq()) != NULL);
+
+ stop = true;
+
+ } else if (thread_no > 0 && thread_no < (num_input_threads + num_output_threads)) {
+ //
+ // These are our writer threads. It writes any processed read to the appropriate output file(s).
+ //
+ uint worker_thread_index = 0;
+ RawRead *r;
+ Seq *s;
+ BarcodePair *bc;
+
+ while (stop == false || records_written < records_read) {
+
+ //
+ // Attempt to get the lock on this output buffer. If the lock is already held by another thread
+ // move to the next output buffer.
+ //
+ if (omp_test_lock(&outbuf_lock[worker_thread_index]) == false) {
+ worker_thread_index++;
+ worker_thread_index = worker_thread_index % num_worker_threads;
+ continue;
+ } else {
+ if (output_buffer[worker_thread_index].size() == 0) {
+ omp_unset_lock(&outbuf_lock[worker_thread_index]);
+ worker_thread_index++;
+ worker_thread_index = worker_thread_index % num_worker_threads;
+ continue;
+ }
+
+ ReadComp &rc = output_buffer[worker_thread_index].front();
+
+ //
+ // Does this barcode combination belong to this output thread? If not, move to the next output buffer.
+ //
+ if (output_thread_map.count(*rc.bc) > 0 &&
+ output_thread_map.at(*rc.bc) != thread_no) {
+ omp_unset_lock(&outbuf_lock[worker_thread_index]);
+ worker_thread_index++;
+ worker_thread_index = worker_thread_index % num_worker_threads;
+ continue;
+ }
+
+ //
+ // Found a record to output for this writer thread.
+ //
+ r = rc.r;
+ s = rc.s;
+ bc = rc.bc;
+ output_buffer[worker_thread_index].pop();
+
+ omp_unset_lock(&outbuf_lock[worker_thread_index]);
+ }
+
+ int result = 1;
+
+ if (r->retain) {
+ if (retain_header)
+ result = (out_file_type == FileT::fastq || out_file_type == FileT::gzfastq) ?
+ write_fastq(pair_1_fhs[*bc], s, r) :
+ write_fasta(pair_1_fhs[*bc], s, r);
+ else
+ result = (out_file_type == FileT::fastq || out_file_type == FileT::gzfastq) ?
+ write_fastq(pair_1_fhs[*bc], r, overhang) :
+ write_fasta(pair_1_fhs[*bc], r, overhang);
+ }
+
+ if (!result) {
+ cerr << "Error writing to output file for '" << bc->str() << "'\n";
+ return_val = -1;
+ break;
+ }
+
+ if (discards && !r->retain) {
+ #pragma omp critical
+ result = out_file_type == FileT::fastq || out_file_type == FileT::gzfastq ?
+ write_fastq(discard_fh, s) :
+ write_fasta(discard_fh, s);
+ }
+
+ if (!result) {
+ cerr << "Error writing to discard file for '" << bc->str() << "'\n";
+ return_val = -1;
+ break;
+ }
+
+ delete r;
+ delete s;
+ delete bc;
+
+ worker_thread_index++;
+ worker_thread_index = worker_thread_index % num_worker_threads;
+
+ #pragma omp atomic
+ records_written++;
+
+ if (thread_no == logger_thread) {
+ if (int(records_written / 1000000) > prev_log) {
+ prev_log = int(records_written/1000000);
+ sprintf(numstr, "%dM...", int(records_written/1000000));
+ line_len += strlen(numstr);
+ cerr << numstr;
+ if (line_len > 80) {
+ cerr << "\n";
+ line_len = 0;
+ }
+ }
+ }
+ }
+
+ } else {
+ //
+ // All other threads process individual reads.
+ //
+ Seq *s;
+ RawRead *r;
+ BarcodePair *bc;
+ uint thread_index = thread_no - num_input_threads - num_output_threads;
+ queue<Seq *> &thread_input_buffer = input_buffer[thread_index];
+ omp_lock_t &thread_inpbuf_lock = inpbuf_lock[thread_index];
+ queue<ReadComp> &thread_output_buffer = output_buffer[thread_index];
+ omp_lock_t &thread_outbuf_lock = outbuf_lock[thread_index];
+
+ map<BarcodePair, map<string, long>> &blog = local_barcode_log[thread_index];
+ map<string, long> &cntr = local_counters[thread_index];
+
+ while (stop == false || records_processed < records_read) {
+
+ while (stop == false && thread_input_buffer.size() == 0) {
+ #pragma omp flush(stop)
+ }
+
+ if (stop && thread_input_buffer.size() == 0)
+ break;
+
+ omp_set_lock(&thread_inpbuf_lock);
+ s = thread_input_buffer.front();
+ thread_input_buffer.pop();
+ omp_unset_lock(&thread_inpbuf_lock);
+
+ r = new RawRead(seq_init_len, 1, min_bc_size_1, win_size);
+
+ cntr["total"]++;
+
+ parse_input_record(s, r);
+
+ bc = new BarcodePair();
+ //
+ // If no barcodes were specified, set the barcode object to be the input file name so
+ // that reads are written to an output file of the same name as the input file.
+ //
+ if (max_bc_size_1 == 0)
+ bc->set(prefix);
+
+ if (barcode_type == inline_null ||
+ barcode_type == index_null)
+ bc->set(r->se_bc);
+ else if (barcode_type == index_inline ||
+ barcode_type == inline_index)
+ bc->set(r->se_bc, r->pe_bc);
+
+ process_barcode(r, NULL, *bc, pair_1_fhs, se_bc, pe_bc, blog, cntr);
+
+ //
+ // Adjust the size of the read to accommodate truncating the sequence and variable
+ // barcode lengths. With standard Illumina data we want to output constant length
+ // reads even as the barcode size may change. Other technologies, like IonTorrent
+ // need to be truncated uniformly.
+ //
+ if (truncate_seq > 0) {
+ if (truncate_seq + r->inline_bc_len <= r->len)
+ r->set_len(truncate_seq + r->inline_bc_len);
+ } else {
+ if (barcode_type == inline_null || barcode_type == inline_inline || barcode_type == inline_index)
+ r->set_len(r->len - (max_bc_size_1 - r->inline_bc_len));
+ }
+
+ if (r->retain)
+ process_singlet(r, renz_1, false, blog[*bc], cntr);
+
+ //
+ // Store the result in the output buffer for writing.
+ //
+ omp_set_lock(&thread_outbuf_lock);
+ thread_output_buffer.push(ReadComp(s, r, bc));
+ omp_unset_lock(&thread_outbuf_lock);
+
+ #pragma omp atomic
+ records_processed++;
+
+ #pragma omp flush(records_read, stop)
+ }
+ }
+ }
+ cerr << "\n";
+
+ for (int j = 0; j < num_worker_threads; j++) {
+ omp_destroy_lock(&inpbuf_lock[j]);
+ omp_destroy_lock(&outbuf_lock[j]);
+ }
+
+ //
+ // Consolidate counters and barcode logs from the worker threads.
+ //
+ consolidate_barcode_logs(barcode_log, local_barcode_log);
+ consolidate_counters(counter, local_counters);
+
+ if (discards) delete discard_fh;
+
+ //
+ // Close the file and delete the Input object.
+ //
+ delete fh;
+
+ return return_val;
+}
+
inline
int
process_singlet(RawRead *href,
@@ -925,8 +1864,6 @@ print_results(int argc, char **argv,
return 0;
}
- cerr << "Outputing details to log: '" << log_path << "'\n\n";
-
init_log(log, argc, argv);
log << "\n"
@@ -965,7 +1902,7 @@ print_results(int argc, char **argv,
<< it->second["total"] << "\n";
}
log << "END per_file_raw_read_counts" << "\n";
-
+
map<string, long> c;
c["total"] = 0;
c["low_quality"] = 0;
@@ -995,7 +1932,8 @@ print_results(int argc, char **argv,
<< " (" << as_percentage((double) n / c["total"]) << ")\n";
};
- cerr << c["total"] << " total sequences\n";
+ cerr << "\n"
+ << c["total"] << " total sequences\n";
if (bestrad)
print_nreads(c["transposed"], "reads were transposed [based on the BestRAD flag]");
if (filter_illumina)
@@ -1007,19 +1945,32 @@ print_results(int argc, char **argv,
print_nreads(c["noradtag"], "RAD cutsite not found drops");
print_nreads(c["retained"], "retained reads");
+ if (max_bc_size_1 > 0)
+ cerr << "\n"
+ << "Details logged: '" << log_path << "'\n"
+ << "For a summary, execute one or more:\n"
+ << " stacks-dist-extract " << log_path << " total_raw_read_counts\n"
+ << " stacks-dist-extract --pretty " << log_path << " per_barcode_raw_read_counts\n";
+ else
+ cerr << "\n"
+ << "Details logged: '" << log_path << "'\n"
+ << "For a summary, execute:\n"
+ << " stacks-dist-extract " << log_path << " total_raw_read_counts\n";
+
+
log << "\n"
<< "BEGIN total_raw_read_counts" << "\n"
<< "Total Sequences\t" << c["total"] << "\n";
if (bestrad)
- log << "Transposed reads\t" << c["transposed"] << "\n";
+ log << "Transposed reads\t" << c["transposed"] << "\t" << as_percentage((double) c["transposed"] / c["total"]) << "\n";
if (filter_illumina)
- log << "Failed Illumina filtered reads\t" << c["ill_filtered"] << "\n";
+ log << "Failed Illumina filtered reads\t" << c["ill_filtered"] << "\t" << as_percentage((double) c["ill_filtered"] / c["total"]) << "\n";
if (filter_adapter)
- log << "Reads containing adapter sequence\t" << c["adapter"] << "\n";
- log << "Barcode Not Found\t" << c["ambiguous"] << "\n"
- << "Low Quality\t" << c["low_quality"] << "\n"
- << "RAD Cutsite Not Found\t" << c["noradtag"] << "\n"
- << "Retained Reads\t" << c["retained"] << "\n"
+ log << "Reads containing adapter sequence\t" << c["adapter"] << "\t" << as_percentage((double) c["adapter"] / c["total"]) << "\n";
+ log << "Barcode Not Found\t" << c["ambiguous"] << "\t" << as_percentage((double) c["ambiguous"] / c["total"]) << "\n"
+ << "Low Quality\t" << c["low_quality"] << "\t" << as_percentage((double) c["low_quality"] / c["total"]) << "\n"
+ << "RAD Cutsite Not Found\t" << c["noradtag"] << "\t" << as_percentage((double) c["noradtag"] / c["total"]) << "\n"
+ << "Retained Reads\t" << c["retained"] << "\t" << as_percentage((double) c["retained"] / c["total"]) << "\n"
<< "END total_raw_read_counts" << "\n";
if (max_bc_size_1 == 0) return 0;
@@ -1045,7 +1996,9 @@ print_results(int argc, char **argv,
log << "Total\t"
<< "RAD Cutsite Not Found\t"
<< "Low Quality\t"
- << "Retained Reads\n";
+ << "Retained Reads\t"
+ << "Pct Retained\t"
+ << "Pct Total Reads\n";
set<BarcodePair> barcode_list;
@@ -1062,7 +2015,9 @@ print_results(int argc, char **argv,
log << barcode_log[barcodes[i]]["total"] << "\t"
<< barcode_log[barcodes[i]]["noradtag"] << "\t"
<< barcode_log[barcodes[i]]["low_qual"] << "\t"
- << barcode_log[barcodes[i]]["retained"] << "\n";
+ << barcode_log[barcodes[i]]["retained"] << "\t"
+ << as_percentage((double) barcode_log[barcodes[i]]["retained"] / barcode_log[barcodes[i]]["total"]) << "\t"
+ << as_percentage((double) barcode_log[barcodes[i]]["retained"] / c["total"]) << "\n";
}
log << "END per_barcode_raw_read_counts" << "\n"
@@ -1127,13 +2082,13 @@ int parse_command_line(int argc, char* argv[]) {
{"inline-index", no_argument, NULL, 'Z'}, {"inline_index", no_argument, NULL, 'Z'},
{"barcode-dist-1", required_argument, NULL, 'B'}, {"barcode_dist_1", required_argument, NULL, 'B'},
{"barcode-dist-2", required_argument, NULL, 'C'}, {"barcode_dist_2", required_argument, NULL, 'C'},
- {"infile-type", required_argument, NULL, 'i'}, {"infile_type", required_argument, NULL, 'i'},
- {"outfile-type", required_argument, NULL, 'y'}, {"outfile_type", required_argument, NULL, 'y'},
+ {"in-type", required_argument, NULL, 'i'}, {"infile_type", required_argument, NULL, 'i'},
+ {"out-type", required_argument, NULL, 'y'}, {"outfile_type", required_argument, NULL, 'y'},
{"file", required_argument, NULL, 'f'},
{"file-p1", required_argument, NULL, '1'}, {"file_p1", required_argument, NULL, '1'},
{"file-p2", required_argument, NULL, '2'}, {"file_p2", required_argument, NULL, '2'},
- {"path", required_argument, NULL, 'p'},
- {"outpath", required_argument, NULL, 'o'},
+ {"in-path", required_argument, NULL, 'p'},
+ {"out-path", required_argument, NULL, 'o'},
{"truncate", required_argument, NULL, 't'},
{"renz-1", required_argument, NULL, 'e'}, {"renz_1", required_argument, NULL, 'e'},
{"renz-2", required_argument, NULL, 'z'}, {"renz_2", required_argument, NULL, 'z'},
@@ -1145,6 +2100,7 @@ int parse_command_line(int argc, char* argv[]) {
{"adapter-1", required_argument, NULL, 'A'}, {"adapter_1", required_argument, NULL, 'A'},
{"adapter-2", required_argument, NULL, 'G'}, {"adapter_2", required_argument, NULL, 'G'},
{"adapter-mm", required_argument, NULL, 'T'}, {"adapter_mm", required_argument, NULL, 'T'},
+ {"threads", required_argument, NULL, 1001},
{0, 0, 0, 0}
};
@@ -1215,6 +2171,9 @@ int parse_command_line(int argc, char* argv[]) {
case 1000:
bestrad = true;
break;
+ case 1001:
+ num_threads = is_integer(optarg);
+ break;
case 'B':
barcode_dist_1 = is_integer(optarg);
break;
@@ -1434,28 +2393,27 @@ void version() {
void help() {
cerr << "process_radtags " << VERSION << "\n"
- << "process_radtags -p in_dir [--paired [--interleaved]] [-b barcode_file] -o out_dir -e enz [-c] [-q] [-r] [-t len]\n"
- << "process_radtags -f in_file [-b barcode_file] -o out_dir -e enz [-c] [-q] [-r] [-t len]\n"
- << "process_radtags -1 pair_1 -2 pair_2 [-b barcode_file] -o out_dir -e enz [-c] [-q] [-r] [-t len]\n"
+ << "process_radtags -p in_dir [-P] [-b barcode_file] -o out_dir -e enz [--threads num] [-c] [-q] [-r] [-t len]\n"
+ << "process_radtags -f in_file [-b barcode_file] -o out_dir -e enz [--threads num] [-c] [-q] [-r] [-t len]\n"
+ << "process_radtags --in-path in_dir [--paired] [--barcodes barcode_file] --out-path out_dir --renz-1 enz [--renz-2 enz] [--threads num] [-c] [-q] [-r] [-t len]\n"
+ << "process_radtags -1 pair_1 -2 pair_2 [-b barcode_file] -o out_dir -e enz [--threads num] [-c] [-q] [-r] [-t len]\n"
<< "\n"
- << " p: path to a directory of files.\n"
- << " P,--paired: files contained within the directory are paired.\n"
- << " I,--interleaved: specify that the paired-end reads are interleaved in single files.\n"
- << " i: input file type, either 'fastq', 'gzfastq' (gzipped fastq), 'bam', or 'bustard' (default: guess, or gzfastq if unable to).\n"
- << " b: path to a file containing barcodes for this run, omit to ignore any barcoding.\n"
- << " o: path to output the processed files.\n"
- << " f: path to the input file if processing single-end sequences.\n"
- << " 1: first input file in a set of paired-end sequences.\n"
- << " 2: second input file in a set of paired-end sequences.\n"
- << " c,--clean: clean data, remove any read with an uncalled base ('N').\n"
- << " q,--quality: discard reads with low quality (phred) scores.\n"
- << " r,--rescue: rescue barcodes and RAD-Tags.\n"
- << " t: truncate final read length to this value.\n"
- << " D: capture discarded reads to a file.\n"
- << " E: specify how quality scores are encoded, 'phred33' (Illumina 1.8+/Sanger, default) or 'phred64' (Illumina 1.3-1.5).\n"
- << " w: set the size of the sliding window as a fraction of the read length, between 0 and 1 (default 0.15).\n"
- << " s: set the score limit. If the average score within the sliding window drops below this value, the read is discarded (default 10).\n"
- << " y: output type, either 'fastq', 'gzfastq', 'fasta', or 'gzfasta' (default: match input type).\n"
+ << " -p,--in-path: path to a directory of files.\n"
+ << " -P,--paired: files contained within the directory are paired.\n"
+ << " -I,--interleaved: specify that the paired-end reads are interleaved in single files.\n"
+ << " -i,--in-type: input file type, either 'fastq', 'gzfastq' (gzipped fastq), 'bam', or 'bustard' (default: guess, or gzfastq if unable to).\n"
+ << " -b,--barcodes: path to a file containing barcodes for this run, omit to ignore any barcoding or for already demultiplexed data.\n"
+ << " -o,--out-path: path to output the processed files.\n"
+ << " -f: path to the input file if processing single-end sequences.\n"
+ << " -1: first input file in a set of paired-end sequences.\n"
+ << " -2: second input file in a set of paired-end sequences.\n"
+ << " --threads: number of threads to run.\n"
+ << " -c,--clean: clean data, remove any read with an uncalled base ('N').\n"
+ << " -q,--quality: discard reads with low quality (phred) scores.\n"
+ << " -r,--rescue: rescue barcodes and RAD-Tag cut sites.\n"
+ << " -t,--truncate: truncate final read length to this value.\n"
+ << " -D,--discards: capture discarded reads to a file.\n"
+ << " -y,--out-type: output type, either 'fastq', 'gzfastq', 'fasta', or 'gzfasta' (default: match input type).\n"
<< "\n"
<< " Barcode options:\n"
<< " --inline-null: barcode is inline with sequence, occurs only on single-end read (default).\n"
@@ -1467,8 +2425,8 @@ void help() {
<< " --index-inline: barcode occurs in FASTQ header (Illumina i5 or i7 read) and is inline with single-end sequence (for single-end data) on paired-end read (for paired-end data).\n"
<< "\n"
<< " Restriction enzyme options:\n"
- << " -e <enz>, --renz-1 <enz>: provide the restriction enzyme used (cut site occurs on single-end read)\n"
- << " --renz-2 <enz>: if a double digest was used, provide the second restriction enzyme used (cut site occurs on the paired-end read).\n"
+ << " -e [enz], --renz-1 [enz]: provide the restriction enzyme used (cut site occurs on single-end read)\n"
+ << " --renz-2 [enz]: if a double digest was used, provide the second restriction enzyme used (cut site occurs on the paired-end read).\n"
<< " Currently supported enzymes include:\n"
<< " ";
@@ -1502,6 +2460,9 @@ void help() {
<< " Advanced options:\n"
<< " --filter-illumina: discard reads that have been marked by Illumina's chastity/purity filter as failing.\n"
<< " --disable-rad-check: disable checking if the RAD cut site is intact.\n"
+ << " --encoding: specify how quality scores are encoded, 'phred33' (Illumina 1.8+/Sanger, default) or 'phred64' (Illumina 1.3-1.5).\n"
+ << " --window-size: set the size of the sliding window as a fraction of the read length, between 0 and 1 (default 0.15).\n"
+ << " --score-limit: set the phred score limit. If the average score within the sliding window drops below this value, the read is discarded (default 10).\n"
<< " --len-limit <limit>: specify a minimum sequence length (useful if your data has already been trimmed).\n"
<< " --barcode-dist-1: the number of allowed mismatches when rescuing single-end barcodes (default 1).\n"
<< " --barcode-dist-2: the number of allowed mismatches when rescuing paired-end barcodes (defaults to --barcode-dist-1).\n";
=====================================
src/process_radtags.h
=====================================
@@ -1,6 +1,6 @@
// -*-mode:c++; c-style:k&r; c-basic-offset:4;-*-
//
-// Copyright 2010-2018, Julian Catchen <jcatchen at illinois.edu>
+// Copyright 2010-2022, Julian Catchen <jcatchen at illinois.edu>
//
// This file is part of Stacks.
//
@@ -32,6 +32,8 @@
#include <sstream>
#include <vector>
+#include <deque>
+#include <queue>
#include <map>
#include <set>
#include <utility>
@@ -51,12 +53,22 @@
void help( void );
void version( void );
int parse_command_line(int, char **);
+int allocate_threads();
+int initialize_counters(map<string, long> &);
+int consolidate_counters(map<string, long> &, vector<map<string, long>> &);
+int consolidate_barcode_logs(map<BarcodePair, map<string, long>> &, vector<map<BarcodePair, map<string, long>>> &);
+
template<typename fhType>
int process_reads(string,
set<string> &, set<string> &,
map<BarcodePair, fhType *> &,
map<string, long> &, map<BarcodePair, map<string, long> > &);
template<typename fhType>
+int process_reads_parallel(string,
+ set<string> &, set<string> &,
+ map<BarcodePair, fhType *> &,
+ map<string, long> &, map<BarcodePair, map<string, long> > &);
+template<typename fhType>
int process_paired_reads(string, string,
set<string> &, set<string> &,
map<BarcodePair, fhType *> &,
@@ -64,6 +76,14 @@ int process_paired_reads(string, string,
map<BarcodePair, fhType *> &,
map<BarcodePair, fhType *> &,
map<string, long> &, map<BarcodePair, map<string, long> > &);
+template<typename fhType>
+int process_paired_reads_parallel(string, string,
+ set<string> &, set<string> &,
+ map<BarcodePair, fhType *> &,
+ map<BarcodePair, fhType *> &,
+ map<BarcodePair, fhType *> &,
+ map<BarcodePair, fhType *> &,
+ map<string, long> &, map<BarcodePair, map<string, long> > &);
int process_singlet(RawRead *,
string, bool,
map<string, long> &, map<string, long> &);
=====================================
src/renz.cc
=====================================
@@ -58,6 +58,8 @@ const char *ecoT22I[] = {"TGCAT", // A/TGCAT, EcoT22I
"ATGCA"};
const char *haeIII[] = {"CC", // GG/CC, HaeIII
"GG"};
+const char *hhaI[] = {"GCG", // GCG/C, HhaI
+ "CGC"};
const char *hinP1I[] = {"CGC", // G/CGC, HinP1I
"GCG"};
const char *hindIII[] = {"AGCTT", // A/AGCTT, HindIII
@@ -110,6 +112,9 @@ const char *pspXI[] = {"TCGAGC", "TCGAGG", "TCGAGT", // VC/TCGAGB, PspXI
"GCTCGA", "CCTCGA", "ACTCGA"};
const char *pstI[] = {"TGCAG", // CTGCA/G, PstI
"CTGCA"};
+const char *pstIshi[] = {"GCAG", // CTGCA/G, PstI;
+ "CTGC"}; // Modified so that "CT" is part of the Illumina sequencing primer and not in the final sequence.
+ // As published in Shirasawa, Hirakawa, and Isobe, 2016, DNA Research; doi: 10.1093/dnares/dsw004
const char *rsaI[] = {"AC", // GT/AC, RsaI
"GT"};
const char *sacI[] = {"AGCTC", // GAGCT/C, SacI
@@ -138,6 +143,7 @@ initialize_renz(map<string, const char **> &renz, map<string, int> &renz_cnt, ma
renz["sbfI"] = sbfI; // CCTGCA/GG, SbfI
renz["pstI"] = pstI; // CTGCA/G, PstI
+ renz["pstIshi"] = pstIshi; // CTGCA/G, PstI-SHI
renz["notI"] = notI; // GC/GGCCGC, NotI
renz["ecoRI"] = ecoRI; // G/AATTC, EcoRI
renz["nspI"] = nspI;
@@ -184,6 +190,7 @@ initialize_renz(map<string, const char **> &renz, map<string, int> &renz_cnt, ma
renz["csp6I"] = csp6I; // G/TAC, Csp6I
renz["bsaHI"] = bsaHI; // GR/CGYC, BsaHI
renz["hpaII"] = hpaII; // C/CGG, HpaII
+ renz["hhaI"] = hhaI; // GCG/C, HhaI
renz["ncoI"] = ncoI; // C/CATGG, NcoI
renz["apaLI"] = apaLI; // G/TGCAC, ApaLI
renz["hinP1I"] = hinP1I; // G/CGC, HinP1I
@@ -192,10 +199,11 @@ initialize_renz(map<string, const char **> &renz, map<string, int> &renz_cnt, ma
renz["pacI"] = pacI; // AAT/TAATT, PacI
renz["pspXI"] = pspXI; // VC/TCGAGB, PspXI
renz["hpyCH4IV"] = hpyCH4IV; // A/CGT, HpyCH4IV
- renz["ngoMIV"] = ngoMIV; // G/CCGGC, NgoMIV
+ renz["ngoMIV"] = ngoMIV; // G/CCGGC, NgoMIV
renz_cnt["sbfI"] = 1;
renz_cnt["pstI"] = 1;
+ renz_cnt["pstIshi"] = 1;
renz_cnt["notI"] = 1;
renz_cnt["ecoRI"] = 1;
renz_cnt["nspI"] = 2;
@@ -242,6 +250,7 @@ initialize_renz(map<string, const char **> &renz, map<string, int> &renz_cnt, ma
renz_cnt["csp6I"] = 1;
renz_cnt["bsaHI"] = 2;
renz_cnt["hpaII"] = 1;
+ renz_cnt["hhaI"] = 1;
renz_cnt["ncoI"] = 1;
renz_cnt["apaLI"] = 1;
renz_cnt["hinP1I"] = 1;
@@ -254,6 +263,7 @@ initialize_renz(map<string, const char **> &renz, map<string, int> &renz_cnt, ma
renz_len["sbfI"] = 6;
renz_len["pstI"] = 5;
+ renz_len["pstIshi"] = 4;
renz_len["notI"] = 6;
renz_len["ecoRI"] = 5;
renz_len["nspI"] = 5;
@@ -300,6 +310,7 @@ initialize_renz(map<string, const char **> &renz, map<string, int> &renz_cnt, ma
renz_len["csp6I"] = 3;
renz_len["bsaHI"] = 4;
renz_len["hpaII"] = 3;
+ renz_len["hhaI"] = 3;
renz_len["ncoI"] = 5;
renz_len["apaLI"] = 5;
renz_len["hinP1I"] = 3;
=====================================
src/ustacks.cc
=====================================
@@ -1,6 +1,6 @@
// -*-mode:c++; c-style:k&r; c-basic-offset:4;-*-
//
-// Copyright 2010-2021, Julian Catchen <jcatchen at illinois.edu>
+// Copyright 2010-2022, Julian Catchen <jcatchen at illinois.edu>
//
// This file is part of Stacks.
//
@@ -244,9 +244,10 @@ try{
}
//
- // Report how many reads were used.
+ // Report how many loci were built, how many reads were used, and final coverage.
//
- cerr << "Final coverage: ";
+ cerr << "Final number of stacks built: " << merged.size() << "\n"
+ << "Final coverage: ";
report_cov();
cerr << "\n";
@@ -259,7 +260,7 @@ try{
//
// Write output files.
//
- cerr << "Writing tags, SNPs, and alleles files...\n";
+ cerr << "Writing tags, SNPs, and alleles files to '" << prefix_path << "'...";
write_results(merged, unique, remainders);
cerr << "done.\n";
@@ -2753,8 +2754,6 @@ load_seq_ids(vector<char *> &seq_ids)
else if (in_file_type == FileT::gzfastq)
fh = new GzFastq(in_file.c_str());
- cerr << "Refetching read IDs...";
-
char *id;
Seq c;
c.id = new char[id_len];
View it on GitLab: https://salsa.debian.org/med-team/stacks/-/compare/3e0a1f35bd7d5ab2033ea7b3f9b51e9ffdd5eba8...3a737b16f64920e7f9c47ed9fa4fbd190a90231c
--
View it on GitLab: https://salsa.debian.org/med-team/stacks/-/compare/3e0a1f35bd7d5ab2033ea7b3f9b51e9ffdd5eba8...3a737b16f64920e7f9c47ed9fa4fbd190a90231c
You're receiving this email because of your account on salsa.debian.org.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://alioth-lists.debian.net/pipermail/debian-med-commit/attachments/20220927/f362561e/attachment-0001.htm>
More information about the debian-med-commit
mailing list