[med-svn] [Git][med-team/stacks][master] 5 commits: routine-update: New upstream version
Lance Lin (@linqigang)
gitlab at salsa.debian.org
Sat Aug 10 13:22:21 BST 2024
Lance Lin pushed to branch master at Debian Med / stacks
Commits:
e8b8c9b8 by Lance Lin at 2024-08-10T18:43:14+07:00
routine-update: New upstream version
- - - - -
fded9fa1 by Lance Lin at 2024-08-10T18:43:35+07:00
New upstream version 2.67+dfsg
- - - - -
6b5ada8f by Lance Lin at 2024-08-10T18:43:36+07:00
Update upstream source from tag 'upstream/2.67+dfsg'
Update to upstream version '2.67+dfsg'
with Debian dir 6d5ea5b09f39d5e60be37ee615cb787bd5a03220
- - - - -
014a80f7 by Lance Lin at 2024-08-10T18:43:36+07:00
routine-update: Standards-Version: 4.7.0
- - - - -
eb484bbc by Lance Lin at 2024-08-10T19:12:14+07:00
d/patches: Refresh patches (change_program_exit_to_success.patch, fix_spelling_errors.patch, script-exe-paths)
- - - - -
27 changed files:
- ChangeLog
- Makefile.am
- configure.ac
- debian/changelog
- debian/control
- debian/patches/change_program_exit_to_success.patch
- debian/patches/fix_spelling_errors.patch
- debian/patches/script-exe-paths
- scripts/denovo_map.pl
- scripts/stacks-dist-extract
- + scripts/stacks-private-alleles
- src/Seq.h
- src/clean.cc
- src/clean.h
- src/clone_filter.cc
- src/constants.h
- src/cstacks.cc
- src/file_io.cc
- src/file_io.h
- src/populations.cc
- src/populations.h
- src/process_radtags.cc
- src/process_radtags.h
- src/process_shortreads.cc
- src/sstacks.cc
- src/ustacks.cc
- src/write.cc
Changes:
=====================================
ChangeLog
=====================================
@@ -1,3 +1,31 @@
+Stacks 2.67 - July 18, 2024
+---------------------------
+ Feature: Added the --min-gt-depth filter to populations. Requires a called SNP to be supported
+ by a minimum number of reads, otherwise it is marked as missing data.
+ Feature: Added processing of UMI field to process_radtags. Raw Illumina data may include
+ an extra field in the FASTQ header which represents a unique molecular
+ identifier (UMI).
+ Feature: Added detection for runs of G nucleotides to process_radtags. In recent two-color
+ chemistries from Illumina, 'no signal' can be mistaken for high quality runs of 'G'.
+ This feature detects high quality of runs of G nucleotides at the 3' end of the read
+ and drops reads containing this loss of signal (if there are 10 or more Gs present).
+ Feature: Added the --basename option to process_radtags so a user can sepcify an output
+ filename when processing individual input files (i.e. -f, or -1/-2). (Useful,
+ e.g., when processing individual samples from SRA.)
+ Bugfix: Corrected a bug in process_radtags when run with multiple threads: if more than one
+ barcode mapped to the same output file it could crash due to a lack of thread
+ synchronization. Code now assigns all barcodes pointing to the same output files to
+ the same output thread.
+ Bugfix: Corrected a bug in filtering adapter sequence in process_radtags that occurred
+ when a barcode was part of the adaptor sequence, but near the end of that adaptor
+ sequence, it could cause a read to be inadvertantly discarded.
+ Bugfix: Corrected a regression in process_radtags where FASTQ custom
+ headers (e.g. from the Sequence Read Archive) were not properly parsed.
+ Also added a check to ensure "/1" and/or "/2" are removed from the header
+ if present so as not to be duplciated after processing.
+ Bugfix: Corrected stacks-private-alleles to work properly with denovo
+ data.
+
Stacks 2.66 - December 5, 2023
------------------------------
Feature: Rewrote stacks-dist-extract in Python including new support for partial
=====================================
Makefile.am
=====================================
@@ -86,7 +86,8 @@ kmer_filter_LDADD = $(LDADD) libclean.a
populations_LDADD = $(LDADD) libpop.a
dist_bin_SCRIPTS = scripts/denovo_map.pl scripts/ref_map.pl \
- scripts/stacks-integrate-alignments scripts/stacks-dist-extract scripts/stacks-gdb \
+ scripts/stacks-integrate-alignments scripts/stacks-dist-extract scripts/stacks-private-alleles \
+ scripts/stacks-gdb \
scripts/stacks-samtools-tview scripts/stacks-count-reads-per-sample-per-locus \
scripts/stacks-hist2d-loci-samples-coverage # scripts/denovo_map.py
=====================================
configure.ac
=====================================
@@ -2,7 +2,7 @@
# Process this file with autoconf to produce a configure script.
AC_PREREQ(2.59)
-AC_INIT([Stacks], [2.66])
+AC_INIT([Stacks], [2.67])
AC_CONFIG_AUX_DIR([config])
AM_INIT_AUTOMAKE([-Wall -Werror foreign parallel-tests subdir-objects])
AC_CONFIG_SRCDIR([src/ustacks.cc])
=====================================
debian/changelog
=====================================
@@ -1,3 +1,13 @@
+stacks (2.67+dfsg-1) UNRELEASED; urgency=medium
+
+ * Team upload.
+ * New upstream version
+ * Standards-Version: 4.7.0 (routine-update)
+ * d/patches: Refresh patches (change_program_exit_to_success.patch,
+ fix_spelling_errors.patch, script-exe-paths)
+
+ -- Lance Lin <lq27267 at gmail.com> Sat, 10 Aug 2024 18:43:14 +0700
+
stacks (2.66+dfsg-1) unstable; urgency=medium
* Team upload.
=====================================
debian/control
=====================================
@@ -10,7 +10,7 @@ Build-Depends: debhelper-compat (= 13),
libsparsehash-dev,
samtools,
libhts-dev
-Standards-Version: 4.6.2
+Standards-Version: 4.7.0
Vcs-Browser: https://salsa.debian.org/med-team/stacks
Vcs-Git: https://salsa.debian.org/med-team/stacks.git
Homepage: https://creskolab.uoregon.edu/stacks/
=====================================
debian/patches/change_program_exit_to_success.patch
=====================================
@@ -40,7 +40,7 @@ Last-Update: 2024-01-19
done
--- a/src/sstacks.cc
+++ b/src/sstacks.cc
-@@ -1428,13 +1428,13 @@
+@@ -1429,13 +1429,13 @@
}
void version() {
@@ -54,10 +54,10 @@ Last-Update: 2024-01-19
void help() {
- cerr << "sstacks " << VERSION << "\n"
+ cout << "sstacks " << VERSION << "\n"
- << "sstacks -P dir -M popmap [-p n_threads]" << "\n"
- << "sstacks -c catalog_path -s sample_path [-s sample_path ...] -o path [-p n_threads]" << "\n"
+ << "sstacks -P dir -M popmap [-t n_threads]" << "\n"
+ << "sstacks -c catalog_path -s sample_path [-s sample_path ...] -o path [-t n_threads]" << "\n"
<< " -P,--in-path: path to the directory containing Stacks files.\n"
-@@ -1450,10 +1450,10 @@
+@@ -1451,10 +1451,10 @@
;
#ifdef DEBUG
@@ -72,7 +72,7 @@ Last-Update: 2024-01-19
}
--- a/src/process_radtags.cc
+++ b/src/process_radtags.cc
-@@ -2462,13 +2462,13 @@
+@@ -2564,13 +2564,13 @@
}
void version() {
@@ -89,7 +89,7 @@ Last-Update: 2024-01-19
<< "process_radtags -p in_dir [-P] [-b barcode_file] -o out_dir -e enz [--threads num] [-c] [-q] [-r] [-t len]\n"
<< "process_radtags -f in_file [-b barcode_file] -o out_dir -e enz [--threads num] [-c] [-q] [-r] [-t len]\n"
<< "process_radtags --in-path in_dir [--paired] [--barcodes barcode_file] --out-path out_dir --renz-1 enz [--renz-2 enz] [--threads num] [-c] [-q] [-r] [-t len]\n"
-@@ -2510,20 +2510,20 @@
+@@ -2611,20 +2611,20 @@
uint cnt = renz_cnt.size();
it = renz_cnt.begin();
for (uint i = 1; i <= cnt; i++) {
@@ -116,7 +116,7 @@ Last-Update: 2024-01-19
<< " Protocol-specific options:\n"
<< " --bestrad: library was generated using BestRAD, check for restriction enzyme on either read and potentially tranpose reads.\n"
<< " --methylrad: library was generated using enzymatic methyl-seq (EM-seq) or bisulphite sequencing.\n\n"
-@@ -2544,5 +2544,5 @@
+@@ -2649,5 +2649,5 @@
<< " --barcode-dist-1: the number of allowed mismatches when rescuing single-end barcodes (default 1).\n"
<< " --barcode-dist-2: the number of allowed mismatches when rescuing paired-end barcodes (defaults to --barcode-dist-1).\n";
@@ -175,7 +175,7 @@ Last-Update: 2024-01-19
}
--- a/src/clone_filter.cc
+++ b/src/clone_filter.cc
-@@ -1309,13 +1309,13 @@
+@@ -1311,13 +1311,13 @@
}
void version() {
@@ -192,7 +192,7 @@ Last-Update: 2024-01-19
<< "clone_filter [-f in_file | -p in_dir [-P] [-I] | -1 pair_1 -2 pair_2] -o out_dir [-i type] [-y type] [-D] [-h]\n"
<< " f: path to the input file if processing single-end sequences.\n"
<< " p: path to a directory of files.\n"
-@@ -1340,5 +1340,5 @@
+@@ -1342,5 +1342,5 @@
<< " --inline-index: random oligo is inline with sequence on single-end read and second oligo occurs in FASTQ header.\n"
<< " --index-inline: random oligo occurs in FASTQ header (Illumina i5 or i7 read) and is inline with sequence on single-end read (if single read data) or paired-end read (if paired data).\n\n";
@@ -201,7 +201,7 @@ Last-Update: 2024-01-19
}
--- a/src/cstacks.cc
+++ b/src/cstacks.cc
-@@ -1998,13 +1998,13 @@
+@@ -1999,13 +1999,13 @@
}
void version() {
@@ -215,10 +215,10 @@ Last-Update: 2024-01-19
void help() {
- cerr << "cstacks " << VERSION << "\n"
+ cout << "cstacks " << VERSION << "\n"
- << "cstacks -P in_dir -M popmap [-n num_mismatches] [-p num_threads]" << "\n"
- << "cstacks -s sample1_path [-s sample2_path ...] -o path [-n num_mismatches] [-p num_threads]" << "\n"
+ << "cstacks -P in_dir -M popmap [-n num_mismatches] [-t num_threads]" << "\n"
+ << "cstacks -s sample1_path [-s sample2_path ...] -o path [-n num_mismatches] [-t num_threads]" << "\n"
<< "\n"
-@@ -2025,5 +2025,5 @@
+@@ -2026,5 +2026,5 @@
<< " --k-len <len>: specify k-mer size for matching between between catalog loci (automatically calculated by default).\n"
<< " --report-mmatches: report query loci that match more than one catalog locus.\n";
@@ -253,7 +253,7 @@ Last-Update: 2024-01-19
}
--- a/src/populations.cc
+++ b/src/populations.cc
-@@ -4347,5 +4347,5 @@
+@@ -4374,5 +4374,5 @@
#endif
;
@@ -262,7 +262,7 @@ Last-Update: 2024-01-19
}
--- a/src/process_shortreads.cc
+++ b/src/process_shortreads.cc
-@@ -1117,13 +1117,13 @@
+@@ -1119,13 +1119,13 @@
}
void version() {
@@ -279,7 +279,7 @@ Last-Update: 2024-01-19
<< "process_shortreads [-f in_file | -p in_dir [-P] [-I] | -1 pair_1 -2 pair_2] -b barcode_file -o out_dir [-i type] [-y type] [-c] [-q] [-r] [-E encoding] [-t len] [-D] [-w size] [-s lim] [-h]\n"
<< " f: path to the input file if processing single-end seqeunces.\n"
<< " i: input file type, either 'bustard' for the Illumina BUSTARD format, 'bam', 'fastq' (default), or 'gzfastq' for gzipped FASTQ.\n"
-@@ -1167,5 +1167,5 @@
+@@ -1169,5 +1169,5 @@
<< " --mate-pair: raw reads are circularized mate-pair data, first read will be reverse complemented.\n"
<< " --no-overhang: data does not contain an overhang nucleotide between barcode and seqeunce.\n";
@@ -288,7 +288,7 @@ Last-Update: 2024-01-19
}
--- a/src/ustacks.cc
+++ b/src/ustacks.cc
-@@ -3017,13 +3017,13 @@
+@@ -3013,13 +3013,13 @@
}
void version() {
@@ -302,10 +302,10 @@ Last-Update: 2024-01-19
void help() {
- cerr << "ustacks " << VERSION << "\n"
+ cout << "ustacks " << VERSION << "\n"
- << "ustacks -f file_path -o path [-M max_dist] [-m min_cov] [-p num_threads]" << "\n"
- << " f: input file path.\n"
- << " o: output path to write results.\n"
-@@ -3059,5 +3059,5 @@
+ << "ustacks -f in_path -o out_path [-M max_dist] [-m min_reads] [-t num_threads]" << "\n"
+ << " -f,--file: input file path.\n"
+ << " -o,--out-path: output path to write results.\n"
+@@ -3055,5 +3055,5 @@
<< "\n"
<< " h: display this help message.\n";
=====================================
debian/patches/fix_spelling_errors.patch
=====================================
@@ -41,7 +41,7 @@ Date: 2023-08-29
<< " --normalize <depth>: normalize read depth according to k-mer coverage.\n\n"
--- a/src/process_shortreads.cc
+++ b/src/process_shortreads.cc
-@@ -681,7 +681,7 @@
+@@ -683,7 +683,7 @@
return 0;
}
@@ -50,7 +50,7 @@ Date: 2023-08-29
init_log(log, argc, argv);
-@@ -1143,7 +1143,7 @@
+@@ -1145,7 +1145,7 @@
<< " D: capture discarded reads to a file.\n"
<< " w: set the size of the sliding window as a fraction of the read length, between 0 and 1 (default 0.15).\n"
<< " s: set the score limit. If the average score within the sliding window drops below this value, the read is discarded (default 10).\n"
@@ -72,7 +72,7 @@ Date: 2023-08-29
qloc = i->second;
--- a/src/clone_filter.cc
+++ b/src/clone_filter.cc
-@@ -1326,7 +1326,7 @@
+@@ -1328,7 +1328,7 @@
<< " o: path to output the processed files.\n"
<< " y: output type, either 'fastq', 'fasta', 'gzfasta', or 'gzfastq' (default same as input type).\n"
<< " D: capture discarded reads to a file.\n"
@@ -111,7 +111,7 @@ Date: 2023-08-29
<< " --dprime-threshold <val>: if D' values fall above <val>, set the D' to 1, otherwise set D' to 0.\n\n"
--- a/src/populations.cc
+++ b/src/populations.cc
-@@ -704,7 +704,7 @@
+@@ -710,7 +710,7 @@
do {
int rv = this->_fasta_reader.next_seq(seq);
if (rv == 0) {
@@ -120,7 +120,7 @@ Date: 2023-08-29
<< rv << "; cloc_id: " << cloc_id << "\n";
throw exception();
}
-@@ -4233,7 +4233,7 @@
+@@ -4260,7 +4260,7 @@
}
if (merge_sites == true && enz.length() == 0) {
@@ -129,7 +129,7 @@ Date: 2023-08-29
help();
}
-@@ -4260,7 +4260,7 @@
+@@ -4287,7 +4287,7 @@
<< "populations -P dir [-O dir] [-M popmap] (filters) [--fstats] [-k [--sigma=150000] [--bootstrap [-N 100]]] (output formats)\n"
<< "populations -V vcf -O dir [-M popmap] (filters) [--fstats] [-k [--sigma=150000] [--bootstrap [-N 100]]] (output formats)\n"
<< "\n"
@@ -138,7 +138,7 @@ Date: 2023-08-29
<< " -V,--in-vcf: path to a standalone input VCF file.\n"
<< " -O,--out-path: path to a directory where to write the output files. (Required by -V; otherwise defaults to value of -P.)\n"
<< " -M,--popmap: path to a population map. (Format is 'SAMPLE1 \\t POP1 \\n SAMPLE2 ...'.)\n"
-@@ -4335,7 +4335,7 @@
+@@ -4363,7 +4363,7 @@
<< " --map-format: mapping program format to write, 'joinmap', 'onemap', and 'rqtl' are currently supported.\n"
<< "\n"
<< "Additional options:\n"
@@ -169,7 +169,7 @@ Date: 2023-08-29
"Reference-based mode:\n"
--- a/src/ustacks.cc
+++ b/src/ustacks.cc
-@@ -2985,7 +2985,7 @@
+@@ -2981,7 +2981,7 @@
if (in_file_type == FileT::unknown) {
in_file_type = guess_file_type(in_file);
if (in_file_type == FileT::unknown) {
@@ -178,8 +178,8 @@ Date: 2023-08-29
help();
}
}
-@@ -3037,7 +3037,7 @@
- << " H: disable calling haplotypes from secondary reads.\n"
+@@ -3033,7 +3033,7 @@
+ << " -H: disable calling haplotypes from secondary reads.\n"
<< "\n"
<< " Stack assembly options:\n"
- << " --force-diff-len: allow raw input reads of different lengths, e.g. after trimming (default: ustacks perfers raw input reads of uniform length).\n"
@@ -187,7 +187,7 @@ Date: 2023-08-29
<< " --keep-high-cov: disable the algorithm that removes highly-repetitive stacks and nearby errors.\n"
<< " --high-cov-thres: highly-repetitive stacks threshold, in standard deviation units (default: 3.0).\n"
<< " --max-locus-stacks <num>: maximum number of stacks at a single de novo locus (default 3).\n"
-@@ -3057,7 +3057,7 @@
+@@ -3053,7 +3053,7 @@
<< " For the Fixed model:\n"
<< " --bc-err-freq <num>: specify the barcode error frequency, between 0 and 1.0.\n"
<< "\n"
=====================================
debian/patches/script-exe-paths
=====================================
@@ -9,7 +9,7 @@ Last-Update: 2016-10-21
This patch header follows DEP-3: http://dep.debian.net/deps/dep3/
--- a/Makefile.am
+++ b/Makefile.am
-@@ -104,10 +104,10 @@
+@@ -105,10 +105,10 @@
$(MAKE) all "CXXFLAGS=-g3 -Wall -DDEBUG -Og"
install-data-hook:
=====================================
scripts/denovo_map.pl
=====================================
@@ -163,7 +163,7 @@ sub execute_stacks {
next;
}
- $cmd = $exe_path . "ustacks -t $sample->{'fmt'} -f $sample->{'path'} -o $out_path" . $minc;
+ $cmd = $exe_path . "ustacks --in-type $sample->{'fmt'} --file $sample->{'path'} --out-path $out_path" . $minc;
if ($sample->{'path'} !~ /$sample->{'file'}.$sample->{'suffix'}$/) {
# Guessing the sample name from the input path won't work.
@@ -808,12 +808,12 @@ sub parse_command_line {
elsif ($_ =~ /^--catalog-popmap$/) { $cpopmap_path = shift @ARGV; }
elsif ($_ =~ /^-T$/ || $_ =~ /^--threads/) {
$arg = shift @ARGV;
- push(@_ustacks, "-p " . $arg);
- push(@_cstacks, "-p " . $arg);
- push(@_sstacks, "-p " . $arg);
- push(@_tsv2bam, "-t " . $arg);
- push(@_gstacks, "-t " . $arg);
- push(@_populations, "-t " . $arg);
+ push(@_ustacks, "--threads " . $arg);
+ push(@_cstacks, "--threads " . $arg);
+ push(@_sstacks, "--threads " . $arg);
+ push(@_tsv2bam, "--threads " . $arg);
+ push(@_gstacks, "--threads " . $arg);
+ push(@_populations, "--threads " . $arg);
} elsif ($_ =~ /^-M$/) {
$ustacks_mismatch = shift(@ARGV);
=====================================
scripts/stacks-dist-extract
=====================================
@@ -133,13 +133,16 @@ def find_section(sections, section):
def pretty_print(stream):
colcnts = []
for line in stream:
+ if line[0] == '#':
+ continue
+
cols = line.split('\t')
while len(colcnts) < len(cols):
colcnts.append(0)
-
+
for i in range(len(cols)):
- if cols[i][0] != "#" and colcnts[i] < len(cols[i]):
+ if colcnts[i] < len(cols[i]):
colcnts[i] = len(cols[i])
s = ""
=====================================
scripts/stacks-private-alleles
=====================================
@@ -0,0 +1,1150 @@
+#!/usr/bin/env python3
+#
+# Copyright 2021-2024, Julian Catchen <jcatchen at illinois.edu>
+#
+# This file is part of Stacks.
+#
+# Stacks is free software: you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation, either version 3 of the License, or
+# (at your option) any later version.
+#
+# Stacks is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with Stacks. If not, see <http://www.gnu.org/licenses/>.
+#
+
+import argparse
+import sys
+import os
+import math
+from datetime import datetime
+from enum import Enum
+from operator import itemgetter, attrgetter
+
+#
+# Constant values
+#
+SUMSTATS_COLS = 21
+VCF_SAMP_STRT = 9
+
+#
+# Global configuration variables.
+#
+sumstats_path = ""
+vcf_path = ""
+popmap_path = ""
+popmap_used = False
+out_path = ""
+ref_aligned = True
+ctg_limit = 500000
+polarize = False
+plots = False
+smooth = False
+sigma = 150000.0
+verbose = False
+
+class Strand(Enum):
+ plus = 0
+ minus = 1
+
+class SmoothStat:
+ def __init__(self, loc_id, pop, col, chr, bp, allele_cnt, stat):
+ self.id = loc_id
+ self.pop = pop
+ self.col = col
+ self.chr = chr
+ self.bp = bp
+ self.cnt = allele_cnt
+ self.stat = stat
+ self.smoothed = 0.0
+
+class Pop:
+ def __init__(self, pop, p, q, cnt, priv):
+ self.pop = pop
+ self.p_freq = p
+ self.q_freq = q
+ self.cnt = cnt
+ self.private = priv
+ self.priv_all = None # If there is a private allele, which one is it?
+
+ def __str__(self):
+ s = "{} - p freq: {:.3f}; q freq: {:.3f}; sample cnt: {}; private? {}".format(
+ self.pop, self.p_freq, self.q_freq, self.cnt, self.private)
+ if self.private == True:
+ s += " [allele: '{}']".format(self.priv_all)
+
+ return s
+
+class SNP:
+ def __init__(self):
+ self.bp = 0
+ self.col = 0
+ self.p_allele = ""
+ self.q_allele = ""
+ self.pops = {} # Includes parental populations and 'vcf' for all VCF samples.
+ self.indvpops = {} # Indexed by population name; includes individual populations from VCF.
+ self.priv_pop = False # True if there is a private allele in one or more of the populations at this site.
+ self.priv_hyb = False # Is the private allele present in the hybrids?
+
+ def __str__(self):
+ s = "Column {}; P allele: '{}', Q allele '{}', private? {}".format(self.col, self.p_allele, self.q_allele, self.priv_pop)
+ return s
+
+class Locus:
+ def __init__(self, id):
+ self.id = id
+ self.chr = None
+ self.bp = 0
+ self.snps = {} # Indexed by column.
+ self.private = False # At least one SNP at this locus contains a private allele.
+ def __str__(self):
+ s = "Locus {}".format(self.id)
+ if self.chr != None:
+ s += " [{}:{}:{}]".format(self.chr, self.bp, self.strand)
+ s +="; {} snp(s)\n".format(len(self.snps))
+ for snp in self.snps:
+ s += " "
+ s += self.snps[snp].__str__() + "\n"
+ for pop in self.snps[snp].pops:
+ s += " "
+ s += self.snps[snp].pops[pop].__str__() + "\n"
+ return s
+
+
+def parse_command_line():
+ global sumstats_path
+ global vcf_path
+ global popmap_path, popmap_used
+ global out_path
+ global polarize
+ global plots, smooth, sigma
+ global verbose
+
+ desc = '''Supply the populations.sumstats.tsv file containing the 'parental' populations \
+ which were used to define private alleles. Supply a VCF file from the 'hybrid' \
+ populations which may contain those private alleles. Optionally export a set of VCF \
+ files, one per parental population, that list SNPs with the non-private allele as \
+ 'REF' and the private allele as 'ALT' for plotting.'''
+
+ p = argparse.ArgumentParser(description=desc)
+
+ #
+ # Add options.
+ #
+ p.add_argument("--sumstats", type=str, metavar="path", required=True,
+ help="path to parental populations sumstats file.")
+ p.add_argument("--vcf", type=str, metavar="path", required=True,
+ help="path to hybrids VCF file.")
+ p.add_argument("--popmap", type=str, metavar="path",
+ help="path to a population map used to generate the VCF file.")
+ p.add_argument("--polarize", action="store_true",
+ help="Export a set of VCF files, one per parent population, with private alleles polarized.")
+ p.add_argument("--plots", action="store_true",
+ help="Write a file for plotting private allele frequencies along the genome.")
+ p.add_argument("--smooth", action="store_true",
+ help="Smooth private allele values in plotting files, if data are referenced aligned.")
+ p.add_argument("--sigma", type=float, metavar="size",
+ help="Control the smoothing. Default 75000bp, increase to smooth more, decrease to smooth less.")
+ p.add_argument("-o", "--out-path", type=str, metavar="prefix", required=True,
+ help="prefix path for writing output files.")
+ p.add_argument("--verbose", action="store_true", dest="verbose",
+ help="Ouput a summary of each locus.")
+
+ #
+ # Parse the command line
+ #
+ args = p.parse_args()
+
+ if args.vcf != None:
+ vcf_path = args.vcf
+ if args.popmap != None:
+ popmap_path = args.popmap
+ popmap_used = True
+ if args.sumstats != None:
+ sumstats_path = args.sumstats
+ if args.out_path != None:
+ out_path = args.out_path
+ if args.verbose != None:
+ verbose = args.verbose
+ if args.polarize != None:
+ polarize = args.polarize
+ if args.plots != None:
+ plots = args.plots
+ if args.smooth != None:
+ smooth = args.smooth
+ if args.sigma != None:
+ sigma = args.sigma
+
+ if len(sumstats_path) == 0:
+ print >> sys.stderr, "You must specify the path to the sumstats file."
+ p.print_help()
+ sys.exit()
+ if len(vcf_path) == 0:
+ print >> sys.stderr, "You must specify the path to the hybrids VCF file."
+ p.print_help()
+ sys.exit()
+ if len(out_path) == 0:
+ print >> sys.stderr, "You must specify a path prefix to output files."
+ p.print_help()
+ sys.exit()
+
+
+def complement(nuc):
+ if nuc == 'A':
+ return 'T'
+ elif nuc == 'C':
+ return 'G'
+ elif nuc == 'G':
+ return 'C'
+ elif nuc == 'T':
+ return 'A'
+ elif nuc == 'a':
+ return 't'
+ elif nuc == 'c':
+ return 'g'
+ elif nuc == 'g':
+ return 'c'
+ elif nuc == 't':
+ return 'a'
+
+
+def parse_popmap_file(path, vcfpops, vcfpops_rev):
+ fh = open(path, "r")
+
+ lineno = 0
+ for line in fh:
+ lineno += 1
+ line = line.strip("\n")
+
+ if line[0] == "#":
+ continue
+
+ parts = line.split("\t")
+ if len(parts) < 2:
+ print("Error parsing popmap '{}' at line {}".format(path, lineno), file=sys.stderr)
+ exit()
+
+ if parts[1] not in vcfpops:
+ vcfpops[parts[1]] = []
+ vcfpops[parts[1]].append(parts[0])
+ vcfpops_rev[parts[0]] = parts[1]
+
+ print(" Found {} samples in {} populations.".format(len(vcfpops_rev), len(vcfpops)), file=sys.stderr)
+ return
+
+
+def parse_sumstats_file(path, loci, populations_key):
+ global ref_aligned, ctg_limit, smooth, plots
+
+ fh = open(path, "r")
+
+ #
+ # Parse the sumstats header and extract the population names/sample names
+ #
+ lineno = 0
+ for line in fh:
+ lineno += 1
+ line = line.strip("\n")
+
+ if line[0] != "#" or line[0:10] == "# Locus ID":
+ break
+
+ parts = line.split("\t")
+ pop_name = parts[0][2:]
+ populations_key[pop_name] = parts[1].split(",")
+
+ #
+ # Parse SNP records.
+ #
+ for line in fh:
+ lineno += 1
+ line = line.strip("\n")
+
+ if line[0] == "#":
+ continue
+
+ parts = line.split("\t")
+
+ if len(parts) != SUMSTATS_COLS:
+ print("Error parsing '{}' on line {}, incorrect number of columns ('{}'), file not in correct format.".format(
+ path, lineno, len(parts)), file=sys.stderr)
+ exit()
+
+ locus_id = int(parts[0])
+ column = int(parts[3])
+ popul = parts[4]
+ p_allele = parts[5]
+ q_allele = parts[6]
+ all_cnt = int(parts[7]) * 2
+ p_freq = float(parts[8])
+ q_freq = 1 - p_freq
+ private = True if parts[20] == "1" else False
+
+ pop = Pop(popul, p_freq, q_freq, all_cnt, private)
+
+ #
+ # Have we already seen this locus and SNP?
+ #
+ if locus_id in loci:
+ loc = loci[locus_id]
+ else:
+ loc = Locus(locus_id)
+ loci[locus_id] = loc
+
+ #if parts[1] != "un":
+ loc.chr = parts[1]
+ loc.bp = int(parts[2])
+
+ if private == True:
+ loc.private = True
+
+ if column in loc.snps:
+ snp = loc.snps[column]
+ else:
+ snp = SNP()
+ snp.bp = int(parts[2])
+ snp.col = column
+ loc.snps[column] = snp
+
+ if p_allele != "-":
+ snp.p_allele = p_allele
+ if q_allele != "-":
+ snp.q_allele = q_allele
+ if private:
+ snp.priv_pop = True
+
+ snp.pops[popul] = pop
+
+ fh.close()
+
+ snp_cnt = 0
+ for l in loci:
+ loc = loci[l]
+ snp_cnt += len(loc.snps)
+ #
+ # Find which allele, or alleles, was/were the private one.
+ # In most cases, a single allele will be private between the parent populations.
+ # However, there can be alternative, private alleles where each parent population
+ # has a unique allele.
+ #
+ if loc.private == False:
+ continue
+
+ for col in loc.snps:
+ snp = loc.snps[col]
+ alleles = {snp.p_allele: 0, snp.q_allele: 0}
+
+ for p in snp.pops:
+ pop = snp.pops[p]
+ if pop.p_freq > 0:
+ alleles[snp.p_allele] += 1
+ if pop.q_freq > 0:
+ alleles[snp.q_allele] += 1
+
+ for p in snp.pops:
+ pop = snp.pops[p]
+ if pop.private == True:
+ if alleles[snp.p_allele] == 1 and pop.p_freq > 0:
+ pop.priv_all = snp.p_allele
+ elif alleles[snp.q_allele] == 1 and pop.q_freq > 0:
+ pop.priv_all = snp.q_allele
+
+ print(" Read {} loci containing {} SNPs.".format(len(loci), snp_cnt), file=sys.stderr)
+
+ l = next(iter(loci))
+ if loci[l].chr == "un" or loci[l].chr == None:
+ ref_aligned = False
+ ctg_limit = 1
+ if smooth == True or plots == True:
+ smooth = False
+ plots = False
+ print("Warning: unable to generate plots or smooth them without reference-aligned data.", file=sys.stderr)
+
+
+def parse_hybrid_vcf(path, loci, populations_key, vcfpops, vcfpops_rev, chrs_list):
+ global ref_aligned
+
+ fh = open(path, "r")
+
+ not_found = 0
+ snp_cnt = 0
+ locus_cnt = set()
+
+ #
+ # Parse the chromosome names and lengths from the VCF header, if available.
+ #
+ lineno = 0
+ parts = []
+ for line in fh:
+ lineno += 1
+ line = line.strip("\n")
+
+ #
+ # Skip the file definition lines, but record the contig lengths.
+ #
+ if line[0:8] == "##contig":
+ line = line[13:-1]
+ chr, chrlen = line.split(",")
+ chrlen = chrlen[7:]
+ chrs_list.append( (chr, int(chrlen)) )
+
+ if line[0:6] == "#CHROM":
+ parts = line.split("\t")
+ break
+
+ if ref_aligned == False:
+ chrs_list.append( ("un", 1) )
+
+ #
+ # List of sample names, to be taken from the VCF header.
+ #
+ populations_key['vcf'] = []
+ sample_cols = {}
+ pop_cols = {}
+
+ if popmap_used == True:
+ for p in vcfpops:
+ pop_cols[p] = []
+
+ #
+ # Parse the sample names from the VCF header and record which columns the samples from
+ # each population occur in.
+ #
+ popmap_samples_found = 0
+ for i in range(9, len(parts)):
+ populations_key['vcf'].append(parts[i])
+
+ if popmap_used and parts[i] in vcfpops_rev:
+ popmap_samples_found += 1
+ sample_cols[i] = vcfpops_rev[parts[i]]
+ pop_cols[vcfpops_rev[parts[i]]].append(i)
+
+ if popmap_used == True:
+ print(" Found {} samples from population map in VCF file.".format(popmap_samples_found), file=sys.stderr)
+
+ for line in fh:
+ lineno += 1
+ line = line.strip("\n")
+
+ parts = line.split("\t")
+
+ #
+ # If referenced aligned, column 3 contains the locus ID, column of the SNP within the
+ # locus and the strand, e.g. '152:79:-'
+ # Otherwise, if contains just the locus ID and column of the SNP.
+ #
+ if ref_aligned == True:
+ locus_id, column, strand = parts[2].split(":")
+ else:
+ locus_id, column = parts[2].split(":")
+ strand = "+"
+ locus_id = int(locus_id)
+ column = int(column)
+ strand = Strand.minus if strand == "-" else Strand.plus
+
+ if locus_id not in loci:
+ not_found += 1
+ continue
+ if column not in loci[locus_id].snps:
+ not_found += 1
+ continue
+
+ locus_cnt.add(locus_id)
+ snp_cnt += 1
+
+ snp = loci[locus_id].snps[column]
+
+ vcfpop = Pop("vcf", 0.0, 0.0, 0, False)
+
+ if popmap_used:
+ indv_pops = {}
+ for p in pop_cols:
+ indv_pops[p] = Pop(p, 0.0, 0.0, 0, False)
+
+ ref_allele = parts[3]
+ alt_allele = parts[4]
+
+ if strand == Strand.minus:
+ ref_allele = complement(ref_allele)
+ alt_allele = complement(alt_allele)
+
+ if ((ref_allele != snp.p_allele and ref_allele != snp.q_allele) or
+ (alt_allele != snp.p_allele and alt_allele != snp.q_allele)):
+ print("Error parsing VCF file on line {}; ref/alt alleles {}/{} do not match sumstats file ({}/{}).\n".format(
+ lineno, ref_allele, alt_allele, snp.p_allele, snp.q_allele), file=sys.stderr)
+ exit(1)
+
+ genotypes = []
+
+ for i in range(VCF_SAMP_STRT, len(parts)):
+ fields = parts[i].split(",")
+ gtype = fields[0].split(":")[0]
+
+ if gtype == "./.":
+ sample_gtype = ( None, None )
+ elif gtype == "0/0":
+ sample_gtype = (ref_allele, ref_allele)
+ elif gtype == "0/1":
+ sample_gtype = (ref_allele, alt_allele)
+ elif gtype == "1/0":
+ sample_gtype = (ref_allele, alt_allele)
+ elif gtype == "1/1":
+ sample_gtype = (alt_allele, alt_allele)
+
+ genotypes.append(sample_gtype)
+
+ for i in range(0, len(genotypes)):
+ gtype = genotypes[i]
+
+ if gtype[0] != None:
+ vcfpop.cnt += 1
+ if gtype[0] == snp.p_allele:
+ vcfpop.p_freq += 1.0
+ else:
+ vcfpop.q_freq += 1.0
+ if gtype[1] != None:
+ vcfpop.cnt += 1
+ if gtype[1] == snp.p_allele:
+ vcfpop.p_freq += 1.0
+ else:
+ vcfpop.q_freq += 1.0
+
+ if popmap_used:
+ indvpop = indv_pops[sample_cols[i + VCF_SAMP_STRT]]
+ if gtype[0] != None:
+ indvpop.cnt += 1
+ if gtype[0] == snp.p_allele:
+ indvpop.p_freq += 1.0
+ else:
+ indvpop.q_freq += 1.0
+ if gtype[1] != None:
+ indvpop.cnt += 1
+ if gtype[1] == snp.p_allele:
+ indvpop.p_freq += 1.0
+ else:
+ indvpop.q_freq += 1.0
+
+ if vcfpop.cnt > 0:
+ vcfpop.p_freq = vcfpop.p_freq / vcfpop.cnt
+ vcfpop.q_freq = vcfpop.q_freq / vcfpop.cnt
+
+ if popmap_used:
+ for p in indv_pops:
+ indvpop = indv_pops[p]
+ if indvpop.cnt > 0:
+ indvpop.p_freq = indvpop.p_freq / indvpop.cnt
+ indvpop.q_freq = indvpop.q_freq / indvpop.cnt
+
+ if snp.priv_pop == True:
+ for p in snp.pops:
+ pop = snp.pops[p]
+ if pop.private == True:
+ if pop.priv_all == snp.p_allele and vcfpop.p_freq > 0:
+ snp.priv_hyb = True
+ elif pop.priv_all == snp.q_allele and vcfpop.q_freq > 0:
+ snp.priv_hyb = True
+
+ snp.pops[vcfpop.pop] = vcfpop
+
+ if popmap_used:
+ for p in indv_pops:
+ snp.indvpops[p] = indv_pops[p]
+
+ fh.close()
+
+ print(" Read {} loci, {} snps, from {} samples.".format(
+ len(locus_cnt), snp_cnt, len(populations_key['vcf'])), file=sys.stderr)
+
+
+def order_loci(chrs_list, loci, ordered_loci):
+ global ref_aligned
+
+ #
+ # Sort loci containing private alleles onto their respective chromosomes.
+ #
+ for loc_id in loci:
+ loc = loci[loc_id]
+
+ if loc.chr == None:
+ continue
+
+ if loc.private == False:
+ continue
+
+ if loc.chr not in ordered_loci:
+ ordered_loci[loc.chr] = []
+ ordered_loci[loc.chr].append(loc)
+
+ for chr in ordered_loci:
+ ordered_loci[chr].sort(key=attrgetter('bp'))
+
+ #
+ # Sort the chromosomes by length
+ #
+ chrs_list.sort(key=lambda x:x[1], reverse=True)
+
+
+def write_private_allele_summary(chrs_list, ordered_loci, populations_key, vcfpops, out_path):
+ datestamp = datetime.now().strftime("%Y%m%d")
+
+ #
+ # Open the output file.
+ #
+ out_fh = open(out_path + "_private_alleles.tsv", 'w')
+ out_fh.write("# Generated by stacks-private-alleles, date {}\n# {}\n".format(datestamp, " ".join(sys.argv)))
+ out_fh.write("# Chr\tBP\tLocus\tColumn\tPrivAllele\tParentPop\tParentFreq\tTotCnt\tPrivCnt\tVCF-Freq\tTotCnt\tPrivCnt")
+
+ if popmap_used:
+ for p in vcfpops:
+ out_fh.write("\t{}-Freq\tTotCnt\tPrivCnt".format(p))
+ out_fh.write("\n")
+ print("\nWriting a summary of private alleles: \n {}\n".format(out_path + "_private_alleles.tsv"), file=sys.stderr)
+
+ tot_snps_per_pop = {}
+ priv_pop = {}
+ priv_hyb = {}
+ chrno = 0
+
+ #
+ # Iterate over chromosomes/scaffolds, longest to shortest.
+ #
+ for chr, chrlen in chrs_list:
+ if chr not in ordered_loci:
+ continue
+ if chrlen < ctg_limit:
+ continue
+
+ chrno += 1
+ if chrno <= 3 or verbose == True:
+ print("{}: {} ({:.2f}Mbp): {} private loci".format(
+ chrno, chr, chrlen/1000000.0, len(ordered_loci[chr])), file=sys.stderr)
+
+ #
+ # Initialize counters.
+ #
+ for p in populations_key:
+ if p == "vcf":
+ continue
+ priv_pop[p] = 0
+ priv_hyb[p] = 0
+ tot_snps_per_pop[p] = 0
+
+ loc_cnt = 0
+ snp_cnt = 0
+ missing_hybrid_cnt = 0
+
+ priv_allele_hybrid_freq = []
+
+ for loc in ordered_loci[chr]:
+ loc_cnt += 1
+ snp_cnt += len(loc.snps)
+ for snpcol in loc.snps:
+ snp = loc.snps[snpcol]
+
+ if "vcf" not in snp.pops:
+ missing_hybrid_cnt += 1
+
+ #
+ # Record total number of SNPs found in each parent population
+ # as well as the number of populations that have private alleles from those SNPs.
+ #
+ for p in snp.pops:
+ if p == "vcf":
+ continue
+
+ tot_snps_per_pop[p] += 1
+
+ if snp.pops[p].private == True:
+ priv_pop[p] += 1
+
+ #
+ # Was this private allele found in the hybrid/vcf population? If so, print it.
+ #
+ if snp.priv_hyb == True:
+
+ for p in snp.pops:
+ pop = snp.pops[p]
+
+ if pop.private == True:
+ priv_hyb[p] += 1
+ if pop.priv_all == snp.p_allele:
+ priv_allele_hybrid_freq.append(snp.pops['vcf'].p_freq)
+ s = "{}\t{}\t{}\t{}\t{}\t{}\t{:.3f}\t{}\t{}\t{:.3f}\t{}\t{}".format(
+ loc.chr, snp.bp, loc.id, snpcol,
+ pop.priv_all, pop.pop, pop.p_freq, pop.cnt,
+ round(pop.p_freq * pop.cnt),
+ snp.pops['vcf'].p_freq, snp.pops['vcf'].cnt,
+ round(snp.pops['vcf'].p_freq * snp.pops['vcf'].cnt))
+ if popmap_used:
+ for p in vcfpops:
+ if p in snp.indvpops:
+ s += "\t{:.3f}\t{}\t{}".format(
+ snp.indvpops[p].p_freq, snp.indvpops[p].cnt,
+ round(snp.indvpops[p].p_freq * snp.indvpops[p].cnt))
+ else:
+ s += "\t{}\t{}\t{}".format('-','-','-','-')
+
+ elif pop.priv_all == snp.q_allele:
+ priv_allele_hybrid_freq.append(snp.pops['vcf'].q_freq)
+ s = "{}\t{}\t{}\t{}\t{}\t{}\t{:.3f}\t{}\t{}\t{:.3f}\t{}\t{}".format(
+ loc.chr, snp.bp, loc.id, snpcol,
+ pop.priv_all, pop.pop, pop.q_freq, pop.cnt,
+ round(pop.q_freq * pop.cnt),
+ snp.pops['vcf'].q_freq, snp.pops['vcf'].cnt,
+ round(snp.pops['vcf'].q_freq * snp.pops['vcf'].cnt))
+ if popmap_used:
+ for p in vcfpops:
+ if p in snp.indvpops:
+ s += "\t{:.3f}\t{}\t{}".format(
+ snp.indvpops[p].q_freq, snp.indvpops[p].cnt,
+ round(snp.indvpops[p].q_freq * snp.indvpops[p].cnt))
+ else:
+ s += "\t{}\t{}\t{}".format('-','-','-','-')
+
+ s += "\n"
+ out_fh.write(s)
+
+ if chrno <= 3 or verbose == True:
+ print("{} loci, including {} SNPs, analyzed; in {} cases SNP present in hybrid/VCF population.".format(
+ loc_cnt, snp_cnt, snp_cnt - missing_hybrid_cnt), file=sys.stderr)
+
+ for p in priv_pop:
+ print(" '{}' had {} SNPs present; private alleles occured within {} of those SNPs ({:.2f}%)".format(
+ p, tot_snps_per_pop[p], priv_pop[p], (priv_pop[p] / float(tot_snps_per_pop[p]) * 100)), file=sys.stderr)
+ print(" Private allele found in {} SNPs of hybrid/VCF population ({:.2f}%)".format(
+ priv_hyb[p], (priv_hyb[p] / float(tot_snps_per_pop[p]) * 100)), file=sys.stderr)
+
+ print(file=sys.stderr)
+
+ print("Specify --verbose for details on more chromosomes.", file=sys.stderr)
+ out_fh.close()
+
+
+def invert_refalt_fields(fields):
+ #
+ # Invert the REF and ALT alleles for this VCF line; alter the downstream genotypes to reflect this change.
+ #
+ # VCF is defined as: CHROM,POS,ID,REF,ALT,QUAL,FILTER,INFO,FORMAT,SAMPLEGENOTYPES...
+ invf = []
+ invf += fields[0:3] # CHROM,POS,ID
+ invf.append(fields[4]) # ALT -> REF
+ invf.append(fields[3]) # REF -> ALT
+ invf += fields[5:9] # QUAL,FILTER,INFO,FORMAT
+
+ for sampleno in range(9, len(fields)):
+ gt = fields[sampleno]
+ if gt[0:3] == "0/0":
+ gt = "1/1" + gt[3:]
+ elif gt[0:3] == "1/1":
+ gt = "0/0" + gt[3:]
+ invf.append(gt)
+
+ return invf
+
+
+def write_polarized_vcfs(chrs_list, loci, populations_key, vcf_path, out_path):
+ datestamp = datetime.now().strftime("%Y%m%d")
+
+ #
+ # Write out a VCF file for each 'parental' population
+ #
+ vcf_fhs = {}
+
+ for pop in populations_key:
+ if pop == 'vcf':
+ continue
+ path = out_path + "_" + pop + ".polarized.snps.vcf"
+ print(" {}".format(path), file=sys.stderr)
+ vcf_fhs[pop] = open(path, "w")
+
+ #
+ # Open the VCF input file again.
+ #
+ in_fh = open(vcf_path, "r")
+
+ lineno = 0
+ parts = []
+ #
+ # Write the VCF header
+ #
+ for line in in_fh:
+ lineno += 1
+
+ if line[0:2] == "##":
+ if line[0:10] == "##fileDate":
+ line = "##fileDate={}\n".format(datestamp)
+ if line[0:8] == "##source":
+ line = "##source={}\n".format("\"stacks-private-alleles\"")
+
+ for pop in vcf_fhs:
+ vcf_fhs[pop].write(line)
+
+ if line[0:6] == "#CHROM":
+ # Write the final line containint the column headings.
+ for pop in vcf_fhs:
+ vcf_fhs[pop].write(line)
+ break
+
+ for line in in_fh:
+ lineno += 1
+ line = line.strip("\n")
+ parts = line.split("\t")
+
+ if ref_aligned == True:
+ locus_id, column, strand = parts[2].split(":")
+ else:
+ locus_id, column = parts[2].split(":")
+ strand = "+"
+ locus_id = int(locus_id)
+ column = int(column)
+ strand = Strand.minus if strand == "-" else Strand.plus
+
+ ref_all = parts[3]
+ alt_all = parts[4]
+
+ # print("Looking at locus {}, snp {}, ref: {}, alt: {}".format(locus_id, column, ref_all, alt_all), file=sys.stderr)
+
+ if locus_id not in loci:
+ continue
+ loc = loci[locus_id]
+
+ if column not in loc.snps:
+ continue
+ snp = loc.snps[column]
+
+ if snp.priv_hyb == False:
+ continue
+
+ for popname in vcf_fhs:
+ #
+ # Does this population possess a private allele at this SNP?
+ #
+ if popname not in snp.pops:
+ continue
+
+ pop = snp.pops[popname]
+ if pop.private == True:
+ #
+ # If so, check if the private allele is the REF allele; if
+ # so, swap the REF/ALT alleles so that private alleles for this
+ # focal population are always ALT alleles.
+ #
+ if pop.priv_all == ref_all:
+ invparts = invert_refalt_fields(parts)
+ vcf_fhs[popname].write("\t".join(invparts) + "\n")
+ else:
+ vcf_fhs[popname].write("\t".join(parts) + "\n")
+
+ for pop in vcf_fhs:
+ vcf_fhs[pop].close()
+
+ in_fh.close()
+
+ return
+
+
+def calc_weights(sigma):
+ #
+ # Calculate weights for window smoothing operations.
+ #
+ limit = 3 * int(sigma)
+
+ weights = [0.0] * (limit + 1)
+ for i in range(0, limit):
+ weights[i] = math.exp((-1 * math.pow(i, 2)) / (2 * math.pow(sigma, 2)))
+
+ return weights
+
+
+def determine_window_limits(sites, limit, center_bp, pos_l, pos_u):
+ limit_l = center_bp - limit if center_bp - limit > 0 else 0
+ limit_u = center_bp + limit
+
+ while pos_l < len(sites):
+ if sites[pos_l].bp < limit_l:
+ pos_l += 1
+ else:
+ break
+
+ while pos_u < len(sites):
+ if sites[pos_u].bp < limit_u:
+ pos_u += 1
+ else:
+ break
+
+ return pos_l, pos_u
+
+
+def smooth_chromosome(sites):
+ #
+ # To generate smooth genome-wide distributions of Fst, we calculate a kernel-smoothing
+ # moving average of Fst values along each ordered chromosome.
+ #
+ # For each genomic region centered on a nucleotide position c, the contribution of the population
+ # genetic statistic at position p to the region average was weighted by the Gaussian function:
+ # exp( (-1 * (p - c)^2) / (2 * sigma^2))
+ #
+ # In addition, we weight each position according to (n_k - 1), where n_k is the number of alleles
+ # sampled at that location.
+ #
+ # By default, sigma = 150Kb, for computational efficiency, only calculate average out to 3sigma.
+ #
+ dist = 0
+ pos_l = 0
+ pos_u = 0
+ sum = 0.0
+ final_weight = 0.0
+ allele_cnt = 0.0
+
+ weights = calc_weights(sigma)
+
+ for pos_c in range(0, len(sites)):
+ c = sites[pos_c]
+
+ sum = 0.0
+
+ pos_l, pos_u = determine_window_limits(sites, int(3*sigma), c.bp, pos_l, pos_u)
+
+ # print("c.bp: {}, pos_c: {}, pos_l: {}, pos_u: {}; sites len: {}".format(c.bp, pos_c, pos_l, pos_u, len(sites)), file=sys.stderr)
+ for pos_p in range(pos_l, pos_u):
+ p = sites[pos_p]
+
+ if p.bp > c.bp:
+ dist = p.bp - c.bp
+ else:
+ dist = c.bp - p.bp
+
+ final_weight = (c.cnt - 1.0) * weights[dist]
+ c.smoothed = c.smoothed + (p.stat * final_weight)
+ sum += final_weight
+
+ c.smoothed = c.smoothed / sum;
+
+ return
+
+
+def write_plot_values(fh, pop_1, pop_2):
+ i = 0
+ j = 0
+ p1_len = len(pop_1)
+ p2_len = len(pop_2)
+
+ while i < p1_len or j < p2_len:
+
+ if j < p2_len:
+ while i < p1_len and pop_1[i].bp <= pop_2[j].bp:
+ if smooth:
+ s = "{}\t{}\t{}\t{:.3f}\t{:.5f}".format(pop_1[i].chr, pop_1[i].bp, pop_1[i].pop, pop_1[i].stat, pop_1[i].smoothed)
+ else:
+ s = "{}\t{}\t{}\t{:.3f}".format(pop_1[i].chr, pop_1[i].bp, pop_1[i].pop, pop_1[i].stat)
+ fh.write(s + "\n")
+ i += 1
+ else:
+ while i < p1_len:
+ if smooth:
+ s = "{}\t{}\t{}\t{:.3f}\t{:.5f}".format(pop_1[i].chr, pop_1[i].bp, pop_1[i].pop, pop_1[i].stat, pop_1[i].smoothed)
+ else:
+ s = "{}\t{}\t{}\t{:.3f}".format(pop_1[i].chr, pop_1[i].bp, pop_1[i].pop, pop_1[i].stat)
+ fh.write(s + "\n")
+ i += 1
+
+ if i < p1_len:
+ while j < p2_len and pop_2[j].bp < pop_1[i].bp:
+ if smooth:
+ s = "{}\t{}\t{}\t{:.3f}\t{:.5f}".format(pop_2[j].chr, pop_2[j].bp, pop_2[j].pop, (-1 * pop_2[j].stat), (-1 * pop_2[j].smoothed))
+ else:
+ s = "{}\t{}\t{}\t{:.3f}".format(pop_2[j].chr, pop_2[j].bp, pop_2[j].pop, (-1 * pop_2[j].stat))
+ fh.write(s + "\n")
+ j += 1
+ else:
+ while j < p2_len:
+ if smooth:
+ s = "{}\t{}\t{}\t{:.3f}\t{:.5f}".format(pop_2[j].chr, pop_2[j].bp, pop_2[j].pop, (-1 * pop_2[j].stat), (-1 * pop_2[j].smoothed))
+ else:
+ s = "{}\t{}\t{}\t{:.3f}".format(pop_2[j].chr, pop_2[j].bp, pop_2[j].pop, (-1 * pop_2[j].stat))
+ fh.write(s + "\n")
+ j += 1
+
+
+def write_plots(chrs_list, ordered_loci, populations_key, vcfpops, out_path):
+ datestamp = datetime.now().strftime("%Y%m%d")
+
+ #
+ # If requested, write a file to plot private allele frequencies in the VCF population.
+ # The y-axis scales from 1 to -1, the x-axis is the genome coordinates
+ # The first population is positive on the y-axis, the second negative;
+ # The frequency of the private allele in the VCF population is the y value.
+ #
+ plot_pops = list(populations_key)
+ plot_fhs = {}
+ plot_fhs['vcf'] = open(out_path + "_plot_by_chromosome-all.tsv", 'w')
+ print(" {}".format(out_path + "_plot_by_chromosome-all.tsv"), file=sys.stderr)
+ plot_fhs['vcf'].write("# Generated by stacks-private-alleles, {}\n# {}\n".format(datestamp, " ".join(sys.argv)))
+ plot_fhs['vcf'].write("# Population '{}': positive values\n# Population '{}': negative values\n".format(plot_pops[0], plot_pops[1]))
+ if smooth:
+ plot_fhs['vcf'].write("# Chr\tBP\tPopulation\tVCF-PrivateAlleleFreq\tVCF-Smoothed\n")
+ else:
+ plot_fhs['vcf'].write("# Chr\tBP\tPopulation\tVCF-PrivateAlleleFreq\n")
+
+ if popmap_used == True:
+ for p in vcfpops:
+ plot_fhs[p] = open(out_path + "_plot_by_chromosome-" + p + ".tsv", 'w')
+ print(" {}".format(out_path + "_plot_by_chromosome-" + p +".tsv"), file=sys.stderr)
+ plot_fhs[p].write("# Generated by stacks-private-alleles, date {}\n# {}\n".format(datestamp, " ".join(sys.argv)))
+ plot_fhs[p].write("# Population '{}': positive values\n# Population '{}': negative values\n".format(plot_pops[0], plot_pops[1]))
+ if smooth:
+ plot_fhs[p].write("# Chr\tBP\tPopulation\t{}-PrivateAlleleFreq\t{}-Smoothed\n".format(p,p))
+ else:
+ plot_fhs[p].write("# Chr\tBP\tPopulation\t{}-PrivateAlleleFreq\n".format(p))
+
+ if smooth:
+ print(" Smoothing values", file=sys.stderr, end='', flush=True)
+
+ #
+ # Iterate over chromosomes/scaffolds, longest to shortest.
+ #
+ for chr, chrlen in chrs_list:
+ if chr not in ordered_loci:
+ continue
+ if chrlen < ctg_limit:
+ continue
+
+ #
+ # Create arrays to hold SmoothStat objects to do the plot smoothing.
+ #
+ pop_1 = []
+ pop_2 = []
+ pop_1_vcfpops = {}
+ pop_2_vcfpops = {}
+ if popmap_used == True:
+ for p in vcfpops:
+ pop_1_vcfpops[p] = []
+ pop_2_vcfpops[p] = []
+
+ for loc in ordered_loci[chr]:
+
+ for snpcol in loc.snps:
+ snp = loc.snps[snpcol]
+
+ #
+ # Was this private allele found in the hybrid/vcf population? If so, print it.
+ #
+ if snp.priv_hyb == True:
+
+ for p in snp.pops:
+ pop = snp.pops[p]
+
+ if pop.private == True:
+
+ if pop.priv_all == snp.p_allele:
+ if pop.pop == plot_pops[0]:
+ pop_1.append(SmoothStat(loc.id, pop.pop, snp.col, loc.chr, snp.bp, snp.pops['vcf'].cnt, snp.pops['vcf'].p_freq))
+ if popmap_used == True:
+ for p in vcfpops:
+ if snp.indvpops[p].cnt > 0:
+ pop_1_vcfpops[p].append(SmoothStat(loc.id, pop.pop, snp.col, loc.chr, snp.bp, snp.indvpops[p].cnt, snp.indvpops[p].p_freq))
+
+ elif pop.pop == plot_pops[1]:
+ pop_2.append(SmoothStat(loc.id, pop.pop, snp.col, loc.chr, snp.bp, snp.pops['vcf'].cnt, snp.pops['vcf'].p_freq))
+ if popmap_used == True:
+ for p in vcfpops:
+ if snp.indvpops[p].cnt > 0:
+ pop_2_vcfpops[p].append(SmoothStat(loc.id, pop.pop, snp.col, loc.chr, snp.bp, snp.indvpops[p].cnt, snp.indvpops[p].p_freq))
+
+ elif pop.priv_all == snp.q_allele:
+ if pop.pop == plot_pops[0]:
+ pop_1.append(SmoothStat(loc.id, pop.pop, snp.col, loc.chr, snp.bp, snp.pops['vcf'].cnt, snp.pops['vcf'].q_freq))
+ if popmap_used == True:
+ for p in vcfpops:
+ if snp.indvpops[p].cnt > 0:
+ pop_1_vcfpops[p].append(SmoothStat(loc.id, pop.pop, snp.col, loc.chr, snp.bp, snp.indvpops[p].cnt, snp.indvpops[p].q_freq))
+
+ elif pop.pop == plot_pops[1]:
+ pop_2.append(SmoothStat(loc.id, pop.pop, snp.col, loc.chr, snp.bp, snp.pops['vcf'].cnt, snp.pops['vcf'].q_freq))
+ if popmap_used == True:
+ for p in vcfpops:
+ if snp.indvpops[p].cnt > 0:
+ pop_2_vcfpops[p].append(SmoothStat(loc.id, pop.pop, snp.col, loc.chr, snp.bp, snp.indvpops[p].cnt, snp.indvpops[p].q_freq))
+
+ pop_1.sort(key=attrgetter("bp"))
+ pop_2.sort(key=attrgetter("bp"))
+ if popmap_used == True:
+ for p in vcfpops:
+ pop_1_vcfpops[p].sort(key=attrgetter("bp"))
+ pop_2_vcfpops[p].sort(key=attrgetter("bp"))
+
+ #
+ # Smooth the values.
+ #
+ if smooth == True:
+ smooth_chromosome(pop_1)
+ smooth_chromosome(pop_2)
+ if popmap_used == True:
+ for p in vcfpops:
+ smooth_chromosome(pop_1_vcfpops[p])
+ smooth_chromosome(pop_2_vcfpops[p])
+ print(".", file=sys.stderr, end='', flush=True)
+
+ #
+ # Write the values.
+ #
+ write_plot_values(plot_fhs['vcf'], pop_1, pop_2)
+ if popmap_used == True:
+ for p in vcfpops:
+ write_plot_values(plot_fhs[p], pop_1_vcfpops[p], pop_2_vcfpops[p])
+
+ for f in plot_fhs:
+ plot_fhs[f].close()
+ if smooth == True:
+ print(file=sys.stderr)
+
+ return
+
+
+def main():
+ loci = {}
+ ordered_loci = {}
+ chrs_list = []
+ #
+ # List of populations, and ordered sample names for each.
+ #
+ populations_key = {}
+
+ parse_command_line()
+
+ vcfpops = {}
+ vcfpops_rev = {}
+ if popmap_used:
+ print("Parsing population map: '{}'...".format(popmap_path), file=sys.stderr)
+ parse_popmap_file(popmap_path, vcfpops, vcfpops_rev)
+
+ print("Parsing sumstats file: '{}'...".format(sumstats_path), file=sys.stderr)
+ parse_sumstats_file(sumstats_path, loci, populations_key)
+
+ print("Parsing VCF file: '{}'...".format(vcf_path), file=sys.stderr)
+ parse_hybrid_vcf(vcf_path, loci, populations_key, vcfpops, vcfpops_rev, chrs_list)
+
+ order_loci(chrs_list, loci, ordered_loci)
+
+ write_private_allele_summary(chrs_list, ordered_loci, populations_key, vcfpops, out_path)
+
+ if polarize == True:
+ print("\nWriting polarized VCF files:", file=sys.stderr)
+ write_polarized_vcfs(chrs_list, loci, populations_key, vcf_path, out_path)
+ print("done.", file=sys.stderr)
+
+ if plots == True:
+ print("\nGenerating plotting files:", file=sys.stderr)
+ write_plots(chrs_list, ordered_loci, populations_key, vcfpops, out_path)
+ print("done.", file=sys.stderr)
+
+# #
+#------------------------------------------------------------------------------#
+# #
+if __name__ == "__main__":
+ main()
=====================================
src/Seq.h
=====================================
@@ -148,12 +148,12 @@ void Seq::reserve(size_t n, bool with_qual) {
seq = new char[capacity + 1];
*seq = '\0';
if (with_qual) {
- delete qual;
+ delete [] qual;
qual = new char[capacity + 1];
*qual = '\0';
} else if (qual != NULL) {
// Delete it to keep the capacity predictable.
- delete qual;
+ delete [] qual;
qual = NULL;
}
}
=====================================
src/clean.cc
=====================================
@@ -1,6 +1,6 @@
// -*-mode:c++; c-style:k&r; c-basic-offset:4;-*-
//
-// Copyright 2011-2021, Julian Catchen <jcatchen at illinois.edu>
+// Copyright 2011-2024, Julian Catchen <jcatchen at illinois.edu>
//
// This file is part of Stacks.
//
@@ -107,11 +107,16 @@ parse_input_record(Seq *s, RawRead *r)
uint lim;
//
// Count the number of colons to differentiate Illumina version.
- // CASAVA 1.8+ has a FASTQ header like this:
+ //
+ // Illumina's BCL Convert program or CASAVA 1.8+ has a FASTQ header like this:
// @HWI-ST0747:155:C01WHABXX:8:1101:6455:26332 1:N:0:
+ //
// Or, with the embedded barcode:
// @HWI-ST1233:67:D12GNACXX:7:2307:14604:78978 1:N:0:ATCACG
- //
+ //
+ // Or, with a UMI (unique molecular identifier -- for PCR duplicate control):
+ // @LH00346:7:22CH5TLT3:5:1101:1407:1032:rTACCCAGT 1:N:0:ACATTGCG
+ // -the 'r' in the UMI field indicates it is reverse complemented.
//
// Or, parse FASTQ header from previous versions that looks like this:
// @HWI-ST0747_0141:4:1101:1240:2199#0/1
@@ -126,16 +131,16 @@ parse_input_record(Seq *s, RawRead *r)
hash_cnt += *q == '#' ? 1 : 0;
}
- if (colon_cnt == 9 && hash_cnt == 0) {
+ if ((colon_cnt >= 9 && colon_cnt <= 10) && hash_cnt == 0) {
r->fastq_type = illv2_fastq;
//
- // According to Illumina manual, "CASAVA v1.8 User Guide" page 41:
+ // Colon_cnt == 9; According to Illumina manual, "CASAVA v1.8 User Guide" page 41:
// @<instrument>:<run number>:<flowcell ID>:<lane>:<tile>:<x-pos>:<y-pos> <read>:<is filtered>:<control number>:<index sequence>
//
for (p = s->id, q = p; *q != ':' && q < stop; q++);
if (q < stop) {
*q = '\0';
- strcpy(r->machine, p);
+ r->machine(p);
*q = ':';
}
@@ -172,12 +177,28 @@ parse_input_record(Seq *s, RawRead *r)
*q = ':';
}
- for (p = q+1, q = p; *q != ' ' && q < stop; q++);
- if (q < stop) {
- *q = '\0';
- r->y = atoi(p);
- *q = ' ';
- }
+ for (p = q+1, q = p; *q != ' ' && q < stop; q++);
+ if (q < stop) {
+ *q = '\0';
+ r->y = atoi(p);
+ }
+
+ if (colon_cnt == 10) {
+ //
+ // According to Illumina's BCL Convert program an extra field occurs containing the UMI:
+ // @<instrument>:<run number>:<flowcell ID>:<lane>:<tile>:<x-pos>:<y-pos>:<UMI> <read>:<is filtered>:<control number>:<index sequence>
+ //
+ *q = ';';
+ r->fastq_type = illv3_fastq;
+
+ for (p = q + 1, q = p; *q != ':' && q < stop; q++);
+ if (q < stop) {
+ *q = '\0';
+ strncpy(r->umi, p, id_len);
+ }
+ }
+ if (q < stop)
+ *q = ' ';
for (p = q+1, q = p; *q != ':' && q < stop; q++);
if (q < stop) {
@@ -270,7 +291,7 @@ parse_input_record(Seq *s, RawRead *r)
for (p = s->id, q = p; *q != ':' && q < stop; q++);
if (q < stop) {
*q = '\0';
- strcpy(r->machine, p);
+ r->machine(p);
*q = ':';
}
@@ -315,8 +336,16 @@ parse_input_record(Seq *s, RawRead *r)
} else {
r->fastq_type = generic_fastq;
- strncpy(r->machine, s->id, id_len);
- r->machine[id_len] = '\0';
+ //
+ // Check if the header already contains a "/1" or "/2" notation at the end,
+ // from previous parsing of the file (e.g. Sequence Read Archve data). Remove
+ // if present.
+ //
+ char *p = s->id + strlen(s->id);
+ if (*(p - 2) == '/' && (*(p - 1) == '1' || *(p - 1) == '2'))
+ *(p - 2) = '\0';
+
+ r->machine(s->id);
}
uint len = strlen(s->seq);
@@ -451,6 +480,14 @@ check_quality_scores(RawRead *href, int qual_offset, int score_limit, int len_li
// | | | | |
// 64 74 84 94 104
// 0 10(90%) 20(99%) 30(99.9%) 40(99.99%)
+ //
+ // Illumina 1.8+ encodes phred scores between ASCII values 33 (0 quality) and 74 (41 quality)
+ //
+ // !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJK
+ // | | | | |
+ // 33 43 53 63 73
+ // 0 10(90%) 20(99%) 30(99.9%) 40(99.99%)
+
//
//
// Drop sequence if the average phred quality score drops below a threshold within a sliding window.
@@ -459,12 +496,9 @@ check_quality_scores(RawRead *href, int qual_offset, int score_limit, int len_li
//
// Convert the encoded quality scores to their integer values
//
- // cerr << "Integer scores: ";
for (uint j = 0; j < href->len; j++) {
href->int_scores[j] = href->phred[j] - qual_offset;
- // cerr << href->int_scores[j] << " ";
}
- // cerr << "\n";
const int *p, *q;
double mean = 0.0;
@@ -530,6 +564,71 @@ check_quality_scores(RawRead *href, int qual_offset, int score_limit, int len_li
return 1;
}
+int
+check_poly_g_run(RawRead *href, int offset, int len_limit)
+{
+ //
+ // Search for a run of G nucleotides at the 3' end of the read. In Illumina machines with
+ // two-color chemistry, no color signal is interpreted as a 'G'. However, when sequencing of
+ // a cluster has failed, and no signal is returned, that is also interpreted as a G and
+ // will receive a very high quality score. If we see a run of these Gs at the end of the
+ // sequence, drop the seqeunce.
+ //
+ const int run_limit = 10;
+ const int polyg_phred_lim = 35;
+ const int exception_lim = 2;
+ //
+ // Set a stop index to the first element of the sequence, not including any barcode.
+ //
+ int stop = offset;
+
+ //
+ // We will look for a run of a certain length (run_len) while allowing for a small number
+ // of exceptions to the run, either with a different base or a lowerer phred score.
+ int run_len = 0;
+ int exception_len = 0;
+
+ //
+ // Set j to be the last element in the sequence.
+ //
+ int j = href->len - 1;
+
+ // cerr << "Start: " << j << "; Stop: " << stop << "\n";
+
+ while (j >= stop) {
+
+ // cerr << " Col: " << j << "; Nuc: " << href->seq[j] << "; Phred: " << href->phred[j] << "; Int: " << href->int_scores[j] << "\n";
+
+ if (href->seq[j] == 'G' && href->int_scores[j] >= polyg_phred_lim)
+ run_len++;
+ else
+ exception_len++;
+
+ if (exception_len >= exception_lim) {
+ // cerr << "Returning 1; run_len: " << run_len << "; exception_len: " << exception_len << "\n";
+ // cerr << " " << href->seq << "\n " << href->phred << "\n";
+ return 1;
+ }
+
+ if (run_len >= run_limit) {
+ // cerr << "Returning -1; run_len: " << run_len << "; exception_len: " << exception_len << "\n";
+ // cerr << " " << href->seq << "\n " << href->phred << "\n";
+ if (j < len_limit) {
+ return 0;
+ } else {
+ href->len = j + 1;
+ href->seq[j] = '\0';
+ href->phred[j] = '\0';
+ return -1;
+ }
+ }
+
+ j--;
+ }
+
+ return 1;
+}
+
bool
correct_barcode(set<string> &bcs, RawRead *href, seqt type, int num_errs)
{
@@ -619,8 +718,9 @@ filter_adapter_seq(RawRead *href, char *adapter, int adp_len, AdapterHash &adp_k
int kmer_size, int distance, int len_limit)
{
vector<pair<int, int> > hits;
- int num_kmers = href->len - kmer_size + 1;
- const char *p = href->seq;
+ int query_stop = href->inline_bc_len;
+ int num_kmers = href->len - query_stop - kmer_size + 1;
+ const char *p = href->seq + query_stop;
string kmer;
//
@@ -651,13 +751,17 @@ filter_adapter_seq(RawRead *href, char *adapter, int adp_len, AdapterHash &adp_k
// cerr << "Starting comparison at i: "<< i << "; j: " << j << "\n";
- while (i >= 0 && j >= 0) {
+ while (i >= query_stop && j >= 0) {
if (href->seq[i] != adapter[j])
mismatches++;
i--;
j--;
}
-
+ while (j > 0) {
+ mismatches++;
+ j--;
+ }
+
if (mismatches > distance)
continue;
=====================================
src/clean.h
=====================================
@@ -1,6 +1,6 @@
// -*-mode:c++; c-style:k&r; c-basic-offset:4;-*-
//
-// Copyright 2011-2023, Julian Catchen <jcatchen at uoregon.edu>
+// Copyright 2011-2024, Julian Catchen <jcatchen at uoregon.edu>
//
// This file is part of Stacks.
//
@@ -30,14 +30,18 @@
#include "input.h"
#include "kmers.h"
-enum fastqt {generic_fastq, illv1_fastq, illv2_fastq};
+enum fastqt {generic_fastq, illv1_fastq, illv2_fastq, illv3_fastq};
enum barcodet {null_null,
null_inline, null_index,
inline_null, index_null,
inline_inline, index_index,
inline_index, index_inline};
-enum seqt {single_end, paired_end};
+
+enum seqt {single_end, paired_end};
+
+enum machinet {genome_analyzer_iix, miseq,
+ hiseq, nextseq, novaseq, miniseq, unknown};
typedef unordered_map<string, vector<int>, std::hash<string> > AdapterHash;
@@ -143,34 +147,37 @@ public:
class RawRead {
public:
- fastqt fastq_type;
- char *inline_bc;
- char *index_bc;
- char *se_bc;
- char *pe_bc;
- char *machine;
- int run;
- int lane;
- int tile;
- int x;
- int y;
- int index;
- int read;
- char *seq;
- char *phred;
- int *int_scores;
- bool filter;
- int inline_bc_len;
- int retain;
- uint size;
- uint len;
- double win_len;
- double stop_pos;
-
+ fastqt fastq_type;
+ char *inline_bc;
+ char *index_bc;
+ char *se_bc;
+ char *pe_bc;
+ char *_machine;
+ machinet machine_type;
+ char *umi;
+ int run;
+ int lane;
+ int tile;
+ int x;
+ int y;
+ int index;
+ int read;
+ char *seq;
+ char *phred;
+ int *int_scores;
+ bool filter;
+ int inline_bc_len;
+ int retain;
+ uint size;
+ uint len;
+ double win_len;
+ double stop_pos;
+
RawRead(uint buf_len, int read, int barcode_size, double win_size) {
this->inline_bc = new char[id_len + 1];
this->index_bc = new char[id_len + 1];
- this->machine = new char[id_len + 1];
+ this->_machine = new char[id_len + 1];
+ this->umi = new char[id_len + 1];
this->seq = new char[buf_len + 1];
this->phred = new char[buf_len + 1];
this->int_scores = new int[buf_len];
@@ -189,10 +196,18 @@ public:
this->inline_bc[0] = '\0';
this->index_bc[0] = '\0';
- this->machine[0] = '\0';
+ this->_machine[0] = '\0';
+ this->umi[0] = '\0';
this->seq[0] = '\0';
this->phred[0] = '\0';
+ this->inline_bc[id_len] = '\0';
+ this->index_bc[id_len] = '\0';
+ this->_machine[id_len] = '\0';
+ this->umi[id_len] = '\0';
+
+ this->detect_machine_type();
+
this->set_len(buf_len);
this->se_bc = NULL;
@@ -241,7 +256,8 @@ public:
~RawRead() {
delete [] this->inline_bc;
delete [] this->index_bc;
- delete [] this->machine;
+ delete [] this->_machine;
+ delete [] this->umi;
delete [] this->seq;
delete [] this->phred;
delete [] this->int_scores;
@@ -259,6 +275,63 @@ public:
return 0;
}
+ void detect_machine_type() {
+
+ const char *p = this->_machine;
+ switch(*p) {
+ case 'A':
+ if (*(p+1) == 'A')
+ this->machine_type = nextseq;
+ else
+ this->machine_type = novaseq;
+ break;
+ case 'N':
+ case 'V':
+ this->machine_type = nextseq;
+ break;
+ case 'C':
+ case 'D':
+ case 'J':
+ case 'K':
+ this->machine_type = hiseq;
+ break;
+ case 'H':
+ if (*(p+1) != 'W')
+ this->machine_type = novaseq;
+ else if (strncmp(p, "HWI-M", 5) == 0)
+ this->machine_type = miseq;
+ else if (strncmp(p, "HWUSI", 5) == 0)
+ this->machine_type = genome_analyzer_iix;
+ else if (strncmp(p, "HWI-C", 5) == 0)
+ this->machine_type = hiseq;
+ else if (strncmp(p, "HWI-D", 5) == 0)
+ this->machine_type = hiseq;
+ break;
+ case 'M':
+ if (*(p+1) == 'N')
+ this->machine_type = miniseq;
+ else
+ this->machine_type = miseq;
+ break;
+ default:
+ this->machine_type = unknown;
+ break;
+ }
+
+ return;
+ }
+ void machine(const char *p) {
+ strncpy(this->_machine, p, id_len);
+ this->detect_machine_type();
+ }
+ const char *machine() {
+ return this->_machine;
+ }
+ bool check_poly_g() {
+ if (this->machine_type == machinet::novaseq || this->machine_type == machinet::nextseq)
+ return true;
+ return false;
+ }
int set_len(uint buf_len) {
if (buf_len == this->len)
return 0;
@@ -333,6 +406,7 @@ int filter_adapter_seq(RawRead *, char *, int, AdapterHash &, int, int, int);
int init_adapter_seq(int, char *, int &, AdapterHash &);
int check_quality_scores(RawRead *, int, int, int, int);
+int check_poly_g_run(RawRead *, int, int);
//
// Templated function to process barcodes.
=====================================
src/clone_filter.cc
=====================================
@@ -31,11 +31,13 @@
//
FileT in_file_type = FileT::unknown;
FileT out_file_type = FileT::unknown;
+InputT in_path_type = InputT::files;
string in_file;
string in_file_p1;
string in_file_p2;
string in_path_1;
string in_path_2;
+string out_basename;
string out_path;
bool discards = false;
bool interleaved = false;
=====================================
src/constants.h
=====================================
@@ -177,6 +177,14 @@ enum class FileT {unknown,
vcf, gzvcf
};
+//
+// Supported file types
+//
+enum class InputT {unknown,
+ directory,
+ files
+};
+
int stacks_handle_exceptions(const exception& e);
string remove_suffix(FileT, const string&);
=====================================
src/cstacks.cc
=====================================
@@ -1,6 +1,6 @@
// -*-mode:c++; c-style:k&r; c-basic-offset:4;-*-
//
-// Copyright 2010-2023, Julian Catchen <jcatchen at illinois.edu>
+// Copyright 2010-2024, Julian Catchen <jcatchen at illinois.edu>
//
// This file is part of Stacks.
//
@@ -1869,24 +1869,24 @@ int parse_command_line(int argc, char* argv[]) {
static struct option long_options[] = {
{"help", no_argument, NULL, 'h'},
{"version", no_argument, NULL, 1000},
- {"uniq-haplotypes", no_argument, NULL, 'u'}, {"uniq_haplotypes", no_argument, NULL, 'u'},
- {"report-mmatches", no_argument, NULL, 'R'}, {"report_mmatches", no_argument, NULL, 'R'},
- {"disable-gapped", no_argument, NULL, 'G'}, {"disable_gapped", no_argument, NULL, 'G'},
- {"max-gaps", required_argument, NULL, 'X'}, {"max_gaps", required_argument, NULL, 'X'},
- {"min-aln-len", required_argument, NULL, 'x'}, {"min_aln_len", required_argument, NULL, 'x'},
- {"ctag-dist", required_argument, NULL, 'n'}, {"ctag_dist", required_argument, NULL, 'n'},
- {"k-len", required_argument, NULL, 'k'}, {"k_len", required_argument, NULL, 'k'},
- {"in-path", required_argument, NULL, 'P'}, {"in_path", required_argument, NULL, 'P'},
+ {"uniq-haplotypes", no_argument, NULL, 'u'},
+ {"report-mmatches", no_argument, NULL, 'R'},
+ {"disable-gapped", no_argument, NULL, 'G'},
+ {"max-gaps", required_argument, NULL, 'X'},
+ {"min-aln-len", required_argument, NULL, 'x'},
+ {"ctag-dist", required_argument, NULL, 'n'},
+ {"k-len", required_argument, NULL, 'k'},
+ {"in-path", required_argument, NULL, 'P'},
{"popmap", required_argument, NULL, 'M'},
{"catalog", required_argument, NULL, 'c'},
{"sample", required_argument, NULL, 's'},
- {"out-path", required_argument, NULL, 'o'}, {"out_path", required_argument, NULL, 'o'},
- {"threads", required_argument, NULL, 'p'},
+ {"out-path", required_argument, NULL, 'o'},
+ {"threads", required_argument, NULL, 't'},
{0, 0, 0, 0}
};
int option_index = 0;
- int c = getopt_long(argc, argv, "hvuRGX:x:o:s:c:p:n:k:P:M:", long_options, &option_index);
+ int c = getopt_long(argc, argv, "hvuRGX:x:o:s:c:p:t:n:k:P:M:", long_options, &option_index);
// Detect the end of the options.
if (c == -1)
@@ -1931,6 +1931,7 @@ int parse_command_line(int argc, char* argv[]) {
version();
break;
case 'p':
+ case 't':
num_threads = is_integer(optarg);
break;
case 'P':
@@ -2005,13 +2006,13 @@ void version() {
void help() {
cerr << "cstacks " << VERSION << "\n"
- << "cstacks -P in_dir -M popmap [-n num_mismatches] [-p num_threads]" << "\n"
- << "cstacks -s sample1_path [-s sample2_path ...] -o path [-n num_mismatches] [-p num_threads]" << "\n"
+ << "cstacks -P in_dir -M popmap [-n num_mismatches] [-t num_threads]" << "\n"
+ << "cstacks -s sample1_path [-s sample2_path ...] -o path [-n num_mismatches] [-t num_threads]" << "\n"
<< "\n"
<< " -P,--in-path: path to the directory containing Stacks files.\n"
<< " -M,--popmap: path to a population map file.\n"
<< " -n: number of mismatches allowed between sample loci when build the catalog (default 1; suggested: set to ustacks -M)." << "\n"
- << " -p,--threads: enable parallel execution with num_threads threads.\n"
+ << " -t,--threads: enable parallel execution with num_threads threads.\n"
<< " -s: sample prefix from which to load loci into the catalog." << "\n"
<< " -o,--outpath: output path to write results." << "\n"
<< " -c,--catalog <path>: add to an existing catalog.\n"
=====================================
src/file_io.cc
=====================================
@@ -1,6 +1,6 @@
// -*-mode:c++; c-style:k&r; c-basic-offset:4;-*-
//
-// Copyright 2012-2021, Julian Catchen <jcatchen at illinois.edu>
+// Copyright 2012-2024, Julian Catchen <jcatchen at illinois.edu>
//
// This file is part of Stacks.
//
@@ -21,10 +21,6 @@
//
// file_io.cc -- common routines for opening groups of files and processing barcode lists.
//
-// Julian Catchen
-// jcatchen at uoregon.edu
-// University of Oregon
-//
#include "file_io.h"
@@ -56,8 +52,10 @@ open_files(vector<pair<string, string> > &files,
BarcodePair bc;
//
// If the size of the barcodes vector is 0, then no barcodes
- // were submitted. In this case, we want to open output files
- // of the same name as input files, but in out_path.
+ // were submitted. In this case, we want to either:
+ // 1. open output files using out_basename, if supplied by the user, or
+ // 2. open output files of the same name as input files,
+ // in both cases opening the files in out_path.
//
if (barcodes.size() == 0 && merge == false) {
@@ -69,22 +67,26 @@ open_files(vector<pair<string, string> > &files,
if (paired)
bc.pe = files[i].second;
- path = out_path + files[i].first;
-
- //
- // Check that the file has a proper suffix for the output type.
- //
- pos = path.find_last_of(".");
- if (path.substr(pos) == ".bam") {
- path = path.substr(0, pos) + suffix_1;
- } else if (path.substr(pos) == ".gz") {
- path = path.substr(0, pos);
- pos = path.find_last_of(".");
- path = path.substr(0, pos) + suffix_1;
+ if (out_basename.length() > 0) {
+ path = out_path + out_basename + suffix_1;
+
} else {
- path = path.substr(0, pos) + suffix_1;
+ path = out_path + files[i].first;
+ //
+ // Check that the file has a proper suffix for the output type.
+ //
+ pos = path.find_last_of(".");
+ if (path.substr(pos) == ".bam") {
+ path = path.substr(0, pos) + suffix_1;
+ } else if (path.substr(pos) == ".gz") {
+ path = path.substr(0, pos);
+ pos = path.find_last_of(".");
+ path = path.substr(0, pos) + suffix_1;
+ } else {
+ path = path.substr(0, pos) + suffix_1;
+ }
}
-
+
if (stat((in_path_1 + files[i].first).c_str(), &sb_1) == -1) {
cerr << "Unable to stat input file '" << in_path_1 + files[i].first << "'\n";
exit(1);
@@ -107,19 +109,25 @@ open_files(vector<pair<string, string> > &files,
if (paired) {
file = interleaved ? files[i].first : files[i].second;
- path = out_path + file;
- pos = path.find_last_of(".");
- if (path.substr(pos) == ".bam") {
- path = path.substr(0, pos) + suffix_2;
- } else if (path.substr(pos) == ".gz") {
- path = path.substr(0, pos);
- pos = path.find_last_of(".");
- path = path.substr(0, pos) + suffix_2;
+ if (out_basename.length() > 0) {
+ path = out_path + out_basename + suffix_2;
+
} else {
- path = path.substr(0, pos) + suffix_2;
+ path = out_path + file;
+
+ pos = path.find_last_of(".");
+ if (path.substr(pos) == ".bam") {
+ path = path.substr(0, pos) + suffix_2;
+ } else if (path.substr(pos) == ".gz") {
+ path = path.substr(0, pos);
+ pos = path.find_last_of(".");
+ path = path.substr(0, pos) + suffix_2;
+ } else {
+ path = path.substr(0, pos) + suffix_2;
+ }
}
-
+
if (stat((in_path_2 + file).c_str(), &sb_1) == -1) {
cerr << "Unable to stat input file '" << in_path_2 + file << "'\n";
exit(1);
@@ -140,17 +148,22 @@ open_files(vector<pair<string, string> > &files,
exit(1);
}
- filepath = files[i].first;
- pos = filepath.find_last_of(".");
- if (filepath.substr(pos) == ".gz") {
- filepath = filepath.substr(0, pos);
+ if (out_basename.length() > 0) {
+ path = out_path + out_basename + ".rem" + suffix_1;
+
+ } else {
+ filepath = files[i].first;
pos = filepath.find_last_of(".");
- filepath = filepath.substr(0, pos);
- } else if (filepath.substr(pos) == ".bam") {
- filepath = filepath.substr(0, pos);
+ if (filepath.substr(pos) == ".gz") {
+ filepath = filepath.substr(0, pos);
+ pos = filepath.find_last_of(".");
+ filepath = filepath.substr(0, pos);
+ } else if (filepath.substr(pos) == ".bam") {
+ filepath = filepath.substr(0, pos);
+ }
+ path = out_path + filepath + ".rem" + suffix_1;
}
- path = out_path + filepath + ".rem" + suffix_1;
-
+
fh = new ofstream(path.c_str(), ifstream::out);
rem_1_fhs[bc] = fh;
@@ -159,17 +172,22 @@ open_files(vector<pair<string, string> > &files,
exit(1);
}
- filepath = file;
- pos = filepath.find_last_of(".");
- if (filepath.substr(pos) == ".gz") {
- filepath = filepath.substr(0, pos);
+ if (out_basename.length() > 0) {
+ path = out_path + out_basename + ".rem" + suffix_2;
+
+ } else {
+ filepath = file;
pos = filepath.find_last_of(".");
- filepath = filepath.substr(0, pos);
- } else if (filepath.substr(pos) == ".bam") {
- filepath = filepath.substr(0, pos);
+ if (filepath.substr(pos) == ".gz") {
+ filepath = filepath.substr(0, pos);
+ pos = filepath.find_last_of(".");
+ filepath = filepath.substr(0, pos);
+ } else if (filepath.substr(pos) == ".bam") {
+ filepath = filepath.substr(0, pos);
+ }
+ path = out_path + filepath + ".rem" + suffix_2;
}
- path = out_path + filepath + ".rem" + suffix_2;
-
+
fh = new ofstream(path.c_str(), ifstream::out);
rem_2_fhs[bc] = fh;
@@ -184,7 +202,12 @@ open_files(vector<pair<string, string> > &files,
} else if (barcodes.size() == 0 && merge == true) {
- path = out_path + "sample_unbarcoded" + suffix_1;
+ if (out_basename.length() > 0) {
+ path = out_path + out_basename + suffix_1;
+
+ } else {
+ path = out_path + "sample_unbarcoded" + suffix_1;
+ }
fh = new ofstream(path.c_str(), ifstream::out);
if (fh->fail()) {
@@ -200,7 +223,12 @@ open_files(vector<pair<string, string> > &files,
}
if (paired) {
- path = out_path + "sample_unbarcoded" + suffix_2;
+ if (out_basename.length() > 0) {
+ path = out_path + out_basename + suffix_2;
+
+ } else {
+ path = out_path + "sample_unbarcoded" + suffix_2;
+ }
fh = new ofstream(path.c_str(), ifstream::out);
if (fh->fail()) {
@@ -214,7 +242,12 @@ open_files(vector<pair<string, string> > &files,
pair_2_fhs[bc] = fh;
}
- path = out_path + "sample_unbarcoded.rem" + suffix_1;
+ if (out_basename.length() > 0) {
+ path = out_path + out_basename + ".rem" + suffix_1;
+
+ } else {
+ path = out_path + "sample_unbarcoded.rem" + suffix_1;
+ }
fh = new ofstream(path.c_str(), ifstream::out);
if (fh->fail()) {
@@ -228,7 +261,12 @@ open_files(vector<pair<string, string> > &files,
rem_1_fhs[bc] = fh;
}
- path = out_path + "sample_unbarcoded.rem" + suffix_2;
+ if (out_basename.length() > 0) {
+ path = out_path + out_basename + ".rem" + suffix_2;
+
+ } else {
+ path = out_path + "sample_unbarcoded.rem" + suffix_2;
+ }
fh = new ofstream(path.c_str(), ifstream::out);
if (fh->fail()) {
@@ -334,8 +372,10 @@ open_files(vector<pair<string, string> > &files,
BarcodePair bc;
//
// If the size of the barcodes vector is 0, then no barcodes
- // were submitted. In this case, we want to open output files
- // of the same name as input files, but in out_path.
+ // were submitted. In this case, we want to either:
+ // 1. open output files using out_basename, if supplied by the user, or
+ // 2. open output files of the same name as input files,
+ // in both cases opening the files in out_path.
//
if (barcodes.size() == 0 && merge == false) {
@@ -347,20 +387,24 @@ open_files(vector<pair<string, string> > &files,
if (paired)
bc.pe = files[i].second;
- path = out_path + files[i].first;
-
- //
- // Check that the file has a proper suffix for the output type.
- //
- pos = path.find_last_of(".");
- if (path.substr(pos) == ".bam") {
- path = path.substr(0, pos) + suffix_1;
- } else if (path.substr(pos) == ".gz") {
- path = path.substr(0, pos);
- pos = path.find_last_of(".");
- path = path.substr(0, pos) + suffix_1;
+ if (out_basename.length() > 0) {
+ path = out_path + out_basename + suffix_1;
+
} else {
- path = path.substr(0, pos) + suffix_1;
+ path = out_path + files[i].first;
+ //
+ // Check that the file has a proper suffix for the output type.
+ //
+ pos = path.find_last_of(".");
+ if (path.substr(pos) == ".bam") {
+ path = path.substr(0, pos) + suffix_1;
+ } else if (path.substr(pos) == ".gz") {
+ path = path.substr(0, pos);
+ pos = path.find_last_of(".");
+ path = path.substr(0, pos) + suffix_1;
+ } else {
+ path = path.substr(0, pos) + suffix_1;
+ }
}
if (stat((in_path_1 + files[i].first).c_str(), &sb_1) == -1) {
@@ -385,18 +429,24 @@ open_files(vector<pair<string, string> > &files,
}
if (paired) {
- file = interleaved ? files[i].first : files[i].second;
- path = out_path + file;
- pos = path.find_last_of(".");
- if (path.substr(pos) == ".bam") {
- path.replace(pos, 4, suffix_2);
- } else if (path.substr(pos) == ".gz") {
- path = path.substr(0, pos);
- pos = path.find_last_of(".");
- path = path.substr(0, pos) + suffix_2;
+ if (out_basename.length() > 0) {
+ path = out_path + out_basename + suffix_2;
+
} else {
- path = path.substr(0, pos) + suffix_2;
+ file = interleaved ? files[i].first : files[i].second;
+ path = out_path + file;
+
+ pos = path.find_last_of(".");
+ if (path.substr(pos) == ".bam") {
+ path.replace(pos, 4, suffix_2);
+ } else if (path.substr(pos) == ".gz") {
+ path = path.substr(0, pos);
+ pos = path.find_last_of(".");
+ path = path.substr(0, pos) + suffix_2;
+ } else {
+ path = path.substr(0, pos) + suffix_2;
+ }
}
if (stat((in_path_2 + file).c_str(), &sb_1) == -1) {
@@ -420,17 +470,22 @@ open_files(vector<pair<string, string> > &files,
exit(1);
}
- filepath = files[i].first;
- pos = filepath.find_last_of(".");
- if (filepath.substr(pos) == ".gz") {
- filepath = filepath.substr(0, pos);
+ if (out_basename.length() > 0) {
+ path = out_path + out_basename + ".rem" + suffix_1;
+
+ } else {
+ filepath = files[i].first;
pos = filepath.find_last_of(".");
- filepath = filepath.substr(0, pos);
- } else if (filepath.substr(pos) == ".bam") {
- filepath = filepath.substr(0, pos);
+ if (filepath.substr(pos) == ".gz") {
+ filepath = filepath.substr(0, pos);
+ pos = filepath.find_last_of(".");
+ filepath = filepath.substr(0, pos);
+ } else if (filepath.substr(pos) == ".bam") {
+ filepath = filepath.substr(0, pos);
+ }
+ path = out_path + filepath + ".rem" + suffix_1;
}
- path = out_path + filepath + ".rem" + suffix_1;
-
+
fh = new gzFile;
*fh = gzopen(path.c_str(), "wb");
rem_1_fhs[bc] = fh;
@@ -440,17 +495,22 @@ open_files(vector<pair<string, string> > &files,
exit(1);
}
- filepath = file;
- pos = filepath.find_last_of(".");
- if (filepath.substr(pos) == ".gz") {
- filepath = filepath.substr(0, pos);
+ if (out_basename.length() > 0) {
+ path = out_path + out_basename + ".rem" + suffix_2;
+
+ } else {
+ filepath = file;
pos = filepath.find_last_of(".");
- filepath = filepath.substr(0, pos);
- } else if (filepath.substr(pos) == ".bam") {
- filepath = filepath.substr(0, pos);
+ if (filepath.substr(pos) == ".gz") {
+ filepath = filepath.substr(0, pos);
+ pos = filepath.find_last_of(".");
+ filepath = filepath.substr(0, pos);
+ } else if (filepath.substr(pos) == ".bam") {
+ filepath = filepath.substr(0, pos);
+ }
+ path = out_path + filepath + ".rem" + suffix_2;
}
- path = out_path + filepath + ".rem" + suffix_2;
-
+
fh = new gzFile;
*fh = gzopen(path.c_str(), "wb");
rem_2_fhs[bc] = fh;
@@ -463,9 +523,14 @@ open_files(vector<pair<string, string> > &files,
}
return 0;
+
} else if (barcodes.size() == 0 && merge == true) {
- path = out_path + "sample_unbarcoded" + suffix_1;
+ if (out_basename.length() > 0) {
+ path = out_path + out_basename + suffix_1;
+ } else {
+ path = out_path + "sample_unbarcoded" + suffix_1;
+ }
fh = new gzFile;
*fh = gzopen(path.c_str(), "wb");
@@ -482,7 +547,11 @@ open_files(vector<pair<string, string> > &files,
}
if (paired) {
- path = out_path + "sample_unbarcoded" + suffix_2;
+ if (out_basename.length() > 0) {
+ path = out_path + out_basename + suffix_2;
+ } else {
+ path = out_path + "sample_unbarcoded" + suffix_2;
+ }
fh = new gzFile;
*fh = gzopen(path.c_str(), "wb");
@@ -497,7 +566,11 @@ open_files(vector<pair<string, string> > &files,
pair_2_fhs[bc] = fh;
}
- path = out_path + "sample_unbarcoded.rem" + suffix_1;
+ if (out_basename.length() > 0) {
+ path = out_path + out_basename + ".rem" + suffix_1;
+ } else {
+ path = out_path + "sample_unbarcoded.rem" + suffix_1;
+ }
fh = new gzFile;
*fh = gzopen(path.c_str(), "wb");
@@ -512,7 +585,11 @@ open_files(vector<pair<string, string> > &files,
rem_1_fhs[bc] = fh;
}
- path = out_path + "sample_unbarcoded.rem" + suffix_2;
+ if (out_basename.length() > 0) {
+ path = out_path + out_basename + ".rem" + suffix_2;
+ } else {
+ path = out_path + "sample_unbarcoded.rem" + suffix_2;
+ }
fh = new gzFile;
*fh = gzopen(path.c_str(), "wb");
@@ -909,10 +986,10 @@ load_barcodes(string barcode_file, vector<BarcodePair> &barcodes,
int
build_file_list(vector<pair<string, string> > &files)
{
- //
- // Scan a directory for a list of files.
- //
- if (in_path_1.length() > 0) {
+ if (in_path_type == InputT::directory) {
+ //
+ // Scan a directory for a list of files.
+ //
string file, paired_file;
const char *p, *q, *end;
struct dirent *direntry;
=====================================
src/file_io.h
=====================================
@@ -1,6 +1,6 @@
// -*-mode:c++; c-style:k&r; c-basic-offset:4;-*-
//
-// Copyright 2012, Julian Catchen <jcatchen at uoregon.edu>
+// Copyright 2012-2024, Julian Catchen <jcatchen at illinois.edu>
//
// This file is part of Stacks.
//
@@ -46,11 +46,13 @@
//
extern FileT in_file_type;
extern FileT out_file_type;
+extern InputT in_path_type;
extern barcodet barcode_type;
extern bool paired;
extern bool interleaved;
extern bool merge;
extern string out_path;
+extern string out_basename;
extern string in_file;
extern string in_file_p1;
extern string in_file_p2;
=====================================
src/populations.cc
=====================================
@@ -1,6 +1,6 @@
// -*-mode:c++; c-style:k&r; c-basic-offset:4;-*-
//
-// Copyright 2012-2023, Julian Catchen <jcatchen at illinois.edu>
+// Copyright 2012-2024, Julian Catchen <jcatchen at illinois.edu>
//
// This file is part of Stacks.
//
@@ -43,10 +43,13 @@ string bl_file;
string wl_file;
string bs_wl_file;
string enz;
-double sigma = 150000.0;
double min_samples_overall = 0.0;
double min_samples_per_pop = 0.0;
int min_populations = 1;
+size_t min_gt_depth = 0;
+double minor_allele_freq = 0.0;
+long minor_allele_cnt = 0;
+double max_obs_het = 1.0;
bool filter_haplotype_wise = false;
int batch_size = 10000;
bool calc_fstats = false;
@@ -70,12 +73,9 @@ bool mapping_cross = false;
CrossT mapcross_type = CrossT::unk;
FormatT mapcross_format = FormatT::unk;
bool verbose = false;
-size_t min_gt_depth = 0;
double merge_prune_lim = 1.0;
-double minor_allele_freq = 0.0;
-long minor_allele_cnt = 0;
-double max_obs_het = 1.0;
double p_value_cutoff = 0.05;
+double sigma = 150000.0;
corr_type fst_correction = no_correction;
set<string> debug_flags;
@@ -305,6 +305,10 @@ try {
Timer timer;
logger->x << "\nBEGIN batch_progress\n";
+ if (verbose)
+ logger->x << "# List of individual sites filtered.\n"
+ << "action\tlocus\tchr\tbp\tcol\treason\t[sample]\n";
+
while (true) {
//
// Read the next set of loci to process.
@@ -322,7 +326,7 @@ try {
logger->x << bloc.chr();
} else {
cout << "Batch " << bloc.next_batch_number() - 1 << " " << flush;
- logger->x << "Batch " << bloc.next_batch_number() - 1;
+ logger->x << "# Batch " << bloc.next_batch_number() - 1;
}
logger->x << ": analyzed "
<< filter.batch_total() << " loci; filtered "
@@ -427,6 +431,8 @@ try {
cout << " " << n_sites << " genomic sites, of which "
<< n_multiloci_sites << " were covered by multiple loci ("
<< as_percentage(n_multiloci_sites, n_sites) << ").\n";
+ if (min_gt_depth > 0)
+ cout << "Filtered " << filter.filtered_genotypes() << " genotypes that fell below the minimum genotype depth threshold.\n";
if (filter.total_sites() == 0) {
cerr << "Error: All data has been filtered out.\n";
throw exception();
@@ -1266,7 +1272,7 @@ LocusFilter::apply_filters_stacks(LocBin& loc, ostream& log_fh, const MetaPopInf
//
// Filter genotypes for depth.
//
- this->gt_depth_filter(loc.d, loc.cloc);
+ this->gt_depth_filter(loc.d, loc.cloc, log_fh);
//
// Identify individual SNPs that are below the -r threshold or the minor allele
@@ -1358,26 +1364,45 @@ LocusFilter::filter(const MetaPopInfo *mpopi, Datum **d, CSLocus *cloc)
}
void
-LocusFilter::gt_depth_filter(Datum** data, const CSLocus* cloc)
+LocusFilter::gt_depth_filter(Datum** data, const CSLocus* cloc, ostream &log_fh)
{
if (min_gt_depth == 0)
return;
else if (cloc->snps.empty())
return;
- for (size_t spl=0; spl<this->_sample_cnt; ++spl) {
- Datum* d = data[spl];
+
+ for (size_t s = 0; s < this->_sample_cnt; s++) {
+ Datum *d = data[s];
+
+ if (d == NULL) continue;
+
assert(d->obshap.size() == 2);
- assert(strlen(d->obshap[0]) == cloc->snps.size()
- && strlen(d->obshap[1]) == cloc->snps.size());
- for (size_t snp=0; snp<cloc->snps.size(); ++snp) {
+ assert(strlen(d->obshap[0]) == cloc->snps.size() &&
+ strlen(d->obshap[1]) == cloc->snps.size());
+
+ for (size_t snp = 0; snp < cloc->snps.size(); snp++) {
size_t col = cloc->snps[snp]->col;
+
if (d->model[col] == 'U')
continue;
+ //
// Check this genotype's depth.
+ //
if (d->snpdata[snp].tot_depth < min_gt_depth) {
d->model[col] = 'U';
- for(size_t i=0; i<2; ++i)
+ this->_filtered_genotypes++;
+ for(size_t i = 0; i < 2; i++)
d->obshap[i][snp] = 'N';
+
+ if (verbose) {
+ log_fh << "pruned_polymorphic_site\t"
+ << cloc->id << "\t"
+ << cloc->loc.chr() << "\t"
+ << cloc->sort_bp(col) +1 << "\t"
+ << col << "\t"
+ << "genotype_depth" << "\t"
+ << mpopi.samples()[s].name << "\n";
+ }
}
}
}
@@ -1403,10 +1428,11 @@ LocusFilter::init(MetaPopInfo *mpopi)
this->_samples = new size_t [this->_sample_cnt];
this->_pop_tot = new size_t [this->_pop_cnt];
- this->_filtered_loci = 0;
- this->_total_loci = 0;
- this->_filtered_sites = 0;
- this->_total_sites = 0;
+ this->_filtered_loci = 0;
+ this->_total_loci = 0;
+ this->_filtered_sites = 0;
+ this->_filtered_genotypes = 0;
+ this->_total_sites = 0;
size_t pop_sthg = 0;
@@ -3704,8 +3730,9 @@ output_parameters(ostream &fh)
fh << "populations parameters selected:\n";
if (input_mode == InputMode::vcf)
fh << " Input mode: VCF\n";
- fh
- << " Percent samples limit per population: " << min_samples_per_pop << "\n"
+ if (min_gt_depth > 0)
+ fh << " Minimum read depth to include genotype: " << min_gt_depth << "\n";
+ fh << " Percent samples limit per population: " << min_samples_per_pop << "\n"
<< " Locus Population limit: " << min_populations << "\n"
<< " Percent samples overall: " << min_samples_overall << "\n"
<< " Minor allele frequency cutoff: " << minor_allele_freq << "\n"
@@ -3785,7 +3812,7 @@ parse_command_line(int argc, char* argv[])
{"whitelist", required_argument, NULL, 'W'},
{"blacklist", required_argument, NULL, 'B'},
{"batch-size", required_argument, NULL, 1999},
- {"dbg-min-gt-depth", required_argument, NULL, 2001},
+ {"min-gt-depth", required_argument, NULL, 2001},
{"write-single-snp", no_argument, NULL, 'I'},
{"write-random-snp", no_argument, NULL, 'j'},
{"no-hap-exports", no_argument, NULL, 1012},
@@ -3851,7 +3878,7 @@ parse_command_line(int argc, char* argv[])
batch_size = is_integer(optarg);
break;
case 2001:
- min_gt_depth = stol(optarg);
+ min_gt_depth = is_integer(optarg);
break;
case 'O':
out_path = optarg;
@@ -4277,6 +4304,7 @@ void help() {
<< " --min-maf [float]: specify a minimum minor allele frequency required to process a SNP (0 < min_maf < 0.5; applied to the metapopulation).\n"
<< " --min-mac [int]: specify a minimum minor allele count required to process a SNP (applied to the metapopulation).\n"
<< " --max-obs-het [float]: specify a maximum observed heterozygosity required to process a SNP (applied ot the metapopulation).\n"
+ << " --min-gt-depth [int]: specify a minimum number of reads to include a called SNP (otherwise marked as missing data).\n"
<< "\n"
<< " --write-single-snp: restrict data analysis to only the first SNP per locus (implies --no-haps).\n"
<< " --write-random-snp: restrict data analysis to one random SNP per locus (implies --no-haps).\n"
@@ -4342,7 +4370,6 @@ void help() {
#ifdef DEBUG
<< "\n"
<< "DEBUG:\n"
- << " --dbg-min-gt-depth\n"
<< " --genepop-haps-3digits: Use 3-digit alleles in the genepop haps output (default is 2-digit).\n"
#endif
;
=====================================
src/populations.h
=====================================
@@ -223,6 +223,7 @@ public:
this->_batch_total_loci = 0;
this->_batch_seen_loci = 0;
this->_filtered_sites = 0;
+ this->_filtered_genotypes = 0;
this->_total_sites = 0;
this->_variant_sites = 0;
}
@@ -255,7 +256,7 @@ public:
void filter_snps(LocBin& loc, const MetaPopInfo& mpopi, ostream &log_fh);
void filter_haps(LocBin& loc, const MetaPopInfo& mpopi, ostream &log_fh);
void filter_fixed_sites(LocBin& loc, const MetaPopInfo& mpopi, ostream &log_fh);
- void gt_depth_filter(Datum **d, const CSLocus *cloc);
+ void gt_depth_filter(Datum **d, const CSLocus *cloc, ostream &log_fh);
void keep_single_snp(CSLocus* cloc, Datum** d, size_t n_samples, const LocTally* t) const;
void keep_random_snp(CSLocus* cloc, Datum** d, size_t n_samples, const LocTally* t) const;
@@ -266,6 +267,7 @@ public:
size_t batch_seen() const { return this->_batch_seen_loci; }
size_t batch_total() const { return this->_batch_total_loci; }
size_t filtered_sites() const { return this->_filtered_sites; }
+ size_t filtered_genotypes() const { return this->_filtered_genotypes; }
size_t total_sites() const { return this->_total_sites; }
size_t variant_sites() const { return this->_variant_sites; }
void locus_seen();
@@ -291,6 +293,7 @@ private:
size_t _batch_total_loci;
size_t _batch_seen_loci;
size_t _filtered_sites;
+ size_t _filtered_genotypes;
size_t _total_sites;
size_t _variant_sites;
=====================================
src/process_radtags.cc
=====================================
@@ -1,6 +1,6 @@
// -*-mode:c++; c-style:k&r; c-basic-offset:4;-*-
//
-// Copyright 2011-2023, Julian Catchen <jcatchen at illinois.edu>
+// Copyright 2011-2024, Julian Catchen <jcatchen at illinois.edu>
//
// This file is part of Stacks.
//
@@ -37,12 +37,14 @@ using namespace std;
//
FileT in_file_type = FileT::unknown;
FileT out_file_type = FileT::unknown;
+InputT in_path_type = InputT::files;
string in_file;
string in_file_p1;
string in_file_p2;
string in_path_1;
string in_path_2;
string out_path;
+string out_basename;
string barcode_file;
string renz_1;
string renz_2;
@@ -63,6 +65,8 @@ bool discards = false;
bool overhang = false;
bool filter_illumina = false;
bool check_radtag = true;
+bool check_poly_g = false;
+bool force_poly_g = false;
uint truncate_seq = 0;
int barcode_dist_1 = 1;
int barcode_dist_2 = -1;
@@ -147,6 +151,10 @@ int main (int argc, char* argv[]) {
else
cerr << "Processing single-end data.\n";
cerr << "Using Phred+" << qual_offset << " encoding for quality scores.\n";
+ if (force_poly_g)
+ check_poly_g ?
+ cerr << "Checks for poly-G runs have been forced.\n" :
+ cerr << "Checks for poly-G runs have been disabled.\n";
if (truncate_seq > 0)
cerr << "Reads will be truncated to " << truncate_seq << "bp\n";
if (filter_illumina)
@@ -206,7 +214,11 @@ int main (int argc, char* argv[]) {
int result = 1;
for (uint i = 0; i < files.size(); i++) {
- cerr << "Processing file " << i+1 << " of " << files.size() << " [" << files[i].first.c_str() << "]\n";
+
+ if (paired)
+ cerr << "Processing file-pair " << i+1 << " of " << files.size() << "\n";
+ else
+ cerr << "Processing file " << i+1 << " of " << files.size() << "\n";
initialize_counters(counters[files[i].first]);
@@ -268,8 +280,10 @@ int main (int argc, char* argv[]) {
cerr << "-" << counters[files[i].first]["ambiguous"] << " ambiguous barcodes; "
<< "-" << counters[files[i].first]["noradtag"] << " ambiguous RAD-Tags; "
<< "+" << counters[files[i].first]["recovered"] << " recovered; "
- << "-" << counters[files[i].first]["low_quality"] << " low quality reads; "
- << counters[files[i].first]["retained"] << " retained reads.\n";
+ << "-" << counters[files[i].first]["low_quality"] << " low quality reads; ";
+ if (check_poly_g)
+ cerr << "-" << counters[files[i].first]["poly_g"] << " reads with poly-G runs; ";
+ cerr << counters[files[i].first]["retained"] << " retained reads.\n";
if (filter_adapter)
cerr << " "
<< counters[files[i].first]["adapter"] << " reads with adapter sequence.\n";
@@ -298,7 +312,7 @@ int main (int argc, char* argv[]) {
}
cerr << "done.\n";
- print_results(argc, argv, barcodes, counters, barcode_log);
+ print_results(argc, argv, out_basename, barcodes, counters, barcode_log);
cerr << "process_radtags is done.\n";
@@ -337,6 +351,7 @@ initialize_counters(map<string, long> &counter)
counter["ill_filtered"] = 0;
counter["adapter"] = 0;
counter["low_quality"] = 0;
+ counter["poly_g"] = 0;
counter["noradtag"] = 0;
counter["ambiguous"] = 0;
counter["retained"] = 0;
@@ -357,6 +372,7 @@ consolidate_counters(map<string, long> &counter,
counter["ill_filtered"] += cntr["ill_filtered"];
counter["adapter"] += cntr["adapter"];
counter["low_quality"] += cntr["low_quality"];
+ counter["poly_g"] += cntr["poly_g"];
counter["noradtag"] += cntr["noradtag"];
counter["ambiguous"] += cntr["ambiguous"];
counter["retained"] += cntr["retained"];
@@ -381,12 +397,14 @@ consolidate_barcode_logs(map<BarcodePair, map<string, long>> &barcode_log,
barcode_log[bc]["noradtag"] = 0;
barcode_log[bc]["total"] = 0;
barcode_log[bc]["low_qual"] = 0;
+ barcode_log[bc]["poly_g"] = 0;
barcode_log[bc]["retained"] = 0;
barcode_log[bc]["proppair"] = 0;
}
barcode_log[bc]["noradtag"] += it->second["noradtag"];
barcode_log[bc]["total"] += it->second["total"];
barcode_log[bc]["low_qual"] += it->second["low_qual"];
+ barcode_log[bc]["poly_g"] += it->second["poly_g"];
barcode_log[bc]["retained"] += it->second["retained"];
barcode_log[bc]["proppair"] += it->second["proppair"];
}
@@ -473,7 +491,7 @@ process_paired_reads(string prefix_1,
r_1 = new RawRead(strlen(s_1->seq), 1, min_bc_size_1, win_size);
r_2 = new RawRead(strlen(s_2->seq), 2, min_bc_size_2, win_size);
-
+
//
// Set len_limit so that if we encounter reads already shorter than truncate_seq limit
// they will be discarded.
@@ -508,6 +526,8 @@ process_paired_reads(string prefix_1,
parse_input_record(s_2, r_2);
counter["total"] += 2;
+ if (force_poly_g == false && r_1->check_poly_g()) check_poly_g = true;
+
//
// If a BestRAD protocol was used, check which read the barcode/restriction cutsite is on
// and transpose the reads if necessary.
@@ -748,9 +768,18 @@ process_paired_reads_parallel(string prefix_1,
}
map<BarcodePair, int> output_thread_map;
- uint k = 0;
+ //
+ // We must take care because multiple barcodes can point to the same output file,
+ // so we must assign all barcodes for a single output file to the same thread for writing.
+ //
+ map<fhType *, vector<BarcodePair>> rev_fh_map;
for (auto it = pair_1_fhs.begin(); it != pair_1_fhs.end(); it++) {
- output_thread_map[it->first] = (k % num_output_threads) + num_input_threads;
+ rev_fh_map[it->second].push_back(it->first);
+ }
+ uint k = 0;
+ for (auto it = rev_fh_map.begin(); it != rev_fh_map.end(); it++) {
+ for (uint j = 0; j < it->second.size(); j++)
+ output_thread_map[it->second[j]] = (k % num_output_threads) + num_input_threads;
k++;
}
@@ -998,11 +1027,13 @@ process_paired_reads_parallel(string prefix_1,
r_1 = new RawRead(seq_init_len_1, 1, min_bc_size_1, win_size);
r_2 = new RawRead(seq_init_len_2, 2, min_bc_size_2, win_size);
-
+
parse_input_record(s_1, r_1);
parse_input_record(s_2, r_2);
cntr["total"] += 2;
+ if (force_poly_g == false && r_1->check_poly_g()) check_poly_g = true;
+
//
// If a BestRAD protocol was used, check which read the barcode/restriction cutsite is on
// and transpose the reads if necessary.
@@ -1170,7 +1201,7 @@ process_reads(string prefix,
}
r = new RawRead(strlen(s->seq), 1, min_bc_size_1, win_size);
-
+
BarcodePair bc;
//
// If no barcodes were specified, set the barcode object to be the input file name so
@@ -1201,6 +1232,8 @@ process_reads(string prefix,
parse_input_record(s, r);
+ if (force_poly_g == false && r->check_poly_g()) check_poly_g = true;
+
if (barcode_type == inline_null ||
barcode_type == index_null)
bc.set(r->se_bc);
@@ -1345,9 +1378,18 @@ process_reads_parallel(string prefix,
}
map<BarcodePair, int> output_thread_map;
- uint k = 0;
+ //
+ // We must take care because multiple barcodes can point to the same output file,
+ // so we must assign all barcodes for a single output file to the same thread for writing.
+ //
+ map<fhType *, vector<BarcodePair>> rev_fh_map;
for (auto it = pair_1_fhs.begin(); it != pair_1_fhs.end(); it++) {
- output_thread_map[it->first] = (k % num_output_threads) + num_input_threads;
+ rev_fh_map[it->second].push_back(it->first);
+ }
+ uint k = 0;
+ for (auto it = rev_fh_map.begin(); it != rev_fh_map.end(); it++) {
+ for (uint j = 0; j < it->second.size(); j++)
+ output_thread_map[it->second[j]] = (k % num_output_threads) + num_input_threads;
k++;
}
@@ -1557,6 +1599,8 @@ process_reads_parallel(string prefix,
parse_input_record(s, r);
+ if (force_poly_g == false && r->check_poly_g()) check_poly_g = true;
+
bc = new BarcodePair();
//
// If no barcodes were specified, set the barcode object to be the input file name so
@@ -1708,6 +1752,20 @@ process_singlet(RawRead *href,
return 0;
}
+ //
+ // Drop this sequence if it has a run of poly-Gs at the 3' end.
+ // Either the user has forced the check on, and check_poly_g is true, or we turned it on
+ // internally via href->check_poly_g because of the machine type that generated the data.
+ //
+ if ((check_poly_g || href->check_poly_g()) &&
+ check_poly_g_run(href, href->inline_bc_len, len_limit) <= 0) {
+ counter["poly_g"]++;
+ if (barcode_type != null_null)
+ bc_log["poly_g"]++;
+ href->retain = 0;
+ return 0;
+ }
+
//
// Drop this sequence if it contains adapter sequence.
//
@@ -1896,6 +1954,7 @@ int dist(const char *res_enz, char *seq) {
int
print_results(int argc, char **argv,
+ string out_basename,
vector<BarcodePair> &barcodes,
map<string, map<string, long> > &counters,
map<BarcodePair, map<string, long> > &barcode_log)
@@ -1903,9 +1962,12 @@ print_results(int argc, char **argv,
map<string, map<string, long> >::iterator it;
string log_path = out_path + "process_radtags.log";
- if (!in_path_1.empty()) {
+
+ if (in_path_type == InputT::directory) {
+ //
// In directory mode, use `$out_path/process_radtags.$(basename $in_path).log`.
// For consistency we always use realpath().
+ //
char abspath [PATH_MAX];
char* rv = realpath(in_path_1.c_str(), abspath);
if (rv == NULL) {
@@ -1917,6 +1979,13 @@ print_results(int argc, char **argv,
size_t p = abspath_s.find_last_of('/');
string in_dir_name = abspath_s.substr(p+1);
log_path = out_path + "process_radtags." + in_dir_name + ".log";
+
+ } else if (in_path_type == InputT::files && out_basename.length() > 0) {
+ //
+ // We are executing on a single (-f) or pair of files (-1/-2), and user has
+ // supplied an output basename.
+ //
+ log_path = out_path + out_basename + ".log";
}
ofstream log(log_path.c_str());
@@ -1943,8 +2012,10 @@ print_results(int argc, char **argv,
log << "Adapter Seq" << "\t";
if (bestrad)
log << "Transposed Reads" << "\t";
- log << "Low Quality\t"
- << "Barcode Not Found\t"
+ log << "Low Quality\t";
+ if (check_poly_g)
+ log << "Poly-G\t";
+ log << "Barcode Not Found\t"
<< "RAD cutsite Not Found\t"
<< "Total\n";
@@ -1957,8 +2028,10 @@ print_results(int argc, char **argv,
log << it->second["adapter"] << "\t";
if (bestrad)
log << it->second["transposed"] << "\t";
- log << it->second["low_quality"] << "\t"
- << it->second["ambiguous"] << "\t"
+ log << it->second["low_quality"] << "\t";
+ if (check_poly_g)
+ log << it->second["poly_g"] << "\t";
+ log << it->second["ambiguous"] << "\t"
<< it->second["noradtag"] << "\t"
<< it->second["total"] << "\n";
}
@@ -1967,6 +2040,7 @@ print_results(int argc, char **argv,
map<string, long> c;
c["total"] = 0;
c["low_quality"] = 0;
+ c["poly_g"] = 0;
c["adapter"] = 0;
c["ill_filtered"] = 0;
c["ambiguous"] = 0;
@@ -1981,6 +2055,7 @@ print_results(int argc, char **argv,
c["total"] += it->second["total"];
c["ill_filtered"] += it->second["ill_filtered"];
c["low_quality"] += it->second["low_quality"];
+ c["poly_g"] += it->second["poly_g"];
c["adapter"] += it->second["adapter"];
c["ambiguous"] += it->second["ambiguous"];
c["noradtag"] += it->second["noradtag"];
@@ -2005,6 +2080,7 @@ print_results(int argc, char **argv,
print_nreads(c["adapter"], "reads contained adapter sequence");
print_nreads(c["ambiguous"], "barcode not found drops");
print_nreads(c["low_quality"], "low quality read drops");
+ print_nreads(c["poly_g"], "poly-G run drops");
print_nreads(c["noradtag"], "RAD cutsite not found drops");
print_nreads(c["retained"], "retained reads");
@@ -2031,8 +2107,10 @@ print_results(int argc, char **argv,
if (filter_adapter)
log << "Reads containing adapter sequence\t" << c["adapter"] << "\t" << as_percentage((double) c["adapter"] / c["total"]) << "\n";
log << "Barcode Not Found\t" << c["ambiguous"] << "\t" << as_percentage((double) c["ambiguous"] / c["total"]) << "\n"
- << "Low Quality\t" << c["low_quality"] << "\t" << as_percentage((double) c["low_quality"] / c["total"]) << "\n"
- << "RAD Cutsite Not Found\t" << c["noradtag"] << "\t" << as_percentage((double) c["noradtag"] / c["total"]) << "\n"
+ << "Low Quality\t" << c["low_quality"] << "\t" << as_percentage((double) c["low_quality"] / c["total"]) << "\n";
+ if (check_poly_g)
+ log << "Poly-G Runs\t" << c["poly_g"] << "\t" << as_percentage((double) c["poly_g"] / c["total"]) << "\n";
+ log << "RAD Cutsite Not Found\t" << c["noradtag"] << "\t" << as_percentage((double) c["noradtag"] / c["total"]) << "\n"
<< "Retained Reads\t" << c["retained"] << "\t" << as_percentage((double) c["retained"] / c["total"]) << "\n";
if (paired)
log << "Properly Paired\t" << c["proppair"] << "\t" << as_percentage(((double) c["proppair"]*2.0) / c["total"]) << "\n";
@@ -2060,8 +2138,10 @@ print_results(int argc, char **argv,
log << "Filename\t";
log << "Total\t"
<< "RAD Cutsite Not Found\t"
- << "Low Quality\t"
- << "Retained Reads\t"
+ << "Low Quality\t";
+ if (check_poly_g)
+ log << "Poly-G Runs\t";
+ log << "Retained Reads\t"
<< "Pct Retained\t";
if (paired)
log << "Properly Paired\t"
@@ -2078,11 +2158,13 @@ print_results(int argc, char **argv,
log << barcodes[i].name << "\t";
if (barcode_log.count(barcodes[i]) == 0)
- log << "0\t" << "0\t" << "0\t" << "0\n";
+ log << "0\t" << "0\t" << "0\t" << "0\t" << "0\n";
else {
log << barcode_log[barcodes[i]]["total"] << "\t"
- << barcode_log[barcodes[i]]["noradtag"] << "\t"
- << barcode_log[barcodes[i]]["low_qual"] << "\t"
+ << barcode_log[barcodes[i]]["noradtag"] << "\t";
+ if (check_poly_g)
+ log << barcode_log[barcodes[i]]["poly_g"] << "\t";
+ log << barcode_log[barcodes[i]]["low_qual"] << "\t"
<< barcode_log[barcodes[i]]["retained"] << "\t"
<< as_percentage((double) barcode_log[barcodes[i]]["retained"] / barcode_log[barcodes[i]]["total"]) << "\t";
if (paired)
@@ -2142,6 +2224,8 @@ int parse_command_line(int argc, char* argv[]) {
{"interleaved", no_argument, NULL, 'I'},
{"merge", no_argument, NULL, 'm'},
{"disable-rad-check", no_argument, NULL, 'R'}, {"disable_rad_check", no_argument, NULL, 'R'},
+ {"disable-poly-g-check", no_argument, NULL, 1004},
+ {"force-poly-g-check", no_argument, NULL, 1005},
{"filter-illumina", no_argument, NULL, 'F'},
{"retain-header", no_argument, NULL, 'H'},
{"bestrad", no_argument, NULL, 1000},
@@ -2160,6 +2244,7 @@ int parse_command_line(int argc, char* argv[]) {
{"file", required_argument, NULL, 'f'},
{"file-p1", required_argument, NULL, '1'}, {"file_p1", required_argument, NULL, '1'},
{"file-p2", required_argument, NULL, '2'}, {"file_p2", required_argument, NULL, '2'},
+ {"basename", required_argument, NULL, 1003},
{"in-path", required_argument, NULL, 'p'},
{"out-path", required_argument, NULL, 'o'},
{"truncate", required_argument, NULL, 't'},
@@ -2224,8 +2309,9 @@ int parse_command_line(int argc, char* argv[]) {
in_file = optarg;
break;
case 'p':
- in_path_1 = optarg;
- in_path_2 = in_path_1;
+ in_path_1 = optarg;
+ in_path_2 = in_path_1;
+ in_path_type = InputT::directory;
break;
case '1':
paired = true;
@@ -2235,6 +2321,9 @@ int parse_command_line(int argc, char* argv[]) {
paired = true;
in_file_p2 = optarg;
break;
+ case 1003:
+ out_basename = optarg;
+ break;
case 'P':
paired = true;
break;
@@ -2293,6 +2382,14 @@ int parse_command_line(int argc, char* argv[]) {
case 'R':
check_radtag = false;
break;
+ case 1004:
+ force_poly_g = true;
+ check_poly_g = false;
+ break;
+ case 1005:
+ force_poly_g = true;
+ check_poly_g = true;
+ break;
case 'F':
filter_illumina = true;
break;
@@ -2388,6 +2485,11 @@ int parse_command_line(int argc, char* argv[]) {
if (in_path_2.length() > 0 && in_path_2.at(in_path_2.length() - 1) != '/')
in_path_2 += "/";
+ if (in_path_1.length() > 0 && out_basename.length() > 0) {
+ cerr << "You can only specify --basename with the -f or -1/-2 options.\n";
+ help();
+ }
+
if (out_path.length() == 0)
out_path = ".";
@@ -2477,19 +2579,18 @@ void help() {
<< " -p,--in-path: path to a directory of files.\n"
<< " -P,--paired: files contained within the directory are paired.\n"
<< " -I,--interleaved: specify that the paired-end reads are interleaved in single files.\n"
- << " -i,--in-type: input file type, either 'fastq', 'gzfastq' (gzipped fastq), 'bam', or 'bustard' (default: guess, or gzfastq if unable to).\n"
<< " -b,--barcodes: path to a file containing barcodes for this run, omit to ignore any barcoding or for already demultiplexed data.\n"
- << " -o,--out-path: path to output the processed files.\n"
<< " -f: path to the input file if processing single-end sequences.\n"
<< " -1: first input file in a set of paired-end sequences.\n"
<< " -2: second input file in a set of paired-end sequences.\n"
+ << " -o,--out-path: path to output the processed files.\n"
+ << " --basename: specify the prefix of the output files when using -f or -1/-2.\n\n"
<< " --threads: number of threads to run.\n"
<< " -c,--clean: clean data, remove any read with an uncalled base ('N').\n"
<< " -q,--quality: discard reads with low quality (phred) scores.\n"
<< " -r,--rescue: rescue barcodes and RAD-Tag cut sites.\n"
<< " -t,--truncate: truncate final read length to this value.\n"
<< " -D,--discards: capture discarded reads to a file.\n"
- << " -y,--out-type: output type, either 'fastq', 'gzfastq', 'fasta', or 'gzfasta' (default: match input type).\n"
<< "\n"
<< " Barcode options:\n"
<< " --inline-null: barcode is inline with sequence, occurs only on single-end read (default).\n"
@@ -2531,12 +2632,16 @@ void help() {
<< " --adapter-1 <sequence>: provide adaptor sequence that may occur on the single-end read for filtering.\n"
<< " --adapter-2 <sequence>: provide adaptor sequence that may occur on the paired-read for filtering.\n"
<< " --adapter-mm <mismatches>: number of mismatches allowed in the adapter sequence.\n\n"
- << " Output options:\n"
+ << " Input/Output options:\n"
+ << " --in-type: input file type, either 'fastq', 'gzfastq' (gzipped fastq), 'bam', or 'bustard' (default: guess, or gzfastq if unable to).\n"
+ << " --out-type: output type, either 'fastq', 'gzfastq', 'fasta', or 'gzfasta' (default: match input type).\n"
<< " --retain-header: retain unmodified FASTQ headers in the output.\n"
<< " --merge: if no barcodes are specified, merge all input files into a single output file.\n\n"
<< " Advanced options:\n"
<< " --filter-illumina: discard reads that have been marked by Illumina's chastity/purity filter as failing.\n"
<< " --disable-rad-check: disable checking if the RAD cut site is intact.\n"
+ << " --force-poly-g-check: force a check for runs of poly-Gs (default: autodetect based on machine type in FASTQ header).\n"
+ << " --disable-poly-g-check: disable checking for runs of poly-Gs (default: autodetect based on machine type in FASTQ header).\n"
<< " --encoding: specify how quality scores are encoded, 'phred33' (Illumina 1.8+/Sanger, default) or 'phred64' (Illumina 1.3-1.5).\n"
<< " --window-size: set the size of the sliding window as a fraction of the read length, between 0 and 1 (default 0.15).\n"
<< " --score-limit: set the phred score limit. If the average score within the sliding window drops below this value, the read is discarded (default 10).\n"
=====================================
src/process_radtags.h
=====================================
@@ -92,7 +92,7 @@ int check_for_transposed_reads(RawRead *, RawRead *, string);
int correct_radtag(RawRead *, string, map<string, long> &);
int check_quality_scores(RawRead *, bool);
int dist(const char *, char *);
-int print_results(int, char **, vector<BarcodePair> &, map<string, map<string, long> > &, map<BarcodePair, map<string, long> > &);
+int print_results(int, char **, string, vector<BarcodePair> &, map<string, map<string, long> > &, map<BarcodePair, map<string, long> > &);
int compare_barcodes(pair<BarcodePair, int>, pair<BarcodePair, int>);
=====================================
src/process_shortreads.cc
=====================================
@@ -31,12 +31,14 @@
//
FileT in_file_type = FileT::unknown;
FileT out_file_type = FileT::unknown;
+InputT in_path_type = InputT::files;
string in_file;
string in_file_p1;
string in_file_p2;
string in_path_1;
string in_path_2;
string out_path;
+string out_basename;
string barcode_file;
char *adapter_1;
char *adapter_2;
=====================================
src/sstacks.cc
=====================================
@@ -1,6 +1,6 @@
// -*-mode:c++; c-style:k&r; c-basic-offset:4;-*-
//
-// Copyright 2010-2023, Julian Catchen <jcatchen at illinois.edu>
+// Copyright 2010-2024, Julian Catchen <jcatchen at illinois.edu>
//
// This file is part of Stacks.
//
@@ -1298,21 +1298,21 @@ int parse_command_line(int argc, char* argv[]) {
{"help", no_argument, NULL, 'h'},
{"version", no_argument, NULL, 'v'},
{"aligned", no_argument, NULL, 'g'},
- {"verify-hap", no_argument, NULL, 'x'}, {"verify_hap", no_argument, NULL, 'x'},
- {"uniq-haplotypes", no_argument, NULL, 'u'}, {"uniq_haplotypes", no_argument, NULL, 'u'},
- {"disable-gapped", no_argument, NULL, 'G'}, {"disable_gapped", no_argument, NULL, 'G'},
- {"threads", required_argument, NULL, 'p'},
+ {"verify-hap", no_argument, NULL, 'x'},
+ {"uniq-haplotypes", no_argument, NULL, 'u'},
+ {"disable-gapped", no_argument, NULL, 'G'},
+ {"threads", required_argument, NULL, 't'},
{"catalog", required_argument, NULL, 'c'},
{"sample", required_argument, NULL, 's'},
- {"out-path", required_argument, NULL, 'o'}, {"out_path", required_argument, NULL, 'o'},
- {"in-path", required_argument, NULL, 'P'}, {"in_path", required_argument, NULL, 'P'},
+ {"out-path", required_argument, NULL, 'o'},
+ {"in-path", required_argument, NULL, 'P'},
{"popmap", required_argument, NULL, 'M'},
{"write-all-matches", no_argument, NULL, 2001},
{0, 0, 0, 0}
};
int option_index = 0;
- int c = getopt_long(argc, argv, "hgGxuvs:c:o:p:P:M:", long_options, &option_index);
+ int c = getopt_long(argc, argv, "hgGxuvs:c:o:p:t:P:M:", long_options, &option_index);
// Detect the end of the options.
if (c == -1)
@@ -1323,6 +1323,7 @@ int parse_command_line(int argc, char* argv[]) {
help();
break;
case 'p':
+ case 't':
num_threads = atoi(optarg);
break;
case 's':
@@ -1435,13 +1436,13 @@ void version() {
void help() {
cerr << "sstacks " << VERSION << "\n"
- << "sstacks -P dir -M popmap [-p n_threads]" << "\n"
- << "sstacks -c catalog_path -s sample_path [-s sample_path ...] -o path [-p n_threads]" << "\n"
+ << "sstacks -P dir -M popmap [-t n_threads]" << "\n"
+ << "sstacks -c catalog_path -s sample_path [-s sample_path ...] -o path [-t n_threads]" << "\n"
<< " -P,--in-path: path to the directory containing Stacks files.\n"
<< " -M,--popmap: path to a population map file from which to take sample names.\n"
<< " -s,--sample: filename prefix from which to load sample loci." << "\n"
<< " -c,--catalog: path to the catalog." << "\n"
- << " -p,--threads: enable parallel execution with n_threads threads.\n"
+ << " -t,--threads: enable parallel execution with n_threads threads.\n"
<< " -o,--out-path: output path to write results." << "\n"
<< " -x: don't verify haplotype of matching locus." << "\n"
<< "\n"
=====================================
src/ustacks.cc
=====================================
@@ -2293,7 +2293,7 @@ calc_coverage_distribution(map<int, MergedStack *> &merged,
tot_depth = 0.0;
size_t not_blacklisted = 0;
- for (const pair<int, MergedStack*>& mtag : merged) {
+ for (const pair<const int, MergedStack*>& mtag : merged) {
if (mtag.second->blacklisted)
continue;
++not_blacklisted;
@@ -2305,7 +2305,7 @@ calc_coverage_distribution(map<int, MergedStack *> &merged,
}
mean = tot_depth / not_blacklisted;
- for (const pair<int, MergedStack*>& mtag : merged)
+ for (const pair<const int, MergedStack*>& mtag : merged)
if (!mtag.second->blacklisted)
stdev += pow(mtag.second->count - mean, 2);
stdev /= not_blacklisted;
@@ -2785,30 +2785,30 @@ int parse_command_line(int argc, char* argv[]) {
static struct option long_options[] = {
{"help", no_argument, NULL, 'h'},
{"version", no_argument, NULL, 'v'},
- {"infile-type", required_argument, NULL, 't'}, {"infile_type", required_argument, NULL, 't'},
+ {"in-type", required_argument, NULL, 't'}, {"infile-type", required_argument, NULL, 't'},
{"file", required_argument, NULL, 'f'},
- {"outpath", required_argument, NULL, 'o'},
- {"name", required_argument, NULL, 1002},
+ {"out-path", required_argument, NULL, 'o'},
+ {"name", required_argument, NULL, 'n'},
{"id", required_argument, NULL, 'i'},
- {"min-cov", required_argument, NULL, 'm'}, {"min_cov", required_argument, NULL, 'm'},
- {"max-dist", required_argument, NULL, 'M'}, {"max_dist", required_argument, NULL, 'M'},
- {"max-sec-dist", required_argument, NULL, 'N'}, {"max_sec_dist", required_argument, NULL, 'N'},
- {"max-locus-stacks", required_argument, NULL, 'K'}, {"max_locus_stacks", required_argument, NULL, 'K'},
- {"k-len", required_argument, NULL, 'k'}, {"k_len", required_argument, NULL, 'k'},
- {"num-threads", required_argument, NULL, 'p'}, {"num_threads", required_argument, NULL, 'p'},
+ {"min-cov", required_argument, NULL, 'm'}, {"min_cov", required_argument, NULL, 'm'},
+ {"max-dist", required_argument, NULL, 'M'}, {"max_dist", required_argument, NULL, 'M'},
+ {"max-sec-dist", required_argument, NULL, 'N'}, {"max_sec_dist", required_argument, NULL, 'N'},
+ {"max-locus-stacks", required_argument, NULL, 'K'}, {"max_locus_stacks", required_argument, NULL, 'K'},
+ {"k-len", required_argument, NULL, 'k'}, {"k_len", required_argument, NULL, 'k'},
+ {"threads", required_argument, NULL, 'p'},
{"deleverage", no_argument, NULL, 'd'},
{"keep-high-cov", no_argument, NULL, 1000}, {"keep_high_cov", no_argument, NULL, 1000},
{"high-cov-thres", required_argument, NULL, 1001}, {"high_cov_thres", required_argument, NULL, 1001},
- {"retain-rem", no_argument, NULL, 'R'}, {"retain_rem", no_argument, NULL, 'R'},
+ {"retain-rem", no_argument, NULL, 'R'}, {"retain_rem", no_argument, NULL, 'R'},
{"graph", no_argument, NULL, 'g'},
- {"sec-hapl", no_argument, NULL, 'H'}, {"sec_hapl", no_argument, NULL, 'H'},
+ {"sec-hapl", no_argument, NULL, 'H'},
{"disable-gapped", no_argument, NULL, 'G'},
- {"max-gaps", required_argument, NULL, 'X'}, {"max_gaps", required_argument, NULL, 'X'},
- {"min-aln-len", required_argument, NULL, 'x'}, {"min_aln_len", required_argument, NULL, 'x'},
- {"model-type", required_argument, NULL, 'T'}, {"model_type", required_argument, NULL, 'T'},
- {"bc-err-freq", required_argument, NULL, 'e'}, {"bc_err_freq", required_argument, NULL, 'e'},
- {"bound-low", required_argument, NULL, 'L'}, {"bound_low", required_argument, NULL, 'L'},
- {"bound-high", required_argument, NULL, 'U'}, {"bound_high", required_argument, NULL, 'U'},
+ {"max-gaps", required_argument, NULL, 'X'},
+ {"min-aln-len", required_argument, NULL, 'x'},
+ {"model-type", required_argument, NULL, 'T'},
+ {"bc-err-freq", required_argument, NULL, 'e'},
+ {"bound-low", required_argument, NULL, 'L'},
+ {"bound-high", required_argument, NULL, 'U'},
{"alpha", required_argument, NULL, 'A'},
{"r-deprecated", no_argument, NULL, 'r'},
{"force-diff-len", no_argument, NULL, 1003}, {"force_diff_len", no_argument, NULL, 1003},
@@ -2818,7 +2818,7 @@ int parse_command_line(int argc, char* argv[]) {
// getopt_long stores the option index here.
int option_index = 0;
- c = getopt_long(argc, argv, "GhHvdrgRA:L:U:f:o:s:i:m:e:p:t:M:N:K:k:T:X:x:", long_options, &option_index);
+ c = getopt_long(argc, argv, "GhHvdrgRA:L:U:f:o:s:i:m:n:e:p:t:M:N:K:k:T:X:x:", long_options, &option_index);
// Detect the end of the options.
if (c == -1)
@@ -2828,7 +2828,7 @@ int parse_command_line(int argc, char* argv[]) {
case 'h':
help();
break;
- case 't':
+ case 'i':
if (strcmp(optarg, "tsv") == 0)
in_file_type = FileT::tsv;
else if (strcmp(optarg, "fasta") == 0)
@@ -2848,11 +2848,9 @@ int parse_command_line(int argc, char* argv[]) {
case 'o':
out_path = optarg;
break;
- case 1002:
+ case 'n':
sample_name = optarg;
break;
- case 'i':
- break;
case 'm':
min_merge_cov = is_integer(optarg);
break;
@@ -2925,6 +2923,7 @@ int parse_command_line(int argc, char* argv[]) {
call_sec_hapl = false;
break;
case 'p':
+ case 't':
num_threads = is_integer(optarg);
break;
case 'v':
@@ -2933,9 +2932,6 @@ int parse_command_line(int argc, char* argv[]) {
case 1003:
force_diff_len = true;
break;
- case 'r': // deprecated Dec 2016, v1.45
- cerr << "Warning: Ignoring deprecated option -r (this has become the default).\n";
- break;
case '?':
// getopt_long already printed an error message.
help();
@@ -3024,29 +3020,29 @@ void version() {
void help() {
cerr << "ustacks " << VERSION << "\n"
- << "ustacks -f file_path -o path [-M max_dist] [-m min_cov] [-p num_threads]" << "\n"
- << " f: input file path.\n"
- << " o: output path to write results.\n"
- << " M: Maximum distance (in nucleotides) allowed between stacks (default 2).\n"
- << " m: Minimum depth of coverage required to create a stack (default 3).\n"
- << " N: Maximum distance allowed to align secondary reads to primary stacks (default: M + 2).\n"
- << " p: enable parallel execution with num_threads threads.\n"
- << " t: input file type. Supported types: fasta, fastq, gzfasta, or gzfastq (default: guess).\n"
- << " --name: a name for the sample (default: input file name minus the suffix).\n"
- << " R: retain unused reads.\n"
- << " H: disable calling haplotypes from secondary reads.\n"
+ << "ustacks -f in_path -o out_path [-M max_dist] [-m min_reads] [-t num_threads]" << "\n"
+ << " -f,--file: input file path.\n"
+ << " -o,--out-path: output path to write results.\n"
+ << " -M: Maximum distance (in nucleotides) allowed between stacks (default 2).\n"
+ << " -m: Minimum number of reads to seed a new stack (default 3).\n"
+ << " -N: Maximum distance allowed to align secondary reads to primary stacks (default: M + 2).\n"
+ << " -t,--threads: enable parallel execution with num_threads threads.\n"
+ << " -i,--in-type: input file type. Supported types: fasta, fastq, gzfasta, or gzfastq (default: guess).\n"
+ << " -n,--name: a name for the sample (default: input file name minus the suffix).\n"
+ << " -R: retain unused reads.\n"
+ << " -H: disable calling haplotypes from secondary reads.\n"
<< "\n"
<< " Stack assembly options:\n"
<< " --force-diff-len: allow raw input reads of different lengths, e.g. after trimming (default: ustacks perfers raw input reads of uniform length).\n"
<< " --keep-high-cov: disable the algorithm that removes highly-repetitive stacks and nearby errors.\n"
<< " --high-cov-thres: highly-repetitive stacks threshold, in standard deviation units (default: 3.0).\n"
<< " --max-locus-stacks <num>: maximum number of stacks at a single de novo locus (default 3).\n"
- << " --k-len <len>: specify k-mer size for matching between alleles and loci (automatically calculated by default).\n"
+ << " --k-len <len>: specify k-mer size for matching between alleles and loci (automatically calculated by default).\n"
<< " --deleverage: enable the Deleveraging algorithm, used for resolving over merged tags.\n\n"
<< " Gapped assembly options:\n"
<< " --max-gaps: number of gaps allowed between stacks before merging (default: 2).\n"
- << " --min-aln-len: minimum length of aligned sequence in a gapped alignment (default: 0.80).\n\n"
- << " --disable-gapped: do not preform gapped alignments between stacks (default: gapped alignements enabled).\n"
+ << " --min-aln-len: minimum length of aligned sequence in a gapped alignment (default: 0.80).\n"
+ << " --disable-gapped: do not preform gapped alignments between stacks (default: gapped alignements enabled).\n\n"
<< " Model options:\n"
<< " --model-type: either 'snp' (default), 'bounded', or 'fixed'\n"
<< " For the SNP or Bounded SNP model:\n"
=====================================
src/write.cc
=====================================
@@ -47,7 +47,7 @@ write_fasta(ofstream *fh, RawRead *href, bool overhang) {
href->seq + offset << "\n";
else
*fh <<
- ">" << href->machine <<
+ ">" << href->machine() <<
"/" << href->read << "\n" <<
href->seq + offset << "\n";
@@ -76,7 +76,7 @@ write_fasta(gzFile *fh, RawRead *href, bool overhang) {
href->seq + offset << "\n";
else
sstr <<
- ">" << href->machine <<
+ ">" << href->machine() <<
"/" << href->read << "\n" <<
href->seq + offset << "\n";
@@ -135,7 +135,7 @@ write_fastq(ofstream *fh, RawRead *href, bool overhang) {
href->phred + offset << "\n";
else
*fh <<
- "@" << href->machine <<
+ "@" << href->machine() <<
"/" << href->read << "\n" <<
href->seq + offset << "\n" <<
"+\n" <<
@@ -171,7 +171,7 @@ write_fastq(gzFile *fh, RawRead *href, bool overhang) {
href->phred + offset << "\n";
else
sstr <<
- "@" << href->machine <<
+ "@" << href->machine() <<
"/" << href->read << "\n" <<
href->seq + offset << "\n" <<
"+\n" <<
View it on GitLab: https://salsa.debian.org/med-team/stacks/-/compare/bc25962eae53b019fab295e48549cd0c4514e9c9...eb484bbc6989a78e7e77fa19ec64d746407a72e9
--
View it on GitLab: https://salsa.debian.org/med-team/stacks/-/compare/bc25962eae53b019fab295e48549cd0c4514e9c9...eb484bbc6989a78e7e77fa19ec64d746407a72e9
You're receiving this email because of your account on salsa.debian.org.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://alioth-lists.debian.net/pipermail/debian-med-commit/attachments/20240810/e9b6bd0e/attachment-0001.htm>
More information about the debian-med-commit
mailing list