[med-svn] [Git][med-team/bcbio][upstream] New upstream version 1.2.6

Steffen Möller gitlab at salsa.debian.org
Wed Feb 10 17:38:17 GMT 2021



Steffen Möller pushed to branch upstream at Debian Med / bcbio


Commits:
5c8848ab by Steffen Moeller at 2021-02-10T16:24:07+01:00
New upstream version 1.2.6
- - - - -


28 changed files:

- HISTORY.md
- MANIFEST.in
- README.md
- bcbio/dragen/dragen.py
- bcbio/pipeline/main.py
- bcbio/pipeline/rnaseq.py
- bcbio/pipeline/run_info.py
- bcbio/provenance/programs.py
- bcbio/variation/annotation.py
- bcbio/variation/vardict.py
- config/examples/rnaseq-seqc-getdata.sh
- config/examples/rnaseq-seqc.yaml
- config/examples/seqc.csv
- config/genomes/GRCh37-resources.yaml
- config/genomes/hg19-resources.yaml
- config/genomes/hg38-resources.yaml
- + config/templates/purecn_ton.yaml
- config/vcfanno/GRCh37-gemini.conf
- config/vcfanno/hg38-gemini.conf
- config/vcfanno/somatic.conf
- docs/contents/bulk_rnaseq.md
- docs/contents/configuration.md
- docs/contents/installation.md
- docs/contents/purecn.md
- requirements-conda.txt
- requirements.txt
- scripts/bcbio_setup_genome.py
- setup.py


Changes:

=====================================
HISTORY.md
=====================================
@@ -1,4 +1,9 @@
-## 1.2.6 (in progress)
+## 1.2.7 (in progress)
+
+## 1.2.6 (04 February 2021)
+- RNASeq: Fail more gracefully if SummarizedExperiment object cannot be created.
+- Fixes to handle DRAGEN BAM files from the first stage of UMI processing.
+- Fix issue with double-annotating with dbSNP. Separating out somatic variant annotation into it's own vcfanno configuration.
 
 ## 1.2.5 (01 January 2021)
 - Joint calling for RNA-seq variant calling requires setting `jointcaller` to bring it in line


=====================================
MANIFEST.in
=====================================
@@ -9,5 +9,7 @@ include tests/data/*.csv
 include tests/data/automated/*.yaml
 include tests/data/automated/*.ini
 include tests/data/automated/tool-data/*.loc
+include tests/data/variants/*.vcf
+include tests/data/variants/*.tbi
 include bcbio/scripts/R/*.R
 include bcbio/scripts/R/*.Rmd


=====================================
README.md
=====================================
@@ -8,11 +8,11 @@ Validated, scalable, community developed variant calling, RNA-seq and small RNA
 
 ## Features
 
-* Community developed: We welcome contributors with the goal of overcoming the biological, algorithmic and computational challenges that face individual developers working on complex pipelines in quickly changing research areas. See our [users page](https://bcbio-nextgen.readthedocs.io/en/latest/contents/about.html#users) for examples of bcbio-nextgen deployments, and the [developer documentation](https://bcbio-nextgen.readthedocs.io/en/latest/contents/development.html) for tips on contributing.
+* Community developed: We welcome contributors with the goal of overcoming the biological, algorithmic and computational challenges that face individual developers working on complex pipelines in quickly changing research areas. See our [users page](https://bcbio-nextgen.readthedocs.io/en/latest/contents/users.html) for examples of bcbio-nextgen deployments, and the [developer documentation](https://bcbio-nextgen.readthedocs.io/en/latest/contents/development.html) for tips on contributing.
 * Installation: [A single installer script](https://bcbio-nextgen.readthedocs.io/en/latest/contents/installation.html#automated) prepares all third party software, data libraries and system configuration files.
 * [Automated validation](https://bcb.io/2014/05/12/wgs-trio-variant-evaluation/): Compare variant calls against common reference materials or sample specific SNP arrays to ensure call correctness. Incorporation of multiple approaches for alignment, preparation and variant calling enable unbiased comparisons of algorithms.
 * Distributed: Focus on [parallel analysis and scaling](https://bcb.io/2013/05/22/scaling-variant-detection-pipelines-for-whole-genome-sequencing-analysis/) to handle large population studies and whole genome analysis. Runs on single multicore computers, in compute clusters using [IPython parallel](https://ipyparallel.readthedocs.io/en/latest/), or on the Amazon cloud. See the [parallel documentation](https://bcbio-nextgen.readthedocs.org/en/latest/contents/parallel.html) for full details.
-* Multiple analysis algorithms: bcbio-nextgen provides configurable [variant calling, RNA-seq and small RNA pipelines](https://bcbio-nextgen.readthedocs.io/en/latest/contents/pipelines.html).
+* Multiple analysis algorithms: bcbio-nextgen provides configurable [variant calling (small and copy number), RNA-seq, ATAC-seq, , BS-Seq, SC RNA-seq, and small RNA pipelines](https://bcbio-nextgen.readthedocs.io/en/latest/).
 
 ## Quick start
 
@@ -20,7 +20,7 @@ Validated, scalable, community developed variant calling, RNA-seq and small RNA
     ```shell script
     wget https://raw.githubusercontent.com/bcbio/bcbio-nextgen/master/scripts/bcbio_nextgen_install.py
     python bcbio_nextgen_install.py /usr/local/share/bcbio --tooldir=/usr/local \
-          --genomes GRCh37 --aligners bwa --aligners bowtie2
+          --genomes hg38 --aligners bwa --aligners bowtie2
     ```
    producing an editable [system configuration file](https://github.com/bcbio/bcbio-nextgen/blob/master/config/bcbio_system.yaml) referencing the installed software, data and system information.
 


=====================================
bcbio/dragen/dragen.py
=====================================
@@ -6,9 +6,10 @@ from bcbio.pipeline import config_utils
 from bcbio.provenance import do
 from bcbio.distributed.transaction import file_transaction
 from bcbio.utils import file_exists, splitext_plus
-from bcbio import bam
+from bcbio import bam, broad
 from bcbio.log import logger
 
+
 def fix_umi_dragen_bam(data, bam=None):
     """
     fixes the UMI BAM from DRAGEN. Accepts a pre UMI collapsed BAM file and
@@ -37,12 +38,18 @@ def add_fgbio_tags(in_bam, out_bam, data):
 
     samtools = config_utils.get_program("samtools", data)
     bamsormadup = config_utils.get_program("bamsormadup", data)
+    fgbio = config_utils.get_program("fgbio", data)
 
     with file_transaction(out_bam) as tx_out_bam:
+        tmpdir = os.path.dirname(tx_out_bam)
+        jvm_opts = _get_fgbio_jvm_opts(data, tmpdir, 1)
         tx_out_dup = "%s-markdup" % splitext_plus(tx_out_bam)[0]
         cmd = (f"{samtools} view -b {in_bam} | "
                f"{bamsormadup} outputformat=sam tmpfile={tx_out_dup} | "
                f"awk '/^@/ {{print;next}} {{N=split($1,n,\":\");print $0 \"\tRX:Z:\" n[N]}}' | "
+               f"{fgbio} {jvm_opts} SortBam -i /dev/stdin -o /dev/stdout -s Queryname |"
+               f"{fgbio} {jvm_opts} SetMateInformation | "
+               f"{fgbio} {jvm_opts} SortBam -i /dev/stdin -o /dev/stdout -s Coordinate |"
                f"{samtools} view -bh > {tx_out_bam}")
         message = f"Adding MC/MQ/RX tags to DRAGEN BAM file {in_bam}."
         do.run(cmd, message)
@@ -54,3 +61,26 @@ def has_fgbio_tags(in_bam):
     """
     FGBIO_TAGS = set(["MC", "MQ", "RX"])
     return bam.has_tags(in_bam, FGBIO_TAGS)
+
+def _get_fgbio_jvm_opts(data, tmpdir, scale_factor=None):
+    cores, mem = _get_cores_memory(data)
+    resources = config_utils.get_resources("fgbio", data["config"])
+    jvm_opts = resources.get("jvm_opts", ["-Xms750m", "-Xmx4g"])
+    if scale_factor and cores > scale_factor:
+        jvm_opts = config_utils.adjust_opts(jvm_opts, {"algorithm": {"memory_adjust":
+                                                                     {"direction": "increase",
+                                                                      "magnitude": cores // scale_factor}}})
+    jvm_opts += broad.get_default_jvm_opts()
+    jvm_opts = " ".join(jvm_opts)
+    return jvm_opts + " --tmp-dir %s" % tmpdir
+
+def _get_cores_memory(data, downscale=2):
+    """Retrieve cores and memory, using samtools as baseline.
+
+    For memory, scaling down because we share with alignment and de-duplication.
+    """
+    resources = config_utils.get_resources("samtools", data["config"])
+    num_cores = data["config"]["algorithm"].get("num_cores", 1)
+    max_mem = config_utils.adjust_memory(resources.get("memory", "2G"),
+                                         downscale, "decrease").upper()
+    return num_cores, max_mem


=====================================
bcbio/pipeline/main.py
=====================================
@@ -272,6 +272,8 @@ def rnaseqpipeline(config, run_info_yaml, parallel, dirs, samples):
                     samples, config, dirs, "qc") as run_parallel:
         with profile.report("quality control", dirs):
             samples = qcsummary.generate_parallel(samples, run_parallel)
+        with profile.report("create SummarizedExperiment object", dirs):
+            samples = rnaseq.load_summarizedexperiment(samples)
         with profile.report("upload", dirs):
             samples = run_parallel("upload_samples", samples)
             for sample in samples:


=====================================
bcbio/pipeline/rnaseq.py
=====================================
@@ -482,17 +482,6 @@ def combine_files(samples):
         dexseq_combined = None
     samples = spikein.combine_spikein(samples)
     tximport = load_tximport(data)
-    # don't fail runs while we get dependencies straightened out
-    try:
-        # we need metadata.csv to generate SE
-        work_dir = dd.get_work_dir(data)
-        metadata_file = os.path.join(work_dir, "metadata.csv")
-        if not file_exists(metadata_file):
-            qcsummary._merge_metadata(samples)
-        summarized_experiment = load_summarizedexperiment(data)
-        qc_report = generate_se_qc_report(data)
-    except Exception:
-        summarized_experiment = None
     updated_samples = []
     for data in dd.sample_data_iterator(samples):
         if combined:
@@ -513,7 +502,6 @@ def combine_files(samples):
         if gtf_file:
             data = dd.set_tx2gene(data, tx2gene_file)
         data = dd.set_tximport(data, tximport)
-        data = dd.set_summarized_experiment(data, summarized_experiment)
         updated_samples.append([data])
     return updated_samples
 
@@ -564,30 +552,44 @@ def load_tximport(data):
     return {"gene_tpm": tpm_file,
             "gene_counts": counts_file}
 
-def load_summarizedexperiment(data):
+def load_summarizedexperiment(samples):
     """ create summarizedexperiment rds object
     fails with n_samples = 1 """
     # using r36 (4.0) - will eventually drop R3.5
     rcmd = Rscript_cmd("r36")
     se_script = os.path.join(os.path.dirname(__file__), os.pardir, "scripts",
                              "R", "bcbio2se.R")
+    data = samples[0][0]
     work_dir = dd.get_work_dir(data)
     out_dir = os.path.join(work_dir, "salmon")
-    out_file = os.path.join(out_dir, "bcbio-se.rds")
-    if file_exists(out_file):
-        return out_file
-    with file_transaction(out_file) as tx_out_file:
-        cmd = f"{rcmd} --vanilla {se_script} {work_dir} {tx_out_file}"
-        message = f"Loading SummarizedExperiment."
-        do.run(cmd, message)
-    return out_file
+    summarized_experiment = os.path.join(out_dir, "bcbio-se.rds")
+    if not file_exists(summarized_experiment):
+        with file_transaction(summarized_experiment) as tx_out_file:
+            cmd = f"{rcmd} --vanilla {se_script} {work_dir} {tx_out_file}"
+            message = f"Loading SummarizedExperiment."
+            try:
+                do.run(cmd, message)
+            except Exception:
+                logger.error("SE creation failed")
+    if file_exists(summarized_experiment):
+        try:
+            se_qc_report = generate_se_qc_report(work_dir)
+        except Exception:
+            se_qc_report = None
+            logger.error("SE QC failed")
+        updated_samples = []
+        for data in dd.sample_data_iterator(samples):
+            data = dd.set_summarized_experiment(data, summarized_experiment)
+            updated_samples.append([data])
+        return updated_samples
+    else:
+        return samples
 
-def generate_se_qc_report(data):
+def generate_se_qc_report(work_dir):
     """ generate QC report based on SE RDS object"""
     rcmd = Rscript_cmd("r36")
     qc_script = os.path.join(os.path.dirname(__file__), os.pardir, "scripts",
                              "R", "se2qc.Rmd")
-    work_dir = dd.get_work_dir(data)
     out_file = os.path.join(work_dir, "qc", "bcbio-se.html")
     rds_file = os.path.join(work_dir, "salmon", "bcbio-se.rds")
     if file_exists(out_file):


=====================================
bcbio/pipeline/run_info.py
=====================================
@@ -726,12 +726,15 @@ def _check_hetcaller(item):
     """Ensure upstream SV callers requires to heterogeneity analysis are available.
     purecn has its own segmentation - no need in cnvkit or gatk-cnv
     """
-    svs = _get_as_list(item, "svcaller")
     hets = _get_as_list(item, "hetcaller")
-    if hets or any([x in svs for x in ["titancna"]]):
-        if not any([x in svs for x in ["cnvkit", "gatk-cnv"]]):
-            raise ValueError("Heterogeneity caller used but need CNV calls. Add `gatk-cnv` "
-                             "or `cnvkit` to `svcaller` in sample: %s" % item["description"])
+    svs = _get_as_list(item, "svcaller")
+    needing_cnv = hets.copy()
+    for s in svs:
+        if s in ["titancna"]:
+            needing_cnv.append(s)
+    if needing_cnv and not any([x in svs for x in ["cnvkit", "gatk-cnv"]]):
+            raise ValueError("Heterogeneity caller(s) %s used but need CNV calls. Add `gatk-cnv` "
+                             "or `cnvkit` to `svcaller` in sample: %s" % (", ".join(needing_cnv), item["description"]))
 
 def _check_jointcaller(data):
     """Ensure specified jointcaller is valid.


=====================================
bcbio/provenance/programs.py
=====================================
@@ -35,13 +35,13 @@ _cl_progs = [{"cmd": "bamtofastq", "name": "biobambam",
              {"cmd": "vcflib", "has_cl_version": False},
              {"cmd": "featureCounts", "args": "-v", "stdout_flag": "featureCounts"}]
 _manifest_progs = [
-    'bcbio-variation', 'bioconductor-bubbletree', 'cufflinks',
+    'atropos', 'bcbio-variation', 'bioconductor-bubbletree', 'cufflinks',
     'cnvkit', 'fgbio', 'gatk4', 'hisat2', 'sailfish', 'salmon', 'grabix',
     'htseq', 'lumpy-sv', 'manta', 'break-point-inspector', 'metasv', 'multiqc',
     'seq2c', 'mirdeep2', 'oncofuse', 'picard', 'phylowgs', 'platypus-variant',
     'rapmap', 'star', 'rtg-tools', 'sambamba', 'samblaster', 'scalpel',
-    'seqbuster', 'snpeff', 'vardict', 'vardict-java', 'varscan',
-    'ensembl-vep', 'vt', 'wham', 'umis']
+    'seqbuster', 'snpeff', 'vardict', 'vardict-java', 'varscan', 'ensembl-vep',
+    'vt', 'wham', 'umis']
 
 def _broad_versioner(type):
     def get_version(config):


=====================================
bcbio/variation/annotation.py
=====================================
@@ -46,7 +46,7 @@ def get_gatk_annotations(config, include_depth=True, include_baseqranksum=True,
     return anns
 
 def finalize_vcf(in_file, variantcaller, items):
-    """Perform cleanup and dbSNP annotation of the final VCF.
+    """Perform cleanup of the final VCF.
 
     - Adds contigs to header for bcftools compatibility
     - adds sample information for tumor/normal
@@ -60,9 +60,6 @@ def finalize_vcf(in_file, variantcaller, items):
             post_cl = " | ".join(cls) + " | "
         else:
             post_cl = None
-        dbsnp_file = tz.get_in(("genome_resources", "variation", "dbsnp"), items[0])
-        if dbsnp_file:
-            out_file = _add_dbsnp(in_file, dbsnp_file, items[0], out_file, post_cl)
     if utils.file_exists(out_file):
         return vcfutils.bgzip_and_index(out_file, items[0]["config"])
     else:
@@ -133,45 +130,6 @@ def _update_header(orig_vcf, base_file, new_lines, chrom_process_fn=None):
         out_handle.write(chrom_line)
     return new_header
 
-_DBSNP_TEMPLATE = """
-[[annotation]]
-file="%s"
-fields=["ID"]
-names=["rs_ids"]
-ops=["concat"]
-
-[[postannotation]]
-name="ID"
-fields=["rs_ids"]
-op="setid"
-type="String"
-
-[[postannotation]]
-fields=["rs_ids"]
-op="delete"
-"""
-
-def _add_dbsnp(orig_file, dbsnp_file, data, out_file=None, post_cl=None):
-    """Annotate a VCF file with dbSNP.
-
-    vcfanno has flexible matching for NON_REF gVCF positions, matching
-    at position and REF allele, matching ALT NON_REF as a wildcard.
-    """
-    orig_file = vcfutils.bgzip_and_index(orig_file, data["config"])
-    if out_file is None:
-        out_file = "%s-wdbsnp.vcf.gz" % utils.splitext_plus(orig_file)[0]
-    if not utils.file_uptodate(out_file, orig_file):
-        with file_transaction(data, out_file) as tx_out_file:
-            conf_file = os.path.join(os.path.dirname(out_file), "dbsnp.conf")
-            with open(conf_file, "w") as out_handle:
-                out_handle.write(_DBSNP_TEMPLATE % os.path.normpath(os.path.join(dd.get_work_dir(data), dbsnp_file)))
-            if not post_cl: post_cl = ""
-            cores = dd.get_num_cores(data)
-            cmd = ("vcfanno -p {cores} {conf_file} {orig_file} | {post_cl} "
-                   "bgzip -c > {tx_out_file}")
-            do.run(cmd.format(**locals()), "Annotate with dbSNP")
-    return vcfutils.bgzip_and_index(out_file, data["config"])
-
 def get_context_files(data):
     """Retrieve pre-installed annotation files for annotating genome context.
     """


=====================================
bcbio/variation/vardict.py
=====================================
@@ -46,9 +46,10 @@ def _vardict_options_from_config(items, config, out_file, target=None, is_rnaseq
     # SV calling will be worked on as a separate step
     vardict_cl = get_vardict_command(items[0])
     version = programs.get_version_manifest(vardict_cl)
+    # turn off structural variants
     if (vardict_cl and version and
         ((vardict_cl == "vardict-java" and LooseVersion(version) >= LooseVersion("1.5.5")) or
-         (vardict_cl == "vardict" and LooseVersion(version) >= LooseVersion("2018.07.25")))):
+         (vardict_cl == "vardict"))):
         opts += ["--nosv"]
     if (vardict_cl and version and
          (vardict_cl == "vardict-java" and LooseVersion(version) >= LooseVersion("1.5.6"))):


=====================================
config/examples/rnaseq-seqc-getdata.sh
=====================================
@@ -2,20 +2,22 @@
 set -eu -o pipefail
 
 # We need about 100Gb for the input files. Confirm we have the space.
-REQ_DISK_SPACE=100
+REQ_DISK_SPACE=100  
 df --block-size=G --output='avail' . | sed s/G//g | awk -v req_disk_space=${REQ_DISK_SPACE} '{ if ($1 !~ /Avail/ && $1 < req_disk_space ) printf("Warning: Not enough disk space.\n Warning: Requires %sGb in total but only has %sGb.\n", req_disk_space, $1)  }'
 
-mkdir -p input
+mkdir -p seqc/input
+
 wget --no-check-certificate https://raw.githubusercontent.com/bcbio/bcbio-nextgen/master/config/examples/rnaseq-seqc.yaml
 
-cd input
+cd seqc/input
 for SAMPLE in SRR950078 SRR950079 SRR950080 SRR950081 SRR950082 SRR950083
-do
-   wget -O ${SAMPLE}_1.fastq.gz ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR950/${SAMPLE}/${SAMPLE}_1.fastq.gz
-   wget -O ${SAMPLE}_2.fastq.gz ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR950/${SAMPLE}/${SAMPLE}_2.fastq.gz
-done
+do 
+   wget -c -O ${SAMPLE}_1.fastq.gz ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR950/${SAMPLE}/${SAMPLE}_1.fastq.gz
+   wget -c -O ${SAMPLE}_2.fastq.gz ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR950/${SAMPLE}/${SAMPLE}_2.fastq.gz
+done 
+
+cd ../../
 
-wget --no-check-certificate https://raw.githubusercontent.com/bcbio/bcbio-nextgen/master/config/examples/seqc.csv
+wget -c --no-check-certificate https://raw.githubusercontent.com/bcbio/bcbio-nextgen/master/config/examples/seqc.csv
 
-cd ../
-bcbio_nextgen.py -w template rnaseq-seqc.yaml input/seqc.csv input
+bcbio_nextgen.py -w template rnaseq-seqc.yaml seqc.csv seqc/input/*.gz


=====================================
config/examples/rnaseq-seqc.yaml
=====================================
@@ -4,8 +4,12 @@ details:
   - analysis: RNA-seq
     genome_build: hg38
     algorithm:
-      quality_format: Standard
+      quality_format: standard
       aligner: star
       strandedness: unstranded
 upload:
   dir: ../final
+resources:
+  star:
+    cores: 10
+    memory: 10G


=====================================
config/examples/seqc.csv
=====================================
@@ -1,8 +1,7 @@
-samplename,description,panel
+samplename,description,category
 SRR950078,UHRR_rep1,UHRR
 SRR950079,HBRR_rep1,HBRR
 SRR950080,UHRR_rep2,UHRR
 SRR950081,HBRR_rep2,HBRR
 SRR950082,UHRR_rep3,UHRR
 SRR950083,HBRR_rep3,HBRR
-SRR950084,UHRR_rep4,UHRR


=====================================
config/genomes/GRCh37-resources.yaml
=====================================
@@ -1,4 +1,4 @@
-version: 54
+version: 55
 
 aliases:
   human: true
@@ -6,7 +6,7 @@ aliases:
   ensembl: homo_sapiens_merged_vep_100_GRCh37
 
 variation:
-  dbsnp: ../variation/dbsnp-151.vcf.gz
+  dbsnp: ../variation/dbsnp.vcf.gz
   train_hapmap: ../variation/hapmap_3.3.vcf.gz
   train_omni: ../variation/1000G_omni2.5.vcf.gz
   train_1000g: ../variation/1000G_phase1.snps.high_confidence.vcf.gz


=====================================
config/genomes/hg19-resources.yaml
=====================================
@@ -1,4 +1,4 @@
-version: 56
+version: 57
 
 aliases:
   human: true
@@ -6,7 +6,7 @@ aliases:
   ensembl: homo_sapiens_merged_vep_100_GRCh37
 
 variation:
-  dbsnp: ../variation/dbsnp-151.vcf.gz
+  dbsnp: ../variation/dbsnp.vcf.gz
   train_hapmap: ../variation/hapmap_3.3.vcf.gz
   train_omni: ../variation/1000G_omni2.5.vcf.gz
   train_1000g: ../variation/1000G_phase1.snps.high_confidence.vcf.gz


=====================================
config/genomes/hg38-resources.yaml
=====================================
@@ -1,4 +1,4 @@
-version: 44
+version: 45
 
 aliases:
   human: true
@@ -6,7 +6,7 @@ aliases:
   ensembl: homo_sapiens_merged_vep_100_GRCh38
 
 variation:
-  dbsnp: ../variation/dbsnp-153.vcf.gz
+  dbsnp: ../variation/dbsnp.vcf.gz
   dbnsfp: ../variation/dbNSFP.txt.gz
   dbscsnv: ../variation/dbscSNV.txt.gz
   train_hapmap: ../variation/hapmap_3.3.vcf.gz


=====================================
config/templates/purecn_ton.yaml
=====================================
@@ -0,0 +1,16 @@
+# Tumor only PureCN analysis with a PON (panel of normals)
+details:
+  - analysis: variant2
+    genome_build: hg38
+    algorithm:
+      aligner: false
+      svcaller: purecn
+      variant_regions: /path/ton/config/panel.bed
+      variantcaller: mutect2
+      background:
+        variant: /path/ton/config/pon_build-mutect2-annotated.vcf.gz
+        cnv_reference:
+          purecn_normaldb: /path/ton/config/normalDB_hg38.rds
+          purecn_mapping_bias: /path/ton/config/mapping_bias_hg38.rds
+    metadata:
+        phenotype: tumor


=====================================
config/vcfanno/GRCh37-gemini.conf
=====================================
@@ -2,7 +2,7 @@
 file="variation/exac.vcf.gz"
 fields=["AC_Adj", "AN_Adj", "AC_AFR", "AN_AFR", "AC_AMR", "AN_AMR", "AC_EAS", "AN_EAS", "AC_FIN", "AN_FIN", "AC_NFE", "AN_NFE", "AC_OTH", "AN_OTH", "AC_SAS", "AN_SAS", "AC_Het", "AC_Hom"]
 names=["ac_exac_all", "an_exac_all", "ac_adj_exac_afr", "an_adj_exac_afr", "ac_adj_exac_amr", "an_adj_exac_amr", "ac_adj_exac_eas", "an_adj_exac_eas", "ac_adj_exac_fin", "an_adj_exac_fin", "ac_adj_exac_nfe", "an_adj_exac_nfe", "ac_adj_exac_oth", "an_adj_exac_oth", "ac_adj_exac_sas", "an_adj_exac_sas", "num_exac_Het", "num_exac_Hom"]
-ops=["max", "self", "max", "self", "max", "self", "max", "self", "max", "self", "max", "self", "max", "self", "max", "self", "self", "self"] 
+ops=["max", "max", "max", "max", "max", "max", "max", "max", "max", "max", "max", "max", "max", "max", "max", "max", "max", "max"]
 
 [[annotation]]
 file="variation/gnomad_exome.vcf.gz"
@@ -16,11 +16,12 @@ fields=["EA_AC", "AA_AC", "TAC"]
 names=["af_esp_ea_float", "af_esp_aa_float", "af_esp_all_float"]
 ops=["lua:ratio(vals)", "lua:ratio(vals)", "lua:ratio(vals)"]
 
+
 [[annotation]]
-file="variation/dbsnp-151.vcf.gz"
-fields=["ID"]
-names=["rs_ids"]
-ops=["concat"]
+file="variation/dbsnp.vcf.gz"
+fields=["ID", "CAF", "FREQ", "ID"]
+names=["rs_id", "CAF", "FREQ", "DB"]
+ops=["first", "self", "self", "flag"]
 
 [[annotation]]
 file="variation/1000g.vcf.gz"
@@ -28,12 +29,11 @@ fields=["AMR_AF", "EAS_AF", "SAS_AF", "AFR_AF", "EUR_AF", "AF"]
 names=["af_1kg_amr", "af_1kg_eas", "af_1kg_sas", "af_1kg_afr", "af_1kg_eur", "af_1kg_all"]
 ops=["max", "max", "max", "max", "max", "max"]
 
-
 [[annotation]]
 file="variation/clinvar.vcf.gz"
-fields=["CLNSIG", "CLNDBN"]
-names=["clinvar_sig", "clinvar_disease_name"]
-ops=["self", "self"]
+fields=["CLNSIG", "CLNDBN", "GENEINFO"]
+names=["clinvar_sig", "clinvar_disease_name", "clinvar_geneinfo"]
+ops=["self", "self", "self"]
 
 # calculate allele frequencies for all populations.
 [[postannotation]]
@@ -84,7 +84,6 @@ name="af_adj_exac_sas"
 op="div2"
 type="Float"
 
-
 [[postannotation]]
 fields=['af_adj_exac_afr', 'af_adj_exac_amr', 'af_adj_exac_eas', 'af_adj_exac_fin', 'af_adj_exac_nfe', 'af_adj_exac_oth', 'af_adj_exac_sas', "af_esp_ea", "af_esp_aa", "af_esp_all", "af_1kg_amr", "af_1kg_eas", "af_1kg_sas", "af_1kg_afr", "af_1kg_eur", "af_1kg_all"]
 op="max"
@@ -103,8 +102,8 @@ op="lua:check_population_aaf(max_aaf_all, 0.01)"
 name="DB"
 type="Flag"
 
-[[annotation]]
-file="variation/cosmic.vcf.gz"
-fields=["ID"]
-names=["cosmic_ids"]
-ops=["uniq"]
+[[postannotation]]
+name="ID"
+fields=["rs_id", "ID"]
+op="setid"
+type="String"


=====================================
config/vcfanno/hg38-gemini.conf
=====================================
@@ -20,17 +20,17 @@ ops=["lua:ratio(vals)", "lua:ratio(vals)", "lua:ratio(vals)"]
 
 
 [[annotation]]
-file="variation/dbsnp-153.vcf.gz"
-fields=["ID"]
-names=["rs_ids"]
-ops=["concat"]
+file="variation/dbsnp.vcf.gz"
+fields=["ID", "CAF", "FREQ", "ID"]
+names=["rs_id", "CAF", "FREQ", "ID"]
+ops=["first", "self", "self", "flag"]
 
 
 [[annotation]]
 file="variation/clinvar.vcf.gz"
-fields=["CLNSIG", "CLNDBN"]
-names=["clinvar_sig", "clinvar_disease_name"]
-ops=["self", "self"]
+fields=["CLNSIG", "CLNDBN", "GENEINFO"]
+names=["clinvar_sig", "clinvar_disease_name", "clinvar_geneinfo"]
+ops=["self", "self", "self"]
 
 # calculate allele frequencies for all populations.
 [[postannotation]]
@@ -99,3 +99,9 @@ fields=["max_aaf_all"]
 op="lua:check_population_aaf(max_aaf_all, 0.01)"
 name="DB"
 type="Flag"
+
+[[postannotation]]
+name="ID"
+fields=["rs_id", "ID"]
+op="setid"
+type="String"


=====================================
config/vcfanno/somatic.conf
=====================================
@@ -2,29 +2,14 @@
 # Requires a manual installation of COSMIC
 # https://bcbio-nextgen.readthedocs.io/en/latest/contents/installation.html?highlight=cosmic#customizing-data-installation 
 
-[[annotation]]
-file="variation/dbsnp-153.vcf.gz"
-fields=["ID", "CAF", "ID"]
-names=["rs_id", "CAF", "DB"]
-ops=["first", "first", "flag"]
-
 [[annotation]]
 file="variation/cosmic.vcf.gz"
-fields=["ID", "CNT", "AA", "CDS", "GENE"]
-names=["cosmic_id", "CNT", "AA", "CDS", "GENE"]
-ops=["first", "first", "first", "first", "first"]
-
-[[annotation]]
-file="variation/clinvar.vcf.gz"
-fields=["CLNSIG", "GENEINFO"]
-ops=["first", "first"]
+fields=["ID"]
+names=["cosmic_id"]
+ops=["uniq"]
 
 [[postannotation]]
 name="ID"
-fields=["rs_id", "cosmic_id", "ID"]
+fields=["cosmic_id", "rs_id", "ID"]
 op="setid"
 type="String"
-
-[[postannotation]]
-fields=["rs_id", "cosmic_id"]
-op="delete"


=====================================
docs/contents/bulk_rnaseq.md
=====================================
@@ -1,10 +1,20 @@
 # Bulk RNA-seq
 
+Bulk RNA-seq pipeline in bcbio:
+- aligns reads with STAR (2pass), or hisat2 vs genome and transcriptome references (human, mouse, custom references);
+- quantifies expression counts with salmon, kallisto;
+- runs quality control;
+- calculates TPM with tximport;
+- detects rusions with arriba, pizzly;
+- creates a SummarizedExperiment object for downstream analysis in R;
+- calls variants with gatk or with vardict;
+- supports [spike-in](https://en.wikipedia.org/wiki/RNA_spike-in) calibration
+
 ## Workflow
 
-This example aligns and creates count files for use with downstream analyses using a subset of the SEQC data from the [FDA's Sequencing Quality Control project](http://www.fdaseqc.org/).
+This example processes 6 RNA-seq samples from the [FDA's Sequencing Quality Control project](http://www.fdaseqc.org/).
 
-### 1. Install STAR index
+### 1. Install STAR index (if there is no STAR in the current bcbio installation)
 ```bash
 bcbio_nextgen.py upgrade -u skip --genomes hg38 --aligners star --cores 10
 ```
@@ -20,7 +30,7 @@ Step 2 in detail:
 
 #### 2.1 Create input directory and download fastq files
 ```bash
-mkdir -p input
+mkdir -p seqc/input
 ```
 
 #### 2.2 Download a template yaml file describing RNA-seq analysis
@@ -35,20 +45,26 @@ details:
   - analysis: RNA-seq
     genome_build: hg38
     algorithm:
+      quality_format: standard
       aligner: star
       strandedness: unstranded
 upload:
   dir: ../final
+resources:
+  star:
+    cores: 10
+    memory: 10G
 ```
 
 #### 2.3 Download fastq files into input dir
 ```bash
-cd input
+cd seqc/input
 for SAMPLE in SRR950078 SRR950079 SRR950080 SRR950081 SRR950082 SRR950083
 do 
    wget -c -O ${SAMPLE}_1.fastq.gz ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR950/${SAMPLE}/${SAMPLE}_1.fastq.gz
    wget -c -O ${SAMPLE}_2.fastq.gz ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR950/${SAMPLE}/${SAMPLE}_2.fastq.gz
 done
+cd ../../
 ```
 
 #### 2.4 Prepare a sample sheet
@@ -65,20 +81,19 @@ SRR950080,UHRR_rep2,UHRR
 SRR950081,HBRR_rep2,HBRR
 SRR950082,UHRR_rep3,UHRR
 SRR950083,HBRR_rep3,HBRR
-SRR950084,UHRR_rep4,UHRR
 ```
 
 #### 2.5 Generate yaml config file for analysis
 project.yaml = template.yaml x sample_sheet.csv
 ```
-cd ../
-bcbio_nextgen.py -w template rnaseq-seqc.yaml input/seqc.csv input
+bcbio_nextgen.py -w template rnaseq-seqc.yaml seqc.csv seqc/input/*.gz
 ```
 
 In the result you should see a folder structure:
 ```
 seqc
 |---config
+|---input
 |---final
 |---work
 ```
@@ -110,8 +125,6 @@ ipython: `sbatch bcbio.sh`:
 bcbio_nextgen.py ../config/seqc.yaml -n 72 -t ipython -s slurm -q medium -r t=0-72:00 -r conmem=20 --timeout 3000 --tag "rna"
 ```
 
-This will run a full scale RNAseq experiment using STAR as the aligner and will take a long time to finish on a single machine. At the end it will output counts, Cufflinks quantitation and a set of QC results about each lane. If you have a cluster you can [parallelize it](parallel) to speed it up considerably.
-
 ## Parameters
 
 * `transcript_assembler` If set, will assemble novel genes and transcripts and merge the results into the known annotation. Can have multiple values set in a list. Supports ['cufflinks', 'stringtie'].
@@ -119,9 +132,9 @@ This will run a full scale RNAseq experiment using STAR as the aligner and will
 * `expression_caller` A list of optional expression callers to turn on. Supports ['cufflinks', 'express', 'stringtie', 'dexseq', 'kallisto']. Salmon and count based expression estimation are run by default. Sailfish is deprecated.
 * `fusion_caller` A list of optional fusion callers to turn on. Supports [oncofuse, pizzly, ericscript, arriba].
 * `variantcaller` Variant calling algorithm to call variants on RNA-seq data. Supports [gatk-haplotype] or [vardict].
-* `spikein_fasta` A FASTA file of spike in sequences to quantitate.
+* `spikein_fasta` A FASTA file of spike in sequences to quantitate. There are quantitated separately, so should be things you are not expecting to match anywhere to the genome or trancriptome of the species you are working with.
 * `quantify_genome_alignments` If set to True, run Salmon quantification using the genome alignments from STAR, when available. If STAR alignments are not available, use Salmon's SA mode with decoys.
-
+* `transcriptome_gtf` A GTF file to use to specify transcripts which will override the bcbio-installed versions. This is used if you have an alternate transcriptome you want to quantitate. You can also use this option to add your own transcripts to a set to quantitate, just append them to your GTF file and sort it.
 ## QC and Basic DE analysis
 By default, bcbio runs simple [R scripts](https://github.com/bcbio/bcbio-nextgen/tree/master/bcbio/scripts/R).
 To make them work, you need >1 sample, and `category` in the metadata for each sample:
@@ -224,6 +237,26 @@ Our current recommendation is to run adapter trimming only if using the Tophat2
 Salmon, which is an extremely fast alignment-free method of quantitation, is run for all experiments. Salmon can accurately quantitate the expression of genes, even ones which are hard to quantitate with other methods (see [this paper](https://doi.org/10.1186/s13059-015-0734-x) for example for Sailfish, which performs similarly to Salmon). Salmon can also quantitate at the transcript level which can help gene-level analyses
 (see [this paper](https://doi.org/10.12688/f1000research.7563.1) for example). We recommend using the Salmon quantitation rather than the counts from featureCounts to perform downstream quantification.
 
+Still, we had at least two projects with initally low % mapped (STAR) at 50-70% which greatly benefited from trimming both in terms of % mapped and N mapped reads.
+We trimmed with 
+bcbio.yaml:
+```yaml
+algorithm:
+  trim_reads: read_through 
+  adapters: illumina
+```
+or with [fastp](https://github.com/OpenGene/fastp)
+```bash
+#!/bin/bash
+fastp -I {input.R1]} -I {input.R2} -o {output.R1_trim_fastq} -O {output.R2_trim_fastq} -h {output.fastp_html} -g -x -5 -3
+
+# Additional flag definitions:
+# -g trims polyG, common artifact in Illumina 2-dye sequencing chemistry (MiniSeq, NextSeq, Novaseq)
+# -x trims polyX, any other homopolymeric stretch
+# -5 sliding window 5’ quality trimming
+# -3 sliding window 3’ quality trimming
+```
+
 Although we do not recommend using the featureCounts based counts, the alignments are still useful because they give you many more quality metrics than the quasi-alignments from Salmon.
 
 After a bcbio RNA-seq run there will be in the `upload` directory a directory for each sample which contains a BAM file of the aligned and unaligned reads, a `salmon` directory with the output of Salmon, including TPM values, and a `qc` directory with plots from FastQC and qualimap.


=====================================
docs/contents/configuration.md
=====================================
@@ -181,7 +181,7 @@ sample1,sample1,batch1,normal,female,/path/to/regions.bed
   * `picard` -- Picard/GATK based cleaning. Includes read group changes, fixing of problematic reads and re-ordering chromosome order to match the reference genome. To fix misencoded input BAMs with non-standard scores, set `quality_format` to `illumina`.
 * `bam_sort` Allow sorting of input BAMs when skipping alignment step (`aligner` set to false). Options are coordinate or queryname. For additional processing through standard pipelines requires coordinate sorted inputs. The default is to not do additional sorting and assume pre-sorted BAMs..
 * `align_split_size`: Increase parallelization of alignment. As of 0.9.8, bcbio will try to determine a useful parameter and you don't need to set this. If you manually set it, bcbio will respect your specification. Set to false to avoid splitting entirely. If set, this defines the number of records to feed into each independent parallel step (for example, 5000000 = 5 million reads per chunk). It converts the original inputs into bgzip grabix indexed FASTQ files, and then retrieves chunks for parallel alignment. Following alignment, it combines all chunks back into the final merged alignment file. This allows parallelization at the cost of additional work of preparing inputs and combining split outputs. The tradeoff makes sense when you have large files and lots of distributed compute. When you have fewer large multicore machines this parameter may not help speed up processing.
-* `quality_format` Quality format of FASTQ or BAM inputs [standard, illumina]
+* `quality_format` Quality format of FASTQ or BAM inputs [standard, illumina]. `standard` means Sanger, `illumina` means Illumina [1.3, 1.8).
 * `strandedness` For RNA-seq libraries, if your library is strand specific, set the appropriate flag from [unstranded, firststrand, secondstrand]. Defaults to unstranded. For dUTP marked libraries, firststrand is correct; for Scriptseq prepared libraries, secondstrand is correct.
 * `save_diskspace` Remove align prepped bgzip and split BAM files after merging into final BAMs. Helps reduce space on limited filesystems during a run. `tools_off: [upload_alignment]` may also be useful in conjunction with this. `[false, true]`
 
@@ -513,6 +513,10 @@ To manually make genomes available to bcbio-nextgen, edit the individual `.loc`
 
 To remove a reference genome, delete its directory `bcbio/genomes/species/reference` and remove all the records corresponding to that genome from `bcbio/galaxy/tool-data/*.loc` files.
 
+`genomes/Hsapiens/hg38/seq/hg38-resources.yaml` specifies relative locations of the resources. To determine the absolute path, bcbio fetches a value
+from `bcbio/galaxy/tool-data/sam_fa_indices.loc` and uses it is a basedir for all resources. If there are several installations of bcbio `data`,
+it is important to have separate `tool-data` as well.
+
 ### Adding custom genomes
 
 [bcbio_setup_genome.py](https://github.com/bcbio/bcbio-nextgen/blob/master/scripts/bcbio_setup_genome.py) installs a custom genome for variant and bulk-RNA-seq analyses and updates the configuration files. 
@@ -576,7 +580,7 @@ For adding an organism not present in snpEff, please see this [mailing list disc
 
 ## Download data from SRA
 
-Use SRA toolkit [prefetch/fastq-dump](https://wiki.rc.hms.harvard.edu/display/O2/Aspera+to+download+NCBI+SRA+data)
+Use SRA toolkit [prefetch/fastq-dump](https://wiki.rc.hms.harvard.edu/pages/viewpage.action?pageId=42402938)
 
 ## Upload
 


=====================================
docs/contents/installation.md
=====================================
@@ -183,14 +183,14 @@ By default, bcbio will install data files for `variation`, `rnaseq` and `smallrn
 * `ericscript` -- Database for [EricScript](https://sites.google.com/site/bioericscript/), based gene fusion detection. Supports hg38, hg19 and GRCh37.
 * `TOPMed` -- [TOPMed](https://www.nhlbiwgs.org) Allele frequencies for whole genome variants from heart, lung, blood and sleep disorders. Supports hg38, hg19 and GRCh37.
 
-For somatic analyses, bcbio includes [COSMIC](https://cancer.sanger.ac.uk/cosmic) v68 for hg19 and GRCh37 only. Due to license restrictions, we cannot include updated versions of
-this dataset and hg38 support with the installer. To prepare these datasets yourself you can use [a utility script shipped with cloudbiolinux](https://github.com/chapmanb/cloudbiolinux/blob/master/utils/prepare_cosmic.py) that downloads, sorts and merges the VCFs, then copies into your bcbio installation:
+For somatic analyses, bcbio includes [COSMIC](https://cancer.sanger.ac.uk/cosmic) v68 for hg19 and GRCh37 only. Due to license restrictions, we cannot include updated versions of this dataset and hg38 support with the installer. To prepare these datasets yourself you can use [a utility script shipped with cloudbiolinux](https://github.com/chapmanb/cloudbiolinux/blob/master/utils/prepare_cosmic.py) that downloads, sorts and merges the VCFs, then copies into your bcbio installation:
 ```shell
 export COSMIC_USER="you at example.org"
 export COSMIC_PASS="your_cosmic_password"
 bcbio_python prepare_cosmic.py 89 /path/to/bcbio
 ```
-`/path/to/bcbio/` here is the directory one up from the `genomes` directory.
+`/path/to/bcbio/` here is the directory one up from the `genomes` directory. The script removes variants marked as `SNP` in COSMIC, i.e. leaving only somatic variants. From version a minor portion of variants gets re-classified, for example: 3,779 variants were SNPs in cosmic-90 and became mutations in cosmic-92,
+656 variants were mutations in cosmic-90 and became SNPs in cosmic-92.
 
 ## Extra software
 


=====================================
docs/contents/purecn.md
=====================================
@@ -90,7 +90,11 @@ mv sample1_T_FFPE.bam sample2_T_FFPE ton/input/
 ```
 
 ### 2. Create purecn_ton_template.yaml:
+```bash
+wget https://raw.githubusercontent.com/bcbio/bcbio-nextgen/master/config/templates/purecn_ton.yaml
 ```
+
+```yaml
 details:
   - analysis: variant2
     genome_build: hg38


=====================================
requirements-conda.txt
=====================================
@@ -1 +1 @@
-bcbio-nextgen=1.2.4
+bcbio-nextgen=1.2.5


=====================================
requirements.txt
=====================================
@@ -1 +1 @@
-bcbio-nextgen==1.2.4
+bcbio-nextgen==1.2.5


=====================================
scripts/bcbio_setup_genome.py
=====================================
@@ -221,25 +221,26 @@ if __name__ == "__main__":
 
     parser.add_argument("-c", "--cores", default=1,
                         help="number of cores to use")
-    parser.add_argument("-f", "--fasta", required=True,
-                        help="FASTA file of the genome.")
     parser.add_argument("--gff3", default=False, action='store_true',
                         help="File is a GFF3 file.")
-    parser.add_argument("-g", "--gtf", default=None,
-                        help="GTF file of the transcriptome")
-    parser.add_argument("-n", "--name", required=True,
-                        help="Name of organism, for example Hsapiens.")
-    parser.add_argument("-b", "--build", required=True,
-                        help="Build of genome, for example hg19.")
     parser.add_argument("-i", "--indexes", choices=SUPPORTED_INDEXES, nargs="*",
                         default=["seq"], help="Space separated list of indexes to make")
     parser.add_argument("--ercc", action='store_true', default=False,
                         help="Add ERCC spike-ins.")
     parser.add_argument("--mirbase", help="species in mirbase for smallRNAseq data.")
     parser.add_argument("--srna_gtf", help="gtf to use for smallRNAseq data.")
-    parser.add_argument("--buildversion", required=True, 
-	                help=("String describing build of genome used. Examples: "
-                              "Ensembl_94, EnsemblMetazoa_94, Flybase_21, etc"))
+    required = parser.add_argument_group('required named arguments')
+    required.add_argument("--buildversion", required=True,
+	                      help=("String describing build of genome used. Examples: "
+                                "Ensembl_94, EnsemblMetazoa_94, Flybase_21, etc"))
+    required.add_argument("-f", "--fasta", required=True,
+                        help="FASTA file of the genome.")
+    required.add_argument("-g", "--gtf", default=None,
+                        help="GTF file of the transcriptome")
+    required.add_argument("-n", "--name", required=True,
+                        help="Name of organism, for example Hsapiens.")
+    required.add_argument("-b", "--build", required=True,
+                        help="Build of genome, for example hg19.")
     args = parser.parse_args()
  #   if not all([args.mirbase, args.srna_gtf]) and any([args.mirbase, args.srna_gtf]):
  #       raise ValueError("--mirbase and --srna_gtf both need a value.")


=====================================
setup.py
=====================================
@@ -7,7 +7,7 @@ import subprocess
 
 import setuptools
 
-VERSION = '1.2.5'
+VERSION = '1.2.6'
 
 # add bcbio version number and git commit hash of the current revision to version.py
 try:



View it on GitLab: https://salsa.debian.org/med-team/bcbio/-/commit/5c8848abe4fd3d8b9864a6b79451c82111cfbb92

-- 
View it on GitLab: https://salsa.debian.org/med-team/bcbio/-/commit/5c8848abe4fd3d8b9864a6b79451c82111cfbb92
You're receiving this email because of your account on salsa.debian.org.


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://alioth-lists.debian.net/pipermail/debian-med-commit/attachments/20210210/62466ecf/attachment-0001.html>


More information about the debian-med-commit mailing list