[med-svn] [Git][med-team/freebayes][master] 5 commits: New upstream version 1.3.6

Andreas Tille (@tille) gitlab at salsa.debian.org
Fri Jan 21 11:15:38 GMT 2022



Andreas Tille pushed to branch master at Debian Med / freebayes


Commits:
7be9fd39 by Andreas Tille at 2022-01-21T12:08:58+01:00
New upstream version 1.3.6
- - - - -
4ab3acd4 by Andreas Tille at 2022-01-21T12:08:58+01:00
routine-update: New upstream version

- - - - -
80eaf995 by Andreas Tille at 2022-01-21T12:09:17+01:00
Update upstream source from tag 'upstream/1.3.6'

Update to upstream version '1.3.6'
with Debian dir e7a2500bd928cf9a1d650b22973a0bca85fa7f12
- - - - -
e8727b4c by Andreas Tille at 2022-01-21T12:09:27+01:00
Add missing build dependency on dh addon.

Changes-By: lintian-brush
Fixes: lintian: missing-build-dependency-for-dh_-command
See-also: https://lintian.debian.org/tags/missing-build-dependency-for-dh_-command.html

- - - - -
3a53c7fe by Andreas Tille at 2022-01-21T12:15:05+01:00
routine-update: Ready to upload to unstable

- - - - -


21 changed files:

- .github/workflows/ci_test.yml
- README.md
- RELEASE-NOTES.md
- + TODO.md
- − bin/.keep
- + contrib/htslib_config/config_vars.h
- contrib/htslib_config/version.h
- debian/changelog
- debian/control
- + examples/freebayes-env.yaml
- + examples/snakemake-freebayes-parallel.smk
- guix.scm
- meson.build
- + meson_options.txt
- + scripts/GenerateFreebayesRegions.R
- scripts/fasta_generate_regions.py
- + scripts/split_ref_by_bai_datasize.py
- src/Parameters.cpp
- src/version_git.h
- + test/performance/benchmark.md
- + test/runner.sh


Changes:

=====================================
.github/workflows/ci_test.yml
=====================================
@@ -9,8 +9,10 @@ jobs:
         os: [ubuntu-latest]
         python-version: [3.8]
     steps:
+    - name: Update ubuntu
+      run: sudo apt-get update
     - name: Install dependencies
-      run: sudo apt-get install samtools bc parallel libvcflib-tools vcftools
+      run: sudo apt-get -y -f install samtools bc parallel libvcflib-tools vcftools
     - uses: actions/checkout at v2
     - uses: actions/setup-python at v1
       with:


=====================================
README.md
=====================================
@@ -2,8 +2,7 @@
 ## user manual and guide
 
 
-[![Github-CI](https://github.com/freebayes/freebayes/workflows/CI/badge.svg)](https://github.com/freebayes/freebayes/actions) [![Travis-CI](https://travis-ci.com/freebayes/freebayes.svg?branch=master)](https://travis-ci.com/freebayes/freebayes) [![AnacondaBadge](https://anaconda.org/bioconda/freebayes/badges/installer/conda.svg)](https://anaconda.org/bioconda/freebayes) [![DL](https://anaconda.org/bioconda/freebayes/badges/downloads.svg)](https://anaconda.org/bioconda/freebayes) [![BrewBadge](https://img.shields.io/badge/%F0%9F%8D%BAbrew-freebayes-brightgreen.svg)](https://github.com/brewsci/homebrew-bio) [![GuixBadge](https://img.shields.io/badge/gnuguix-freebayes-brightgreen.svg)](https://www.gnu.org/software/guix/packages/F/) [![DebianBadge](https://badges.debian.net/badges/debian/testing/freebayes/version.svg)](https://packages.debian.org/testing/freebayes)
-
+[![Github-CI](https://github.com/freebayes/freebayes/workflows/CI/badge.svg)](https://github.com/freebayes/freebayes/actions) [![Travis-CI](https://travis-ci.com/freebayes/freebayes.svg?branch=master)](https://travis-ci.com/freebayes/freebayes) [![AnacondaBadge](https://anaconda.org/bioconda/freebayes/badges/installer/conda.svg)](https://anaconda.org/bioconda/freebayes) [![DL](https://anaconda.org/bioconda/freebayes/badges/downloads.svg)](https://anaconda.org/bioconda/freebayes) [![BrewBadge](https://img.shields.io/badge/%F0%9F%8D%BAbrew-freebayes-brightgreen.svg)](https://github.com/brewsci/homebrew-bio) [![GuixBadge](https://img.shields.io/badge/gnuguix-freebayes-brightgreen.svg)](https://www.gnu.org/software/guix/packages/F/) [![DebianBadge](https://badges.debian.net/badges/debian/testing/freebayes/version.svg)](https://packages.debian.org/testing/freebayes) [![Chat on Matrix](https://matrix.to/img/matrix-badge.svg)](https://matrix.to/#/#vcflib:matrix.org)
 --------
 
 ## Overview
@@ -39,6 +38,8 @@ format.  It can also use an input set of variants (VCF) as a source of prior
 information, and a copy number variant map (BED) to define non-uniform ploidy
 variation across the samples under analysis.
 
+freebayes is maintained by Erik Garrison and Pjotr Prins. See also [RELEASE-NOTES](./RELEASE-NOTES.md).
+
 ## Citing freebayes
 
 A preprint [Haplotype-based variant detection from short-read sequencing](http://arxiv.org/abs/1207.3907) provides an overview of the
@@ -82,8 +83,8 @@ For instance:
 Multiple BAM files may be given for joint calling.
 
 Typically, we might consider two additional parameters.
-GVCF output allows us to have coverage information about non-called sites, and we can enable it with `--gcvcf`.
-For performance reasons we may want to skip regions of extremely high coverage in the reference using the `--skip-limit` parameter `-g`.
+GVCF output allows us to have coverage information about non-called sites, and we can enable it with `--gvcf`.
+For performance reasons we may want to skip regions of extremely high coverage in the reference using the `--skip-coverage` parameter or `-g`.
 These can greatly increase runtime but do not produce meaningful results..
 For instance, if we wanted to exclude regions of 1000X coverage, we would run:
 
@@ -121,7 +122,7 @@ Use a different ploidy:
 
 Assume a pooled sample with a known number of genome copies.  Note that this
 means that each sample identified in the BAM file is assumed to have 32 genome
-copies.  When running with highh --ploidy settings, it may be required to set
+copies.  When running with high --ploidy settings, it may be required to set
 `--use-best-n-alleles` to a low number to limit memory usage.
 
     freebayes -f ref.fa -p 32 --use-best-n-alleles 4 --pooled-discrete aln.bam >var.vcf
@@ -145,16 +146,25 @@ Naive variant calling: simply annotate observation counts of SNPs and indels:
     freebayes -f ref.fa --haplotype-length 0 --min-alternate-count 1 \
         --min-alternate-fraction 0 --pooled-continuous --report-monomorphic >var.vcf
 
-Parallel operation (use 36 cores in this case):
+## Parallelisation
+
+In general, freebayes can be parallelised by running multiple instances of freebayes on separate regions of the genome, and then concatenating the resulting output.
+The wrapper, [freebayes-parallel](https://github.com/ekg/freebayes/blob/master/scripts/freebayes-parallel) will perform this, using [GNU parallel](https://www.gnu.org/software/parallel/).
+
+Example freebayes-parallel operation (use 36 cores in this case):
 
     freebayes-parallel <(fasta_generate_regions.py ref.fa.fai 100000) 36 \
-        -f ref.fa aln.bam >var.vcf
+        -f ref.fa aln.bam > var.vcf
 
 Note that any of the above examples can be made parallel by using the
 scripts/freebayes-parallel script.  If you find freebayes to be slow, you
 should probably be running it in parallel using this script to run on a single
 host, or generating a series of scripts, one per region, and run them on a
-cluster. Be aware that the freebayes-parallel script contains calls to other programs  using relative paths from the scripts subdirectory; the easiest way to ensure a successful run is to invoke the freebayes-parallel script from within the scripts subdirectory.
+cluster. Be aware that the freebayes-parallel script contains calls to other programs using relative paths from the scripts subdirectory; the easiest way to ensure a successful run is to invoke the freebayes-parallel script from within the scripts subdirectory.
+
+A current limitation of the freebayes-parallel wrapper, is that due to variance in job memory and runtimes, some cores can go unused for long periods, as they will not move onto the next job unless all cores in use have completed their respective genome chunk. This can be partly avoided by calculating coverage of the input bam file, and splitting the genome into regions of equal coverage using the [coverage_to_regions.py script](https://github.com/freebayes/freebayes/blob/master/scripts/coverage_to_regions.py). An alternative script [split_ref_by_bai_datasize.py](https://github.com/freebayes/freebayes/blob/master/scripts/split_ref_by_bai_datasize.py) will determine target regions based on the data within multiple bam files, with the option of choosing a target data size. This is useful when submitting to Slurm and other cluster job managers, where use of resources needs to be controlled.
+
+Alternatively, users may wish to parallelise freebayes within the workflow manager [snakemake](https://snakemake.readthedocs.io/en/stable/). As snakemake automatically dispatches jobs when a core becomes available, this avoids the above issue. An example [.smk file](https://github.com/freebayes/freebayes/blob/master/examples/snakemake-freebayes-parallel.smk), and associated [conda environment recipe](https://github.com/freebayes/freebayes/blob/master/examples/freebayes-env.yaml), can be found in the /examples directory.
 
 ## Calling variants: from fastq to VCF
 
@@ -446,6 +456,10 @@ To download freebayes, please use git to download the most recent development tr
 
     git clone --recursive https://github.com/freebayes/freebayes.git
 
+If you have a repo, update the submodules with
+
+    git submodule update --init --recursive --progress
+
 On Debian you'll need a gcc compiler and want packages:
 
 - bc
@@ -480,17 +494,22 @@ Next compile and test in the ~build~ directory
     ninja
     ninja test
 
+The freebayes binary should be in
+
+    build/freebayes
+
 Tests on ARM may be slow. If you get a TIMEOUT use a multiplier,
 e.g.
 
     meson test -t 4 -C build/
 
+See [meson.build](./meson.build) for more information.
 
 ### Compile in a Guix container
 
 After checking out the repo with git recursive create a Guix
 container with all the build tools with
 
-    guix environment -C -l guix.scm
+    guix shell -C -D -f guix.scm
 
 See also [guix.scm](./guix.scm).


=====================================
RELEASE-NOTES.md
=====================================
@@ -5,6 +5,26 @@ and
 [commits](https://github.com/freebayes/freebayes/commits/master).
 
 
+## ChangeLog v1.3.6 (20220121)
+
+This is a maintenance release of Freebayes:
+
++ Added parallel snakemake example and text (thanks @sanjaynagi)
++ Updated packaged htslib build to release 1.14
++ README fixes https://github.com/freebayes/freebayes/issues/726 (thanks @veghp)
++ Add option to force use of submodule deps (thanks @jnumm)
++ Using recent guix shell support for building freebayes
++ Added static build support with meson+ninja
++ Started tracking freebayes performance in ./test/performance/benchmark.md
++ Added matrix channel badge to README
++ Added Pjotr Prins as maintainer in README and banner
+
+The provided static binary is built against gcc 10.3.0 and htslib 1.12.
+
+Note that at this point the static binary builds with:
+
+    meson build -Dstatic=true -Dprefer_system_deps=false --buildtype release
+
 ## ChangeLog v1.3.5 (20210210)
 
 This is a maintenance and bug-fix release of Freebayes:


=====================================
TODO.md
=====================================
@@ -0,0 +1,6 @@
+# TODO for freebayes
+
+- [ ] Update Debian
+- [ ] Update Conda
+- [ ] Fix CI builds
+- [ ] Introduce metrics such as from rtg-tools https://hpc.nih.gov/apps/rtg-tools.html


=====================================
bin/.keep deleted
=====================================


=====================================
contrib/htslib_config/config_vars.h
=====================================
@@ -0,0 +1,5 @@
+#define HTS_CC "gcc"
+#define HTS_CPPFLAGS ""
+#define HTS_CFLAGS "-g -Wall -O2 -fvisibility=hidden"
+#define HTS_LDFLAGS "-fvisibility=hidden"
+#define HTS_LIBS "-lz -lm -lbz2 -llzma -lcurl"


=====================================
contrib/htslib_config/version.h
=====================================
@@ -1 +1,2 @@
 #define HTS_VERSION_TEXT "freebayes htslib"
+#define HTSCODECS_VERSION_TEXT ""


=====================================
debian/changelog
=====================================
@@ -1,3 +1,10 @@
+freebayes (1.3.6-1) unstable; urgency=medium
+
+  * New upstream version
+  * Add missing build dependency on dh addon.
+
+ -- Andreas Tille <tille at debian.org>  Fri, 21 Jan 2022 12:09:32 +0100
+
 freebayes (1.3.5-3) unstable; urgency=medium
 
   * Fix watchfile to detect new versions on github


=====================================
debian/control
=====================================
@@ -16,7 +16,8 @@ Build-Depends: debhelper-compat (= 13),
                bc,
                samtools,
                parallel <!nocheck>,
-               libvcflib-tools <!nocheck>
+               libvcflib-tools <!nocheck>,
+               debhelper
 Standards-Version: 4.6.0
 Vcs-Browser: https://salsa.debian.org/med-team/freebayes
 Vcs-Git: https://salsa.debian.org/med-team/freebayes.git


=====================================
examples/freebayes-env.yaml
=====================================
@@ -0,0 +1,7 @@
+channels:
+  - bioconda
+  - conda-forge
+dependencies:
+  - freebayes>=1.3.4
+  - bcftools=1.11
+  - vcflib=1.0.0


=====================================
examples/snakemake-freebayes-parallel.smk
=====================================
@@ -0,0 +1,77 @@
+## A simple example snakemake .smk file for parallelising freebayes
+## Uses a fasta_generate_regions to split the genome into regions of equal size based on the .fai index
+## As snakemake automatically moves each cpu core to the next genome chunk, this works out faster
+## than the freebayes-parallel wrapper.
+## This .smk file assumes we have a list of the bam files called bam.list
+## This .smk file splits the genome by chromosome, which of course, is not necessary.
+## One will want to edit the paths (for example, the path to bam files)
+
+
+# these parameters should usually be stored in the snakemake configuration file (config.yaml) and accessed e.g. config['ref']
+samples = ['SampleA', 'SampleB', 'SampleC']
+reference = "path/to/reference"
+chroms = [1,2,3]
+nchunks = 9
+
+bamlist = "path/to/bam.list"
+chunks = list(range(1,nchunks+1))
+
+rule GenomeIndex:
+    input:
+        ref = reference
+    output:
+        idx = reference + ".fai"
+    log: 
+        "logs/GenomeIndex.log"
+    wrapper: 
+        "v0.69.0/bio/samtools/faidx"
+
+
+rule GenerateFreebayesRegions:
+    input:
+        ref_idx = reference,
+        index = reference + ".fai",
+        bams = expand("resources/alignments/{sample}.bam", sample=samples)
+    output:
+        regions = expand("resources/regions/genome.{chrom}.region.{i}.bed", chrom=chroms, i = chunks)
+    log:
+        "logs/GenerateFreebayesRegions.log"
+    params:
+        chroms = chroms,
+        chunks = chunks
+    conda:
+        "../envs/freebayes-env.yaml"
+    script:
+        # "../scripts/GenerateFreebayesRegions.R" # This is located in the scripts/ directory of freebayes
+        "../scripts/fasta_generate_regions.py --chunks --bed resources/regions/genome --chromosome {params.chroms} {input.index} {params.chunks}"
+
+
+rule VariantCallingFreebayes:
+    input:
+        bams = expand("resources/alignments/{sample}.bam", sample=samples),
+        index = expand("resources/alignments/{sample}.bam.bai", sample=samples),
+        ref = reference,
+        samples = bamlist,
+        regions = "resources/regions/genome.{chrom}.region.{i}.bed"
+    output:
+        temp("results/variants/vcfs/{chrom}/variants.{i}.vcf")
+    log:
+        "logs/VariantCallingFreebayes/{chrom}.{i}.log"
+    conda:
+        "../envs/freebayes-env.yaml"
+    threads:1
+    shell:	"freebayes -f {input.ref} -t {input.regions} -L {input.samples} > {output} 2> {log}"
+
+
+rule ConcatVCFs:
+    input:
+        calls = expand("results/variants/vcfs/{{chrom}}/variants.{i}.vcf", i=chunks)
+    output:
+        "results/variants/vcfs/variants.{chrom}.vcf"
+    log:
+        "logs/ConcatVCFs/{chrom}.log"
+    conda:
+        "../envs/freebayes-env.yaml"
+    threads:4
+    shell:  
+        "bcftools concat {input.calls} | vcfuniq > {output} 2> {log}"
\ No newline at end of file


=====================================
guix.scm
=====================================
@@ -4,9 +4,9 @@
 ;;
 ;; To get a development container
 ;;
-;;   guix environment -C -l guix.scm
+;;   guix shell -C -D -f guix.scm
 ;;
-;; For the tests you need /usr/bin/env. In a container create it with
+;; For the tests you need /usr/bin/env. Inside the container:
 ;;
 ;;   mkdir -p /usr/bin ; ln -s $GUIX_ENVIRONMENT/bin/env /usr/bin/env
 
@@ -40,17 +40,20 @@
 (define-public freebayes-git
   (package
     (name "freebayes-git")
-    (version (git-version "1.3.4" "HEAD" %git-commit))
+    (version (git-version "1.3.6" "HEAD" %git-commit))
     (source (local-file %source-dir #:recursive? #t))
     (build-system meson-build-system)
     (propagated-inputs
-     `(("perl" ,perl)         ; for testing
-       ("grep" ,grep)         ; for testing
-       ("samtools" ,samtools) ; for testing
-       ("vcflib" ,vcflib)     ; for freebayes-parallel
-       ("which" ,which)       ; for version
-       ("htslib" ,htslib)     ; does work, but lacks codecs
-       ("tabixpp" ,tabixpp)))
+     `(("perl" ,perl)          ; for testing
+       ("grep" ,grep)          ; for testing
+       ("samtools" ,samtools)  ; for testing
+       ("vcflib" ,vcflib)      ; for freebayes-parallel
+       ("which" ,which)        ; for version
+       ("htslib" ,htslib)      ; does work, but lacks codecs
+       ("tabixpp" ,tabixpp)    ; for htslib
+       ("bzip2-static" ,bzip2 "static")    ; libz2 part of htslib for static builds
+       ("xz-static" ,xz "static")     ; for static builds
+       ("zlib-static" ,zlib "static")))
     (native-inputs
      `(
        ("meson" ,meson)


=====================================
meson.build
=====================================
@@ -1,25 +1,46 @@
 # Meson build definition
 #
-# meson --buildtype [debug|release] [--default-library static] build/
-# meson test -C build /
+#     meson --buildtype [debug|release] [--default-library static] build/
+#     meson test -C build /
 #
 # Meson builds with locally installed htslib, vcflib, tabixpp and seqlib..
 # If one is missing it should pick up the sources from git submodules.
+#
+# to compile htslib with local sources use
+#
+#     meson build -Dprefer_system_deps=false
+#
+# to build static binary for release
+#
+#     meson build -Dstatic=true -Dprefer_system_deps=false --buildtype release
 
 project('freebayes', ['cpp', 'c'],
-        version : '1.3.4',
+        version : '1.3.6',
         default_options : ['warning_level=1', 'cpp_std=c++14', 'optimization=3'])
+static_build = get_option('static')
 
 cc = meson.get_compiler('cpp')
 
-zlib_dep = dependency('zlib')
-lzma_dep = dependency('liblzma')
-thread_dep = dependency('threads')
+zlib_dep = dependency('zlib', static: static_build)
+lzma_dep = dependency('liblzma', static: static_build)
+thread_dep = dependency('threads', static: static_build)
+tabixpp_dep = cc.find_library('tabixpp', required: false, static: static_build)
 
-htslib_dep = dependency('htslib', required : false)
-tabixpp_dep = cc.find_library('tabixpp', required : false)
-vcflib_dep = dependency('libvcflib', required : false)
-seqlib_dep = dependency('libseqlib', required : false)
+# to compile htslib use
+#   meson build -Dprefer_system_deps=false
+# otherwise it tries to pick up the local system libs
+if get_option('prefer_system_deps')
+  htslib_dep = dependency('htslib', static: static_build, required: false)
+  tabixpp_dep = cc.find_library('tabixpp', static: static_build, required: false)
+  vcflib_dep = dependency('libvcflib', static: static_build, required: false)
+  seqlib_dep = dependency('libseqlib', static: static_build, required: false)
+else
+  # uses the local git submodules
+  htslib_dep = dependency('', required : false)
+  tabixpp_dep = dependency('', required : false)
+  vcflib_dep = dependency('', required : false)
+  seqlib_dep = dependency('', required : false)
+endif
 
 # for setting a warning_level on the external code in custom_* targets below
 warn_quiet = ['warning_level=0']
@@ -38,8 +59,8 @@ if not htslib_dep.found()
     'contrib/htslib/sam.c',
     'contrib/htslib/hts.c',
     'contrib/htslib/hfile.c',
-    'contrib/htslib/hfile_net.c',
-    'contrib/htslib/knetfile.c',
+    # 'contrib/htslib/hfile_net.c',
+    # 'contrib/htslib/knetfile.c',
     'contrib/htslib/textutils.c',
     'contrib/htslib/thread_pool.c',
     'contrib/htslib/region.c',
@@ -55,16 +76,25 @@ if not htslib_dep.found()
     'contrib/htslib/cram/cram_stats.c',
     'contrib/htslib/cram/cram_codecs.c',
     'contrib/htslib/cram/cram_decode.c',
-    'contrib/htslib/cram/cram_samtools.c',
+    # 'contrib/htslib/cram/cram_samtools.c',
     'contrib/htslib/kstring.c',
     'contrib/htslib/cram/cram_external.c',
     # 'contrib/htslib/cram/files.c',
     'contrib/htslib/cram/mFILE.c',
     'contrib/htslib/faidx.c',
-    'contrib/htslib/cram/rANS_static.c',
+    'contrib/htslib/hts_expr.c',
+    'contrib/htslib/htscodecs/htscodecs/rANS_static.c',
+    'contrib/htslib/htscodecs/htscodecs/arith_dynamic.c',
+    'contrib/htslib/htscodecs/htscodecs/fqzcomp_qual.c',
+    'contrib/htslib/htscodecs/htscodecs/htscodecs.c',
+    'contrib/htslib/htscodecs/htscodecs/rANS_static.c',
+    'contrib/htslib/htscodecs/htscodecs/tokenise_name3.c',
     'contrib/htslib/cram/open_trace_file.c',
     'contrib/htslib/multipart.c',
-    )
+    'contrib/htslib/htscodecs/htscodecs/rANS_static4x16pr.c',
+    'contrib/htslib/htscodecs/htscodecs/pack.c',
+    'contrib/htslib/htscodecs/htscodecs/rle.c',
+)
     htslib_lib = static_library('custom_htslib',
                                 htslib_src,
                                 include_directories : htslib_inc,
@@ -195,10 +225,17 @@ freebayes_lib = static_library(
     install : false,
     )
 
+if static_build
+  link_arguments = '-static'
+else
+  link_arguments = []
+endif
+
 executable('freebayes',
            freebayes_src,
            include_directories : incdir,
            cpp_args : extra_cpp_args,
+           link_args: link_arguments,
            dependencies: [zlib_dep, lzma_dep, thread_dep,
                           htslib_dep, tabixpp_dep, vcflib_dep, seqlib_dep],
            link_with : freebayes_lib,
@@ -209,6 +246,7 @@ executable('bamleftalign',
            bamleftalign_src,
            include_directories : incdir,
            cpp_args : extra_cpp_args,
+           link_args: link_arguments,
            dependencies: [zlib_dep, lzma_dep, thread_dep,
                           htslib_dep, vcflib_dep, seqlib_dep],
            link_with : freebayes_lib,


=====================================
meson_options.txt
=====================================
@@ -0,0 +1,2 @@
+option('prefer_system_deps', type : 'boolean', value : true)
+option('static', type : 'boolean', value : false)


=====================================
scripts/GenerateFreebayesRegions.R
=====================================
@@ -0,0 +1,34 @@
+## Generate freebayes params ##
+# make bed for parallelising freebayes
+
+#' This script reads a .fai index and a set of chroms and outputs a series of BED files
+#' which split the genome into separate regions.
+
+library(dplyr)
+library(data.table)
+library(glue)
+
+# read inputs
+chroms = snakemake at params[['chroms']]
+chunks = snakemake at params[['chunks']]
+fai = fread(snakemake at input[['index']])
+
+# select chroms we want, and start, end columns
+fai = fai[fai$V1 %in% chroms, c(1,2)]
+
+# for each chromsome
+for (chrom in chroms){
+   #subset index to desired chrom
+   f = fai[fai$V1 == chrom]
+   #get sequence of n chunks from 0 to length of chrom
+   bedseq = round(seq(0, f$V2, length.out = chunks))
+   
+   #for each chunk
+   for (i in 1:(chunks-1)){
+      #write bed file, one for each chunk/interval, which will be passed as input to freebayes
+      row = c(chrom, bedseq[i], bedseq[i+1])
+      data.frame(row) %>% t() %>% fwrite(., glue("resources/regions/genome.{chrom}.region.{i}.bed"), sep="\t", col.names = FALSE)
+   }
+}
+
+sessionInfo()
\ No newline at end of file


=====================================
scripts/fasta_generate_regions.py
=====================================
@@ -1,29 +1,57 @@
 #!/usr/bin/env python3
 
-import sys
-
-if len(sys.argv) == 1:
-    print("usage: {} <fasta file or index file> <region size>".format(sys.argv[0]))
-    print("generates a list of freebayes/bamtools region specifiers on stdout")
-    print("intended for use in creating cluster jobs")
-    exit(1)
-
-fasta_index_file = sys.argv[1]
-if not fasta_index_file.endswith(".fai"):
-    fasta_index_file = fasta_index_file + ".fai"
-
-fasta_index_file = open(fasta_index_file)
-region_size = int(sys.argv[2])
-
-for line in fasta_index_file:
-    fields = line.strip().split("\t")
-    chrom_name = fields[0]
-    chrom_length = int(fields[1])
-    region_start = 0
-    while region_start < chrom_length:
-        start = region_start
-        end = region_start + region_size
-        if end > chrom_length:
-            end = chrom_length
-        print(chrom_name + ":" + str(region_start) + "-" + str(end))
-        region_start = end
+import argparse
+from math import ceil
+
+
+def generate_regions(fasta_index_file, size, chunks=False, chromosomes=None, bed_files=None):
+
+    if not fasta_index_file.endswith(".fai"):
+        fasta_index_file = fasta_index_file + ".fai"
+
+    with open(fasta_index_file, "r") as fasta_index:
+        for line in fasta_index:
+            fields = line.strip().split("\t")
+            chrom_name = fields[0]
+            chrom_length = int(fields[1])
+            if chromosomes is not None and chrom_name not in chromosomes:
+                continue
+            region_start = 0
+            if chunks is True:
+                region_size = ceil(chrom_length / size)  # have to make sure this works
+            else:
+                region_size = size
+            while region_start < chrom_length:
+                region_end = region_start + region_size
+                if region_end > chrom_length:
+                    region_end = chrom_length
+                start = str(region_start)
+                end = str(region_end)
+                if bed_files is not None:
+                    region = str(ceil(region_end / region_size))
+                    file_path = f"{bed_files}.{chrom_name}.region.{region}.bed"
+                    with open(file_path, "w") as f:
+                        f.write("\t".join([chrom_name, start, end]))
+                else:
+                    print(f"{chrom_name}:{start}-{end}")
+                region_start = region_end
+
+
+if __name__ == "__main__":
+
+    parser = argparse.ArgumentParser(description="Generates a list of freebayes/bamtools region specifiers. Intended "
+                                                 "for parallelization or creating cluster jobs.")
+
+    parser.add_argument("--chunks", action="store_true",
+                        help="Split the fasta into N chunks rather than into N length pieces")
+    parser.add_argument("--chromosomes", nargs="+", default=None,
+                        help="List of chromosomes to create chunks for")
+    parser.add_argument("--bed", metavar="base name", type=str,
+                        help="Write chunks to individual bed files (for Snakemake) instead of stdout.")
+    parser.add_argument("fai", metavar="<fasta or fai file>",
+                        help="The fasta file to split. Must be indexed.")
+    parser.add_argument("region_size", metavar="<N>", type=int,
+                        help="Region size")
+
+    args = parser.parse_args()
+    generate_regions(args.fai, args.region_size, chunks=args.chunks, chromosomes=args.chromosomes, bed_files=args.bed)


=====================================
scripts/split_ref_by_bai_datasize.py
=====================================
@@ -0,0 +1,168 @@
+#!/usr/bin/env python3
+"""
+Create a list of regions by splitting a reference based on the amount of data in bam files.
+Uses the `bai` index of the bam files. Useful for submitting jobs of equal size to a cluster.
+"""
+
+import sys
+import os
+import argparse
+import time
+import logging
+
+import struct
+import numpy as np
+from scipy import interpolate
+
+
+DEFAULT_LOGGING_LEVEL = logging.INFO
+MAX_LOGGING_LEVEL = logging.CRITICAL
+
+def setup_logger(verbose_level):
+    fmt=('%(levelname)s %(asctime)s [%(module)s:%(lineno)s %(funcName)s] :: '
+            '%(message)s')
+    logging.basicConfig(format=fmt, level=max((0, min((MAX_LOGGING_LEVEL,
+                        DEFAULT_LOGGING_LEVEL-(verbose_level*10))))))
+
+
+def Main(argv):
+    tic_total = time.time()
+
+    # parse arguments
+    parser = argparse.ArgumentParser(description=__doc__, formatter_class=argparse.ArgumentDefaultsHelpFormatter)
+    parser.add_argument('bamfiles', metavar='BAMFILE', nargs='*')
+    parser.add_argument('-L', '--bam-list', nargs='*')
+    parser.add_argument('-r', '--reference-fai', help="reference fasta index file", required=True)
+    parser.add_argument('-s', '--target-data-size', default='100e6', help="target combined data size of bam files in each region (MB)")
+    parser.add_argument('--bai-interval-size', default=16384, type=int, help="Size in baseparis of each interval in the bam index (bai).")
+    parser.add_argument('-v', '--verbose', action='count', default=0,
+                        help="increase logging verbosity")
+    parser.add_argument('-q', '--quiet', action='count', default=0,
+                        help="decrease logging verbosity")
+    args = parser.parse_args(argv)
+
+    # setup logger
+    setup_logger(verbose_level=args.verbose-args.quiet)
+    if argv is not None:
+        logging.warning('Using passed arguments: '+str(argv))
+    logging.info('args: '+str(args))
+
+    # additional argument parsing and datatype handling
+    if not args.bamfiles and not args.bam_list:
+        logging.error("Must provide an BAMFILE and/or --bam-list argument")
+        sys.exit(2)
+    args.target_data_size = int(float(args.target_data_size))*1000000
+    logging.info('target-data-size: '+str(args.target_data_size)+' bytes')
+
+    # read bam-lists if provided
+    if args.bam_list:
+        for bamlistfile in args.bam_list:
+            with open(bamlistfile,'r') as fh:
+                for x in fh:
+                    x = x.split('#')[0].strip()
+                    if x:
+                        args.bamfiles.append(x)
+    #logging.info('bam files: '+", ".join(args.bamfiles)) # output complete list of bam files being used
+
+    # read the reference fasta index
+    fai_chrom = []
+    fai_len = []
+    with open(args.reference_fai,'r') as fh:
+        for x in fh:
+            x = x.strip().split(sep='\t')
+            fai_chrom.append(str(x[0]))
+            fai_len.append(int(x[1]))
+
+
+    ## read bai indexes, skipping bin info
+    # list by chrom of number of intervals
+    n_intvs = np.array([int(np.ceil(x/args.bai_interval_size)) for x in fai_len])
+    # list by chrom of lists of interval offsets
+    icumsz = [] # cumulative size of data by interval
+    for i,n in enumerate(n_intvs):
+        icumsz.append(np.zeros((n,), dtype=np.int64))
+
+    for bamfn in args.bamfiles:
+        baifn = bamfn+'.bai'
+        with open(baifn,'rb') as fh:
+            logging.info("processing: "+baifn)
+            # filetype magic check
+            assert struct.unpack('4s', fh.read(4))[0] == b'BAI\x01'
+
+            # number of reference sequences (chroms)
+            n_ref = struct.unpack('i', fh.read(4))[0]
+            assert n_ref == len(fai_len), "fasta index and bam index have must have same number of chroms"
+
+            for ci in range(n_ref):
+                # skip over the binning index
+                n_bin = struct.unpack('i', fh.read(4))[0]
+                for bini in range(n_bin):
+                    bin_id = struct.unpack('I', fh.read(4))[0]
+                    n_chunk = struct.unpack('i', fh.read(4))[0]
+                    fh.seek(n_chunk*16, os.SEEK_CUR)
+                # read interval index
+                n_intv = struct.unpack('i', fh.read(4))[0]
+                if n_intv > 0:
+                    ioff = np.array(struct.unpack(str(n_intv)+'Q', fh.read(n_intv*8)), dtype=np.int64)
+                    while( len(ioff) < len(icumsz[ci]) ):
+                        ioff = np.append(ioff, ioff[-1]+1)
+                    icumsz[ci] += ioff-ioff[0]
+
+
+
+    ## make the list of regions
+    regions = []
+
+    for ci,chrom in enumerate(fai_chrom):
+
+        # sanity check last point if there are more than one
+        if len(icumsz[ci]) > 1:
+            assert icumsz[ci][-1] >= icumsz[ci][-2]
+
+        # tiny chroms just get 1 region
+        if len(icumsz[ci]) < 2:
+            regions.extend([ (fai_chrom[ci], 0, fai_len[ci]) ])
+            continue
+        ds = icumsz[ci]
+        pos = np.arange(0, ds.shape[0])*args.bai_interval_size
+
+        # estimate total data size for the chrom
+        f = interpolate.interp1d(pos, ds, fill_value='extrapolate', kind='linear')
+        ds_total = f([fai_len[ci]])[0]
+
+        num_regions = int(np.ceil(ds_total/args.target_data_size))
+
+        # approx equal LENGTH regions
+        # tmp = np.linspace(0, fai_len[ci], num=num_regions+1, endpoint=True, dtype=int)
+
+        # approx equal DATA SIZE regions
+        f = interpolate.interp1d(ds, pos, fill_value='extrapolate', kind='linear')
+        dsx = np.linspace(0, ds_total, num=num_regions+1, endpoint=True, dtype=int)
+        tmp = f(dsx).astype(int)
+        # ensure we exactly hit the endpoints
+        tmp[0] = 0
+        tmp[-1] = fai_len[ci]
+
+        regions.extend([ (fai_chrom[ci], tmp[i], tmp[i+1]) for i in range(len(tmp)-1) ])
+
+    ## Output regions file
+    for r in regions:
+        print(*r, sep='\t')
+
+    logging.info("Number of chroms: {}".format(len(fai_len)))
+    logging.info("Number of splits: {}".format(len(regions)-len(fai_len)))
+    logging.info("Number of regions: {}".format(len(regions)))
+
+    logging.info("Done: {:.2f} sec elapsed".format(time.time()-tic_total))
+    return 0
+
+
+
+#########################################################################
+# Main loop hook... if run as script run main, else this is just a module
+if __name__ == '__main__':
+    if 'TESTING_ARGS' in globals():
+        sys.exit(Main(argv=TESTING_ARGS))
+    else:
+        sys.exit(Main(argv=None))
+


=====================================
src/Parameters.cpp
=====================================
@@ -13,6 +13,8 @@ void Parameters::simpleUsage(char ** argv) {
         << endl
         << "   -h --help       For a complete description of options." << endl
         << endl
+        << "freebayes is maintained by Erik Garrison and Pjotr Prins." << endl
+        << endl
         << "citation: Erik Garrison, Gabor Marth" << endl
         << "          \"Haplotype-based variant detection from short-read sequencing\"" << endl
         << "          arXiv:1207.3907 (http://arxiv.org/abs/1207.3907)" << endl
@@ -23,14 +25,8 @@ void Parameters::simpleUsage(char ** argv) {
 }
 
 void Parameters::usage(char** argv) {
+    simpleUsage(argv);
     cout
-        << "usage: " << argv[0] << " [OPTION] ... [BAM FILE] ... " << endl
-        << endl
-        << "Bayesian haplotype-based polymorphism discovery." << endl
-        << endl
-        << "citation: Erik Garrison, Gabor Marth" << endl
-        << "          \"Haplotype-based variant detection from short-read sequencing\"" << endl
-        << "          arXiv:1207.3907 (http://arxiv.org/abs/1207.3907)" << endl
         << endl
         << "overview:" << endl
         << endl


=====================================
src/version_git.h
=====================================
@@ -1,4 +1,4 @@
 #ifndef VERSION_GIT_H
 #define VERSION_GIT_H
-#define VERSION_GIT "v1.3.5"
+#define VERSION_GIT "v1.3.6"
 #endif /* VERSION_GIT_H */


=====================================
test/performance/benchmark.md
=====================================
@@ -0,0 +1,52 @@
+# Benchmarking
+
+In this file we keep track of some benchmarking stats over time.
+
+Download the test BAM file from https://www.internationalgenome.org/
+
+    wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/phase3/data/HG00100/alignment/HG00100.chrom20.ILLUMINA.bwa.GBR.low_coverage.20130415.bam
+
+And a reference genome. E.g.
+
+    wget http://hgdownload.soe.ucsc.edu/goldenPath/hg38/chromosomes/chr20.fa.gz
+
+and unpack. Note you need to change the first 'chr20' tag to '20' for this to work.
+
+To speed things up somewhat using a smaller ref
+
+    head -6000 chr20.fa > chr20-6K.fa
+
+## Penguin2 56x Intel(R) Xeon(R) CPU E5-2683 v3 @ 2.00GHz, 256Gb
+
+First test an older 1.3.0 release:
+
+```sh
+time ./freebayes-v1.3.0-1 -f chr20-6K.fa HG00100.chrom20.ILLUMINA.bwa.GBR.low_coverage.20130415.bam > fb-test-1.3.0-1.vcf
+real    1m29.208s
+user    1m28.696s
+sys     0m0.508s
+```
+
+First test an older 1.2.0 release:
+
+```sh
+time ./.conda/pkgs/freebayes-1.2.0-htslib1.7_0/bin/freebayes -f chr20-6K.fa HG00100.chrom20.ILLUMINA.bwa.GBR.low_coverage.20130415.bam > fb-test-1.2.0-conda.vcf
+real    4m31.958s
+user    4m31.364s
+sys     0m0.584s
+```
+
+Found 49265 variants. The output for 1.2.0 is slightly different, so stick with later versions (after 1.3.0) of freebayes!
+
+Latest static build finds 49265 variants:
+
+```sh
+time ./freebayes-1.3.6 -f chr20-6K.fa HG00100.chrom20.ILLUMINA.bwa.GBR.low_coverage.20130415.bam > fb-test-1.3.6.vcf
+real    1m10.926s
+user    1m10.472s
+sys     0m0.444s
+```
+
+1.3.0 and 1.3.6 give the same results.  The speed difference between 1.3.0 to 1.3.6 is due to htslib improvements.
+
+The clang (9.0.1) + llvm (13.0.0) build has a similar runtime.


=====================================
test/runner.sh
=====================================
@@ -0,0 +1,13 @@
+#! /bin/sh
+#
+# Run tests directly from the test directory without perl-tap/prove
+
+# add meson builddir
+PATH=../builddir:$PATH
+
+bash ../test/t/00_region_and_target_handling.t
+echo $?
+exit
+bash ../test/t/01_call_variants.t
+bash ../test/t/02_multi_bam.t
+bash ../test/t/03_reference_bases.t



View it on GitLab: https://salsa.debian.org/med-team/freebayes/-/compare/c5d344e546d26a7c67cd45b5d74dc83d5be5166c...3a53c7fe8fa3e3ddd1d634be59a29f4b0fc5c533

-- 
View it on GitLab: https://salsa.debian.org/med-team/freebayes/-/compare/c5d344e546d26a7c67cd45b5d74dc83d5be5166c...3a53c7fe8fa3e3ddd1d634be59a29f4b0fc5c533
You're receiving this email because of your account on salsa.debian.org.


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://alioth-lists.debian.net/pipermail/debian-med-commit/attachments/20220121/45432e3f/attachment-0001.htm>


More information about the debian-med-commit mailing list