[med-svn] [Git][med-team/discosnp][upstream] New upstream version 4.4.4
Andreas Tille
gitlab at salsa.debian.org
Thu Oct 1 13:57:28 BST 2020
Andreas Tille pushed to branch upstream at Debian Med / discosnp
Commits:
07ef2a0f by Andreas Tille at 2020-10-01T13:48:03+02:00
New upstream version 4.4.4
- - - - -
15 changed files:
- README.md
- + discoSnpRAD/COOKBOOK.md
- discoSnpRAD/README.md
- + discoSnpRAD/clustering_scripts/README.md
- discoSnpRAD/clustering_scripts/discoRAD_clustering.sh
- discoSnpRAD/post-processing_scripts/README.md
- discoSnpRAD/post-processing_scripts/filter_paralogs.py
- discoSnpRAD/run_discoSnpRad.sh
- doc/release_note.txt
- scripts/k3000/K3000_enhance_gfa.py
- scripts/k3000/K3000_find_unitig_connected_pairs_of_facts.py
- scripts/k3000/K3000_gfa_to_dat.py
- scripts/k3000/K3000_node_ids_to_node_sequences.py
- + scripts/k3000/extract_DP_from_vcf.py
- scripts/k3000/run.sh
Changes:
=====================================
README.md
=====================================
@@ -32,12 +32,26 @@ c++ compiler; compilation was tested with gcc and g++ version>=4.5 (Linux) and c
## Instructions
- # get a local copy of DiscoSnp source code
- git clone --recursive https://github.com/GATB/DiscoSnp.git
-
- # compile the code an run a simple test on your computer
- cd DiscoSnp
- sh INSTALL
+### Install from source
+
+```bash
+# get a local copy of DiscoSnp source code
+git clone --recursive https://github.com/GATB/DiscoSnp.git
+
+# compile the code an run a simple test on your computer
+cd DiscoSnp
+sh INSTALL
+```
+
+### Install using Conda
+
+DiscoSnp++ and DiscoSnpRAD are also distributed as a [Bioconda package](https://anaconda.org/bioconda/discosnp):
+
+```
+conda install -c bioconda discosnp
+```
+
+The scripts `run_discoSnp++.sh` and `run_discoSnpRad.sh` are then executable.
# Getting a binary stable release
@@ -70,9 +84,9 @@ Run DiscoSnp WITH mapping results on a reference genome AND using this reference
See doc/discoSnp_user_guide.pdf or doc/discoSnp_user_guide.txt
# DiscoSnpRad
-While dealing with RAD-Seq data, `run_discoSnpRad.sh` script should be used. It uses options specific to RAD-Seq: branching strategy, kind of extensions, abundance threshold, and kind of bubbles to be found. Moreover, it clusters variants per locus by calling the `discoRAD_finalization.sh` pipeline. Cluster information is reported in the final provided VCF file.
+When dealing with RAD-Seq data, `run_discoSnpRad.sh` script should be used. It uses options specific to RAD-Seq data: branching strategy, kind of extensions, abundance threshold, and kind of bubbles to be found. Moreover, it clusters variants per locus, the cluster information being reported in the final provided VCF file.
-A README file describes all scripts and the `discoRAD_finalization.sh` pipeline.
+See the [discoSnpRAD README file](https://github.com/GATB/DiscoSnp/tree/master/discoSnpRAD).
# Contact
=====================================
discoSnpRAD/COOKBOOK.md
=====================================
@@ -0,0 +1,277 @@
+# DiscoSnp-RAD cookbook
+**Table of Contents**
+* [1. No reference genome - Using only reads 1 from pairs - No clustering](#1)
+* [2. No reference genome - Using only reads 1 from pairs - With clustering](#2)
+* [3. Using a reference genome - Using only reads 1 from pairs - With clustering](#3)
+* [4. Using forward and reverse reads.](#4)
+* [5. Post-processing](#5)
+
+- - - -
+**Prerequisite**
+
+Datasets `set1.fastq.gz`, `set2.fastq.gz`, ..., `set5.fastq.gz` for this cookbook can be downloaded as following:
+
+```bash
+for i in 1 2 3 4 5; do wget http://bioinformatique.rennes.inria.fr/data_cookbook_discoSnp-RAD/set$i.fastq.gz; done
+```
+
+**Full commands:** For each example, the full commands are proposed at the end of the section.
+
+**Computation time** is approximately 10 to 15 minutes per example (once dataset is dowloaded).
+
+***
+
+**Note about multiplexed data**
+
+To date, DiscoSnp-RAD is not able to consider multiplexed data. Hence, input files need to be demultiplexed to samples.
+
+- - - -
+
+## 1. <a name="1"> No reference genome - Using only reads 1 from pairs - No clustering</a>
+This is the most classical usage of DiscoSnp-RAD.
+
+Consider one has $n$ rad datasets composed only of reads 1. Suppose those sets are called `set1.fastq.gz`, `set2.fastq.gz`, ..., `setn.fastq.gz`.
+
+**First**, one need to create a *file of file* (fof), used as input for DiscoSnp-RAD. In this case, we want each data set to be considered individually in the displayed results. Hence, each line of the input fof contains exactly one dataset name. We may create this fof as following:
+
+```bash
+ls set*.fastq.gz > my_fof.txt
+```
+
+> **Reminder**DiscoSnp-RAD can analyse fastq or fasta files, gzipped or not.
+
+> **Note**If you wish to run DiscoSnp-RAD from a directory distinct from the one containing the read datasets, you have to indicate the absolute paths in the fof:```bash
+> ls -d $PWD/set*.fastq.gz > my_fof.txt
+>
+> ```
+>
+> ```
+
+**Second**, run discoSnp-RAD, using the input fof:
+
+```bash
+/my/discoSnp/path/discoSnpRAD/run_discoSnpRad.sh -r my_fof.txt
+```
+
+That's it! Results are available in files
+
+* `discoRad_k_31_c_3_D_0_P_5_m_5_raw.fa` that contains the raw bubble sequences detected by discoSnp. For instance the first variant of this file is :
+
+```
+>SNP_higher_path_9965|P_1:30_C/G|high|nb_pol_1|left_unitig_length_83|right_unitig_length_6|left_contig_length_83|right_contig_length_6|C1_0|C2_22|C3_0|C4_0|C5_0|Q1_0|Q2_71|Q3_0|Q4_0|Q5_0|G1_1/1:364,58,5|G2_0/0:5,70,444|G3_1/1:384,61,5|
+G4_1/1:424,67,5|G5_1/1:364,58,5|rank_1
+atgtggcctgccgaggtggaggcggtcatcgacgagctgccggaggtgaagcgggtgtgcgtgatcggggtttacgacgagacCCAGGGAGATGTGCCTGGTGCCCTGGTTGTCCGGGAGGATAATGCCACTCTGACCGCACAGcaggtg
+>SNP_lower_path_9965|P_1:30_C/G|high|nb_pol_1|left_unitig_length_83|right_unitig_length_6|left_contig_length_83|right_contig_length_6|C1_18|C2_0|C3_19|C4_21|C5_18|Q1_71|Q2_0|Q3_71|Q4_71|Q5_71|G1_1/1:364,58,5|G2_0/0:5,70,444|G3_1/1:384,
+61,5|G4_1/1:424,67,5|G5_1/1:364,58,5|rank_1
+atgtggcctgccgaggtggaggcggtcatcgacgagctgccggaggtgaagcgggtgtgcgtgatcggggtttacgacgagacCCAGGGAGATGTGCCTGGTGCCCTGGTTGTGCGGGAGGATAATGCCACTCTGACCGCACAGcaggtg
+```
+
+ It is a SNP for which one can find **1/** its id: `9965`, **2/** its variable nucleotides `A/C` **3/** its nucleotidic complexity `high` (this is not a low complexity sequence), **4/** the length of its left and right unitigs `83` and `6` (corresponding resp. to the leftmost and rightmost lowercase acgt chatacters). For the higher path (first sequence corresponding to the `C` allele - the lower path corresponding to the `G`allele) **5/** the abundance of the allele is provided in the Ci fields (`0` for first dataset, `22` for the second and so on), **6/** the average phred quality of mapped reads on the variable locus is provided (`0`for first dataset, `71` for the second and so on). For the both paths, **7/** the estimated genotype of the variant is provided for each dataset (`G1` to `G5`), and **8/** its rank (see manuscript for detailed explanation)..
+
+ Note1: information **1/**, **2/**, **3/**, **4/**, **7/** and **8/** are redundant for the two lines, while information **5/** and **6/** are specific to each allele, that is to say to each line.
+
+ Note2: higher case characters correspond to the sequences of the bubble, and lower case sequences correspond to the left and right unitigs surrounding the bubble.
+
+* `discoRad_k_31_c_3_D_0_P_5_m_5.vcf`. For instance the first variant is this file is
+
+```
+SNP_higher_path_9965 113 9965 C G . . Ty=SNP;Rk=1.0;UL=83;UR=6;CL=83;CR=6;Genome=.;Sd=.;Cluster=.;ClSize=. GT:DP:PL:AD:HQ 1/1:18:364,58,5:0,18:0,71 0/0:22:5,70,444:22,0:71,0 1/1:19:384,
+61,5:0,19:0,71 1/1:21:424,67,5:0,21:0,71 1/1:18:364,58,5:0,18:0,71
+```
+
+ This line contains the same information as the one existing in the comment of the fasta file previously presented. Additional fields are proposed. They are empty as they are related to the usage of a reference genome that we have not done in this example or to the clustering of the variants that has not been done neither.
+
+* `discoRad_read_files_correspondance.txt`. This files simply recall the correspondence between the `C1`, ... `C5` fields with their datasets names. For instance, first line is `C_1 set1.fastq.gz`
+
+**Full commands:**
+
+```bash
+disco_path=/my/discoSnp/path/
+```
+
+```
+for i in 1 2 3 4 5; do wget http://bioinformatique.rennes.inria.fr/data_cookbook_discoSnp-RAD/set$i.fastq.gz; done
+ls set*.fastq.gz > my_fof.txt
+${disco_path}/discoSnpRAD/run_discoSnpRad.sh -r my_fof.txt
+```
+
+## 2. <a name="2">No reference genome - Using only reads 1 from pairs - With clustering</a>
+Given the fof file created as previously, run discoSnp-RAD indicating the `short_read_connector` installation path.
+
+```bash
+/my/discoSnp/path/discoSnpRAD/run_discoSnpRad.sh -r my_fof.txt -S /my/SRC/path/
+```
+
+ In this case we retrieve the `discoRad_k_31_c_3_D_0_P_5_m_5_raw.fa` and the `discoRad_read_files_correspondance.txt` files as previously exposed.
+
+ The vcf file is now called `discoRad_k_31_c_3_D_0_P_5_m_5_clustered.vcf` as the `Cluster` field contains for each variant the cluster id of the variant and the `ClSize` field contains the size of the corresponding cluster.
+
+ In addition a second .fa file is provided: `discoRad_k_31_c_3_D_0_P_5_m_5_raw_filtered.fa`. In this file, variants with more than 0.4 missing data and rank<0.4 are filtered out.
+
+**Full commands:**
+
+```bash
+disco_path=/my/discoSnp/path/
+src_path=/my/short_short_read_connector/path
+```
+
+```
+for i in 1 2 3 4 5; do wget http://bioinformatique.rennes.inria.fr/data_cookbook_discoSnp-RAD/set$i.fastq.gz; done
+ls set*.fastq.gz > my_fof.txt
+${disco_path}/discoSnpRAD/run_discoSnpRad.sh -r my_fof.txt -S ${src_path}
+```
+
+## 3. <a name="3">Using a reference genome - Using only reads 1 from pairs - With clustering</a>
+If one disposes for a reference genome, it can be used for determine the position of each predicted variant (without using the reference genome for prediction) on the genome.
+
+We may use the proposed reference genome as following:
+
+```bash
+wget http://bioinformatique.rennes.inria.fr/data_cookbook_discoSnp-RAD/ref.fa
+```
+
+Variants from `discoRad_k_31_c_3_D_0_P_5_m_5_raw_filtered.fa` can be mapped to the reference genome as following:
+
+```bash
+/my/discoSnp/path/scripts/run_VCF_creator.sh -G ref.fa -p discoRad_k_31_c_3_D_0_P_5_m_5_raw_filtered.fa -e -o temp.vcf
+```
+
+Note: bwa must be installed and in the path.
+
+In order to retrieve the clustering information already computed, the `add_cluster_info_to_mapped_vcf.py`script may be used:
+
+```
+python /my/discoSnp/path/discoSnpRAD/post-processing_scripts/add_cluster_info_to_mapped_vcf.py -m temp.vcf -u discoRad_k_31_c_3_D_0_P_5_m_5_clustered.vcf -o discoRad_k_31_c_3_D_0_P_5_m_5_mapped.vcf
+```
+
+**Full commands:**
+
+```bash
+disco_path=/my/discoSnp/path/
+src_path=/my/short_short_read_connector/path
+```
+
+```
+for i in 1 2 3 4 5; do wget http://bioinformatique.rennes.inria.fr/data_cookbook_discoSnp-RAD/set$i.fastq.gz; done
+wget http://bioinformatique.rennes.inria.fr/data_cookbook_discoSnp-RAD/ref.fa
+ls set*.fastq.gz > my_fof.txt
+${disco_path}/discoSnpRAD/run_discoSnpRad.sh -r my_fof.txt -S ${src_path}
+${disco_path}/scripts/run_VCF_creator.sh -G ref.fa -p discoRad_k_31_c_3_D_0_P_5_m_5_raw_filtered.fa -e -o temp.vcf
+python ${disco_path}/discoSnpRAD/post-processing_scripts/add_cluster_info_to_mapped_vcf.py -m temp.vcf -u discoRad_k_31_c_3_D_0_P_5_m_5_clustered.vcf -o discoRad_k_31_c_3_D_0_P_5_m_5_mapped.vcf
+```
+
+## 4. <a name="4">Using forward and reverse reads.</a>
+
+Whatever the wanted usage (with or without reference genome, with or without clustering) one may use pairend data.
+
+Imagine one disposes from 5 pairend read sets
+
+* `set1_1.fastq.gz`, `set1_2.fastq.gz`
+* `set2_1.fastq.gz`, `set2_2.fastq.gz`
+* ...
+* `set5_1.fastq.gz`, `set5_2.fastq.gz`
+
+1. **If our aim is to consider individually each file** (hence considering each file as a set), then we can simply create a *file of files* (fof) in which each line is a .fastq.gz file:
+
+```bash
+ls *.fastq.gz > my_fof.txt
+```
+
+2. **If our aim is to virtually concatenate forward and reverse reads for each sample**, we have to create as many fof as samples:
+
+```bash
+for (( i=1; i<=5; i++ ));
+ do
+ ls set${i}_*.fastq.gz > my_fof_set${i}.txt
+ ;
+done
+```
+
+ Finally the fof provided to `run_discoSnpRad.sh` script (`-r` option) is a file in which each line is a fof file for one sample:
+
+```bash
+ls my_fof_set*.txt > my_fof.txt
+```
+
+## 5. <a name="5">Post-processing</a>
+
+A bench of post-processing scripts can be found in the dedicated [directory](https://github.com/GATB/DiscoSnp/tree/master/discoSnpRAD/post-processing_scripts).
+
+We consider here that [case 3](#3.%20Using%20a%20reference%20genome%20-%20Using%20only%20reads%201%20from%20pairs%20-%20With%20clustering) (Using a reference genome - Using only reads 1 from pairs - With clustering) was performed, hence obtaining the file: `discoRad_k_31_c_3_D_0_P_5_m_5_mapped.vcf`
+
+### 5.1 Filtering scripts
+
+#### **script** `filter_by_cluster_size_and_rank.py`
+
+* Filtering on cluster size.
+ Use case: Need clusters of size between 2 to 100:
+
+ ```bash
+ python filter_by_cluster_size_and_rank.py -i discoRad_k_31_c_3_D_0_P_5_m_5_mapped.vcf -o filtered_on_cluster_size.vcf -m 2 -M 100
+ ```
+
+* Filtering on rank.
+ Use case: Need variants with a rank higher than 0.8:
+
+ ```bash
+ python filter_by_cluster_size_and_rank.py -i discoRad_k_31_c_3_D_0_P_5_m_5_mapped.vcf -o filtered_on_rank.vcf -r 0.8
+ ```
+
+#### script `filter_vcf_by_indiv_cov_max_missing_and_maf.py`
+
+* Filtering on read coverage.
+ Use case: replace by `./.`original genotypes of variants whose total read coverage is below 20:
+
+ ```bash
+ python filter_vcf_by_indiv_cov_max_missing_and_maf.py -i discoRad_k_31_c_3_D_0_P_5_m_5_mapped.vcf -o non_genotyped_low_covered.vcf -c 20
+ ```
+
+* Filtering on missing genotypes.
+ Use case: Remove variants whose fraction of missing genotypes is greater than 70%:
+
+ ```bash
+ python filter_vcf_by_indiv_cov_max_missing_and_maf.py -i discoRad_k_31_c_3_D_0_P_5_m_5_mapped.vcf -o filtered_on_missing_geno.vcf -m 0.7
+ ```
+
+* Filtering on minor allele frequency.
+ Use case: Remove variants whose minor allele frequency is smaller than 0.2:
+
+ ```bash
+ python filter_vcf_by_indiv_cov_max_missing_and_maf.py -i discoRad_k_31_c_3_D_0_P_5_m_5_mapped.vcf -o filtered_on_maf.vcf -f 0.2
+ ```
+
+#### script `filter_paralogs.py`
+
+* Filtering on paralogs.
+ Use case: Remove variants that belong to a cluster such that more than 50% of its variants have each more than 10% of heterozygous genotypes:
+
+ ```bash
+ python filter_paralogs.py -i discoRad_k_31_c_3_D_0_P_5_m_5_mapped.vcf -o filtered_on_paralogs.vcf -x 0.1 -y 0.5
+ ```
+
+### 5.2 Scripts for STRUCTURE analyses
+
+#### script `1SNP_per_cluster.py`
+
+* Conserve one variant per cluster (the one with less missing genotypes)..
+ Use case: prepare the vcf file to be used by STRUCTURE:
+
+ ```bash
+ python 1SNP_per_cluster.py -i discoRad_k_31_c_3_D_0_P_5_m_5_mapped.vcf -o one_variant_per_cluster.vcf
+ ```
+
+#### script `vcf2structure.sh`
+
+* Changes the vcf format to a Structure format (input of the software Structure).
+ Use case: prepare the vcf file to be used by STRUCTURE:
+
+ ```bash
+ sh vcf2structure.sh discoRad_k_31_c_3_D_0_P_5_m_5_mapped.vcf > file.str
+ ```
+
+
+
+### 5.3 Mapping to a reference, and keeping the cluster information.
+
+The script `add_cluster_info_to_mapped_vcf.py` can be used in case one wants to map predicted variants to any available genome. This use case was described in [case 3](#3.%20Using%20a%20reference%20genome%20-%20Using%20only%20reads%201%20from%20pairs%20-%20With%20clustering) (Using a reference genome - Using only reads 1 from pairs - With clustering)
+
=====================================
discoSnpRAD/README.md
=====================================
@@ -7,11 +7,13 @@ DiscoSnpRAD is a pipeline based on discoSnp++ to discover small variants in RAD-
* clustering the called variants into loci
**Reference:**
-Gauthier, J., Mouden, C., Suchan, T., Alvarez, N., Arrigo, N., Riou, C., Lemaitre, C., Peterlongo, P. (2017). [DiscoSnp-RAD: de novo detection of small variants for population genomics](https://www.biorxiv.org/content/early/2017/11/09/216747). BioRxiv
+Gauthier, J., Mouden, C., Suchan, T., Alvarez, N., Arrigo, N., Riou, C., Lemaitre, C., Peterlongo, P. (2019). [DiscoSnp-RAD: de novo detection of small variants for population genomics](https://www.biorxiv.org/content/10.1101/216747v2). BioRxiv
+
+ Simulation and validation scripts: [![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.3722623.svg)](https://doi.org/10.5281/zenodo.3722623)
## Installation
-* discoSnp++ (see [../README.md](../README.md))
+* DiscoSnpRAD is installed via the discoSnp++ basic instal (from sources or conda install. see [../README.md](../README.md))
* `short_read_connector` must have been downloaded and installed (clustering task). [https://github.com/GATB/short_read_connector](https://github.com/GATB/short_read_connector)
@@ -70,6 +72,11 @@ Each variant is described with:
+## Cookbook
+
+This [cookbook](./COOKBOOK.md) presents classical usages.
+
+
## Content of this directory
=====================================
discoSnpRAD/clustering_scripts/README.md
=====================================
@@ -0,0 +1,25 @@
+# Directory containing scripts called by [run_discoSnpRad.sh](https://github.com/GATB/DiscoSnp/blob/master/discoSnpRAD/run_discoSnpRad.sh)
+
+
+
+These scripts are not intended to be used by a user. They are called by [run_discoSnpRad.sh](https://github.com/GATB/DiscoSnp/blob/master/discoSnpRAD/run_discoSnpRad.sh).
+
+1. **script** `discoRAD_clustering.sh`
+ * This script
+ * manages bubble clustering from a discofile.fa file,
+ * manages the integration of cluster informations in a vcf file
+ * remove variants with more than 95% missing genotypes (can be changed with -m option)
+ * remove low rank (<0.4) variants (can be changed with -r option)
+ * remove clusters with more than 150 variants (can be changed with -c option)
+ * This script is automatically used in [run_discoSnpRad.sh](https://github.com/GATB/DiscoSnp/blob/master/discoSnpRAD/run_discoSnpRad.sh)
+ * It presents no interest to be used alone
+2. **script** ` from_SRC_to_edges.py `
+ * Parses the short read connector output to a format readable by the clustering algorithm.
+ * This script is used in [discoRAD_clustering.sh](https://github.com/GATB/DiscoSnp/blob/master/discoSnpRAD/clustering_scripts/discoRAD_clustering.sh)
+ * It presents no interest to be used alone
+3. **script** `fasta_and_cluster_to_filtered_vcf`
+ * This script is used in [discoRAD_clustering.sh](https://github.com/GATB/DiscoSnp/blob/master/discoSnpRAD/clustering_scripts/discoRAD_clustering.sh)
+ * It presents no interest to be used alone
+
+
+
=====================================
discoSnpRAD/clustering_scripts/discoRAD_clustering.sh
=====================================
@@ -1,4 +1,4 @@
-#!/bin/sh
+#!/bin/bash
# REQUIRES:
## Python 3
@@ -25,10 +25,13 @@ echo " 3/ Format the variants in a vcf file with cluster information"
echo "Usage: ./discoRAD_clustering.sh -f discofile -s SRC_path -o output_file.vcf"
# echo "nb: all options are MANDATORY\n"
echo "OPTIONS:"
-echo "\t -f: DiscoSnp fasta output containing coherent predictions"
-echo "\t -s: Path to Short Read Connector"
-echo "\t -o: output file path (vcf)"
-echo "\t -w: Wraith mode: only show all discoSnpRad commands without running them"
+echo " -f: DiscoSnp fasta output containing coherent predictions"
+echo " -m: Max Missing value (default 0.95)"
+echo " -r: Min Rank (default 0.4)"
+echo " -c: Max cluster size (default 150)"
+echo " -s: Path to Short Read Connector"
+echo " -o: output file path (vcf)"
+echo " -w: Wraith mode: only show all discoSnpRad commands without running them"
}
wraith="false"
@@ -36,13 +39,30 @@ wraith="false"
# help
# exit
# fi
+# rank filter parameter
+min_rank=0.4
+
+# max cluster size parameter
+max_cluster_size=150
+
+percent_missing=0.95
-while getopts "f:s:o:hw" opt; do
+while getopts "f:s:o:m:r:c:hw" opt; do
case $opt in
f)
rawdiscofile=$OPTARG
;;
+ m)
+ percent_missing=$OPTARG
+ ;;
+ r)
+ min_rank=$OPTARG
+ ;;
+ c)
+ max_cluster_size=$OPTARG
+ ;;
+
s)
short_read_connector_path=$OPTARG
;;
@@ -96,13 +116,7 @@ then
usedk=31
fi
-# rank filter parameter
-min_rank=0.4
-# max cluster size parameter
-max_cluster_size=150
-
-percent_missing=0.95
echo "${yellow}############################################################"
echo "######### MISSING DATA AND LOW RANK FILTERING #############"
=====================================
discoSnpRAD/post-processing_scripts/README.md
=====================================
@@ -1,8 +1,8 @@
# Directory containing some scripts to post-process and filter discoSnpRad results
-
-
+
+
## Filtering scripts
-
+
1. **script** `filter_by_cluster_size_and_rank.py`:
* removes variants belonging to a cluster (locus) whose size (nb of variants) is outside the given size range (options `-m` and `-M`)
* removes variants with rank lower than a given threshold given by option `-r`
@@ -27,7 +27,7 @@
## Scripts for STRUCTURE analyses :
- 4. **script** `1SNP_by_cluster.py`
+ 4. **script** `1SNP_per_cluster.py`
* selects one SNP per cluster (the one with less missing genotypes)
* Usage :
`python 1SNP_per_cluster.py -i vcf_file -o new_vcf_file`
@@ -35,7 +35,8 @@
5. **script** `vcf2structure.sh`
* changes the vcf format to a Structure format (input of the software Structure)
* Usage:
- `vcf2structure.sh file.vcf > fle.str`
+ `sh vcf2structure.sh file.vcf`
+ (creates a file `file.str`)
## Mapping to a reference genome, and keeping the cluster information :
=====================================
discoSnpRAD/post-processing_scripts/filter_paralogs.py
=====================================
@@ -97,7 +97,7 @@ def check_format(vcf_file):
INFO_split = line.split("\t")[7].split(";")
checked = True
if len(INFO_split) < 10: return False
- tmp_cluster = INFO_split[8].split("Cluster=")
+ tmp_cluster = INFO_split[-2].split("Cluster=")
if len(tmp_cluster) < 2: return False
if tmp_cluster[1] == ".": return False
cl_id = int(tmp_cluster[1])
@@ -123,7 +123,7 @@ def store_info(vcf_file, x):
if not line: break
if line.startswith("#"): continue
- num_cluster = int(line.split("\t")[7].split(";")[8].split("Cluster=")[1])
+ num_cluster = int(line.split("\t")[7].split(";")[-2].split("Cluster=")[1])
if num_cluster not in dict :
dict[num_cluster] = [0,0]
@@ -175,7 +175,7 @@ def output_newvcf(vcf_file, out_file, clusters_to_keep) :
new_vcf.write(line)
continue
- cluster = int(line.split("\t")[7].split(";")[8].split("Cluster=")[1])
+ cluster = int(line.split("\t")[7].split(";")[-2].split("Cluster=")[1])
if cluster not in clusters_to_keep: continue
new_vcf.write(line)
=====================================
discoSnpRAD/run_discoSnpRad.sh
=====================================
@@ -75,6 +75,11 @@ verbose=1
clustering="false"
short_read_connector_path=""
option_phase_variants=""
+
+max_size_cluster=150
+max_missing=0.95
+min_rank=0.4
+
#EDIR=$( cd "$( dirname "${BASH_SOURCE[0]}" )" && pwd )
EDIR=$( python -c "import os.path; print(os.path.dirname(os.path.realpath(\"${BASH_SOURCE[0]}\")))" ) # as suggested by Philippe Bordron
@@ -106,63 +111,91 @@ function help {
echo " *** HELP ***"
echo " ************"
echo "run_discoSnpRad.sh, pipelining kissnp2 and kissreads and clustering per locus for calling SNPs and small indels from RAD-seq data without the need of a reference genome"
- echo "Version "$version
+ echo "Version: "$version
+ echo "Cookbook: You may find a Cookbook here https://github.com/GATB/DiscoSnp/blob/master/discoSnpRAD/COOKBOOK.md providing classical use cases."
echo "Usage: ./run_discoSnpRad.sh --fof read_file_of_files --src [src_path] [OPTIONS]"
- echo -e "MANDATORY"
- echo -e "\t -r|--fof <file name of a file of file(s)>"
- echo -e "\t\t The input read files indicated in a file of file(s)"
- echo -e "\t\t Example: -r bank.fof with bank.fof containing the two lines \n\t\t\t data_sample/reads_sequence1.fasta\n\t\t\t data_sample/reads_sequence2.fasta.gz"
+ echo "MANDATORY"
+ echo " -r|--fof <file name of a file of file(s)>"
+ echo " The input read files indicated in a file of file(s)"
+ echo " Example: -r bank.fof with bank.fof containing the two lines"
+ echo " data_sample/reads_sequence1.fasta"
+ echo " data_sample/reads_sequence2.fasta.gz"
+ echo " Note: DiscoSnp-RAD uses files demultiplexed to samples. Each sample corresponds to exactly one line in the input file of files."
+ echo ""
+ echo "PARAMETERS"
+ echo " -k | --k_size value <int value>"
+ echo " Set the length of used kmers. Must fit the compiled value."
+ echo " Default=31"
+ echo " -c | --min_coverage value <int value>"
+ echo " Set the minimal coverage per read set: Used by kissnp2 (don't use kmers with lower coverage) and kissreads (read coherency threshold)."
+ echo " This coverage can be automatically detected per read set (in this case use \"auto\" or specified per read set, see the documentation.)"
+ echo " Default=3"
+ echo " --high_precision | -R"
+ echo " lower recall / higher precision mode. With this parameter no symmetrical crossroads may be traversed during bubble detection (by default up to 5 symmetrical crossroads may be traversed during bubble detection)."
+
+ echo ""
+ echo "OPTIONS"
+ echo " -g | --graph <file name>"
+ echo " reuse a previously created graph (.h5 file) with same prefix and same k and c parameters."
+ echo " -p | --prefix <string>"
+ echo " All out files will start with this prefix. Default=\"discoRes\""
+ echo " -C | --max_coverage <int value>"
+ echo " Set the maximal coverage for each read set: Used by kissnp2 (don't use kmers with higher coverage)."
+ echo " Default=2^31-1"
+ echo " -l | --no_low_complexity"
+ echo " Remove low complexity bubbles"
+ echo " -D | --deletion_max_size <int value>"
+ echo " discoSnpRad will search for deletions of size from 1 to D included. Default=0"
- echo -e "\nOPTIONS"
- echo -e "\t -S|--src [src_path]"
- echo -e "\t\t performs clustering of variants with short_read_connector"
- echo -e "\t\t src_path: **absolute** path to short_read_connector directory, containing the \"short_read_connector.sh\" file. "
- echo -e "\t\t -Note1: short read connector must be compiled."
- echo -e "\t\t -Note2: if no value is given, it assumes short_read_connector.sh is in the PATH env variable."
- echo -e "\t\t -Note3: with this option, discoSnpRad outputs a vcf file containing the variants clustered by locus"
-
- echo -e "\t -k | --k_size value <int value>"
- echo -e "\t\t Set the length of used kmers. Must fit the compiled value."
- echo -e "\t\t Default=31"
- echo -e "\t -c | --min_coverage value <int value>"
- echo -e "\t\t Set the minimal coverage per read set: Used by kissnp2 (don't use kmers with lower coverage) and kissreads (read coherency threshold)."
- echo -e "\t\t This coverage can be automatically detected per read set (in this case use \"auto\" or specified per read set, see the documentation."
- echo -e "\t\t Default=3"
- echo -e "\t -C | --max_coverage value"
- echo -e "\t\t Set the maximal coverage for each read set: Used by kissnp2 (don't use kmers with higher coverage)."
- echo -e "\t\t Default=2^31-1"
+ echo ""
+ echo "CLUSTERING OPTION"
+ echo " -S|--src [src_path]"
+ echo " performs clustering of variants with short_read_connector"
+ echo " src_path: **absolute** path to short_read_connector directory, containing the \"short_read_connector.sh\" file. "
+ echo " -Note1: short read connector must be compiled."
+ echo " -Note2: if no value is given, it assumes short_read_connector.sh is in the PATH env variable."
+ echo " -Note3: with this option, discoSnpRad outputs a vcf file containing the variants clustered by locus"
+ echo " --max_size_cluster <int value> "
+ echo " Discards cluster containing more than this number of variants. (Default 150)"
+ echo " Requires the -S or --src option"
- echo -e "\t --high_precision | -R"
- echo -e "\t\t lower recall / higher precision mode. With this parameter no symmetrical crossroads may be traversed during bubble detection (by default up to 5 symmetrical crossroads may be traversed during bubble detection)."
+ echo ""
+ echo "FILTERING OPTION"
+ echo " --max_missing <float value> "
+ echo " Remove variants with more than max_missing % missing values. (Default 0.95, removes variants detected in 5% and less populations)"
+ echo " --min_rank <float value>"
+ echo " Remove variants whose rank is smaller than this threshold. (Default 0.4)"
- echo -e "\t -L | --max_diff_len <integer>"
- echo -e "\t\t Longest accepted difference length between two paths of a truncated bubble"
- echo -e "\t\t default 0"
+
+
+
+
+# echo " -L | --max_diff_len <integer>" # Hidden -
+# echo " Longest accepted difference length between two paths of a truncated bubble"
+# echo " default 0"
- echo -e "\t -g | --graph <file name>"
- echo -e "\t\t reuse a previously created graph (.h5 file) with same prefix and same k and c parameters."
- echo -e "\t -D | --deletion_max_size <int>"
- echo -e "\t\t discoSnpRad will search for deletions of size from 1 to D included. Default=0"
- echo -e "\t -a | --ambiguity_max_size <int>"
- echo -e "\t\t Maximal size of ambiguity of INDELs. INDELS whose ambiguity is higher than this value are not output [default '20']"
- echo -e "\t -P | --max_snp_per_bubble <int>"
- echo -e "\t\t discoSnpRad will search up to P SNPs in a unique bubble. Default=5"
- echo -e "\t -p | --prefix <string>"
- echo -e "\t\t All out files will start with this prefix. Default=\"discoRes\""
- echo -e "\t -l | --no_low_complexity"
- echo -e "\t\t Remove low complexity bubbles"
- echo -e "\t -d | --max_substitutions <int>"
- echo -e "\t\t Set the number of authorized substitutions used while mapping reads on found SNPs (kissreads). Default=1"
- echo -e "\t -u | --max_threads <int>"
- echo -e "\t\t Max number of used threads. 0 means all threads"
- echo -e "\t\t default 0"
+# echo " -a | --ambiguity_max_size <int>" # Hidden
+# echo " Maximal size of ambiguity of INDELs. INDELS whose ambiguity is higher than this value are not output [default '20']"
+ echo ""
+ echo "ADVANCED OPTIONS"
+ echo " -P | --max_snp_per_bubble <int>"
+ echo " discoSnpRad will search up to P SNPs in a unique bubble. Default=5"
+ echo " -d | --max_substitutions <int>"
+ echo " Set the number of authorized substitutions used while mapping reads on found SNPs (kissreads). Default=10"
+
- echo -e "\t -w\t Wraith mode: only show all discoSnpRad commands without running them"
- echo -e "\t -v <0 or 1>"
- echo -e "\t\t Verbose 0 (avoids progress output) or 1 (enables progress output) -- default=1."
- echo -e "\t -h | --help"
- echo -e "\t\t Prints this message and exist\n"
+ echo ""
+ echo "MISC."
+ echo " -u | --max_threads <int>"
+ echo " Max number of used threads. 0 means all threads"
+ echo " default 0"
+ echo " -w Wraith mode: only show all discoSnpRad commands without running them"
+ echo " -v <0 or 1>"
+ echo " Verbose 0 (avoids progress output) or 1 (enables progress output) -- default=1."
+ echo " -h | --help"
+ echo " Prints this message and exist"
+ echo ""
echo "Any further question: read the readme file or contact us via the Biostar forum: https://www.biostars.org/t/discosnp/"
}
@@ -179,6 +212,34 @@ echo "${yellow}"
while :; do
case $1 in
+ --max_missing)
+ if [ "$2" ] && [ ${2:0:1} != "-" ] ; then # checks that there exists a second value and its is not the start of the next option
+ max_missing=$2
+ shift 1
+ else
+ die 'ERROR: "'$1'" option requires a non-empty option argument.'
+ fi
+ ;;
+
+ --min_rank)
+ if [ "$2" ] && [ ${2:0:1} != "-" ] ; then # checks that there exists a second value and its is not the start of the next option
+ min_rank=$2
+ shift 1
+ echo "min_rank" ${min_rank}
+ else
+ die 'ERROR: "'$1'" option requires a non-empty option argument.'
+ fi
+ ;;
+
+ --max_size_cluster)
+ if [ "$2" ] && [ ${2:0:1} != "-" ] ; then # checks that there exists a second value and its is not the start of the next option
+ max_size_cluster=$2
+ shift 1
+ else
+ die 'ERROR: "'$1'" option requires a non-empty option argument.'
+ fi
+ ;;
+
-A)
option_phase_variants="-phasing"
echo "Will phase variants during kissreads process - WARNING this option is too experimental and thus not described in the help message"
@@ -373,9 +434,9 @@ echo $reset
#######################################################################
if [ -z "$read_sets" ]; then
- echo -e "${red}\t\t\t**************************************************************************"
- echo -e "\t\t\t** ERROR: You must provide at least one read set (-r) "
- echo -e "\t\t\t**************************************************************************"
+ echo "${red} **************************************************************************"
+ echo " ** ERROR: You must provide at least one read set (-r) "
+ echo " **************************************************************************"
echo $reset
exit 1
fi
@@ -389,11 +450,11 @@ if [[ "$clustering" == "true" ]]; then
if [ -f "$src_file" ]; then
echo "${yellow}short_read_connector path is $src_file$reset"
else
- echo -e "${red}\t\t\t**************************************************************************"
- echo -e "\t\t\t** WARNING: I cannot find short_read_connector (-S). "
- echo -e "\t\t\t** $src_file does not exist"
- echo -e "\t\t\t** I will not cluster variants per RAD locus"
- echo -e "\t\t\t**************************************************************************"
+ echo "${red} **************************************************************************"
+ echo " ** WARNING: I cannot find short_read_connector (-S). "
+ echo " ** $src_file does not exist"
+ echo " ** I will not cluster variants per RAD locus"
+ echo " **************************************************************************"
echo $reset
clustering="false"
fi
@@ -403,11 +464,11 @@ if [[ "$clustering" == "true" ]]; then
if [ -n "$src_file" ]; then
echo "${yellow}short_read_connector path is $src_file$reset"
else
- echo -e "${red}\t\t\t**************************************************************************"
- echo -e "\t\t\t** WARNING: I cannot find short_read_connector in PATH. "
- echo -e "\t\t\t** Try giving the absolute path of short_read_connector directory with option -S"
- echo -e "\t\t\t** I will not cluster variants per RAD locus"
- echo -e "\t\t\t**************************************************************************"
+ echo "${red} **************************************************************************"
+ echo " ** WARNING: I cannot find short_read_connector in PATH. "
+ echo " ** Try giving the absolute path of short_read_connector directory with option -S"
+ echo " ** I will not cluster variants per RAD locus"
+ echo " **************************************************************************"
echo $reset
clustering="false"
fi
@@ -453,18 +514,18 @@ fi
#######################################################################
if [[ "$wraith" == "false" ]]; then
- echo -e "${yellow}\tRunning discoSnpRad "$version", in directory "$EDIR" with following parameters:"
- echo -e "\t\t read_sets="$read_sets
- echo -e "\t\t short_read_connector path="$short_read_connector_path
- echo -e "\t\t prefix="$h5prefix
- echo -e "\t\t c="$c
- echo -e "\t\t C="$C
- echo -e "\t\t k="$k
- echo -e "\t\t b="$b
- echo -e "\t\t d="$d
- echo -e "\t\t D="$D
- echo -e "\t\t max_truncated_path_length_difference="$max_truncated_path_length_difference
- echo -e -n "\t starting date="
+ echo "${yellow} Running discoSnpRad "$version", in directory "$EDIR" with following parameters:"
+ echo " read_sets="$read_sets
+ echo " short_read_connector path="$short_read_connector_path
+ echo " prefix="$h5prefix
+ echo " c="$c
+ echo " C="$C
+ echo " k="$k
+ echo " b="$b
+ echo " d="$d
+ echo " D="$D
+ echo " max_truncated_path_length_difference="$max_truncated_path_length_difference
+ echo -n " starting date="
date
echo $reset
echo
@@ -497,11 +558,12 @@ if [ $remove -eq 1 ]; then
rm -f $h5prefix.h5
fi
+
if [ ! -e $h5prefix.h5 ]; then
T="$(date +%s)"
- echo -e "${yellow}\t############################################################"
- echo -e "\t#################### GRAPH CREATION #######################"
- echo -e "\t############################################################${reset}"
+ echo "${yellow} ############################################################"
+ echo " #################### GRAPH CREATION #######################"
+ echo " ############################################################${reset}"
graphCmd="${dbgh5_bin} -in ${read_sets}_${kissprefix}_removemeplease -out $h5prefix -kmer-size $k -abundance-min ${c_dbgh5} -abundance-max $C -solidity-kind one ${option_cores_gatb} -verbose $verbose -skip-bcalm -skip-bglue -no-mphf"
echo $green${graphCmd}$cyan
@@ -521,7 +583,7 @@ if [ ! -e $h5prefix.h5 ]; then
fi
else
if [[ "$wraith" == "false" ]]; then
- echo -e "${yellow}File $h5prefix.h5 exists. We use it as input graph${reset}"
+ echo "${yellow}File $h5prefix.h5 exists. We use it as input graph${reset}"
fi
fi
@@ -535,9 +597,9 @@ fi
#################### KISSNP2 #######################
######################################################
T="$(date +%s)"
-echo -e "${yellow}\t############################################################"
-echo -e "\t#################### KISSNP2 MODULE #######################"
-echo -e "\t############################################################${reset}"
+echo "${yellow} ############################################################"
+echo " #################### KISSNP2 MODULE #######################"
+echo " ############################################################${reset}"
kissnp2Cmd="${kissnp2_bin} -in $h5prefix.h5 -out ${kissprefix}_r -b $b $l $x -P $P -D $D $extend $option_cores_gatb $output_coverage_option -coverage_file ${h5prefix}_cov.h5 -max_ambigous_indel ${max_ambigous_indel} -max_symmetrical_crossroads ${option_max_symmetrical_crossroads} -verbose $verbose -max_truncated_path_length_difference ${max_truncated_path_length_difference}"
echo $green${kissnp2Cmd}$cyan
if [[ "$wraith" == "false" ]]; then
@@ -557,9 +619,9 @@ if [ ! -f ${kissprefix}_r.fa ]
then
if [[ "$wraith" == "false" ]]; then
echo "${red}No polymorphism predicted by discoSnpRad"
- echo -e -n "\t ending date="
+ echo -n " ending date="
date
- echo -e "${yellow}\t Thanks for using discoSnpRad - http://colibread.inria.fr/discoSnp/${reset}"
+ echo "${yellow} Thanks for using discoSnpRad - http://colibread.inria.fr/discoSnp/${reset}"
exit
fi
fi
@@ -567,9 +629,9 @@ fi
#######################################################################
#################### REDUNDANCY REMOVAL #######################
#######################################################################
-echo -e "${yellow}\t############################################################"
-echo -e "\t#################### REDUNDANCY REMOVAL ###################"
-echo -e "\t############################################################$reset"
+echo "${yellow} ############################################################"
+echo " #################### REDUNDANCY REMOVAL ###################"
+echo " ############################################################$reset"
redundancy_removal_cmd="python $EDIR/../scripts/redundancy_removal_discosnp.py ${kissprefix}_r.fa $k $kissprefix.fa"
echo $green${redundancy_removal_cmd}$cyan
if [[ "$wraith" == "false" ]]; then
@@ -586,9 +648,9 @@ fi
#######################################################################
T="$(date +%s)"
-echo -e "${yellow}\t#############################################################"
-echo -e "\t#################### KISSREADS MODULE #######################"
-echo -e "\t#############################################################$reset"
+echo "${yellow} #############################################################"
+echo " #################### KISSREADS MODULE #######################"
+echo " #############################################################$reset"
smallk=$k
if (( $smallk>31 )) ; then
@@ -617,9 +679,9 @@ T="$(($(date +%s)-T))"
#################### SORT AND FORMAT RESULTS #########################
#######################################################################
-echo -e "${yellow}\t###############################################################"
-echo -e "\t#################### SORT AND FORMAT RESULTS #################"
-echo -e "\t###############################################################$reset"
+echo "${yellow} ###############################################################"
+echo " #################### SORT AND FORMAT RESULTS #################"
+echo " ###############################################################$reset"
if [[ "$wraith" == "false" ]]; then
sort -rg ${kissprefix}_coherent | cut -d " " -f 2 | tr ';' '\n' > ${kissprefix}_raw.fa
fi
@@ -652,9 +714,9 @@ rm -f ${kissprefix}_uncoherent.fa
#################### Deal with Downstream analyses ###############################
##################################################################################
-echo -e "${yellow}\t###############################################################"
-echo -e "\t######## CLUSTERING PER LOCUS AND/OR FORMATTING ###############"
-echo -e "\t###############################################################$reset"
+echo "${yellow} ###############################################################"
+echo " ######## CLUSTERING PER LOCUS AND/OR FORMATTING ###############"
+echo " ###############################################################$reset"
T="$(date +%s)"
if [[ "$clustering" == "true" ]]; then
@@ -662,7 +724,7 @@ if [[ "$clustering" == "true" ]]; then
echo "${yellow}Clustering and vcf formmatting$reset"
fi
final_output="${kissprefix}_clustered.vcf"
- cmd="$EDIR/clustering_scripts/discoRAD_clustering.sh -f ${kissprefix}_raw.fa -s $src_file -o ${final_output}"
+ cmd="$EDIR/clustering_scripts/discoRAD_clustering.sh -f ${kissprefix}_raw.fa -s $src_file -o ${final_output} -m ${max_missing} -r ${min_rank} -c ${max_size_cluster}"
echo $green$cmd$cyan$reset
if [[ "$wraith" == "false" ]]; then
eval $cmd
@@ -681,7 +743,7 @@ else
echo "Filtering and vcf formatting$reset"
fi
final_output="${kissprefix}.vcf"
- cmd="python3 $EDIR/../scripts/create_filtered_vcf.py -i ${kissprefix}_raw.fa -o ${final_output} -m 0.95 -r 0.4"
+ cmd="python3 $EDIR/../scripts/create_filtered_vcf.py -i ${kissprefix}_raw.fa -o ${final_output} -m ${max_missing} -r ${min_rank}"
echo $green$cmd$cyan$reset
if [[ "$wraith" == "false" ]]; then
eval $cmd
@@ -693,16 +755,16 @@ else
fi
if [[ "$wraith" == "false" ]]; then
- echo -e "${yellow}\t###############################################################"
- echo -e "\t#################### DISCOSNPRAD FINISHED ######################"
- echo -e "\t###############################################################"
+ echo "${yellow} ###############################################################"
+ echo " #################### DISCOSNPRAD FINISHED ######################"
+ echo " ###############################################################"
Ttot="$(($(date +%s)-Ttot))"
echo "DiscoSnpRad total time in seconds: ${Ttot}"
- echo -e "\t################################################################################################################"
- echo -e "\t fasta of predicted variant is \""${kissprefix}_raw.fa"\""
- echo -e "\t Ghost VCF file (1-based) is \""${final_output}"\""
- echo -e "\t Thanks for using discoSnpRad - http://colibread.inria.fr/discoSnp/ - Forum: http://www.biostars.org/t/discoSnp/"
- echo -e "\t################################################################################################################${reset}"
+ echo " ################################################################################################################"
+ echo " fasta of predicted variant is \""${kissprefix}_raw.fa"\""
+ echo " Ghost VCF file (1-based) is \""${final_output}"\""
+ echo " Thanks for using discoSnpRad - http://colibread.inria.fr/discoSnp/ - Forum: http://www.biostars.org/t/discoSnp/"
+ echo " ################################################################################################################${reset}"
fi
=====================================
doc/release_note.txt
=====================================
@@ -1,3 +1,8 @@
+* Jan 2020
+ + Conda Install
+ + discoRad (independent / reorganized)
+ + phasing in test (hidden -A option)
+ + default values modified
* 9/6/2016 2.2.9
+ Fixed a VCF creator bug
+ Optimizing VCF creator
=====================================
scripts/k3000/K3000_enhance_gfa.py
=====================================
@@ -33,34 +33,44 @@ def compatibles(raw_fact,i,compacted_facts):
we already know that the 2 facts share at least a snp
This methods detects if there exists or not a snp, with distinct alleles in the two facts (eg 1000h in one fact, 1000l in the other).
Returns false in this case, True else.
+ When True is returned, a second boolean value is returned.
+ It is set to True of the raw fact and the compacted_facts[i] are in the same orientation, else it is set to false
"""
# print (compacted_facts[i])
# print(raw_fact)
compacted_fact_i_dict = {} # snps id -> h or l
+ query_snp_sign=''
for snp in compacted_facts[i]:
+ if snp[0]=='-': query_snp_sign='-'
+ else: query_snp_sign='+'
snp=snp.lstrip('-') # remove the sign
compacted_fact_i_dict[snp[:-1]]=snp[-1]
nb_shared=0
+
+ target_snp_sign=''
for snp in raw_fact:
+ if snp[0]=='-': target_snp_sign='-'
+ else: target_snp_sign='+'
snp=snp.lstrip('-') # remove the sign
if snp[:-1] in compacted_fact_i_dict:
if compacted_fact_i_dict[snp[:-1]]!=snp[-1]: # distinct alleles
# print (" INCOMPATIBLE ", snp, compacted_facts[i])
- return False
+ return False,query_snp_sign==target_snp_sign
else :
nb_shared+=1
# else: return False # uncomment if one needs the raw fact to be fully inlcuded in the compacted_facts[i]
if nb_shared>1: # at least two shared alleles to link two facts. # note that if test >1 one may avoid lot of computations upstream.
- return True
+ return True,query_snp_sign==target_snp_sign
if len(raw_fact)==nb_shared or len(compacted_facts[i])==nb_shared: # the raw fact or the compacted fact is fully mapped
- return True
- return False
+ return True,query_snp_sign==target_snp_sign
+ return False,query_snp_sign==target_snp_sign
def get_compatible_facts(text_raw_fact, compacted_facts, snp_to_fact_id):
"""
given the text of a raw fact, returns all compacted_facts ids in which this raw fact occurs
+ add the sign of each of the compacted_facts id in which this raw fact occurs. Negative if it is not the same orientation as the text_raw_fact, else positive
Example:
* text_raw_fact: -10000l_0;92837_h;
* compacted_facts: ... '29731': {'31938l', '499h'} ... (factid -> list of non oriented alleles)
@@ -80,9 +90,12 @@ def get_compatible_facts(text_raw_fact, compacted_facts, snp_to_fact_id):
raw_fact_snps.add(oriented_allele.split("_")[0])
# print ("raw_fact_snps", raw_fact_snps)
for j in subcompacedfacts:
- if compatibles(raw_fact_snps, j, compacted_facts):
- result.add(int(j))
-
+ cmpt,same_orientation = compatibles(raw_fact_snps, j, compacted_facts)
+ if cmpt:
+ if same_orientation: result.add(int(j))
+ else: result.add(-int(j))
+
+ # sys.stderr.write(f"{result}\n")
return result
@@ -137,7 +150,6 @@ def detects_allele_coverage(compacted_facts, raw_disco_file_name, read_set_id):
"""
Given the compacted facts indexed and the raw disco output: for each compacted fact, find all allele that belong to it and compute its estimated coverage (average, min, max)
Returns a dictionary: compacted_fact_id -> allele_weight
- TODO. In fact we use only the "min" value. Thus, no need to compute and to store the mean and max values.
"""
alleles_coverage = index_allele_coverage(raw_disco_file_name, read_set_id)
compacted_fact_allele_weight = {} # For each compacted fact id, stores its weight
@@ -173,6 +185,7 @@ def detects_facts_coverage(compacted_facts, snp_to_fact_id, raw_facts_file_name)
# detects all compacted_facts in which the rawfact occurs:
matching_compacted_fact_ids = get_compatible_facts(rawfact, compacted_facts, snp_to_fact_id)
for matching_compacted_fact_id in matching_compacted_fact_ids:
+ matching_compacted_fact_id = abs(matching_compacted_fact_id) # remove the useless sign here
if matching_compacted_fact_id not in compacted_fact_weight:
compacted_fact_weight[matching_compacted_fact_id]=0
compacted_fact_weight[matching_compacted_fact_id]+=coverage
@@ -193,7 +206,7 @@ def print_facts(phasing_file,compacted_fact_weight, compacted_fact_allele_weight
alleles_weight=0
if str(compacted_fact_id) in compacted_fact_allele_weight:
alleles_weight = compacted_fact_allele_weight[str(compacted_fact_id)]
- print(line+"\tFC:i:"+str(fact_weight)+"\tRC:i:"+str(alleles_weight[0]))
+ print(line+"\tFC:i:"+str(fact_weight)+"\tRC:i:"+str(alleles_weight[1])+"\tmin:i:"+str(alleles_weight[0])+"\tmax:i:"+str(alleles_weight[1])+"\tmean:i:"+str(alleles_weight[2])) ## RC is max ([1]) as we take the max (17/02/2020)
cpt+=1
sys.stderr.write(str(cpt)+" facts written\n")
mfile.close()
@@ -235,10 +248,22 @@ def detects_pairs_of_linked_compacted_paths(compacted_facts, snp_to_fact_id, raw
def print_pair_edges_gfa_style(pair_edges, occurrence_min=1):
cpt=0
for left_fact_id in pair_edges:
+ left_sign = ''
+ if left_fact_id<0: left_sign = '-'
+ else: left_sign = '+'
+ abs_left_fact_id = abs(left_fact_id)
for right_fact_id in pair_edges[left_fact_id]:
+ # the right sign is reversed as pairend reads map --> <--. Hence if right is forward on a fact we have to reverse this fact
+ right_sign = ''
+ if right_fact_id<0: right_sign = '+'
+ else: right_sign = '-'
+ abs_right_fact_id = abs(right_fact_id)
+
+
+
if left_fact_id < right_fact_id and pair_edges[left_fact_id][right_fact_id]>=occurrence_min:
cpt+=1
- print ("L\t"+str(left_fact_id)+"\t+\t"+str(right_fact_id)+"\t+\t0M\tFC:i:"+str(pair_edges[left_fact_id][right_fact_id]))
+ print ("L\t"+str(abs_left_fact_id)+"\t"+left_sign+"\t"+str(abs_right_fact_id)+"\t"+right_sign+"\t0M\tFC:i:"+str(pair_edges[left_fact_id][right_fact_id]))
sys.stderr.write(str(cpt)+" paired fact written\n")
def print_facts_overlaps(phasing_file):
=====================================
scripts/k3000/K3000_find_unitig_connected_pairs_of_facts.py
=====================================
@@ -213,7 +213,8 @@ def print_original_gfa(gfa_file_name):
print ("H\t# - first: the relative position of first nucleotide of the bubble with respect to the position of the last nucleotide of the bubble of the previous allele. This value is equal to zero for the first allele")
print ("H\t# - second: the length of the bubble of the allele")
print ("H\t# * field FC is the coverage of the fact, as provided by the total number of reads that phased at least two alleles of the fact")
- print ("H\t# * field RC is the coverage of the fact, as provided by the min of the read coverage of all alleles")
+ print ("H\t# * field RC is the coverage of the fact, as provided by the max of the read coverage from all independent allele computed coverages")
+ print ("H\t# * fields min, max, and mean stand resp. for the min, max and mean of the read coverage of all alleles")
print ("H\t# Four types of edges:")
print ("H\t# 1. Overlap between facts. These links have an overlap length >0. Eg, \"L 1 - 29384 + 3M\", with:")
print ("H\t# \"S 1 10011l;23229h;-21935l;-8929l;-24397l;10011h; RC:i:24\", and")
=====================================
scripts/k3000/K3000_gfa_to_dat.py
=====================================
@@ -4,120 +4,92 @@ import K3000_gfa_post_treatment as gpt # :)
"""
-Dat file:
-AcorusCalamus_multigraph.dat
-
-data;
-param Title := AcorusCalamus;
-param start := "3_0_F";
+#id of forward and reverse nodes
set V :=
-3_0_F
-1_0_R
-0_0_R
-0_0_F
-1_0_F
-1_1_F
-3_0_R
-1_1_R
+p0
+p1
+p2
+m0
+m1
+m2
+;
+param number_loci
+p0 3
+p1 2
+p2 2
+m0 3
+m1 2
+m2 2
;
+#id of nodes with their coverage. Here (3 sept 2019) coverage refers to the read coverage of the less covered allele of all alleles of the fact
param w :=
-3_0_F 83691
-1_0_R 26025
-0_0_R 18476
-0_0_F 18476
-1_0_F 26025
-1_1_F 26025
-3_0_R 83691
-1_1_R 26025
+p0 5626
+m0 5626
+p1 10152
+m1 10152
+p2 1142
+m2 1142
;
+#for each forward node, indicates the id of the reverse version
set reverse :=
-3_0_F 3_0_R
-1_0_R 1_0_F
-0_0_R 0_0_F
-0_0_F 0_0_R
-1_0_F 1_0_R
-1_1_F 1_1_R
-3_0_R 3_0_F
-1_1_R 1_1_F
+p0 m0
+p1 m1
+p2 m2
;
+#set of edges. Four types of edges, 1/ "overlaps" edges, that show an overlap between facts and 2/ "links" edges, that represent facts linked by paired reads (distanace unknown) and 3/ "successive" edges that represent two successive facts (without phasing) and 4/ "incompatible" edges, no path should contain two nodes linked by such an edge
set Edges :=
-3_0_F 1_1_F overlaps
-3_0_F 1_1_F links
-3_0_F 1_0_F overlaps
-3_0_F 1_0_F links
-1_0_R 3_0_F overlaps
-1_0_R 3_0_F links
-1_0_R 3_0_R overlaps
-1_0_R 3_0_R links
-0_0_R 1_0_R overlaps
-0_0_R 1_0_R links
-0_0_R 1_1_R overlaps
-0_0_R 1_1_R links
-0_0_F 1_0_R overlaps
-0_0_F 1_0_R links
-0_0_F 1_1_R overlaps
-0_0_F 1_1_R links
-1_0_F 0_0_R overlaps
-1_0_F 0_0_R links
-1_0_F 0_0_F overlaps
-1_0_F 0_0_F links
-1_1_F 0_0_R overlaps
-1_1_F 0_0_R links
-1_1_F 0_0_F overlaps
-1_1_F 0_0_F links
-3_0_R 1_1_F overlaps
-3_0_R 1_1_F links
-3_0_R 1_0_F overlaps
-3_0_R 1_0_F links
-1_1_R 3_0_F overlaps
-1_1_R 3_0_F links
-1_1_R 3_0_R overlaps
-1_1_R 3_0_R links
+p0 m55 overlaps #TODO (plus tard): il faut aussi doubler les arêtes et donc générer p55 m0 overlaps (pas d'ordre à respecter au sein de chaque champ)
+p0 m53 overlaps
+p1 m55 overlaps
+p303 p305 links
+p303 p310 links
+p303 p304 links
+p0 p54 incompatibles
+p0 p1 incompatibles
+p0 p2 incompatibles
+p243 p341 successive
+p244 p341 successive
+p320 m388 successive
;
+#overlap length of each edge. For an "overlaps" edge, it indicates the number of common variants. For any other edge type (links, successive, or incompatibles), this is set to zero
param l :=
-3_0_F 1_1_F overlaps -99
-3_0_F 1_1_F links 0
-3_0_F 1_0_F overlaps -99
-3_0_F 1_0_F links 0
-1_0_R 3_0_F overlaps -99
-1_0_R 3_0_F links 0
-1_0_R 3_0_R overlaps -99
-1_0_R 3_0_R links 0
-0_0_R 1_0_R overlaps -99
-0_0_R 1_0_R links 0
-0_0_R 1_1_R overlaps -99
-0_0_R 1_1_R links 0
-0_0_F 1_0_R overlaps -99
-0_0_F 1_0_R links 0
-0_0_F 1_1_R overlaps -99
-0_0_F 1_1_R links 0
-1_0_F 0_0_R overlaps -99
-1_0_F 0_0_R links 0
-1_0_F 0_0_F overlaps -99
-1_0_F 0_0_F links 0
-1_1_F 0_0_R overlaps -99
-1_1_F 0_0_R links 0
-1_1_F 0_0_F overlaps -99
-1_1_F 0_0_F links 0
-3_0_R 1_1_F overlaps -99
-3_0_R 1_1_F links 0
-3_0_R 1_0_F overlaps -99
-3_0_R 1_0_F links 0
-1_1_R 3_0_F overlaps -99
-1_1_R 3_0_F links 0
-1_1_R 3_0_R overlaps -99
-1_1_R 3_0_R links 0
+p0 m55 overlaps 3
+p0 m53 overlaps 3
+p1 m55 overlaps 3
+p303 p305 links 0
+p303 p310 links 0
+p303 p304 links 0
+p0 p54 incompatibles 0
+p0 p1 incompatibles 0
+p0 p2 incompatibles 0
+p243 p341 successive 0
+p244 p341 successive 0
+p320 m388 successive 0
;
+#Coverage of the pairend links. Note that overlap links do not have any coverage (just computed from fact overlaps). Also incompatible and successive links do not have coverage, by definition.
+param pairend_end_links_coverage :=
+p0 m55 overlaps 0
+p0 m53 overlaps 0
+p1 m55 overlaps 0
+p303 p305 links 12
+p303 p310 links 3
+p303 p304 links 7
+p0 p54 incompatibles 0
+p0 p1 incompatibles 0
+p0 p2 incompatibles 0
+p243 p341 successive 0
+p244 p341 successive 0
+p320 m388 successive 0
+
"""
def print_header():
- print("data;")
- #print("param Title := ???;")
- #print("param start := \"???\";")
+ print("#haplotype number")
+ print("param mult := XXX")
def print_nodes(gfa_file_name, DG=None):
- print("#id of forward nodes")
+ print("#id of plus (p) and minus (m) nodes")
print("set V :=")
gfa_file = open(gfa_file_name)
for line in gfa_file.readlines():
@@ -128,12 +100,28 @@ def print_nodes(gfa_file_name, DG=None):
node_id = line.split()[1]
if DG and node_id not in DG.nodes(): continue # this node was removed during gfa post treatment
print("p"+node_id) # 'p' stands for "plus strand"
+ print("m"+node_id) # 'p' stands for "minus strand"
print(";")
gfa_file.close()
+def print_nodes_number_loci(gfa_file_name, DG=None):
+ print("#id of nodes with their number of variants")
+ print("param length :=")
+ gfa_file = open(gfa_file_name)
+ for line in gfa_file.readlines():
+ line=line.strip()
+ if line[0]=="S":
+ node_id=line.split()[1]
+ if DG and node_id not in DG.nodes(): continue # this node was removed during gfa post treatment
+ "S 0 28175h;10031h;12786h;-41223l;-26670h; SP:0_426;383_541;427_586;542_661;587_731; BP:0_93;-17_61;54_61;-16_61;14_84; FC:i:64 RC:i:21"
+ print("p"+node_id+"\t"+str(len(line.split()[2].strip(";").split(";"))))
+ print("m"+node_id+"\t"+str(len(line.split()[2].strip(";").split(";"))))
+ print(";")
+ gfa_file.close()
+
def print_nodes_weight(gfa_file_name, DG=None):
- print("#id of forward nodes with their coverage. Here (3 sept 2019) coverage refers to the read coverage of the less covered allele of all alleles of the fact")
- print("param w :=")
+ print("#id of nodes with their coverage. Here coverage refers to the read coverage of the less covered allele of all alleles of the fact")
+ print("param ab :=")
gfa_file = open(gfa_file_name)
for line in gfa_file.readlines():
line=line.strip()
@@ -145,6 +133,21 @@ def print_nodes_weight(gfa_file_name, DG=None):
print(";")
gfa_file.close()
+
+def print_nodes_weight_phased_alleles(gfa_file_name, DG=None):
+ print("#id of nodes with their coverage. Here coverage refers to the total number of reads that phased at least two alleles of the fact ")
+ print("param ab_comapped :=")
+ gfa_file = open(gfa_file_name)
+ for line in gfa_file.readlines():
+ line=line.strip()
+ if line[0]=="S":
+ node_id=line.split()[1]
+ if DG and node_id not in DG.nodes(): continue # this node was removed during gfa post treatment
+ "S 0 28175h;10031h;12786h;-41223l;-26670h; SP:0_426;383_541;427_586;542_661;587_731; BP:0_93;-17_61;54_61;-16_61;14_84; FC:i:64 RC:i:21"
+ print("p"+node_id+"\t"+line.split()[5].split(":")[-1])
+ print(";")
+ gfa_file.close()
+
def print_reverse(gfa_file_name,DG=None):
print("#for each forward node, indicates the id of the reverse version")
@@ -163,7 +166,7 @@ def print_reverse(gfa_file_name,DG=None):
def print_nodes_connected_components(gfa_file_name, DG):
print("#id of forward nodes with their connected component id. ")
- print("param c :=")
+ print("param comp :=")
gfa_file = open(gfa_file_name)
for line in gfa_file.readlines():
line=line.strip()
@@ -178,10 +181,28 @@ def print_nodes_connected_components(gfa_file_name, DG):
gfa_file.close()
+def printable_successive(printed_successive, source_id, target_id):
+ """
+ 1/ checks of source_id <-> target_id not already in printed_successive. If already inside, do nothing and returns false
+ 2/ [else] add target_id in source_id key and return true
+ """
+ # source is always the smallest, avoid to test both directions.
+ if target_id < source_id:
+ tmp = source_id
+ source_id = target_id
+ target_id = tmp
+ if source_id in printed_successive and target_id in printed_successive[source_id]: return False
+
+ if source_id not in printed_successive: printed_successive[source_id] = []
+ printed_successive[source_id].append(target_id)
+ return True
+
+
def print_edges(gfa_file_name, DG=None):
print("#set of edges. Four types of edges, 1/ \"overlaps\" edges, that show an overlap between facts and 2/ \"links\" edges, that represent facts linked by paired reads (distanace unknown) and 3/ \"successive\" edges that represent two successive facts (without phasing) and 4/ \"incompatible\" edges, no path should contain two nodes linked by such an edge ")
print("set Edges :=")
gfa_file = open(gfa_file_name)
+ printed_successive = {} # key = id, target = [ids]. Used to retain which successive links ad been writen, avoiding redundancies
for line in gfa_file.readlines():
line=line.strip()
if line[0]=="L":
@@ -201,16 +222,107 @@ def print_edges(gfa_file_name, DG=None):
if overlap_len==0: type="links"
if overlap_len==-1: type="successive"
if overlap_len==-2: type="incompatibles"
- print(sign_source+source_id+"\t"+sign_target+target_id+"\t"+type)
+ if type == "overlaps" or type=="successive" or type=="links":
+ print(sign_source+source_id+"\t"+sign_target+target_id+"\t"+type)
+ # print the reverse of these nodes
+ if sign_source == "m": sign_source="p"
+ else: sign_source = "m"
+ if sign_target == "m": sign_target="p"
+ else: sign_target = "m"
+ print(sign_target+target_id+"\t"+sign_source+source_id+"\t"+type)
+
+
+
+
+ else: # All those nodes are non oriented - need all possible combinations
+ # if type=="successive" and not printable_successive(printed_successive, source_id, target_id): continue
+ # if type=="links":
+ # sign_source = "p"
+ # sign_target = "p"
+ # print(sign_target+target_id+"\t"+sign_source+source_id+"\t"+type)
+ # continue
+ for sign_source in "m","p":
+ for sign_target in "m", "p":
+ print(sign_target+target_id+"\t"+sign_source+source_id+"\t"+type)
+ # if type=="successive": # in case of successive edges, one needs to indicate all possibilities p/m and forward and reverse
+ # print(sign_source+source_id+"\t"+sign_target+target_id+"\t"+type)
+
+
+
+
# do not print other edges (unitig linked and snp linked)
print(";")
gfa_file.close()
+def print_edge_coverages(gfa_file_name, DG=None):
+ print ("#Coverage of the pairend links. Note that overlap links do not have any coverage (just computed from fact overlaps). Also incompatible and successive links do not have coverage, by definition. ")
+ print ("param pairend_end_links_coverage :=")
+ gfa_file = open(gfa_file_name)
+ printed_successive = {} # key = id, target = [ids]. Used to retain which successive links ad been writen, avoiding redundancies
+ for line in gfa_file.readlines():
+ line=line.strip()
+ if line[0]=="L":
+ "L 1 - 29384 + 8M"
+ "to"
+ "m1 p29384 overlaps 8"
+
+ source_id = line.split()[1]
+ target_id = line.split()[3]
+ if DG:
+ if source_id not in DG.nodes() or target_id not in DG.nodes(): continue # one of these nodes was removed during gfa post treatment
+ overlap_len = int(line.split()[5].rstrip("M"))
+ sign_source="p"
+ if line.split()[2]=='-': sign_source="m"
+ sign_target="p"
+ if line.split()[4]=='-': sign_target="m"
+ type="overlaps"
+ coverage=0
+ if overlap_len==0:
+ type="links"
+ coverage = int(line.split()[6].split(":")[-1])
+ if overlap_len==-1: type="successive"
+ if overlap_len==-2: type="incompatibles"
+ # print(sign_source+source_id+"\t"+sign_target+target_id+"\t"+type+"\t"+str(coverage))
+# # do not print other edges (unitig linked and snp linked)
+# # print the reverse of these nodes
+# if sign_source == "m": sign_source="p"
+# else: sign_source = "m"
+# if sign_target == "m": sign_target="p"
+# else: sign_target = "m"
+# print(sign_target+target_id+"\t"+sign_source+source_id+"\t"+type+"\t"+str(coverage))
+#
+ if type == "overlaps" or type=="successive" or type=="links":
+ print(sign_source+source_id+"\t"+sign_target+target_id+"\t"+type+"\t"+str(coverage))
+ # print the reverse of these nodes
+ if sign_source == "m": sign_source="p"
+ else: sign_source = "m"
+ if sign_target == "m": sign_target="p"
+ else: sign_target = "m"
+ print(sign_target+target_id+"\t"+sign_source+source_id+"\t"+type+"\t"+str(coverage))
+
+ else: # All those nodes are non oriented - need all possible combinations
+ # if type=="successive" and not printable_successive(printed_successive, source_id, target_id): continue
+ # if type=="links":
+ # sign_source = "p"
+ # sign_target = "p"
+ # print(sign_target+target_id+"\t"+sign_source+source_id+"\t"+type+"\t"+str(coverage))
+ # continue
+ for sign_source in "m","p":
+ for sign_target in "m", "p":
+ print(sign_target+target_id+"\t"+sign_source+source_id+"\t"+type+"\t"+str(coverage))
+ # if type=="successive": # in case of successive edges, one needs to indicate all possibilities p/m and forward and reverse
+ # print(sign_source+source_id+"\t"+sign_target+target_id+"\t"+type+"\t"+str(coverage))
+
+
+
+ print(";")
+ gfa_file.close()
def print_edges_content(gfa_file_name, DG=None):
print("#overlap length of each edge. For an \"overlaps\" edge, it indicates the number of common variants. For any other edge type (links, successive, or incompatibles), this is set to zero")
print("param l :=")
gfa_file = open(gfa_file_name)
+ printed_successive = {} # key = id, target = [ids]. Used to retain which successive links ad been writen, avoiding redundancies
for line in gfa_file.readlines():
line=line.strip()
if line[0]=="L":
@@ -231,8 +343,38 @@ def print_edges_content(gfa_file_name, DG=None):
if overlap_len==0: type="links"
if overlap_len==-1: type="successive"
if overlap_len==-2: type="incompatibles"
- print(sign_source+source_id+"\t"+sign_target+target_id+"\t"+type+"\t"+str(max(0,overlap_len)))
- # do not print other edges (unitig linked and snp linked)
+ # print(sign_source+source_id+"\t"+sign_target+target_id+"\t"+type+"\t"+str(max(0,overlap_len)))
+ # # do not print other edges (unitig linked and snp linked)
+ #
+ # # print the reverse of these nodes
+ # if sign_source == "m": sign_source="p"
+ # else: sign_source = "m"
+ # if sign_target == "m": sign_target="p"
+ # else: sign_target = "m"
+ # print(sign_target+target_id+"\t"+sign_source+source_id+"\t"+type+"\t"+str(max(0,overlap_len)))
+ #
+ if type == "overlaps" or type=="successive" or type=="links":
+ print(sign_source+source_id+"\t"+sign_target+target_id+"\t"+type+"\t"+str(max(0,overlap_len)))
+ # print the reverse of these nodes
+ if sign_source == "m": sign_source="p"
+ else: sign_source = "m"
+ if sign_target == "m": sign_target="p"
+ else: sign_target = "m"
+ print(sign_target+target_id+"\t"+sign_source+source_id+"\t"+type+"\t"+str(max(0,overlap_len)))
+
+ else: # All those nodes are non oriented - need all possible combinations
+ # if type=="successive" and not printable_successive(printed_successive, source_id, target_id): continue
+ # if type=="links":
+ # sign_source = "p"
+ # sign_target = "p"
+ # print(sign_target+target_id+"\t"+sign_source+source_id+"\t"+type+"\t"+str(max(0,overlap_len)))
+ # continue
+ for sign_source in "m","p":
+ for sign_target in "m", "p":
+ print(sign_target+target_id+"\t"+sign_source+source_id+"\t"+type+"\t"+str(max(0,overlap_len)))
+ # if type=="successive": # in case of successive edges, one needs to indicate all possibilities p/m and forward and reverse
+ # print(sign_source+source_id+"\t"+sign_target+target_id+"\t"+type+"\t"+str(max(0,overlap_len)))
+
print(";")
gfa_file.close()
@@ -258,11 +400,14 @@ def main(gfa_file_name):
print_header()
print_nodes(gfa_file_name,DG)
+ print_nodes_number_loci(gfa_file_name, DG=None)
print_nodes_weight(gfa_file_name,DG)
+ print_nodes_weight_phased_alleles(gfa_file_name, DG)
print_reverse(gfa_file_name,DG)
if DG:
print_nodes_connected_components(gfa_file_name, DG)
print_edges(gfa_file_name,DG)
+ print_edge_coverages(gfa_file_name, DG)
print_edges_content(gfa_file_name,DG)
=====================================
scripts/k3000/K3000_node_ids_to_node_sequences.py
=====================================
@@ -74,11 +74,12 @@ def modify_gfa_file(gfa_file_name, compacted_facts_fa_file_name, header_to_file_
print ("H\t#################")
print ("H\t# Nodes are (compacted) facts with their read mapping coverage. Eg. \"S 81 ccggcgttggcttcca[...]agttct FC:i:57 RC:i:26 SP:0_178;136_269;179_438; BP:0_61;-18_61;30_61; 20434h;-21604h;10436h;\".")
print ("H\t# * field FC is the coverage of the fact, as provided by the total number of reads that phased at least two alleles of the fact")
- print ("H\t# * field RC is the coverage of the fact, as provided by the min of the read coverage of all alleles")
+ print ("H\t# * field RC is the coverage of the fact, as provided by the max of the read coverage of all alleles")
print ("H\t# * field SP stands for \"Sequence Position\". It indicates for each allele of the fact its starting and ending position in the ACGT sequence")
print ("H\t# * field BP stands for \"Bubble Position\". For each allele of the fact it indicates:")
print ("H\t# - first: the relative position of first nucleotide of the bubble with respect to the position of the last nucleotide of the bubble of the previous allele. This value is equal to zero for the first allele")
print ("H\t# - second: the length of the bubble of the allele")
+ print ("H\t# * fields min, max, and mean stand resp. for the min, max and mean of the read coverage of all alleles")
print ("H\t# Four types of edges:")
print ("H\t# 1. Overlap between facts, Overlap length is >0. Eg, \"L 1 - 29384 + 8M\"")
@@ -107,7 +108,8 @@ def modify_gfa_file(gfa_file_name, compacted_facts_fa_file_name, header_to_file_
gfa_line.strip()
if gfa_line[0]=='H': continue #Header was changed
if gfa_line[0]=='S': #Deal with sequences
- #S 0 24824h;33997h;10000h; SP:0_166;126_261;178_444; BP:0_83;-20_72;23_61; FC:i:15 RC:i:26
+ #S 0 -1h;3h; SP:0_99;50_1899; BP:0_61;-11_61; FC:i:48 RC:i:322 min:i:85 max:i:322 mean:i:203.5
+
gfa_line=gfa_line.split()
assert gfa_line[2] in header_to_file_position, gfa_line[2]+" is not in header_to_file_position"
compacted_facts_fa_file.seek(header_to_file_position[gfa_line[2]])
@@ -115,7 +117,7 @@ def modify_gfa_file(gfa_file_name, compacted_facts_fa_file_name, header_to_file_
sequence_fa=compacted_facts_fa_file.readline().strip()
# assert gfa_line[2] == allele_header,gfa_line[2]+" is not "+allele_header+" "+header_fa[1:]
node_id_to_sequence[gfa_line[1]]=sequence_fa #TODO: optimize this to avoid memory usage. One may store the position of the node in the file and retreive the sequence latter
- print(gfa_line[0]+"\t"+gfa_line[1]+"\t"+sequence_fa+"\t"+gfa_line[5]+"\t"+gfa_line[6]+"\t"+gfa_line[3]+"\t"+gfa_line[4]+"\t"+gfa_line[2])
+ print(gfa_line[0]+"\t"+gfa_line[1]+"\t"+sequence_fa+"\t"+gfa_line[5]+"\t"+gfa_line[6]+"\t"+gfa_line[3]+"\t"+gfa_line[4]+"\t"+gfa_line[2]+"\t"+gfa_line[7]+"\t"+gfa_line[8]+"\t"+gfa_line[9])
continue
if gfa_line[0]=='L':
=====================================
scripts/k3000/extract_DP_from_vcf.py
=====================================
@@ -0,0 +1,62 @@
+import sys
+"""
+From a vcf file (arg1) and a valid dataset_id (arg2) this script extracts the max, min and average DP for the give dataset.
+Returns an error if the dataset_id does not exists in the vcf
+
+Input format (with one unique dataset, else lines would have one or more additional : GT:DP:PL:AD:HQ fields (eg. 0/1:1727:5117,821,20742:1255,472:73,73).
+# Comments
+ecoli2kb 599 5 A C . PASS Ty=SNP;Rk=0;UL=19;UR=19;CL=.;CR=.;Genome=A;Sd=-1 GT:DP:PL:AD:HQ 0/1:1727:5117,821,20742:1255,472:73,73
+ecoli2kb 749 4 A C . PASS Ty=SNP;Rk=0;UL=19;UR=1721;CL=.;CR=.;Genome=G;Sd=1 GT:DP:PL:AD:HQ 0/1:1717:23527,1401,3352:353,1364:73,73
+ecoli2kb 649 3 A C . PASS Ty=SNP;Rk=0;UL=19;UR=19;CL=.;CR=.;Genome=T;Sd=1 GT:DP:PL:AD:HQ 0/0:1770:1466,2566,28486:1562,208:73,73
+ecoli2kb 699 2 A C . PASS Ty=SNP;Rk=0;UL=19;UR=19;CL=.;CR=.;Genome=A;Sd=1 GT:DP:PL:AD:HQ 1/1:1712:30704,3553,410:97,1615:73,73
+ecoli2kb 549 1 A C . PASS Ty=SNP;Rk=0;UL=19;UR=518;CL=.;CR=.;Genome=A;Sd=-1 GT:DP:PL:AD:HQ 0/1:1657:22184,1236,3525:361,1296:73,73
+
+
+Output format (stdout):
+param DP_avg := 1716.6
+param DP_max := 1770;
+param DP_min := 1657;
+"""
+
+def extract_DP(vcf_file_name, dataset_id):
+ if dataset_id < 1:
+ sys.stderr.write("Error: dataset id must be > 0\n")
+ return
+ min=sys.maxsize
+ max=0
+ sum=0
+ nb=0
+ with open(vcf_file_name) as vcf_file:
+ for line in vcf_file:
+ line = line.strip()
+ if line[0] =="#": continue
+ s_line = line.split()
+ field_id = 8+dataset_id
+ try:
+ GT_DP = s_line[field_id]
+ DP = int(GT_DP.split(":")[1])
+ if DP > max: max=DP
+ if DP < min: min=DP
+ sum+=DP
+ nb+=1
+ except IndexError:
+ sys.stderr.write("Error: no field nb "+str(dataset_id)+" in file "+vcf_file_name+"\n")
+ return
+ print(f"param DP_avg := {int(sum/float(nb))};")
+ print(f"param DP_max := {max};")
+ print(f"param DP_min := {min};")
+
+
+
+def main(vcf_file_name, dataset_id):
+ '''
+ Extraction of DP min, max and average from a VCf file
+ Usage:
+ python ~/workspace/gatb-discosnp/scripts/k3000/extract_DP_from_vcf.py /discoRes_k_31_c_2_D_0_P_3_b_2_coherent.vcf >> graph_5haplotypes_filtered.dat #once graph_5haplotypes_filtered.dat was created using K3000_gfa_to_dat.py
+ '''
+ extract_DP(vcf_file_name, dataset_id)
+
+
+
+if __name__ == "__main__":
+ main(sys.argv[1], int(sys.argv[2]))
=====================================
scripts/k3000/run.sh
=====================================
@@ -143,9 +143,9 @@ if [[ "$wraith" == "false" ]]; then
fi
if [ $? -ne 0 ]
then
- echo "${red} ###Problem detected, check logs.${reset}"
+ echo "${red} ### Non critical problem detected, check logs.${reset}"
echo "${green} ### You may remove useless files: rm -f compacted_facts_int_${read_set_id}.txt compacted_facts_${read_set_id}.gfa graph_${read_set_id}.gfa compacted_facts_${read_set_id}.fa "
- echo "${green}${bold} ### EXPLOITATION OF PHASING INFORMATION OBTAINED FROM DISCOSNP ENDED, the final graph is $(tput blink)${underline}graph_final_${read_set_id}.gfa${no_underline}.${reset}. "
+ echo "${green}${bold} ### EXPLOITATION OF PHASING INFORMATION OBTAINED FROM DISCOSNP ENDED, the final graph is $(tput blink)${underline}graph_final_${read_set_id}.gfa${no_underline}${reset} "
exit 0
fi
View it on GitLab: https://salsa.debian.org/med-team/discosnp/-/commit/07ef2a0fc77d4e37df42d8ed04330a5c77fd4576
--
View it on GitLab: https://salsa.debian.org/med-team/discosnp/-/commit/07ef2a0fc77d4e37df42d8ed04330a5c77fd4576
You're receiving this email because of your account on salsa.debian.org.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://alioth-lists.debian.net/pipermail/debian-med-commit/attachments/20201001/8c9c7d96/attachment-0001.html>
More information about the debian-med-commit
mailing list