[med-svn] [Git][med-team/cutesv][upstream] New upstream version 1.0.12

Andreas Tille (@tille) gitlab at salsa.debian.org
Wed Nov 17 06:36:41 GMT 2021



Andreas Tille pushed to branch upstream at Debian Med / cutesv


Commits:
58057786 by Andreas Tille at 2021-11-17T07:31:24+01:00
New upstream version 1.0.12
- - - - -


8 changed files:

- README.md
- setup.py
- src/cuteSV/cuteSV
- src/cuteSV/cuteSV_Description.py
- src/cuteSV/cuteSV_forcecalling.py
- src/cuteSV/cuteSV_genotype.py
- src/cuteSV/cuteSV_resolveDUP.py
- src/cuteSV/cuteSV_resolveINV.py


Changes:

=====================================
README.md
=====================================
@@ -91,7 +91,7 @@ A new wiki page about diploid-assembly-based SV detection using cuteSV has been
 |--merge_ins_threshold|Maximum distance of insertion signals to be merged.|100|
 |--min_support|Minimum number of reads that support a SV to be reported.|10|
 |--min_size|Minimum length of SV to be reported.|30|
-|--max_size|Minimum length of SV to be reported.|100000|
+|--max_size|Maximum size of SV to be reported. Full length SVs are reported when using -1.|100000|
 |--genotype|Enable to generate genotypes.|False|
 |--gt_round|Maximum round of iteration for alignments searching if perform genotyping.|500|
 |-Ivcf|Optional given vcf file. Enable to perform force calling.|NULL|
@@ -116,7 +116,12 @@ Please cite the manuscript of cuteSV before using these callsets.
 ---
 ### Changelog
 
-	cutesv (v1.0.11):
+	cuteSV (v1.0.12):
+	1. Add Allele frequency (AF) info in the outputs.
+	2. Fix an index error when force calling BND variants.
+	3. Modify the parameter of --max_size and enable to report full length of SVs.
+
+	cuteSV (v1.0.11):
 	1. Add a script for post-processing typically cuteSV callsets from assembly-based alignments to generate the diploid-assembly-based SV calls.
 	2. Give a wiki page for helping uses to achieve assembly-based SV calling.
 	3. Improve acquirement of inserted sequence in a read whose primary alignment contains hardclips.
@@ -126,13 +131,13 @@ Please cite the manuscript of cuteSV before using these callsets.
 	7. Update the sort commands used in cuteSV.  
 	8. Update the parameter of --max_split_parts.
 
-	cutesv (v1.0.10):
+	cuteSV (v1.0.10):
 	1. Fix a bug leading to calculate wrong TRA positions.
 	2. Add a file format conversion script that enable to transfer the vcf file to bedpe file. 
 	3. Involve several clustering-and-refinement strategies in force calling function.
 	4. Assessed the performance of force calling with Giab HG002 sample datasets (including CLR, CCS, and ONT platforms).
 
-	cutesv (v1.0.9):
+	cuteSV (v1.0.9):
 	1. Change 0-based pos into 1-based pos in DUP in order to support bcftools conversion.
 	2. Correct REF and ALT fields. Adjust END value of INS to make it equal to the value of POS.
 	3. Improve the description of errors.
@@ -142,7 +147,7 @@ Please cite the manuscript of cuteSV before using these callsets.
 	7. Add force calling function and enable cuteSV to perform population-based SV calling.
 	8. Fix several minor bugs.
 
-	cutesv (v1.0.8):
+	cuteSV (v1.0.8):
 	1. Rewirte the function of ins/del signatures clustering.
 	2. Update the recommandation parameters for different sequencing datasets.
 	3. Replace <DEL>/<INS> with its variant allele sequence, which needs the reference genome sequence as input.
@@ -165,7 +170,6 @@ Please cite the manuscript of cuteSV before using these callsets.
 	3.Fix bugs in inversion and translocation calling.
 	4.Add new option for specificly setting the maximum size of SV to be discovered. The default value is 100,000 bp. 
 
-
 	cuteSV (v1.0.4):
 	1.Add a new option for specificly setting the threshold of SV signals merging in the same read. The default parameter is 500 bp. You can reduce it for high-quality sequencing datasets like PacBio HiFi (CCS).
 	2.Make the genotyping function optional.


=====================================
setup.py
=====================================
@@ -7,7 +7,7 @@ with open('README.md') as f:
 
 setup(
     name = "cuteSV",
-    version = "1.0.11",
+    version = "1.0.12",
     description = "Long-read-based human genomic structural variation detection with cuteSV",
     author = "Jiang Tao",
     author_email = "tjiang at hit.edu.cn",


=====================================
src/cuteSV/cuteSV
=====================================
@@ -4,8 +4,8 @@
  * All rights Reserved, Designed By HIT-Bioinformatics   
  * @Title: cuteSV 
  * @author: tjiang
- * @data: May 16th 2020
- * @version V1.0.11
+ * @date: Oct 4th 2021
+ * @version V1.0.12
 '''
 
 import pysam
@@ -202,7 +202,7 @@ def analysis_split_read(split_read, SV_size, RLength, read_name, candidate, MaxS
 						if ele_2[4] not in candidate["INS"]:
 							candidate["INS"][ele_2[4]] = list()
 
-						if ele_2[2] - ele_1[3] <= 100 and ele_2[0]+ele_1[3]-ele_2[2]-ele_1[1] <= MaxSize:
+						if ele_2[2] - ele_1[3] <= 100 and (ele_2[0]+ele_1[3]-ele_2[2]-ele_1[1] <= MaxSize or MaxSize == -1):
 							candidate["INS"][ele_2[4]].append([(ele_2[2]+ele_1[3])/2, 
 																ele_2[0]+ele_1[3]-ele_2[2]-ele_1[1], 
 																read_name,
@@ -211,7 +211,7 @@ def analysis_split_read(split_read, SV_size, RLength, read_name, candidate, MaxS
 						if ele_2[4] not in candidate["DEL"]:
 							candidate["DEL"][ele_2[4]] = list()
 
-						if ele_2[0] - ele_1[1] <= 100 and ele_2[2]-ele_2[0]+ele_1[1]-ele_1[3] <= MaxSize:
+						if ele_2[0] - ele_1[1] <= 100 and (ele_2[2]-ele_2[0]+ele_1[1]-ele_1[3] <= MaxSize or MaxSize == -1):
 							candidate["DEL"][ele_2[4]].append([ele_1[3], 
 																ele_2[2]-ele_2[0]+ele_1[1]-ele_1[3], 
 																read_name])
@@ -329,7 +329,7 @@ def analysis_split_read(split_read, SV_size, RLength, read_name, candidate, MaxS
 				# print(ele_2)
 				dis_ref = ele_2[2] - ele_1[3]
 				dis_read = ele_2[0] - ele_1[1]
-				if dis_ref < 100 and dis_read - dis_ref >= SV_size and dis_read - dis_ref <= MaxSize:
+				if dis_ref < 100 and dis_read - dis_ref >= SV_size and (dis_read - dis_ref <= MaxSize or MaxSize == -1):
 					# print(min(ele_2[2], ele_1[3]), dis_read - dis_ref, read_name)
 					if ele_1[4] not in candidate['INS']:
 						candidate['INS'][ele_1[4]] = list()


=====================================
src/cuteSV/cuteSV_Description.py
=====================================
@@ -2,12 +2,12 @@
  * All rights Reserved, Designed By HIT-Bioinformatics   
  * @Title:  cuteSV_Description.py
  * @author: tjiang
- * @date: May 16th 2021
- * @version V1.0.11
+ * @date: Oct 4th 2021
+ * @version V1.0.12
 '''
 import argparse
 
-VERSION = '1.0.11'
+VERSION = '1.0.12'
 
 class cuteSVdp(object):
 	'''
@@ -63,7 +63,7 @@ def parseArgs(argv):
 	parser.add_argument("input", 
 		metavar="[BAM]", 
 		type = str, 
-		help ="Sorted .bam file form NGMLR or Minimap2.")
+		help ="Sorted .bam file from NGMLR or Minimap2.")
 	parser.add_argument("reference",  
 		type = str, 
 		help ="The reference genome in fasta format.")
@@ -134,7 +134,7 @@ def parseArgs(argv):
 		default = 30, 
 		type = int)
 	GroupSVCluster.add_argument('-L', '--max_size', 
-		help = "Maximum size of SV to be reported.[%(default)s]", 
+		help = "Maximum size of SV to be reported. All SVs are reported when using -1. [%(default)s]", 
 		default = 100000, 
 		type = int)
 	GroupSVCluster.add_argument('-sl', '--min_siglength', 
@@ -271,6 +271,7 @@ def Generation_VCF_header(file, contiginfo, sample, argv):
 	file.write("##INFO=<ID=RE,Number=1,Type=Integer,Description=\"Number of read support this record\">\n")
 	file.write("##INFO=<ID=STRAND,Number=A,Type=String,Description=\"Strand orientation of the adjacency in BEDPE format (DEL:+-, DUP:-+, INV:++/--)\">\n")
 	file.write("##INFO=<ID=RNAMES,Number=.,Type=String,Description=\"Supporting read names of SVs (comma separated)\">\n")
+	file.write("##INFO=<ID=AF,Number=A,Type=Float,Description=\"Allele Frequency.\">\n")
 	file.write("##FILTER=<ID=q5,Description=\"Quality below 5\">\n")
 	# FORMAT
 	# file.write("\n")


=====================================
src/cuteSV/cuteSV_forcecalling.py
=====================================
@@ -76,30 +76,40 @@ def parse_record(record):
     sv_type = parse_svtype(record.info['SVTYPE'])
     chrom1 = record.chrom
     start = parse_to_int(record.pos)
+    chrom2 = ''
+    end = 0
     if (sv_type == 'INS' or sv_type == 'DEL') and 'SVLEN' in record.info:  ###
         end = abs(parse_to_int(record.info['SVLEN']))
     else:
         try:
             end = parse_to_int(record.stop)
         except:
-            pass   
+            pass
     if sv_type == 'TRA':
-        tra_alt = str(record.alts[0])
-        if tra_alt[0] == 'N':
-            if tra_alt[1] == '[':
-                tra_type = 'A'
+        if 'CHR2' in record.info:
+            chrom2 = record.info['CHR2']
+        if 'END' in record.info:
+            end = parse_to_int(record.info['END'])
+        try:
+            tra_alt = str(record.alts[0])
+            if tra_alt[0] == 'N':
+                if tra_alt[1] == '[':
+                    tra_type = 'A'
+                else:
+                    tra_type = 'B'
+            elif tra_alt[0] == '[':
+                tra_type = 'C'
             else:
-                tra_type = 'B'
-        elif tra_alt[0] == '[':
-            tra_type = 'C'
-        else:
-            tra_type = 'D'
-        if tra_alt[0] == 'N':
-            tra_alt = tra_alt[2:-1]
-        else:
-            tra_alt = tra_alt[1:-2]
-        chrom2 = tra_alt.split(':')[0]
-        end = int(tra_alt.split(':')[1])
+                tra_type = 'D'
+            if tra_alt[0] == 'N':
+                tra_alt = tra_alt[2:-1]
+            else:
+                tra_alt = tra_alt[1:-2]
+            if ':' in tra_alt:
+                chrom2 = tra_alt.split(':')[0]
+                end = int(tra_alt.split(':')[1])
+        except:
+            pass
     strand = '.'
     if sv_type != 'TRA':
         chrom2 = record.chrom
@@ -372,6 +382,8 @@ def call(call_gt_args, idx, row_count, para, strands, seq, var_type):
     if var_type == 'TRA':
         gt_re, DR, genotype, GL, GQ, QUAL = call_gt_tra(*call_gt_args)
         rname_list = call_gt_args[5]
+        if para.alts == '<TRA>' or para.alts == '<BND>':
+            seq = str(call_gt_args[4]) + ':' + str(call_gt_args[2]) # chr2:end
     rname = ','.join(rname_list)
     if rname == '':
         rname = 'NULL'


=====================================
src/cuteSV/cuteSV_genotype.py
=====================================
@@ -101,11 +101,13 @@ def generate_output(args, semi_result, contigINFO, argv, ref_g):
 	svid["INV"] = 0
 
 	file = open(args.output, 'w')
+	action = args.genotype
 	Generation_VCF_header(file, contigINFO, args.sample, argv)
 	file.write("#CHROM\tPOS\tID\tREF\tALT\tQUAL\tFILTER\tINFO\tFORMAT\t%s\n"%(args.sample))
-
 	for i in semi_result:
 		if i[1] in ["DEL", "INS"]:
+			if abs(int(float(i[3]))) > args.max_size and args.max_size != -1:
+				continue
 			if i[1] == "INS":
 				cal_end = int(i[2]) + 1
 			else:
@@ -119,6 +121,11 @@ def generate_output(args, semi_result, contigINFO, argv, ref_g):
 				CILEN = i[6], 
 				RE = i[4],
 				RNAMES = i[12] if args.report_readid else "NULL")
+			if action:
+				try:
+					info_list += ";AF=" + str(round(int(i[4]) / (int(i[4]) + int(i[7])), 4))
+				except:
+					info_list += ";AF=."
 			if i[1] =="DEL":
 				info_list += ";STRAND=+-"
 			if i[11] == "." or i[11] == None:
@@ -142,6 +149,8 @@ def generate_output(args, semi_result, contigINFO, argv, ref_g):
 				PASS = filter_lable))
 			svid[i[1]] += 1
 		elif i[1] == "DUP":
+			if abs(int(float(i[3]))) > args.max_size and args.max_size != -1:
+				continue
 			cal_end = int(i[2]) + 1 + abs(int(float(i[3])))
 			info_list = "{PRECISION};SVTYPE={SVTYPE};SVLEN={SVLEN};END={END};RE={RE};STRAND=-+;RNAMES={RNAMES}".format(
 				PRECISION = "IMPRECISE" if i[6] == "0/0" else "PRECISE", 
@@ -150,6 +159,11 @@ def generate_output(args, semi_result, contigINFO, argv, ref_g):
 				END = str(cal_end), 
 				RE = i[4],
 				RNAMES = i[10] if args.report_readid else "NULL")
+			if action:
+				try:
+					info_list += ";AF=" + str(round(int(i[4]) / (int(i[4]) + int(i[5])), 4))
+				except:
+					info_list += ";AF=."
 			if i[9] == ".":
 				filter_lable = "PASS"
 			else:
@@ -171,6 +185,8 @@ def generate_output(args, semi_result, contigINFO, argv, ref_g):
 				PASS = filter_lable))
 			svid[i[1]] += 1
 		elif i[1] == "INV":
+			if abs(int(float(i[3]))) > args.max_size and args.max_size != -1:
+				continue
 			cal_end = int(i[2]) + 1 + abs(int(float(i[3])))
 			info_list = "{PRECISION};SVTYPE={SVTYPE};SVLEN={SVLEN};END={END};RE={RE};STRAND={STRAND};RNAMES={RNAMES}".format(
 				PRECISION = "IMPRECISE" if i[6] == "0/0" else "PRECISE", 
@@ -180,6 +196,11 @@ def generate_output(args, semi_result, contigINFO, argv, ref_g):
 				RE = i[4],
 				STRAND = i[7],
 				RNAMES = i[11] if args.report_readid else "NULL")
+			if action:
+				try:
+					info_list += ";AF=" + str(round(int(i[4]) / (int(i[4]) + int(i[5])), 4))
+				except:
+					info_list += ";AF=."
 			if i[10] == ".":
 				filter_lable = "PASS"
 			else:
@@ -210,6 +231,11 @@ def generate_output(args, semi_result, contigINFO, argv, ref_g):
 				# END = str(int(i[4]) + 1), 
 				RE = i[5],
 				RNAMES = i[11] if args.report_readid else "NULL")
+			if action:
+				try:
+					info_list += ";AF=" + str(round(int(i[5]) / (int(i[5]) + int(i[6])), 4))
+				except:
+					info_list += ";AF=."
 			if i[10] == ".":
 				filter_lable = "PASS"
 			else:
@@ -244,7 +270,7 @@ def generate_pvcf(args, result, contigINFO, argv, ref_g):
 		else:
 			filter_lable = "PASS" if float(i[13]) >= 5.0 else "q5"
 		if i[3] == 'INS':
-			if abs(i[4]) > args.max_size:
+			if abs(i[4]) > args.max_size and args.max_size != -1:
 				continue
 			elif i[12] == '<INS>':
 				ref = str(ref_g[i[0]].seq[max(i[1]-2, 0)])
@@ -261,6 +287,10 @@ def generate_pvcf(args, result, contigINFO, argv, ref_g):
 				CILEN = i[7], 
 				RE = i[8][0],
 				RNAMES = i[9] if args.report_readid else "NULL")
+			try:
+				info_list += ";AF=" + str(round(i[8][0] / (i[8][0] + i[8][1]), 4))
+			except:
+				info_list += ";AF=."
 			file.write("{CHR}\t{POS}\t{ID}\t{REF}\t{ALT}\t{QUAL}\t{PASS}\t{INFO}\t{FORMAT}\t{GT}:{DR}:{RE}:{PL}:{GQ}\n".format(
 				CHR = i[0], 
 				POS = str(i[1]), 
@@ -278,7 +308,7 @@ def generate_pvcf(args, result, contigINFO, argv, ref_g):
 				GQ = i[8][3]
 				))
 		elif i[3] == 'DEL':
-			if abs(i[4]) > args.max_size:
+			if abs(i[4]) > args.max_size and args.max_size != -1:
 				continue
 			elif i[12] == '<DEL>':
 				ref = str(ref_g[i[0]].seq[max(i[1]-2, 0):int(i[1])-1-int(i[4])])
@@ -295,6 +325,10 @@ def generate_pvcf(args, result, contigINFO, argv, ref_g):
 				CILEN = i[7], 
 				RE = i[8][0],
 				RNAMES = i[9] if args.report_readid else "NULL")
+			try:
+				info_list += ";AF=" + str(round(i[8][0] / (i[8][0] + i[8][1]), 4))
+			except:
+				info_list += ";AF=."
 			file.write("{CHR}\t{POS}\t{ID}\t{REF}\t{ALT}\t{QUAL}\t{PASS}\t{INFO}\t{FORMAT}\t{GT}:{DR}:{RE}:{PL}:{GQ}\n".format(
 				CHR = i[0], 
 				POS = str(i[1]), 
@@ -312,6 +346,8 @@ def generate_pvcf(args, result, contigINFO, argv, ref_g):
 				GQ = i[8][3]
 				))
 		elif i[3] == 'DUP':
+			if abs(i[4]) > args.max_size and args.max_size != -1:
+				continue
 			info_list = "{PRECISION};SVTYPE={SVTYPE};SVLEN={SVLEN};END={END};RE={RE};RNAMES={RNAMES};STRAND=-+".format(
 				PRECISION = "IMPRECISE" if i[2] == "0/0" else "PRECISE", 
 				SVTYPE = i[3], 
@@ -319,6 +355,10 @@ def generate_pvcf(args, result, contigINFO, argv, ref_g):
 				END = i[5], 
 				RE = i[8][0],
 				RNAMES = i[9] if args.report_readid else "NULL")
+			try:
+				info_list += ";AF=" + str(round(i[8][0] / (i[8][0] + i[8][1]), 4))
+			except:
+				info_list += ";AF=."
 			file.write("{CHR}\t{POS}\t{ID}\t{REF}\t{ALT}\t{QUAL}\t{PASS}\t{INFO}\t{FORMAT}\t{GT}:{DR}:{RE}:{PL}:{GQ}\n".format(
 				CHR = i[0], 
 				POS = str(i[1]), 
@@ -336,6 +376,8 @@ def generate_pvcf(args, result, contigINFO, argv, ref_g):
 				GQ = i[8][3]
 				))
 		elif i[3] == 'INV':
+			if abs(i[4]) > args.max_size and args.max_size != -1:
+				continue
 			info_list = "{PRECISION};SVTYPE={SVTYPE};SVLEN={SVLEN};END={END};RE={RE};RNAMES={RNAMES};STRAND={STRAND}".format(
 				PRECISION = "IMPRECISE" if i[2] == "0/0" else "PRECISE", 
 				SVTYPE = i[3], 
@@ -344,6 +386,10 @@ def generate_pvcf(args, result, contigINFO, argv, ref_g):
 				RE = i[8][0],
 				RNAMES = i[9] if args.report_readid else "NULL",
 				STRAND = i[14])
+			try:
+				info_list += ";AF=" + str(round(i[8][0] / (i[8][0] + i[8][1]), 4))
+			except:
+				info_list += ";AF=."
 			file.write("{CHR}\t{POS}\t{ID}\t{REF}\t{ALT}\t{QUAL}\t{PASS}\t{INFO}\t{FORMAT}\t{GT}:{DR}:{RE}:{PL}:{GQ}\n".format(
 				CHR = i[0], 
 				POS = str(i[1]), 
@@ -367,6 +413,14 @@ def generate_pvcf(args, result, contigINFO, argv, ref_g):
 					SVTYPE = i[3], 
 					RE = i[8][0],
 					RNAMES = i[9] if args.report_readid else "NULL")
+			try:
+				info_list += ";AF=" + str(round(i[8][0] / (i[8][0] + i[8][1]), 4))
+			except:
+				info_list += ";AF=."
+			if ':' in i[15]:
+				info_list += ";CHR2={CHR2};END={END}".format(
+					CHR2 = i[15].split(':')[0],
+					END = i[15].split(':')[1])
 			file.write("{CHR}\t{POS}\t{ID}\t{REF}\t{ALT}\t{QUAL}\t{PASS}\t{INFO}\t{FORMAT}\t{GT}:{DR}:{RE}:{PL}:{GQ}\n".format(
 				CHR = i[0], 
 				POS = str(i[1]), 


=====================================
src/cuteSV/cuteSV_resolveDUP.py
=====================================
@@ -87,7 +87,7 @@ def generate_dup_cluster(semi_dup_cluster, chr, read_count, max_cluster_bias,
 		breakpoint_2 = int(sum(breakpoint_2)/len(semi_dup_cluster[low_b:up_b]))
 
 
-	if sv_size <= breakpoint_2 - breakpoint_1 <= MaxSize:
+	if sv_size <= breakpoint_2 - breakpoint_1 <= MaxSize or (sv_size <= breakpoint_2 - breakpoint_1 and MaxSize == -1):
 		if action:
 			import time
 			# time_start = time.time()


=====================================
src/cuteSV/cuteSV_resolveINV.py
=====================================
@@ -121,7 +121,7 @@ def generate_semi_inv_cluster(semi_inv_cluster, chr, svtype, read_count, sv_size
 				inv_len = breakpoint_2 - breakpoint_1
 				if inv_len >= sv_size and max_count_id >= read_count:
 					# candidate_single_SV.append('%s\t%s\t%d\t%d\t%d\n'%(chr, svtype, breakpoint_1, breakpoint_2, max_count_id))
-					if inv_len <= MaxSize:
+					if inv_len <= MaxSize or MaxSize == -1:
 						if action:
 							import time
 							# time_start = time.time()
@@ -172,7 +172,7 @@ def generate_semi_inv_cluster(semi_inv_cluster, chr, svtype, read_count, sv_size
 		inv_len = breakpoint_2 - breakpoint_1
 		if inv_len >= sv_size and max_count_id >= read_count:
 			# candidate_single_SV.append('%s\t%s\t%d\t%d\t%d\n'%(chr, svtype, breakpoint_1, breakpoint_2, max_count_id))
-			if inv_len <= MaxSize:
+			if inv_len <= MaxSize or MaxSize == -1:
 				if action:
 					import time
 					# time_start = time.time()



View it on GitLab: https://salsa.debian.org/med-team/cutesv/-/commit/58057786f4b4dd1c2102b3a611dbb41bdec71b51

-- 
View it on GitLab: https://salsa.debian.org/med-team/cutesv/-/commit/58057786f4b4dd1c2102b3a611dbb41bdec71b51
You're receiving this email because of your account on salsa.debian.org.


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://alioth-lists.debian.net/pipermail/debian-med-commit/attachments/20211117/7541d62f/attachment-0001.htm>


More information about the debian-med-commit mailing list