[med-svn] [Git][med-team/python-loompy][upstream] New upstream version 3.0.7+dfsg

Sun Jul 17 15:52:37 BST 2022


Andreas Tille pushed to branch upstream at Debian Med / python-loompy


Commits:
534c2713 by Andreas Tille at 2022-07-17T16:10:10+02:00
New upstream version 3.0.7+dfsg
- - - - -


17 changed files:

- doc/apiwalkthrough/index.rst
- doc/format/index.rst
- + kallisto/README.md
- + kallisto/manifest.json
- + kallisto/mouse_build.py
- + kallisto/mouse_download.sh
- + kallisto/mouse_generate_fragments.py
- loompy/__init__.py
- loompy/_version.py
- loompy/attribute_manager.py
- loompy/bus_file.py
- loompy/layer_manager.py
- loompy/loompy.py
- + loompy/metadata_loaders.py
- loompy/normalize.py
- notebooks/build_index.ipynb
- tests/debug.py


Changes:

=====================================
doc/apiwalkthrough/index.rst
=====================================
@@ -381,7 +381,7 @@ Layers support the same pythonic API as attributes:
   a = ds.layers["spliced"][:, 10] # Assign the 10th column of layer "spliced" to the variable a
   del ds.layers["spliced"]     # Delete the "spliced" layer
 
-The main matrix is availabe as a layer named "" (the empty string). It cannot be deleted but
+The main matrix is available as a layer named "" (the empty string). It cannot be deleted but
 otherwise supports the same operations as any other layer.
 
 As a convenience, layers are also available directly on the connection object. The above


=====================================
doc/format/index.rst
=====================================
@@ -105,7 +105,7 @@ Global attributes
 ^^^^^^^^^^^^^^^^^
 
 -  There MUST be an HDF5 group ``/attrs`` containing global attributes.
--  There MUST be a HDF5 dataset ``/attrs/LOOM_SPEC_VERSION`` with the value ``v3.0.0``.
+-  There MUST be a HDF5 dataset ``/attrs/LOOM_SPEC_VERSION`` with the value ``3.0.0``.
 
 Global attributes apply semantically to the whole file, not any specific part of it. 
 Such attributes are stored in the HDF5 group ``/attrs`` and can be any valid scalar


=====================================
kallisto/README.md
=====================================
@@ -0,0 +1,64 @@
+# Building an annotated kallisto index
+
+For the human genome see the notebook subdirectory.
+
+## Building the mouse kallisto index
+
+These instructions work on Linux (tested on CentOS7).
+
+1. Make sure packages bedtools and kallisto are installed on the system.
+  - bedtools from https://bedtools.readthedocs.io/en/latest/content/installation.html
+  - kallisto from https://pachterlab.github.io/kallisto/download.html
+
+2. Create your working directory, 'cd' there, and put the files above there.
+
+3. Download and preprocess input files:
+
+`bash mouse_download.sh`
+
+  This will create a directory "inputs" and put some files there as well as in the current directory.
+
+4. Download "BrowseTF  TcoF-DB.xlsx" from https://tools.sschmeier.com/tcof/browse/?type=tcof&species=mouse&class=all# by clicking the "Excel" button. (Main page is https://tools.sschmeier.com/tcof/home/).
+  Open the file in Excel and save tab-separated as "inputs/TcoF-DB.tsv".
+
+5. You need to download some annotations for Mouse GRCm38 from BioMart (https://m.ensembl.org/biomart) Open this link in a new browser tab:
+
+http://www.ensembl.org/biomart/martview/7c9b283e3eca26cb81449ec518f4fc14?VIRTUALSCHEMANAME=default&ATTRIBUTES=mmusculus_gene_ensembl.default.feature_page.ensembl_gene_id|mmusculus_gene_ensembl.default.feature_page.ensembl_gene_id_version|mmusculus_gene_ensembl.default.feature_page.ensembl_transcript_id|mmusculus_gene_ensembl.default.feature_page.ensembl_transcript_id_version|mmusculus_gene_ensembl.default.feature_page.ucsc|mmusculus_gene_ensembl.default.feature_page.vega_translation|mmusculus_gene_ensembl.default.feature_page.ccds&FILTERS=&VISIBLEPANEL=resultspanel
+
+  On this BioMart page, click the "Go" button, and save the downloaded "mart_export.txt" file as "inputs/mart_export.txt".
+  The file should contain the following columns in the header:
+      Gene stable ID        Gene stable ID version  Transcript stable ID    Transcript stable ID version    UCSC Stable ID  Vega translation ID     CCDS ID
+
+  If the link fails, you need to manually select the proper dataset and columns from the https://m.ensembl.org/biomart webpage and download:
+  *  Select Dataset "Ensembl Genes 101"/"Mouse genes GRCm38".
+  *  Select Attributes as in columns above: First 4 should be auto-selected. Select the following 3 from the "EXTERNAL" section, clicking the 3 boxes in the order above.
+  * Click "Results", export using "Go" button, and save to "inputs/mart_export.txt". 
+
+6. Run the annotation assembly script:
+
+`python mouse_build.py`
+
+7. Create the "manifest.json" file or use the one supplied above. It should contain:
+```
+{
+    "species": "Mus musculus",
+    "index_file": "gencode.vM23.fragments.idx",
+    "gene_metadata_file": "gencode.vM23.metadata.tab",
+    "gene_metadata_key": "AccessionVersion",
+    "fragments_to_genes_file": "fragments2genes.txt",
+    "layers": {
+        "unspliced": "unspliced_fragments.txt",
+        "spliced": "spliced_fragments.txt"
+    }
+}
+```
+
+8. Run the fragment generator script:
+
+`python mouse_generate_fragments.py`
+
+9. Build the kallisto index:
+
+`kallisto index -i gencode.vM23.fragments.idx -k 31 inputs/gencode.vM23.fragments.fa`
+
+10. Refer to the notebook for human for more info on the output.


=====================================
kallisto/manifest.json
=====================================
@@ -0,0 +1,11 @@
+{
+    "species": "Mus musculus",
+    "index_file": "gencode.vM23.fragments.idx",
+    "gene_metadata_file": "gencode.vM23.metadata.tab",
+    "gene_metadata_key": "AccessionVersion",
+    "fragments_to_genes_file": "fragments2genes.txt",
+    "layers": {
+        "unspliced": "unspliced_fragments.txt",
+        "spliced": "spliced_fragments.txt"
+    }
+}


=====================================
kallisto/mouse_build.py
=====================================
@@ -0,0 +1,128 @@
+from collections import defaultdict
+
+d = "./"
+
+mgiID2MRK_ENSE = {}
+enseID2mgiID = {}
+with open(d + "inputs/MRK_ENSEMBL.rpt") as f:
+	for line in f:
+		items = line[:-1].split("\t")
+		mgiID = items[0]
+		mgiID2MRK_ENSE[mgiID] = items
+		enseID = items[5]
+		enseID2mgiID[enseID] = mgiID
+
+enseID2MGI_GMC = {}
+with open(d + "inputs/MGI_Gene_Model_Coord.rpt") as f:
+	MGI_GMC_headers = f.readline()[:-1].split("\t")
+	for line in f:
+		items = line[:-1].split("\t")
+		enseID = items[10]
+		enseID2MGI_GMC[enseID] = items
+
+mgiID2MRK_Seq = {}
+with open(d + "inputs/MRK_Sequence.rpt") as f:
+        MRK_Seq_headers = f.readline()[:-1].split("\t")
+        for line in f:
+                items = line[:-1].split("\t")
+                mgiID = items[0]
+                mgiID2MRK_Seq[mgiID] = items
+
+enseID2mart = {}
+with open(d + "inputs/mart_export.txt") as f:
+	mart_headers = f.readline()[:-1].split("\t")
+	for line in f:
+		items = line[:-1].split("\t")
+		enseID = items[0]
+		enseID2mart[enseID] = items
+
+geneSymbol2TF = {}
+with open(d + "inputs/TF_TcoF-DB.tsv") as f:
+	TF_headers = f.readline()[:-1].split("\t")
+	for line in f:
+		items = line[:-1].split("\t")
+		geneSymbol = items[0]
+		geneSymbol2TF[geneSymbol] = items
+
+geneSymbol2Regulated = defaultdict(list)
+with open(d + "inputs/trrust_rawdata.mouse.tsv") as f:
+	for line in f:
+		items = line[:-1].split("\t")
+		TFSymbol = items[0]
+		geneSymbol2Regulated[TFSymbol].append(items[1])
+
+with open(d + "gencode.vM23.metadata.tab", "w") as fout:
+	fout.write("\t".join([
+		"Accession",
+		"AccessionVersion",
+		"Gene",
+		"FullName",
+		"GeneType",
+		"HgncID",
+		"Chromosome",
+		"Strand",
+		"ChromosomeStart",
+		"ChromosomeEnd",
+		"LocusGroup",
+		"LocusType",
+		"Location",
+		"LocationSortable",
+		"Aliases",
+		"VegaID",
+		"UcscID",
+		"RefseqID",
+		"CcdsID",
+		"UniprotID",
+		"PubmedID",
+		"MgdID",
+		"RgdID",
+		"CosmicID",
+		"OmimID",
+		"MirBaseID",
+		"IsTFi (TcoF-DB)",
+		"DnaBindingDomain",
+		"Regulates (TRRUST)"
+	]))
+	fout.write("\n")
+	with open(d + "inputs/gencode.vM23.primary_assembly.annotation.gtf") as f:
+		for line in f:
+			if line.startswith("##"):
+				continue
+			items = line[:-1].split("\t")
+			if items[2] != "gene":
+				continue
+			extra = {x.strip().split(" ")[0]: x.strip().split(" ")[1].strip('"') for x in items[8].split(";")[:-1]}
+			enseID = extra["gene_id"].split(".")[0]
+			geneSymbol = extra.get("gene_name", "")
+			fout.write("\t".join([
+				enseID,
+				extra["gene_id"],
+				geneSymbol,
+				enseID2MGI_GMC[enseID][3] if enseID in enseID2MGI_GMC else "",  # full name
+				extra["gene_type"],  # gene type from gencode
+				"",  # HGNC id
+				items[0],  # Chromosome
+				items[6],
+				items[3],  # Start
+				items[4],  # End
+				"",  # Locus group
+				mgiID2MRK_ENSE[mgiID][8],  # Locus type
+				"",  # Location
+				"",  # Location, sortable
+				"",  # Aliases
+				enseID2mart[enseID][5] if enseID in enseID2mart else "",  # VEGA id
+				enseID2mart[enseID][4] if enseID in enseID2mart else "",  # UCSC id
+				mgiID2MRK_Seq[mgiID][12],  # Refseq id
+				enseID2mart[enseID][6] if enseID in enseID2mart else "",  # CCDS id
+				mgiID2MRK_Seq[mgiID][14],  # Uniprot id
+				"",  # Pubmed id
+				"",  # MGD id
+				"",  # RGD id
+				"",  # COSMIC id
+				"",  # OMIM id
+				"",  # MIRbase id
+				"True" if (geneSymbol in geneSymbol2TF) else "False", #  IsTF?
+				"", # DBD
+				",".join(geneSymbol2Regulated[geneSymbol])  #  TF regulated genes
+				]))
+			fout.write("\n")


=====================================
kallisto/mouse_download.sh
=====================================
@@ -0,0 +1,45 @@
+mkdir inputs
+
+# Make sure these manual steps have been done:
+# Download "BrowseTF  TcoF-DB.xlsx" from https://tools.sschmeier.com/tcof/browse/?type=tcof&species=mouse&class=all# (a button at https://tools.sschmeier.com/tcof/home/)
+#   Open the file in Excel and save tab-separated as "inputs/TcoF-Db.tsv"
+#
+# You need to import data from BioMart using this link:
+# http://www.ensembl.org/biomart/martview/7c9b283e3eca26cb81449ec518f4fc14?VIRTUALSCHEMANAME=default&ATTRIBUTES=mmusculus_gene_ensembl.default.feature_page.ensembl_gene_id|mmusculus_gene_ensembl.default.feature_page.ensembl_gene_id_version|mmusculus_gene_ensembl.default.feature_page.ensembl_transcript_id|mmusculus_gene_ensembl.default.feature_page.ensembl_transcript_id_version|mmusculus_gene_ensembl.default.feature_page.ucsc|mmusculus_gene_ensembl.default.feature_page.vega_translation|mmusculus_gene_ensembl.default.feature_page.ccds&FILTERS=&VISIBLEPANEL=resultspanel
+# by clicking "Go" button, and saving the downloaded "mart_export.txt" file in "inputs/mart_export.txt".
+# The file should contain the following columns:
+# Gene stable ID	Gene stable ID version	Transcript stable ID	Transcript stable ID version	UCSC Stable ID	Vega translation ID	CCDS ID
+
+wget ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_mouse/release_M23/GRCm38.primary_assembly.genome.fa.gz
+wget ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_mouse/release_M23/gencode.vM23.primary_assembly.annotation.gtf.gz
+wget https://github.com/10XGenomics/cellranger/raw/master/lib/python/cellranger/barcodes/3M-february-2018.txt.gz
+wget https://github.com/10XGenomics/cellranger/raw/master/lib/python/cellranger/barcodes/737K-april-2014_rc.txt
+wget https://github.com/10XGenomics/cellranger/raw/master/lib/python/cellranger/barcodes/737K-august-2016.txt
+
+zcat gencode.vM23.primary_assembly.annotation.gtf.gz | gawk 'OFS="\t" {if ($3=="gene") {print $1,$4-1,$5,$10,0,$7}}' | tr -d '";' > gencode.vM23.primary_assembly.annotation.bed
+bedtools sort -i gencode.vM23.primary_assembly.annotation.bed > gencode.vM23.primary_assembly.annotation.sorted.bed
+bedtools merge -i gencode.vM23.primary_assembly.annotation.sorted.bed -s -c 4 -o collapse > gencode.vM23.primary_assembly.annotation.merged.bed
+gunzip GRCm38.primary_assembly.genome.fa.gz 
+bedtools getfasta -name -fo gencode.vM23.unspliced.fa -fi GRCm38.primary_assembly.genome.fa -bed gencode.vM23.primary_assembly.annotation.sorted.bed
+
+mv 737K-april-2014_rc.txt 10xv1_whitelist.txt
+mv 737K-august-2016.txt 10xv2_whitelist.txt
+gunzip 3M-february-2018.txt.gz 
+mv 3M-february-2018.txt 10xv3_whitelist.txt 
+
+mv GRCm38.primary_assembly.genome.fa* inputs/
+mv gencode.vM23.unspliced.fa inputs/
+mv gencode.vM23.primary_assembly.annotation.bed inputs/
+mv gencode.vM23.primary_assembly.annotation.sorted.bed inputs/
+gunzip gencode.vM23.primary_assembly.annotation.gtf.gz 
+mv gencode.vM23.primary_assembly.annotation.gtf inputs/
+
+cd inputs
+wget ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_mouse/release_M23/gencode.vM23.transcripts.fa.gz
+wget https://www.grnpedia.org/trrust/data/trrust_rawdata.mouse.tsv
+gunzip gencode.vM23.transcripts.fa.gz 
+wget http://www.informatics.jax.org/downloads/reports/MGI_Gene_Model_Coord.rpt
+wget http://www.informatics.jax.org/downloads/reports/MRK_ENSEMBL.rpt
+wget http://www.informatics.jax.org/downloads/reports/MRK_Sequence.rpt
+cd ..
+


=====================================
kallisto/mouse_generate_fragments.py
=====================================
@@ -0,0 +1,109 @@
+import sys, os
+
+extent = 600  # how many bases away from polya to include
+min_len = 90  # how many non-repeat bases required to make a transcript
+
+from typing import *
+from Bio import SeqIO
+from Bio.SeqRecord import SeqRecord
+from Bio.Seq import Seq
+
+def find_polys(seq: SeqRecord, c: str = "A", n: int = 15) -> List[Tuple[int, int]]:
+	found = []
+	count = seq[:n].count(c)  # Count occurences in the first k-mer
+	if count >= n - 1:  # We have a match
+		found.append(0)
+	ix = 0
+	while ix < len(seq) - n - 1:
+		if seq[ix] == c:  # Outgoing base
+			count -= 1
+		if seq[ix + n] == c:  # Incoming base
+			count += 1
+		ix += 1
+		if count >= n - 1:  # We have a match
+			found.append(ix)
+	
+	sorted_by_lower_bound = [(f, f + n) for f in found]
+	# merge intervals (https://codereview.stackexchange.com/questions/69242/merging-overlapping-intervals)
+	merged = []
+	for higher in sorted_by_lower_bound:
+		if not merged:
+			merged.append(higher)
+		else:
+			lower = merged[-1]
+			# test for intersection between lower and higher:
+			# we know via sorting that lower[0] <= higher[0]
+			if higher[0] <= lower[1]:
+				upper_bound = max(lower[1], higher[1])
+				merged[-1] = (lower[0], upper_bound)  # replace by merged interval
+			else:
+				merged.append(higher)
+	return merged
+
+polyAs = {}
+polyTs = {}
+for fasta in SeqIO.parse(open("inputs/gencode.vM23.unspliced.fa"),'fasta'):
+	gene_id = fasta.id
+	intervals = find_polys(fasta.seq, c="A", n=14)
+	if len(intervals) > 0:
+		polyAs[gene_id] = intervals
+	# Collect fragments on the opposite strand, downstream of poly-Ts (not sure if such reads really happen?)
+	intervals = find_polys(fasta.seq, c="T", n=14)
+	if len(intervals) > 0:
+		polyTs[gene_id] = intervals
+
+tr2g = {}
+with open("inputs/gencode.vM23.primary_assembly.annotation.gtf") as f:
+	for line in f:
+		if "\ttranscript\t" in line:
+			items = line.split("; ")
+			chrom, _, _, start, end, _, strand, _, gid = items[0].split("\t")
+			gene_id = gid.split('"')[1]
+			transcript_id = items[1].split('"')[1]
+			gene_type = items[2].split('"')[1]
+			gene_name = items[3].split('"')[1]
+			tr2g[transcript_id] = (chrom, start, end, strand, gene_id, gene_type, gene_name)
+
+count = 0
+with open("fragments2genes.txt", "w") as ftr2g:
+	with open("inputs/gencode.vM23.fragments.fa", "w") as fout:
+		# Write the nascent fragments, with one partial transcript per internal poly-A/T site
+		with open("unspliced_fragments.txt", "w") as fucapture:
+			for fasta in SeqIO.parse(open("inputs/gencode.vM23.unspliced.fa"),'fasta'):  # Note we're in the masked file now
+				gene_id = fasta.id
+				if gene_id in polyAs:
+					for interval in polyAs[gene_id]:
+						seq = str(fasta.seq[max(0, interval[0] - extent):interval[0]])
+						#seq = seq.translate(tr).strip("N")
+						if len(seq) >= min_len:
+							count += 1
+							transcript_id = f"{gene_id}.A{interval[0]}"
+							trseq = SeqRecord(Seq(seq), transcript_id, '', '')
+							fout.write(trseq.format("fasta"))
+							ftr2g.write(f"{transcript_id}\t{gene_id}\n")
+							fucapture.write(f"{transcript_id}\n")
+				if gene_id in polyTs:
+					for interval in polyTs[gene_id]:
+						seq = str(fasta.seq[interval[1]:interval[1] + extent].reverse_complement())
+						#seq = seq.translate(tr).strip("N")
+						if len(seq) >= min_len:
+							count += 1
+							transcript_id = f"{gene_id}.T{interval[0]}"
+							trseq = SeqRecord(Seq(seq), transcript_id, '', '')
+							fout.write(trseq.format("fasta"))
+							ftr2g.write(f"{transcript_id}\t{gene_id}\n")
+							fucapture.write(f"{transcript_id}\n")
+		# Write the mature fragments, covering the 3' end of each mature transcript
+		with open("spliced_fragments.txt", "w") as fscapture:
+			for fasta in SeqIO.parse(open("inputs/gencode.vM23.transcripts.fa"),'fasta'):  # Note we're in the masked file now
+				transcript_id = fasta.id.split("|")[0]
+				gene_id = fasta.id.split("|")[1]
+				attrs = tr2g[transcript_id]
+				seq = str(fasta.seq[-extent:])
+				if len(seq) >= min_len:
+					count += 1
+					trseq = SeqRecord(Seq(seq), f"{transcript_id}.{count} gene_id:{attrs[4]} gene_name:{attrs[6]}", '', '')
+					fout.write(trseq.format("fasta"))
+					ftr2g.write(f"{transcript_id}.{count}\t{attrs[4]}\n")
+					fscapture.write(f"{transcript_id}.{count}\n")
+


=====================================
loompy/__init__.py
=====================================
@@ -8,8 +8,9 @@ from .loom_view import LoomView
 from .loom_layer import MemoryLoomLayer, LoomLayer
 from .to_html import to_html
 from .view_manager import ViewManager
-from .loompy import connect, create, create_append, combine, create_from_cellranger, LoomConnection, new, combine_faster, create_from_matrix_market
+from .loompy import connect, create, create_append, combine, create_from_cellranger, LoomConnection, new, combine_faster, create_from_matrix_market, create_from_star
 from .loom_validator import LoomValidator
 from ._version import __version__, loom_spec_version
 from .bus_file import create_from_fastq
 from .cell_calling import call_cells
+from .metadata_loaders import load_gene_metadata, make_row_attrs_from_gene_annotations, make_row_attrs_from_gene_metadata, load_sample_metadata


=====================================
loompy/_version.py
=====================================
@@ -1,2 +1,2 @@
-__version__ = '3.0.6'
+__version__ = '3.0.7'
 loom_spec_version = '3.0.0'


=====================================
loompy/attribute_manager.py
=====================================
@@ -152,11 +152,17 @@ class AttributeManager:
 					raise ValueError(f"Attribute '{name}' must have exactly {self.ds.shape[self.axis]} values but {len(values)} were given")
 				if self.ds._file[a].__contains__(name):
 					del self.ds._file[a + name]
-				
-				if values.dtype == np.object_:
-					self.ds._file.create_dataset(a + name, data=values, dtype=h5py.special_dtype(vlen=str))
-				else:
-					self.ds._file[a + name] = values
+
+				self.ds._file.create_dataset(
+					a + name,
+					data=values,
+					dtype=h5py.special_dtype(vlen=str) if values.dtype == np.object_ else values.dtype,
+					maxshape=(values.shape[0], ) if len(values.shape) == 1 else (values.shape[0], None),
+					fletcher32=False,
+					compression="gzip",
+					shuffle=False,
+					compression_opts=2
+				)
 				self.ds._file[a + name].attrs["last_modified"] = timestamp()
 				self.ds._file[a].attrs["last_modified"] = timestamp()
 				self.ds._file.attrs["last_modified"] = timestamp()


=====================================
loompy/bus_file.py
=====================================
@@ -117,6 +117,8 @@ def ixs_thatsort_a2b(a: np.ndarray, b: np.ndarray, check_content: bool = True) -
 	return np.argsort(a)[np.argsort(np.argsort(b))]
 
 
+# TODO: This function is a copy of the same function in loompy.metadata_loaders, call that one instead
+
 def load_sample_metadata(path: str, sample_id: str) -> Dict[str, str]:
 	if not os.path.exists(path):
 		raise ValueError(f"Samples metadata file '{path}' not found.")
@@ -325,11 +327,11 @@ class BusFile:
 		row_attrs = {}
 		# Transpose the gene metadata
 		for i, attr in enumerate(self.gene_metadata_attributes):
-			row_attrs[attr] = [v[i] for v in self.genes.values()]
+			row_attrs[attr] = np.array([v[i] for v in self.genes.values()])
 
 		# Create cell attributes
 		col_attrs = {
-			"CellID": [sample_id + "_" + twobit_to_dna(int(cid), 16) for cid in self.cell_ids],
+			"CellID": np.array([sample_id + "_" + twobit_to_dna(int(cid), 16) for cid in self.cell_ids]),
 			"TotalUMIs": self.total_umis[self.valid_cells]
 		}
 


=====================================
loompy/layer_manager.py
=====================================
@@ -147,7 +147,7 @@ class LayerManager:
 				# Fill the matrix with sparse data
 				if sparse.issparse(val):
 					m = val.tocsc()
-					window = 6400
+					window = max(1, 1024**3 // 8 * m.shape[0])
 					ix = 0
 					while ix < val.shape[1]:
 						window = min(window, m.shape[1] - ix)
@@ -155,7 +155,7 @@ class LayerManager:
 							break
 						self.ds._file[path][:, ix:ix + window] = m[:, ix: ix + window].toarray()
 						ix += window
-					
+
 				self.ds._file.flush()
 			else:
 				self.__dict__["storage"][name] = val


=====================================
loompy/loompy.py
=====================================
@@ -25,6 +25,7 @@
 import gzip
 import logging
 import os.path
+import re
 from shutil import copyfile
 from typing import Tuple, Union, Any, Dict, List, Iterable, Callable, Optional
 
@@ -33,9 +34,12 @@ import numpy as np
 import numpy_groupies.aggregate_numpy as npg
 import scipy.sparse
 from scipy.io import mmread
+import scipy.sparse as sparse
 
 import loompy
 from loompy import deprecated, timestamp
+from loompy.metadata_loaders import make_row_attrs_from_gene_metadata, load_sample_metadata
+from .cell_calling import call_cells
 
 
 class LoomConnection:
@@ -79,7 +83,7 @@ class LoomConnection:
 		if validate:
 			lv = loompy.LoomValidator()
 			if not lv.validate(filename):
-				raise ValueError("\n".join(lv.errors) + f"\n{filename} does not appead to be a valid Loom file according to Loom spec version '{lv.version}'")
+				raise ValueError("\n".join(lv.errors) + f"\n{filename} does not appear to be a valid Loom file according to Loom spec version '{lv.version}'")
 
 		self._file = h5py.File(filename, mode)
 		self._closed = False
@@ -316,6 +320,7 @@ class LoomConnection:
 				self.shape = (self.ra[k].shape[0], self.shape[1])
 			if len(self.ca) == 0:
 				for k, v in col_attrs.items():
+					v = np.array(v)
 					self.ca[k] = np.zeros(0, v.dtype)
 
 		layers_dict: Dict[str, np.ndarray] = {}
@@ -388,6 +393,7 @@ class LoomConnection:
 			self.shape = (self.shape[0], n_cols)
 			todel = []
 			for key, vals in col_attrs.items():
+				vals = np.array(vals)
 				if vals.shape[1:] != self.col_attrs[key].shape[1:]:
 					logging.debug(f"Removing attribute {key} because shape {vals.shape} did not match existing shape {self.col_attrs[key].shape} beyond first dimension")
 					todel.append(key)
@@ -892,6 +898,8 @@ class LoomConnection:
 						if np.issubdtype(type(val), np.str_):
 							valnew = val.replace("/", "-")  # Slashes are not allowed in attribute names
 							valnew = valnew.replace(".", "_")  # Nor are periods
+						else:
+							valnew = val
 						ca[key + "_" + str(valnew)] = npg.aggregate(zero_strt_sort_noholes_lbls, (self.ca[key] == val).astype('int'), func="sum", fill_value=0)
 				elif func == "mode":
 					def mode(x):  # type: ignore
@@ -1048,10 +1056,10 @@ def create(filename: str, layers: Union[np.ndarray, Dict[str, np.ndarray], loomp
 			raise ValueError(f"Layer '{name}' is not the same shape as the main matrix")
 	for name, ra in row_attrs.items():
 		if len(ra) != shape[0]:
-			raise ValueError(f"Row attribute '{name}' is not the same length ({ra.shape[0]}) as number of rows in main matrix ({shape[0]})")
+			raise ValueError(f"Row attribute '{name}' is not the same length ({len(ra)}) as number of rows in main matrix ({shape[0]})")
 	for name, ca in col_attrs.items():
 		if len(ca) != shape[1]:
-			raise ValueError(f"Column attribute '{name}' is not the same length ({ca.shape[0]}) as number of columns in main matrix ({shape[1]})")
+			raise ValueError(f"Column attribute '{name}' is not the same length ({len(ca)}) as number of columns in main matrix ({shape[1]})")
 
 	try:
 		with new(filename, file_attrs=file_attrs) as ds:
@@ -1195,6 +1203,124 @@ def create_from_matrix_market(out_file: str, sample_id: str, layer_paths: Dict[s
 				ds[name] = layer
 
 
+def create_from_star(indir : str, outfile : str, sample_id : str, \
+                     cell_filter : str = "star", expected_n_cells : int = 0, min_total_umis : int = 0, ambient_pthreshold : float = 1.0, \
+                     sample_metadata_file : str = None, gtf_file : str = None, main_layer : str = "velosum", extra_layers = None):
+	"""
+		Create a .loom file from STARsolo output
+		Args:
+			  indir (str):	              path to STARsolo output folder (the one that contains 'Solo.out')
+			  outfile (str):              path and name of the new loom file
+			  sample_id (str):            sample_id (typically 10Xxxx_x)
+			  cell_filter (str):          'emptydrops', 'star', 'none', 'combine', 'Gene', 'GeneFull', 'GeneFull_ExonOverIntron', 'GeneFull_Ex50pAS'
+			                              emptydrops: with parameters expected_n_cells, min_total_umis, and ambient_pthreshold
+			                              star: uses the barcodes from 'Solo.out/Velocyto/filtered' subdir of indir (Same as 'Gene')
+			                              combine: combines cells from all extra_layers and fills layer's non-valid cells with zeroes
+			                              Gene... : uses the cells from that Solo.out filtered subdir. Will be added to extra_layers.
+			                              none: all cells are kept - Lots of cells!
+			  sample_metadata_file (str): file of sample metadata or path to sqlite3 database. Used for gene annotation.
+			  gtf_file (str):             path to gtf file used by STARsolo. Used for cell annotations.
+			  main_layer(str):            The STAR output matrix to use for the main loom layer (e.g. 'Gene', or 'GeneFull').
+			                              'velosum' adds up spliced, unspliced, and ambiguous.
+			  extra_layers(list):         Names of additional layers to add in data (e.g. 'GeneFull').
+		Returns:
+		           nothing
+
+	"""
+	gtypes = extra_layers if extra_layers else []
+	if cell_filter in ("Gene", "GeneFull", "GeneFull_ExonOverIntron", "GeneFull_Ex50pAS") and cell_filter not in gtypes:
+		gtypes.append(cell_filter)
+	if main_layer in ("Gene", "GeneFull", "GeneFull_ExonOverIntron", "GeneFull_Ex50pAS") and main_layer not in gtypes:
+		gtypes.append(main_layer)
+	subdir = "raw" if cell_filter == "none" else "filtered"
+	velodir = os.path.join(indir, "Solo.out", "Velocyto", subdir)
+	accessions = [ l.split('\t')[0] for l in open(os.path.join(velodir, "features.tsv")).readlines() ]
+	genes = [ l.split('\t')[1] for l in open(os.path.join(velodir, "features.tsv")).readlines() ]
+	gtype2bcs = {}
+	n_genes = len(accessions)
+	layers = {}
+	for gtype in gtypes:
+		mtxpath = os.path.join(indir, "Solo.out", gtype, subdir, "matrix.mtx")
+		bcspath = os.path.join(indir, "Solo.out", gtype, subdir, "barcodes.tsv")
+		if os.path.exists(mtxpath) and os.path.exists(bcspath):
+			bcs = [ l.rstrip() for l in open(bcspath).readlines() ]
+			gtype2bcs[gtype] = bcs
+			mtx = np.loadtxt(mtxpath, skiprows=3, delimiter=' ')
+			layers[gtype] = sparse.csc_matrix((mtx[:,2], (mtx[:,0]-1, mtx[:,1]-1)), shape = (n_genes, len(bcs))).todense()
+	velobcspath = os.path.join(velodir, "barcodes.tsv")
+	velobcs = [ l.rstrip() for l in open(velobcspath).readlines() ]
+	gtype2bcs['spliced'] = gtype2bcs['unspliced'] = gtype2bcs['ambiguous'] = velobcs
+	velomtxshape = (n_genes, len(velobcs))
+	common_mtx = os.path.join(velodir, "matrix.mtx")
+	if os.path.exists(common_mtx):
+		mtx = np.loadtxt(common_mtx, skiprows=3, delimiter=' ')
+		layers['spliced'] = allspliced = sparse.csc_matrix((mtx[:,2], (mtx[:,0]-1, mtx[:,1]-1)), shape = velomtxshape).todense()
+		layers['unspliced'] = sparse.csc_matrix((mtx[:,3], (mtx[:,0]-1, mtx[:,1]-1)), shape = velomtxshape).todense()
+		layers['ambiguous'] = sparse.csc_matrix((mtx[:,4], (mtx[:,0]-1, mtx[:,1]-1)), shape = velomtxshape).todense()
+	else: # STAR >= 2.7.9
+		for vtype in ('spliced', 'unspliced', 'ambiguous'):
+			mtx = np.loadtxt(os.path.join(velodir, vtype + ".mtx"), skiprows=3, delimiter=' ')
+			layers[vtype] = sparse.csc_matrix((mtx[:,2], (mtx[:,0]-1, mtx[:,1]-1)), shape = velomtxshape).todense()
+	if cell_filter == "emptydrops":
+		allspliced = layers['spliced']
+		spliced_total_umis = np.array(allspliced.sum(axis=0))[0]
+		ambient_umis, ambient_pvalue = call_cells(allspliced.tocsc(), expected_n_cells)
+		valid_cell_idxs = (ambient_pvalue <= ambient_pthreshold) | (spliced_total_umis >= min_total_umis)
+		valid_pvalue = ambient_pvalue[valid_cell_idxs]
+		valid_bcs = gtype2bcs['spliced'][valid_cell_idxs]
+	elif cell_filter == "combine":
+		all_bcs = set()
+		for bcs in gtype2bcs.values():
+			all_bcs.update(bcs)
+		valid_bcs = all_bcs
+	elif cell_filter in gtypes:
+		valid_bcs = gtype2bcs[cell_filter]
+	elif cell_filter !="none":
+		if cell_filter != "star":
+			print (f"Unknown cell_filter type: {cell_filter} - using cells from Solo.out/Velocyto/filtered/")
+		valid_bcs = velobcs # gtype2bcs["Gene"]
+
+	valid_bcs = list(valid_bcs)
+	n_valid_cells = len(valid_bcs)
+	new_shape = (n_genes, n_valid_cells)
+	cellids = np.array([f"{sample_id}:{v_bc}" for v_bc in valid_bcs])
+	ca = { "CellID": cellids }
+	for gtype in layers:
+		valid_cells_layer = np.full(new_shape, 0, layers[gtype].dtype)
+		ca_valid = np.full(n_valid_cells, 0)
+		#print (valid_cells_layer.shape, layers[gtype].shape)
+		for fromidx, bc in enumerate(gtype2bcs[gtype]):
+			try:
+				toidx = valid_bcs.index(bc)
+				#print (fromidx, toidx, bc, layers[gtype].shape, valid_cells_layer.shape)
+				valid_cells_layer[:,toidx] = layers[gtype][:,fromidx].reshape(n_genes,)
+				ca_valid[toidx] = 1
+			except ValueError:
+				pass #print ("Err:", fromidx, bc, len(valid_bcs))
+		layers[gtype] = valid_cells_layer
+		#print (gtype, valid_cells_layer.shape, valid_cells_layer.sum())
+		ca["Valid_" + gtype] = ca_valid
+		#print ("ca", gtype, ca_valid.shape)
+	if main_layer == "velosum":
+		layers[''] = layers['spliced'] + layers['unspliced'] + layers['ambiguous']
+	else:
+		layers[''] = layers[main_layer]
+		del layers[main_layer]
+	total_umis = layers[''].sum(axis=0)
+	ca["BarcodeTotalUMIs"] = total_umis
+	if cell_filter == "emptydrops":
+		ca["AmbientPValue"] = valid_pvalue
+		ca["AmbientUMIs"] = np.full(n_valid_cells, ambient_umis)
+	if sample_metadata_file:
+		sample_metadata = load_sample_metadata(sample_metadata_file, sample_id)
+		for key, value in sample_metadata.items():
+			ca[key] = np.full(n_valid_cells, value)
+	if gtf_file:
+		ra = make_row_attrs_from_gene_metadata(gtf_file, accessions)
+	else:
+		ra = { "Gene": np.array(genes), "Accession": np.array(accessions) }
+	create(filename=outfile, layers=layers, row_attrs=ra, col_attrs=ca)
+
 def create_from_kallistobus(out_file: str, in_dir: str, tr2g_file: str, whitelist_file: str, file_attrs: Dict[str, str] = None, layers: Dict[str, str] = None):
 	"""
 	Create a loom file from a kallisto-bus output folder.
@@ -1387,3 +1513,4 @@ def connect(filename: str, mode: str = 'r+', *, validate: bool = True, spec_vers
 		Note: if validation is requested, an exception is raised if validation fails.
 	"""
 	return LoomConnection(filename, mode, validate=validate)
+


=====================================
loompy/metadata_loaders.py
=====================================
@@ -0,0 +1,126 @@
+from typing import Dict, Union, Iterable
+import os.path, re
+import sqlite3 as sqlite
+import numpy as np
+
+def load_gene_metadata(gtf_file : str) -> Dict[str, Dict[str, Union[int, str]]]:
+	"""
+	Read gene metadata from a GTF file.
+
+	Args:
+	  gtf_file (str):             path to GTF file
+
+	Returns:
+	  A Dict with each Accession (gtf "gene_id") pointing to a Dict of metadata keys -> values
+	"""
+	if not os.path.exists(gtf_file):
+		raise ValueError(f"Gene metadata file '{gtf_file}' not found.")
+	regex_genetype = re.compile('gene_biotype "([^"]+)"')
+	regex_geneid = re.compile('gene_id "([^"]+)"')
+	regex_genename = re.compile('gene_name "([^"]+)"')
+	geneid2annots = {}
+	for line in open(gtf_file).readlines():
+		if line.startswith('#'):
+			continue
+		fields = line.rstrip().split('\t')
+		chrom, feature_class, feature_type, start_str, end_str, junk, strand, junk, tags = fields
+		if feature_type == "gene":
+			genename = geneid = regex_geneid.search(tags).group(1)
+			_genename_search = regex_genename.search(tags)
+			if _genename_search:
+				genename = _genename_search.group(1)
+			_genetype_search = regex_genetype.search(tags)
+			genetype = _genetype_search.group(1) if _genetype_search else "n/a"
+			chrid, start, end = fields[0], int(fields[3]), int(fields[4])
+			attrs = { "Gene": genename, "Accession": geneid, "Biotype": genetype, \
+                                                  "Chromosome": chrid, "Start": start, "End": end }
+			geneid2annots[geneid] = attrs
+			geneid2annots[genename] = attrs
+	return geneid2annots
+
+def make_row_attrs_from_gene_annotations(acc2annots : Dict[str, Dict[str, Union[int, str]]], ordered_features : Iterable[str]) -> Dict[str, np.ndarray]:
+	"""
+         Construct loom row attributes corresponding to ordered_features from load_gene_metadata output.
+
+	Args:
+	  acc2annots (Dict):          output from load_gene_metadata
+          ordered_features (str):     the accessions (should match the gtf "gene_id") in matrix row order
+
+	Returns:
+          A row attribute dictionary of attr->numpy arrays to assign to a Loom object.
+	"""
+	ra = {}
+	first_annot = next(iter(acc2annots.values()))
+	ra_attrs = list(first_annot.keys())
+	n_genes = len(ordered_features)
+	for ra_attr in ra_attrs:
+		ra[ra_attr] = np.zeros((n_genes,), dtype = object)
+	for idx, geneid in enumerate(ordered_features):
+		try:
+			annots = acc2annots[geneid]
+		except KeyError:
+			annots = { "Gene": geneid, "Accession": geneid, "Biotype": "n/a", "Chromosome": "Un", "Start": "0", "End": "0" }
+		for ra_attr in ra_attrs:
+			ra[ra_attr][idx] = annots[ra_attr]
+	return ra
+
+def make_row_attrs_from_gene_metadata(gtf_file : str, ordered_features : Iterable[str]) -> Dict[str, np.ndarray]:
+	"""
+        Read gene metadata from a GTF file and construct loom row attributes corresponding to ordered_features.
+
+	Args:
+	  gtf_file (str):             path to GTF file
+          ordered_features (str):     the accessions (should match the gtf "gene_id") in matrix row order
+
+	Returns:
+          A row attribute object ready to assign to a Loom object.
+	"""
+	acc2annots = load_gene_metadata(gtf_file)
+	return make_row_attrs_from_gene_annotations(acc2annots, ordered_features)
+
+def load_sample_metadata(path: str, sample_id: str) -> Dict[str, str]:
+        """
+        Read sample metadata from either an sqlite '.db' database or a tab-delimited file.
+        The tab-file has to have  one sample per line and a header with a column
+        labelled 'Name' or 'SampleId' in which a match to function argument sample_id should be found.
+
+        Args:
+          path (str):             path to sqlite database (.db) or tab-delimited metadata file
+          sample_id (str):        Id of sample to annotate
+
+        Returns:
+              A Dict of metadata keys -> values
+        """
+        if not os.path.exists(path):
+        	raise ValueError(f"Samples metadata file '{path}' not found.")
+        if path.endswith(".db"):
+                # sqlite3
+                with sqlite.connect(path) as db:
+                        cursor = db.cursor()
+                        cursor.execute("SELECT * FROM sample WHERE name = ?", (sample_id,))
+                        keys = [x[0] for x in cursor.description]
+                        vals = cursor.fetchone()
+                        if vals is not None:
+                                return dict(zip(keys, vals))
+                        raise ValueError(f"SampleID '{sample_id}' was not found in the samples database.")
+        else:
+                result = {}
+                with open(path) as f:
+                        headers = [x.lower() for x in f.readline()[:-1].split("\t")]
+                        if "sampleid" not in headers and 'name' not in headers:
+                                raise ValueError("Required column 'SampleID' or 'Name' not found in sample metadata file")
+                        if "sampleid" in headers:
+                                sample_metadata_key_idx = headers.index("sampleid")
+                        else:
+                                sample_metadata_key_idx = headers.index("name")
+                        sample_found = False
+                        for line in f:
+                                items = line[:-1].split("\t")
+                                if len(items) > sample_metadata_key_idx and items[sample_metadata_key_idx] == sample_id:
+                                        for i, item in enumerate(items):
+                                                result[headers[i]] = item
+                                        sample_found = True
+                if not sample_found:
+                        raise ValueError(f"SampleID '{sample_id}' not found in sample metadata file")
+                return result
+


=====================================
loompy/normalize.py
=====================================
@@ -87,7 +87,7 @@ def materialize_attr_values(a: np.ndarray) -> np.ndarray:
 	if np.isscalar(a):
 		scalar = True
 		a = np.array([a])
-	result: np.ndarray = None             # This second clause takes care of attributes stored as variable-length ascii, which can be generated by loomR or Seurat
+	result: np.ndarray = None   # This second clause takes care of attributes stored as variable-length ascii, which can be generated by loomR or Seurat
 	if np.issubdtype(a.dtype, np.string_) or np.issubdtype(a.dtype, np.object_):
 		# First ensure that what we load is valid ascii (i.e. ignore anything outside 7-bit range)
 		if hasattr(a, "decode"):  # This takes care of Loom files that store strings as UTF8, which comes in as str and doesn't have a decode method
@@ -95,7 +95,11 @@ def materialize_attr_values(a: np.ndarray) -> np.ndarray:
 		else:
 			temp = a
 		# Then unescape XML entities and convert to unicode
-		result = np.array([html.unescape(x) for x in temp.astype(str)], dtype=object)
+		try:
+			result = np.array([html.unescape(x) for x in temp.astype(str)], dtype=object)
+		except:  # Dirty hack to handle UTF-8 non-break-space in scalar strings. TODO: Rewrite this whole method completely!
+			if type(a[0]) == np.bytes_:
+				result = [ a[0].replace(b'\xc2\xa0', b'') ]
 	elif np.issubdtype(a.dtype, np.str_) or np.issubdtype(a.dtype, np.unicode_):
 		result = np.array(a.astype(str), dtype=object)
 	else:


=====================================
notebooks/build_index.ipynb
=====================================
@@ -1,19 +1,5 @@
 {
  "cells": [
-  {
-   "cell_type": "code",
-   "execution_count": 4,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "import os\n",
-    "import sys\n",
-    "from typing import *\n",
-    "from Bio import SeqIO\n",
-    "from Bio.SeqRecord import SeqRecord\n",
-    "from Bio.Seq import Seq"
-   ]
-  },
   {
    "cell_type": "markdown",
    "metadata": {},
@@ -22,7 +8,7 @@
     "\n",
     "WARNING: these instruction will create a genome index that is *only* suitable for use with 10x Chromium data. We create a composite index, with separate sequence fragments representing unspliced and spliced transcripts. Unspliced fragments will be generated from genomic sequence upstream of every poly-A stretch located inside of a gene locus. Spliced fragments will be generated from transcriptomic sequence upstream of every polyadenylation site in every known spliced transcript. This reflects the locations of reads generated by Chromium, but may not be suitable for other methods.\n",
     "\n",
-    "The instructions below show how to build a genome index for the human genome. For other species, the code below will have to be modified to accomodate different sources and formats of metadata for those species.\n",
+    "The instructions below show how to build a genome index for the human genome. For other species, the code below will have to be modified to accomodate different sources and formats of metadata for those species. Instructions for mouse are available in the loompy/kallisto subdirectory.\n",
     "\n",
     "### Install prerequisites\n",
     "\n",
@@ -46,7 +32,9 @@
     "\n",
     "Get the \"Genome sequence, primary assembly (GRCh38)” file named [GRCh38.primary_assembly.genome.fa.gz](ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_31/GRCh38.primary_assembly.genome.fa.gz)\n",
     "\n",
-    "Download the comprehensive PRI gene annotation on the primary assembly GTF file named [gencode.v31.primary_assembly.annotation.gtf](ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_31/gencode.v31.primary_assembly.annotation.gtf.gz)\n",
+    "Download the transcript sequences as a fasta file, replacing XX with the version number you want [gencode.vXX.transcripts.fa.gz](ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_XX/gencode.vXX.transcripts.fa.gz) . XX = 31 and 38 have been tested.\n",
+    "\n",
+    "Download the comprehensive PRI gene annotation on the primary assembly GTF file, replacing XX with the version number you want [gencode.vXX.primary_assembly.annotation.gtf](ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_XX/gencode.vXX.primary_assembly.annotation.gtf.gz) . XX = 31 and 38 have been tested.\n",
     "\n",
     "#### Download additional gene metadata from [HGNC](https://www.genenames.org/download/statistics-and-files/)\n",
     "\n",
@@ -64,24 +52,31 @@
     "\n",
     "#### Download the human transcription factor motif database\n",
     "\n",
-    "Get the [human_tfs_consensus.txt]() file from our Google Cloud. These metadata will be used to determine which genes are transcription factors, and to assign them to families.\n",
+    "Get the [human_tfs_consensus.tab](https://storage.googleapis.com/linnarsson-lab-www-blobs/human_tfs_consensus.tab) file from our Google Cloud. These metadata will be used to determine which genes are transcription factors, and to assign them to families.\n",
     "\n",
     "### Preprocess the genome data\n",
     "\n",
+    "Unpack gzip:ed files:\n",
+    "\n",
+    "```\n",
+    "gunzip -c gencode.*.primary_assembly.annotation.gtf.gz > gencode.primary_assembly.annotation.gtf\n",
+    "gunzip -c gencode.*.transcripts.fa.gz > gencode.transcripts.fa\n",
+    "gunzip GRCh38.primary_assembly.genome.fa.gz\n",
+    "```\n",
     "Create a BED file with all the genes:\n",
     "\n",
     "```\n",
-    "cat gencode.v31.primary_assembly.annotation.gtf |  gawk 'OFS=\"\\t\" {if ($3==\"gene\") {print $1,$4-1,$5,$10,0,$7}}' | tr -d '\";' > gencode.v31.primary_assembly.annotation.bed\n",
+    "cat gencode.primary_assembly.annotation.gtf |  gawk 'OFS=\"\\t\" {if ($3==\"gene\") {print $1,$4-1,$5,$10,0,$7}}' | tr -d '\";' > gencode.primary_assembly.annotation.bed\n",
     "```\n",
     "\n",
     "Create a fasta file of pre-mRNA (exons + introns) transcripts:\n",
     "\n",
     "```\n",
-    "bedtools sort -i gencode.v31.primary_assembly.annotation.bed > gencode.v31.primary_assembly.annotation.sorted.bed\n",
+    "bedtools sort -i gencode.primary_assembly.annotation.bed > gencode.primary_assembly.annotation.sorted.bed\n",
     "\n",
-    "bedtools merge -i gencode.v31.primary_assembly.annotation.sorted.bed -s -c 4 -o collapse > gencode.v31.primary_assembly.annotation.merged.bed\n",
+    "bedtools merge -i gencode.primary_assembly.annotation.sorted.bed -s -c 4 -o collapse > gencode.primary_assembly.annotation.merged.bed\n",
     "\n",
-    "bedtools getfasta -name -fo gencode.v31.unspliced.fa -fi GRCh38.primary_assembly.genome.fa -bed gencode.v31.primary_assembly.annotation.sorted.bed\n",
+    "bedtools getfasta -name -fo gencode.unspliced.fa -fi GRCh38.primary_assembly.genome.fa -bed gencode.primary_assembly.annotation.sorted.bed\n",
     "```\n",
     "\n",
     "### Create the index manifest file, and an index directory\n",
@@ -91,8 +86,8 @@
     "```\n",
     "{\n",
     "    \"species\": \"Homo sapiens\",\n",
-    "    \"index_file\": \"gencode.v31.fragments.idx\",\n",
-    "    \"gene_metadata_file\": \"gencode.v31.metadata.tab\",\n",
+    "    \"index_file\": \"gencode.fragments.idx\",\n",
+    "    \"gene_metadata_file\": \"gencode.metadata.tab\",\n",
     "    \"gene_metadata_key\": \"AccessionVersion\",\n",
     "    \"fragments_to_genes_file\": \"fragments2genes.txt\",\n",
     "    \"layers\": {\n",
@@ -102,7 +97,7 @@
     "}\n",
     "```\n",
     "\n",
-    "The `gene_metadata_key` indicated which column in the metadata file (`gencode.v31.metadata.tab`) contains the gene IDs (as used in `fragments2genes.txt`).\n",
+    "The `gene_metadata_key` indicated which column in the metadata file (`gencode.metadata.tab`) contains the gene IDs (as used in `fragments2genes.txt`).\n",
     "\n",
     "Finally, organize your files in a directory and subdirectory as follows:\n",
     "\n",
@@ -112,14 +107,13 @@
     "10xv3_whitelist.txt\n",
     "inputs/\n",
     "    GRCh38.primary_assembly.genome.fa\n",
-    "    GRCh38.primary_assembly.genome.fa.fai\n",
-    "    gencode.v31.unspliced.fa\n",
-    "    gencode.v31.primary_assembly.annotation.bed\n",
-    "    gencode.v31.primary_assembly.annotation.gtf\n",
-    "    gencode.v31.primary_assembly.annotation.sorted.bed\n",
-    "    gencode.v31.transcripts.fa\n",
+    "    gencode.unspliced.fa\n",
+    "    gencode.primary_assembly.annotation.bed\n",
+    "    gencode.primary_assembly.annotation.gtf\n",
+    "    gencode.primary_assembly.annotation.sorted.bed\n",
+    "    gencode.transcripts.fa\n",
     "    hgnc_complete_set.txt\n",
-    "    human_tfs_consensus.txt\n",
+    "    human_tfs_consensus.tab\n",
     "```\n",
     "\n",
     "The `inputs` directory will only be used to generate the index, and can be discarded afterwards.\n",
@@ -136,7 +130,14 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "d = \"/Users/stelin/kallisto_GRCh38/human_GRCh38_gencode.v31/\"\n",
+    "import os\n",
+    "import sys\n",
+    "from typing import *\n",
+    "from Bio import SeqIO\n",
+    "from Bio.SeqRecord import SeqRecord\n",
+    "from Bio.Seq import Seq\n",
+    "\n",
+    "d = \"/Users/stelin/kallisto_GRCh38/human_GRCh38_gencode/\"\n",
     "extent = 600  # how many bases away from polya to include\n",
     "min_len = 90  # how many non-repeat bases required to make a transcript"
    ]
@@ -172,7 +173,7 @@
     "\t\ttfs[items[0]] = items\n",
     "\n",
     "# Next load the gencode GTF and create a genome annotation tsv\n",
-    "with open(d + \"gencode.v31.metadata.tab\", \"w\") as fout:\n",
+    "with open(d + \"gencode.metadata.tab\", \"w\") as fout:\n",
     "\tfout.write(\"\\t\".join([\n",
     "\t\t\"Accession\",\n",
     "\t\t\"AccessionVersion\",\n",
@@ -181,6 +182,7 @@
     "\t\t\"GeneType\",\n",
     "\t\t\"HgncID\",\n",
     "\t\t\"Chromosome\",\n",
+    "\t\t\"Strand\",\n",
     "\t\t\"ChromosomeStart\",\n",
     "\t\t\"ChromosomeEnd\",\n",
     "\t\t\"LocusGroup\",\n",
@@ -203,7 +205,7 @@
     "\t\t\"DnaBindingDomain\"\n",
     "\t]))\n",
     "\tfout.write(\"\\n\")\n",
-    "\twith open(d + \"inputs/gencode.v31.primary_assembly.annotation.gtf\") as f:\n",
+    "\twith open(d + \"inputs/gencode.primary_assembly.annotation.gtf\") as f:\n",
     "\t\tfor line in f:\n",
     "\t\t\tif line.startswith(\"##\"):\n",
     "\t\t\t\tcontinue\n",
@@ -253,6 +255,7 @@
     "\t\t\t\t\textra[\"gene_type\"],  # gene type from gencode\n",
     "\t\t\t\t\t\"\",  # HGNC id\n",
     "\t\t\t\t\titems[0],  # Chromosome\n",
+    "\t\t\t\t\titems[6],  # Strand\n",
     "\t\t\t\t\titems[3],  # Start\n",
     "\t\t\t\t\titems[4],  # End\n",
     "\t\t\t\t\t\"\",  # Locus group\n",
@@ -324,7 +327,7 @@
    "source": [
     "polyAs = {}\n",
     "polyTs = {}\n",
-    "for fasta in SeqIO.parse(open(d + \"inputs/gencode.v31.unspliced.fa\"),'fasta'):\n",
+    "for fasta in SeqIO.parse(open(d + \"inputs/gencode.unspliced.fa\"),'fasta'):\n",
     "\tgene_id = fasta.id\n",
     "\tintervals = find_polys(fasta.seq, c=\"A\", n=14)\n",
     "\tif len(intervals) > 0:\n",
@@ -342,7 +345,7 @@
    "outputs": [],
    "source": [
     "tr2g = {}\n",
-    "with open(d + \"inputs/gencode.v31.primary_assembly.annotation.gtf\") as f:\n",
+    "with open(d + \"inputs/gencode.primary_assembly.annotation.gtf\") as f:\n",
     "\tfor line in f:\n",
     "\t\tif \"\\ttranscript\\t\" in line:\n",
     "\t\t\titems = line.split(\"; \")\n",
@@ -362,10 +365,10 @@
    "source": [
     "count = 0\n",
     "with open(d + \"fragments2genes.txt\", \"w\") as ftr2g:\n",
-    "\twith open(d + \"inputs/gencode.v31.fragments.fa\", \"w\") as fout:\n",
+    "\twith open(d + \"inputs/gencode.fragments.fa\", \"w\") as fout:\n",
     "\t\t# Write the nascent fragments, with one partial transcript per internal poly-A/T site\n",
     "\t\twith open(d + \"unspliced_fragments.txt\", \"w\") as fucapture:\n",
-    "\t\t\tfor fasta in SeqIO.parse(open(d + \"inputs/gencode.v31.unspliced.fa\"),'fasta'):  # Note we're in the masked file now\n",
+    "\t\t\tfor fasta in SeqIO.parse(open(d + \"inputs/gencode.unspliced.fa\"),'fasta'):  # Note we're in the masked file now\n",
     "\t\t\t\tgene_id = fasta.id\n",
     "\t\t\t\tif gene_id in polyAs:\n",
     "\t\t\t\t\tfor interval in polyAs[gene_id]:\n",
@@ -391,7 +394,7 @@
     "\t\t\t\t\t\t\tfucapture.write(f\"{transcript_id}\\n\")\n",
     "\t\t# Write the mature fragments, covering the 3' end of each mature transcript\n",
     "\t\twith open(d + \"spliced_fragments.txt\", \"w\") as fscapture:\n",
-    "\t\t\tfor fasta in SeqIO.parse(open(d + \"inputs/gencode.v31.transcripts.fa\"),'fasta'):  # Note we're in the masked file now\n",
+    "\t\t\tfor fasta in SeqIO.parse(open(d + \"inputs/gencode.transcripts.fa\"),'fasta'):  # Note we're in the masked file now\n",
     "\t\t\t\ttranscript_id = fasta.id.split(\"|\")[0]\n",
     "\t\t\t\tgene_id = fasta.id.split(\"|\")[1]\n",
     "\t\t\t\tattrs = tr2g[transcript_id]\n",
@@ -413,7 +416,7 @@
     "Run the following on the command line (it might take half an hour):\n",
     "\n",
     "```\n",
-    "kallisto index -i gencode.v31.fragments.idx -k 31 inputs/gencode.v31.fragments.fa\n",
+    "kallisto index -i gencode.fragments.idx -k 31 inputs/gencode.fragments.fa\n",
     "```\n",
     "\n",
     "You should now have the following directory structure:\n",
@@ -423,8 +426,8 @@
     "10xv2_whitelist.txt\n",
     "10xv3_whitelist.txt\n",
     "fragments2genes.txt\n",
-    "gencode.v31.fragments.idx\n",
-    "gencode.v31.metadata.tab\n",
+    "gencode.fragments.idx\n",
+    "gencode.metadata.tab\n",
     "manifest.json\n",
     "spliced_fragments.txt\n",
     "unspliced_fragments.txt\n",
@@ -435,7 +438,7 @@
     "\n",
     "### What just happened?\n",
     "\n",
-    "We created a composite kallisto index (`gencode.v31.fragments.idx`), with separate sequence fragments for spliced and unspliced transcripts. The IDs of spliced and unspliced fragments are listed in the `spliced_fragments.txt` and `unspliced_fragments.txt` files, so that we can later count them separately. For example, here's the first few lines of `unspliced_fragments.txt` (the last part of each ID, e.g. `A14056`, indicates that this fragment is located upstream of a poly-A stretch at position 14056):\n",
+    "We created a composite kallisto index (`gencode.fragments.idx`), with separate sequence fragments for spliced and unspliced transcripts. The IDs of spliced and unspliced fragments are listed in the `spliced_fragments.txt` and `unspliced_fragments.txt` files, so that we can later count them separately. For example, here's the first few lines of `unspliced_fragments.txt` (the last part of each ID, e.g. `A14056`, indicates that this fragment is located upstream of a poly-A stretch at position 14056):\n",
     "\n",
     "```\n",
     "ENSG00000277400.1.A14056\n",
@@ -465,7 +468,7 @@
     "ENSG00000277400.1.A53231\tENSG00000277400.1\n",
     "```\n",
     "\n",
-    "Finally, we created a consolidated metadata file `gencode.v31.metadata.tab` which collects a lot of useful annotation about each gene:\n",
+    "Finally, we created a consolidated metadata file `gencode.metadata.tab` which collects a lot of useful annotation about each gene:\n",
     "\n",
     "```\n",
     "Accession          # ENSEMBL accession\n",
@@ -475,6 +478,7 @@
     "GeneType           # Like 'protein_coding', 'pseudogene', 'snRNA', 'rRNA', etc.\n",
     "HgncID\n",
     "Chromosome         # Like 'chr1', 'chr2', 'chrM', etc.\n",
+    "Strand             # '+' or '-'\n",
     "ChromosomeStart    # Integer start position of the gene\n",
     "ChromosomeEnd      # Integer end position of the gene\n",
     "LocusGroup\n",
@@ -499,13 +503,6 @@
     "\n",
     "Of these, only `Accession` and `Gene` are required by Cytograph.\n"
    ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": []
   }
  ],
  "metadata": {


=====================================
tests/debug.py
=====================================
@@ -17,13 +17,5 @@ handler.setFormatter(formatter)
 # Set STDERR handler as the only handler 
 logger.handlers = [handler]
 
-d = "/Users/stelin/kallisto_GRCh38/"
-
-fastqs = [
-	d + "HLFGJBCXY/10X_17_029_S2_L002_R1_001.fastq.gz",
-	d + "HLFGJBCXY/10X_17_029_S2_L002_R2_001.fastq.gz"
-#	d + "HL73JBCXY/10X_17_029_S2_L002_R1_001.fastq.gz",
-#	d + "HL73JBCXY/10X_17_029_S2_L002_R2_001.fastq.gz"
-]
-#loompy.create_from_fastq(d + "10X_17_029.loom", "10X_17_029", fastqs, d + "human_GRCh38_gencode.v31", d + "samples.tab", n_threads=6)
-loompy.create_from_fastq(d + "10X_17_029.loom", "10X_17_029", fastqs, d + "human_GRCh38_gencode.v31", "/Users/stelin/sqlite3_chromium.db", n_threads=6)
+with loompy.connect("/Users/stelin/Downloads/output-2.loom") as ds:
+	print(ds.shape)
\ No newline at end of file



View it on GitLab: https://salsa.debian.org/med-team/python-loompy/-/commit/534c27137b0ce83b714f8f6cf69dae63649e8564

-- 
View it on GitLab: https://salsa.debian.org/med-team/python-loompy/-/commit/534c27137b0ce83b714f8f6cf69dae63649e8564
You're receiving this email because of your account on salsa.debian.org.


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://alioth-lists.debian.net/pipermail/debian-med-commit/attachments/20220717/af20624c/attachment-0001.htm>