[med-svn] [Git][med-team/pyranges][upstream] New upstream version 0.0.111+ds
Nilesh Patra (@nilesh)
gitlab at salsa.debian.org
Tue Nov 2 18:50:55 GMT 2021
Nilesh Patra pushed to branch upstream at Debian Med / pyranges
Commits:
9afe05d2 by Nilesh Patra at 2021-11-02T23:22:21+05:30
New upstream version 0.0.111+ds
- - - - -
25 changed files:
- − .travis.yml
- CHANGELOG.txt
- README.md
- pyranges/__init__.py
- pyranges/genomicfeatures.py
- pyranges/methods/attr.py
- pyranges/methods/concat.py
- pyranges/methods/coverage.py
- pyranges/methods/init.py
- pyranges/methods/intersection.py
- pyranges/methods/join.py
- pyranges/methods/k_nearest.py
- + pyranges/methods/max_disjoint.py
- pyranges/methods/subtraction.py
- pyranges/multioverlap.py
- pyranges/multithreaded.py
- pyranges/pyranges.py
- pyranges/readers.py
- pyranges/version.py
- setup.py
- tests/helpers.py
- tests/test_binary.py
- + tests/test_concat.py
- + tests/test_count_overlaps.py
- tests/test_do_not_error.py
Changes:
=====================================
.travis.yml deleted
=====================================
@@ -1,32 +0,0 @@
-# Stolen from http://conda.pydata.org/docs/travis.html
-language: python
-python:
- # We don't actually use the Travis Python, but this keeps it organized..
- - "3.6"
-install:
- - sudo apt-get update
- # We do this conditionally because it saves us some downloading if the
- # version is the same.
- - wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh -O miniconda.sh;
- - bash miniconda.sh -b -p $HOME/miniconda
- - export PATH="$HOME/miniconda/bin:$PATH"
- - hash -r
- - conda config --set always_yes yes --set changeps1 no
- # - conda update -q conda
- - conda config --add channels bioconda
- - conda config --add channels r
- # Useful for debugging any issues with conda
- - conda info -a
- - conda create -q -n test-environment python=$TRAVIS_PYTHON_VERSION numpy scipy pandas pytest pytest-cov cython tabulate hypothesis bedtools pybigwig pysam
- - source activate test-environment
- - python --version
- - pip --version
- - pip install sorted_nearest ncls pyrle
- # to build docs pip install sphinxcontrib-napoleon sphinx-autoapi
- - python setup.py install
- - python -c 'import pandas as pd; print(pd.__version__)'
- - ls tests
-
-script: python -m pytest -m "not explore" --doctest-modules --cov=pyranges tests
-
-after_success: coveralls
=====================================
CHANGELOG.txt
=====================================
@@ -1,3 +1,75 @@
+# 0.0.111 (01.10.2021)
+- require minimum version of NCLS
+
+# 0.0.110 (20.09.21)
+- fix count_overlaps with keep_nonoverlapping=False
+- fix subtract with more than 1024 intervals (new fix)
+
+# 0.0.109 (16.09.21)
+- fix overlap invert behavior
+- add intersect invert flag
+- fix subtract in cases where more than 1024 intervals overlapped a single interval
+
+# 0.0.106/107/108(hotfixes) (07/8.09.21)
+- fix join with slack mutating first arg
+- add flag use_other_strand in join, nearest, k_nearest
+- fix categorical-bug in newer versions of pandas
+- add function pr.version_info() to print relevant version flags for debugging
+
+# 0.0.105 (23.08.21)
+- require bamread 0.0.10 to fix #211
+
+# 0.0.104 (06/20.08.21)
+- fix broken three_end/five_end code
+
+# 0.0.102/103 (06.08.21)
+- fix bug in pr.count_overlaps
+- demand version 0.0.9 or greater from bamread
+
+# 0.0.100/0.0.101 (20/21.06.21)
+- add full-flag to read_gtf
+- fix bug in join with slack > 0 when result is empty
+
+# 0.0.99 (17.06.21)
+- add nb_cpu arg to overlap
+
+# 0.0.98 (07.06.21)
+- fix k-nearest how=None
+
+# 0.0.98 (20.05.21)
+- fix casting in tss/tes
+
+# 0.0.96/97 (07.05.21)
+- fixes to .tes and .tss methods (issue #182)
+
+# 0.0.95 (02.03.21)
+- teensy fix bedclip
+- add pretty-printing in jupyter notebooks (thanks to @rasi)
+
+# 0.0.94 (27.02.21)
+- print warning if start and end columns have different dtypes
+
+# 0.0.93 (25.02.21)
+- add max_disjoint for maximal disjoint set
+
+# 0.0.91-92 (15.01.21)
+- hotfix for 0.0.90
+
+# 0.0.90 (03.01.21)
+- fix #165 slow set operations on small files with many chromosomes (thanks ndukler)
+
+# 0.0.89 (16.11.20)
+- fix #159 (thanks cfriedline)
+
+# 0.0.88 (09.11.20)
+- fix bug when concatting stranded and unstranded pyranges (thanks cfriedline, issue #160)
+
+# 0.0.87 (23.10.20)
+- fix bug in join with left/right option
+
+# 0.0.86 (05.10.20)
+- add slack-option to merge
+
# 0.0.85 (17.09.20)
- fix error when parsing gtf-files with whitespace in value-tags
=====================================
README.md
=====================================
@@ -1,6 +1,6 @@
# pyranges
-[![Coverage Status](https://img.shields.io/coveralls/github/biocore-ntnu/pyranges.svg)](https://coveralls.io/github/biocore-ntnu/pyranges?branch=master) [![Codacy Badge](https://api.codacy.com/project/badge/Grade/b61a53346d764a8d8f0ab2a6afd7b100)](https://www.codacy.com/app/endrebak/pyranges?utm_source=github.com&utm_medium=referral&utm_content=biocore-ntnu/pyranges&utm_campaign=Badge_Grade) [![Build Status](https://travis-ci.org/biocore-ntnu/pyranges.svg?branch=master)](https://travis-ci.org/biocore-ntnu/pyranges) [![hypothesis tested](graphs/hypothesis-tested-brightgreen.svg)](http://hypothesis.readthedocs.io/) [![PyPI version](https://badge.fury.io/py/pyranges.svg)](https://badge.fury.io/py/pyranges) [![MIT](https://img.shields.io/pypi/l/pyranges.svg?color=green)](https://opensource.org/licenses/MIT) ![PyPI - Python Version](https://img.shields.io/pypi/pyversions/pyranges.svg) [![install with bioconda](https://img.shields.io/badge/install%20with-bioconda-brightgreen.svg?style=flat)](http://bioconda.github.io/recipes/pyranges/README.html)
+[![Coverage Status](https://img.shields.io/coveralls/github/biocore-ntnu/pyranges.svg)](https://coveralls.io/github/biocore-ntnu/pyranges?branch=master) [![Codacy Badge](https://api.codacy.com/project/badge/Grade/b61a53346d764a8d8f0ab2a6afd7b100)](https://www.codacy.com/app/endrebak/pyranges?utm_source=github.com&utm_medium=referral&utm_content=biocore-ntnu/pyranges&utm_campaign=Badge_Grade) [![Build Status](https://travis-ci.com/biocore-ntnu/pyranges.svg?branch=master)](https://travis-ci.com/biocore-ntnu/pyranges) [![hypothesis tested](graphs/hypothesis-tested-brightgreen.svg)](http://hypothesis.readthedocs.io/) [![PyPI version](https://badge.fury.io/py/pyranges.svg)](https://badge.fury.io/py/pyranges) [![MIT](https://img.shields.io/pypi/l/pyranges.svg?color=green)](https://opensource.org/licenses/MIT) ![PyPI - Python Version](https://img.shields.io/pypi/pyversions/pyranges.svg) [![install with bioconda](https://img.shields.io/badge/install%20with-bioconda-brightgreen.svg?style=flat)](http://bioconda.github.io/recipes/pyranges/README.html)
## Introduction
=====================================
pyranges/__init__.py
=====================================
@@ -549,4 +549,33 @@ def to_bigwig(gr, path, chromosome_sizes):
bw.addEntries(chromosomes, starts, ends=ends, values=values)
-__all__ = ["from_string", "from_dict", "to_bigwig", "count_overlaps", "random", "itergrs", "read_gtf", "read_bam", "read_bed", "read_gff3", "concat", "PyRanges"]
+def version_info():
+
+ import importlib
+ def update_version_info(version_info, library):
+ if importlib.util.find_spec(library):
+ version = importlib.import_module(library).__version__
+ else:
+ version = "not installed"
+
+ version_info[library] = version
+
+ version_info = {"pyranges version": pr.__version__,
+ "pandas version": pd.__version__,
+ "numpy version": np.__version__,
+ "python version": sys.version_info}
+
+ update_version_info(version_info, "ncls")
+ update_version_info(version_info, "sorted_nearest")
+ update_version_info(version_info, "pyrle")
+ update_version_info(version_info, "ray")
+ update_version_info(version_info, "bamread")
+ # update_version_info(version_info, "bwread") no version string yet!
+ update_version_info(version_info, "pyranges_db")
+ update_version_info(version_info, "pybigwig")
+ update_version_info(version_info, "hypothesis")
+
+ print(version_info)
+
+
+__all__ = ["from_string", "from_dict", "to_bigwig", "count_overlaps", "random", "itergrs", "read_gtf", "read_bam", "read_bed", "read_gff3", "concat", "PyRanges", "version_info"]
=====================================
pyranges/genomicfeatures.py
=====================================
@@ -263,7 +263,7 @@ def _outside_bounds(df, **kwargs):
size_df = chromsizes.df
chromsizes = {k: v for k, v in zip(size_df.Chromosome, size_df.End)}
- size = chromsizes[df.Chromosome.iloc[0]]
+ size = int(chromsizes[df.Chromosome.iloc[0]])
clip = kwargs.get("clip", False)
ends_outside = df.End > size
@@ -473,12 +473,12 @@ def _tss(df, slack=0):
tss_neg = df.loc[df.Strand == "-"].copy()
# pd.options.mode.chained_assignment = None
- tss_neg.loc[:, "Start"] = tss_neg.End
+ tss_neg.loc[:, "Start"] = tss_neg.End - 1
# pd.options.mode.chained_assignment = "warn"
tss = pd.concat([tss_pos, tss_neg], sort=False)
- tss["End"] = tss.Start
- tss.End = tss.End + 1 + slack
+ tss["End"] = tss.Start + 1
+ tss.End = tss.End + slack
tss.Start = tss.Start - slack
tss.loc[tss.Start < 0, "Start"] = 0
@@ -499,12 +499,12 @@ def _tes(df, slack=0):
tes_neg = df.loc[df.Strand == "-"].copy()
# pd.options.mode.chained_assignment = None
- tes_neg.loc[:, "Start"] = tes_neg.End
+ tes_neg.loc[:, "End"] = tes_neg.Start + 1
# pd.options.mode.chained_assignment = "warn"
tes = pd.concat([tes_pos, tes_neg], sort=False)
- tes["Start"] = tes.End
- tes.End = tes.End + 1 + slack
+ tes["Start"] = tes.End - 1
+ tes.End = tes.End + slack
tes.Start = tes.Start - slack
tes.loc[tes.Start < 0, "Start"] = 0
=====================================
pyranges/methods/attr.py
=====================================
@@ -55,7 +55,8 @@ def _setattr(self, column_name, column, pos=False):
self.__dict__["dfs"] = dfs
else:
int64 = True if self.dtypes["Start"] == np.int64 else False
- self.__dict__["dfs"] = pr.PyRanges(pr.PyRanges(dfs).df, int64=int64).dfs # will merge the dfs, then split on keys again to ensure they are correct
+ # will merge the dfs, then split on keys again to ensure they are correct
+ self.__dict__["dfs"] = pr.PyRanges(pr.PyRanges(dfs).df, int64=int64).dfs
def _getattr(self, name):
=====================================
pyranges/methods/concat.py
=====================================
@@ -9,14 +9,13 @@ def concat(pyranges, strand=None):
if not pyranges:
return None
- # from pydbg import dbg
pyranges = [pr for pr in pyranges if not pr.empty]
- # dbg(pyranges)
- # dbg([p.df.dtypes for p in pyranges])
grs_per_chromosome = defaultdict(list)
+ strand_info = [gr.stranded for gr in pyranges]
+
if strand is None:
- strand = all([gr.stranded for gr in pyranges])
+ strand = all(strand_info)
if strand:
assert all([
@@ -30,22 +29,25 @@ def concat(pyranges, strand=None):
else:
for gr in pyranges:
for chromosome in gr.chromosomes:
- # dbg(gr)
- # dbg(gr[chromosome])
df = gr[chromosome].df
- # dbg(df.dtypes)
grs_per_chromosome[chromosome].append(df)
new_pyrange = {}
for k, v in grs_per_chromosome.items():
- # dbg([_v.dtypes for _v in v])
new_pyrange[k] = pd.concat(v, sort=False)
- # dbg(new_pyrange[k].dtypes)
res = pr.multithreaded.process_results(new_pyrange.values(),
new_pyrange.keys())
- # dbg([r.dtypes for r in res.values()])
+ if any(strand_info) and not all(strand_info):
+ new_res = {}
+ for k, v in res.items():
+ v.loc[:, "Strand"] = v.Strand.cat.add_categories(["."])
+ new_res[k] = v.assign(Strand=v.Strand.fillna("."))
+ res = pr.PyRanges(new_res)
+ res.Strand = res.Strand
+ else:
+ res = pr.PyRanges(res)
- return pr.PyRanges(res)
+ return res
=====================================
pyranges/methods/coverage.py
=====================================
@@ -33,18 +33,18 @@ def _number_overlapping(scdf, ocdf, **kwargs):
df = scdf.copy()
- if keep_nonoverlapping:
- _missing_indexes = np.setdiff1d(scdf.index, _self_indexes)
- missing = pd.DataFrame(data={"Index": _missing_indexes, "Count": 0}, index=_missing_indexes)
- counts_per_read = pd.concat([counts_per_read, missing])
- else:
- df = df.loc[_self_indexes]
+ _missing_indexes = np.setdiff1d(scdf.index, _self_indexes)
+ missing = pd.DataFrame(data={"Index": _missing_indexes, "Count": 0}, index=_missing_indexes)
+ counts_per_read = pd.concat([counts_per_read, missing])
- counts_per_read = counts_per_read.set_index("Index")
+ counts_per_read = counts_per_read.set_index("Index").sort_index()
df.insert(df.shape[1], column_name, counts_per_read)
- return df
+ if keep_nonoverlapping:
+ return df
+ else:
+ return df[df[column_name] != 0]
=====================================
pyranges/methods/init.py
=====================================
@@ -19,37 +19,29 @@ def set_dtypes(df, int64):
"Start": np.int32,
"End": np.int32,
"Chromosome": "category",
- "Strand": "category"
+ "Strand": "category",
}
else:
dtypes = {
"Start": np.int64,
"End": np.int64,
"Chromosome": "category",
- "Strand": "category"
+ "Strand": "category",
}
- if not "Strand" in df:
+ if "Strand" not in df:
del dtypes["Strand"]
# need to ascertain that object columns do not consist of multiple types
# https://github.com/biocore-ntnu/epic2/issues/32
for column in "Chromosome Strand".split():
- if not column in df:
+ if column not in df:
continue
-
- if df[column].dtype == object and len(
- df[column].apply(type).drop_duplicates()) > 1:
- df[column] = df[column].astype(str)
- elif df[column].dtype != object:
- df[column] = df[column].astype(str)
+ df[column] = df[column].astype(str)
for col, dtype in dtypes.items():
-
if df[col].dtype.name != dtype:
-
df[col] = df[col].astype(dtype)
-
return df
@@ -80,18 +72,20 @@ def create_pyranges_df(chromosomes, starts, ends, strands=None):
columns = [chromosomes, starts, ends, strands]
lengths = list(str(len(s)) for s in columns)
- assert len(
- set(lengths)
- ) == 1, "chromosomes, starts, ends and strands must be of equal length. But are {}".format(
- ", ".join(lengths))
+ assert (
+ len(set(lengths)) == 1
+ ), "chromosomes, starts, ends and strands must be of equal length. But are {}".format(
+ ", ".join(lengths)
+ )
colnames = "Chromosome Start End Strand".split()
else:
columns = [chromosomes, starts, ends]
lengths = list(str(len(s)) for s in columns)
- assert len(
- set(lengths)
- ) == 1, "chromosomes, starts and ends must be of equal length. But are {}".format(
- ", ".join(lengths))
+ assert (
+ len(set(lengths)) == 1
+ ), "chromosomes, starts and ends must be of equal length. But are {}".format(
+ ", ".join(lengths)
+ )
colnames = "Chromosome Start End".split()
idx = range(len(starts))
@@ -119,7 +113,8 @@ def check_strandedness(df):
contains_more_than_plus_minus_in_strand_col = False
if str(df.Strand.dtype) == "category" and (
- set(df.Strand.cat.categories) - set("+-")):
+ set(df.Strand.cat.categories) - set("+-")
+ ):
contains_more_than_plus_minus_in_strand_col = True
elif not ((df.Strand == "+") | (df.Strand == "-")).all():
contains_more_than_plus_minus_in_strand_col = True
@@ -130,14 +125,16 @@ def check_strandedness(df):
return not contains_more_than_plus_minus_in_strand_col
-def _init(self,
- df=None,
- chromosomes=None,
- starts=None,
- ends=None,
- strands=None,
- int64=False,
- copy_df=True):
+def _init(
+ self,
+ df=None,
+ chromosomes=None,
+ starts=None,
+ ends=None,
+ strands=None,
+ int64=False,
+ copy_df=True,
+):
# TODO: add categorize argument with dict of args to categorize?
if isinstance(df, PyRanges):
@@ -183,7 +180,6 @@ def _init(self,
else:
_has_strand = False
-
if not all([_single_value_key, _key_same, _strand_valid]):
df = pd.concat(empty_removed.values()).reset_index(drop=True)
=====================================
pyranges/methods/intersection.py
=====================================
@@ -61,7 +61,6 @@ def _intersection(scdf, ocdf, **kwargs):
def _overlap(scdf, ocdf, **kwargs):
- invert = kwargs["invert"]
return_indexes = kwargs.get("return_indexes", False)
if scdf.empty or ocdf.empty:
@@ -83,9 +82,6 @@ def _overlap(scdf, ocdf, **kwargs):
else:
_indexes = it.has_overlaps(starts, ends, indexes)
- if invert:
- _indexes = scdf.index.difference(_indexes)
-
if return_indexes:
return _indexes
@@ -98,16 +94,14 @@ def _count_overlaps(scdf, ocdf, **kwargs):
idx = _overlap(scdf, ocdf, **kwargs)
sx = pd.DataFrame(np.zeros(len(scdf)), index=scdf.index)
- if idx is None:
- return sx
- vc = pd.Series(idx).value_counts(sort=False)
+ vc = pd.Series(idx, dtype=np.int64).value_counts(sort=False)
- sx.iloc[vc.index, 0] = vc.values
+ sx.loc[vc.index, 0] = vc.values
- sx.columns = ["__0__"]
+ scdf.insert(scdf.shape[1], kwargs["name"], sx)
- return sx
+ return scdf
# def _first_df(scdf, ocdf, kwargs):
=====================================
pyranges/methods/join.py
=====================================
@@ -63,7 +63,9 @@ def null_types(h):
elif d == "str" or d == "object":
null = "-1"
elif d == "category":
- h2.loc[:, n] = h2[:, n].cat.add_categories("-1")
+ tmp_cat = h2[n].copy()
+ tmp_cat = tmp_cat.cat.add_categories("-1")
+ h2[n] = tmp_cat
null = "-1"
h2.loc[:, n] = null
=====================================
pyranges/methods/k_nearest.py
=====================================
@@ -95,7 +95,7 @@ def nearest(d1, d2, **kwargs):
d2 = d2.reindex(xdf.RX)
d1.index = range(len(d1))
d2.index = range(len(d1))
- d2 = d2.drop("Chromosome", 1)
+ d2 = d2.drop("Chromosome", axis=1)
df = d1.join(d2, rsuffix=suffix)
df.insert(df.shape[1], "Distance", xdf.D.values)
@@ -119,7 +119,7 @@ def nearest_previous(d1, d2, **kwargs):
d1.index = range(len(d1))
d2.index = range(len(d1))
- d2 = d2.drop("Chromosome", 1)
+ d2 = d2.drop("Chromosome", axis=1)
df = d1.join(d2, rsuffix=suffix)
df.insert(df.shape[1], "Distance", dist)
@@ -141,7 +141,7 @@ def nearest_next(d1, d2, **kwargs):
d1 = d1.reindex(lidx)
d2 = d2.reindex(ridx)
- d2 = d2.drop("Chromosome", 1)
+ d2 = d2.drop("Chromosome", axis=1)
d1.index = range(len(d1))
d2.index = range(len(d1))
df = d1.join(d2, rsuffix=suffix)
=====================================
pyranges/methods/max_disjoint.py
=====================================
@@ -0,0 +1,17 @@
+#!/usr/bin/env python
+
+from sorted_nearest import max_disjoint
+
+
+def _max_disjoint(df, **kwargs):
+
+ if df.empty:
+ return None
+
+ slack = kwargs.get("slack", 0)
+
+ cdf = df.sort_values("End")
+
+ idx = max_disjoint(cdf.index.values, cdf.Start.values, cdf.End.values, slack)
+
+ return cdf.reindex(idx)
=====================================
pyranges/methods/subtraction.py
=====================================
@@ -1,7 +1,49 @@
import pandas as pd
+import numpy as np
+
from ncls import NCLS
+def add_rows_per_group(df):
+ last_rows = df.groupby("__ix__").last().reset_index()
+ last_rows.loc[:, "__last__"] = True
+ df = pd.concat([df, last_rows], ignore_index=True)
+ df = df.sort_values("__ix__", ascending=True)
+ return df
+
+# def _subtraction(scdf, **kwargs):
+# if scdf.empty:
+# return scdf
+
+# falses = np.zeros(len(scdf), dtype=bool)
+# scdf.insert(scdf.shape[1], "__first__", falses)
+# scdf.insert(scdf.shape[1], "__last__", falses)
+
+# scdf = add_rows_per_group(scdf)
+
+# scdf.insert(scdf.shape[1], "NewStart", scdf.End__deleteme__.shift(fill_value=-1))
+# scdf.insert(scdf.shape[1], "NewEnd", scdf.Start__deleteme__)
+# scdf.insert(scdf.shape[1], "__ix2__", np.arange(len(scdf)))
+
+# first_rows = scdf.groupby(scdf.__ix__, as_index=False).first()
+# scdf.loc[scdf.__ix2__.isin(first_rows.__ix2__), "__first__"] = True
+
+# scdf.loc[:, "NewStart"] = np.where(scdf.__first__, scdf.Start, scdf.NewStart)
+
+# scdf.loc[scdf.__first__ & ~(scdf.Start__deleteme__ >= scdf.Start), "NewStart"] = -1
+# scdf.loc[:, "NewEnd"] = np.where(scdf.__last__, scdf.End, scdf.NewEnd)
+# scdf.loc[:, "NewStart"] = np.where(scdf.__last__, scdf.End__deleteme__, scdf.NewStart)
+# scdf.loc[scdf.__last__ & ~(scdf.End__deleteme__ <= scdf.End), ["NewEnd", "NewStart"]] = -1
+
+
+# scdf = scdf[~((scdf.NewStart == -1) | (scdf.NewEnd == -1))]
+# scdf = scdf.drop(["Start", "End"], axis=1)
+# scdf.rename(columns={"NewStart": "Start", "NewEnd": "End"}, inplace=True)
+
+# remove_mask = scdf.Start >= scdf.End
+
+# return scdf[~remove_mask]
+
def _subtraction(scdf, ocdf, **kwargs):
if ocdf.empty or scdf.empty:
@@ -20,7 +62,7 @@ def _subtraction(scdf, ocdf, **kwargs):
o = NCLS(ocdf.Start.values, ocdf.End.values, ocdf.index.values)
idx_self, new_starts, new_ends = o.set_difference_helper(
- scdf.Start.values, scdf.End.values, scdf.index.values)
+ scdf.Start.values, scdf.End.values, scdf.index.values, scdf.__num__.values)
missing_idx = pd.Index(scdf.index).difference(idx_self)
=====================================
pyranges/multioverlap.py
=====================================
@@ -1,5 +1,4 @@
import pyranges as pr
-import pandas as pd
import numpy as np
@@ -20,7 +19,7 @@ def count_overlaps(grs, features=None, strandedness=None, how=None, nb_cpu=1):
strandedness : {None, "same", "opposite", False}, default None, i.e. auto
Whether to compare PyRanges on the same strand, the opposite or ignore strand
- information. The default, None, means use "same" if both PyRanges are strande,
+ information. The default, None, means use "same" if both PyRanges are stranded,
otherwise ignore the strand information.
how : {None, "all", "containment"}, default None, i.e. all
@@ -140,6 +139,8 @@ def count_overlaps(grs, features=None, strandedness=None, how=None, nb_cpu=1):
if features is None:
features = pr.concat(grs.values()).split(between=True)
+ else:
+ features = features.copy()
from pyranges.methods.intersection import _count_overlaps
@@ -147,13 +148,9 @@ def count_overlaps(grs, features=None, strandedness=None, how=None, nb_cpu=1):
gr = gr.drop()
+ kwargs["name"] = name
res = features.apply_pair(gr, _count_overlaps, **kwargs)
- setattr(features, name, res)
-
- setattr(features, name, getattr(features, name).fillna(0))
-
-
def to_int(df):
df.loc[:, names] = df[names].astype(np.int32)
return df
@@ -161,10 +158,3 @@ def count_overlaps(grs, features=None, strandedness=None, how=None, nb_cpu=1):
features = features.apply(to_int)
return features
-
-# if __name__
-
-
-# if __name__ == "__main__":
-
-# print(a)
=====================================
pyranges/multithreaded.py
=====================================
@@ -8,7 +8,6 @@ from natsort import natsorted
import os
-from collections import defaultdict
def get_n_args(f):
@@ -105,7 +104,6 @@ def process_results(results, keys):
for k in to_delete:
del results_dict[k]
-
return results_dict
@@ -183,7 +181,6 @@ def get_multithreaded_funcs(function, nb_cpu):
return function, get, _merge_dfs
-
def pyrange_apply(function, self, other, **kwargs):
nparams = get_n_args(function)
@@ -218,6 +215,11 @@ def pyrange_apply(function, self, other, **kwargs):
items = natsorted(self.dfs.items())
keys = natsorted(self.dfs.keys())
+ dummy = pd.DataFrame(columns="Chromosome Start End".split())
+
+ other_chromosomes = other.chromosomes
+ other_dfs = other.dfs
+
if strandedness:
for (c, s), df in items:
@@ -225,7 +227,7 @@ def pyrange_apply(function, self, other, **kwargs):
os = strand_dict[s]
if not (c, os) in other.keys() or len(other[c, os].values()) == 0:
- odf = pd.DataFrame(columns="Chromosome Start End".split())
+ odf = dummy
else:
odf = other[c, os].values()[0]
@@ -240,10 +242,10 @@ def pyrange_apply(function, self, other, **kwargs):
for (c, s), df in items:
- if not c in other.chromosomes:
- odf = pd.DataFrame(columns="Chromosome Start End".split())
+ if not c in other_chromosomes:
+ odf = dummy
else:
- odf = other.dfs[c]
+ odf = other_dfs[c]
df, odf = make_binary_sparse(kwargs, df, odf)
result = call_f(function, nparams, df, odf, kwargs)
@@ -253,11 +255,11 @@ def pyrange_apply(function, self, other, **kwargs):
for c, df in items:
- if not c in other.chromosomes:
- odf = pd.DataFrame(columns="Chromosome Start End".split())
+ if not c in other_chromosomes:
+ odf = dummy
else:
- odf1 = other[c, "+"].df
- odf2 = other[c, "-"].df
+ odf1 = other_dfs.get((c, "+"), dummy)
+ odf2 = other_dfs.get((c, "-"), dummy)
odf = _merge_dfs.remote(odf1, odf2)
@@ -270,21 +272,20 @@ def pyrange_apply(function, self, other, **kwargs):
for (c, s), df in self.items():
- if not c in other.chromosomes:
- odfs = pr.PyRanges(
- pd.DataFrame(columns="Chromosome Start End".split()))
+ if not c in other_chromosomes:
+ odfs = pr.PyRanges(dummy)
else:
- odfs = other[c].values()
+ odfp = other_dfs.get((c, "+"), dummy)
+ odfm = other_dfs.get((c, "-"), dummy)
- # from pydbg import dbg
- # dbg(odfs)
+ odfs = [odfp, odfm]
if len(odfs) == 2:
odf = _merge_dfs.remote(*odfs)
elif len(odfs) == 1:
odf = odfs[0]
else:
- odf = pd.DataFrame(columns="Chromosome Start End".split())
+ odf = dummy
df, odf = make_binary_sparse(kwargs, df, odf)
@@ -294,10 +295,10 @@ def pyrange_apply(function, self, other, **kwargs):
else:
for c, df in items:
- if not c in other.chromosomes:
- odf = pd.DataFrame(columns="Chromosome Start End".split())
+ if not c in other_chromosomes:
+ odf = dummy
else:
- odf = other.dfs[c]
+ odf = other_dfs[c]
df, odf = make_binary_sparse(kwargs, df, odf)
@@ -388,7 +389,6 @@ def pyrange_apply_single(function, self, **kwargs):
results = process_results(results, keys)
-
return results
@@ -402,40 +402,34 @@ def _lengths(df):
def _tss(df, **kwargs):
df = df.copy(deep=True)
-
+ dtype = df.dtypes["Start"]
slack = kwargs.get("slack", 0)
- tss_pos = df.loc[df.Strand == "+"]
-
- tss_neg = df.loc[df.Strand == "-"]
-
- # pd.options.mode.chained_assignment = None
- tss_neg.loc[:, "Start"] = tss_neg.End
+ starts = np.where(df.Strand == "+", df.Start, df.End - 1)
+ ends = starts + slack + 1
+ starts = starts - slack
+ starts = np.where(starts < 0, 0, starts)
- # pd.options.mode.chained_assignment = "warn"
- tss = pd.concat([tss_pos, tss_neg], sort=False)
- tss["End"] = tss.Start
- tss.End = tss.End + 1 + slack
- tss.Start = tss.Start - slack
- tss.loc[tss.Start < 0, "Start"] = 0
-
- return tss.reindex(df.index)
+ df.loc[:, "Start"] = starts.astype(dtype)
+ df.loc[:, "End"] = ends.astype(dtype)
+ return df
def _tes(df, **kwargs):
- df = df.copy()
- if df.Strand.iloc[0] == "+":
- df.loc[:, "Start"] = df.End
- else:
- df.loc[:, "End"] = df.Start
+ df = df.copy(deep=True)
+ dtype = df.dtypes["Start"]
+ slack = kwargs.get("slack", 0)
- df.loc[:, "Start"] = df.End
- df.loc[:, "End"] = df.End + 1
- df.loc[:, "Start"] = df.Start
- df.loc[df.Start < 0, "Start"] = 0
+ starts = np.where(df.Strand == "+", df.End - 1, df.Start)
+ ends = starts + 1 + slack
+ starts = starts - slack
+ starts = np.where(starts < 0, 0, starts)
- return df.reindex(df.index)
+ df.loc[:, "Start"] = starts.astype(dtype)
+ df.loc[:, "End"] = ends.astype(dtype)
+
+ return df
def _slack(df, **kwargs):
=====================================
pyranges/pyranges.py
=====================================
@@ -239,7 +239,6 @@ class PyRanges():
# self.apply()
-
def __getattr__(self, name):
"""Return column.
@@ -313,6 +312,11 @@ class PyRanges():
else:
_setattr(self, column_name, column)
+ if column_name in ["Start", "End"]:
+ if self.dtypes["Start"] != self.dtypes["End"]:
+ print("Warning! Start and End columns now have different dtypes: {} and {}".format(
+ self.dtypes["Start"], self.dtypes["End"]))
+
def __getitem__(self, val):
"""Fetch columns or subset on position.
@@ -487,7 +491,11 @@ class PyRanges():
return str(self)
+ def _repr_html_(self):
+ """Return REPL HTML representation for Jupyter Noteboooks."""
+
+ return self.df._repr_html_()
def apply(self, f, strand=None, as_pyranges=True, nb_cpu=1, **kwargs):
@@ -916,7 +924,7 @@ class PyRanges():
type(first_result))
# do a deepcopy of object
- new_self = pr.PyRanges({k: v.copy() for k, v in self.items()})
+ new_self = self.copy()
new_self.__setattr__(col, result)
return new_self
@@ -1846,7 +1854,7 @@ class PyRanges():
return self
- def intersect(self, other, strandedness=None, how=None, nb_cpu=1):
+ def intersect(self, other, strandedness=None, how=None, invert=False, nb_cpu=1):
"""Return overlapping subintervals.
@@ -1869,6 +1877,10 @@ class PyRanges():
What intervals to report. By default reports all overlapping intervals. "containment"
reports intervals where the overlapping is contained within it.
+ invert : bool, default False
+
+ Whether to return the intervals without overlaps.
+
nb_cpu: int, default 1
How many cpus to use. Can at most use 1 per chromosome or chromosome/strand tuple.
@@ -1954,9 +1966,21 @@ class PyRanges():
kwargs = fill_kwargs(kwargs)
kwargs["sparse"] = {"self": False, "other": True}
+ if len(self) == 0:
+ return self
+
+ if invert:
+ self.__ix__ = np.arange(len(self))
+
dfs = pyrange_apply(_intersection, self, other, **kwargs)
+ result = pr.PyRanges(dfs)
- return PyRanges(dfs)
+ if invert:
+ found_idxs = getattr(result, "__ix__", [])
+ result = self[~self.__ix__.isin(found_idxs)]
+ result = result.drop("__ix__")
+
+ return result
def items(self):
@@ -1988,7 +2012,7 @@ class PyRanges():
return natsorted([(k, df) for (k, df) in self.dfs.items()])
- def join(self, other, strandedness=None, how=None, report_overlap=False, slack=0, suffix="_b", nb_cpu=1):
+ def join(self, other, strandedness=None, how=None, report_overlap=False, slack=0, suffix="_b", nb_cpu=1, apply_strand_suffix=None):
"""Join PyRanges on genomic location.
@@ -2011,7 +2035,7 @@ class PyRanges():
report_overlap : bool, default False
- Report amount of overlap in base pairs.
+ Report amount of overlap in base pairs.
slack : int, default 0
@@ -2021,6 +2045,11 @@ class PyRanges():
Suffix to give overlapping columns in other.
+ apply_strand_suffix : bool, default None
+
+ If first pyranges is unstranded, but the second is not, the first will be given a strand column.
+ apply_strand_suffix makes the added strand column a regular data column instead by adding a suffix.
+
nb_cpu: int, default 1
How many cpus to use. Can at most use 1 per chromosome or chromosome/strand tuple.
@@ -2113,9 +2142,9 @@ class PyRanges():
from pyranges.methods.join import _write_both
- kwargs = {"strandedness": strandedness, "how": how, "report_overlap":report_overlap, "suffix": suffix, "nb_cpu": nb_cpu}
- # slack = kwargs.get("slack")
+ kwargs = {"strandedness": strandedness, "how": how, "report_overlap":report_overlap, "suffix": suffix, "nb_cpu": nb_cpu, "apply_strand_suffix": apply_strand_suffix}
if slack:
+ self = self.copy()
self.Start__slack = self.Start
self.End__slack = self.End
@@ -2127,13 +2156,6 @@ class PyRanges():
kwargs = fill_kwargs(kwargs)
- # if "new_pos" in kwargs:
- # if kwargs["new_pos"] in "intersection union".split():
- # suffixes = kwargs.get("suffixes")
- # assert suffixes is not None, "Must give two non-empty suffixes when using new_pos with intersection or union."
- # assert suffixes[0], "Must have nonempty first suffix when using new_pos with intersection or union."
- # assert suffixes[1], "Must have nonempty second suffix when using new_pos with intersection or union."
-
how = kwargs.get("how")
if how in ["left", "outer"]:
@@ -2144,14 +2166,17 @@ class PyRanges():
dfs = pyrange_apply(_write_both, self, other, **kwargs)
gr = PyRanges(dfs)
- if slack:
+ if slack and len(gr) > 0:
gr.Start = gr.Start__slack
gr.End = gr.End__slack
gr = gr.drop(like="(Start|End).*__slack")
- # new_position = kwargs.get("new_pos")
- # if new_position:
- # gr = gr.new_position(new_pos=new_position, suffixes=kwargs["suffixes"])
+ if not self.stranded and other.stranded:
+ if apply_strand_suffix is None:
+ import sys
+ print("join: Strand data from other will be added as strand data to self.\nIf this is undesired use the flag apply_strand_suffix=False.\nTo turn off the warning set apply_strand_suffix to True or False.", file=sys.stderr)
+ elif apply_strand_suffix:
+ gr.columns = gr.columns.str.replace("Strand", "Strand" + kwargs["suffix"])
return gr
@@ -2183,7 +2208,7 @@ class PyRanges():
return natsorted(self.dfs.keys())
- def k_nearest(self, other, k=1, ties=None, strandedness=None, overlap=True, how=None, suffix="_b", nb_cpu=1):
+ def k_nearest(self, other, k=1, ties=None, strandedness=None, overlap=True, how=None, suffix="_b", nb_cpu=1, apply_strand_suffix=None):
"""Find k nearest intervals.
@@ -2199,7 +2224,7 @@ class PyRanges():
ties : {None, "first", "last", "different"}, default None
- How to resolve ties, i.e. closest intervals with equal distance. None means that ...
+ How to resolve ties, i.e. closest intervals with equal distance. None means that the k nearest intervals are kept.
"first" means that the first tie is kept, "last" meanst that the last is kept.
"different" means that all nearest intervals with the k unique nearest distances are kept.
@@ -2222,6 +2247,12 @@ class PyRanges():
Suffix to give columns with shared name in other.
+ apply_strand_suffix : bool, default None
+
+ If first pyranges is unstranded, but the second is not, the first will be given a strand column.
+ apply_strand_suffix makes the added strand column a regular data column instead by adding a suffix.
+
+
nb_cpu: int, default 1
How many cpus to use. Can at most use 1 per chromosome or chromosome/strand tuple.
@@ -2404,86 +2435,49 @@ class PyRanges():
overlap = kwargs.get("overlap", True)
ties = kwargs.get("ties", False)
- self = pr.PyRanges({k: v.copy() for k, v in self.dfs.items()})
+ self = self.copy()
try: # if k is a Series
k = k.values
except:
pass
+ # how many to nearest to find; might be different for each
self.__k__ = k
+ # give each their own unique ID
self.__IX__ = np.arange(len(self))
-
- # from time import time
- # start = time()
dfs = pyrange_apply(_nearest, self, other, **kwargs)
- # end = time()
- # print("nearest", end - start)
-
nearest = PyRanges(dfs)
- # nearest.msp()
- # raise
- # print("nearest len", len(nearest))
if not overlap:
- # self = self.drop(like="__k__|__IX__")
- result = nearest#.drop(like="__k__|__IX__")
+ result = nearest
else:
from collections import defaultdict
- # overlap_kwargs = {k: v for k, v in kwargs.items()}
- # print("kwargs ties:", kwargs.get("ties"))
overlap_how = defaultdict(lambda: None, {"first": "first", "last": "last"})[kwargs.get("ties")]
- # start = time()
- overlaps = self.join(other, strandedness=strandedness, how=overlap_how, nb_cpu=nb_cpu)
- # end = time()
- # print("overlaps", end - start)
+ overlaps = self.join(other, strandedness=strandedness, how=overlap_how, nb_cpu=nb_cpu, apply_strand_suffix=apply_strand_suffix)
overlaps.Distance = 0
- # print("overlaps len", len(overlaps))
-
result = pr.concat([overlaps, nearest])
if not len(result):
return pr.PyRanges()
- # print(result)
- # print(overlaps.drop(like="__").df)
- # raise
-
- # start = time()
new_result = {}
if ties in ["first", "last"]:
- # method = "tail" if ties == "last" else "head"
- # keep = "last" if ties == "last" else "first"
-
for c, df in result:
- # start = time()
- # print(c)
- # print(df)
-
df = df.sort_values(["__IX__", "Distance"])
grpby = df.groupby("__k__", sort=False)
dfs = []
for k, kdf in grpby:
- # print("k", k)
- # print(kdf)
- # dist_bool = ~kdf.Distance.duplicated(keep=keep)
- # print(dist_bool)
- # kdf = kdf[dist_bool]
grpby2 = kdf.groupby("__IX__", sort=False)
- # f = getattr(grpby2, method)
_df = grpby2.head(k)
- # print(_df)
dfs.append(_df)
- # raise
if dfs:
new_result[c] = pd.concat(dfs)
- # print(new_result[c])
+
elif ties == "different" or not ties:
for c, df in result:
- # print(df)
-
if df.empty:
continue
dfs = []
@@ -2491,27 +2485,14 @@ class PyRanges():
df = df.sort_values(["__IX__", "Distance"])
grpby = df.groupby("__k__", sort=False)
- # for each index
- # want to keep until we have k
- # then keep all with same distance
for k, kdf in grpby:
- # print("kdf " * 10)
- # print("k " * 5, k)
- # print(kdf["__IX__ Distance".split()])
- # print(kdf.dtypes)
- # print(kdf.index.dtypes)
- # if ties:
if ties:
lx = get_different_ties(kdf.index.values, kdf.__IX__.values, kdf.Distance.astype(np.int64).values, k)
+ _df = kdf.reindex(lx)
else:
lx = get_all_ties(kdf.index.values, kdf.__IX__.values, kdf.Distance.astype(np.int64).values, k)
- # print(lx)
-
-
- # else:
- # lx = get_all_ties(kdf.index.values, kdf.__IX__.values, kdf.Distance.astype(np.int64).values, k)
- _df = kdf.reindex(lx)
- # print("_df", _df)
+ _df = kdf.reindex(lx)
+ _df = _df.groupby("__IX__").head(k)
dfs.append(_df)
if dfs:
@@ -2539,12 +2520,14 @@ class PyRanges():
df.loc[bools, "Distance"] = -df.loc[bools, "Distance"]
return df
- # print(result)
result = result.apply(prev_to_neg, suffix=kwargs["suffix"])
- # print(result)
- # end = time()
- # print("final stuff", end - start)
+ if not self.stranded and other.stranded:
+ if apply_strand_suffix is None:
+ import sys
+ print("join: Strand data from other will be added as strand data to self.\nIf this is undesired use the flag apply_strand_suffix=False.\nTo turn off the warning set apply_strand_suffix to True or False.", file=sys.stderr)
+ elif apply_strand_suffix:
+ result.columns = result.columns.str.replace("Strand", "Strand" + kwargs["suffix"])
return result
@@ -2664,8 +2647,64 @@ class PyRanges():
return pd.concat(_lengths).reset_index(drop=True)
+ def max_disjoint(self, strand=None, slack=0, **kwargs):
+
+ """Find the maximal disjoint set of intervals.
+
+ Parameters
+ ----------
+ strand : bool, default None, i.e. auto
+
+ Find the max disjoint set separately for each strand.
+
+ slack : int, default 0
+
+ Consider intervals within a distance of slack to be overlapping.
+
+ Returns
+ -------
+ PyRanges
+
+ PyRanges with maximal disjoint set of intervals.
+
+ Examples
+ --------
+ >>> gr = pr.data.f1()
+ +--------------+-----------+-----------+------------+-----------+--------------+
+ | Chromosome | Start | End | Name | Score | Strand |
+ | (category) | (int32) | (int32) | (object) | (int64) | (category) |
+ |--------------+-----------+-----------+------------+-----------+--------------|
+ | chr1 | 3 | 6 | interval1 | 0 | + |
+ | chr1 | 8 | 9 | interval3 | 0 | + |
+ | chr1 | 5 | 7 | interval2 | 0 | - |
+ +--------------+-----------+-----------+------------+-----------+--------------+
+ Stranded PyRanges object has 3 rows and 6 columns from 1 chromosomes.
+ For printing, the PyRanges was sorted on Chromosome and Strand.
+
+ >>> gr.max_disjoint(strand=False)
+ +--------------+-----------+-----------+------------+-----------+--------------+
+ | Chromosome | Start | End | Name | Score | Strand |
+ | (category) | (int32) | (int32) | (object) | (int64) | (category) |
+ |--------------+-----------+-----------+------------+-----------+--------------|
+ | chr1 | 3 | 6 | interval1 | 0 | + |
+ | chr1 | 8 | 9 | interval3 | 0 | + |
+ +--------------+-----------+-----------+------------+-----------+--------------+
+ Stranded PyRanges object has 2 rows and 6 columns from 1 chromosomes.
+ For printing, the PyRanges was sorted on Chromosome and Strand.
+ """
+
+ if strand is None:
+ strand = self.stranded
+
+ kwargs = {"strand": strand, "slack": slack}
+ kwargs = fill_kwargs(kwargs)
- def merge(self, strand=None, count=False, count_col="Count", by=None):
+ from pyranges.methods.max_disjoint import _max_disjoint
+ df = pyrange_apply_single(_max_disjoint, self, **kwargs)
+
+ return pr.PyRanges(df)
+
+ def merge(self, strand=None, count=False, count_col="Count", by=None, slack=0):
"""Merge overlapping intervals into one.
@@ -2687,6 +2726,10 @@ class PyRanges():
Only merge intervals with equal values in these columns.
+ slack : int, default 0
+
+ Allow this many nucleotides between each interval to merge.
+
Returns
-------
PyRanges
@@ -2784,7 +2827,7 @@ class PyRanges():
if strand is None:
strand = self.stranded
- kwargs = {"strand": strand, "count": count, "by": by, "count_col": count_col}
+ kwargs = {"strand": strand, "count": count, "by": by, "count_col": count_col, "slack": slack}
if not kwargs["by"]:
kwargs["sparse"] = {"self": True}
@@ -2859,7 +2902,7 @@ class PyRanges():
return self
- def nearest(self, other, strandedness=None, overlap=True, how=None, suffix="_b", nb_cpu=1):
+ def nearest(self, other, strandedness=None, overlap=True, how=None, suffix="_b", nb_cpu=1, apply_strand_suffix=None):
"""Find closest interval.
@@ -2888,6 +2931,11 @@ class PyRanges():
Suffix to give columns with shared name in other.
+ apply_strand_suffix : bool, default None
+
+ If first pyranges is unstranded, but the second is not, the first will be given the strand column of the second.
+ apply_strand_suffix makes the added strand column a regular data column instead by adding a suffix.
+
nb_cpu: int, default 1
How many cpus to use. Can at most use 1 per chromosome or chromosome/strand tuple.
@@ -2967,14 +3015,22 @@ class PyRanges():
from pyranges.methods.nearest import _nearest
- kwargs = {"strandedness": strandedness, "how": how, "overlap": overlap, "nb_cpu": nb_cpu, "suffix": suffix}
+ kwargs = {"strandedness": strandedness, "how": how, "overlap": overlap, "nb_cpu": nb_cpu, "suffix": suffix, "apply_strand_suffix": apply_strand_suffix}
kwargs = fill_kwargs(kwargs)
if kwargs.get("how") in "upstream downstream".split():
assert other.stranded, "If doing upstream or downstream nearest, other pyranges must be stranded"
dfs = pyrange_apply(_nearest, self, other, **kwargs)
+ gr = PyRanges(dfs)
- return PyRanges(dfs)
+ if not self.stranded and other.stranded:
+ if apply_strand_suffix is None:
+ import sys
+ print("join: Strand data from other will be added as strand data to self.\nIf this is undesired use the flag apply_strand_suffix=False.\nTo turn off the warning set apply_strand_suffix to True or False.", file=sys.stderr)
+ elif apply_strand_suffix:
+ gr.columns = gr.columns.str.replace("Strand", "Strand" + kwargs["suffix"])
+
+ return gr
def new_position(self, new_pos, columns=None):
@@ -3132,7 +3188,7 @@ class PyRanges():
return pr.PyRanges(dfs)
- def overlap(self, other, strandedness=None, how="first", invert=False):
+ def overlap(self, other, strandedness=None, how="first", invert=False, nb_cpu=1):
"""Return overlapping intervals.
@@ -3155,6 +3211,10 @@ class PyRanges():
What intervals to report. By default reports every interval in self with overlap once.
"containment" reports all intervals where the overlapping is contained within it.
+ invert : bool, default False
+
+ Whether to return the intervals without overlaps.
+
nb_cpu: int, default 1
How many cpus to use. Can at most use 1 per chromosome or chromosome/strand tuple.
@@ -3246,15 +3306,28 @@ class PyRanges():
For printing, the PyRanges was sorted on Chromosome.
"""
- kwargs = {"strandedness": strandedness}
+ kwargs = {"strandedness": strandedness, "nb_cpu": nb_cpu}
kwargs["sparse"] = {"self": False, "other": True}
kwargs["how"] = how
kwargs["invert"] = invert
kwargs = fill_kwargs(kwargs)
+ if len(self) == 0:
+ return self
+
+ if invert:
+ self = self.copy()
+ self.__ix__ = np.arange(len(self))
+
dfs = pyrange_apply(_overlap, self, other, **kwargs)
+ result = pr.PyRanges(dfs)
- return pr.PyRanges(dfs)
+ if invert:
+ found_idxs = getattr(result, "__ix__", [])
+ result = self[~self.__ix__.isin(found_idxs)]
+ result = result.drop("__ix__")
+
+ return result
def pc(self, n=8, formatting=None):
@@ -4310,9 +4383,14 @@ class PyRanges():
strand = True if strandedness else False
other_clusters = other.merge(strand=strand)
+ self = self.count_overlaps(other_clusters, strandedness=strandedness, overlap_col="__num__")
+
result = pyrange_apply(_subtraction, self, other_clusters, **kwargs)
- return PyRanges(result)
+ self = self.drop("__num__")
+
+ return PyRanges(result).drop("__num__")
+
def summary(self, to_stdout=True, return_df=False):
=====================================
pyranges/readers.py
=====================================
@@ -186,6 +186,11 @@ def read_bam(f, sparse=True, as_df=False, mapq=0, required_flag=0, filter_flag=1
print("bamread must be installed to read bam. Use `conda install -c bioconda bamread` or `pip install bamread` to install it.")
sys.exit(1)
+ if bamread.__version__ in ["0.0.1", "0.0.2", "0.0.3", "0.0.4",
+ "0.0.5", "0.0.6", "0.0.7", "0.0.8", "0.0.9"]:
+ print("bamread not recent enough. Must be 0.0.10 or higher. Use `conda install -c bioconda 'bamread>=0.0.10'` or `pip install bamread>=0.0.10` to install it.")
+ sys.exit(1)
+
if sparse:
df = bamread.read_bam(f, mapq, required_flag, filter_flag)
else:
@@ -254,6 +259,10 @@ def read_gtf(f, full=True, as_df=False, nrows=None, duplicate_attr=False):
Path to GTF file.
+ full : bool, default True
+
+ Whether to read and interpret the annotation column.
+
as_df : bool, default False
Whether to return as pandas DataFrame instead of PyRanges.
@@ -298,7 +307,10 @@ def read_gtf(f, full=True, as_df=False, nrows=None, duplicate_attr=False):
_skiprows = skiprows(f)
- gr = read_gtf_full(f, as_df, nrows, _skiprows, duplicate_attr)
+ if full:
+ gr = read_gtf_full(f, as_df, nrows, _skiprows, duplicate_attr)
+ else:
+ gr = read_gtf_restricted(f, _skiprows, as_df=False, nrows=None)
return gr
@@ -357,10 +369,6 @@ def to_rows(anno):
# l[:-1] removes final ";" cheaply
for kv in l[:-1].split("; ")]})
- # for l in anno:
- # l = l.replace('"', '').replace(";", "").split()
- # rowdicts.append({k: v for k, v in zip(*([iter(l)] * 2))})
-
return pd.DataFrame.from_dict(rowdicts).set_index(anno.index)
@@ -388,8 +396,8 @@ def to_rows_keep_duplicates(anno):
def read_gtf_restricted(f,
+ skiprows,
as_df=False,
- skiprows=0,
nrows=None):
"""seqname - name of the chromosome or scaffold; chromosome names can be given with or without the 'chr' prefix. Important note: the seqname must be one used within Ensembl, i.e. a standard chromosome name or an Ensembl identifier such as a scaffold ID, without any additional content such as species or assembly. See the example GFF output below.
# source - name of the program that generated this feature, or the data source (database or project name)
@@ -415,6 +423,7 @@ def read_gtf_restricted(f,
names="Chromosome Feature Start End Score Strand Attribute".split(),
dtype=dtypes,
chunksize=int(1e5),
+ skiprows=skiprows,
nrows=nrows)
dfs = []
@@ -457,7 +466,7 @@ def to_rows_gff3(anno):
return pd.DataFrame.from_dict(rowdicts).set_index(anno.index)
-def read_gff3(f, annotation=None, as_df=False, nrows=None, skiprows=0):
+def read_gff3(f, full=True, annotation=None, as_df=False, nrows=None):
"""Read files in the General Feature Format.
@@ -467,6 +476,10 @@ def read_gff3(f, annotation=None, as_df=False, nrows=None, skiprows=0):
Path to GFF file.
+ full : bool, default True
+
+ Whether to read and interpret the annotation column.
+
as_df : bool, default False
Whether to return as pandas DataFrame instead of PyRanges.
@@ -481,6 +494,10 @@ def read_gff3(f, annotation=None, as_df=False, nrows=None, skiprows=0):
pyranges.read_gtf : read files in the Gene Transfer Format
"""
+ _skiprows = skiprows(f)
+
+ if not full:
+ return read_gtf_restricted(f, _skiprows, as_df=as_df, nrows=nrows)
dtypes = {
"Chromosome": "category",
@@ -499,7 +516,7 @@ def read_gff3(f, annotation=None, as_df=False, nrows=None, skiprows=0):
names=names,
dtype=dtypes,
chunksize=int(1e5),
- skiprows=skiprows,
+ skiprows=_skiprows,
nrows=nrows)
dfs = []
=====================================
pyranges/version.py
=====================================
@@ -1 +1 @@
-__version__ = "0.0.85"
+__version__ = "0.0.111"
=====================================
setup.py
=====================================
@@ -6,7 +6,7 @@ __version__ = open("pyranges/version.py").readline().split(" = ")[1].replace(
'"', '').strip()
install_requires = [
- "cython", "pandas", "ncls>=0.0.50", "tabulate", "sorted_nearest>=0.0.30", "pyrle",
+ "cython", "pandas", "ncls>=0.0.62", "tabulate", "sorted_nearest>=0.0.33", "pyrle",
"natsort"] #,
# optional_requires = ["bamread", "pybigwig", "ray"]
=====================================
tests/helpers.py
=====================================
@@ -3,6 +3,12 @@ import pandas as pd
def assert_df_equal(df1, df2):
+ print("-"*100)
+ print("df1")
+ print(df1)
+ print("df2")
+ print(df2)
+
# df1.loc[:, "Start"] = df1.Start.astype(np.int64)
# df2.loc[:, "Start"] = df1.Start.astype(np.int64)
# df1.loc[:, "End"] = df1.End.astype(np.int64)
=====================================
tests/test_binary.py
=====================================
@@ -100,9 +100,7 @@ def compare_results_nearest(bedtools_df, result):
result = result.df
-
if not len(result) == 0:
-
bedtools_df = bedtools_df.sort_values("Start End Distance".split())
result = result.sort_values("Start End Distance".split())
result_df = result["Chromosome Start End Strand Distance".split()]
@@ -239,6 +237,11 @@ def test_coverage(gr, gr2, strandedness):
result = gr.coverage(gr2, strandedness=strandedness)
+ print("pyranges")
+ print(result.df)
+ print("bedtools")
+ print(bedtools_df)
+
# assert len(result) > 0
assert np.all(
bedtools_df.NumberOverlaps.values == result.NumberOverlaps.values)
@@ -301,9 +304,12 @@ def test_subtraction(gr, gr2, strandedness):
names="Chromosome Start End Name Score Strand".split(),
sep="\t")
+ print("subtracting" * 50)
result = gr.subtract(gr2, strandedness=strandedness)
+ print("bedtools_result")
print(bedtools_df)
+ print("PyRanges result:")
print(result)
compare_results(bedtools_df, result)
@@ -551,3 +557,36 @@ def test_k_nearest(gr, gr2, nearest_how, overlap, strandedness, ties):
print(result)
compare_results_nearest(bedtools_df, result)
+
+
+# @settings(
+# max_examples=max_examples,
+# deadline=deadline,
+# print_blob=True,
+# suppress_health_check=HealthCheck.all())
+# @given(gr=dfs_min()) # pylint: disable=no-value-for-parameter
+# def test_k_nearest_nearest_self_same_size(gr):
+
+# result = gr.k_nearest(
+# gr, k=1, strandedness=None, overlap=True, how=None, ties="first")
+
+# assert len(result) == len(gr)
+
+ at settings(
+ max_examples=max_examples,
+ deadline=deadline,
+ print_blob=True,
+ suppress_health_check=HealthCheck.all())
+ at given(gr=dfs_min(), gr2=dfs_min()) # pylint: disable=no-value-for-parameter
+def test_k_nearest_1_vs_nearest(gr, gr2):
+
+ result_k = gr.k_nearest(gr2, k=1, strandedness=None, overlap=True, how=None)
+ if len(result_k) > 0:
+ result_k.Distance = result_k.Distance.abs()
+
+ result_n = gr.nearest(gr2, strandedness=None, overlap=True, how=None)
+
+ if len(result_k) == 0 and len(result_n) == 0:
+ pass
+ else:
+ assert (result_k.sort().Distance.abs() == result_n.sort().Distance).all()
=====================================
tests/test_concat.py
=====================================
@@ -0,0 +1,31 @@
+#!/usr/bin/env python3
+
+import pytest
+import pyranges as pr
+
+def assert_equal_length_before_after(gr1, gr2):
+
+ print("in test")
+ l1 = len(gr1)
+ l2 = len(gr2)
+ c = pr.concat([gr1, gr2])
+
+ if not gr1.stranded or not gr2.stranded:
+ assert not c.stranded
+
+ lc = len(c)
+ assert l1 + l2 == lc
+
+def test_concat_stranded_unstranded(f1, f2):
+
+ assert_equal_length_before_after(f1, f2)
+
+def test_concat_unstranded_unstranded(f1, f2):
+
+ assert_equal_length_before_after(f1.unstrand(), f2.unstrand())
+
+def test_concat_stranded_unstranded(f1, f2):
+ assert_equal_length_before_after(f1, f2.unstrand())
+
+def test_concat_unstranded_stranded(f1, f2):
+ assert_equal_length_before_after(f1.unstrand(), f2)
=====================================
tests/test_count_overlaps.py
=====================================
@@ -0,0 +1,62 @@
+import numpy as np
+import pyranges as pr
+from tests.helpers import assert_df_equal
+
+a = '''Chromosome Start End Strand
+chr1 6 12 +
+chr1 10 20 +
+chr1 22 27 -
+chr1 24 30 -'''
+b = '''Chromosome Start End Strand
+chr1 12 32 +
+chr1 14 30 +'''
+c = '''Chromosome Start End Strand
+chr1 8 15 +
+chr1 713800 714800 -
+chr1 32 34 -'''
+
+grs = {n: pr.from_string(s) for n, s in zip(["a", "b", "c"], [a, b, c])}
+unstranded_grs = {n: gr.unstrand() for n, gr in grs.items()}
+
+features = pr.PyRanges(chromosomes=["chr1"] * 4,
+ starts=[0, 10, 20, 30],
+ ends=[10, 20, 30, 40],
+ strands = ["+","+","+","-"])
+unstranded_features = features.unstrand()
+
+def test_strand_vs_strand_same():
+
+ expected_result = pr.from_string("""Chromosome Start End Strand a b c
+chr1 0 10 + 1 0 1
+chr1 10 20 + 2 2 1
+chr1 20 30 + 0 2 0
+chr1 30 40 - 0 0 1""")
+
+ res = pr.count_overlaps(grs, features, strandedness="same")
+ res = res.apply(lambda df: df.astype({"a": np.int64, "b": np.int64, "c": np.int64}))
+
+ res.print(merge_position=True)
+
+ assert_df_equal(res.df, expected_result.df)
+
+
+# def test_strand_vs_strand_opposite():
+
+# expected_result = pr.from_string("""Chromosome Start End Strand a b c
+# chr1 0 10 + 1 0 1
+# chr1 10 20 + 1 2 1
+# chr1 20 30 + 0 2 0
+# chr1 30 40 - 0 0 1""")
+
+# res = pr.count_overlaps(grs, features, strandedness="opposite")
+
+# print("features")
+# print(features)
+
+# for name, gr in grs.items():
+# print(name)
+# print(gr)
+
+# res.print(merge_position=True)
+
+# assert_df_equal(res.df, expected_result.df)
=====================================
tests/test_do_not_error.py
=====================================
@@ -50,6 +50,8 @@ strandedness_chain = list(product(["same", "opposite"], strandedness)) + list(
# @reproduce_failure('5.5.4', b'AXicY2RAA4xIJDY+AAC2AAY=') # test_three_in_a_row[strandedness_chain24-method_chain24]
def test_three_in_a_row(gr, gr2, gr3, strandedness_chain, method_chain):
+ print(method_chain)
+
s1, s2 = strandedness_chain
f1, f2 = method_chain
View it on GitLab: https://salsa.debian.org/med-team/pyranges/-/commit/9afe05d25e92fd318b45e99a9f7e75341c0ae911
--
View it on GitLab: https://salsa.debian.org/med-team/pyranges/-/commit/9afe05d25e92fd318b45e99a9f7e75341c0ae911
You're receiving this email because of your account on salsa.debian.org.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://alioth-lists.debian.net/pipermail/debian-med-commit/attachments/20211102/0853b2db/attachment-0001.htm>
More information about the debian-med-commit
mailing list