[med-svn] [Git][med-team/pyranges][upstream] New upstream version 0.0.111+ds

Tue Nov 2 18:50:55 GMT 2021


Nilesh Patra pushed to branch upstream at Debian Med / pyranges


Commits:
9afe05d2 by Nilesh Patra at 2021-11-02T23:22:21+05:30
New upstream version 0.0.111+ds
- - - - -


25 changed files:

- − .travis.yml
- CHANGELOG.txt
- README.md
- pyranges/__init__.py
- pyranges/genomicfeatures.py
- pyranges/methods/attr.py
- pyranges/methods/concat.py
- pyranges/methods/coverage.py
- pyranges/methods/init.py
- pyranges/methods/intersection.py
- pyranges/methods/join.py
- pyranges/methods/k_nearest.py
- + pyranges/methods/max_disjoint.py
- pyranges/methods/subtraction.py
- pyranges/multioverlap.py
- pyranges/multithreaded.py
- pyranges/pyranges.py
- pyranges/readers.py
- pyranges/version.py
- setup.py
- tests/helpers.py
- tests/test_binary.py
- + tests/test_concat.py
- + tests/test_count_overlaps.py
- tests/test_do_not_error.py


Changes:

=====================================
.travis.yml deleted
=====================================
@@ -1,32 +0,0 @@
-# Stolen from http://conda.pydata.org/docs/travis.html
-language: python
-python:
-  # We don't actually use the Travis Python, but this keeps it organized..
-  - "3.6"
-install:
-  - sudo apt-get update
-  # We do this conditionally because it saves us some downloading if the
-  # version is the same.
-  - wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh -O miniconda.sh;
-  - bash miniconda.sh -b -p $HOME/miniconda
-  - export PATH="$HOME/miniconda/bin:$PATH"
-  - hash -r
-  - conda config --set always_yes yes --set changeps1 no
-  # - conda update -q conda
-  - conda config --add channels bioconda
-  - conda config --add channels r
-  # Useful for debugging any issues with conda
-  - conda info -a
-  - conda create -q -n test-environment python=$TRAVIS_PYTHON_VERSION numpy scipy pandas pytest pytest-cov cython tabulate hypothesis bedtools pybigwig pysam
-  - source activate test-environment
-  - python --version
-  - pip --version
-  - pip install sorted_nearest ncls pyrle
-  # to build docs pip install sphinxcontrib-napoleon sphinx-autoapi
-  - python setup.py install
-  - python -c 'import pandas as pd; print(pd.__version__)'
-  - ls tests
-
-script: python -m pytest -m "not explore"  --doctest-modules --cov=pyranges tests
-
-after_success: coveralls


=====================================
CHANGELOG.txt
=====================================
@@ -1,3 +1,75 @@
+# 0.0.111 (01.10.2021)
+- require minimum version of NCLS
+
+# 0.0.110 (20.09.21)
+- fix count_overlaps with keep_nonoverlapping=False
+- fix subtract with more than 1024 intervals (new fix)
+
+# 0.0.109 (16.09.21)
+- fix overlap invert behavior
+- add intersect invert flag
+- fix subtract in cases where more than 1024 intervals overlapped a single interval
+
+# 0.0.106/107/108(hotfixes) (07/8.09.21)
+- fix join with slack mutating first arg
+- add flag use_other_strand in join, nearest, k_nearest
+- fix categorical-bug in newer versions of pandas
+- add function pr.version_info() to print relevant version flags for debugging
+
+# 0.0.105 (23.08.21)
+- require bamread 0.0.10 to fix #211
+
+# 0.0.104 (06/20.08.21)
+- fix broken three_end/five_end code
+
+# 0.0.102/103 (06.08.21)
+- fix bug in pr.count_overlaps
+- demand version 0.0.9 or greater from bamread
+
+# 0.0.100/0.0.101 (20/21.06.21)
+- add full-flag to read_gtf
+- fix bug in join with slack > 0 when result is empty
+
+# 0.0.99 (17.06.21)
+- add nb_cpu arg to overlap
+
+# 0.0.98 (07.06.21)
+- fix k-nearest how=None
+
+# 0.0.98 (20.05.21)
+- fix casting in tss/tes
+
+# 0.0.96/97 (07.05.21)
+- fixes to .tes and .tss methods (issue #182)
+
+# 0.0.95 (02.03.21)
+- teensy fix bedclip
+- add pretty-printing in jupyter notebooks (thanks to @rasi)
+
+# 0.0.94 (27.02.21)
+- print warning if start and end columns have different dtypes
+
+# 0.0.93 (25.02.21)
+- add max_disjoint for maximal disjoint set
+
+# 0.0.91-92 (15.01.21)
+- hotfix for 0.0.90
+
+# 0.0.90 (03.01.21)
+- fix #165 slow set operations on small files with many chromosomes (thanks ndukler)
+
+# 0.0.89 (16.11.20)
+- fix #159 (thanks cfriedline)
+
+# 0.0.88 (09.11.20)
+- fix bug when concatting stranded and unstranded pyranges (thanks cfriedline, issue #160)
+
+# 0.0.87 (23.10.20)
+- fix bug in join with left/right option
+
+# 0.0.86 (05.10.20)
+- add slack-option to merge
+
 # 0.0.85 (17.09.20)
 - fix error when parsing gtf-files with whitespace in value-tags
 


=====================================
README.md
=====================================
@@ -1,6 +1,6 @@
 # pyranges
 
-[![Coverage Status](https://img.shields.io/coveralls/github/biocore-ntnu/pyranges.svg)](https://coveralls.io/github/biocore-ntnu/pyranges?branch=master) [![Codacy Badge](https://api.codacy.com/project/badge/Grade/b61a53346d764a8d8f0ab2a6afd7b100)](https://www.codacy.com/app/endrebak/pyranges?utm_source=github.com&utm_medium=referral&utm_content=biocore-ntnu/pyranges&utm_campaign=Badge_Grade) [![Build Status](https://travis-ci.org/biocore-ntnu/pyranges.svg?branch=master)](https://travis-ci.org/biocore-ntnu/pyranges) [![hypothesis tested](graphs/hypothesis-tested-brightgreen.svg)](http://hypothesis.readthedocs.io/) [![PyPI version](https://badge.fury.io/py/pyranges.svg)](https://badge.fury.io/py/pyranges) [![MIT](https://img.shields.io/pypi/l/pyranges.svg?color=green)](https://opensource.org/licenses/MIT) ![PyPI - Python Version](https://img.shields.io/pypi/pyversions/pyranges.svg) [![install with bioconda](https://img.shields.io/badge/install%20with-bioconda-brightgreen.svg?style=flat)](http://bioconda.github.io/recipes/pyranges/README.html)
+[![Coverage Status](https://img.shields.io/coveralls/github/biocore-ntnu/pyranges.svg)](https://coveralls.io/github/biocore-ntnu/pyranges?branch=master) [![Codacy Badge](https://api.codacy.com/project/badge/Grade/b61a53346d764a8d8f0ab2a6afd7b100)](https://www.codacy.com/app/endrebak/pyranges?utm_source=github.com&utm_medium=referral&utm_content=biocore-ntnu/pyranges&utm_campaign=Badge_Grade) [![Build Status](https://travis-ci.com/biocore-ntnu/pyranges.svg?branch=master)](https://travis-ci.com/biocore-ntnu/pyranges) [![hypothesis tested](graphs/hypothesis-tested-brightgreen.svg)](http://hypothesis.readthedocs.io/) [![PyPI version](https://badge.fury.io/py/pyranges.svg)](https://badge.fury.io/py/pyranges) [![MIT](https://img.shields.io/pypi/l/pyranges.svg?color=green)](https://opensource.org/licenses/MIT) ![PyPI - Python Version](https://img.shields.io/pypi/pyversions/pyranges.svg) [![install with bioconda](https://img.shields.io/badge/install%20with-bioconda-brightgreen.svg?style=flat)](http://bioconda.github.io/recipes/pyranges/README.html)
 
 ## Introduction
 


=====================================
pyranges/__init__.py
=====================================
@@ -549,4 +549,33 @@ def to_bigwig(gr, path, chromosome_sizes):
 
     bw.addEntries(chromosomes, starts, ends=ends, values=values)
 
-__all__ = ["from_string", "from_dict", "to_bigwig", "count_overlaps", "random", "itergrs", "read_gtf", "read_bam", "read_bed", "read_gff3", "concat", "PyRanges"]
+def version_info():
+
+    import importlib
+    def update_version_info(version_info, library):
+        if importlib.util.find_spec(library):
+             version = importlib.import_module(library).__version__
+        else:
+            version = "not installed"
+
+        version_info[library] = version
+
+    version_info = {"pyranges version": pr.__version__,
+                    "pandas version": pd.__version__,
+                    "numpy version": np.__version__,
+                    "python version": sys.version_info}
+
+    update_version_info(version_info, "ncls")
+    update_version_info(version_info, "sorted_nearest")
+    update_version_info(version_info, "pyrle")
+    update_version_info(version_info, "ray")
+    update_version_info(version_info, "bamread")
+    # update_version_info(version_info, "bwread") no version string yet!
+    update_version_info(version_info, "pyranges_db")
+    update_version_info(version_info, "pybigwig")
+    update_version_info(version_info, "hypothesis")
+
+    print(version_info)
+
+
+__all__ = ["from_string", "from_dict", "to_bigwig", "count_overlaps", "random", "itergrs", "read_gtf", "read_bam", "read_bed", "read_gff3", "concat", "PyRanges", "version_info"]


=====================================
pyranges/genomicfeatures.py
=====================================
@@ -263,7 +263,7 @@ def _outside_bounds(df, **kwargs):
         size_df = chromsizes.df
         chromsizes = {k: v for k, v in zip(size_df.Chromosome, size_df.End)}
 
-    size = chromsizes[df.Chromosome.iloc[0]]
+    size = int(chromsizes[df.Chromosome.iloc[0]])
     clip = kwargs.get("clip", False)
 
     ends_outside = df.End > size
@@ -473,12 +473,12 @@ def _tss(df, slack=0):
     tss_neg = df.loc[df.Strand == "-"].copy()
 
     # pd.options.mode.chained_assignment = None
-    tss_neg.loc[:, "Start"] = tss_neg.End
+    tss_neg.loc[:, "Start"] = tss_neg.End - 1
 
     # pd.options.mode.chained_assignment = "warn"
     tss = pd.concat([tss_pos, tss_neg], sort=False)
-    tss["End"] = tss.Start
-    tss.End = tss.End + 1 + slack
+    tss["End"] = tss.Start + 1
+    tss.End = tss.End + slack
     tss.Start = tss.Start - slack
     tss.loc[tss.Start < 0, "Start"] = 0
 
@@ -499,12 +499,12 @@ def _tes(df, slack=0):
     tes_neg = df.loc[df.Strand == "-"].copy()
 
     # pd.options.mode.chained_assignment = None
-    tes_neg.loc[:, "Start"] = tes_neg.End
+    tes_neg.loc[:, "End"] = tes_neg.Start + 1
 
     # pd.options.mode.chained_assignment = "warn"
     tes = pd.concat([tes_pos, tes_neg], sort=False)
-    tes["Start"] = tes.End
-    tes.End = tes.End + 1 + slack
+    tes["Start"] = tes.End - 1
+    tes.End = tes.End + slack
     tes.Start = tes.Start - slack
     tes.loc[tes.Start < 0, "Start"] = 0
 


=====================================
pyranges/methods/attr.py
=====================================
@@ -55,7 +55,8 @@ def _setattr(self, column_name, column, pos=False):
         self.__dict__["dfs"] = dfs
     else:
         int64 = True if self.dtypes["Start"] == np.int64 else False
-        self.__dict__["dfs"] = pr.PyRanges(pr.PyRanges(dfs).df, int64=int64).dfs # will merge the dfs, then split on keys again to ensure they are correct
+        # will merge the dfs, then split on keys again to ensure they are correct
+        self.__dict__["dfs"] = pr.PyRanges(pr.PyRanges(dfs).df, int64=int64).dfs 
 
 
 def _getattr(self, name):


=====================================
pyranges/methods/concat.py
=====================================
@@ -9,14 +9,13 @@ def concat(pyranges, strand=None):
     if not pyranges:
         return None
 
-    # from pydbg import dbg
     pyranges = [pr for pr in pyranges if not pr.empty]
-    # dbg(pyranges)
-    # dbg([p.df.dtypes for p in pyranges])
     grs_per_chromosome = defaultdict(list)
 
+    strand_info = [gr.stranded for gr in pyranges]
+
     if strand is None:
-        strand = all([gr.stranded for gr in pyranges])
+        strand = all(strand_info)
 
     if strand:
         assert all([
@@ -30,22 +29,25 @@ def concat(pyranges, strand=None):
     else:
         for gr in pyranges:
             for chromosome in gr.chromosomes:
-                # dbg(gr)
-                # dbg(gr[chromosome])
                 df = gr[chromosome].df
-                # dbg(df.dtypes)
                 grs_per_chromosome[chromosome].append(df)
 
     new_pyrange = {}
 
     for k, v in grs_per_chromosome.items():
-        # dbg([_v.dtypes for _v in v])
         new_pyrange[k] = pd.concat(v, sort=False)
-        # dbg(new_pyrange[k].dtypes)
 
     res = pr.multithreaded.process_results(new_pyrange.values(),
                                            new_pyrange.keys())
 
-    # dbg([r.dtypes for r in res.values()])
+    if any(strand_info) and not all(strand_info):
+        new_res = {}
+        for k, v in res.items():
+            v.loc[:, "Strand"] = v.Strand.cat.add_categories(["."])
+            new_res[k] = v.assign(Strand=v.Strand.fillna("."))
+        res = pr.PyRanges(new_res)
+        res.Strand = res.Strand
+    else:
+        res = pr.PyRanges(res)
 
-    return pr.PyRanges(res)
+    return  res


=====================================
pyranges/methods/coverage.py
=====================================
@@ -33,18 +33,18 @@ def _number_overlapping(scdf, ocdf, **kwargs):
 
     df = scdf.copy()
 
-    if keep_nonoverlapping:
-        _missing_indexes = np.setdiff1d(scdf.index, _self_indexes)
-        missing = pd.DataFrame(data={"Index": _missing_indexes, "Count": 0}, index=_missing_indexes)
-        counts_per_read = pd.concat([counts_per_read, missing])
-    else:
-        df = df.loc[_self_indexes]
+    _missing_indexes = np.setdiff1d(scdf.index, _self_indexes)
+    missing = pd.DataFrame(data={"Index": _missing_indexes, "Count": 0}, index=_missing_indexes)
+    counts_per_read = pd.concat([counts_per_read, missing])
 
-    counts_per_read = counts_per_read.set_index("Index")
+    counts_per_read = counts_per_read.set_index("Index").sort_index()
 
     df.insert(df.shape[1], column_name, counts_per_read)
 
-    return df
+    if keep_nonoverlapping:
+        return df
+    else:
+        return df[df[column_name] != 0]
 
 
 


=====================================
pyranges/methods/init.py
=====================================
@@ -19,37 +19,29 @@ def set_dtypes(df, int64):
             "Start": np.int32,
             "End": np.int32,
             "Chromosome": "category",
-            "Strand": "category"
+            "Strand": "category",
         }
     else:
         dtypes = {
             "Start": np.int64,
             "End": np.int64,
             "Chromosome": "category",
-            "Strand": "category"
+            "Strand": "category",
         }
 
-    if not "Strand" in df:
+    if "Strand" not in df:
         del dtypes["Strand"]
 
     # need to ascertain that object columns do not consist of multiple types
     # https://github.com/biocore-ntnu/epic2/issues/32
     for column in "Chromosome Strand".split():
-        if not column in df:
+        if column not in df:
             continue
-
-        if df[column].dtype == object and len(
-                df[column].apply(type).drop_duplicates()) > 1:
-            df[column] = df[column].astype(str)
-        elif df[column].dtype != object:
-            df[column] = df[column].astype(str)
+        df[column] = df[column].astype(str)
 
     for col, dtype in dtypes.items():
-
         if df[col].dtype.name != dtype:
-
             df[col] = df[col].astype(dtype)
-
     return df
 
 
@@ -80,18 +72,20 @@ def create_pyranges_df(chromosomes, starts, ends, strands=None):
 
         columns = [chromosomes, starts, ends, strands]
         lengths = list(str(len(s)) for s in columns)
-        assert len(
-            set(lengths)
-        ) == 1, "chromosomes, starts, ends and strands must be of equal length. But are {}".format(
-            ", ".join(lengths))
+        assert (
+            len(set(lengths)) == 1
+        ), "chromosomes, starts, ends and strands must be of equal length. But are {}".format(
+            ", ".join(lengths)
+        )
         colnames = "Chromosome Start End Strand".split()
     else:
         columns = [chromosomes, starts, ends]
         lengths = list(str(len(s)) for s in columns)
-        assert len(
-            set(lengths)
-        ) == 1, "chromosomes, starts and ends must be of equal length. But are {}".format(
-            ", ".join(lengths))
+        assert (
+            len(set(lengths)) == 1
+        ), "chromosomes, starts and ends must be of equal length. But are {}".format(
+            ", ".join(lengths)
+        )
         colnames = "Chromosome Start End".split()
 
     idx = range(len(starts))
@@ -119,7 +113,8 @@ def check_strandedness(df):
     contains_more_than_plus_minus_in_strand_col = False
 
     if str(df.Strand.dtype) == "category" and (
-            set(df.Strand.cat.categories) - set("+-")):
+        set(df.Strand.cat.categories) - set("+-")
+    ):
         contains_more_than_plus_minus_in_strand_col = True
     elif not ((df.Strand == "+") | (df.Strand == "-")).all():
         contains_more_than_plus_minus_in_strand_col = True
@@ -130,14 +125,16 @@ def check_strandedness(df):
     return not contains_more_than_plus_minus_in_strand_col
 
 
-def _init(self,
-          df=None,
-          chromosomes=None,
-          starts=None,
-          ends=None,
-          strands=None,
-          int64=False,
-          copy_df=True):
+def _init(
+    self,
+    df=None,
+    chromosomes=None,
+    starts=None,
+    ends=None,
+    strands=None,
+    int64=False,
+    copy_df=True,
+):
     # TODO: add categorize argument with dict of args to categorize?
 
     if isinstance(df, PyRanges):
@@ -183,7 +180,6 @@ def _init(self,
             else:
                 _has_strand = False
 
-
         if not all([_single_value_key, _key_same, _strand_valid]):
             df = pd.concat(empty_removed.values()).reset_index(drop=True)
 


=====================================
pyranges/methods/intersection.py
=====================================
@@ -61,7 +61,6 @@ def _intersection(scdf, ocdf, **kwargs):
 
 def _overlap(scdf, ocdf, **kwargs):
 
-    invert = kwargs["invert"]
     return_indexes = kwargs.get("return_indexes", False)
 
     if scdf.empty or ocdf.empty:
@@ -83,9 +82,6 @@ def _overlap(scdf, ocdf, **kwargs):
     else:
         _indexes = it.has_overlaps(starts, ends, indexes)
 
-    if invert:
-        _indexes = scdf.index.difference(_indexes)
-
     if return_indexes:
         return _indexes
 
@@ -98,16 +94,14 @@ def _count_overlaps(scdf, ocdf, **kwargs):
     idx = _overlap(scdf, ocdf, **kwargs)
 
     sx = pd.DataFrame(np.zeros(len(scdf)), index=scdf.index)
-    if idx is None:
-        return sx
 
-    vc = pd.Series(idx).value_counts(sort=False)
+    vc = pd.Series(idx, dtype=np.int64).value_counts(sort=False)
 
-    sx.iloc[vc.index, 0] = vc.values
+    sx.loc[vc.index, 0] = vc.values
 
-    sx.columns = ["__0__"]
+    scdf.insert(scdf.shape[1], kwargs["name"], sx)
 
-    return sx
+    return scdf
 
 
 # def _first_df(scdf, ocdf, kwargs):


=====================================
pyranges/methods/join.py
=====================================
@@ -63,7 +63,9 @@ def null_types(h):
         elif d == "str" or d == "object":
             null = "-1"
         elif d == "category":
-            h2.loc[:, n] = h2[:, n].cat.add_categories("-1")
+            tmp_cat = h2[n].copy()
+            tmp_cat = tmp_cat.cat.add_categories("-1")
+            h2[n] = tmp_cat
             null = "-1"
 
         h2.loc[:, n] = null


=====================================
pyranges/methods/k_nearest.py
=====================================
@@ -95,7 +95,7 @@ def nearest(d1, d2, **kwargs):
     d2 = d2.reindex(xdf.RX)
     d1.index = range(len(d1))
     d2.index = range(len(d1))
-    d2 = d2.drop("Chromosome", 1)
+    d2 = d2.drop("Chromosome", axis=1)
     df = d1.join(d2, rsuffix=suffix)
     df.insert(df.shape[1], "Distance", xdf.D.values)
 
@@ -119,7 +119,7 @@ def nearest_previous(d1, d2, **kwargs):
     d1.index = range(len(d1))
     d2.index = range(len(d1))
 
-    d2 = d2.drop("Chromosome", 1)
+    d2 = d2.drop("Chromosome", axis=1)
     df = d1.join(d2, rsuffix=suffix)
     df.insert(df.shape[1], "Distance", dist)
 
@@ -141,7 +141,7 @@ def nearest_next(d1, d2, **kwargs):
 
     d1 = d1.reindex(lidx)
     d2 = d2.reindex(ridx)
-    d2 = d2.drop("Chromosome", 1)
+    d2 = d2.drop("Chromosome", axis=1)
     d1.index = range(len(d1))
     d2.index = range(len(d1))
     df = d1.join(d2, rsuffix=suffix)


=====================================
pyranges/methods/max_disjoint.py
=====================================
@@ -0,0 +1,17 @@
+#!/usr/bin/env python
+
+from sorted_nearest import max_disjoint
+
+
+def _max_disjoint(df, **kwargs):
+
+    if df.empty:
+        return None
+
+    slack = kwargs.get("slack", 0)
+
+    cdf = df.sort_values("End")
+
+    idx = max_disjoint(cdf.index.values, cdf.Start.values, cdf.End.values, slack)
+
+    return cdf.reindex(idx)


=====================================
pyranges/methods/subtraction.py
=====================================
@@ -1,7 +1,49 @@
 import pandas as pd
+import numpy as np
+
 from ncls import NCLS
 
 
+def add_rows_per_group(df):
+    last_rows = df.groupby("__ix__").last().reset_index()
+    last_rows.loc[:, "__last__"] = True
+    df = pd.concat([df, last_rows], ignore_index=True)
+    df = df.sort_values("__ix__", ascending=True)
+    return df
+
+# def _subtraction(scdf, **kwargs):
+#     if scdf.empty:
+#         return scdf
+
+#     falses = np.zeros(len(scdf), dtype=bool)
+#     scdf.insert(scdf.shape[1], "__first__", falses)
+#     scdf.insert(scdf.shape[1], "__last__", falses)
+
+#     scdf = add_rows_per_group(scdf)
+
+#     scdf.insert(scdf.shape[1], "NewStart", scdf.End__deleteme__.shift(fill_value=-1))
+#     scdf.insert(scdf.shape[1], "NewEnd", scdf.Start__deleteme__)
+#     scdf.insert(scdf.shape[1], "__ix2__", np.arange(len(scdf)))
+
+#     first_rows = scdf.groupby(scdf.__ix__, as_index=False).first()
+#     scdf.loc[scdf.__ix2__.isin(first_rows.__ix2__), "__first__"] = True
+
+#     scdf.loc[:, "NewStart"] = np.where(scdf.__first__, scdf.Start, scdf.NewStart)
+
+#     scdf.loc[scdf.__first__ & ~(scdf.Start__deleteme__ >= scdf.Start), "NewStart"] = -1
+#     scdf.loc[:, "NewEnd"] = np.where(scdf.__last__, scdf.End, scdf.NewEnd)
+#     scdf.loc[:, "NewStart"] = np.where(scdf.__last__, scdf.End__deleteme__, scdf.NewStart)
+#     scdf.loc[scdf.__last__ & ~(scdf.End__deleteme__ <= scdf.End), ["NewEnd", "NewStart"]] = -1
+
+
+#     scdf = scdf[~((scdf.NewStart == -1) | (scdf.NewEnd == -1))]
+#     scdf = scdf.drop(["Start", "End"], axis=1)
+#     scdf.rename(columns={"NewStart": "Start", "NewEnd": "End"}, inplace=True)
+
+#     remove_mask = scdf.Start >= scdf.End
+
+#     return scdf[~remove_mask]
+
 def _subtraction(scdf, ocdf, **kwargs):
 
     if ocdf.empty or scdf.empty:
@@ -20,7 +62,7 @@ def _subtraction(scdf, ocdf, **kwargs):
     o = NCLS(ocdf.Start.values, ocdf.End.values, ocdf.index.values)
 
     idx_self, new_starts, new_ends = o.set_difference_helper(
-        scdf.Start.values, scdf.End.values, scdf.index.values)
+        scdf.Start.values, scdf.End.values, scdf.index.values, scdf.__num__.values)
 
     missing_idx = pd.Index(scdf.index).difference(idx_self)
 


=====================================
pyranges/multioverlap.py
=====================================
@@ -1,5 +1,4 @@
 import pyranges as pr
-import pandas as pd
 import numpy as np
 
 
@@ -20,7 +19,7 @@ def count_overlaps(grs, features=None, strandedness=None, how=None,  nb_cpu=1):
     strandedness : {None, "same", "opposite", False}, default None, i.e. auto
 
         Whether to compare PyRanges on the same strand, the opposite or ignore strand
-        information. The default, None, means use "same" if both PyRanges are strande,
+        information. The default, None, means use "same" if both PyRanges are stranded,
         otherwise ignore the strand information.
 
      how : {None, "all", "containment"}, default None, i.e. all
@@ -140,6 +139,8 @@ def count_overlaps(grs, features=None, strandedness=None, how=None,  nb_cpu=1):
 
     if features is None:
         features = pr.concat(grs.values()).split(between=True)
+    else:
+        features = features.copy()
 
     from pyranges.methods.intersection import _count_overlaps
 
@@ -147,13 +148,9 @@ def count_overlaps(grs, features=None, strandedness=None, how=None,  nb_cpu=1):
 
         gr = gr.drop()
 
+        kwargs["name"] = name
         res = features.apply_pair(gr, _count_overlaps, **kwargs)
 
-        setattr(features, name, res)
-
-        setattr(features, name, getattr(features, name).fillna(0))
-
-
     def to_int(df):
         df.loc[:, names] = df[names].astype(np.int32)
         return df
@@ -161,10 +158,3 @@ def count_overlaps(grs, features=None, strandedness=None, how=None,  nb_cpu=1):
     features = features.apply(to_int)
 
     return features
-
-# if __name__
-
-
-# if __name__ == "__main__":
-
-#     print(a)


=====================================
pyranges/multithreaded.py
=====================================
@@ -8,7 +8,6 @@ from natsort import natsorted
 
 import os
 
-from collections import defaultdict
 
 def get_n_args(f):
 
@@ -105,7 +104,6 @@ def process_results(results, keys):
     for k in to_delete:
         del results_dict[k]
 
-
     return results_dict
 
 
@@ -183,7 +181,6 @@ def get_multithreaded_funcs(function, nb_cpu):
 
     return function, get, _merge_dfs
 
-
 def pyrange_apply(function, self, other, **kwargs):
 
     nparams = get_n_args(function)
@@ -218,6 +215,11 @@ def pyrange_apply(function, self, other, **kwargs):
     items = natsorted(self.dfs.items())
     keys = natsorted(self.dfs.keys())
 
+    dummy = pd.DataFrame(columns="Chromosome Start End".split())
+
+    other_chromosomes = other.chromosomes
+    other_dfs = other.dfs
+
     if strandedness:
 
         for (c, s), df in items:
@@ -225,7 +227,7 @@ def pyrange_apply(function, self, other, **kwargs):
             os = strand_dict[s]
 
             if not (c, os) in other.keys() or len(other[c, os].values()) == 0:
-                odf = pd.DataFrame(columns="Chromosome Start End".split())
+                odf = dummy
             else:
                 odf = other[c, os].values()[0]
 
@@ -240,10 +242,10 @@ def pyrange_apply(function, self, other, **kwargs):
 
             for (c, s), df in items:
 
-                if not c in other.chromosomes:
-                    odf = pd.DataFrame(columns="Chromosome Start End".split())
+                if not c in other_chromosomes:
+                    odf = dummy
                 else:
-                    odf = other.dfs[c]
+                    odf = other_dfs[c]
 
                 df, odf = make_binary_sparse(kwargs, df, odf)
                 result = call_f(function, nparams, df, odf, kwargs)
@@ -253,11 +255,11 @@ def pyrange_apply(function, self, other, **kwargs):
 
             for c, df in items:
 
-                if not c in other.chromosomes:
-                    odf = pd.DataFrame(columns="Chromosome Start End".split())
+                if not c in other_chromosomes:
+                    odf = dummy
                 else:
-                    odf1 = other[c, "+"].df
-                    odf2 = other[c, "-"].df
+                    odf1 = other_dfs.get((c, "+"), dummy)
+                    odf2 = other_dfs.get((c, "-"), dummy)
 
                     odf = _merge_dfs.remote(odf1, odf2)
 
@@ -270,21 +272,20 @@ def pyrange_apply(function, self, other, **kwargs):
 
             for (c, s), df in self.items():
 
-                if not c in other.chromosomes:
-                    odfs = pr.PyRanges(
-                        pd.DataFrame(columns="Chromosome Start End".split()))
+                if not c in other_chromosomes:
+                    odfs = pr.PyRanges(dummy)
                 else:
-                    odfs = other[c].values()
+                    odfp = other_dfs.get((c, "+"), dummy)
+                    odfm = other_dfs.get((c, "-"), dummy)
 
-                # from pydbg import dbg
-                # dbg(odfs)
+                    odfs = [odfp, odfm]
 
                 if len(odfs) == 2:
                     odf = _merge_dfs.remote(*odfs)
                 elif len(odfs) == 1:
                     odf = odfs[0]
                 else:
-                    odf = pd.DataFrame(columns="Chromosome Start End".split())
+                    odf = dummy
 
                 df, odf = make_binary_sparse(kwargs, df, odf)
 
@@ -294,10 +295,10 @@ def pyrange_apply(function, self, other, **kwargs):
         else:
 
             for c, df in items:
-                if not c in other.chromosomes:
-                    odf = pd.DataFrame(columns="Chromosome Start End".split())
+                if not c in other_chromosomes:
+                    odf = dummy
                 else:
-                    odf = other.dfs[c]
+                    odf = other_dfs[c]
 
                 df, odf = make_binary_sparse(kwargs, df, odf)
 
@@ -388,7 +389,6 @@ def pyrange_apply_single(function, self, **kwargs):
 
     results = process_results(results, keys)
 
-
     return results
 
 
@@ -402,40 +402,34 @@ def _lengths(df):
 def _tss(df, **kwargs):
 
     df = df.copy(deep=True)
-
+    dtype = df.dtypes["Start"]
     slack = kwargs.get("slack", 0)
 
-    tss_pos = df.loc[df.Strand == "+"]
-
-    tss_neg = df.loc[df.Strand == "-"]
-
-    # pd.options.mode.chained_assignment = None
-    tss_neg.loc[:, "Start"] = tss_neg.End
+    starts = np.where(df.Strand == "+", df.Start, df.End - 1)
+    ends = starts + slack + 1
+    starts = starts - slack
+    starts = np.where(starts < 0, 0, starts)
 
-    # pd.options.mode.chained_assignment = "warn"
-    tss = pd.concat([tss_pos, tss_neg], sort=False)
-    tss["End"] = tss.Start
-    tss.End = tss.End + 1 + slack
-    tss.Start = tss.Start - slack
-    tss.loc[tss.Start < 0, "Start"] = 0
-
-    return tss.reindex(df.index)
+    df.loc[:, "Start"] = starts.astype(dtype)
+    df.loc[:, "End"] = ends.astype(dtype)
 
+    return df
 
 def _tes(df, **kwargs):
 
-    df = df.copy()
-    if df.Strand.iloc[0] == "+":
-        df.loc[:, "Start"] = df.End
-    else:
-        df.loc[:, "End"] = df.Start
+    df = df.copy(deep=True)
+    dtype = df.dtypes["Start"]
+    slack = kwargs.get("slack", 0)
 
-    df.loc[:, "Start"] = df.End
-    df.loc[:, "End"] = df.End + 1
-    df.loc[:, "Start"] = df.Start
-    df.loc[df.Start < 0, "Start"] = 0
+    starts = np.where(df.Strand == "+", df.End - 1, df.Start)
+    ends = starts + 1 + slack
+    starts = starts - slack
+    starts = np.where(starts < 0, 0, starts)
 
-    return df.reindex(df.index)
+    df.loc[:, "Start"] = starts.astype(dtype)
+    df.loc[:, "End"] = ends.astype(dtype)
+
+    return df
 
 
 def _slack(df, **kwargs):


=====================================
pyranges/pyranges.py
=====================================
@@ -239,7 +239,6 @@ class PyRanges():
 
         # self.apply()
 
-
     def __getattr__(self, name):
 
         """Return column.
@@ -313,6 +312,11 @@ class PyRanges():
         else:
             _setattr(self, column_name, column)
 
+            if column_name in ["Start", "End"]:
+                if self.dtypes["Start"] != self.dtypes["End"]:
+                    print("Warning! Start and End columns now have different dtypes: {} and {}".format(
+                        self.dtypes["Start"], self.dtypes["End"]))
+
     def __getitem__(self, val):
 
         """Fetch columns or subset on position.
@@ -487,7 +491,11 @@ class PyRanges():
 
         return str(self)
 
+    def _repr_html_(self):
 
+        """Return REPL HTML representation for Jupyter Noteboooks."""
+
+        return self.df._repr_html_()
 
     def apply(self, f, strand=None, as_pyranges=True, nb_cpu=1, **kwargs):
 
@@ -916,7 +924,7 @@ class PyRanges():
             type(first_result))
 
         # do a deepcopy of object
-        new_self = pr.PyRanges({k: v.copy() for k, v in self.items()})
+        new_self = self.copy()
         new_self.__setattr__(col, result)
 
         return new_self
@@ -1846,7 +1854,7 @@ class PyRanges():
         return self
 
 
-    def intersect(self, other, strandedness=None, how=None, nb_cpu=1):
+    def intersect(self, other, strandedness=None, how=None, invert=False, nb_cpu=1):
 
         """Return overlapping subintervals.
 
@@ -1869,6 +1877,10 @@ class PyRanges():
             What intervals to report. By default reports all overlapping intervals. "containment"
             reports intervals where the overlapping is contained within it.
 
+        invert : bool, default False
+
+            Whether to return the intervals without overlaps.
+
         nb_cpu: int, default 1
 
             How many cpus to use. Can at most use 1 per chromosome or chromosome/strand tuple.
@@ -1954,9 +1966,21 @@ class PyRanges():
         kwargs = fill_kwargs(kwargs)
         kwargs["sparse"] = {"self": False, "other": True}
 
+        if len(self) == 0:
+            return self
+
+        if invert:
+            self.__ix__ = np.arange(len(self))
+
         dfs = pyrange_apply(_intersection, self, other, **kwargs)
+        result = pr.PyRanges(dfs)
 
-        return PyRanges(dfs)
+        if invert:
+            found_idxs = getattr(result, "__ix__", [])
+            result = self[~self.__ix__.isin(found_idxs)]
+            result = result.drop("__ix__")
+
+        return result
 
     def items(self):
 
@@ -1988,7 +2012,7 @@ class PyRanges():
 
         return natsorted([(k, df) for (k, df) in self.dfs.items()])
 
-    def join(self, other, strandedness=None, how=None, report_overlap=False, slack=0, suffix="_b", nb_cpu=1):
+    def join(self, other, strandedness=None, how=None, report_overlap=False, slack=0, suffix="_b", nb_cpu=1, apply_strand_suffix=None):
 
         """Join PyRanges on genomic location.
 
@@ -2011,7 +2035,7 @@ class PyRanges():
 
         report_overlap : bool, default False
 
-            Report amount of overlap in base pairs. 
+            Report amount of overlap in base pairs.
 
         slack : int, default 0
 
@@ -2021,6 +2045,11 @@ class PyRanges():
 
             Suffix to give overlapping columns in other.
 
+        apply_strand_suffix : bool, default None
+
+            If first pyranges is unstranded, but the second is not, the first will be given a strand column.
+            apply_strand_suffix makes the added strand column a regular data column instead by adding a suffix.
+
         nb_cpu: int, default 1
 
             How many cpus to use. Can at most use 1 per chromosome or chromosome/strand tuple.
@@ -2113,9 +2142,9 @@ class PyRanges():
 
         from pyranges.methods.join import _write_both
 
-        kwargs = {"strandedness": strandedness, "how": how, "report_overlap":report_overlap, "suffix": suffix, "nb_cpu": nb_cpu}
-        # slack = kwargs.get("slack")
+        kwargs = {"strandedness": strandedness, "how": how, "report_overlap":report_overlap, "suffix": suffix, "nb_cpu": nb_cpu, "apply_strand_suffix": apply_strand_suffix}
         if slack:
+            self = self.copy()
             self.Start__slack = self.Start
             self.End__slack = self.End
 
@@ -2127,13 +2156,6 @@ class PyRanges():
 
         kwargs = fill_kwargs(kwargs)
 
-        # if "new_pos" in kwargs:
-        #     if kwargs["new_pos"] in "intersection union".split():
-        #         suffixes = kwargs.get("suffixes")
-        #         assert suffixes is not None, "Must give two non-empty suffixes when using new_pos with intersection or union."
-        #         assert suffixes[0], "Must have nonempty first suffix when using new_pos with intersection or union."
-        #         assert suffixes[1], "Must have nonempty second suffix when using new_pos with intersection or union."
-
         how = kwargs.get("how")
 
         if how in ["left", "outer"]:
@@ -2144,14 +2166,17 @@ class PyRanges():
         dfs = pyrange_apply(_write_both, self, other, **kwargs)
         gr = PyRanges(dfs)
 
-        if slack:
+        if slack and len(gr) > 0:
             gr.Start = gr.Start__slack
             gr.End = gr.End__slack
             gr = gr.drop(like="(Start|End).*__slack")
 
-        # new_position = kwargs.get("new_pos")
-        # if new_position:
-        #     gr = gr.new_position(new_pos=new_position, suffixes=kwargs["suffixes"])
+        if not self.stranded and other.stranded:
+            if apply_strand_suffix is None:
+                import sys
+                print("join: Strand data from other will be added as strand data to self.\nIf this is undesired use the flag apply_strand_suffix=False.\nTo turn off the warning set apply_strand_suffix to True or False.", file=sys.stderr)
+            elif apply_strand_suffix:
+                gr.columns = gr.columns.str.replace("Strand", "Strand" + kwargs["suffix"])
 
         return gr
 
@@ -2183,7 +2208,7 @@ class PyRanges():
 
         return natsorted(self.dfs.keys())
 
-    def k_nearest(self, other, k=1, ties=None, strandedness=None, overlap=True, how=None, suffix="_b", nb_cpu=1):
+    def k_nearest(self, other, k=1, ties=None, strandedness=None, overlap=True, how=None, suffix="_b", nb_cpu=1, apply_strand_suffix=None):
 
         """Find k nearest intervals.
 
@@ -2199,7 +2224,7 @@ class PyRanges():
 
         ties : {None, "first", "last", "different"}, default None
 
-            How to resolve ties, i.e. closest intervals with equal distance. None means that ...
+            How to resolve ties, i.e. closest intervals with equal distance. None means that the k nearest intervals are kept.
             "first" means that the first tie is kept, "last" meanst that the last is kept.
             "different" means that all nearest intervals with the k unique nearest distances are kept.
 
@@ -2222,6 +2247,12 @@ class PyRanges():
 
             Suffix to give columns with shared name in other.
 
+        apply_strand_suffix : bool, default None
+
+            If first pyranges is unstranded, but the second is not, the first will be given a strand column.
+            apply_strand_suffix makes the added strand column a regular data column instead by adding a suffix.
+
+
         nb_cpu: int, default 1
 
             How many cpus to use. Can at most use 1 per chromosome or chromosome/strand tuple.
@@ -2404,86 +2435,49 @@ class PyRanges():
         overlap = kwargs.get("overlap", True)
         ties = kwargs.get("ties", False)
 
-        self = pr.PyRanges({k: v.copy() for k, v in self.dfs.items()})
+        self = self.copy()
 
         try: # if k is a Series
             k = k.values
         except:
             pass
 
+        # how many to nearest to find; might be different for each
         self.__k__ = k
+        # give each their own unique ID
         self.__IX__ = np.arange(len(self))
 
-
-        # from time import time
-        # start = time()
         dfs = pyrange_apply(_nearest, self, other, **kwargs)
-        # end = time()
-        # print("nearest", end - start)
-
         nearest = PyRanges(dfs)
-        # nearest.msp()
-        # raise
-        # print("nearest len", len(nearest))
 
         if not overlap:
-            # self = self.drop(like="__k__|__IX__")
-            result = nearest#.drop(like="__k__|__IX__")
+            result = nearest
         else:
             from collections import defaultdict
-            # overlap_kwargs = {k: v for k, v in kwargs.items()}
-            # print("kwargs ties:", kwargs.get("ties"))
             overlap_how = defaultdict(lambda: None, {"first": "first", "last": "last"})[kwargs.get("ties")]
-            # start = time()
-            overlaps = self.join(other, strandedness=strandedness, how=overlap_how, nb_cpu=nb_cpu)
-            # end = time()
-            # print("overlaps", end - start)
+            overlaps = self.join(other, strandedness=strandedness, how=overlap_how, nb_cpu=nb_cpu, apply_strand_suffix=apply_strand_suffix)
             overlaps.Distance = 0
-            # print("overlaps len", len(overlaps))
-
             result = pr.concat([overlaps, nearest])
 
         if not len(result):
             return pr.PyRanges()
-        # print(result)
-        # print(overlaps.drop(like="__").df)
-        # raise
-
-        # start = time()
         new_result = {}
         if ties in ["first", "last"]:
-            # method = "tail" if ties == "last" else "head"
-            # keep = "last" if ties == "last" else "first"
-
             for c, df in result:
-                # start = time()
-                # print(c)
-                # print(df)
-
                 df = df.sort_values(["__IX__", "Distance"])
                 grpby = df.groupby("__k__", sort=False)
                 dfs = []
                 for k, kdf in grpby:
-                    # print("k", k)
-                    # print(kdf)
-                    # dist_bool = ~kdf.Distance.duplicated(keep=keep)
-                    # print(dist_bool)
-                    # kdf = kdf[dist_bool]
                     grpby2 = kdf.groupby("__IX__", sort=False)
-                    # f = getattr(grpby2, method)
                     _df = grpby2.head(k)
-                    # print(_df)
                     dfs.append(_df)
-                # raise
 
                 if dfs:
                     new_result[c] = pd.concat(dfs)
-                # print(new_result[c])
+
         elif ties == "different" or not ties:
             for c, df in result:
 
-                # print(df)
-
                 if df.empty:
                     continue
                 dfs = []
@@ -2491,27 +2485,14 @@ class PyRanges():
                 df = df.sort_values(["__IX__", "Distance"])
                 grpby = df.groupby("__k__", sort=False)
 
-                # for each index
-                # want to keep until we have k
-                # then keep all with same distance
                 for k, kdf in grpby:
-                    # print("kdf " * 10)
-                    # print("k " * 5, k)
-                    # print(kdf["__IX__ Distance".split()])
-                    # print(kdf.dtypes)
-                    # print(kdf.index.dtypes)
-                    # if ties:
                     if ties:
                         lx = get_different_ties(kdf.index.values, kdf.__IX__.values, kdf.Distance.astype(np.int64).values, k)
+                        _df = kdf.reindex(lx)
                     else:
                         lx = get_all_ties(kdf.index.values, kdf.__IX__.values, kdf.Distance.astype(np.int64).values, k)
-                    # print(lx)
-
-
-                    # else:
-                    #     lx = get_all_ties(kdf.index.values, kdf.__IX__.values, kdf.Distance.astype(np.int64).values, k)
-                    _df = kdf.reindex(lx)
-                    # print("_df", _df)
+                        _df = kdf.reindex(lx)
+                        _df = _df.groupby("__IX__").head(k)
                     dfs.append(_df)
 
                 if dfs:
@@ -2539,12 +2520,14 @@ class PyRanges():
             df.loc[bools, "Distance"] = -df.loc[bools, "Distance"]
             return df
 
-        # print(result)
         result = result.apply(prev_to_neg, suffix=kwargs["suffix"])
-        # print(result)
 
-        # end = time()
-        # print("final stuff", end - start)
+        if not self.stranded and other.stranded:
+            if apply_strand_suffix is None:
+                import sys
+                print("join: Strand data from other will be added as strand data to self.\nIf this is undesired use the flag apply_strand_suffix=False.\nTo turn off the warning set apply_strand_suffix to True or False.", file=sys.stderr)
+            elif apply_strand_suffix:
+                result.columns = result.columns.str.replace("Strand", "Strand" + kwargs["suffix"])
 
         return result
 
@@ -2664,8 +2647,64 @@ class PyRanges():
 
             return pd.concat(_lengths).reset_index(drop=True)
 
+    def max_disjoint(self, strand=None, slack=0, **kwargs):
+
+        """Find the maximal disjoint set of intervals.
+
+        Parameters
+        ----------
+        strand : bool, default None, i.e. auto
+
+            Find the max disjoint set separately for each strand.
+
+        slack : int, default 0
+
+            Consider intervals within a distance of slack to be overlapping.
+
+        Returns
+        -------
+        PyRanges
+
+            PyRanges with maximal disjoint set of intervals.
+
+        Examples
+        --------
+        >>> gr = pr.data.f1()
+        +--------------+-----------+-----------+------------+-----------+--------------+
+        | Chromosome   |     Start |       End | Name       |     Score | Strand       |
+        | (category)   |   (int32) |   (int32) | (object)   |   (int64) | (category)   |
+        |--------------+-----------+-----------+------------+-----------+--------------|
+        | chr1         |         3 |         6 | interval1  |         0 | +            |
+        | chr1         |         8 |         9 | interval3  |         0 | +            |
+        | chr1         |         5 |         7 | interval2  |         0 | -            |
+        +--------------+-----------+-----------+------------+-----------+--------------+
+        Stranded PyRanges object has 3 rows and 6 columns from 1 chromosomes.
+        For printing, the PyRanges was sorted on Chromosome and Strand.
+
+        >>> gr.max_disjoint(strand=False)
+        +--------------+-----------+-----------+------------+-----------+--------------+
+        | Chromosome   |     Start |       End | Name       |     Score | Strand       |
+        | (category)   |   (int32) |   (int32) | (object)   |   (int64) | (category)   |
+        |--------------+-----------+-----------+------------+-----------+--------------|
+        | chr1         |         3 |         6 | interval1  |         0 | +            |
+        | chr1         |         8 |         9 | interval3  |         0 | +            |
+        +--------------+-----------+-----------+------------+-----------+--------------+
+        Stranded PyRanges object has 2 rows and 6 columns from 1 chromosomes.
+        For printing, the PyRanges was sorted on Chromosome and Strand.
+        """
+
+        if strand is None:
+            strand = self.stranded
+
+        kwargs = {"strand": strand, "slack": slack}
+        kwargs = fill_kwargs(kwargs)
 
-    def merge(self, strand=None, count=False, count_col="Count", by=None):
+        from pyranges.methods.max_disjoint import _max_disjoint
+        df = pyrange_apply_single(_max_disjoint, self, **kwargs)
+
+        return pr.PyRanges(df)
+
+    def merge(self, strand=None, count=False, count_col="Count", by=None, slack=0):
 
         """Merge overlapping intervals into one.
 
@@ -2687,6 +2726,10 @@ class PyRanges():
 
             Only merge intervals with equal values in these columns.
 
+        slack : int, default 0
+
+            Allow this many nucleotides between each interval to merge.
+
         Returns
         -------
         PyRanges
@@ -2784,7 +2827,7 @@ class PyRanges():
         if strand is None:
             strand = self.stranded
 
-        kwargs = {"strand": strand, "count": count, "by": by, "count_col": count_col}
+        kwargs = {"strand": strand, "count": count, "by": by, "count_col": count_col, "slack": slack}
 
         if not kwargs["by"]:
             kwargs["sparse"] = {"self": True}
@@ -2859,7 +2902,7 @@ class PyRanges():
         return self
 
 
-    def nearest(self, other, strandedness=None, overlap=True, how=None, suffix="_b", nb_cpu=1):
+    def nearest(self, other, strandedness=None, overlap=True, how=None, suffix="_b", nb_cpu=1, apply_strand_suffix=None):
 
         """Find closest interval.
 
@@ -2888,6 +2931,11 @@ class PyRanges():
 
             Suffix to give columns with shared name in other.
 
+        apply_strand_suffix : bool, default None
+
+            If first pyranges is unstranded, but the second is not, the first will be given the strand column of the second.
+            apply_strand_suffix makes the added strand column a regular data column instead by adding a suffix.
+
         nb_cpu: int, default 1
 
             How many cpus to use. Can at most use 1 per chromosome or chromosome/strand tuple.
@@ -2967,14 +3015,22 @@ class PyRanges():
 
         from pyranges.methods.nearest import _nearest
 
-        kwargs = {"strandedness": strandedness, "how": how, "overlap": overlap, "nb_cpu": nb_cpu, "suffix": suffix}
+        kwargs = {"strandedness": strandedness, "how": how, "overlap": overlap, "nb_cpu": nb_cpu, "suffix": suffix, "apply_strand_suffix": apply_strand_suffix}
         kwargs = fill_kwargs(kwargs)
         if kwargs.get("how") in "upstream downstream".split():
             assert other.stranded, "If doing upstream or downstream nearest, other pyranges must be stranded"
 
         dfs = pyrange_apply(_nearest, self, other, **kwargs)
+        gr = PyRanges(dfs)
 
-        return PyRanges(dfs)
+        if not self.stranded and other.stranded:
+            if apply_strand_suffix is None:
+                import sys
+                print("join: Strand data from other will be added as strand data to self.\nIf this is undesired use the flag apply_strand_suffix=False.\nTo turn off the warning set apply_strand_suffix to True or False.", file=sys.stderr)
+            elif apply_strand_suffix:
+                gr.columns = gr.columns.str.replace("Strand", "Strand" + kwargs["suffix"])
+
+        return gr
 
 
     def new_position(self, new_pos, columns=None):
@@ -3132,7 +3188,7 @@ class PyRanges():
         return pr.PyRanges(dfs)
 
 
-    def overlap(self, other, strandedness=None, how="first", invert=False):
+    def overlap(self, other, strandedness=None, how="first", invert=False, nb_cpu=1):
 
         """Return overlapping intervals.
 
@@ -3155,6 +3211,10 @@ class PyRanges():
             What intervals to report. By default reports every interval in self with overlap once.
             "containment" reports all intervals where the overlapping is contained within it.
 
+        invert : bool, default False
+
+            Whether to return the intervals without overlaps.
+
         nb_cpu: int, default 1
 
             How many cpus to use. Can at most use 1 per chromosome or chromosome/strand tuple.
@@ -3246,15 +3306,28 @@ class PyRanges():
         For printing, the PyRanges was sorted on Chromosome.
         """
 
-        kwargs = {"strandedness": strandedness}
+        kwargs = {"strandedness": strandedness, "nb_cpu": nb_cpu}
         kwargs["sparse"] = {"self": False, "other": True}
         kwargs["how"] = how
         kwargs["invert"] = invert
         kwargs = fill_kwargs(kwargs)
 
+        if len(self) == 0:
+            return self
+
+        if invert:
+            self = self.copy()
+            self.__ix__ = np.arange(len(self))
+
         dfs = pyrange_apply(_overlap, self, other, **kwargs)
+        result = pr.PyRanges(dfs)
 
-        return pr.PyRanges(dfs)
+        if invert:
+            found_idxs = getattr(result, "__ix__", [])
+            result = self[~self.__ix__.isin(found_idxs)]
+            result = result.drop("__ix__")
+
+        return result
 
     def pc(self, n=8, formatting=None):
 
@@ -4310,9 +4383,14 @@ class PyRanges():
         strand = True if strandedness else False
         other_clusters = other.merge(strand=strand)
 
+        self = self.count_overlaps(other_clusters, strandedness=strandedness, overlap_col="__num__")
+
         result = pyrange_apply(_subtraction, self, other_clusters, **kwargs)
 
-        return PyRanges(result)
+        self = self.drop("__num__")
+
+        return PyRanges(result).drop("__num__")
+
 
     def summary(self, to_stdout=True, return_df=False):
 


=====================================
pyranges/readers.py
=====================================
@@ -186,6 +186,11 @@ def read_bam(f, sparse=True, as_df=False, mapq=0, required_flag=0, filter_flag=1
         print("bamread must be installed to read bam. Use `conda install -c bioconda bamread` or `pip install bamread` to install it.")
         sys.exit(1)
 
+    if bamread.__version__ in ["0.0.1", "0.0.2", "0.0.3", "0.0.4",
+                               "0.0.5", "0.0.6", "0.0.7", "0.0.8", "0.0.9"]:
+        print("bamread not recent enough. Must be 0.0.10 or higher. Use `conda install -c bioconda 'bamread>=0.0.10'` or `pip install bamread>=0.0.10` to install it.")
+        sys.exit(1)
+
     if sparse:
         df = bamread.read_bam(f, mapq, required_flag, filter_flag)
     else:
@@ -254,6 +259,10 @@ def read_gtf(f, full=True, as_df=False, nrows=None, duplicate_attr=False):
 
         Path to GTF file.
 
+    full : bool, default True
+
+        Whether to read and interpret the annotation column.
+
     as_df : bool, default False
 
         Whether to return as pandas DataFrame instead of PyRanges.
@@ -298,7 +307,10 @@ def read_gtf(f, full=True, as_df=False, nrows=None, duplicate_attr=False):
 
     _skiprows = skiprows(f)
 
-    gr = read_gtf_full(f, as_df, nrows, _skiprows, duplicate_attr)
+    if full:
+        gr = read_gtf_full(f, as_df, nrows, _skiprows, duplicate_attr)
+    else:
+        gr = read_gtf_restricted(f, _skiprows, as_df=False, nrows=None)
 
     return gr
 
@@ -357,10 +369,6 @@ def to_rows(anno):
                                       # l[:-1] removes final ";" cheaply
                                       for kv in l[:-1].split("; ")]})
 
-    # for l in anno:
-    #     l = l.replace('"', '').replace(";", "").split()
-    #     rowdicts.append({k: v for k, v in zip(*([iter(l)] * 2))})
-
     return pd.DataFrame.from_dict(rowdicts).set_index(anno.index)
 
 
@@ -388,8 +396,8 @@ def to_rows_keep_duplicates(anno):
 
 
 def read_gtf_restricted(f,
+                        skiprows,
                         as_df=False,
-                        skiprows=0,
                         nrows=None):
     """seqname - name of the chromosome or scaffold; chromosome names can be given with or without the 'chr' prefix. Important note: the seqname must be one used within Ensembl, i.e. a standard chromosome name or an Ensembl identifier such as a scaffold ID, without any additional content such as species or assembly. See the example GFF output below.
     # source - name of the program that generated this feature, or the data source (database or project name)
@@ -415,6 +423,7 @@ def read_gtf_restricted(f,
         names="Chromosome Feature Start End Score Strand Attribute".split(),
         dtype=dtypes,
         chunksize=int(1e5),
+        skiprows=skiprows,
         nrows=nrows)
 
     dfs = []
@@ -457,7 +466,7 @@ def to_rows_gff3(anno):
     return pd.DataFrame.from_dict(rowdicts).set_index(anno.index)
 
 
-def read_gff3(f, annotation=None, as_df=False, nrows=None, skiprows=0):
+def read_gff3(f, full=True, annotation=None, as_df=False, nrows=None):
 
     """Read files in the General Feature Format.
 
@@ -467,6 +476,10 @@ def read_gff3(f, annotation=None, as_df=False, nrows=None, skiprows=0):
 
         Path to GFF file.
 
+    full : bool, default True
+
+        Whether to read and interpret the annotation column.
+
     as_df : bool, default False
 
         Whether to return as pandas DataFrame instead of PyRanges.
@@ -481,6 +494,10 @@ def read_gff3(f, annotation=None, as_df=False, nrows=None, skiprows=0):
     pyranges.read_gtf : read files in the Gene Transfer Format
     """
 
+    _skiprows = skiprows(f)
+
+    if not full:
+        return read_gtf_restricted(f, _skiprows, as_df=as_df, nrows=nrows)
 
     dtypes = {
         "Chromosome": "category",
@@ -499,7 +516,7 @@ def read_gff3(f, annotation=None, as_df=False, nrows=None, skiprows=0):
         names=names,
         dtype=dtypes,
         chunksize=int(1e5),
-        skiprows=skiprows,
+        skiprows=_skiprows,
         nrows=nrows)
 
     dfs = []


=====================================
pyranges/version.py
=====================================
@@ -1 +1 @@
-__version__ = "0.0.85"
+__version__ = "0.0.111"


=====================================
setup.py
=====================================
@@ -6,7 +6,7 @@ __version__ = open("pyranges/version.py").readline().split(" = ")[1].replace(
     '"', '').strip()
 
 install_requires = [
-    "cython", "pandas", "ncls>=0.0.50", "tabulate", "sorted_nearest>=0.0.30", "pyrle",
+    "cython", "pandas", "ncls>=0.0.62", "tabulate", "sorted_nearest>=0.0.33", "pyrle",
     "natsort"] #,
 
 # optional_requires = ["bamread", "pybigwig", "ray"]


=====================================
tests/helpers.py
=====================================
@@ -3,6 +3,12 @@ import pandas as pd
 
 def assert_df_equal(df1, df2):
 
+    print("-"*100)
+    print("df1")
+    print(df1)
+    print("df2")
+    print(df2)
+
     # df1.loc[:, "Start"] = df1.Start.astype(np.int64)
     # df2.loc[:, "Start"] = df1.Start.astype(np.int64)
     # df1.loc[:, "End"] = df1.End.astype(np.int64)


=====================================
tests/test_binary.py
=====================================
@@ -100,9 +100,7 @@ def compare_results_nearest(bedtools_df, result):
 
     result = result.df
 
-
     if not len(result) == 0:
-
         bedtools_df = bedtools_df.sort_values("Start End Distance".split())
         result = result.sort_values("Start End Distance".split())
         result_df = result["Chromosome Start End Strand Distance".split()]
@@ -239,6 +237,11 @@ def test_coverage(gr, gr2, strandedness):
 
     result = gr.coverage(gr2, strandedness=strandedness)
 
+    print("pyranges")
+    print(result.df)
+    print("bedtools")
+    print(bedtools_df)
+
     # assert len(result) > 0
     assert np.all(
         bedtools_df.NumberOverlaps.values == result.NumberOverlaps.values)
@@ -301,9 +304,12 @@ def test_subtraction(gr, gr2, strandedness):
         names="Chromosome Start End Name Score Strand".split(),
         sep="\t")
 
+    print("subtracting" * 50)
     result = gr.subtract(gr2, strandedness=strandedness)
 
+    print("bedtools_result")
     print(bedtools_df)
+    print("PyRanges result:")
     print(result)
 
     compare_results(bedtools_df, result)
@@ -551,3 +557,36 @@ def test_k_nearest(gr, gr2, nearest_how, overlap, strandedness, ties):
     print(result)
 
     compare_results_nearest(bedtools_df, result)
+
+
+# @settings(
+#     max_examples=max_examples,
+#     deadline=deadline,
+#     print_blob=True,
+#     suppress_health_check=HealthCheck.all())
+# @given(gr=dfs_min())  # pylint: disable=no-value-for-parameter
+# def test_k_nearest_nearest_self_same_size(gr):
+
+#     result = gr.k_nearest(
+#         gr, k=1, strandedness=None, overlap=True, how=None, ties="first")
+
+#     assert len(result) == len(gr)
+
+ at settings(
+    max_examples=max_examples,
+    deadline=deadline,
+    print_blob=True,
+    suppress_health_check=HealthCheck.all())
+ at given(gr=dfs_min(), gr2=dfs_min())  # pylint: disable=no-value-for-parameter
+def test_k_nearest_1_vs_nearest(gr, gr2):
+
+    result_k = gr.k_nearest(gr2, k=1, strandedness=None, overlap=True, how=None)
+    if len(result_k) > 0:
+        result_k.Distance = result_k.Distance.abs()
+
+    result_n = gr.nearest(gr2, strandedness=None, overlap=True, how=None)
+
+    if len(result_k) == 0 and len(result_n) == 0:
+        pass
+    else:
+        assert (result_k.sort().Distance.abs() == result_n.sort().Distance).all()


=====================================
tests/test_concat.py
=====================================
@@ -0,0 +1,31 @@
+#!/usr/bin/env python3
+
+import pytest
+import pyranges as pr
+
+def assert_equal_length_before_after(gr1, gr2):
+
+    print("in test")
+    l1 = len(gr1)
+    l2 = len(gr2)
+    c = pr.concat([gr1, gr2])
+
+    if not gr1.stranded or not gr2.stranded:
+        assert not c.stranded
+
+    lc = len(c)
+    assert l1 + l2 == lc
+
+def test_concat_stranded_unstranded(f1, f2):
+
+    assert_equal_length_before_after(f1, f2)
+
+def test_concat_unstranded_unstranded(f1, f2):
+
+    assert_equal_length_before_after(f1.unstrand(), f2.unstrand())
+
+def test_concat_stranded_unstranded(f1, f2):
+    assert_equal_length_before_after(f1, f2.unstrand())
+
+def test_concat_unstranded_stranded(f1, f2):
+    assert_equal_length_before_after(f1.unstrand(), f2)


=====================================
tests/test_count_overlaps.py
=====================================
@@ -0,0 +1,62 @@
+import numpy as np
+import pyranges as pr
+from tests.helpers import assert_df_equal
+
+a = '''Chromosome Start End Strand
+chr1    6    12  +
+chr1    10    20 +
+chr1    22    27 -
+chr1    24    30 -'''
+b = '''Chromosome Start End Strand
+chr1    12    32 +
+chr1    14    30 +'''
+c = '''Chromosome Start End Strand
+chr1    8    15 +
+chr1    713800    714800 -
+chr1    32    34 -'''
+
+grs = {n: pr.from_string(s) for n, s in zip(["a", "b", "c"], [a, b, c])}
+unstranded_grs = {n: gr.unstrand() for n, gr in grs.items()}
+
+features = pr.PyRanges(chromosomes=["chr1"] * 4,
+                       starts=[0, 10, 20, 30],
+                       ends=[10, 20, 30, 40],
+                       strands = ["+","+","+","-"])
+unstranded_features = features.unstrand()
+
+def test_strand_vs_strand_same():
+
+    expected_result = pr.from_string("""Chromosome Start End Strand a b c
+chr1  0 10  + 1 0 1
+chr1 10 20  + 2 2 1
+chr1 20 30  + 0 2 0
+chr1 30 40  - 0 0 1""")
+
+    res = pr.count_overlaps(grs, features, strandedness="same")
+    res = res.apply(lambda df: df.astype({"a": np.int64, "b": np.int64, "c": np.int64}))
+
+    res.print(merge_position=True)
+
+    assert_df_equal(res.df, expected_result.df)
+
+
+# def test_strand_vs_strand_opposite():
+
+#     expected_result = pr.from_string("""Chromosome Start End Strand a b c
+# chr1  0 10  + 1 0 1
+# chr1 10 20  + 1 2 1
+# chr1 20 30  + 0 2 0
+# chr1 30 40  - 0 0 1""")
+
+#     res = pr.count_overlaps(grs, features, strandedness="opposite")
+
+#     print("features")
+#     print(features)
+
+#     for name, gr in grs.items():
+#         print(name)
+#         print(gr)
+
+#     res.print(merge_position=True)
+
+#     assert_df_equal(res.df, expected_result.df)


=====================================
tests/test_do_not_error.py
=====================================
@@ -50,6 +50,8 @@ strandedness_chain = list(product(["same", "opposite"], strandedness)) + list(
 # @reproduce_failure('5.5.4', b'AXicY2RAA4xIJDY+AAC2AAY=') # test_three_in_a_row[strandedness_chain24-method_chain24]
 def test_three_in_a_row(gr, gr2, gr3, strandedness_chain, method_chain):
 
+    print(method_chain)
+
     s1, s2 = strandedness_chain
     f1, f2 = method_chain
 



View it on GitLab: https://salsa.debian.org/med-team/pyranges/-/commit/9afe05d25e92fd318b45e99a9f7e75341c0ae911

-- 
View it on GitLab: https://salsa.debian.org/med-team/pyranges/-/commit/9afe05d25e92fd318b45e99a9f7e75341c0ae911
You're receiving this email because of your account on salsa.debian.org.


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://alioth-lists.debian.net/pipermail/debian-med-commit/attachments/20211102/0853b2db/attachment-0001.htm>