[med-svn] [Git][med-team/pyensembl][upstream] New upstream version 2.6.7

Wed Apr 22 06:05:55 BST 2026


Karsten Schöke pushed to branch upstream at Debian Med / pyensembl


Commits:
b44d3559 by Karsten Schöke at 2026-04-22T06:26:54+02:00
New upstream version 2.6.7
- - - - -


13 changed files:

- .github/workflows/tests.yml
- + AGENTS.md
- pyensembl/genome.py
- pyensembl/shell.py
- pyensembl/transcript.py
- pyensembl/version.py
- pyproject.toml
- tests/test_build_system.py
- tests/test_gene_ids.py
- tests/test_gene_names.py
- tests/test_transcript_ids.py
- tests/test_transcript_sequences.py
- tests/test_ucsc_gtf.py


Changes:

=====================================
.github/workflows/tests.yml
=====================================
@@ -3,7 +3,6 @@
 
 # TODO:
 # - cache this directory $HOME/.cache/pyensembl/
-# - update coveralls
 # - get a badge for tests passing
 # - download binary dependencies from conda
 name: Tests
@@ -49,5 +48,18 @@ jobs:
       - name: Run unit tests
         run: |
           ./test.sh
-      - name: Publish coverage to Coveralls
+      - name: Upload coverage to Coveralls
         uses: coverallsapp/github-action at v2.2.3
+        with:
+          parallel: true
+          flag-name: python-${{ matrix.python-version }}
+
+  coveralls-finish:
+    needs: build
+    if: always()
+    runs-on: ubuntu-latest
+    steps:
+      - name: Finalize Coveralls
+        uses: coverallsapp/github-action at v2.2.3
+        with:
+          parallel-finished: true


=====================================
AGENTS.md
=====================================
@@ -0,0 +1,79 @@
+## Golden Rules
+
+1. **Never commit to `main`.** Always `git checkout -b <feature-branch>` before editing. Land via PR.
+2. **Every PR bumps the version.** Even doc-only PRs — at minimum a patch bump. The version lives in `pyensembl/version.py` and must be bumped in the PR itself.
+3. **"Done" means merged AND deployed to PyPI** — never stop at merge. After a PR merges, run `./deploy.sh` from a clean `main`. Skipping deploy = task not done.
+4. **File problems as issues, don't silently work around them.** If you hit a bug here or in a sibling openvax repo, open a GitHub issue on the correct repo and link it from the PR.
+5. **After a PR ships, look for the next block of work.** Read open issues across the relevant openvax repos, group by dependency + urgency. Prefer *foundational* changes that unblock multiple downstream improvements; otherwise chain the smallest independent improvements.
+
+---
+
+## Before Completing Any Task
+
+Before considering any code change complete, you MUST:
+
+1. **Run `./lint.sh`** — runs `ruff check pyensembl/`.
+2. **Run `./test.sh`** — runs the pytest suite.
+
+Do not tell the user you are "done" or that changes are "complete" until both pass. `./lint-and-test.sh` runs them together.
+
+## Scripts
+
+- `./lint.sh` — runs `ruff check pyensembl/`. **Always use this for linting.**
+- `./test.sh` — runs pytest with coverage (must pass).
+- `./lint-and-test.sh` — convenience wrapper that runs lint then tests.
+- `./deploy.sh` — deploys to PyPI. Gates on `lint.sh` + `test.sh`, builds sdist+wheel, uploads via twine, tags the commit with `pyensembl/version.py`, pushes tags. **Bump `pyensembl/version.py` before running** — deploy.sh does not bump the version for you.
+- `./develop.sh` — installs package in development mode.
+
+## Code Style
+
+- Use ruff for linting (there is no `format.sh`; formatting is not currently enforced in CI).
+- Configuration is in `pyproject.toml`.
+- Python support: 3.9+.
+
+---
+
+## Workflow Orchestration
+
+### 1. Upfront Planning
+- For ANY non-trivial task (3+ steps or architectural decisions): write a detailed spec before touching code
+- If something goes sideways, STOP and re-plan immediately — don't keep pushing
+- Use planning/verification steps, not just building
+- Write detailed specs upfront to reduce ambiguity
+
+### 2. Self-Improvement Loop
+- After ANY correction from the user: capture the pattern (in Claude Code memory or `tasks/lessons.md`)
+- Write rules for yourself that prevent the same mistake
+- Ruthlessly iterate on these lessons until mistake rate drops
+- Review lessons at session start for relevant project
+
+### 3. Verification Before Done
+- Never mark a task complete without proving it works
+- Diff behavior between the latest code and your changes when relevant
+- Ask yourself: "Would a staff engineer approve this?"
+- Run tests, check logs, demonstrate correctness
+
+### 4. Demand Elegance (Balanced)
+- For non-trivial changes: pause and ask "is there a more elegant way?"
+- If a fix feels hacky: "Knowing everything I know now, implement the elegant solution"
+- Skip this for simple, obvious fixes — don't over-engineer
+- Challenge your own work before presenting it
+
+### 5. Autonomous Bug Fixing
+- When given a bug report: just fix it. Don't ask for hand-holding
+- Point at logs, errors, failing tests — then resolve them
+- Zero context switching required from the user
+- Fix failing unit tests without being told how
+
+---
+
+## Core Principles
+
+- **Simplicity First**: Make every change as simple as possible. Impact minimal code.
+- **No Laziness**: Find root causes. No temporary fixes. Senior developer standards.
+- **Minimal Impact**: Changes should only touch what's necessary. Avoid introducing bugs.
+- **No tautological tests**: Don't write tests that reassert the contents of declarative config (e.g. a `pyproject.toml` dependency list against a hardcoded copy). They verify nothing and break on every legitimate bump.
+
+## Scientific Domain Knowledge
+- **Read the literature**: if some code involves scientific or biological concepts, feel free to search for review papers and read those before changing code that expresses scientific concepts.
+- **Flag inconsistencies**: if code expresses a scientific model that's at odds with your understanding, note that inconsistency and ask for clarification.


=====================================
pyensembl/genome.py
=====================================
@@ -589,7 +589,7 @@ class Genome(Serializable):
     def protein_ids_at_locus(self, contig, position, end=None, strand=None):
         return self.db.distinct_column_values_at_locus(
             column="protein_id",
-            feature="transcript",
+            feature="CDS",
             contig=contig,
             position=position,
             end=end,


=====================================
pyensembl/shell.py
=====================================
@@ -69,9 +69,7 @@ parser.add_argument(
 )
 
 
-root_group = parser.add_mutually_exclusive_group()
-
-release_group = root_group.add_argument_group()
+release_group = parser.add_argument_group("Ensembl release options")
 release_group.add_argument(
     "--release",
     type=int,
@@ -93,7 +91,7 @@ release_group.add_argument(
     help="URL and directory to use instead of the default Ensembl FTP server",
 )
 
-path_group = root_group.add_argument_group()
+path_group = parser.add_argument_group("Custom genome options")
 
 path_group.add_argument(
     "--reference-name",


=====================================
pyensembl/transcript.py
=====================================
@@ -13,9 +13,28 @@
 from memoized_property import memoized_property
 
 from .common import memoize
+from .exon import Exon
 from .locus_with_genome import LocusWithGenome
 
 
+def _merge_ranges(ranges):
+    """
+    Sort [(start, end)] inclusive-inclusive ranges and merge any that are
+    adjacent or overlapping (end+1 == next start).
+    """
+    if not ranges:
+        return []
+    ordered = sorted(ranges)
+    merged = [ordered[0]]
+    for start, end in ordered[1:]:
+        prev_start, prev_end = merged[-1]
+        if start <= prev_end + 1:
+            merged[-1] = (prev_start, max(prev_end, end))
+        else:
+            merged.append((start, end))
+    return merged
+
+
 class Transcript(LocusWithGenome):
     """
     Transcript encompasses the locus, exons, and sequence of a transcript.
@@ -123,23 +142,25 @@ class Transcript(LocusWithGenome):
     def exons(self):
         # need to look up exon_number alongside ID since each exon may
         # appear in multiple transcripts and have a different exon number
-        # in each transcript
-        columns = ["exon_number", "exon_id"]
-        exon_numbers_and_ids = self.db.query(
+        # in each transcript.
+        # Older or non-Ensembl GTFs may omit the exon_id attribute, in
+        # which case we build Exon objects directly from the exon row
+        # and synthesize a stable per-transcript ID.
+        has_exon_id = self.db.column_exists("exon", "exon_id")
+        if has_exon_id:
+            columns = ["exon_number", "exon_id"]
+        else:
+            columns = ["exon_number", "seqname", "start", "end", "strand"]
+        rows = self.db.query(
             columns, filter_column="transcript_id", filter_value=self.id, feature="exon"
         )
 
         # fill this list in its correct order (by exon_number) by using
         # the exon_number as a 1-based list offset
-        exons = [None] * len(exon_numbers_and_ids)
+        exons = [None] * len(rows)
 
-        for exon_number, exon_id in exon_numbers_and_ids:
-            exon = self.genome.exon_by_id(exon_id)
-            if exon is None:
-                raise ValueError(
-                    "Missing exon %s for transcript %s" % (exon_number, self.id)
-                )
-            exon_number = int(exon_number)
+        for row in rows:
+            exon_number = int(row[0])
             if exon_number < 1:
                 raise ValueError("Invalid exon number: %s" % exon_number)
             elif exon_number > len(exons):
@@ -148,9 +169,27 @@ class Transcript(LocusWithGenome):
                     % (exon_number, len(exons))
                 )
 
+            if has_exon_id:
+                exon_id = row[1]
+                exon = self.genome.exon_by_id(exon_id)
+                if exon is None:
+                    raise ValueError(
+                        "Missing exon %s for transcript %s" % (exon_number, self.id)
+                    )
+            else:
+                _, seqname, start, end, strand = row
+                exon = Exon(
+                    exon_id="%s_exon_%d" % (self.id, exon_number),
+                    contig=seqname,
+                    start=start,
+                    end=end,
+                    strand=strand,
+                    gene_name=self.gene_name,
+                    gene_id=self.gene_id,
+                )
+
             # exon_number is 1-based, convert to list index by subtracting 1
-            exon_idx = exon_number - 1
-            exons[exon_idx] = exon
+            exons[exon_number - 1] = exon
         return exons
 
     # possible annotations associated with transcripts
@@ -388,9 +427,15 @@ class Transcript(LocusWithGenome):
     def coding_sequence_position_ranges(self):
         """
         Return absolute chromosome position ranges for CDS fragments
-        of this transcript
+        of this transcript, including the stop codon (which Ensembl
+        encodes as a separate feature from the CDS).
         """
-        return self._transcript_feature_position_ranges("CDS")
+        ranges = list(self._transcript_feature_position_ranges("CDS"))
+        if self.contains_stop_codon:
+            ranges.extend(
+                self._transcript_feature_position_ranges("stop_codon", required=False)
+            )
+        return _merge_ranges(ranges)
 
     @memoized_property
     def complete(self):


=====================================
pyensembl/version.py
=====================================
@@ -1,4 +1,4 @@
-__version__ = "2.6.0"
+__version__ = "2.6.7"
 
 def print_version():
     print(f"v{__version__}")


=====================================
pyproject.toml
=====================================
@@ -25,8 +25,8 @@ dependencies = [
     "datacache>=1.4.0,<2.0.0",
     "memoized-property>=1.0.2",
     "tinytimer>=0.0.0,<1.0.0",
-    "gtfparse>=2.5.0,<3.0.0",
-    "serializable>=0.2.1,<1.0.0",
+    "gtfparse>=2.6.0,<3.0.0",
+    "serializable>=0.2.1,<2.0.0",
     "numpy>=2.0.0,<3.0.0",
 ]
 


=====================================
tests/test_build_system.py
=====================================
@@ -52,40 +52,6 @@ def test_build_system_backend():
     assert config["build-system"]["build-backend"] == "setuptools.build_meta"
 
 
-def test_dependencies_correct():
-    """Test that runtime dependencies match specification."""
-    try:
-        import tomllib
-    except ImportError:
-        import tomli as tomllib
-
-    project_root = Path(__file__).parent.parent
-    pyproject_path = project_root / "pyproject.toml"
-
-    with open(pyproject_path, "rb") as f:
-        config = tomllib.load(f)
-
-    expected_deps = {
-        "typechecks>=0.0.2,<1.0.0",
-        "datacache>=1.4.0,<2.0.0",
-        "memoized-property>=1.0.2",
-        "tinytimer>=0.0.0,<1.0.0",
-        "gtfparse>=2.5.0,<3.0.0",
-        "serializable>=0.2.1,<1.0.0",
-        "numpy<2",
-    }
-
-    actual_deps = set(config["project"]["dependencies"])
-
-    assert actual_deps == expected_deps, (
-        f"Dependencies mismatch.\n"
-        f"Expected: {expected_deps}\n"
-        f"Actual: {actual_deps}\n"
-        f"Missing: {expected_deps - actual_deps}\n"
-        f"Extra: {actual_deps - expected_deps}"
-    )
-
-
 def test_no_pylint_in_runtime_deps():
     """
     Test that pylint is not in runtime dependencies.


=====================================
tests/test_gene_ids.py
=====================================
@@ -15,17 +15,18 @@ ensembl77 = cached_release(77, "human")
 
 
 def test_gene_ids_grch38_hla_a():
-    # chr6:29,945,884  is a position for HLA-A
-    # Gene ID = ENSG00000206503
-    # based on:
+    # chr6:29,945,884 is a position for HLA-A (ENSG00000206503).
+    # Ensembl release 114 introduced overlapping gene POLR1HASP
+    # (ENSG00000293508) at the same locus, so accept either HLA-A alone
+    # or the HLA-A + POLR1HASP pair.
     # http://useast.ensembl.org/Homo_sapiens/Gene/
     # Summary?db=core;g=ENSG00000206503;r=6:29941260-29945884
-    ids = ensembl_grch38.gene_ids_at_locus(6, 29945884)
-    expected = "ENSG00000206503"
-    assert ids == ["ENSG00000206503"], "Expected HLA-A, gene ID = %s, got: %s" % (
-        expected,
-        ids,
-    )
+    ids = set(ensembl_grch38.gene_ids_at_locus(6, 29945884))
+    assert "ENSG00000206503" in ids, "Expected HLA-A (ENSG00000206503), got: %s" % (ids,)
+    if len(ids) > 1:
+        assert ids == {"ENSG00000206503", "ENSG00000293508"}, (
+            "Expected HLA-A alone or HLA-A + POLR1HASP, got: %s" % (ids,)
+        )
 
 
 def test_gene_ids_of_gene_name_hla_grch38():


=====================================
tests/test_gene_names.py
=====================================
@@ -34,12 +34,17 @@ def test_all_gene_names(genome):
 
 
 def test_gene_names_at_locus_grch38_hla_a():
-    # chr6:29,945,884  is a position for HLA-A
-    # based on:
+    # chr6:29,945,884 is a position for HLA-A. Ensembl release 114
+    # introduced overlapping gene POLR1HASP at the same locus, so accept
+    # either HLA-A alone or the HLA-A + POLR1HASP pair.
     # http://useast.ensembl.org/Homo_sapiens/Gene/
     # Summary?db=core;g=ENSG00000206503;r=6:29941260-29945884
-    names = grch38.gene_names_at_locus(6, 29945884)
-    assert names == ["HLA-A"], "Expected gene name HLA-A, got: %s" % (names,)
+    names = set(grch38.gene_names_at_locus(6, 29945884))
+    assert "HLA-A" in names, "Expected gene name HLA-A, got: %s" % (names,)
+    if len(names) > 1:
+        assert names == {"HLA-A", "POLR1HASP"}, (
+            "Expected HLA-A alone or HLA-A + POLR1HASP, got: %s" % (names,)
+        )
 
 
 @run_multiple_genomes()


=====================================
tests/test_transcript_ids.py
=====================================
@@ -60,3 +60,19 @@ def test_transcript_id_of_protein_id_CCR2():
     # Ensembl release 104, GRCh38.p13
     transcript_id = grch38.transcript_id_of_protein_id("ENSP00000399285")
     eq_("ENST00000445132", transcript_id)
+
+
+def test_protein_ids_at_locus_grch38_hla_a():
+    # Regression test for https://github.com/openvax/pyensembl/issues/286:
+    # protein_ids_at_locus previously queried the "transcript" feature,
+    # which stores an empty protein_id, so results were a list of empty
+    # strings instead of real protein IDs.
+    # chr6:29,942,555 falls inside the first CDS of HLA-A transcripts.
+    protein_ids = set(grch38.protein_ids_at_locus(6, 29942555))
+    assert protein_ids, "Expected non-empty protein IDs at HLA-A CDS locus"
+    assert "" not in protein_ids, (
+        "protein_ids_at_locus should not return empty strings: %s" % (protein_ids,)
+    )
+    assert "ENSP00000365998" in protein_ids, (
+        "Expected HLA-A protein ENSP00000365998 at locus, got: %s" % (protein_ids,)
+    )


=====================================
tests/test_transcript_sequences.py
=====================================
@@ -17,3 +17,21 @@ def test_transcript_sequence_ensembl_grch38():
     eq_(seq, expected)
     # now try via a Transcript object
     eq_(grch38.transcript_by_id("ENST00000448914").sequence, expected)
+
+
+def test_coding_sequence_matches_position_ranges():
+    # Regression test for https://github.com/openvax/pyensembl/issues/176:
+    # len(coding_sequence) used to exceed the total length of
+    # coding_sequence_position_ranges by exactly 3 because Ensembl's CDS
+    # feature excludes the stop codon while coding_sequence includes it.
+    for transcript_id in [
+        "ENST00000311936",
+        "ENST00000371085",
+        "ENST00000275493",
+    ]:
+        transcript = grch38.transcript_by_id(transcript_id)
+        cds_length = len(transcript.coding_sequence)
+        ranges_length = sum(
+            end - start + 1 for (start, end) in transcript.coding_sequence_position_ranges
+        )
+        eq_(cds_length, ranges_length)


=====================================
tests/test_ucsc_gtf.py
=====================================
@@ -53,6 +53,51 @@ def test_ucsc_gencode_genome():
         eq_(transcript_1_30564[0].id, "uc057aty.1")
 
 
+def test_transcript_exons_without_exon_id():
+    """
+    Regression test: older Ensembl releases (e.g. release 54) and other
+    GTFs omit the exon_id attribute while still providing exon_number.
+    Transcript.exons previously raised
+    sqlite3.OperationalError: no such column: exon_id; it now falls back
+    to building Exon objects directly from the exon row with a
+    synthesized per-transcript ID.
+    """
+    import os
+
+    with TemporaryDirectory() as tmpdir:
+        gtf_path = os.path.join(tmpdir, "no_exon_id.gtf")
+        with open(gtf_path, "w") as f:
+            # minimal Ensembl-style GTF: transcript + 2 ordered exons,
+            # with exon_number but no exon_id (as in Ensembl release 54)
+            f.write(
+                '1\ttest\ttranscript\t100\t500\t.\t+\t.\t'
+                'gene_id "G1"; transcript_id "T1"; gene_name "FN1";\n'
+                '1\ttest\texon\t100\t200\t.\t+\t.\t'
+                'gene_id "G1"; transcript_id "T1"; exon_number "1"; gene_name "FN1";\n'
+                '1\ttest\texon\t300\t500\t.\t+\t.\t'
+                'gene_id "G1"; transcript_id "T1"; exon_number "2"; gene_name "FN1";\n'
+            )
+        genome = Genome(
+            reference_name="GRCh38",
+            annotation_name="no_exon_id_test",
+            gtf_path_or_url=gtf_path,
+            cache_directory_path=tmpdir,
+        )
+        genome.index()
+        assert not genome.db.column_exists("exon", "exon_id"), (
+            "Test fixture unexpectedly has exon_id — update this test"
+        )
+        transcript = genome.transcript_by_id("T1")
+        exons = transcript.exons
+        eq_(len(exons), 2)
+        eq_(exons[0].id, "T1_exon_1")
+        eq_(exons[0].start, 100)
+        eq_(exons[0].end, 200)
+        eq_(exons[1].id, "T1_exon_2")
+        eq_(exons[1].start, 300)
+        eq_(exons[1].end, 500)
+
+
 def test_ucsc_refseq_gtf():
     """
     Test GTF object with a small RefSeq GTF file downloaded from



View it on GitLab: https://salsa.debian.org/med-team/pyensembl/-/commit/b44d3559fa471d8ba468856502580f8cdb98d411

-- 
View it on GitLab: https://salsa.debian.org/med-team/pyensembl/-/commit/b44d3559fa471d8ba468856502580f8cdb98d411
You're receiving this email because of your account on salsa.debian.org.


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://alioth-lists.debian.net/pipermail/debian-med-commit/attachments/20260422/01f45363/attachment-0001.htm>