[med-svn] [Git][med-team/pyensembl][master] 8 commits: New upstream version 2.6.7
Karsten Schöke (@karso)
gitlab at salsa.debian.org
Wed Apr 22 06:05:48 BST 2026
Karsten Schöke pushed to branch master at Debian Med / pyensembl
Commits:
b44d3559 by Karsten Schöke at 2026-04-22T06:26:54+02:00
New upstream version 2.6.7
- - - - -
e12eb54d by Karsten Schöke at 2026-04-22T06:26:54+02:00
New upstream version
- - - - -
2aa5ac3e by Karsten Schöke at 2026-04-22T06:26:55+02:00
Update upstream source from tag 'upstream/2.6.7'
Update to upstream version '2.6.7'
with Debian dir a28f7b876dbcc1d657cd033cc47dc24a37a47776
- - - - -
751ef552 by Karsten Schöke at 2026-04-22T06:27:15+02:00
Remove trailing whitespace in debian/copyright (routine-update)
- - - - -
9cd8f3e1 by Karsten Schöke at 2026-04-22T06:27:28+02:00
debputy lint --auto-fix (routine-update)
- - - - -
947a7d5f by Karsten Schöke at 2026-04-22T06:43:52+02:00
remove obsolete Patches, fixed upstream
- - - - -
a1a35dff by Karsten Schöke at 2026-04-22T07:02:54+02:00
Proper cleanup after the build.
- - - - -
c62ca2f2 by Karsten Schöke at 2026-04-22T07:04:22+02:00
prepare 2.6.7-1 release.
- - - - -
21 changed files:
- .github/workflows/tests.yml
- + AGENTS.md
- debian/changelog
- debian/control
- debian/copyright
- − debian/patches/fix-argparse-nested-group.patch
- − debian/patches/no_tinytimer_in_requirements.txt
- debian/patches/series
- debian/rules
- debian/watch
- pyensembl/genome.py
- pyensembl/shell.py
- pyensembl/transcript.py
- pyensembl/version.py
- pyproject.toml
- tests/test_build_system.py
- tests/test_gene_ids.py
- tests/test_gene_names.py
- tests/test_transcript_ids.py
- tests/test_transcript_sequences.py
- tests/test_ucsc_gtf.py
Changes:
=====================================
.github/workflows/tests.yml
=====================================
@@ -3,7 +3,6 @@
# TODO:
# - cache this directory $HOME/.cache/pyensembl/
-# - update coveralls
# - get a badge for tests passing
# - download binary dependencies from conda
name: Tests
@@ -49,5 +48,18 @@ jobs:
- name: Run unit tests
run: |
./test.sh
- - name: Publish coverage to Coveralls
+ - name: Upload coverage to Coveralls
uses: coverallsapp/github-action at v2.2.3
+ with:
+ parallel: true
+ flag-name: python-${{ matrix.python-version }}
+
+ coveralls-finish:
+ needs: build
+ if: always()
+ runs-on: ubuntu-latest
+ steps:
+ - name: Finalize Coveralls
+ uses: coverallsapp/github-action at v2.2.3
+ with:
+ parallel-finished: true
=====================================
AGENTS.md
=====================================
@@ -0,0 +1,79 @@
+## Golden Rules
+
+1. **Never commit to `main`.** Always `git checkout -b <feature-branch>` before editing. Land via PR.
+2. **Every PR bumps the version.** Even doc-only PRs — at minimum a patch bump. The version lives in `pyensembl/version.py` and must be bumped in the PR itself.
+3. **"Done" means merged AND deployed to PyPI** — never stop at merge. After a PR merges, run `./deploy.sh` from a clean `main`. Skipping deploy = task not done.
+4. **File problems as issues, don't silently work around them.** If you hit a bug here or in a sibling openvax repo, open a GitHub issue on the correct repo and link it from the PR.
+5. **After a PR ships, look for the next block of work.** Read open issues across the relevant openvax repos, group by dependency + urgency. Prefer *foundational* changes that unblock multiple downstream improvements; otherwise chain the smallest independent improvements.
+
+---
+
+## Before Completing Any Task
+
+Before considering any code change complete, you MUST:
+
+1. **Run `./lint.sh`** — runs `ruff check pyensembl/`.
+2. **Run `./test.sh`** — runs the pytest suite.
+
+Do not tell the user you are "done" or that changes are "complete" until both pass. `./lint-and-test.sh` runs them together.
+
+## Scripts
+
+- `./lint.sh` — runs `ruff check pyensembl/`. **Always use this for linting.**
+- `./test.sh` — runs pytest with coverage (must pass).
+- `./lint-and-test.sh` — convenience wrapper that runs lint then tests.
+- `./deploy.sh` — deploys to PyPI. Gates on `lint.sh` + `test.sh`, builds sdist+wheel, uploads via twine, tags the commit with `pyensembl/version.py`, pushes tags. **Bump `pyensembl/version.py` before running** — deploy.sh does not bump the version for you.
+- `./develop.sh` — installs package in development mode.
+
+## Code Style
+
+- Use ruff for linting (there is no `format.sh`; formatting is not currently enforced in CI).
+- Configuration is in `pyproject.toml`.
+- Python support: 3.9+.
+
+---
+
+## Workflow Orchestration
+
+### 1. Upfront Planning
+- For ANY non-trivial task (3+ steps or architectural decisions): write a detailed spec before touching code
+- If something goes sideways, STOP and re-plan immediately — don't keep pushing
+- Use planning/verification steps, not just building
+- Write detailed specs upfront to reduce ambiguity
+
+### 2. Self-Improvement Loop
+- After ANY correction from the user: capture the pattern (in Claude Code memory or `tasks/lessons.md`)
+- Write rules for yourself that prevent the same mistake
+- Ruthlessly iterate on these lessons until mistake rate drops
+- Review lessons at session start for relevant project
+
+### 3. Verification Before Done
+- Never mark a task complete without proving it works
+- Diff behavior between the latest code and your changes when relevant
+- Ask yourself: "Would a staff engineer approve this?"
+- Run tests, check logs, demonstrate correctness
+
+### 4. Demand Elegance (Balanced)
+- For non-trivial changes: pause and ask "is there a more elegant way?"
+- If a fix feels hacky: "Knowing everything I know now, implement the elegant solution"
+- Skip this for simple, obvious fixes — don't over-engineer
+- Challenge your own work before presenting it
+
+### 5. Autonomous Bug Fixing
+- When given a bug report: just fix it. Don't ask for hand-holding
+- Point at logs, errors, failing tests — then resolve them
+- Zero context switching required from the user
+- Fix failing unit tests without being told how
+
+---
+
+## Core Principles
+
+- **Simplicity First**: Make every change as simple as possible. Impact minimal code.
+- **No Laziness**: Find root causes. No temporary fixes. Senior developer standards.
+- **Minimal Impact**: Changes should only touch what's necessary. Avoid introducing bugs.
+- **No tautological tests**: Don't write tests that reassert the contents of declarative config (e.g. a `pyproject.toml` dependency list against a hardcoded copy). They verify nothing and break on every legitimate bump.
+
+## Scientific Domain Knowledge
+- **Read the literature**: if some code involves scientific or biological concepts, feel free to search for review papers and read those before changing code that expresses scientific concepts.
+- **Flag inconsistencies**: if code expresses a scientific model that's at odds with your understanding, note that inconsistency and ask for clarification.
=====================================
debian/changelog
=====================================
@@ -1,9 +1,18 @@
-pyensembl (2.6.0-2) UNRELEASED; urgency=medium
+pyensembl (2.6.7-1) UNRELEASED; urgency=medium
* Team upload.
+
+ [ Andreas Tille ]
* d/copyright: Short names for licenses
- -- Andreas Tille <tille at debian.org> Mon, 13 Apr 2026 08:02:22 +0200
+ [ Karsten Schöke ]
+ * New upstream version
+ * Remove trailing whitespace in debian/copyright (routine-update)
+ * debputy lint --auto-fix (routine-update)
+ * Remove obsolete Patches, fixed upstream
+ * Proper cleanup after the build.
+
+ -- Karsten Schöke <karsten.schoeke at geobasis-bb.de> Wed, 22 Apr 2026 07:03:24 +0200
pyensembl (2.6.0-1) unstable; urgency=medium
=====================================
debian/control
=====================================
@@ -1,27 +1,29 @@
Source: pyensembl
-Section: science
-Maintainer: Debian Med Packaging Team <debian-med-packaging at lists.alioth.debian.org>
-Uploaders: Steffen Moeller <moeller at debian.org>,
- Étienne Mollier <emollier at debian.org>
-Build-Depends: debhelper-compat (= 13),
- dh-sequence-python3,
- pybuild-plugin-pyproject,
- python3-setuptools,
- python3-all,
- python3-pandas,
- python3-numpy,
- python3-datacache,
- python3-serializable,
- python3-memoized-property,
- python3-gtfparse,
- python3-pytest <!nocheck>,
# python3-tinytimer <!nocheck> # only run during tests that are not run because of
# demand for internet access
Standards-Version: 4.7.4
-Homepage: https://github.com/openvax/pyensembl
+Maintainer: Debian Med Packaging Team <debian-med-packaging at lists.alioth.debian.org>
+Uploaders:
+ Steffen Moeller <moeller at debian.org>,
+ Étienne Mollier <emollier at debian.org>,
+Section: science
+Testsuite: autopkgtest-pkg-pybuild
+Build-Depends:
+ debhelper-compat (= 13),
+ dh-sequence-python3,
+ pybuild-plugin-pyproject,
+ python3-setuptools,
+ python3-all,
+ python3-pandas,
+ python3-numpy,
+ python3-datacache,
+ python3-serializable,
+ python3-memoized-property,
+ python3-gtfparse,
+ python3-pytest <!nocheck>,
Vcs-Browser: https://salsa.debian.org/med-team/pyensembl
Vcs-Git: https://salsa.debian.org/med-team/pyensembl.git
-Testsuite: autopkgtest-pkg-pybuild
+Homepage: https://github.com/openvax/pyensembl
Package: pyensembl
Architecture: all
=====================================
debian/copyright
=====================================
@@ -2,16 +2,13 @@ Format: https://www.debian.org/doc/packaging-manuals/copyright-format/1.0/
Upstream-Name: pyensembl
Upstream-Contact: Alex Rubinsteyn <alex.rubinsteyn at unc.edu>
Source: https://github.com/openvax/pyensembl
-Files-Excluded:
- pyensembl.egg-info/SOURCES.txt
- pyensembl.egg-info/requires.txt
Files: *
Copyright: 2015-2025 Mount Sinai School of Medicine
License: Apache-2.0
Files: tests/data/gencode.ucsc.small.gtf
-Copyright:
+Copyright:
2025 EBI
License: EMBL
General
@@ -74,7 +71,7 @@ License: EMBL
Wellcome Genome Campus
Hinxton CB10 1SD
.
- Data Resources and Tools
+ Data Resources and Tools
.
Users of EMBL-EBI Data Resources and Tools agree not to attempt to
use any EMBL-EBI computers, files or networks apart from through the
@@ -149,8 +146,8 @@ License: EMBL
to the user, including, without limitation, those related to privacy,
intellectual property rights, export controls and sanctions.
.
- Data
- .
+ Data
+ .
The Data Resources and Tools hosted by EMBL-EBI are generated in part
from data contributed by the community who remain the data owners.
.
@@ -195,10 +192,10 @@ License: EMBL
cookies page.
Comment:
This information is gathered from
- https://www.gencodegenes.org/human/releases.html
+ https://www.gencodegenes.org/human/releases.html
pointing to the EBI's data usage policies on
https://www.ebi.ac.uk/about/terms-of-use/.
-
+
Files: tests/data/mouse.ensembl.81.partial.ENSMUSG00000017167.pep
tests/data/mouse.ensembl.81.partial.ENSMUSG00000017167.fa
=====================================
debian/patches/fix-argparse-nested-group.patch deleted
=====================================
@@ -1,24 +0,0 @@
-From: =?utf-8?q?Karsten_Sch=C3=B6ke?= <karsten.schoeke at geobasis-bb.de>
-Date: Sun, 12 Apr 2026 12:32:43 +0200
-Subject: fix argparse nested group
-
----
- pyensembl/shell.py | 4 ++--
- 1 file changed, 2 insertions(+), 2 deletions(-)
-
-diff --git a/pyensembl/shell.py b/pyensembl/shell.py
-index f446afa..f0bebfa 100755
---- a/pyensembl/shell.py
-+++ b/pyensembl/shell.py
-@@ -69,9 +69,9 @@ parser.add_argument(
- )
-
-
--root_group = parser.add_mutually_exclusive_group()
-+root_group = parser # flatten group structure for Python 3.14 compatibility
-
--release_group = root_group.add_argument_group()
-+release_group = parser.add_argument_group()
- release_group.add_argument(
- "--release",
- type=int,
=====================================
debian/patches/no_tinytimer_in_requirements.txt deleted
=====================================
@@ -1,21 +0,0 @@
-From: Debian Med Packaging Team
- <debian-med-packaging at lists.alioth.debian.org>
-Date: Sun, 12 Apr 2026 11:44:03 +0200
-Subject: no_tinytimer_in_requirements.txt
-
----
- pyproject.toml | 1 -
- 1 file changed, 1 deletion(-)
-
-diff --git a/pyproject.toml b/pyproject.toml
-index 59739d1..8ac6dfc 100644
---- a/pyproject.toml
-+++ b/pyproject.toml
-@@ -24,7 +24,6 @@ dependencies = [
- "typechecks>=0.0.2,<1.0.0",
- "datacache>=1.4.0,<2.0.0",
- "memoized-property>=1.0.2",
-- "tinytimer>=0.0.0,<1.0.0",
- "gtfparse>=2.5.0,<3.0.0",
- "serializable>=0.2.1,<1.0.0",
- "numpy>=2.0.0,<3.0.0",
=====================================
debian/patches/series
=====================================
@@ -1,3 +1 @@
skip-benchmark.patch
-no_tinytimer_in_requirements.txt
-fix-argparse-nested-group.patch
=====================================
debian/rules
=====================================
@@ -1,6 +1,7 @@
#!/usr/bin/make -f
export DH_VERBOSE = 1
export PYBUILD_NAME=pyensembl
+export PYBUILD_AFTER_INSTALL=rm -fr {destdir}/usr/lib/python3*/dist-packages/pyensembl-*/top_level.txt
# See d/tests/control for why the following tests are ignored:
export PYBUILD_TEST_ARGS = -v \
@@ -24,9 +25,3 @@ export PYBUILD_TEST_ARGS = -v \
%:
dh $@ --buildsystem=pybuild
-
-execute_after_dh_auto_clean:
- rm -f pyensembl.egg-info/SOURCES.txt pyensembl.egg-info/requires.txt
-
-execute_after_dh_install:
- find debian -name requirements.txt -delete
=====================================
debian/watch
=====================================
@@ -2,7 +2,7 @@ Version: 5
Source: https://github.com/openvax/pyensembl.git
Matching-Pattern: refs/tags/v?@ANY_VERSION@
-Dversionmangle: auto
+Dversion-Mangle: auto
Mode: git
Repack: yes
-Repacksuffix: +ds
+Repack-Suffix: +ds
=====================================
pyensembl/genome.py
=====================================
@@ -589,7 +589,7 @@ class Genome(Serializable):
def protein_ids_at_locus(self, contig, position, end=None, strand=None):
return self.db.distinct_column_values_at_locus(
column="protein_id",
- feature="transcript",
+ feature="CDS",
contig=contig,
position=position,
end=end,
=====================================
pyensembl/shell.py
=====================================
@@ -69,9 +69,7 @@ parser.add_argument(
)
-root_group = parser.add_mutually_exclusive_group()
-
-release_group = root_group.add_argument_group()
+release_group = parser.add_argument_group("Ensembl release options")
release_group.add_argument(
"--release",
type=int,
@@ -93,7 +91,7 @@ release_group.add_argument(
help="URL and directory to use instead of the default Ensembl FTP server",
)
-path_group = root_group.add_argument_group()
+path_group = parser.add_argument_group("Custom genome options")
path_group.add_argument(
"--reference-name",
=====================================
pyensembl/transcript.py
=====================================
@@ -13,9 +13,28 @@
from memoized_property import memoized_property
from .common import memoize
+from .exon import Exon
from .locus_with_genome import LocusWithGenome
+def _merge_ranges(ranges):
+ """
+ Sort [(start, end)] inclusive-inclusive ranges and merge any that are
+ adjacent or overlapping (end+1 == next start).
+ """
+ if not ranges:
+ return []
+ ordered = sorted(ranges)
+ merged = [ordered[0]]
+ for start, end in ordered[1:]:
+ prev_start, prev_end = merged[-1]
+ if start <= prev_end + 1:
+ merged[-1] = (prev_start, max(prev_end, end))
+ else:
+ merged.append((start, end))
+ return merged
+
+
class Transcript(LocusWithGenome):
"""
Transcript encompasses the locus, exons, and sequence of a transcript.
@@ -123,23 +142,25 @@ class Transcript(LocusWithGenome):
def exons(self):
# need to look up exon_number alongside ID since each exon may
# appear in multiple transcripts and have a different exon number
- # in each transcript
- columns = ["exon_number", "exon_id"]
- exon_numbers_and_ids = self.db.query(
+ # in each transcript.
+ # Older or non-Ensembl GTFs may omit the exon_id attribute, in
+ # which case we build Exon objects directly from the exon row
+ # and synthesize a stable per-transcript ID.
+ has_exon_id = self.db.column_exists("exon", "exon_id")
+ if has_exon_id:
+ columns = ["exon_number", "exon_id"]
+ else:
+ columns = ["exon_number", "seqname", "start", "end", "strand"]
+ rows = self.db.query(
columns, filter_column="transcript_id", filter_value=self.id, feature="exon"
)
# fill this list in its correct order (by exon_number) by using
# the exon_number as a 1-based list offset
- exons = [None] * len(exon_numbers_and_ids)
+ exons = [None] * len(rows)
- for exon_number, exon_id in exon_numbers_and_ids:
- exon = self.genome.exon_by_id(exon_id)
- if exon is None:
- raise ValueError(
- "Missing exon %s for transcript %s" % (exon_number, self.id)
- )
- exon_number = int(exon_number)
+ for row in rows:
+ exon_number = int(row[0])
if exon_number < 1:
raise ValueError("Invalid exon number: %s" % exon_number)
elif exon_number > len(exons):
@@ -148,9 +169,27 @@ class Transcript(LocusWithGenome):
% (exon_number, len(exons))
)
+ if has_exon_id:
+ exon_id = row[1]
+ exon = self.genome.exon_by_id(exon_id)
+ if exon is None:
+ raise ValueError(
+ "Missing exon %s for transcript %s" % (exon_number, self.id)
+ )
+ else:
+ _, seqname, start, end, strand = row
+ exon = Exon(
+ exon_id="%s_exon_%d" % (self.id, exon_number),
+ contig=seqname,
+ start=start,
+ end=end,
+ strand=strand,
+ gene_name=self.gene_name,
+ gene_id=self.gene_id,
+ )
+
# exon_number is 1-based, convert to list index by subtracting 1
- exon_idx = exon_number - 1
- exons[exon_idx] = exon
+ exons[exon_number - 1] = exon
return exons
# possible annotations associated with transcripts
@@ -388,9 +427,15 @@ class Transcript(LocusWithGenome):
def coding_sequence_position_ranges(self):
"""
Return absolute chromosome position ranges for CDS fragments
- of this transcript
+ of this transcript, including the stop codon (which Ensembl
+ encodes as a separate feature from the CDS).
"""
- return self._transcript_feature_position_ranges("CDS")
+ ranges = list(self._transcript_feature_position_ranges("CDS"))
+ if self.contains_stop_codon:
+ ranges.extend(
+ self._transcript_feature_position_ranges("stop_codon", required=False)
+ )
+ return _merge_ranges(ranges)
@memoized_property
def complete(self):
=====================================
pyensembl/version.py
=====================================
@@ -1,4 +1,4 @@
-__version__ = "2.6.0"
+__version__ = "2.6.7"
def print_version():
print(f"v{__version__}")
=====================================
pyproject.toml
=====================================
@@ -25,8 +25,8 @@ dependencies = [
"datacache>=1.4.0,<2.0.0",
"memoized-property>=1.0.2",
"tinytimer>=0.0.0,<1.0.0",
- "gtfparse>=2.5.0,<3.0.0",
- "serializable>=0.2.1,<1.0.0",
+ "gtfparse>=2.6.0,<3.0.0",
+ "serializable>=0.2.1,<2.0.0",
"numpy>=2.0.0,<3.0.0",
]
=====================================
tests/test_build_system.py
=====================================
@@ -52,40 +52,6 @@ def test_build_system_backend():
assert config["build-system"]["build-backend"] == "setuptools.build_meta"
-def test_dependencies_correct():
- """Test that runtime dependencies match specification."""
- try:
- import tomllib
- except ImportError:
- import tomli as tomllib
-
- project_root = Path(__file__).parent.parent
- pyproject_path = project_root / "pyproject.toml"
-
- with open(pyproject_path, "rb") as f:
- config = tomllib.load(f)
-
- expected_deps = {
- "typechecks>=0.0.2,<1.0.0",
- "datacache>=1.4.0,<2.0.0",
- "memoized-property>=1.0.2",
- "tinytimer>=0.0.0,<1.0.0",
- "gtfparse>=2.5.0,<3.0.0",
- "serializable>=0.2.1,<1.0.0",
- "numpy<2",
- }
-
- actual_deps = set(config["project"]["dependencies"])
-
- assert actual_deps == expected_deps, (
- f"Dependencies mismatch.\n"
- f"Expected: {expected_deps}\n"
- f"Actual: {actual_deps}\n"
- f"Missing: {expected_deps - actual_deps}\n"
- f"Extra: {actual_deps - expected_deps}"
- )
-
-
def test_no_pylint_in_runtime_deps():
"""
Test that pylint is not in runtime dependencies.
=====================================
tests/test_gene_ids.py
=====================================
@@ -15,17 +15,18 @@ ensembl77 = cached_release(77, "human")
def test_gene_ids_grch38_hla_a():
- # chr6:29,945,884 is a position for HLA-A
- # Gene ID = ENSG00000206503
- # based on:
+ # chr6:29,945,884 is a position for HLA-A (ENSG00000206503).
+ # Ensembl release 114 introduced overlapping gene POLR1HASP
+ # (ENSG00000293508) at the same locus, so accept either HLA-A alone
+ # or the HLA-A + POLR1HASP pair.
# http://useast.ensembl.org/Homo_sapiens/Gene/
# Summary?db=core;g=ENSG00000206503;r=6:29941260-29945884
- ids = ensembl_grch38.gene_ids_at_locus(6, 29945884)
- expected = "ENSG00000206503"
- assert ids == ["ENSG00000206503"], "Expected HLA-A, gene ID = %s, got: %s" % (
- expected,
- ids,
- )
+ ids = set(ensembl_grch38.gene_ids_at_locus(6, 29945884))
+ assert "ENSG00000206503" in ids, "Expected HLA-A (ENSG00000206503), got: %s" % (ids,)
+ if len(ids) > 1:
+ assert ids == {"ENSG00000206503", "ENSG00000293508"}, (
+ "Expected HLA-A alone or HLA-A + POLR1HASP, got: %s" % (ids,)
+ )
def test_gene_ids_of_gene_name_hla_grch38():
=====================================
tests/test_gene_names.py
=====================================
@@ -34,12 +34,17 @@ def test_all_gene_names(genome):
def test_gene_names_at_locus_grch38_hla_a():
- # chr6:29,945,884 is a position for HLA-A
- # based on:
+ # chr6:29,945,884 is a position for HLA-A. Ensembl release 114
+ # introduced overlapping gene POLR1HASP at the same locus, so accept
+ # either HLA-A alone or the HLA-A + POLR1HASP pair.
# http://useast.ensembl.org/Homo_sapiens/Gene/
# Summary?db=core;g=ENSG00000206503;r=6:29941260-29945884
- names = grch38.gene_names_at_locus(6, 29945884)
- assert names == ["HLA-A"], "Expected gene name HLA-A, got: %s" % (names,)
+ names = set(grch38.gene_names_at_locus(6, 29945884))
+ assert "HLA-A" in names, "Expected gene name HLA-A, got: %s" % (names,)
+ if len(names) > 1:
+ assert names == {"HLA-A", "POLR1HASP"}, (
+ "Expected HLA-A alone or HLA-A + POLR1HASP, got: %s" % (names,)
+ )
@run_multiple_genomes()
=====================================
tests/test_transcript_ids.py
=====================================
@@ -60,3 +60,19 @@ def test_transcript_id_of_protein_id_CCR2():
# Ensembl release 104, GRCh38.p13
transcript_id = grch38.transcript_id_of_protein_id("ENSP00000399285")
eq_("ENST00000445132", transcript_id)
+
+
+def test_protein_ids_at_locus_grch38_hla_a():
+ # Regression test for https://github.com/openvax/pyensembl/issues/286:
+ # protein_ids_at_locus previously queried the "transcript" feature,
+ # which stores an empty protein_id, so results were a list of empty
+ # strings instead of real protein IDs.
+ # chr6:29,942,555 falls inside the first CDS of HLA-A transcripts.
+ protein_ids = set(grch38.protein_ids_at_locus(6, 29942555))
+ assert protein_ids, "Expected non-empty protein IDs at HLA-A CDS locus"
+ assert "" not in protein_ids, (
+ "protein_ids_at_locus should not return empty strings: %s" % (protein_ids,)
+ )
+ assert "ENSP00000365998" in protein_ids, (
+ "Expected HLA-A protein ENSP00000365998 at locus, got: %s" % (protein_ids,)
+ )
=====================================
tests/test_transcript_sequences.py
=====================================
@@ -17,3 +17,21 @@ def test_transcript_sequence_ensembl_grch38():
eq_(seq, expected)
# now try via a Transcript object
eq_(grch38.transcript_by_id("ENST00000448914").sequence, expected)
+
+
+def test_coding_sequence_matches_position_ranges():
+ # Regression test for https://github.com/openvax/pyensembl/issues/176:
+ # len(coding_sequence) used to exceed the total length of
+ # coding_sequence_position_ranges by exactly 3 because Ensembl's CDS
+ # feature excludes the stop codon while coding_sequence includes it.
+ for transcript_id in [
+ "ENST00000311936",
+ "ENST00000371085",
+ "ENST00000275493",
+ ]:
+ transcript = grch38.transcript_by_id(transcript_id)
+ cds_length = len(transcript.coding_sequence)
+ ranges_length = sum(
+ end - start + 1 for (start, end) in transcript.coding_sequence_position_ranges
+ )
+ eq_(cds_length, ranges_length)
=====================================
tests/test_ucsc_gtf.py
=====================================
@@ -53,6 +53,51 @@ def test_ucsc_gencode_genome():
eq_(transcript_1_30564[0].id, "uc057aty.1")
+def test_transcript_exons_without_exon_id():
+ """
+ Regression test: older Ensembl releases (e.g. release 54) and other
+ GTFs omit the exon_id attribute while still providing exon_number.
+ Transcript.exons previously raised
+ sqlite3.OperationalError: no such column: exon_id; it now falls back
+ to building Exon objects directly from the exon row with a
+ synthesized per-transcript ID.
+ """
+ import os
+
+ with TemporaryDirectory() as tmpdir:
+ gtf_path = os.path.join(tmpdir, "no_exon_id.gtf")
+ with open(gtf_path, "w") as f:
+ # minimal Ensembl-style GTF: transcript + 2 ordered exons,
+ # with exon_number but no exon_id (as in Ensembl release 54)
+ f.write(
+ '1\ttest\ttranscript\t100\t500\t.\t+\t.\t'
+ 'gene_id "G1"; transcript_id "T1"; gene_name "FN1";\n'
+ '1\ttest\texon\t100\t200\t.\t+\t.\t'
+ 'gene_id "G1"; transcript_id "T1"; exon_number "1"; gene_name "FN1";\n'
+ '1\ttest\texon\t300\t500\t.\t+\t.\t'
+ 'gene_id "G1"; transcript_id "T1"; exon_number "2"; gene_name "FN1";\n'
+ )
+ genome = Genome(
+ reference_name="GRCh38",
+ annotation_name="no_exon_id_test",
+ gtf_path_or_url=gtf_path,
+ cache_directory_path=tmpdir,
+ )
+ genome.index()
+ assert not genome.db.column_exists("exon", "exon_id"), (
+ "Test fixture unexpectedly has exon_id — update this test"
+ )
+ transcript = genome.transcript_by_id("T1")
+ exons = transcript.exons
+ eq_(len(exons), 2)
+ eq_(exons[0].id, "T1_exon_1")
+ eq_(exons[0].start, 100)
+ eq_(exons[0].end, 200)
+ eq_(exons[1].id, "T1_exon_2")
+ eq_(exons[1].start, 300)
+ eq_(exons[1].end, 500)
+
+
def test_ucsc_refseq_gtf():
"""
Test GTF object with a small RefSeq GTF file downloaded from
View it on GitLab: https://salsa.debian.org/med-team/pyensembl/-/compare/5ad3f4880a2c7941cb6c7663095c420cc4c47d94...c62ca2f2f740892244ea987a178995de085d5255
--
View it on GitLab: https://salsa.debian.org/med-team/pyensembl/-/compare/5ad3f4880a2c7941cb6c7663095c420cc4c47d94...c62ca2f2f740892244ea987a178995de085d5255
You're receiving this email because of your account on salsa.debian.org.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://alioth-lists.debian.net/pipermail/debian-med-commit/attachments/20260422/386bcb31/attachment-0001.htm>
More information about the debian-med-commit
mailing list