[med-svn] [Git][med-team/pyensembl][upstream] New upstream version 2.6.7
Karsten Schöke (@karso)
gitlab at salsa.debian.org
Wed Apr 22 06:05:55 BST 2026
Karsten Schöke pushed to branch upstream at Debian Med / pyensembl
Commits:
b44d3559 by Karsten Schöke at 2026-04-22T06:26:54+02:00
New upstream version 2.6.7
- - - - -
13 changed files:
- .github/workflows/tests.yml
- + AGENTS.md
- pyensembl/genome.py
- pyensembl/shell.py
- pyensembl/transcript.py
- pyensembl/version.py
- pyproject.toml
- tests/test_build_system.py
- tests/test_gene_ids.py
- tests/test_gene_names.py
- tests/test_transcript_ids.py
- tests/test_transcript_sequences.py
- tests/test_ucsc_gtf.py
Changes:
=====================================
.github/workflows/tests.yml
=====================================
@@ -3,7 +3,6 @@
# TODO:
# - cache this directory $HOME/.cache/pyensembl/
-# - update coveralls
# - get a badge for tests passing
# - download binary dependencies from conda
name: Tests
@@ -49,5 +48,18 @@ jobs:
- name: Run unit tests
run: |
./test.sh
- - name: Publish coverage to Coveralls
+ - name: Upload coverage to Coveralls
uses: coverallsapp/github-action at v2.2.3
+ with:
+ parallel: true
+ flag-name: python-${{ matrix.python-version }}
+
+ coveralls-finish:
+ needs: build
+ if: always()
+ runs-on: ubuntu-latest
+ steps:
+ - name: Finalize Coveralls
+ uses: coverallsapp/github-action at v2.2.3
+ with:
+ parallel-finished: true
=====================================
AGENTS.md
=====================================
@@ -0,0 +1,79 @@
+## Golden Rules
+
+1. **Never commit to `main`.** Always `git checkout -b <feature-branch>` before editing. Land via PR.
+2. **Every PR bumps the version.** Even doc-only PRs — at minimum a patch bump. The version lives in `pyensembl/version.py` and must be bumped in the PR itself.
+3. **"Done" means merged AND deployed to PyPI** — never stop at merge. After a PR merges, run `./deploy.sh` from a clean `main`. Skipping deploy = task not done.
+4. **File problems as issues, don't silently work around them.** If you hit a bug here or in a sibling openvax repo, open a GitHub issue on the correct repo and link it from the PR.
+5. **After a PR ships, look for the next block of work.** Read open issues across the relevant openvax repos, group by dependency + urgency. Prefer *foundational* changes that unblock multiple downstream improvements; otherwise chain the smallest independent improvements.
+
+---
+
+## Before Completing Any Task
+
+Before considering any code change complete, you MUST:
+
+1. **Run `./lint.sh`** — runs `ruff check pyensembl/`.
+2. **Run `./test.sh`** — runs the pytest suite.
+
+Do not tell the user you are "done" or that changes are "complete" until both pass. `./lint-and-test.sh` runs them together.
+
+## Scripts
+
+- `./lint.sh` — runs `ruff check pyensembl/`. **Always use this for linting.**
+- `./test.sh` — runs pytest with coverage (must pass).
+- `./lint-and-test.sh` — convenience wrapper that runs lint then tests.
+- `./deploy.sh` — deploys to PyPI. Gates on `lint.sh` + `test.sh`, builds sdist+wheel, uploads via twine, tags the commit with `pyensembl/version.py`, pushes tags. **Bump `pyensembl/version.py` before running** — deploy.sh does not bump the version for you.
+- `./develop.sh` — installs package in development mode.
+
+## Code Style
+
+- Use ruff for linting (there is no `format.sh`; formatting is not currently enforced in CI).
+- Configuration is in `pyproject.toml`.
+- Python support: 3.9+.
+
+---
+
+## Workflow Orchestration
+
+### 1. Upfront Planning
+- For ANY non-trivial task (3+ steps or architectural decisions): write a detailed spec before touching code
+- If something goes sideways, STOP and re-plan immediately — don't keep pushing
+- Use planning/verification steps, not just building
+- Write detailed specs upfront to reduce ambiguity
+
+### 2. Self-Improvement Loop
+- After ANY correction from the user: capture the pattern (in Claude Code memory or `tasks/lessons.md`)
+- Write rules for yourself that prevent the same mistake
+- Ruthlessly iterate on these lessons until mistake rate drops
+- Review lessons at session start for relevant project
+
+### 3. Verification Before Done
+- Never mark a task complete without proving it works
+- Diff behavior between the latest code and your changes when relevant
+- Ask yourself: "Would a staff engineer approve this?"
+- Run tests, check logs, demonstrate correctness
+
+### 4. Demand Elegance (Balanced)
+- For non-trivial changes: pause and ask "is there a more elegant way?"
+- If a fix feels hacky: "Knowing everything I know now, implement the elegant solution"
+- Skip this for simple, obvious fixes — don't over-engineer
+- Challenge your own work before presenting it
+
+### 5. Autonomous Bug Fixing
+- When given a bug report: just fix it. Don't ask for hand-holding
+- Point at logs, errors, failing tests — then resolve them
+- Zero context switching required from the user
+- Fix failing unit tests without being told how
+
+---
+
+## Core Principles
+
+- **Simplicity First**: Make every change as simple as possible. Impact minimal code.
+- **No Laziness**: Find root causes. No temporary fixes. Senior developer standards.
+- **Minimal Impact**: Changes should only touch what's necessary. Avoid introducing bugs.
+- **No tautological tests**: Don't write tests that reassert the contents of declarative config (e.g. a `pyproject.toml` dependency list against a hardcoded copy). They verify nothing and break on every legitimate bump.
+
+## Scientific Domain Knowledge
+- **Read the literature**: if some code involves scientific or biological concepts, feel free to search for review papers and read those before changing code that expresses scientific concepts.
+- **Flag inconsistencies**: if code expresses a scientific model that's at odds with your understanding, note that inconsistency and ask for clarification.
=====================================
pyensembl/genome.py
=====================================
@@ -589,7 +589,7 @@ class Genome(Serializable):
def protein_ids_at_locus(self, contig, position, end=None, strand=None):
return self.db.distinct_column_values_at_locus(
column="protein_id",
- feature="transcript",
+ feature="CDS",
contig=contig,
position=position,
end=end,
=====================================
pyensembl/shell.py
=====================================
@@ -69,9 +69,7 @@ parser.add_argument(
)
-root_group = parser.add_mutually_exclusive_group()
-
-release_group = root_group.add_argument_group()
+release_group = parser.add_argument_group("Ensembl release options")
release_group.add_argument(
"--release",
type=int,
@@ -93,7 +91,7 @@ release_group.add_argument(
help="URL and directory to use instead of the default Ensembl FTP server",
)
-path_group = root_group.add_argument_group()
+path_group = parser.add_argument_group("Custom genome options")
path_group.add_argument(
"--reference-name",
=====================================
pyensembl/transcript.py
=====================================
@@ -13,9 +13,28 @@
from memoized_property import memoized_property
from .common import memoize
+from .exon import Exon
from .locus_with_genome import LocusWithGenome
+def _merge_ranges(ranges):
+ """
+ Sort [(start, end)] inclusive-inclusive ranges and merge any that are
+ adjacent or overlapping (end+1 == next start).
+ """
+ if not ranges:
+ return []
+ ordered = sorted(ranges)
+ merged = [ordered[0]]
+ for start, end in ordered[1:]:
+ prev_start, prev_end = merged[-1]
+ if start <= prev_end + 1:
+ merged[-1] = (prev_start, max(prev_end, end))
+ else:
+ merged.append((start, end))
+ return merged
+
+
class Transcript(LocusWithGenome):
"""
Transcript encompasses the locus, exons, and sequence of a transcript.
@@ -123,23 +142,25 @@ class Transcript(LocusWithGenome):
def exons(self):
# need to look up exon_number alongside ID since each exon may
# appear in multiple transcripts and have a different exon number
- # in each transcript
- columns = ["exon_number", "exon_id"]
- exon_numbers_and_ids = self.db.query(
+ # in each transcript.
+ # Older or non-Ensembl GTFs may omit the exon_id attribute, in
+ # which case we build Exon objects directly from the exon row
+ # and synthesize a stable per-transcript ID.
+ has_exon_id = self.db.column_exists("exon", "exon_id")
+ if has_exon_id:
+ columns = ["exon_number", "exon_id"]
+ else:
+ columns = ["exon_number", "seqname", "start", "end", "strand"]
+ rows = self.db.query(
columns, filter_column="transcript_id", filter_value=self.id, feature="exon"
)
# fill this list in its correct order (by exon_number) by using
# the exon_number as a 1-based list offset
- exons = [None] * len(exon_numbers_and_ids)
+ exons = [None] * len(rows)
- for exon_number, exon_id in exon_numbers_and_ids:
- exon = self.genome.exon_by_id(exon_id)
- if exon is None:
- raise ValueError(
- "Missing exon %s for transcript %s" % (exon_number, self.id)
- )
- exon_number = int(exon_number)
+ for row in rows:
+ exon_number = int(row[0])
if exon_number < 1:
raise ValueError("Invalid exon number: %s" % exon_number)
elif exon_number > len(exons):
@@ -148,9 +169,27 @@ class Transcript(LocusWithGenome):
% (exon_number, len(exons))
)
+ if has_exon_id:
+ exon_id = row[1]
+ exon = self.genome.exon_by_id(exon_id)
+ if exon is None:
+ raise ValueError(
+ "Missing exon %s for transcript %s" % (exon_number, self.id)
+ )
+ else:
+ _, seqname, start, end, strand = row
+ exon = Exon(
+ exon_id="%s_exon_%d" % (self.id, exon_number),
+ contig=seqname,
+ start=start,
+ end=end,
+ strand=strand,
+ gene_name=self.gene_name,
+ gene_id=self.gene_id,
+ )
+
# exon_number is 1-based, convert to list index by subtracting 1
- exon_idx = exon_number - 1
- exons[exon_idx] = exon
+ exons[exon_number - 1] = exon
return exons
# possible annotations associated with transcripts
@@ -388,9 +427,15 @@ class Transcript(LocusWithGenome):
def coding_sequence_position_ranges(self):
"""
Return absolute chromosome position ranges for CDS fragments
- of this transcript
+ of this transcript, including the stop codon (which Ensembl
+ encodes as a separate feature from the CDS).
"""
- return self._transcript_feature_position_ranges("CDS")
+ ranges = list(self._transcript_feature_position_ranges("CDS"))
+ if self.contains_stop_codon:
+ ranges.extend(
+ self._transcript_feature_position_ranges("stop_codon", required=False)
+ )
+ return _merge_ranges(ranges)
@memoized_property
def complete(self):
=====================================
pyensembl/version.py
=====================================
@@ -1,4 +1,4 @@
-__version__ = "2.6.0"
+__version__ = "2.6.7"
def print_version():
print(f"v{__version__}")
=====================================
pyproject.toml
=====================================
@@ -25,8 +25,8 @@ dependencies = [
"datacache>=1.4.0,<2.0.0",
"memoized-property>=1.0.2",
"tinytimer>=0.0.0,<1.0.0",
- "gtfparse>=2.5.0,<3.0.0",
- "serializable>=0.2.1,<1.0.0",
+ "gtfparse>=2.6.0,<3.0.0",
+ "serializable>=0.2.1,<2.0.0",
"numpy>=2.0.0,<3.0.0",
]
=====================================
tests/test_build_system.py
=====================================
@@ -52,40 +52,6 @@ def test_build_system_backend():
assert config["build-system"]["build-backend"] == "setuptools.build_meta"
-def test_dependencies_correct():
- """Test that runtime dependencies match specification."""
- try:
- import tomllib
- except ImportError:
- import tomli as tomllib
-
- project_root = Path(__file__).parent.parent
- pyproject_path = project_root / "pyproject.toml"
-
- with open(pyproject_path, "rb") as f:
- config = tomllib.load(f)
-
- expected_deps = {
- "typechecks>=0.0.2,<1.0.0",
- "datacache>=1.4.0,<2.0.0",
- "memoized-property>=1.0.2",
- "tinytimer>=0.0.0,<1.0.0",
- "gtfparse>=2.5.0,<3.0.0",
- "serializable>=0.2.1,<1.0.0",
- "numpy<2",
- }
-
- actual_deps = set(config["project"]["dependencies"])
-
- assert actual_deps == expected_deps, (
- f"Dependencies mismatch.\n"
- f"Expected: {expected_deps}\n"
- f"Actual: {actual_deps}\n"
- f"Missing: {expected_deps - actual_deps}\n"
- f"Extra: {actual_deps - expected_deps}"
- )
-
-
def test_no_pylint_in_runtime_deps():
"""
Test that pylint is not in runtime dependencies.
=====================================
tests/test_gene_ids.py
=====================================
@@ -15,17 +15,18 @@ ensembl77 = cached_release(77, "human")
def test_gene_ids_grch38_hla_a():
- # chr6:29,945,884 is a position for HLA-A
- # Gene ID = ENSG00000206503
- # based on:
+ # chr6:29,945,884 is a position for HLA-A (ENSG00000206503).
+ # Ensembl release 114 introduced overlapping gene POLR1HASP
+ # (ENSG00000293508) at the same locus, so accept either HLA-A alone
+ # or the HLA-A + POLR1HASP pair.
# http://useast.ensembl.org/Homo_sapiens/Gene/
# Summary?db=core;g=ENSG00000206503;r=6:29941260-29945884
- ids = ensembl_grch38.gene_ids_at_locus(6, 29945884)
- expected = "ENSG00000206503"
- assert ids == ["ENSG00000206503"], "Expected HLA-A, gene ID = %s, got: %s" % (
- expected,
- ids,
- )
+ ids = set(ensembl_grch38.gene_ids_at_locus(6, 29945884))
+ assert "ENSG00000206503" in ids, "Expected HLA-A (ENSG00000206503), got: %s" % (ids,)
+ if len(ids) > 1:
+ assert ids == {"ENSG00000206503", "ENSG00000293508"}, (
+ "Expected HLA-A alone or HLA-A + POLR1HASP, got: %s" % (ids,)
+ )
def test_gene_ids_of_gene_name_hla_grch38():
=====================================
tests/test_gene_names.py
=====================================
@@ -34,12 +34,17 @@ def test_all_gene_names(genome):
def test_gene_names_at_locus_grch38_hla_a():
- # chr6:29,945,884 is a position for HLA-A
- # based on:
+ # chr6:29,945,884 is a position for HLA-A. Ensembl release 114
+ # introduced overlapping gene POLR1HASP at the same locus, so accept
+ # either HLA-A alone or the HLA-A + POLR1HASP pair.
# http://useast.ensembl.org/Homo_sapiens/Gene/
# Summary?db=core;g=ENSG00000206503;r=6:29941260-29945884
- names = grch38.gene_names_at_locus(6, 29945884)
- assert names == ["HLA-A"], "Expected gene name HLA-A, got: %s" % (names,)
+ names = set(grch38.gene_names_at_locus(6, 29945884))
+ assert "HLA-A" in names, "Expected gene name HLA-A, got: %s" % (names,)
+ if len(names) > 1:
+ assert names == {"HLA-A", "POLR1HASP"}, (
+ "Expected HLA-A alone or HLA-A + POLR1HASP, got: %s" % (names,)
+ )
@run_multiple_genomes()
=====================================
tests/test_transcript_ids.py
=====================================
@@ -60,3 +60,19 @@ def test_transcript_id_of_protein_id_CCR2():
# Ensembl release 104, GRCh38.p13
transcript_id = grch38.transcript_id_of_protein_id("ENSP00000399285")
eq_("ENST00000445132", transcript_id)
+
+
+def test_protein_ids_at_locus_grch38_hla_a():
+ # Regression test for https://github.com/openvax/pyensembl/issues/286:
+ # protein_ids_at_locus previously queried the "transcript" feature,
+ # which stores an empty protein_id, so results were a list of empty
+ # strings instead of real protein IDs.
+ # chr6:29,942,555 falls inside the first CDS of HLA-A transcripts.
+ protein_ids = set(grch38.protein_ids_at_locus(6, 29942555))
+ assert protein_ids, "Expected non-empty protein IDs at HLA-A CDS locus"
+ assert "" not in protein_ids, (
+ "protein_ids_at_locus should not return empty strings: %s" % (protein_ids,)
+ )
+ assert "ENSP00000365998" in protein_ids, (
+ "Expected HLA-A protein ENSP00000365998 at locus, got: %s" % (protein_ids,)
+ )
=====================================
tests/test_transcript_sequences.py
=====================================
@@ -17,3 +17,21 @@ def test_transcript_sequence_ensembl_grch38():
eq_(seq, expected)
# now try via a Transcript object
eq_(grch38.transcript_by_id("ENST00000448914").sequence, expected)
+
+
+def test_coding_sequence_matches_position_ranges():
+ # Regression test for https://github.com/openvax/pyensembl/issues/176:
+ # len(coding_sequence) used to exceed the total length of
+ # coding_sequence_position_ranges by exactly 3 because Ensembl's CDS
+ # feature excludes the stop codon while coding_sequence includes it.
+ for transcript_id in [
+ "ENST00000311936",
+ "ENST00000371085",
+ "ENST00000275493",
+ ]:
+ transcript = grch38.transcript_by_id(transcript_id)
+ cds_length = len(transcript.coding_sequence)
+ ranges_length = sum(
+ end - start + 1 for (start, end) in transcript.coding_sequence_position_ranges
+ )
+ eq_(cds_length, ranges_length)
=====================================
tests/test_ucsc_gtf.py
=====================================
@@ -53,6 +53,51 @@ def test_ucsc_gencode_genome():
eq_(transcript_1_30564[0].id, "uc057aty.1")
+def test_transcript_exons_without_exon_id():
+ """
+ Regression test: older Ensembl releases (e.g. release 54) and other
+ GTFs omit the exon_id attribute while still providing exon_number.
+ Transcript.exons previously raised
+ sqlite3.OperationalError: no such column: exon_id; it now falls back
+ to building Exon objects directly from the exon row with a
+ synthesized per-transcript ID.
+ """
+ import os
+
+ with TemporaryDirectory() as tmpdir:
+ gtf_path = os.path.join(tmpdir, "no_exon_id.gtf")
+ with open(gtf_path, "w") as f:
+ # minimal Ensembl-style GTF: transcript + 2 ordered exons,
+ # with exon_number but no exon_id (as in Ensembl release 54)
+ f.write(
+ '1\ttest\ttranscript\t100\t500\t.\t+\t.\t'
+ 'gene_id "G1"; transcript_id "T1"; gene_name "FN1";\n'
+ '1\ttest\texon\t100\t200\t.\t+\t.\t'
+ 'gene_id "G1"; transcript_id "T1"; exon_number "1"; gene_name "FN1";\n'
+ '1\ttest\texon\t300\t500\t.\t+\t.\t'
+ 'gene_id "G1"; transcript_id "T1"; exon_number "2"; gene_name "FN1";\n'
+ )
+ genome = Genome(
+ reference_name="GRCh38",
+ annotation_name="no_exon_id_test",
+ gtf_path_or_url=gtf_path,
+ cache_directory_path=tmpdir,
+ )
+ genome.index()
+ assert not genome.db.column_exists("exon", "exon_id"), (
+ "Test fixture unexpectedly has exon_id — update this test"
+ )
+ transcript = genome.transcript_by_id("T1")
+ exons = transcript.exons
+ eq_(len(exons), 2)
+ eq_(exons[0].id, "T1_exon_1")
+ eq_(exons[0].start, 100)
+ eq_(exons[0].end, 200)
+ eq_(exons[1].id, "T1_exon_2")
+ eq_(exons[1].start, 300)
+ eq_(exons[1].end, 500)
+
+
def test_ucsc_refseq_gtf():
"""
Test GTF object with a small RefSeq GTF file downloaded from
View it on GitLab: https://salsa.debian.org/med-team/pyensembl/-/commit/b44d3559fa471d8ba468856502580f8cdb98d411
--
View it on GitLab: https://salsa.debian.org/med-team/pyensembl/-/commit/b44d3559fa471d8ba468856502580f8cdb98d411
You're receiving this email because of your account on salsa.debian.org.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://alioth-lists.debian.net/pipermail/debian-med-commit/attachments/20260422/01f45363/attachment-0001.htm>
More information about the debian-med-commit
mailing list