[med-svn] [Git][med-team/xpore][master] 3 commits: New upstream version 2.1

Nilesh Patra (@nilesh) gitlab at salsa.debian.org
Tue Oct 19 12:12:15 BST 2021



Nilesh Patra pushed to branch master at Debian Med / xpore


Commits:
d07611b8 by Nilesh Patra at 2021-10-19T16:37:55+05:30
New upstream version 2.1
- - - - -
5af9ac1c by Nilesh Patra at 2021-10-19T16:37:55+05:30
Update upstream source from tag 'upstream/2.1'

Update to upstream version '2.1'
with Debian dir 20cb41edd838158d49404c3c347a82a69fa511b5
- - - - -
bea8751e by Nilesh Patra at 2021-10-19T16:38:52+05:30
Upload to unstable

- - - - -


18 changed files:

- .gitignore
- README.md
- debian/changelog
- + docs/source/citing.rst
- docs/source/scripts.rst → docs/source/cmd.rst
- docs/source/conf.py
- + docs/source/data.rst
- + docs/source/help.rst
- docs/source/index.rst
- docs/source/installation.rst
- docs/source/quickstart.rst
- − gtf
- setup.py
- xpore/__init__.py
- xpore/scripts/dataprep.py
- xpore/scripts/diffmod.py
- + xpore/scripts/postprocessing.py
- + xpore/scripts/xpore.py


Changes:

=====================================
.gitignore
=====================================
@@ -131,3 +131,6 @@ dmypy.json
 
 # Pyre type checker
 .pyre/
+
+# DS store
+.DS_Store


=====================================
README.md
=====================================
@@ -20,17 +20,20 @@ To install the latest release with PyPI (recommended), run
 
 ```sh
 $ pip install xpore 
-$ pyensembl install --release 91 --species homo_sapiens  # please specify the compatible Ensembl release with your data when you install it.
 ```
+---
+
 ### Documentation
 
-Please refer to the xPore documention ([https://xpore.readthedocs.io](https://xpore.readthedocs.io)) for additional information, a quick start guide, and details on the data processing and output file format.
+Please refer to our xPore documentation ([https://xpore.readthedocs.io](https://xpore.readthedocs.io)) for additional information, a quick start guide, and details on the data processing and output file format.
+
+xPore is described in details in [Pratanwanich et al. *Nat Biotechnol* (2021)](https://doi.org/10.1038/s41587-021-00949-w)
 
-xPore is described in detail in a preprint (https://www.biorxiv.org/content/10.1101/2020.06.18.160010v1)
+---
 
 ### Release History
 
-The current release is xPore v1.0.
+The current release is xPore v2.1. 
 
 Please refer to the github release history for previous releases: https://github.com/GoekeLab/xpore/releases
 
@@ -38,17 +41,19 @@ Please refer to the github release history for previous releases: https://github
 
 ### Citing xPore
 
-If you use xPore in your research, please cite
-
-[Ploy N. Pratanwanich, Ploy N. et al. "Detection of differential RNA modifications from direct RNA sequencing of human cell lines". *bioRxiv* (2020) doi: https://doi.org/10.1101/2020.06.18.160010](https://www.biorxiv.org/content/10.1101/2020.06.18.160010v1)
+If you use xPore in your research, please cite 
 
+Ploy N. Pratanwanich, et al. Identification of differential RNA modifications from nanopore direct RNA sequencing with xPore. *Nat Biotechnol* (2021), [https://doi.org/10.1038/s41587-021-00949-w](https://doi.org/10.1038/s41587-021-00949-w).
 
 ---
 
-### Contact
+### Getting Help
 
-xPore is maintained by [Ploy N. Pratanwanich](https://github.com/ploy-np) and [Jonathan Goeke](https://github.com/jonathangoeke) from the Genome Institute of Singapore, A*STAR. 
-
-If you want to contribute, please leave an issue. 
+We appreciate your feedback and questions! You can report any error or suggestion related to xPore as an [issue on github](https://github.com/GoekeLab/xpore/issues). If you have questions related to the manuscript, data, or any general comment or suggestion please use the [Discussions](https://github.com/GoekeLab/xpore/discussions).
 
 Thank you!
+
+---
+
+### Contact
+xPore is maintained by [Ploy N. Pratanwanich](https://github.com/ploy-np), [Yuk Kei Wan](https://github.com/yuukiiwa) and [Jonathan Goeke](https://github.com/jonathangoeke) from the Genome Institute of Singapore, A\*STAR. 


=====================================
debian/changelog
=====================================
@@ -1,3 +1,11 @@
+xpore (2.1-1) unstable; urgency=medium
+
+  * Team Upload.
+  * New upstream version 2.1
+  * Standards version: 4.6.0
+
+ -- Nilesh Patra <nilesh at debian.org>  Tue, 19 Oct 2021 16:38:22 +0530
+
 xpore (1.0-2) unstable; urgency=medium
 
   * Team Upload.


=====================================
docs/source/citing.rst
=====================================
@@ -0,0 +1,10 @@
+.. _citing:
+
+Citing xPore
+==================
+
+If you use xPore in your research, please cite
+
+Ploy N. Pratanwanich, et al., Identification of differential RNA modifications from nanopore direct RNA sequencing with xPore. *Nat Biotechnol* (2021), `https://doi.org/10.1038/s41587-021-00949-w <https://doi.org/10.1038/s41587-021-00949-w>`_.
+
+Thank you!


=====================================
docs/source/scripts.rst → docs/source/cmd.rst
=====================================
@@ -1,7 +1,7 @@
-.. _scripts:
+.. _cmd:
 
-Scripts
-==========
+Command line arguments
+=======================
 
 We provide 2 main scripts to run the analysis of differential RNA modifications as the following.
 
@@ -10,27 +10,20 @@ We provide 2 main scripts to run the analysis of differential RNA modifications
 
 * Input
 
-Output files from `nanopolish eventalgin`. Please refer to :ref:`Data preparation <preparation>` for the full Nanopolish command.
-
-* Usage example
+Output files from ``nanopolish eventalgin``. Please refer to :ref:`Data preparation <preparation>` for the full Nanopolish command.
 
 =================================   ==========  ===================  ============================================================================================================
-Argument name(s)                    Required    Default value         Description
+Argument name                       Required    Default value         Description
 =================================   ==========  ===================  ============================================================================================================
 --eventalign=FILE                   Yes         NA                    Eventalign filepath, the output from nanopolish.         
---summary=FILE                      Yes         NA                    Eventalign summary filepath, the output from nanopolish.
 --out_dir=DIR                       Yes         NA                    Output directory.
---ensembl=NUM                       No          91                    Ensembl version for gene-transcript mapping.
---species=STR                       No          homo_sapiens          Species for ensembl gene-transcript mapping.
---customised_genome                 No          False                 If customised genome provided.
---reference_name                    No          NA                    fasta reference name.
---annotation_name                   No          NA                    gtf annotation name.
---gtf_path_or_url                   No          NA                    gtf file path or url.
---transcript_fasta_paths_or_urls    No          NA                    Transcript fasta paths or urls.
+--gtf_path_or_url                   No          NA                    GTF file path or url used for mapping transcriptomic to genomic coordinates..
+--transcript_fasta_paths_or_urls    No          NA                    Transcript FASTA paths or urls used for mapping transcriptomic to genomic coordinates.
 --skip_eventalign_indexing          No          False                 To skip indexing the eventalign nanopolish output.
 --genome                            No          False                 To run on Genomic coordinates. Without this argument, the program will run on transcriptomic coordinates.
 --n_processes=NUM                   No          1                     Number of processes to run.
 --readcount_max=NUM                 No          1000                  Maximum read counts per gene.
+--readcount_min=NUM                 No          1                     Minimum read counts per gene.
 --resume                            No          False                 With this argument, the program will resume from the previous run.
 =================================   ==========  ===================  ============================================================================================================
 
@@ -51,14 +44,12 @@ data.readcount          csv             Summary of readcounts per gene.
 
 * Input
 
-Output files from `xpore-dataprep`.
-
-* Usage example
+Output files from ``xpore-dataprep``.
 
 ===================  ==========  ===============      ==============================================================================
-Argument name(s)      Required    Default value       Description
+Argument name         Required    Default value       Description
 ===================  ==========  ===============      ==============================================================================
---config=FILE           Yes         NA                Yaml configuraion filepath.
+--config=FILE           Yes         NA                YAML configurtaion filepath.
 --n_processes=NUM       No          1                 Number of processes to run.
 --save_models           No          False             With this argument, the program will save the model parameters for each id.
 --resume                No          False             With this argument, the program will resume from the previous run.
@@ -73,4 +64,17 @@ File name                File type           Description
 diffmod.table            csv                 Output table information of differential modification rates. Please refer to :ref:`Output table description <outputtable>` for the full description.   
 diffmod.log              txt                 Gene/Transcript ids being processed.
 ======================  ===============     =================================================================================================================================================
-   
+
+``xpore-postprocessing``
+**************************
+
+* Input
+
+The ``diffmod.table`` file  from ``xpore-diffmod``.
+
+======================  ===============     =======================================================================
+Argument name            Required           Description
+======================  ===============     =======================================================================
+--diffmod_dir            Yes                Path of the directory containing ``diffmod.table``.
+======================  ===============     =======================================================================
+


=====================================
docs/source/conf.py
=====================================
@@ -26,9 +26,9 @@ copyright = '2020, Ploy N. Pratanwanich'
 author = 'Ploy N. Pratanwanich'
 
 # The short X.Y version
-version = '1.0'
+version = '2.0'
 # The full version, including alpha/beta/rc tags
-release = '1.0'
+release = '2.0'
 
 
 # -- General configuration ---------------------------------------------------


=====================================
docs/source/data.rst
=====================================
@@ -0,0 +1,14 @@
+.. _data:
+
+Data
+======
+
+You can find the links to all preprocessed data used in our manuscript at Zenodo `for the SGNEx data <https://doi.org/10.5281/zenodo.4604945>`_ and `for the others samples <https://doi.org/10.5281/zenodo.4587661>`_.
+All the raw fast5 and fastq files are also available at `ENA <https://www.ebi.ac.uk/ena/browser/view/PRJEB40872>`_ and `SGNEx <https://github.com/GoekeLab/sg-nex-data>`_. Please refer to our Supplementary Table S7 in our manuscript for full details of data download.
+
+
+Note that all HEK293T-KO samples can be used as unmodified (m6A) controls for any other data set generated with the same RNA kit (SQK-RNA001).
+If the cells are genetically different, we recommend to perform variant calling before finalising the list of differentially modified sites in order to remove false positives.
+
+
+


=====================================
docs/source/help.rst
=====================================
@@ -0,0 +1,9 @@
+.. _help:
+
+Getting Help
+==================
+
+
+We appreciate your feedback and questions! You can report any error or suggestion related to xPore as an `issue on github <https://github.com/GoekeLab/xpore/issues>`_. If you have questions related to the manuscript, data, or any general comment or suggestion please use the `Discussions <https://github.com/GoekeLab/xpore/discussions>`_.
+
+Thank you!


=====================================
docs/source/index.rst
=====================================
@@ -5,15 +5,18 @@
 
 Welcome to xPore's documentation!
 =================================
-xPore is a Python package for detection of differentail RNA modifications from Nanopore sequencing data.
+xPore is a Python package for identification of differentail RNA modifications from Nanopore sequencing data.
 
 To install the latest release, run::
 
     pip install xpore
-    pyensembl install --release 91 --species homo_sapiens  # Please specify the compatible Ensembl release with your data when you install it.
 
 See our :ref:`Installation page <installation>` for details.
 
+To check the version of xPore, run::
+
+    xpore -v
+
 To detect differential modifications, you can follow the instructions in our :ref:`Quickstart page <quickstart>`. 
 
 Contents
@@ -26,11 +29,18 @@ Contents
    outputtable
    configuration
    preparation
-   scripts
+   data
+   cmd
+   citing
+   help
 
 Contacts
 --------
-xPore is maintained by `Ploy N. Pratanwanich <https://github.com/ploy-rukawa>`_ and `Jonathan Goeke <https://github.com/jonathangoeke>`_ from the Genome Institute of Singapore, A*STAR. 
+If you use xPore in your research, please cite
+
+Ploy N. Pratanwanich, et al.,Identification of differential RNA modifications from nanopore direct RNA sequencing with xPore. *Nat Biotechnol* (2021), `https://doi.org/10.1038/s41587-021-00949-w <https://doi.org/10.1038/s41587-021-00949-w>`_ 
+
+xPore is maintained by `Ploy N. Pratanwanich <https://github.com/ploy-np>`_, `Yuk Kei Wan <https://github.com/yuukiiwa>`_ and `Jonathan Goeke <https://github.com/jonathangoeke>`_ from the Genome Institute of Singapore, A*STAR. 
 
 If you want to contribute, please leave an issue in `our repo <https://github.com/GoekeLab/xpore>`_
 


=====================================
docs/source/installation.rst
=====================================
@@ -10,7 +10,6 @@ PyPI installation (recommended)
 ::
 
     pip install xpore
-    pyensembl install --release 91 --species homo_sapiens  # please specify the compatible Ensembl release with your data when you install it.
 
 Installation from our GitHub repository
 ---------------------------------------
@@ -19,4 +18,4 @@ Installation from our GitHub repository
     git clone https://github.com/GoekeLab/xpore.git
     cd xpore
     python setup.py install
-    pyensembl install --release 91 --species homo_sapiens  # please specify the compatible Ensembl release with your data when you install it.
+


=====================================
docs/source/quickstart.rst
=====================================
@@ -3,9 +3,9 @@
 Quickstart - Detection of differential RNA modifications
 =========================================================
 
-Download and extract the demo dataset from our `S3 bucket <http://s3.ap-southeast-1.amazonaws.com/all-public-data.store.genome.sg/xpore/demo.tar.gz>`_::
+Download and extract the demo dataset from our `zenodo <https://zenodo.org/record/5162402/files/demo.tar.gz>`_::
 
-    wget http://s3.ap-southeast-1.amazonaws.com/all-public-data.store.genome.sg/xpore/demo.tar.gz
+    wget https://zenodo.org/record/5162402/files/demo.tar.gz
     tar -xvf demo.tar.gz
 
 After extraction, you will find::
@@ -15,20 +15,25 @@ After extraction, you will find::
     |-- data
         |-- HEK293T-METTL3-KO-rep1  # dataset dir
         |-- HEK293T-WT-rep1 # dataset dir
+    |-- demo.gtf # GTF (general transfer format) file for transcript-to-gene mapping  
+    |-- demo.fa # transcriptome reference FASTA file for transcript-to-gene mapping
 
 Each dataset under the ``data`` directory contains the following directories:
 
-* ``fast5`` : Raw signal FAST5 files
-* ``fastq`` : Basecalled reads
-* ``bamtx`` : Transcriptome-aligned sequence
+* ``fast5`` : Raw signal, FAST5 files
+* ``fastq`` : Basecalled reads, FASTQ file
+* ``bamtx`` : Transcriptome-aligned sequence, BAM file
 * ``nanopolish``: Eventalign files obtained from `nanopolish eventalign <https://nanopolish.readthedocs.io/en/latest/quickstart_eventalign.html>`_
 
-1. Preprocess the data for each data set using ``xpore-dataprep``. (This step will take approximately 5h for 1 million reads)::
+Note that the FAST5, FASTQ and BAM files are required to obtain the eventalign file with Nanopolish, xPore only requires the eventalign file. See our :ref:`Data preparation page <preparation>` for details to obtain the eventalign file from raw reads.
+
+1. Preprocess the data for each data set using ``xpore dataprep``. Note that the ``--gtf_or_gff`` and ``--transcript_fasta`` arguments are required to map transcriptomic to genomic coordinates when the ``--genome`` option is chosen, so that xPore can run based on genome coordinates. However, GTF is the recommended option. If GFF is the only file available, please note that the GFF file works with GENCODE or ENSEMBL FASTA files, but not UCSC FASTA files. We plan to remove the requirement of FASTA files in a future release.(This step will take approximately 5h for 1 million reads)::
 
     # Within each dataset directory i.e. demo/data/HEK293T-METTL3-KO-rep1 and demo/data/HEK293T-WT-rep1, run
-    xpore-dataprep \
+    xpore dataprep \
     --eventalign nanopolish/eventalign.txt \
-    --summary nanopolish/summary.txt \
+    --gtf_or_gff ../../demo.gtf \
+    --transcript_fasta ../../demo.fa \
     --out_dir dataprep \
     --genome  
 
@@ -40,7 +45,7 @@ The output files are stored under ``dataprep`` in each  dataset directory:
 * ``data.readcount`` : Summary of readcounts per gene
 * ``data.log`` : Log file
 
-Run ``xpore-dataprep -h`` to explore the full usage.
+Run ``xpore dataprep -h`` or visit our :ref:`Command line arguments <cmd>` to explore the full usage description. 
 
 2. Prepare a ``.yml`` configuration file. With this YAML file, you can specify the information of your design experiment, the data directories, the output directory, and the method options.
 In the demo directory, there is an example configuration file ``Hek293T_config.yaml`` available that you can use as a starting template.
@@ -56,28 +61,22 @@ Below is how it looks like::
 
     out: ./out # output dir
 
-    method:
-        prefiltering:
-            method: t-test
-            threshold: 0.1
-        stopping_criteria: 0.0001
 
-Note that if high accuracy is required, the ``prefiltering`` section should be removed and the processing time is accordingly longer.
-See the :ref:`Configuration file page <configuration>` for details.
+See the :ref:`Configuration file page <configuration>` for more details.
 
 3. Now that we have the data and the configuration file ready for modelling differential modifications using ``xpore-diffmod``. 
 
 ::
 
     # At the demo directory where the configuration file is, run.
-    xpore-diffmod --config Hek293T_config.yml
+    xpore diffmod --config Hek293T_config.yml
 
 The output files are generated within the ``out`` directory:
 
 * ``diffmod.table`` : Result table of differential RNA modification across all tested positions
 * ``diffmod.log`` : Log file
 
-Run ``xpore-diffmod -h`` to explore the full usage.
+Run ``xpore diffmod -h`` or visit our :ref:`Command line arguments <cmd>` to explore the full usage description.
 
 We can rank the significantly differentially modified sites based on ``pval_HEK293T-KO_vs_HEK293T-WT``. The results are shown below.::
 
@@ -88,9 +87,15 @@ We can rank the significantly differentially modified sites based on ``pval_HEK2
     ENSG00000159111   47824137  GGACA               -0.604056   7.614675e-24        -10.068479  ...      7.164075    4.257725       0.553929     0.353160           lower  2.294337e-10
     ENSG00000114125  141745249  GGACT               -0.514980   2.779122e-19         -8.977134  ...      5.215243   20.598471       0.954968     0.347174           lower  1.304111e-06
 
-**Notes:** We can consider only one modification type per k-mer by finding the majority ``mod_assignment`` of each k-mer. 
+4. (Optional) We can consider only one modification type per k-mer by finding the majority ``mod_assignment`` of each k-mer. 
 For example, the majority of the modification means of ``GGACT`` (``mu_mod``) is lower than the non-modification counterpart (``mu_unmod``). 
-This can be achieved by simply running ``groupby`` on the ``kmer`` and ``mod_assignment`` columns in Python.
-We can then remove those positions with the ``mod_assigment`` not in line with the majority in order to restrict ourselves with one modification type per kmer in the analysis.
-You can find more details in our paper.
+We can filter out those positions whose ``mod_assigment`` values are not in line with those of the majority in order to restrict ourselves with one modification type per kmer in the analysis.
+This can be done by running ``xpore postprocessing``.
+
+::
+
+    xpore postprocessing --diffmod_dir out
+
+With this command, we will get the final file in which only kmers with their ``mod_assignment`` different from the majority assigment of the corresponding kmer are removed. The output file ``majority_direction_kmer_diffmod.table`` is generated in the ``out`` directtory. You can find more details in our paper.
 
+Run ``xpore postprocessing -h`` or visit our :ref:`Command line arguments <cmd>` to explore the full usage description.


=====================================
gtf deleted
=====================================


=====================================
setup.py
=====================================
@@ -14,7 +14,7 @@ setup(
     name=__pkg_name__,
     license="MIT",
     description='xpore is a python package for Nanopore data analysis of differential RNA modifications.',
-    version='v1.0',
+    version='v2.1',
     long_description=README,
     long_description_content_type='text/markdown',
     url='https://github.com/GoekeLab/xpore',
@@ -30,8 +30,7 @@ setup(
             'ujson>=4.0.1'
             ],
     python_requires=">=3.8",
-    entry_points={'console_scripts': ["xpore-dataprep={}.scripts.dataprep:main".format(__pkg_name__),
-                                      "xpore-diffmod={}.scripts.diffmod:main".format(__pkg_name__)]},
+    entry_points={'console_scripts': ["xpore={}.scripts.xpore:main".format(__pkg_name__)]},
     classifiers=[
         # Trove classifiers
         # (https://pypi.python.org/pypi?%3Aaction=list_classifiers)


=====================================
xpore/__init__.py
=====================================
@@ -1 +1 @@
-__version__ = "1.0"
\ No newline at end of file
+__version__ = "2.0"


=====================================
xpore/scripts/dataprep.py
=====================================
@@ -1,4 +1,3 @@
-import argparse
 import numpy as np
 import pandas as pd
 import os,re
@@ -6,8 +5,6 @@ import multiprocessing
 import h5py
 import csv
 import ujson
-from pyensembl import EnsemblRelease
-from pyensembl import Genome
 from operator import itemgetter
 from collections import defaultdict
 from io import StringIO
@@ -15,43 +12,6 @@ from io import StringIO
 from . import helper
 from ..utils import misc
 
-def get_args():
-    parser = argparse.ArgumentParser(formatter_class=argparse.ArgumentDefaultsHelpFormatter)
-
-    optional = parser._action_groups.pop()
-    required = parser.add_argument_group('required arguments')
-
-    # Required arguments
-    required.add_argument('--eventalign', dest='eventalign', help='eventalign filepath, the output from nanopolish.',required=True)
-    required.add_argument('--summary', dest='summary', help='eventalign summary filepath, the output from nanopolish.',required=True)
-    required.add_argument('--out_dir', dest='out_dir', help='output directory.',required=True)
-
-    # Optional
-    # Use ensembl db
-    optional.add_argument('--ensembl', dest='ensembl', help='ensembl version for gene-transcript mapping.',type=int, default=91)
-    optional.add_argument('--species', dest='species', help='species for ensembl gene-transcript mapping.', default='homo_sapiens')
-
-    # Use customised db
-    # These arguments will be passed to Genome from pyensembl
-    optional.add_argument('--customised_genome', dest='customised_genome', help='if customised genome provided.',default=False,action='store_true')
-    optional.add_argument('--reference_name', dest='reference_name', help='fasta reference name.',type=str)
-    optional.add_argument('--annotation_name', dest='annotation_name', help='gtf annotation name.',type=str)
-    optional.add_argument('--gtf_path_or_url', dest='gtf_path_or_url', help='gtf file path or url.',type=str)
-    optional.add_argument('--transcript_fasta_paths_or_urls', dest='transcript_fasta_paths_or_urls', help='transcript fasta paths or urls.',type=str)
-
-    optional.add_argument('--skip_eventalign_indexing', dest='skip_eventalign_indexing', help='skip indexing the eventalign nanopolish output.',default=False,action='store_true')
-
-    # parser.add_argument('--features', dest='features', help='Signal features to extract.',type=list,default=['norm_mean'])
-    optional.add_argument('--genome', dest='genome', help='to run on Genomic coordinates. Without this argument, the program will run on transcriptomic coordinates',default=False,action='store_true') 
-    optional.add_argument('--n_processes', dest='n_processes', help='number of processes to run.',type=int, default=1)
-    optional.add_argument('--chunk_size', dest='chunk_size', help='number of lines from nanopolish eventalign.txt for processing.',type=int, default=1000000)
-    optional.add_argument('--readcount_min', dest='readcount_min', help='minimum read counts per gene.',type=int, default=1)
-    optional.add_argument('--readcount_max', dest='readcount_max', help='maximum read counts per gene.',type=int, default=1000)
-    optional.add_argument('--resume', dest='resume', help='with this argument, the program will resume from the previous run.',default=False,action='store_true') #todo
-
-    parser._action_groups.append(optional)
-    return parser.parse_args()
-
 def index(eventalign_result,pos_start,out_paths,locks):
     eventalign_result = eventalign_result.set_index(['contig','read_index'])
     pos_end=pos_start
@@ -66,7 +26,7 @@ def index(eventalign_result,pos_start,out_paths,locks):
                 pass
             pos_start = pos_end
 
-def parallel_index(eventalign_filepath,summary_filepath,chunk_size,out_dir,n_processes,resume):
+def parallel_index(eventalign_filepath,chunk_size,out_dir,n_processes,resume):
     # Create output paths and locks.
     out_paths,locks = dict(),dict()
     for out_filetype in ['index']:
@@ -118,25 +78,32 @@ def parallel_index(eventalign_filepath,summary_filepath,chunk_size,out_dir,n_pro
     # Wait for all of the tasks to finish.
     task_queue.join()
     
-def t2g(gene_id,ensembl,g2t_mapping,df_eventalign_index,readcount_min):
+def t2g(gene_id,fasta_dict,annotation_dict,g2t_mapping,df_eventalign_index,readcount_min):
     tx_ids = []
     t2g_dict = {}
-    transcripts = [tx for tx in ensembl.gene_by_id(gene_id).transcripts if tx.id in g2t_mapping[gene_id]]
-    n_reads = sum([len(df_eventalign_index.loc[tx.id]) for tx in transcripts])
+    transcripts = [tx for tx in annotation_dict if tx in g2t_mapping[gene_id]]
+    n_reads = sum([len(df_eventalign_index.loc[tx]) for tx in transcripts])
     if n_reads >= readcount_min:
         for tx in transcripts:
-            tx_seq = ensembl.transcript_sequence(tx.id)
+            tx_seq = fasta_dict[tx][0]
+            tx_contig = annotation_dict[tx]['chr']
             if tx_seq is None:
                 continue
-            for interval in tx.exon_intervals:
-                for g_pos in range(interval[0],interval[1]+1): # Exclude the rims of exons.
-                    tx_pos = tx.spliced_offset(g_pos)
-                    if (interval[0] <= g_pos < interval[0]+2) or (interval[1]-2 < g_pos <= interval[1]): # Todo: To improve the mapping
+            for exon_num in range(len(annotation_dict[tx]['exon'])):
+                g_interval=annotation_dict[tx]['exon'][exon_num]
+                tx_interval=annotation_dict[tx]['tx_exon'][exon_num]
+                for g_pos in range(g_interval[0],g_interval[1]+1): # Exclude the rims of exons.
+                    dis_from_start = g_pos - g_interval[0]
+                    if annotation_dict[tx]['strand'] == "+":
+                        tx_pos = tx_interval[0] + dis_from_start
+                    elif annotation_dict[tx]['strand'] == "-":
+                        tx_pos = tx_interval[1] - dis_from_start
+                    if (g_interval[0] <= g_pos < g_interval[0]+2) or (g_interval[1]-2 < g_pos <= g_interval[1]): # Todo: To improve the mapping
                         kmer = 'XXXXX'
                     else:
                         kmer = tx_seq[tx_pos-2:tx_pos+3]
-                    t2g_dict[(tx.id,tx_pos)] = (tx.contig,gene_id,g_pos,kmer) # tx.contig is chromosome.
-            tx_ids += [tx.id]
+                    t2g_dict[(tx,tx_pos)] = (tx_contig,gene_id,g_pos,kmer) # tx.contig is chromosome.
+            tx_ids += [tx]
 
                 
     return n_reads, tx_ids, t2g_dict
@@ -168,7 +135,8 @@ def combine(events_str):
         eventalign_result.reset_index(inplace=True)
 
 
-        eventalign_result['transcript_id'] = [contig.split('.')[0] for contig in eventalign_result['contig']]    #### CHANGE MADE ####
+#        eventalign_result['transcript_id'] = [contig.split('.')[0] for contig in eventalign_result['contig']]    #### CHANGE MADE ####
+        eventalign_result['transcript_id'] = [contig for contig in eventalign_result['contig']]
         #eventalign_result['transcript_id'] = eventalign_result['contig']
 
         eventalign_result['transcriptomic_position'] = pd.to_numeric(eventalign_result['position']) + 2 # the middle position of 5-mers.
@@ -189,7 +157,106 @@ def combine(events_str):
         np_events = np.rec.fromrecords(df_events, names=[*df_events])
         return np_events
 
-def parallel_preprocess_gene(eventalign_filepath,ensembl,out_dir,n_processes,readcount_min,readcount_max,resume):
+def readFasta(transcript_fasta,is_gff):
+    fasta=open(transcript_fasta,"r")
+    entries,separate_by_pipe="",False
+    for ln in fasta:
+        entries+=ln
+    entries=entries.split(">")
+    if len(entries[1].split("|"))>1:
+        separate_by_pipe=True
+    dict={}
+    for entry in entries:
+        entry=entry.split("\n")
+        if len(entry[0].split()) > 0:
+            id=entry[0].split('.')[0]
+            seq="".join(entry[1:])
+            dict[id]=[seq]
+            if is_gff > 0:
+                if separate_by_pipe == True:
+                    g_id=info[1],split(".")[0]
+                else:
+                    g_id=entry[0].split("gene:")[1].split(".")[0]
+                dict[id].append(g_id)
+    return dict
+
+def readAnnotation(gtf_or_gff):
+    gtf=open(gtf_or_gff,"r")
+    dict,is_gff={},0
+    for ln in gtf:
+        if not ln.startswith("#"):
+            ln=ln.strip("\n").split("\t")
+            if is_gff == 0:
+                if ln[-1].startswith("ID") or ln[-1].startswith("Parent"):
+                    is_gff = 1
+                else:
+                    is_gff = -1
+            if is_gff < 0:
+                if ln[2] == "transcript" or ln[2] == "exon":
+                    chr,type,start,end=ln[0],ln[2],int(ln[3]),int(ln[4])
+                    attrList=ln[-1].split(";")
+                    attrDict={}
+                    for k in attrList:
+                        p=k.strip().split(" ")
+                        if len(p) == 2:
+                            attrDict[p[0]]=p[1].strip('\"')
+                    ##tx_id=ln[-1].split('; transcript_id "')[1].split('";')[0]
+                    ##g_id=ln[-1].split('gene_id "')[1].split('";')[0]
+                    tx_id = attrDict["transcript_id"]
+                    g_id = attrDict["gene_id"]
+                    if tx_id not in dict:
+                        dict[tx_id]={'chr':chr,'g_id':g_id,'strand':ln[6]}
+                        if type not in dict[tx_id]:
+                            if type == "transcript":
+                                dict[tx_id][type]=(start,end)
+                    else:
+                        if type == "exon":
+                            if type not in dict[tx_id]:
+                                dict[tx_id][type]=[(start,end)]
+                            else:
+                                dict[tx_id][type].append((start,end))
+            if is_gff > 0:
+                if ln[2] == "exon" or ln[2] == "mRNA":
+                    chr,type,start,end=ln[0],ln[2],int(ln[3]),int(ln[4])
+                    tx_id=ln[-1].split('transcript:')[1].split(';')[0]
+                    if ln[2] == "mRNA":
+                        type="transcript"
+                    if tx_id not in dict:
+                        dict[tx_id]={'chr':chr,'strand':ln[6]}
+                        if type == "transcript":
+                            dict[tx_id][type]=(start,end)
+                        if type == "exon":
+                            dict[tx_id][type]=[(start,end)]
+                    else:
+                        if type == "transcript" and type not in dict[tx_id]:
+                            dict[tx_id][type]=(start,end)
+                        if type == "exon":
+                            if type not in dict[tx_id]:
+                                dict[tx_id][type]=[(start,end)]
+                            else:
+                                dict[tx_id][type].append((start,end))
+    #convert genomic positions to tx positions
+    if is_gff < 0:
+        for id in dict:
+            tx_pos,tx_start=[],0
+            for pair in dict[id]["exon"]:
+                tx_end=pair[1]-pair[0]+tx_start
+                tx_pos.append((tx_start,tx_end))
+                tx_start=tx_end+1
+            dict[id]['tx_exon']=tx_pos
+    else:
+        for id in dict:
+            tx_pos,tx_start=[],0
+            if dict[id]["strand"] == "-":
+                dict[id]["exon"].sort(key=lambda tup: tup[0], reverse=True)
+            for pair in dict[id]["exon"]:
+                tx_end=pair[1]-pair[0]+tx_start
+                tx_pos.append((tx_start,tx_end))
+                tx_start=tx_end+1
+            dict[id]['tx_exon']=tx_pos
+    return (dict,is_gff)
+
+def parallel_preprocess_gene(eventalign_filepath,fasta_dict,annotation_dict,is_gff,out_dir,n_processes,readcount_min,readcount_max,resume):
     
     # Create output paths and locks.
     out_paths,locks = dict(),dict()
@@ -224,22 +291,29 @@ def parallel_preprocess_gene(eventalign_filepath,ensembl,out_dir,n_processes,rea
     # Get all gene ids and create a dict of eventalign.combine index.
 #     gene_ids = set()
 
-    
+    if is_gff > 0: ##add g_id from fasta dict entry if gff annotation is used
+        for tx_id in annotation_dict:
+            try:
+                annotation_dict[tx_id]['g_id']=fasta_dict[tx_id][1]
+            except KeyError:
+                continue
+
     df_eventalign_index = pd.read_csv(os.path.join(out_dir,'eventalign.index'))
     df_eventalign_index['transcript_id'] = [tx_id.split('.')[0] for tx_id in  df_eventalign_index['transcript_id']]
+#    df_eventalign_index['transcript_id'] = [tx_id for tx_id in  df_eventalign_index['transcript_id']]
     df_eventalign_index.set_index('transcript_id',inplace=True)
     g2t_mapping = defaultdict(list)
 
     for tx_id in set(df_eventalign_index.index):
         try:
-            g_id = ensembl.transcript_by_id(tx_id).gene_id 
-        except ValueError:
+##           g_id = ensembl.transcript_by_id(tx_id).gene_id 
+            g_id = annotation_dict[tx_id]['g_id'] 
+        except KeyError:
             continue
         else:
 #             gene_ids = gene_ids.union([g_id])
             g2t_mapping[g_id] += [tx_id]
 
-        
 #     f = open(os.path.join(out_dir,'eventalign.index'))
 #     for ln in f:
 #         tx_id,read_index,pos_start,pos_end = ln.split(',')
@@ -265,7 +339,7 @@ def parallel_preprocess_gene(eventalign_filepath,ensembl,out_dir,n_processes,rea
                 continue
             # mapping a gene <-> transcripts
 
-            n_reads, tx_ids, t2g_mapping = t2g(gene_id,ensembl,g2t_mapping,df_eventalign_index,readcount_min)
+            n_reads, tx_ids, t2g_mapping = t2g(gene_id,fasta_dict,annotation_dict,g2t_mapping,df_eventalign_index,readcount_min)
             #
             if n_reads >= readcount_min: 
                 data_dict = dict()
@@ -340,7 +414,7 @@ def preprocess_gene(gene_id,data_dict,t2g_mapping,out_paths,locks):
 #         if len(events_per_read) > 0:
         # ===== transcript to gene coordinates ===== # TODO: to use gtf.
 #        tx_ids = [tx_id.decode('UTF-8').split('.')[0] for tx_id in events_per_read['transcript_id']]
-        tx_ids = [tx_id for tx_id in events_per_read['transcript_id']] 
+        tx_ids = [tx_id.split('.')[0] for tx_id in events_per_read['transcript_id']] 
         tx_positions = events_per_read['transcriptomic_position']
         genomic_coordinate = list(itemgetter(*zip(tx_ids,tx_positions))(t2g_mapping)) # genomic_coordinates -- np structured array of 'chr','gene_id','genomic_position','kmer'
         genomic_coordinate = np.array(genomic_coordinate,dtype=np.dtype([('chr','<U2'),('gene_id','<U15'),('genomic_position','<i4'),('g_kmer','<U5')]))
@@ -396,6 +470,7 @@ def preprocess_gene(gene_id,data_dict,t2g_mapping,out_paths,locks):
             
         try:
             assert len(set(g_kmer_array)) == 1
+            assert list(set(g_kmer_array))[0].count('N') == 0 ##to weed out the mapped kmers from tx_seq that contain 'N', which is not in diffmod's model_kmer
             assert {position} == set(g_positions_array)
         except:
             asserted = False
@@ -459,7 +534,8 @@ def parallel_preprocess_tx(eventalign_filepath,out_dir,n_processes,readcount_min
     # Load tasks into task_queue.
     tx_ids_processed = []
     df_eventalign_index = pd.read_csv(os.path.join(out_dir,'eventalign.index'))
-    df_eventalign_index['transcript_id'] = [tx_id.split('.')[0] for tx_id in  df_eventalign_index['transcript_id']]
+#    df_eventalign_index['transcript_id'] = [tx_id.split('.')[0] for tx_id in  df_eventalign_index['transcript_id']]
+#    df_eventalign_index['transcript_id'] = [tx_id for tx_id in  df_eventalign_index['transcript_id']]
     tx_ids = df_eventalign_index['transcript_id'].values.tolist()
     tx_ids = list(dict.fromkeys(tx_ids))
     df_eventalign_index.set_index('transcript_id',inplace=True)
@@ -472,7 +548,7 @@ def parallel_preprocess_tx(eventalign_filepath,out_dir,n_processes,readcount_min
                 eventalign_result.seek(pos_start,0)
                 events_str = eventalign_result.read(pos_end-pos_start)
                 data = combine(events_str)
-                if data.size > 1:
+                if (data is not None) and (data.size > 1):
                     data_dict[read_index] = data
                 readcount += 1 
                 if readcount > readcount_max:
@@ -553,6 +629,7 @@ def preprocess_tx(tx_id,data_dict,out_paths,locks):
             
         try:
             assert len(set(reference_kmer_array)) == 1
+            assert list(set(reference_kmer_array))[0].count('N') == 0 ##to weed out the mapped kmers from tx_seq that contain 'N', which is not in diffmod's model_kmer
         except:
             asserted = False
             break
@@ -609,59 +686,70 @@ def preprocess_tx(tx_id,data_dict,out_paths,locks):
 #                     assert row_eventalign['read_index'] == read_index 
 #                     eventalign_per_read = [row_eventalign]
 
-
-
-def main():
-    args = get_args()
+#def check_gene_tx_id_version(gtf_or_gff):
+#    gtf=open(gtf_or_gff,"r")
+#    extra_version_fields=0
+#    for i in range(25):
+#        ln=gtf.readline().split('\t')
+#        if not ln[0].startswith('#'):
+#            if ln[2] == "transcript" or ln[2] == "exon":
+#                check_transcript_version = len(ln[-1].split('transcript_version')) == 2
+#                check_gene_version = len(ln[-1].split('gene_version')) == 2
+#                if check_transcript_version and check_gene_version:
+#                   extra_version_fields+=1
+#    if extra_version_fields > 0:
+#        return True
+#    else:
+#        return False
+ 
+#def mergeGTFtxIDversion(gtf_or_gff,out_dir):
+#    gtf=open(gtf_or_gff,"r")
+#    new_gtf_path=os.path.join(out_dir,'transcript_id_version_merged.gtf')
+#    new_gtf=open(new_gtf_path,"w")
+#    for ln in gtf:
+#        if not ln.startswith("#"):
+#            ln=ln.split("\t")
+#            if ln[2] == "transcript" or ln[2] == "exon":
+#                g_id=ln[-1].split('gene_id "')[1].split('";')[0]
+#                g_ver=ln[-1].split('; gene_version "')[1].split('";')[0]
+#                tx_id=ln[-1].split('; transcript_id "')[1].split('";')[0]
+#                tx_ver=ln[-1].split('; transcript_version "')[1].split('";')[0]
+#                new_gtf.write('\t'.join(ln[:-1])+'\t'+'gene_id "'+g_id+'.'+g_ver+'"; transcript_id "'+tx_id+'.'+tx_ver+'";'+'\n')
+#    new_gtf.close()
+#    return new_gtf_path
+
+def dataprep(args):
     #
     n_processes = args.n_processes        
     eventalign_filepath = args.eventalign
-    summary_filepath = args.summary
     chunk_size = args.chunk_size
     out_dir = args.out_dir
-    ensembl_version = args.ensembl
-    ensembl_species = args.species
     readcount_min = args.readcount_min    
     readcount_max = args.readcount_max
     resume = args.resume
     genome = args.genome
 
-    customised_genome = args.customised_genome
-    if customised_genome and (None in [args.reference_name,args.annotation_name,args.gtf_path_or_url,args.transcript_fasta_paths_or_urls]):
-        print('If you have your own customised genome not in Ensembl, please provide the following')
-        print('- reference_name')
-        print('- annotation_name')
-        print('- gtf_path_or_url')
-        print('- transcript_fasta_paths_or_urls')
+    if genome and (None in [args.gtf_or_gff,args.transcript_fasta]):
+        print('please provide the following')
+        print('- gtf_or_gff')
+        print('- transcript_fasta')
     else:
-        reference_name = args.reference_name
-        annotation_name = args.annotation_name
-        gtf_path_or_url = args.gtf_path_or_url
-        transcript_fasta_paths_or_urls = args.transcript_fasta_paths_or_urls
+        gtf_or_gff = args.gtf_or_gff
+        transcript_fasta = args.transcript_fasta
         
     misc.makedirs(out_dir) #todo: check every level.
     
     # (1) For each read, combine multiple events aligned to the same positions, the results from nanopolish eventalign, into a single event per position.
     if not args.skip_eventalign_indexing:
-        parallel_index(eventalign_filepath,summary_filepath,chunk_size,out_dir,n_processes,resume)
+        parallel_index(eventalign_filepath,chunk_size,out_dir,n_processes,resume)
     
     # (2) Create a .json file, where the info of all reads are stored per position, for modelling.
     if genome:
-        if customised_genome:
-            db = Genome(
-                reference_name=reference_name,
-                annotation_name=annotation_name,
-                gtf_path_or_url=gtf_path_or_url,
-                transcript_fasta_paths_or_urls=transcript_fasta_paths_or_urls
-            )
-            # parse GTF and construct database of genomic features
-            db.index()
-        else:
-            db = EnsemblRelease(ensembl_version,ensembl_species) # Default: human reference genome GRCh38 release 91 used in the ont mapping.    
-        parallel_preprocess_gene(eventalign_filepath,db,out_dir,n_processes,readcount_min,readcount_max,resume)
-
+#        merge_transcript_id_version = check_gene_tx_id_version(gtf_or_gff)
+#        if merge_transcript_id_version:
+#            gtf_or_gff = mergeGTFtxIDversion(gtf_or_gff,out_dir)
+        annotation_dict,is_gff = readAnnotation(gtf_or_gff)
+        fasta_dict = readFasta(transcript_fasta,is_gff)
+        parallel_preprocess_gene(eventalign_filepath,fasta_dict,annotation_dict,is_gff,out_dir,n_processes,readcount_min,readcount_max,resume)
     else:
         parallel_preprocess_tx(eventalign_filepath,out_dir,n_processes,readcount_min,readcount_max,resume)
-
-if __name__ == '__main__':
-    main()


=====================================
xpore/scripts/diffmod.py
=====================================
@@ -1,4 +1,3 @@
-import argparse
 import numpy as np
 import pandas
 import os
@@ -12,25 +11,6 @@ from ..diffmod.configurator import Configurator
 from ..diffmod.gmm import GMM
 from ..diffmod import io
 from ..diffmod.statstest import StatsTest
-
-def get_args():
-    parser = argparse.ArgumentParser(formatter_class=argparse.ArgumentDefaultsHelpFormatter)
-
-    optional = parser._action_groups.pop()
-    required = parser.add_argument_group('required arguments')
-    
-    # Required arguments
-    required.add_argument('--config', dest='config', help='yaml configuraion filepath.',required=True)
-
-    # Optional arguments
-    optional.add_argument('--n_processes', dest='n_processes', help='number of processes to run.',type=int,default=1)
-    optional.add_argument('--save_models', dest='save_models', help='with this argument, the program will save the model parameters for each id.',default=False,action='store_true') # todo
-    optional.add_argument('--resume', dest='resume', help='with this argument, the program will resume from the previous run.',default=False,action='store_true') 
-    
-    optional.add_argument('--ids', dest='ids', help='gene / transcript ids to model.',default=[],nargs='*')
-
-    parser._action_groups.append(optional)
-    return parser.parse_args()
         
 def execute(idx, data_dict, data_info, method, criteria, model_kmer, prior_params, out_paths, save_models,locks):
     """
@@ -96,8 +76,7 @@ def execute(idx, data_dict, data_info, method, criteria, model_kmer, prior_param
     with locks['log'], open(out_paths['log'],'a') as f:
         f.write(idx + '\n')
                         
-def main():
-    args = get_args()
+def diffmod(args):
     
     n_processes = args.n_processes       
     config_filepath = args.config
@@ -205,16 +184,6 @@ def main():
     # Close data files
     for f in f_data.values():
         f.close()   
-        
+
     with open(out_paths['log'],'a+') as f:
         f.write(helper.decor_message('successfully finished'))
-        
-
-if __name__ == '__main__':
-    """
-    Usage:
-        xpore-diffmod --config CONFIG [--n_processes N_PROCESSES] \
-                     [--save_models] [--resume] \
-                     [--ids [IDS [IDS ...]]]
-    """
-    main()


=====================================
xpore/scripts/postprocessing.py
=====================================
@@ -0,0 +1,40 @@
+import os
+
+def run_postprocessing(diffmod_table_path,out_dir):
+    file=open(diffmod_table_path,"r")
+    header=file.readline()
+    entries=file.readlines()
+    outfile_path=os.path.join(out_dir,"majority_direction_kmer_diffmod.table")
+    outfile=open(outfile_path,"w")
+    outfile.write(header)
+    header=header.strip().split(',')
+    kmer_ind,dir_ind=header.index('kmer'),header.index('mod_assignment')    
+    dict={}
+    for ln in entries:
+        l=ln.strip().split(",")
+        if l[kmer_ind] not in dict:
+            dict[l[kmer_ind]]={l[dir_ind]:1}
+        else:
+            if l[dir_ind] not in dict[l[kmer_ind]]:
+                dict[l[kmer_ind]][l[dir_ind]]=1
+            else:
+                dict[l[kmer_ind]][l[dir_ind]]+=1
+    for k in dict:
+        if len(dict[k]) > 1:  ##consider one modification type per k-mer
+            if dict[k]['higher'] <= dict[k]['lower']: ##choose the majority
+                dict[k]['choose']='lower'
+            else:
+                dict[k]['choose']='higher'
+        else:
+            dict[k]['choose']=list(dict[k].keys())[0]
+    for ln in entries:
+        l=ln.strip().split(",")
+        if l[dir_ind] == dict[l[kmer_ind]]['choose']:
+            outfile.write(ln)
+    outfile.close()
+
+def postprocessing(args):
+    diffmod_dir = args.diffmod_dir
+    diffmod_table_path = os.path.join(diffmod_dir,"diffmod.table")
+    run_postprocessing(diffmod_table_path,diffmod_dir)
+


=====================================
xpore/scripts/xpore.py
=====================================
@@ -0,0 +1,70 @@
+import sys
+
+from .dataprep import dataprep
+from .diffmod import diffmod
+from .postprocessing import postprocessing
+
+def parse_options(argv):
+
+    """Parses options from the command line """
+
+    from argparse import ArgumentParser
+    from xpore import __version__
+
+    parser = ArgumentParser(prog='xpore')
+    subparsers = parser.add_subparsers(help='Running modes', metavar='{dataprep, diffmod, postprocessing}')
+    parser.add_argument('-v', '--version', action='version', version='%(prog)s {version}'.format(version=__version__))
+
+    ### RUN MODE "DATAPREP"
+    parser_dataprep = subparsers.add_parser('dataprep', help='run mode for preprocessing nanopolish eventalign.txt before differential modification analysis')
+    optional_dataprep = parser_dataprep._action_groups.pop()
+    required_dataprep = parser_dataprep.add_argument_group('required arguments')
+    # Required arguments
+    required_dataprep.add_argument('--eventalign', dest='eventalign', help='eventalign filepath, the output from nanopolish.',required=True)
+    ##required.add_argument('--summary', dest='summary', help='eventalign summary filepath, the output from nanopolish.',required=True)
+    required_dataprep.add_argument('--out_dir', dest='out_dir', help='output directory.',required=True)
+    optional_dataprep.add_argument('--gtf_or_gff', dest='gtf_or_gff', help='GTF or GFF file path.',type=str)
+    optional_dataprep.add_argument('--transcript_fasta', dest='transcript_fasta', help='transcript FASTA path.',type=str)
+    # Optional arguments
+    optional_dataprep.add_argument('--skip_eventalign_indexing', dest='skip_eventalign_indexing', help='skip indexing the eventalign nanopolish output.',default=False,action='store_true')
+    # parser.add_argument('--features', dest='features', help='Signal features to extract.',type=list,default=['norm_mean'])
+    optional_dataprep.add_argument('--genome', dest='genome', help='to run on Genomic coordinates. Without this argument, the program will run on transcriptomic coordinates',default=False,action='store_true') 
+    optional_dataprep.add_argument('--n_processes', dest='n_processes', help='number of processes to run.',type=int, default=1)
+    optional_dataprep.add_argument('--chunk_size', dest='chunk_size', help='number of lines from nanopolish eventalign.txt for processing.',type=int, default=1000000)
+    optional_dataprep.add_argument('--readcount_min', dest='readcount_min', help='minimum read counts per gene.',type=int, default=1)
+    optional_dataprep.add_argument('--readcount_max', dest='readcount_max', help='maximum read counts per gene.',type=int, default=1000)
+    optional_dataprep.add_argument('--resume', dest='resume', help='with this argument, the program will resume from the previous run.',default=False,action='store_true') #todo
+    parser_dataprep._action_groups.append(optional_dataprep)
+    parser_dataprep.set_defaults(func=dataprep)
+
+    ### RUN MODE "DIFFMOD"
+    parser_diffmod = subparsers.add_parser('diffmod', help='run mode for performing differential modification analysis')
+    optional_diffmod = parser_diffmod._action_groups.pop()
+    required_diffmod = parser_diffmod.add_argument_group('required arguments')
+    # Required arguments
+    required_diffmod.add_argument('--config', dest='config', help='YAML configuraion filepath.',required=True)
+    # Optional arguments
+    optional_diffmod.add_argument('--n_processes', dest='n_processes', help='number of processes to run.',type=int,default=1)
+    optional_diffmod.add_argument('--save_models', dest='save_models', help='with this argument, the program will save the model parameters for each id.',default=False,action='store_true') # todo
+    optional_diffmod.add_argument('--resume', dest='resume', help='with this argument, the program will resume from the previous run.',default=False,action='store_true') 
+    optional_diffmod.add_argument('--ids', dest='ids', help='gene or transcript ids to model.',default=[],nargs='*')
+    parser_diffmod._action_groups.append(optional_diffmod)
+    parser_diffmod.set_defaults(func=diffmod)
+
+    ### RUN MODE "POSTPROCESSING"
+    parser_postprocessing = subparsers.add_parser('postprocessing', help='run mode for post-processing diffmod.table, the result table from differential modification analysis.')
+    required_postprocessing = parser_postprocessing.add_argument_group('required arguments')
+    # Required arguments
+    required_postprocessing.add_argument('--diffmod_dir', dest='diffmod_dir', help='diffmod directory path, the output from xpore-diffmod.',required=True)
+    parser_postprocessing.set_defaults(func=postprocessing)
+
+    return parser.parse_args(argv[1:])
+
+def main(argv=sys.argv):
+
+    ### get command line options
+    options = parse_options(argv)
+    options.func(options)
+
+if __name__ == "__main__":
+    main(sys.argv)



View it on GitLab: https://salsa.debian.org/med-team/xpore/-/compare/45d62e12569b4d75d74eaae3000b4724c58b6dee...bea8751ee62ee65ba9720fbfce3f73d3ea59108c

-- 
View it on GitLab: https://salsa.debian.org/med-team/xpore/-/compare/45d62e12569b4d75d74eaae3000b4724c58b6dee...bea8751ee62ee65ba9720fbfce3f73d3ea59108c
You're receiving this email because of your account on salsa.debian.org.


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://alioth-lists.debian.net/pipermail/debian-med-commit/attachments/20211019/d40c426a/attachment-0001.htm>


More information about the debian-med-commit mailing list