[Debian-med-packaging] Bug#950311: FastQC users attention: Please comment on the remaining failures! (Was: Bug#950311: fastqc: autopkgtest regression: debhelper bump moved files to different location)

tony mancill tmancill at debian.org
Mon Apr 6 20:19:17 BST 2020


On Sat, Feb 01, 2020 at 07:45:20AM +0100, Paul Gevers wrote:
> Hi,
> 
> On Fri, 31 Jan 2020 12:22:42 +0100 Andreas Tille <andreas at an3as.eu> wrote:
> 
> > I'd love if some users of fastqc could comment on the outcome of these
> > tests.
> 
> Me too. Because how it looks to me (Release Team member hat on) this
> test should actually not reward FastQC with the reduced age. Hence I
> have aged this version of fastqc to 10 days.

Hi Paul, hello Andreas,

I spent a while looking into this bug (and thereby took a crash course
in the Sequence Alignment Map file format) and I am convinced that there's
not an issue with fastqc.  The "FAIL" output in the summary indicates
that FastQC is finding issues with the data quality in the toy.sam and
toy.bam datasets.  And since the latter is merely the binary version of
the former, the quality issues detected are the same.

From the upstream README [1]:

> FastQC is an application which takes a FastQ file and runs a series
> of tests on it to generate a comprehensive QC report.  This will
> tell you if there is anything unusual about your sequence.  Each
> test is flagged as a pass, warning or fail depending on how far it
> departs from what you'd expect from a normal large dataset with no
> significant biases.  It's important to stress that warnings or even
> failures do not necessarily mean that there is a problem with your
> data, only that it is unusual.  It is possible that the biological
> nature of your sample means that you would expect this particular
> bias in your results.

In this tutorial [2], the warning is stronger:

> The output from FastQC, after analyzing a FASTQ file of sequence reads,
> is an html file that may be viewed in your browser.  The report contains
> one result section for each FastQC module.  In addition to the graphical
> or list data provided by each module, a flag of “Passed”, “Warn” or
> “Fail” is assigned.  Researchers should be very cautious about relying
> on these flags when assessing sequence data. The thresholds used to
> assign these flags are based on a very specific set of assumptions that
> are applicable to a very specific type of sequence data. Specifically,
> they are tuned for good quality whole genome shotgun DNA sequencing.
> They are less reliable with other types of sequencing, for example
> mRNA-Seq, small RNA-Seq, methyl-seq, targeted sequence capture and
> targeted amplicon sequencing.  Therefore, a module result that has a
> “Warn” or “Fail” flag does not necessarily mean that the sequence run
> failed.  “Warn” and “Fail” flags mean that the researcher must stop and
> consider what that results mean in the context of that particular sample
> and the type of sequencing that was run.

In order to prove this to myself, I ran FastQC against a number of
different datasets in the SAM and BAM format that I found online (for
example in this Galaxy tutorial [3]) and against the SAM files found in
picard-tools [4] sources, and it found issues with all of them.

So I think the tool is doing the right thing outputting FAIL for these
files.  I propose that we update the test to ensure that a summary file is
produced and contains the requisite number of lines for each of the data
quality tests, and that each line contains one of "PASS|WARN|FAIL" to
indicate that FastQC was able to run the tests.  This will ensure that
FastQC isn't exiting unexpectedly while trying to read the file, etc.
Going further, we could establish some known results and compare them to
the output during the build.

For reference, good output looks like this:

$ cat GSM461177_untreat_paired_chr4_fastqc/summary.txt         
PASS	Basic Statistics	GSM461177_untreat_paired_chr4.bam
PASS	Per base sequence quality	GSM461177_untreat_paired_chr4.bam
PASS	Per sequence quality scores	GSM461177_untreat_paired_chr4.bam
FAIL	Per base sequence content	GSM461177_untreat_paired_chr4.bam
PASS	Per sequence GC content	GSM461177_untreat_paired_chr4.bam
PASS	Per base N content	GSM461177_untreat_paired_chr4.bam
PASS	Sequence Length Distribution	GSM461177_untreat_paired_chr4.bam
WARN	Sequence Duplication Levels	GSM461177_untreat_paired_chr4.bam
WARN	Overrepresented sequences	GSM461177_untreat_paired_chr4.bam
PASS	Adapter Content	GSM461177_untreat_paired_chr4.bam

Bad output looks like a Java exception stack trace... :)

Cheers,
tony

[1] https://raw.githubusercontent.com/s-andrews/FastQC/master/README.txt
[2] https://rtsf.natsci.msu.edu/genomics/tech-notes/fastqc-tutorial-and-faq/
[3] https://galaxyproject.github.io/training-material/topics/sequence-analysis/tutorials/quality-control/tutorial.html
[4] https://salsa.debian.org/med-team/picard-tools/-/tree/master/testdata%2Fpicard%2Fsam
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 833 bytes
Desc: not available
URL: <http://alioth-lists.debian.net/pipermail/debian-med-packaging/attachments/20200406/5a2ba126/attachment-0001.sig>


More information about the Debian-med-packaging mailing list