Biological SciencesBlogger

Blasted Bioinformatics!?

Bioinformatics lessons learned the hard way, bugs, gripes, and maybe topical paper reviews too...
Home PageAtom Feed
language
Biological Sciences
Published
Author Peter Cock

Yesterday I attended the annual "Potatoes in Practice" meeting for the first time, mainly to see the finished display which I helped produce. Here it is, showing the twelve chromosomes of potato, drawn as stylized uniform green 'X' shapes, with different colour LEDs marking traits of interest for potato breeding.

Biological Sciences
Published
Author Peter Cock

Mark-up languages allow you to write a plain text input file which is then processed to produce a nicely formatted output - as HTML, PDF or similar. The plain text nature of the input is perfect for tracking under version control software, which other richer formats are not suited to. So what's the current best markup choice for a Python project?

CompressionBiological Sciences
Published
Author Peter Cock

I've written about random access to the blocked GZIP variant BGZF used in BAM, and looked at random access to BZIP2, but here I'm looking at XZ files which are based on LZMA compression. This was prompted by the release of Python 3.3 which includes the lzma module to support XZ files, which I then back-ported to offer lzma for Python 2.6 or later.

BLASTNCBIBiological Sciences
Published
Author Peter Cock

The blastdbcmd tool in the BLAST+ suite (replacing fastacmd in the C 'legacy' BLAST suite) lets you do a lot of clever things with a BLAST database. As long as you follow the baroque NCBI FASTA naming scheme you can do this with local BLAST databases too. However, if you don't want to bow down to the NCBI naming (e.g. use FASTA files directly from your favourite assembler), then blastdbcmd seems needlessly crippled.

Biological Sciences
Published
Author Peter Cock

One of the first things a programmer dealing with 'Next Generation Sequencing' (NGS) aka 'High Throughput Sequencing' (HTSeq) data learns is to be very aware of memory limitations. You can't just go loading files into RAM when they are often gigabytes in size. Instead where possible you loop over a file (iterating over it record by record) or employ indexed random access. The authors of MrFast & MrsFast didn't do this.

CompressionBiological Sciences
Published
Author Peter Cock

Today's release of Python 3.3 includes the lzma module in the standard library for  LZMA/XZ files, but it didn't work 'out of the box' under Mac OS X 10.8 Mountain Lion - it requires XZ Utils. This is how I installed it. At the time of writing Python 3.3 doesn't come pre-compiled for Mac OS X 10.8 Mountain Lion (only older versions of OS X), so I built it from source (having installed XCode via the App Store for free).

BLASTNCBIBiological Sciences
Published
Author Peter Cock

Have you ever tried to use a BLAST database of protein sequences containing stop codons? If you work on nice model organisms with solid gene annotation maybe not. However, with draft annotations, mutation studies, or read through translation it is not unreasonable for the odd internal stop codon to appear in a protein sequence. And some translation pipelines do leave in a trailing * character.

BLASTNCBIBiological Sciences
Published
Author Peter Cock

This is an open letter to the NCBI BLAST+ team to request two simple enhancements which I think would be extremely useful - first and foremost the option to include BLAST result descriptions in the tabular output. Having the taxonomic identifiers (if available) would be great too - allowing downstream filtering of BLAST results by species etc.