BiologieEnglischBlogger

Blasted Bioinformatics!?

Bioinformatics lessons learned the hard way, bugs, gripes, and maybe topical paper reviews too...
StartseiteAtom-Feed
language
BLASTNCBIBiologieEnglisch
Veröffentlicht

We noticed a major memory problem running NCBI BLAST+ with XML output using a FASTA subject (consuming loads of swap space then getting killed). This doesn't happen with tabular output, nor if using a BLAST database: Memory and CPU usage (until killed by OS) The screenshots are from the cluster monitoring tool Ganglia, the horizontal red lines show the machine's physical limits (here 4 CPUs and 8 GB of RAM). This example is using BLAST+

NCBIBiologieEnglisch
Veröffentlicht

I recently stumbled on a problem in NCBI Entrez with the GenBank (with parts) return type. Some GenBank files don't actually contain a sequence at the end - instead they have a CONTIG section telling you how to construct the sequence from other referenced pieces. That's often inconvenient so the NCBI have the handy option of downloading it with all this parts pre-computed, which normally is great.

CompressionSAM/BAMBiologieEnglisch
Veröffentlicht

CRAM 0.7 was released earlier this month, and includes support for storing arbitrary read tags - a key requirement for it to be evaluated in existing pipelines as a BAM alternative. However, it doesn't preserve read names - which is a compression trick you can also do with plain BAM. Vadim kindly released some sample data with BAM -> CRAM 0.7 -> BAM.

CompressionSAM/BAMBiologieEnglisch
Veröffentlicht

In some respects the SAM/BAM specification is quite loose, in that there is more than one way to represent a given piece of information. We can take advantage of this to reduce the size on disk of mapped reads which match the reference sequence, while still maintaining conformance within the spec.

Ion TorrentBiologieEnglisch
Veröffentlicht

Good news - the Ion Torrent Suite is now freely available open source software on GitHub under the GPL v2 licence, as promised late last year. There is now something more substantial behind talk of Ion Torrent " democratising sequencing ", and a clear advantage over the closed source tools of rival companies. I commend them!

Ion TorrentBiologieEnglisch
Veröffentlicht

I'm a bit behind the curve here (see Lex's blog post from July 2011), but I was amused to find out Ion Torrent call their current nucleotide flow order TACGTACGTCTGAGCATCGATCGATGTACAGC the "Samba". Apparently the idea is to avoid reads going out of phase which could happen with the traditional repeated flow TACG (still used by Roche 454), by giving the molecules which missed a base a chance to catch up, and for IonTorrent this works better.

BiologieEnglisch
Veröffentlicht

Most people will have seen a Gravatar user icon online, short for the rather grand sounding "Globally Recognized Avatar". For example GitHub.com and StackOverflow use them, and many blog platforms uses them for user comments (sadly Blogger doesn't, yet). To get a user's icon, you construct a URL with the MD5 checksum of their email address - and if the user isn't registered you get default image or a unique generated abstract icon.

BiologieEnglisch
Veröffentlicht
Autor Peter Cock

In my last post I looked at how the GZIP variant BGZF (Blocked GNU Zip Format, used in BAM files) allowed efficient random access to large compressed files. This time I'm looking at bzip2 (bz2) which offers better compression than GZIP, but is also block based so in theory the same random access strategy can be employed.