Sometimes using BLAST is frustrating. Today I'm writing about it returning different expectation values, and therefore different answers, depending on if you use a FASTA subject file, or a database made from that file.
Sometimes using BLAST is frustrating. Today I'm writing about it returning different expectation values, and therefore different answers, depending on if you use a FASTA subject file, or a database made from that file.
We noticed a major memory problem running NCBI BLAST+ with XML output using a FASTA subject (consuming loads of swap space then getting killed). This doesn't happen with tabular output, nor if using a BLAST database: Memory and CPU usage (until killed by OS) The screenshots are from the cluster monitoring tool Ganglia, the horizontal red lines show the machine's physical limits (here 4 CPUs and 8 GB of RAM). This example is using BLAST+
My last blog post was on a problem with missing NCBI GenBank feature location information for trans-spliced genes when using GenBank 'with parts'. That was fixed, but while using EFetch a colleague just stumbled over something possibly related where the location is given as just a question mark.
I recently stumbled on a problem in NCBI Entrez with the GenBank (with parts) return type. Some GenBank files don't actually contain a sequence at the end - instead they have a CONTIG section telling you how to construct the sequence from other referenced pieces. That's often inconvenient so the NCBI have the handy option of downloading it with all this parts pre-computed, which normally is great.
CRAM 0.7 was released earlier this month, and includes support for storing arbitrary read tags - a key requirement for it to be evaluated in existing pipelines as a BAM alternative. However, it doesn't preserve read names - which is a compression trick you can also do with plain BAM. Vadim kindly released some sample data with BAM -> CRAM 0.7 -> BAM.
In some respects the SAM/BAM specification is quite loose, in that there is more than one way to represent a given piece of information. We can take advantage of this to reduce the size on disk of mapped reads which match the reference sequence, while still maintaining conformance within the spec.
Good news - the Ion Torrent Suite is now freely available open source software on GitHub under the GPL v2 licence, as promised late last year. There is now something more substantial behind talk of Ion Torrent " democratising sequencing ", and a clear advantage over the closed source tools of rival companies. I commend them!
I'm a bit behind the curve here (see Lex's blog post from July 2011), but I was amused to find out Ion Torrent call their current nucleotide flow order TACGTACGTCTGAGCATCGATCGATGTACAGC the "Samba". Apparently the idea is to avoid reads going out of phase which could happen with the traditional repeated flow TACG (still used by Roche 454), by giving the molecules which missed a base a chance to catch up, and for IonTorrent this works better.
Most people will have seen a Gravatar user icon online, short for the rather grand sounding "Globally Recognized Avatar". For example GitHub.com and StackOverflow use them, and many blog platforms uses them for user comments (sadly Blogger doesn't, yet). To get a user's icon, you construct a URL with the MD5 checksum of their email address - and if the user isn't registered you get default image or a unique generated abstract icon.
It seems IonTorrent are trying to present themselves as the open democratising sequencing platform for high throughput sequencing, with their Ion Community, sample datasets and (in theory) open source software.
In my last post I looked at how the GZIP variant BGZF (Blocked GNU Zip Format, used in BAM files) allowed efficient random access to large compressed files. This time I'm looking at bzip2 (bz2) which offers better compression than GZIP, but is also block based so in theory the same random access strategy can be employed.