Rogue Scholar

BLASTNCBIBiologieAnglais

BLAST tabular output missing descriptions

Publié 25 mai 2012

This is an open letter to the NCBI BLAST+ team to request two simple enhancements which I think would be extremely useful - first and foremost the option to include BLAST result descriptions in the tabular output. Having the taxonomic identifiers (if available) would be great too - allowing downstream filtering of BLAST results by species etc.

BLASTNCBIBiologieAnglais

BLAST+ ignoring search space size for e-values

https://doi.org/10.59350/t7hq6-k5888

Publié 18 mai 2012

Auteur Peter Cock

Sometimes using BLAST is frustrating. Today I'm writing about it returning different expectation values, and therefore different answers, depending on if you use a FASTA subject file, or a database made from that file.

BLASTNCBIBiologieAnglais

BLAST+ memory hog with subject FASTA and XML output

https://doi.org/10.59350/0xfck-e5229

Publié 18 avril 2012

Auteur Peter Cock

We noticed a major memory problem running NCBI BLAST+ with XML output using a FASTA subject (consuming loads of swap space then getting killed). This doesn't happen with tabular output, nor if using a BLAST database: Memory and CPU usage (until killed by OS) The screenshots are from the cluster monitoring tool Ganglia, the horizontal red lines show the machine's physical limits (here 4 CPUs and 8 GB of RAM). This example is using BLAST+

NCBIBiologieAnglais

Missing feature locations in GenBank "with parts"

https://doi.org/10.59350/bmf21-1fz95

Publié 3 avril 2012

Auteur Peter Cock

My last blog post was on a problem with missing NCBI GenBank feature location information for trans-spliced genes when using GenBank 'with parts'. That was fixed, but while using EFetch a colleague just stumbled over something possibly related where the location is given as just a question mark.

NCBIBiologieAnglais

Missing external exons in GenBank "with parts"

https://doi.org/10.59350/rmx4h-60g74

Publié 16 mars 2012

Auteur Peter Cock

I recently stumbled on a problem in NCBI Entrez with the GenBank (with parts) return type. Some GenBank files don't actually contain a sequence at the end - instead they have a CONTIG section telling you how to construct the sequence from other referenced pieces. That's often inconvenient so the NCBI have the handy option of downloading it with all this parts pre-computed, which normally is great.

CompressionSAM/BAMBiologieAnglais

BAM versus CRAM v0.7

https://doi.org/10.59350/hzb30-jfr61

Publié 9 mars 2012

Auteur Peter Cock

CRAM 0.7 was released earlier this month, and includes support for storing arbitrary read tags - a key requirement for it to be evaluated in existing pipelines as a BAM alternative. However, it doesn't preserve read names - which is a compression trick you can also do with plain BAM. Vadim kindly released some sample data with BAM -> CRAM 0.7 -> BAM.

CompressionSAM/BAMBiologieAnglais

Reference based SAM/BAM compression

https://doi.org/10.59350/9fe72-dwb66

Publié 14 février 2012

Auteur Peter Cock

In some respects the SAM/BAM specification is quite loose, in that there is more than one way to represent a given piece of information. We can take advantage of this to reduce the size on disk of mapped reads which match the reference sequence, while still maintaining conformance within the spec.

Ion TorrentBiologieAnglais

Ion Torrent Suite on GitHub

https://doi.org/10.59350/b6xmg-n2w68

Publié 23 janvier 2012

Auteur Peter Cock

Good news - the Ion Torrent Suite is now freely available open source software on GitHub under the GPL v2 licence, as promised late last year. There is now something more substantial behind talk of Ion Torrent " democratising sequencing ", and a clear advantage over the closed source tools of rival companies. I commend them!

Ion TorrentBiologieAnglais

Ion Torrent does the Samba

https://doi.org/10.59350/kny56-72062

Publié 16 janvier 2012

Auteur Peter Cock

I'm a bit behind the curve here (see Lex's blog post from July 2011), but I was amused to find out Ion Torrent call their current nucleotide flow order TACGTACGTCTGAGCATCGATCGATGTACAGC the "Samba". Apparently the idea is to avoid reads going out of phase which could happen with the traditional repeated flow TACG (still used by Roche 454), by giving the molecules which missed a base a chance to catch up, and for IonTorrent this works better.

BiologieAnglais

Validating ID via Gravatar

https://doi.org/10.59350/d15dn-p7032

Publié 13 décembre 2011

Auteur Peter Cock

Most people will have seen a Gravatar user icon online, short for the rather grand sounding "Globally Recognized Avatar". For example GitHub.com and StackOverflow use them, and many blog platforms uses them for user comments (sadly Blogger doesn't, yet). To get a user's icon, you construct a URL with the MD5 checksum of their email address - and if the user isn't registered you get default image or a unique generated abstract icon.

Ion TorrentBiologieAnglais

Is IonTorrent open or not?

https://doi.org/10.59350/q0fa0-z2g64

Publié 12 décembre 2011

Auteur Peter Cock

It seems IonTorrent are trying to present themselves as the open democratising sequencing platform for high throughput sequencing, with their Ion Community, sample datasets and (in theory) open source software.

Blasted Bioinformatics!?

BLAST tabular output missing descriptions

BLAST+ ignoring search space size for e-values

BLAST+ memory hog with subject FASTA and XML output

Missing feature locations in GenBank "with parts"

Missing external exons in GenBank "with parts"

BAM versus CRAM v0.7

Reference based SAM/BAM compression

Ion Torrent Suite on GitHub

Ion Torrent does the Samba

Validating ID via Gravatar

Is IonTorrent open or not?