Rogue Scholar

Dark TaxaDNA BarcodingNCBIInformática y Ciencias de la InformaciónInglés

Dark taxa even darker: NCBI pulls (some) DNA barcodes from GenBank (updated)

Publicado 24 de abril de 2012

Dark taxa have become even darker. NCBI has pulled the plug on large numbers of DNA barcode sequences that lack scientific names. For example, taxon Cyclopoida sp. BOLD:AAG9771 (tax_id 818059) now has a sparse page that has no associated sequences. From an earlier download of EMBL I know that this taxon is associated with at least 5 sequences, such as GU679674. But if you go to that sequence you get this:So the the sequence is hidden.

CrossrefDataCiteDOIIdentifiersSpecimen CodesInformática y Ciencias de la InformaciónInglés

Quick thoughts on specimen identifiers

https://doi.org/10.59350/8y7v3-6jc97

Publicado 20 de abril de 2012

Autor Roderic Page

Based on recent discussions my sense is that our community will continue to thrash the issue of identifiers to death, repeating many of the debates that have gone on (and will go on) in other areas. To be trite, it seems to me we have three criteria: cheap , resolvable , and persistent . We get to pick two. Cheap and resolvable means URLs, which everybody is nervous about because they break.

ChallengeDataEOLInformática y Ciencias de la InformaciónInglés

EOL Computable Data Challenge community

https://doi.org/10.59350/qxr4p-pdq88

Publicado 5 de abril de 2012

Autor Roderic Page

Now we are awash in challenges! EOL has announced its Computable Data Challenge:Some $US 50,000 is on offer. "Challenge" is perhaps a misnomer, as EOL is offering this money not as a prize at the end, but rather to fund one or more proposals (submitted by 22 May) that are accepted.

BHLBiomedicalGBIFLinkingMekong River SchistosomiasisInformática y Ciencias de la InformaciónInglés

BHL and GBIF as biomedical databases

https://doi.org/10.59350/8pp2p-9dh09

Publicado 27 de marzo de 2012

Autor Roderic Page

When I think of the Biodiversity Heritage Library (BHL) or GBIF I tend to think of taxonomy and biodiversity. Folk wisdom has it that BHL is full of old books, mostly pre-1923. Great for finding old taxonomic names, or nice artwork, but not exactly "modern" biology. GBIF is mainly about displaying organism distributions based on museum specimens, the primary data of taxonomic research.

Informática y Ciencias de la InformaciónInglés

iEvoBio 2012 Challenge: Synthesizing phylogenies

https://doi.org/10.59350/vpyvh-gjw63

Publicado 21 de marzo de 2012

Autor Roderic Page

The iEvoBio 2012 Challenge has been announced, and the topic is synthesizing phylogenies. The task:The rules of this challenge are:The set of trees you use must have at least 10,000 leaves in total.

AnnotationErrorGBIFGenbankIdentifiersInformática y Ciencias de la InformaciónInglés

Yet more reasons to have specimen identifiers: annotating GenBank sequences

https://doi.org/10.59350/k46hh-dz648

Publicado 1 de marzo de 2012

Autor Roderic Page

One reason I'm pursuing the theme of specimen identifiers (and identifiers in general) is the central role they play in annotating databases. To give a concrete example, I (among others) have argued for a wiki-style annotation layer on top of GenBank to capture things such as sequencing errors, updated species names, etc. Annotation is a lot easier if we have consistent identifiers for the things being annotated.

IntegrationLinksVelcroInformática y Ciencias de la InformaciónInglés

Making biodiversity data sticky: it's all about links

https://doi.org/10.59350/kdgp2-er494

Publicado 29 de febrero de 2012

Autor Roderic Page

Sometimes I need to remind myself just why I'm spending so much time trying to make sense of other people's data, and why I go on (and on) about identifiers. One reason for my obsession is I want data to be "sticky", like the burrs shown in the photo above (Who invented velcro? by A-dep). Shared identifiers are like the hooks on the burrs, if two pieces of data have the same identifier they will stick together.

BioStorDigitisationGBIFHostLiceInformática y Ciencias de la InformaciónInglés

GBIF specimens in BioStor: who are the top ten museums with citable specimens?

https://doi.org/10.59350/d97rd-ea309

Publicado 28 de febrero de 2012

Autor Roderic Page

Brief update on yesterday's post about finding specimens in BioStor. BioStor has some 66,000 articles from BHL, from which I've extracted 143,000 cases of a specimen code being cited in the text.

BHLBioStorGBIFIdentifiersLinkingInformática y Ciencias de la InformaciónInglés

Linking GBIF and the Biodiversity Heritage Library

https://doi.org/10.59350/ehbwx-fjv34

Publicado 27 de febrero de 2012

Autor Roderic Page

Following on from exploring links between GBIF and GenBank here I'm going to look at links between GBIF and the primary literature, in this case articles scanned by the Biodiversity Heritage Library (BHL). The OCR text in BHL can be mined for a variety of entities. BHL itself has used uBio's tools to identity taxonomic names in the OCR text, and in my BioStor project I've extracted article-level metadata and geographic co-ordinates.

Darwin Core RipletDuplicatesGBIFIdentifiersSpecimen CodesInformática y Ciencias de la InformaciónInglés

How many specimens does GBIF really have?

https://doi.org/10.59350/2d3dv-8q010

Publicado 23 de febrero de 2012

Autor Roderic Page

Duplicate records are the bane of any project that aggregates data from multiple sources.

ClusteringData CleaningGraphvizTaxonomyInformática y Ciencias de la InformaciónInglés

Clustering strings

https://doi.org/10.59350/wfhyy-qt220

Publicado 22 de febrero de 2012

Autor Roderic Page

Revisiting an old idea (Clustering taxonomic names) I've added code to cluster strings into sets of similar strings to the phyloinformatics course site.This service (available at http://iphylo.org/~rpage/phyloinformatics/services/clusterstrings.php) takes a list of strings, one per line, and returns a list of clusters.

iPhylo

Dark taxa even darker: NCBI pulls (some) DNA barcodes from GenBank (updated)

Quick thoughts on specimen identifiers

EOL Computable Data Challenge community

BHL and GBIF as biomedical databases

iEvoBio 2012 Challenge: Synthesizing phylogenies

Yet more reasons to have specimen identifiers: annotating GenBank sequences

Making biodiversity data sticky: it's all about links

GBIF specimens in BioStor: who are the top ten museums with citable specimens?

Linking GBIF and the Biodiversity Heritage Library

How many specimens does GBIF really have?

Clustering strings