Rogue Scholar

BHLBiomedicalGBIFLinkingMekong River SchistosomiasisCiências da Computação e da InformaçãoInglês

BHL and GBIF as biomedical databases

Publicados 27 de março de 2012

When I think of the Biodiversity Heritage Library (BHL) or GBIF I tend to think of taxonomy and biodiversity. Folk wisdom has it that BHL is full of old books, mostly pre-1923. Great for finding old taxonomic names, or nice artwork, but not exactly "modern" biology. GBIF is mainly about displaying organism distributions based on museum specimens, the primary data of taxonomic research.

Ciências da Computação e da InformaçãoInglês

iEvoBio 2012 Challenge: Synthesizing phylogenies

https://doi.org/10.59350/vpyvh-gjw63

Publicados 21 de março de 2012

Autor Roderic Page

The iEvoBio 2012 Challenge has been announced, and the topic is synthesizing phylogenies. The task:The rules of this challenge are:The set of trees you use must have at least 10,000 leaves in total.

AnnotationErrorGBIFGenbankIdentifiersCiências da Computação e da InformaçãoInglês

Yet more reasons to have specimen identifiers: annotating GenBank sequences

https://doi.org/10.59350/k46hh-dz648

Publicados 1 de março de 2012

Autor Roderic Page

One reason I'm pursuing the theme of specimen identifiers (and identifiers in general) is the central role they play in annotating databases. To give a concrete example, I (among others) have argued for a wiki-style annotation layer on top of GenBank to capture things such as sequencing errors, updated species names, etc. Annotation is a lot easier if we have consistent identifiers for the things being annotated.

IntegrationLinksVelcroCiências da Computação e da InformaçãoInglês

Making biodiversity data sticky: it's all about links

https://doi.org/10.59350/kdgp2-er494

Publicados 29 de fevereiro de 2012

Autor Roderic Page

Sometimes I need to remind myself just why I'm spending so much time trying to make sense of other people's data, and why I go on (and on) about identifiers. One reason for my obsession is I want data to be "sticky", like the burrs shown in the photo above (Who invented velcro? by A-dep). Shared identifiers are like the hooks on the burrs, if two pieces of data have the same identifier they will stick together.

BioStorDigitisationGBIFHostLiceCiências da Computação e da InformaçãoInglês

GBIF specimens in BioStor: who are the top ten museums with citable specimens?

https://doi.org/10.59350/d97rd-ea309

Publicados 28 de fevereiro de 2012

Autor Roderic Page

Brief update on yesterday's post about finding specimens in BioStor. BioStor has some 66,000 articles from BHL, from which I've extracted 143,000 cases of a specimen code being cited in the text.

BHLBioStorGBIFIdentifiersLinkingCiências da Computação e da InformaçãoInglês

Linking GBIF and the Biodiversity Heritage Library

https://doi.org/10.59350/ehbwx-fjv34

Publicados 27 de fevereiro de 2012

Autor Roderic Page

Following on from exploring links between GBIF and GenBank here I'm going to look at links between GBIF and the primary literature, in this case articles scanned by the Biodiversity Heritage Library (BHL). The OCR text in BHL can be mined for a variety of entities. BHL itself has used uBio's tools to identity taxonomic names in the OCR text, and in my BioStor project I've extracted article-level metadata and geographic co-ordinates.

Darwin Core RipletDuplicatesGBIFIdentifiersSpecimen CodesCiências da Computação e da InformaçãoInglês

How many specimens does GBIF really have?

https://doi.org/10.59350/2d3dv-8q010

Publicados 23 de fevereiro de 2012

Autor Roderic Page

Duplicate records are the bane of any project that aggregates data from multiple sources.

ClusteringData CleaningGraphvizTaxonomyCiências da Computação e da InformaçãoInglês

Clustering strings

https://doi.org/10.59350/wfhyy-qt220

Publicados 22 de fevereiro de 2012

Autor Roderic Page

Revisiting an old idea (Clustering taxonomic names) I've added code to cluster strings into sets of similar strings to the phyloinformatics course site.This service (available at http://iphylo.org/~rpage/phyloinformatics/services/clusterstrings.php) takes a list of strings, one per line, and returns a list of clusters.

LSIDRantCiências da Computação e da InformaçãoInglês

Why LSIDs suck

https://doi.org/10.59350/s2ttc-w0z21

Publicados 22 de fevereiro de 2012

Autor Roderic Page

I'll keep this short: LSIDs suck because they are so hard to set up that many LSIDs don't actually work. Because of this there seems to be no shame in publishing "fake" LSIDs (LSIDs that look like LSIDs but which don't resolve using the LSID protocol). Hey, it's hard work, so let's just stick them on a web page but not actually make them resolvable.

FrogsGBIFGenbankGeophylogenyKMLCiências da Computação e da InformaçãoInglês

Linking GBIF and Genbank

https://doi.org/10.59350/hj161-hh554

Publicados 21 de fevereiro de 2012

Autor Roderic Page

As part of my mantra that it's not about the data, it's all about the links between the data, I've started exploring matching GenBank sequences to GBIF occurrences using the specimen_voucher codes recorded in GenBank sequences. It's quickly becoming apparent that this is not going to be easy.

ChallengeEOLTree Of LifeCiências da Computação e da InformaçãoInglês

EOL Phylogenetic Tree Challenge

https://doi.org/10.59350/8x2cb-5h677

Publicados 16 de fevereiro de 2012

Autor Roderic Page

The Encyclopedia of Life have announced the EOL Phylogenetic Tree Challenge. The contest has two purposes:First prize is a trip to iEvoBio 2012, this year in Ottawa, Canada. For more details visit the challenge website. There is also an EOL community devoted to this challenge.Challenges are great things, especially ones with worthwhile tasks and decent prizes.

iPhylo

BHL and GBIF as biomedical databases

iEvoBio 2012 Challenge: Synthesizing phylogenies

Yet more reasons to have specimen identifiers: annotating GenBank sequences

Making biodiversity data sticky: it's all about links

GBIF specimens in BioStor: who are the top ten museums with citable specimens?

Linking GBIF and the Biodiversity Heritage Library

How many specimens does GBIF really have?

Clustering strings

Why LSIDs suck

Linking GBIF and Genbank

EOL Phylogenetic Tree Challenge