Rogue Scholar

Published February 28, 2012

Brief update on yesterday's post about finding specimens in BioStor. BioStor has some 66,000 articles from BHL, from which I've extracted 143,000 cases of a specimen code being cited in the text. Of these 143,000 occurrences, 81,000 have been matched to an occurrence in GBIF.

BHLBioStorGBIFIdentifiersLinkingComputer and Information Sciences

Linking GBIF and the Biodiversity Heritage Library

https://doi.org/10.59350/ehbwx-fjv34

Published February 27, 2012

Author Roderic Page

Following on from exploring links between GBIF and GenBank here I'm going to look at links between GBIF and the primary literature, in this case articles scanned by the Biodiversity Heritage Library (BHL). The OCR text in BHL can be mined for a variety of entities. BHL itself has used uBio's tools to identity taxonomic names in the OCR text, and in my BioStor project I've extracted article-level metadata and geographic co-ordinates.

Darwin Core RipletDuplicatesGBIFIdentifiersSpecimen CodesComputer and Information Sciences

How many specimens does GBIF really have?

https://doi.org/10.59350/2d3dv-8q010

Published February 23, 2012

Author Roderic Page

Duplicate records are the bane of any project that aggregates data from multiple sources.

ClusteringData CleaningGraphvizTaxonomyComputer and Information Sciences

Clustering strings

https://doi.org/10.59350/wfhyy-qt220

Published February 22, 2012

Author Roderic Page

Revisiting an old idea (Clustering taxonomic names) I've added code to cluster strings into sets of similar strings to the phyloinformatics course site. This service (available at http://iphylo.org/~rpage/phyloinformatics/services/clusterstrings.php) takes a list of strings, one per line, and returns a list of clusters. For example, given the names Ferrusac 1821 Bonavita 1965 Ferussa 1821 Fer.

LSIDRantComputer and Information Sciences

Why LSIDs suck

https://doi.org/10.59350/s2ttc-w0z21

Published February 22, 2012

Author Roderic Page

I'll keep this short: LSIDs suck because they are so hard to set up that many LSIDs don't actually work. Because of this there seems to be no shame in publishing "fake" LSIDs (LSIDs that look like LSIDs but which don't resolve using the LSID protocol). Hey, it's hard work, so let's just stick them on a web page but not actually make them resolvable.

FrogsGBIFGenbankGeophylogenyKMLComputer and Information Sciences

Linking GBIF and Genbank

https://doi.org/10.59350/hj161-hh554

Published February 21, 2012

Author Roderic Page

As part of my mantra that it's not about the data, it's all about the links between the data, I've started exploring matching GenBank sequences to GBIF occurrences using the specimen_voucher codes recorded in GenBank sequences. It's quickly becoming apparent that this is not going to be easy.

ChallengeEOLTree Of LifeComputer and Information Sciences

EOL Phylogenetic Tree Challenge

https://doi.org/10.59350/8x2cb-5h677

Published February 16, 2012

Author Roderic Page

The Encyclopedia of Life have announced the EOL Phylogenetic Tree Challenge. The contest has two purposes: First prize is a trip to iEvoBio 2012, this year in Ottawa, Canada. For more details visit the challenge website. There is also an EOL community devoted to this challenge. Challenges are great things, especially ones with worthwhile tasks and decent prizes. EOL badly needs a phylogenetic perspective, so this is a welcome development.

BLASTDark TaxaPhyloinformaticsComputer and Information Sciences

BLAST a sequence and get a tree and a map

https://doi.org/10.59350/jpph3-ztv21

Published February 10, 2012

Author Roderic Page

I've updated the BLAST a sequence and get a tree tool described in a previous post to output additional details, such as a list of the sequences used to build the tree and some basic metadata (such as the taxon name, name of any associated host, publication, and geographic coordinates). If the sequences are geotagged, then you will also see a little map showing the localities.

GeophylogenyGoogle EarthKMLMatchingComputer and Information Sciences

Automating the creation of geophylogenies: NEXUS + delimited text = KML

https://doi.org/10.59350/7f9yn-gaw27

Published February 8, 2012

Author Roderic Page

One thing which has always frustrated me about geophylogenies is how tedious they are to create. In theory, they should be pretty straightforward to generate. We take a tree, get point localities for each leaf in the tree, and generate the KML to display on Google Earth. The tedious part is getting the latitude and longitude data in the right format, and linking the leaves in the tree to the locality data.

Data CleaningGoogle RefineTaxonomic NameComputer and Information Sciences

Using Google Refine and taxonomic databases (EOL, NCBI, uBio, WORMS) to clean messy data

https://doi.org/10.59350/jyyjb-ppf17

Published February 6, 2012

Author Roderic Page

Google Refine is an elegant tool for data cleaning. One of its most powerful features is the ability to call "Reconciliation Services" to help clean data, for example by matching names to external identifiers. Google Refine comes with the ability to use Freebase reconciliation services, but you can also add external services. Inspired by this I've started to implement services to reconcile taxonomic names.

iPhylo

GBIF specimens in BioStor: who are the top ten museums with citable specimens?

Linking GBIF and the Biodiversity Heritage Library

How many specimens does GBIF really have?

Clustering strings

Why LSIDs suck

Linking GBIF and Genbank

EOL Phylogenetic Tree Challenge

BLAST a sequence and get a tree and a map

Automating the creation of geophylogenies: NEXUS + delimited text = KML

Using Google Refine and taxonomic databases (EOL, NCBI, uBio, WORMS) to clean messy data