Scienze informatiche e dell'informazioneIngleseBlogger

iPhylo

Rants, raves (and occasionally considered opinions) on phyloinformatics, taxonomy, and biodiversity informatics. For more ranty and less considered opinions, see my Twitter feed.ISSN 2051-8188. Written content on this site is licensed under a Creative Commons Attribution 4.0 International license.
Pagina inizialeAtom ForaggioMastodonISSN 2051-8188
language
CrossrefDataCiteDOIIdentifiersSpecimen CodesScienze informatiche e dell'informazioneInglese
Pubblicato

Based on recent discussions my sense is that our community will continue to thrash the issue of identifiers to death, repeating many of the debates that have gone on (and will go on) in other areas. To be trite, it seems to me we have three criteria: cheap , resolvable , and persistent . We get to pick two. Cheap and resolvable means URLs, which everybody is nervous about because they break.

ChallengeDataEOLScienze informatiche e dell'informazioneInglese
Pubblicato

Now we are awash in challenges! EOL has announced its Computable Data Challenge:Some $US 50,000 is on offer. "Challenge" is perhaps a misnomer, as EOL is offering this money not as a prize at the end, but rather to fund one or more proposals (submitted by 22 May) that are accepted.

BHLBiomedicalGBIFLinkingMekong River SchistosomiasisScienze informatiche e dell'informazioneInglese
Pubblicato

When I think of the Biodiversity Heritage Library (BHL) or GBIF I tend to think of taxonomy and biodiversity. Folk wisdom has it that BHL is full of old books, mostly pre-1923. Great for finding old taxonomic names, or nice artwork, but not exactly "modern" biology. GBIF is mainly about displaying organism distributions based on museum specimens, the primary data of taxonomic research.

AnnotationErrorGBIFGenbankIdentifiersScienze informatiche e dell'informazioneInglese
Pubblicato

One reason I'm pursuing the theme of specimen identifiers (and identifiers in general) is the central role they play in annotating databases. To give a concrete example, I (among others) have argued for a wiki-style annotation layer on top of GenBank to capture things such as sequencing errors, updated species names, etc. Annotation is a lot easier if we have consistent identifiers for the things being annotated.

IntegrationLinksVelcroScienze informatiche e dell'informazioneInglese
Pubblicato

Sometimes I need to remind myself just why I'm spending so much time trying to make sense of other people's data, and why I go on (and on) about identifiers. One reason for my obsession is I want data to be "sticky", like the burrs shown in the photo above (Who invented velcro? by A-dep). Shared identifiers are like the hooks on the burrs, if two pieces of data have the same identifier they will stick together.

BHLBioStorGBIFIdentifiersLinkingScienze informatiche e dell'informazioneInglese
Pubblicato

Following on from exploring links between GBIF and GenBank here I'm going to look at links between GBIF and the primary literature, in this case articles scanned by the Biodiversity Heritage Library (BHL). The OCR text in BHL can be mined for a variety of entities. BHL itself has used uBio's tools to identity taxonomic names in the OCR text, and in my BioStor project I've extracted article-level metadata and geographic co-ordinates.

ClusteringData CleaningGraphvizTaxonomyScienze informatiche e dell'informazioneInglese
Pubblicato

Revisiting an old idea (Clustering taxonomic names) I've added code to cluster strings into sets of similar strings to the phyloinformatics course site.This service (available at http://iphylo.org/~rpage/phyloinformatics/services/clusterstrings.php) takes a list of strings, one per line, and returns a list of clusters.

LSIDRantScienze informatiche e dell'informazioneInglese
Pubblicato

I'll keep this short: LSIDs suck because they are so hard to set up that many LSIDs don't actually work. Because of this there seems to be no shame in publishing "fake" LSIDs (LSIDs that look like LSIDs but which don't resolve using the LSID protocol). Hey, it's hard work, so let's just stick them on a web page but not actually make them resolvable.