Ciencias NaturalesInglésHugo

Donny Winston

Donny Winston
Made as simple as possible, but not simpler.
Página de inicioFeed AtomMastodon
language
Ciencias NaturalesInglés
Publicado

In a collaboration, data objects are produced at many sites. To make the data objects findable, you may steward a central, searchable index for their metadata. How then do you make the data objects accessible for download? One common solution is to centralize the custodianship – have all sites upload copies of their data objects to a central store. The central store may partition storage across several physical servers behind the scenes (e.g.

Ciencias NaturalesInglés
Publicado

One powerful mechanism of robustness is exploratory behavior, for which the desired outcome is produced by a generate-and-test mechanism. This organization allows the generator to work and be developed independently of the tester that accepts or rejects a particular result. One can make an analogy to biological evolution, where the generator is random mutation and the tester is natural selection.

Ciencias NaturalesInglés
Publicado

Resource description refers to defining concepts and relationships that represent the content and structure of some subject matter (ontology) or a database (schema) in a formal language. The relationship between ontology and database schema is nuanced – Uschold provides a nice comparison. 1 You can formally describe resources using the resource description framework (RDF), SQL’s data definition language (DDL), etc.

Ciencias NaturalesInglés
Publicado

In the Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH), a repository is a means of exposing metadata to harvesters . The OAI-PMH spec goes into great detail about how a data provider should implement a repository so that a harvester can simply be a client application that issues one of six possible HTTP requests.

Ciencias NaturalesInglés
Publicado

Do your programs only compute pure functions of data, or do they also perform effects such as dynamically reading input, writing output, transitioning database state, making network requests, etc.? One sense of the term “logic” is as a general subject, i.e. the study of how to draw valid conclusions.

Ciencias NaturalesInglés
Publicado

Shared datasets often have column/field names that are ambiguous in their meaning, or contain identical/related concepts with different names, hindering reuse. This ambiguity happens regardless of the method of sharing – via files, web pages, or APIs. The traditional solution for this is to provide documentation.

Ciencias NaturalesInglés
Publicado

When you have several different applications (e.g. to perform simulations and analyses) that each have their own data model, it’s typical for each to also maintain its own siloed data store. Then, in order to use all the applications in concert to complete a research project, or to support an ongoing research program, you need to run extract-transform-load (ETL) pipelines to sync the data.

Ciencias NaturalesInglés
Publicado

I was reading about hidden costs of “packaged” software solutions – that is, using existing software to solve problems – and came across this sentence: 1 Huh? I typically do not distinguish development from implementation . What McComb is calling “implementation” I just call “installation”. Weird.

Ciencias NaturalesInglés
Publicado

Earlier this week, I wrote that As luck would have it, the U.S. Department of Energy (DOE) posted a funding opportunity announcement (FOA) yesterday on Data Reduction for Science: There have been efforts for decades to identify and deal with this issue, with cute acronyms for relevant data like ROT (Redundant Obsolete and Trivial), WORN (Write Once Read Never), and WORSE (Write Once Read Seldom if Ever). However, the DOE FOA highlights that it

Ciencias NaturalesInglés
Publicado

If you share data on the web as delimiter-separated values – that is, as spreadsheets – there is a world of power-ups available to you. The term “sidecar” is used for a functional addition. A motorcycle sidecar can carry things and people. A Kubernetes sidecar container has access to the namespace and storage volumes of it’s pod’s main container, and so supports auxiliary work.

Ciencias NaturalesInglés
Publicado

In response to this note, a reader asked I re-read the Stonebraker whitepaper I had linked to in that note, and what it describes does not seem meaningfully different than data fusion. I propose that data unification is more conservative than data fusion – it stops short of the lossy reduction often required for decision-support systems.