Rogue Scholar

Scienze naturaliInglese

How Is Data Unification Different Than Data Fusion?

Pubblicato 12 aprile 2021

In response to this note, a reader asked I re-read the Stonebraker whitepaper I had linked to in that note, and what it describes does not seem meaningfully different than data fusion. I propose that data unification is more conservative than data fusion – it stops short of the lossy reduction often required for decision-support systems.

Scienze naturaliInglese

Tenets of (Scalable) Data Unification

https://doi.org/10.59350/yyte4-wym74

Pubblicato 9 aprile 2021

Autore Donny Winston

Unification is a process of combining partial-information structures. First used in computing for theorem proving, ¹ it is used widely for type inference in programming-language compilers and for logic-programming systems. Data unification is described well in this whitepaper by Stonebraker.

Scienze naturaliInglese

Value Lattices for Data Collaboration

https://doi.org/10.59350/r5t81-yfq09

Pubblicato 7 aprile 2021

Autore Donny Winston

Values like “3” are typically considered separately from constraints like “< 10” or types like “number”. What if these were all on equal footing as values in a hierarchy?

Scienze naturaliInglese

Reusable Data Attributes

https://doi.org/10.59350/w3c1y-94169

Pubblicato 5 aprile 2021

Autore Donny Winston

Do you repeatedly define the same field/attribute across different classes / entity types? For example, you may have many different entities with an “id”, a “name”, etc. When an attribute “belongs to” an entity, you need to repeatedly register specifications for each (re)definition: it’s a string, it needs to pass these tests to be considered valid, etc. What if attributes were top-level?

Scienze naturaliInglese

FAIR Components of Scientific Models

https://doi.org/10.59350/s84ka-rmb41

Pubblicato 2 aprile 2021

Autore Donny Winston

Consider “basic theories” that are particularly simple in two ways. First, they describe selected aspects of material objects, abstracting from all other properties – homogenous samples, thermally isolated containers, points, rigid solids, infinitely thin layers, etc. Second, they provide particularly simple expressions and means of combination for their simple objects.

Scienze naturaliInglese

The QUDT System for Dimensional Analysis and Unit Conversions

https://doi.org/10.59350/8vd4h-sms37

Pubblicato 31 marzo 2021

Autore Donny Winston

In order to integrate quantitative data, you need to know (a) if units are commensurate, and (b) if so, how to do conversions. The Quantities, Units, Dimensions, and Types (QUDT) ontology serves three major purposes. First, it provides a global reference for units via URIs; this helps avoid tacit conventions that are prone to misinterpretation. Second, it provides for dimensional analysis via so-called “quantity kind” dimensional vectors;

Scienze naturaliInglese

Go-to-Declaration for Data

https://doi.org/10.59350/eqz3q-fg638

Pubblicato 30 marzo 2021

Autore Donny Winston

One of my favorite features of the PyCharm code editor is go-to-declaration: you can hold the control key and hover your mouse over a usage of a symbol, and you’ll see a tooltip with a preview of the declaration/definition of the symbol. Click it, and you’ll jump to the definition, perhaps in another file. After you’ve reviewed the definition, a keyboard shortcut gets you back to the usage point.

Scienze naturaliInglese

A Continuant-Event Model for Recording Scientific Data

https://doi.org/10.59350/svmhc-40j47

Pubblicato 26 marzo 2021

Autore Donny Winston

The RDF data model is quite flexible: Anybody can say Anything about Any topic (aka the “AAA slogan”). However, I recommend – and describe here – a particular modeling strategy when it comes to entering new facts about research activities into a data management system. Once entered this way, workflows may add additional derived facts to suit the needs of downstream applications.

Scienze naturaliInglese

Data Collaboration Is Hard Like Distributed Computing Is Hard

https://doi.org/10.59350/8ta9e-4jg92

Pubblicato 24 marzo 2021

Autore Donny Winston

It is hard to mediate among concrete representations, among data structures with differing schema. There are certainly valiant efforts to replicate shared data structures without conflict and to facilitate distributed schema evolution, i.e. to sync “under the hood” concerns.

Scienze naturaliInglese

A CSV File That Knows Its Schema and Context

https://doi.org/10.59350/fws6z-f1c79

Pubblicato 23 marzo 2021

Autore Donny Winston

Have you ever given or gotten data as CSV? Are the meanings of the columns always clear? How are they made clear? Are the given column labels/names and the given file/sheet names always enough? If additional information beyond the CSV file is needed, how is that facilitated? A separate README file that travels with the CSV as part of a zipped archive file?

Scienze naturaliInglese

A JSON File That Knows Its Schema and Context

https://doi.org/10.59350/rbjh3-e8t46

Pubblicato 19 marzo 2021

Autore Donny Winston

If you provide JSON, either as files or as API responses, you might be one step away from ensuring that anyone encountering that JSON has a portal to what it means. This step is to provide a single extra key-value pair in each JSON document – the key is “@context”, and the value is a URL. JSON-LD is “a JSON-based format to serialize Linked Data.