Chemical SciencesHugo

Depth-First

Depth-First
Recent content on Depth-First
Home Page
language
Chemical Sciences
Published

The CTfile family includes the widely-used serialization formats Molfile and SDfile. Users rarely need to consider the many details of these formats because they’re handled by software. But data corruption can result when one utility misreads the output of another. The problem is especially hard to diagnose when the root cause lies close to the base layer.

Chemical Sciences
Published

The valence bond (VB) model is a central abstraction in cheminformatics, underpinning everything from serialization formats to toolkit data structures. Like any approximation, the VB model makes certain simplifying assumptions. When one fails, trouble isn’t far behind. This article describes the VB model’s weakest assumption, and a conservative approach to avoiding the fallout from its near-certain invalidation.

Chemical Sciences
Published

Stereoisomerism is an essential component of molecular structure. But chemistry’s ability to devise new stereoisomer forms long ago outstripped the ability of mainstream cheminformatics tools to represent them. Anything more complicated than tetrahedral configuration or alkene conformation often presents a challenge. This problem will ultimately be solved by generalized approaches to stereochemical representation.

Chemical Sciences
Published

InChI is widely viewed as merely an identifier, not a serialization format, but a recent article challenged that notion. Although the idea may be discouraged in some quarters, it is possible to use InChI to read and write molecular structure. Tools for reliably doing so could lead to new applications. One missing component is a formal description of the InChI’s syntax. This article takes the first few steps toward its creation.

Chemical Sciences
Published

InChI is a molecular identifier developed by IUPAC. A previous article discussed the possibility of using InChI as something else: a molecular serialization format. Rather than treating InChI strings as mere hash keys, what if InChI could be used to encode and decode molecular structures in the same way as the SMILES or CTfile formats? Doing so requires a way to map elements to their respective atoms. This article describes a way to do that.

Chemical Sciences
Published

The CTfile (aka “molfile”) format can be found throughout cheminformatics and computational chemistry. Two versions are available: V2000 and V3000 (aka “V2K” and “V3K,” respectively). The former is widely supported but lacks features that can make it difficult to use in modern settings. The latter supports these features but lacks widespread, robust tooling due in part to more complex semantics and syntax.

Chemical Sciences
Published

When molecular structures are encoded and decoded as short strings, SMILES is more likely than not to be used. In the 34 years since Weininger’s landmark publication, SMILES has been widely adopted by both software vendors and database maintainers. But despite its crucial importance to the fields of both chemistry and cheminformatics, SMILES remains underspecified. This article outlines problem and describes a solution.

Chemical Sciences
Published

SMILES is a de facto standard for chemical data exchange. It’s routinely found in public-facing databases, supported by most widely-used cheminformatics toolkits, and for the last few years has even appeared in the context of machine learning. The problem is that the language was never completely specified. The claim that one of chemistry’s most important data formats is incompletely specified may seem extraordinary.

Chemical Sciences
Published

SMILES is a de facto standard for chemical information exchange. Although there may be broad agreement on the technical underpinnings of SMILES, many important details have been left to individual interpretation. The lack of specificity offers a shaky foundation on which to build desperately-needed data standardization efforts. This article talks about the problem, and offers one possible path forward. What is SMILES?

Chemical Sciences
Published

The CTfile specification defines a suite of popular cheminformatics serialization formats including SDfile, Molfile, and RGfile. CTfile currently comes in two varieties: V2000 and its successor, V3000 (aka “V3K”). V3K may not be as widely-used as its older sibling, but many of its features are unique. Even so, V3K is at least as complex as what came before. Chalk this up to pairing those new features with partial backward compatibility.

Chemical Sciences
Published

Molecular identifiers, also known as “chemical names,” underpin modern chemistry. A recent paper introduced TUCAN, a new molecular identifier. As noted in my overview, TUCAN could one day play a similar role in molecular identification as canonical SMILES and IUPAC nomenclature. An important point along the way is canonicalization, or the selection of one representation out of many possible for a given molecule.