Chemical SciencesHugo

Depth-First

Depth-First
Recent content on Depth-First
Home Page
language
Chemical Sciences
Published

Life-changing events are are easy to miss. This is especially true for the ones that come from nowhere. They don’t fit into any of the compartments we make to hold the things that experience chucks at us, and so they just disappear. Only on reflection do these otherwise forgettable episodes take on the landmark status we remember them for. Warning Shot An event like this happened to me recently.

Chemical Sciences
Published

Changes to the state of a value can set the stage for later errors. Sometimes these errors occur during subsequent method calls. Other times, the method call itself is the source of the error. Good type systems can make it possible to move these kinds of state bugs from run time to compile time. Indeed, the typestate pattern is often cited a solution, especially for Rust.

Chemical Sciences
Published

The valence bond model (“VB model”) remains widely used in chemistry, and serves as the basis for most molecular representation schemes in cheminformatics. Useful though it may be, the VB model carries some important liabilities. One relates to electron delocalization of the type found in benzene and its analogs. Here, the VB model’s exclusive focus on two-atom bonding leads to asymmetry artifacts.

Chemical Sciences
Published

Enumeration of all cycles is a useful graph operation. This is especially true in cheminformatics, where the set of all cycles is required in tasks ranging from 2D structure layout to electronic characterization and descriptor calculation. Although restricted cycle sets are suitable for some applications, others require exhaustive enumeration. This article introduces a Rust crate for doing that.

Chemical Sciences
Published

Two versions of the widely-used molfile format exist: V2000 and V3000. As noted previously, the V3000 format introduces several capabilities not present in its predecessor. One of them allows the arbitrary grouping and tagging of certain kinds of features. The documentation hints at this grouping and tagging as a V3000 extension mechanism, but how would that work in practice? This article takes a closer look at the question.

Chemical Sciences
Published

If ever there was a perennial problem in cheminformatics, it would be tautomerism. Dealing with it consistently and in a chemically relevant way is much harder than appearances might suggest. The problem is multi-faceted, requiring not just a rigorous definition of a slippery concept, but methods to pinpoint molecular features leading to tautomerism.

Chemical Sciences
Published

The comparison of molecules for equivalence is a computationally complex process whose efficiency can be improved through canonicalization. Canonicalization deterministically chooses one molecular representation from all candidates. The technique applies whenever collections of unique molecules are assembled or generated, which turns out to be a lot of situations.

Chemical Sciences
Published

SMILES is a compact molecular serialization format used widely in cheminformatics and computational chemistry. Even so, the published documentation incompletely describes SMILES, making the implementation of software impossible without some degree of improvisation. Balsa is a fully-specified language subset created to solve this problem. To bridge theory and practice, work on a Balsa reference implementation is well underway.

Chemical Sciences
Published

Several molecular serialization formats are used in cheminformatics. Examples include SMILES, molfile, CDXML, and InChI, among others. Each format has a preferred context, making translation a perennial problem. To avoid generating molecular garbage, translations must occur cleanly, neither adding artifacts nor dropping features unnecessarily. This article offers some perspective on two approaches to this problem.

Chemical Sciences
Published

CTfile is a widely-used family of file formats in cheminformatics and computational chemistry. CTfiles are most commonly processed through a cheminformatics toolkit. But sometimes that kind of power is overkill. You might, for example, want to pull out just certain pieces of information from a file without the overhead of building high-level data structures. In other situations, a toolkit does too little.

Chemical Sciences
Published

The CTfile family includes the widely-used serialization formats Molfile and SDfile. Users rarely need to consider the many details of these formats because they’re handled by software. But data corruption can result when one utility misreads the output of another. The problem is especially hard to diagnose when the root cause lies close to the base layer.