Rogue Scholar

Chemical Sciences

TUCAN Canonicalization Revisited

Published May 4, 2022

Molecular identifiers, also known as “chemical names,” underpin modern chemistry. A recent paper introduced TUCAN, a new molecular identifier. As noted in my overview, TUCAN could one day play a similar role in molecular identification as canonical SMILES and IUPAC nomenclature. An important point along the way is canonicalization, or the selection of one representation out of many possible for a given molecule.

Chemical Sciences

TUCAN Canonicalization

https://doi.org/10.59350/waf5k-gea58

Published April 20, 2022

Author Richard L. Apodaca

Note: This article requires revisions on one or more important points. In particular, the algorithm does not in its current form enumerate every candidate indexing as claimed. See the revised article for details. Molecular identifiers (or “chemical names”) are everywhere in chemistry. A recent post discussed a new kind of molecular identifier called TUCAN.

Chemical Sciences

Molecular Identification with TUCAN

https://doi.org/10.59350/ebrmb-kce84

Published April 6, 2022

Author Richard L. Apodaca

It’s hard to overstate the importance of molecular identifiers to chemistry. Also called “chemical names,” molecular identifiers enable individuals, laboratories, organizations, and countries to efficiently exchange information about molecules. Given this foundational role, you might expect to find a well-organized and lavishly-funded effort to develop and improve molecular identifiers. This is, of course, not the case.

Chemical Sciences

An Introduction to DataWarrior

https://doi.org/10.59350/megqg-9wh40

Published March 23, 2022

Author Richard L. Apodaca

There’s a rather large category of important chemistry software that doesn’t get a lot of attention in journal articles. It inhabits the twilight zone between chemist and programmer. I don’t mean “chemist” in the sense of cheminformatician or computational chemist. I mean “chemist” in the sense of a trained experimentalist who gathers and uses experimental data.

Chemical Sciences

Python Extensions in Pure Rust with Rust-CPython

https://doi.org/10.59350/q5s8v-r1905

Published March 9, 2022

Author Richard L. Apodaca

Python’s many advantages come at a cost: execution speed on the “traditional” runtime lags that of other languages by a considerable margin. Python’s solution is to expose the runtime to more efficient extensions written C and C++. As noted previously here, Python extensions can also be written in pure Rust through PyO3. But some projects call for greater control.

Chemical Sciences

Big Reaction Data

https://doi.org/10.59350/gaydb-ysj11

Published February 23, 2022

Author Richard L. Apodaca

Broad access to chemical information remains a largely unrealized goal. Back in 2006 the second post on this blog noted the recent introduction of PubChem and ZINC as game-changing developments. But despite progress in the creation of open structure collections, repositories linking molecular structure with properties are much less well-developed. An important frontier in this area is open reaction data.

Chemical Sciences

V3000 Molfile Enhanced Stereochemistry Representation

https://doi.org/10.59350/xc3sd-qkm25

Published February 9, 2022

Author Richard L. Apodaca

Stereoisomerism plays a crucial role in the science and technology of chemistry, but this is a relatively new development. Analytical and synthetic techniques have not yet advanced to the stage that allows configuration to be assigned with the same ease as other aspects of molecular structure. Depending on the context, it’s still not unusual for configuration to remain partially or completely unknown indefinitely.

Chemical Sciences

Graphs from Scratch in Python

https://doi.org/10.59350/pmsv7-0fw31

Published January 26, 2022

Author Richard L. Apodaca

Graphs are central to many areas of programming, so it’s not surprising to find many general-purpose graph libraries. But these ready-made solutions sometimes lack the focus needed to solve specific problem well. Having hit this problem several times, I recently proposed a solution in the form of a minimal graph API with a Rust implementation.

Chemical Sciences

Penny Codes

https://doi.org/10.59350/ttpdm-t1n57

Published January 12, 2022

Author Richard L. Apodaca

A fingerprint is a molecular representation that omits certain kinds of structural information with the goal of increasing computational speed. The success of this approach is evidenced by numerous modern applications ranging from structure search to property prediction. A good fingerprint trades just enough structural information to achieve the desired computational goal, so flexibility matters.

Chemical Sciences

Stereochemistry and the V2000 Molfile Format

https://doi.org/10.59350/5vf94-5bx41

Published December 29, 2021

Author Richard L. Apodaca

A previous article offered some reasons to adopt the V3000 molfile format. Although there are several, the one that gets the most attention is “enhanced stereochemistry” support. It should come as no surprise that the cost of this enhancement is increased complexity. Fortunately, V3000’s stereochemistry model extends the one used in V2000. Unfortunately, the V2000 stereochemistry model is not exactly simple.

Chemical Sciences

A Beginner's Guide to Parsing in Rust

https://doi.org/10.59350/c5psp-e6j72

Published December 16, 2021

Author Richard L. Apodaca

Parsers are crucial for many data processing tasks. Contrary to what appearances might imply, writing a parser from scratch is not difficult given the right starting point. This article presents a flexible system for writing custom parsers for a wide range of languages. It assumes some experience with Rust, but no experience with language theory. More experienced readers might want to skip directly to the Lyn crate.

Depth-First

TUCAN Canonicalization Revisited

TUCAN Canonicalization

Molecular Identification with TUCAN

An Introduction to DataWarrior

Python Extensions in Pure Rust with Rust-CPython

Big Reaction Data

V3000 Molfile Enhanced Stereochemistry Representation

Graphs from Scratch in Python

Penny Codes

Stereochemistry and the V2000 Molfile Format

A Beginner's Guide to Parsing in Rust