IUPAC name to SMILES: how the conversion works
Turning a systematic IUPAC name into a SMILES string is a parsing problem with one correct answer. Here is how OPSIN does it and how to verify the result.
What is an IUPAC-name-to-SMILES conversion?
An IUPAC-name-to-SMILES conversion translates a systematic chemical name (for example, 2-acetoxybenzoic acid) into a SMILES string (CC(=O)Oc1ccccc1C(=O)O), a compact text encoding of the molecule's atoms and bonds. Because both formats describe one exact structure, the conversion has a single correct answer — it is a parsing problem, not a prediction problem.
What SMILES encodes
SMILES (Simplified Molecular Input Line Entry System) writes a molecule as a line of text: atoms as element symbols, bonds as characters, rings as matched digits, and branches in parentheses. c1ccccc1 is benzene; lowercase letters mark aromatic atoms. Its compactness is why databases, machine-learning models, and chemistry APIs all speak SMILES.
How OPSIN parses a name
- Tokenize: Split the name into recognized morphemes — prefixes, parent hydride, suffixes, and locants.
- Parse the grammar: Apply IUPAC nomenclature rules as a formal grammar to assemble a parse tree of substituents and the parent chain.
- Build the structure: Convert the parse tree into an atom-and-bond graph, placing substituents at their numbered positions.
- Emit SMILES: Serialize the graph to a canonical SMILES (or InChI) string.
OPSIN (Open Parser for Systematic IUPAC Nomenclature) performs this deterministically — the same name always yields the same structure.
How to verify the conversion is correct
The reliable check is a round-trip: take the SMILES output, generate a name or canonical identifier from it, and confirm it matches the input. Cheemly's Critic Gate does exactly this — it parses the SMILES with RDKit and runs an OPSIN name-to-structure round-trip, so a mistranslated name is caught before the answer reaches you. Generic LLMs skip this step, which is why they confidently return SMILES that do not match the named compound.
Common pitfalls
- Ambiguous or non-standard names: Trivial or outdated names may not parse; use the current IUPAC name.
- Stereochemistry:
E/ZandR/Sdescriptors must survive the conversion — verify chirality is preserved in the output. - Salts and mixtures: Multi-component names map to dot-separated SMILES (
[Na+].[Cl-]); confirm every component is present.
Frequently asked questions
- Is converting an IUPAC name to SMILES deterministic?
- Yes. A correct systematic IUPAC name describes exactly one structure, so a proper parser like OPSIN always returns the same SMILES. It is a parsing task with a single right answer, unlike a probabilistic LLM guess.
- What tool converts IUPAC names to SMILES?
- OPSIN, an open-source parser, is the standard tool for converting systematic IUPAC names to SMILES or InChI. Cheemly wraps OPSIN and verifies the output with RDKit before returning it.
- Why do AI chatbots get SMILES wrong?
- General-purpose LLMs generate SMILES probabilistically and skip verification, so they can return a valid-looking string that encodes the wrong molecule. Deterministic parsing plus a round-trip check eliminates this class of error.