Dante Smiles: SMILES Validation, Canonicalization, and Property Calculation
#worksona#atomic47#pharmaceutical#cheminformatics#rdkit#smiles#validation
David OlssonDante Smiles is the validation and enrichment layer at the end of the Dante cheminformatics pipeline. It accepts SMILES strings — from Dante Decimer, from a CSV import, or from any other source — validates them using RDKit, canonicalizes them to a deterministic standard form, calculates a core set of molecular properties, and supports format conversion between SMILES, InChI, and MOL. Batch processing handles thousands of structures in a single run. Invalid structures are flagged with structured error records rather than silently discarded.
The service operates on a simple contract: every structure that passes through it either emerges as a fully enriched, canonicalized compound record, or is returned as an explicit error with a machine-readable failure reason.
Why is it useful?
SMILES strings from optical recognition are noisy. Syntactically valid SMILES can represent chemically impossible structures — radicals, impossible valences, broken ring closures — and different tools produce different SMILES strings for the same molecule. Without canonicalization, a compound library accumulates duplicates that look distinct to string comparison but are chemically identical. Deduplication becomes unreliable, and any downstream analysis that depends on consistent compound identifiers breaks.
RDKit canonicalization resolves this. By running every structure through RDKit's canonical SMILES generator as the final step before storage, we guarantee that every record in the compound library is in a single, deterministic form. Two SMILES strings that represent the same molecule will produce the same canonical output; duplicates become detectable by straightforward string comparison.
Property calculation at ingestion time — rather than on demand — means downstream tools always have molecular weight, LogP, formula, and ring count available without recalculating. It also surfaces problems early: a structure with an implausible molecular weight or LogP value is a signal worth reviewing before it enters a compound library.
How and where does it apply?
Dante Smiles is the final stage in the Dante pipeline, receiving output from Dante Decimer after optical recognition and batch queue processing.
The service is also usable standalone, decoupled from the rest of the Dante pipeline. A pharma team with an existing compound library in CSV format can import it directly, validate and canonicalize in batch, and export enriched records. We have applied this to legacy library audits: identifying invalid structures and duplicate entries in compound collections before migration to new registration systems. The structured error output gives the audit team a reviewable list of failures with specific failure reasons rather than a count of dropped rows.
from rdkit import Chem
from rdkit.Chem import Descriptors, inchi
def enrich_smiles(smiles: str) -> dict:
mol = Chem.MolFromSmiles(smiles)
if mol is None:
return {"valid": False, "smiles": smiles, "error": "parse_failed"}
canonical = Chem.MolToSmiles(mol)
return {
"valid": True,
"smiles": canonical,
"inchi": inchi.MolToInchi(mol),
"mw": round(Descriptors.MolWt(mol), 3),
"logp": round(Descriptors.MolLogP(mol), 3),
"formula": Chem.rdMolDescriptors.CalcMolFormula(mol)
}
The function returns a consistent dictionary shape whether the input is valid or not. Callers do not need to handle exceptions or missing keys; they check the valid flag and proceed. In batch runs, this makes aggregation and reporting straightforward: filter by valid, group errors by error value, count and export.