Dante Decimer: SMILES Generation via Optical Chemical Recognition
#worksona#atomic47#pharmaceutical#cheminformatics#smiles#decimer#microservice
David OlssonDante Decimer is the SMILES generation layer in the Dante pharmaceutical pipeline. It receives cropped chemical structure images from Dante Bronte, passes them through the DECIMER optical chemical structure recognition engine, and returns canonical SMILES strings back to the Bronte database. A batch processing queue handles multi-structure documents. The service has a single, bounded responsibility: image in, SMILES out.
DECIMER โ Deep lEarning for Chemical IMagE Recognition โ is a neural network trained specifically on chemical structure diagrams. It is not a general-purpose vision model adapted to chemistry; it was built for this domain from the ground up. Dante Decimer wraps the DECIMER inference step in a microservice boundary so it can be called, monitored, upgraded, and replaced independently of the surrounding pipeline.
Why is it useful?
General-purpose vision models perform poorly on pharmaceutical structure drawings. The bond angles, ring notations, stereo descriptors, and abbreviation conventions in chemical diagrams require domain-specific training data and architecture decisions that DECIMER provides. Using it as the recognition engine gives substantially better accuracy on the structure types that appear in pharmaceutical literature and regulatory filings.
The microservice boundary matters for a second reason: model improvement. DECIMER is an active research project. By isolating the inference step behind a clean interface, we can swap in a newer model version without modifying Dante Bronte's extraction logic or Dante Smiles' validation layer. The batch queue adds a third benefit โ it prevents API saturation when a single document contains dozens of structures, throttling submissions to the inference engine at a controlled rate rather than flooding it with concurrent requests.
How and where does it apply?
Dante Decimer is the second stage in the Dante pipeline: Bronte extracts and crops structure images, Decimer converts them to SMILES, Smiles validates and enriches them.
The service is also callable by any external system that produces cropped chemical structure images and needs canonical SMILES in return. The interface contract is simple: submit a list of image bytes, receive a list of SMILES strings with positional correspondence to the inputs.
async def process_batch(image_crops: list[bytes]) -> list[str]:
results = []
for crop in image_crops:
smiles = await decimer_predict(crop) # DECIMER inference
canonical = canonicalize_smiles(smiles) # RDKit round-trip check
results.append(canonical or "PARSE_ERROR")
return results
Each result position maps directly to its input crop. A failed recognition returns PARSE_ERROR rather than None or an empty string, which makes downstream error aggregation explicit. The RDKit round-trip check on the raw DECIMER output catches syntactically plausible but chemically invalid strings before they propagate to the validation layer.
The pipeline is modular by design. Bronte and Decimer can each be replaced with alternative implementations โ a different extraction UI, a different recognition model โ without changes to Dante Smiles. The SMILES string is the handoff contract, and it is stable across all three components.