Smirk: A Tokenizer for OpenSMILES

GitHub License arXiv:2409.15370

Smirk is a chemistry-specific tokenizer that provides complete coverage of the OpenSMILES specification, that is built using Rust 🦀 and HuggingFace’s tokenizers 🤗. Installation is easy, and Smirk works out-of-the-box with the HuggingFace ecosystem.

Check out Getting Started to see smirk in action, or read the paper to learn about tokenization for molecular foundation models.

Why Smirk?

Molecular Foundation Models are demonstrating impressive performance, but current models use tokenizers that fail to represent all of chemistry, inherently limiting their performance. smirk fixes this by tokenizing SMILES encodings all the way down to their constituent elements. Enabling complete coverage of OpenSMILES with a vocabulary of 167 tokens.