Smirk: A Tokenizer for OpenSMILES¶
Smirk is a chemistry-specific tokenizer that provides complete coverage of the OpenSMILES specification, that is built using Rust 🦀 and HuggingFace’s tokenizers 🤗. Installation is easy, and Smirk works out-of-the-box with the HuggingFace ecosystem.
Check out Getting Started to see smirk
in action, or read the paper to learn
about tokenization for molecular foundation models.
Why Smirk?¶
Molecular Foundation Models are demonstrating impressive performance, but current models use tokenizers
that fail to represent all of chemistry, inherently limiting their performance.
smirk
fixes this by tokenizing SMILES encodings all the way down to their constituent elements.
Enabling complete coverage of OpenSMILES with a vocabulary of 167 tokens.