API Reference¶

class smirk.SmirkTokenizerFast(tokenizer_file: PathLike | None = None, **kwargs)[source]¶

__init__(tokenizer_file: PathLike | None = None, **kwargs)[source]¶

A Chemically-Complete Tokenizer for OpenSMILES

Parameters:

vocab_file (os.PathLike) – Path to a JSON encoded vocabulary mapping tokens to ids.
tokenizer_file – Path to a JSON serialize SmirkTokenizerFast tokenizers
template (str) – A post-processing template. Defaults to no post-processing. For a BERT-like template, use: [CLS] $0 [SEP].
add_special_tokens (bool) – If true, adds the default special tokens to the vocabulary. Defaults to true.
kwargs – Additional kwargs are passed to transformers.PreTrainedTokenizerFast

__len__() → int[source]¶: Size of the full vocabulary including added tokens

property added_tokens_decoder: dict[int, AddedToken]¶: Mapping from id to AddedToken for the added tokens

property added_tokens_encoder: dict[str, int]¶: Mapping from added token to token id

convert_ids_to_tokens(index: int | List[int]) → str | List[str][source]¶: Decode a token id, or a list of ids, into their token(s)

convert_tokens_to_ids(token: str | List[str]) → int | List[int][source]¶: Convert a token, or list of tokens, into their token id(s)

convert_tokens_to_string(tokens: List[str]) → str[source]¶: Joins a list of tokens into a string.

get_vocab() → dict[str, int][source]¶: Returns the vocabulary of the tokenzier as a dictionary

property post_processor¶: Returns the JSON serialization of the post-processor

Tip

This is not part of transformers.PreTrainedTokenizerFast’s API.

set_truncation_and_padding(padding_strategy: PaddingStrategy, truncation_strategy: TruncationStrategy, max_length: int | None = None, stride: int = 0, pad_to_multiple_of: int | None = None)[source]¶: Configure truncation and padding options for the tokenizer

to_str() → str[source]¶: Serialize the tokenizer into a JSON string

tokenize(text: str, add_special_tokens=False) → list[str][source]¶: Converts a string into a sequence of tokens

property vocab: dict[str, int]¶: The tokenzier’s vocabulary

property vocab_size¶: The size of the vocabulary without the added tokens

smirk.SmirkTokenizerFast.__call__(text: str, ...)¶

Primary method for tokenizing text

Command Line Utility¶

Train a smirk-gpe tokenizer from a corpus of SMILES encodings

usage: python -m smirk.cli [-h] [--vocab-size VOCAB_SIZE]
                           [--merge-brackets | --no-merge-brackets]
                           [--split-structure | --no-split-structure]
                           [-o OUTPUT]
                           files [files ...]

Positional Arguments¶

files

Named Arguments¶

--vocab-size

Default: 1024

--merge-brackets, --no-merge-brackets

Allow merges with bracket ([ or ]) tokens (default: False)

Default: False

--split-structure, --no-split-structure

Split SMILES on structure before training (default: True)

Default: True

-o, --output

directory where trained smirk-gpe model is saved

Default: “.”