API Reference

class smirk.SmirkTokenizerFast(tokenizer_file: PathLike | None = None, **kwargs)[source]
__init__(tokenizer_file: PathLike | None = None, **kwargs)[source]

A Chemically-Complete Tokenizer for OpenSMILES

Parameters:
__len__() int[source]

Size of the full vocabulary including added tokens

property added_tokens_decoder: dict[int, AddedToken]

Mapping from id to AddedToken for the added tokens

property added_tokens_encoder: dict[str, int]

Mapping from added token to token id

convert_ids_to_tokens(index: int | List[int]) str | List[str][source]

Decode a token id, or a list of ids, into their token(s)

convert_tokens_to_ids(token: str | List[str]) int | List[int][source]

Convert a token, or list of tokens, into their token id(s)

convert_tokens_to_string(tokens: List[str]) str[source]

Joins a list of tokens into a string.

get_vocab() dict[str, int][source]

Returns the vocabulary of the tokenzier as a dictionary

property post_processor

Returns the JSON serialization of the post-processor

Tip

This is not part of transformers.PreTrainedTokenizerFast’s API.

set_truncation_and_padding(padding_strategy: PaddingStrategy, truncation_strategy: TruncationStrategy, max_length: int | None = None, stride: int = 0, pad_to_multiple_of: int | None = None)[source]

Configure truncation and padding options for the tokenizer

to_str() str[source]

Serialize the tokenizer into a JSON string

tokenize(text: str, add_special_tokens=False) list[str][source]

Converts a string into a sequence of tokens

property vocab: dict[str, int]

The tokenzier’s vocabulary

property vocab_size

The size of the vocabulary without the added tokens

smirk.SmirkTokenizerFast.__call__(text: str, ...)

Primary method for tokenizing text

See also

transformers.PreTrainedTokenizerBase.__call__() for the 🤗 documentation

Attention

The following features are not supported:

  • Passing in pairs of text. i.e. text_pair or related arguments.

  • Does not support split_special_tokens.

  • Pre-tokenizer inputs, i.e. is_split_into_words

  • Returning overflowing tokens (i.e. return_overflowing_tokens)

smirk.SmirkTokenizerFast.decode(list[int] | Tensor, skip_special_tokens: bool = False, ...) str

Primary method for decoding token ids back into text.

See also

transformers.PreTrainedTokenizerBase.decode() for the 🤗 documentation

smirk.SmirkTokenizerFast.batch_decode(list[list[int]] | Tensor, skip_special_tokens: bool = False, ...) list[str]

Primary method for decoding batches of token ids back into text.

See also

transformers.PreTrainedTokenizerBase.batch_decode() for the 🤗 documentation

smirk.SmirkSelfiesFast(vocab_file: PathLike | None = None, unk_token='[UNK]', **kwargs) PreTrainedTokenizerFast[source]

Instantiate a Chemically-Consistent tokenizer for SELFIES

Parameters:
smirk.SPECIAL_TOKENS = {'bos_token': '[BOS]', 'cls_token': '[CLS]', 'eos_token': '[EOS]', 'mask_token': '[MASK]', 'pad_token': '[PAD]', 'sep_token': '[SEP]', 'unk_token': '[UNK]'}

Default Special tokens used by the SmirkTokenizerFast and SmirkSelfiesFast() tokenizers.

smirk.train_gpe(files: list[str], ref: SmirkTokenizerFast | None = None, min_frequency: int = 0, vocab_size: int = 1024, merge_brackets: bool = False, split_structure: bool = True) SmirkTokenizerFast[source]

Train a Smirk-GPE Tokenizer from a corpus of SMILES encodings.

Parameters:
  • files – List of files containing the corpus to train the tokenizer on

  • ref – The initial tokenizer to start from when training. Defaults to SmirkTokenizerFast() This determines the initial vocabulary and identity of the unknown token

  • min_frequency – Minimum count for a pair to be considered for a merge

  • vocab_size – The target size of the final vocabulary

  • merge_brackets – If true, merges with brackets ([ or ]) are allowed

  • split_structure – If true, will split SMILES encoding on structural elements, before considering merges (i.e. merges across structural elements are not allowed)

Command Line Utility

Train a smirk-gpe tokenizer from a corpus of SMILES encodings

usage: python -m smirk.cli [-h] [--vocab-size VOCAB_SIZE]
                           [--merge-brackets | --no-merge-brackets]
                           [--split-structure | --no-split-structure]
                           [-o OUTPUT]
                           files [files ...]

Positional Arguments

files

Named Arguments

--vocab-size

Default: 1024

--merge-brackets, --no-merge-brackets

Allow merges with bracket ([ or ]) tokens (default: False)

Default: False

--split-structure, --no-split-structure

Split SMILES on structure before training (default: True)

Default: True

-o, --output

directory where trained smirk-gpe model is saved

Default: “.”