API Reference¶
- class smirk.SmirkTokenizerFast(tokenizer_file: PathLike | None = None, **kwargs)[source]¶
- __init__(tokenizer_file: PathLike | None = None, **kwargs)[source]¶
A Chemically-Complete Tokenizer for OpenSMILES
- Parameters:
vocab_file (os.PathLike) – Path to a JSON encoded vocabulary mapping tokens to ids.
tokenizer_file – Path to a JSON serialize SmirkTokenizerFast tokenizers
template (str) – A
post-processing template. Defaults to no post-processing. For a BERT-like template, use:[CLS] $0 [SEP].add_special_tokens (bool) – If true, adds the
default special tokensto the vocabulary. Defaults to true.kwargs – Additional kwargs are passed to
transformers.PreTrainedTokenizerFast
- property added_tokens_decoder: dict[int, AddedToken]¶
Mapping from id to
AddedTokenfor the added tokens
- convert_ids_to_tokens(index: int | List[int]) str | List[str][source]¶
Decode a token id, or a list of ids, into their token(s)
- convert_tokens_to_ids(token: str | List[str]) int | List[int][source]¶
Convert a token, or list of tokens, into their token id(s)
- property post_processor¶
Returns the JSON serialization of the post-processor
Tip
This is not part of
transformers.PreTrainedTokenizerFast’s API.
- set_truncation_and_padding(padding_strategy: PaddingStrategy, truncation_strategy: TruncationStrategy, max_length: int | None = None, stride: int = 0, pad_to_multiple_of: int | None = None)[source]¶
Configure truncation and padding options for the tokenizer
- tokenize(text: str, add_special_tokens=False) list[str][source]¶
Converts a string into a sequence of tokens
- property vocab_size¶
The size of the vocabulary without the added tokens
- smirk.SmirkTokenizerFast.__call__(text: str, ...)¶
Primary method for tokenizing text
See also
transformers.PreTrainedTokenizerBase.__call__()for the 🤗 documentationAttention
The following features are not supported:
Passing in pairs of text. i.e.
text_pairor related arguments.Does not support
split_special_tokens.Pre-tokenizer inputs, i.e.
is_split_into_wordsReturning overflowing tokens (i.e.
return_overflowing_tokens)
- smirk.SmirkTokenizerFast.decode(list[int] | Tensor, skip_special_tokens: bool = False, ...) str¶
Primary method for decoding token ids back into text.
See also
transformers.PreTrainedTokenizerBase.decode()for the 🤗 documentation
- smirk.SmirkTokenizerFast.batch_decode(list[list[int]] | Tensor, skip_special_tokens: bool = False, ...) list[str]¶
Primary method for decoding batches of token ids back into text.
See also
transformers.PreTrainedTokenizerBase.batch_decode()for the 🤗 documentation
- smirk.SmirkSelfiesFast(vocab_file: PathLike | None = None, unk_token='[UNK]', **kwargs) PreTrainedTokenizerFast[source]¶
Instantiate a Chemically-Consistent tokenizer for SELFIES
- Parameters:
vocab_file (os.PathLike) – The vocabulary for the
tokenizers.models.WordLevelmodel. The default vocabulary includes all possible SELFIES tokensunk_token (str) – The unknown token to be used.
add_special_tokens (bool) – if true, adds the
default special tokensto the vocabulary. Defaults to true.kwargs – Additional kwargs are passed to
transformers.PreTrainedTokenizerFast
- smirk.SPECIAL_TOKENS = {'bos_token': '[BOS]', 'cls_token': '[CLS]', 'eos_token': '[EOS]', 'mask_token': '[MASK]', 'pad_token': '[PAD]', 'sep_token': '[SEP]', 'unk_token': '[UNK]'}¶
Default Special tokens used by the
SmirkTokenizerFastandSmirkSelfiesFast()tokenizers.
- smirk.train_gpe(files: list[str], ref: SmirkTokenizerFast | None = None, min_frequency: int = 0, vocab_size: int = 1024, merge_brackets: bool = False, split_structure: bool = True) SmirkTokenizerFast[source]¶
Train a Smirk-GPE Tokenizer from a corpus of SMILES encodings.
- Parameters:
files – List of files containing the corpus to train the tokenizer on
ref – The initial tokenizer to start from when training. Defaults to SmirkTokenizerFast() This determines the initial vocabulary and identity of the unknown token
min_frequency – Minimum count for a pair to be considered for a merge
vocab_size – The target size of the final vocabulary
merge_brackets – If true, merges with brackets ([ or ]) are allowed
split_structure – If true, will split SMILES encoding on structural elements, before considering merges (i.e. merges across structural elements are not allowed)
Command Line Utility¶
Train a smirk-gpe tokenizer from a corpus of SMILES encodings
usage: python -m smirk.cli [-h] [--vocab-size VOCAB_SIZE]
[--merge-brackets | --no-merge-brackets]
[--split-structure | --no-split-structure]
[-o OUTPUT]
files [files ...]
Positional Arguments¶
- files
Named Arguments¶
- --vocab-size
Default: 1024
- --merge-brackets, --no-merge-brackets
Allow merges with bracket ([ or ]) tokens (default: False)
Default: False
- --split-structure, --no-split-structure
Split SMILES on structure before training (default: True)
Default: True
- -o, --output
directory where trained smirk-gpe model is saved
Default: “.”