tokenizer#

Bases: GPTTokenizer

decode(token_ids, skip_special_tokens=False, clean_up_tokenization_spaces=True, truncate_before_pattern=None, **kwargs)[source]#

Converts a sequence of ids in a string, using the tokenizer and vocabulary with options to remove special tokens and clean up tokenization spaces.

Similar to doing self.convert_tokens_to_string(self.convert_ids_to_tokens(token_ids)).

Args:

token_ids (Union[int, List[int], np.ndarray, paddle.Tensor]):
List of tokenized input ids. Can be obtained using the __call__ method.

skip_special_tokens (bool, optional, defaults to False):
Whether or not to remove special tokens in the decoding.

clean_up_tokenization_spaces (bool, optional, defaults to True):
Whether or not to clean up the tokenization spaces.

truncate_before_pattern (List[str], optional, defaults to None):
A list of regular expression strings that will be used to truncate the returned string. This can be used to remove extra pieces of code (e.g. truncate if observing a comment symbol “#” at the beginning of a new line). An example pattern could be `[“^#”, re.escape(“<|endoftext|>”), “^’’’”, “

“]`.

kwargs (additional keyword arguments, optional):
Will be passed to the underlying model specific decode method.

Returns:: str: The decoded sentence.

tokenizer

Contents

tokenizer#