tokenizer#

class CodeGenTokenizer(vocab_file, merges_file, errors='replace', max_len=None, pad_token='<|endoftext|>', eos_token='<|endoftext|>', unk_token='<|endoftext|>', eol_token='Ċ', **kwargs)[source]#

Bases: GPTTokenizer

decode(token_ids, skip_special_tokens=False, clean_up_tokenization_spaces=True, truncate_before_pattern=None, **kwargs)[source]#

Converts a sequence of ids in a string, using the tokenizer and vocabulary with options to remove special tokens and clean up tokenization spaces.

Similar to doing self.convert_tokens_to_string(self.convert_ids_to_tokens(token_ids)).

Args:
token_ids (Union[int, List[int], np.ndarray, paddle.Tensor]):

List of tokenized input ids. Can be obtained using the __call__ method.

skip_special_tokens (bool, optional, defaults to False):

Whether or not to remove special tokens in the decoding.

clean_up_tokenization_spaces (bool, optional, defaults to True):

Whether or not to clean up the tokenization spaces.

truncate_before_pattern (List[str], optional, defaults to None):

A list of regular expression strings that will be used to truncate the returned string. This can be used to remove extra pieces of code (e.g. truncate if observing a comment symbol “#” at the beginning of a new line). An example pattern could be `[“^#”, re.escape(“<|endoftext|>”), “^’’’”, “

“]`.
kwargs (additional keyword arguments, optional):

Will be passed to the underlying model specific decode method.

Returns:

str: The decoded sentence.