tokenizer#
- class CodeGenTokenizer(vocab_file, merges_file, errors='replace', max_len=None, pad_token='<|endoftext|>', eos_token='<|endoftext|>', unk_token='<|endoftext|>', eol_token='Ċ', **kwargs)[source]#
Bases:
GPTTokenizer
- decode(token_ids, skip_special_tokens=False, clean_up_tokenization_spaces=True, truncate_before_pattern=None, **kwargs)[source]#
Converts a sequence of ids in a string, using the tokenizer and vocabulary with options to remove special tokens and clean up tokenization spaces.
Similar to doing
self.convert_tokens_to_string(self.convert_ids_to_tokens(token_ids))
.- Args:
- token_ids (
Union[int, List[int], np.ndarray, paddle.Tensor]
): List of tokenized input ids. Can be obtained using the
__call__
method.- skip_special_tokens (
bool
, optional, defaults toFalse
): Whether or not to remove special tokens in the decoding.
- clean_up_tokenization_spaces (
bool
, optional, defaults toTrue
): Whether or not to clean up the tokenization spaces.
- truncate_before_pattern (
List[str]
, optional, defaults toNone
): A list of regular expression strings that will be used to truncate the returned string. This can be used to remove extra pieces of code (e.g. truncate if observing a comment symbol “#” at the beginning of a new line). An example pattern could be `[“^#”, re.escape(“<|endoftext|>”), “^’’’”, “
- token_ids (
- “]`.
- kwargs (additional keyword arguments, optional):
Will be passed to the underlying model specific decode method.
- Returns:
str
: The decoded sentence.