tokenizer#

class JiebaTokenizer(vocab)[source]#

Bases: BaseTokenizer

Constructs a tokenizer based on jieba. It supports cut() method to split the text to tokens, and encode() method to covert text to token ids.

Parameters:

vocab (paddlenlp.data.Vocab) – An instance of paddlenlp.data.Vocab.

cut(sentence, cut_all=False, use_hmm=True)[source]#

The method used to cut the text to tokens.

Parameters:
  • sentence (str) – The text that needs to be cuted.

  • cut_all (bool, optional) – Whether to use the full mode. If True, using full mode that gets all the possible words from the sentence, which is fast but not accurate. If False, using accurate mode that attempts to cut the sentence into the most accurate segmentations, which is suitable for text analysis. Default: False.

  • use_hmm (bool, optional) – Whether to use the HMM model. Default: True.

Returns:

A list of tokens.

Return type:

list[str]

Example

from paddlenlp.data import Vocab, JiebaTokenizer
# The vocab file. The sample file can be downloaded firstly.
# wget https://bj.bcebos.com/paddlenlp/data/senta_word_dict.txt
vocab_file_path = './senta_word_dict.txt'
# Initialize the Vocab
vocab = Vocab.load_vocabulary(
    vocab_file_path,
    unk_token='[UNK]',
    pad_token='[PAD]')
tokenizer = JiebaTokenizer(vocab)

tokens = tokenizer.cut('我爱你中国')
print(tokens)
# ['我爱你', '中国']
encode(sentence, cut_all=False, use_hmm=True)[source]#

The method used to convert the text to ids. It will firstly call cut() method to cut the text to tokens. Then, convert tokens to ids using vocab.

Parameters:
  • sentence (str) – The text that needs to be cuted.

  • cut_all (bool, optional) – Whether to use the full mode. If True, using full mode that gets all the possible words from the sentence, which is fast but not accurate. If False, using accurate mode that attempts to cut the sentence into the most accurate segmentations, which is suitable for text analysis. Default: False.

  • use_hmm (bool, optional) – Whether to use the HMM model. Default: True.

Returns:

A list of ids.

Return type:

list[int]

Example

from paddlenlp.data import Vocab, JiebaTokenizer
# The vocab file. The sample file can be downloaded firstly.
# wget https://bj.bcebos.com/paddlenlp/data/senta_word_dict.txt
vocab_file_path = './senta_word_dict.txt'
# Initialize the Vocab
vocab = Vocab.load_vocabulary(
    vocab_file_path,
    unk_token='[UNK]',
    pad_token='[PAD]')
tokenizer = JiebaTokenizer(vocab)

ids = tokenizer.encode('我爱你中国')
print(ids)
# [1170578, 575565]