tokenizer

class JiebaTokenizer(vocab)[源代码]

基类:paddlenlp.data.tokenizer.BaseTokenizer

Constructs a tokenizer based on jieba. It supports cut() method to split the text to tokens, and encode() method to covert text to token ids.

参数

vocab (paddlenlp.data.Vocab) -- An instance of paddlenlp.data.Vocab.

cut(sentence, cut_all=False, use_hmm=True)[源代码]

The method used to cut the text to tokens.

参数
  • sentence (str) -- The text that needs to be cuted.

  • cut_all (bool, optional) -- Whether to use the full mode. If True, using full mode that gets all the possible words from the sentence, which is fast but not accurate. If False, using accurate mode that attempts to cut the sentence into the most accurate segmentations, which is suitable for text analysis. Default: False.

  • use_hmm (bool, optional) -- Whether to use the HMM model. Default: True.

返回

A list of tokens.

返回类型

list[str]

示例

from paddlenlp.data import Vocab, JiebaTokenizer
# The vocab file. The sample file can be downloaded firstly.
# wget https://paddlenlp.bj.bcebos.com/data/senta_word_dict.txt
vocab_file_path = './senta_word_dict.txt'
# Initialize the Vocab
vocab = Vocab.load_vocabulary(
    vocab_file_path,
    unk_token='[UNK]',
    pad_token='[PAD]')
tokenizer = JiebaTokenizer(vocab)

tokens = tokenizer.cut('我爱你中国')
print(tokens)
# ['我爱你', '中国']
encode(sentence, cut_all=False, use_hmm=True)[源代码]

The method used to convert the text to ids. It will firstly call cut() method to cut the text to tokens. Then, convert tokens to ids using vocab.

参数
  • sentence (str) -- The text that needs to be cuted.

  • cut_all (bool, optional) -- Whether to use the full mode. If True, using full mode that gets all the possible words from the sentence, which is fast but not accurate. If False, using accurate mode that attempts to cut the sentence into the most accurate segmentations, which is suitable for text analysis. Default: False.

  • use_hmm (bool, optional) -- Whether to use the HMM model. Default: True.

返回

A list of ids.

返回类型

list[int]

示例

from paddlenlp.data import Vocab, JiebaTokenizer
# The vocab file. The sample file can be downloaded firstly.
# wget https://paddlenlp.bj.bcebos.com/data/senta_word_dict.txt
vocab_file_path = './senta_word_dict.txt'
# Initialize the Vocab
vocab = Vocab.load_vocabulary(
    vocab_file_path,
    unk_token='[UNK]',
    pad_token='[PAD]')
tokenizer = JiebaTokenizer(vocab)

ids = tokenizer.encode('我爱你中国')
print(ids)
# [1170578, 575565]