tokenizer¶
-
class
JiebaTokenizer
(vocab)[源代码]¶ 基类:
paddlenlp.data.tokenizer.BaseTokenizer
Constructs a tokenizer based on jieba. It supports
cut()
method to split the text to tokens, andencode()
method to covert text to token ids.- 参数
vocab (paddlenlp.data.Vocab) -- An instance of
paddlenlp.data.Vocab
.
-
cut
(sentence, cut_all=False, use_hmm=True)[源代码]¶ The method used to cut the text to tokens.
- 参数
sentence (str) -- The text that needs to be cuted.
cut_all (bool, optional) -- Whether to use the full mode. If True, using full mode that gets all the possible words from the sentence, which is fast but not accurate. If False, using accurate mode that attempts to cut the sentence into the most accurate segmentations, which is suitable for text analysis. Default: False.
use_hmm (bool, optional) -- Whether to use the HMM model. Default: True.
- 返回
A list of tokens.
- 返回类型
list[str]
示例
from paddlenlp.data import Vocab, JiebaTokenizer # The vocab file. The sample file can be downloaded firstly. # wget https://bj.bcebos.com/paddlenlp/data/senta_word_dict.txt vocab_file_path = './senta_word_dict.txt' # Initialize the Vocab vocab = Vocab.load_vocabulary( vocab_file_path, unk_token='[UNK]', pad_token='[PAD]') tokenizer = JiebaTokenizer(vocab) tokens = tokenizer.cut('我爱你中国') print(tokens) # ['我爱你', '中国']
-
encode
(sentence, cut_all=False, use_hmm=True)[源代码]¶ The method used to convert the text to ids. It will firstly call
cut()
method to cut the text to tokens. Then, convert tokens to ids usingvocab
.- 参数
sentence (str) -- The text that needs to be cuted.
cut_all (bool, optional) -- Whether to use the full mode. If True, using full mode that gets all the possible words from the sentence, which is fast but not accurate. If False, using accurate mode that attempts to cut the sentence into the most accurate segmentations, which is suitable for text analysis. Default: False.
use_hmm (bool, optional) -- Whether to use the HMM model. Default: True.
- 返回
A list of ids.
- 返回类型
list[int]
示例
from paddlenlp.data import Vocab, JiebaTokenizer # The vocab file. The sample file can be downloaded firstly. # wget https://bj.bcebos.com/paddlenlp/data/senta_word_dict.txt vocab_file_path = './senta_word_dict.txt' # Initialize the Vocab vocab = Vocab.load_vocabulary( vocab_file_path, unk_token='[UNK]', pad_token='[PAD]') tokenizer = JiebaTokenizer(vocab) ids = tokenizer.encode('我爱你中国') print(ids) # [1170578, 575565]