vocab#

class Vocab(counter=None, max_size=None, min_freq=1, token_to_idx=None, unk_token=None, pad_token=None, bos_token=None, eos_token=None, **kwargs)[source]#

Bases: object

The class used to convert between tokens and ids. It also includes some store/load functions.

Parameters:
  • counter (collections.Counter, optional) – A Counter intance describes the tokens and their frequencies. Its keys will be indexed accroding to the order of frequency sorting to construct mapping relationship. If None, token_to_idx must be provided as the mapping relationship. Default: None.

  • max_size (int, optional) – Max size of vocab, not including special tokens. Default: None.

  • min_freq (int, optional) – Ignore tokens whose frequencies are less than min_freq. Default: 1.

  • token_to_idx (dict, optional) – A dict specifies the mapping relationship between tokens and indices to be used. If provided, adjust the tokens and indices mapping according to it. If None, counter must be provided. Default: None.

  • unk_token (str, optional) – Special token for unknow token. If no need, it also could be None. Default: None.

  • pad_token (str, optional) – Special token for padding token. If no need, it also could be None. Default: None.

  • bos_token (str, optional) – Special token for bos token. If no need, it also could be None. Default: None.

  • eos_token (str, optional) – Special token for eos token. If no need, it lso could be None. Default: None.

  • kwargs (dict) – Keyword arguments ending with _token. It can be used to specify further special tokens that will be exposed as attribute of the vocabulary and associated with an index.

to_tokens(indices)[source]#

Maps the input indices to token list.

Parameters:

indices (int|list[int]|tuple[int]|numpy.ndarray) – The input indice(s) for mapping. Must be an int or 1D list[int]`|`tuple[int]`|`numpy.ndarray.

Returns:

Obtained token(s). If indices is an integer, it will return a str. If indices is a list/tuple of integers, it will return a list of str.

Return type:

str|list[str]

Example

from paddlenlp.data import Vocab
# The vocab file. The sample file can be downloaded firstly.
# wget https://bj.bcebos.com/paddlenlp/data/senta_word_dict.txt
vocab_file_path = './senta_word_dict.txt'
# Initialize the Vocab
vocab = Vocab.load_vocabulary(
    vocab_file_path,
    unk_token='[UNK]',
    pad_token='[PAD]')
tokens = vocab.to_tokens([0, 1, 2, 3])
print(tokens)
# ['[PAD]', '[UNK]', '一斤三', '意面屋']
to_indices(tokens)[source]#

Maps the input tokens into indices.

Parameters:

tokens (str|list[str]|tuple[str], optional) – The input token(s) for mapping.

Returns:

Obationed indice(s). If tokens is a str, it will return an integer. If tokens is a list/tuple of str, it will return a list of integers.

Return type:

int|list[int]

Example

from paddlenlp.data import Vocab
# The vocab file. The sample file can be downloaded firstly.
# wget https://bj.bcebos.com/paddlenlp/data/senta_word_dict.txt
vocab_file_path = './senta_word_dict.txt'
# Initialize the Vocab
vocab = Vocab.load_vocabulary(
    vocab_file_path,
    unk_token='[UNK]',
    pad_token='[PAD]')
tokens = vocab.to_indices(['[PAD]', '[UNK]', '一斤三', '意面屋'])
print(tokens)
# [0, 1, 2, 3]
__call__(tokens)[source]#

Maps the input tokens into indices. Its function is the same as the to_indices() method.

See detail at to_indices.

to_json(path=None)[source]#

Summarizes some information of vocab as JSON string. If path is gaven, the JSON string will be saved into files. The JSON string and the saved file all can be used to reconstruct the Vocab by calling from_json() method.

Parameters:

path (str, optional) – The path to save JSON string. If None, the JSON will not be saved. Default: None.

Returns:

The JSON string including information of vocab.

Return type:

str

Example

from paddlenlp.data import Vocab
# The vocab file. The sample file can be downloaded firstly.
# wget https://bj.bcebos.com/paddlenlp/data/senta_word_dict.txt
vocab_file_path = './senta_word_dict.txt'
# Initialize the Vocab
vocab = Vocab.load_vocabulary(
    vocab_file_path,
    unk_token='[UNK]',
    pad_token='[PAD]')
json_str = vocab.to_json(path='./vocab.json')
classmethod from_json(json_str)[source]#

Loads Vocab from JSON string or JSON file, which is gotten by calling to_json() method.

Parameters:

json_str (str) – JSON string or file path of JSON string.

Returns:

An instance of Vocab generated from information contained in JSON string.

Return type:

Vocab

Example

from paddlenlp.data import Vocab
# The vocab file. The sample file can be downloaded firstly.
# wget https://bj.bcebos.com/paddlenlp/data/senta_word_dict.txt
vocab_file_path = './senta_word_dict.txt'
# Initialize the Vocab
vocab = Vocab.load_vocabulary(
    vocab_file_path,
    unk_token='[UNK]',
    pad_token='[PAD]')
json_str = vocab.to_json(path='./vocab.json')

vocab1 = Vocab.from_json(json_str)
vocab2 = Vocab.from_json('./vocab.json')
print(len(vocab), len(vocab1), len(vocab2))
# 1256608 1256608 1256608
classmethod from_dict(token_to_idx, unk_token=None, pad_token=None, bos_token=None, eos_token=None, **kwargs)[source]#

Builds the Vocab from a dict.

Parameters:
  • token_to_idx (dict) – A dict describes the mapping relationship between tokens and indices.

  • unk_token (str, optional) – The special token for unknow token. If no need, it also could be None. Default: None.

  • pad_token (str, optional) – The special token for padding token. If no need, it also could be None. Default: None.

  • bos_token (str, optional) – The special token for bos token. If no need, it also could be None. Default: None.

  • eos_token (str, optional) – The special token for eos token. If no need, it also could be None. Default: None.

  • kwargs (dict) – Keyword arguments ending with _token. It can be used to specify further special tokens that will be exposed as attribute of the vocabulary and associated with an index.

Returns:

An instance of Vocab generated from the given dict and special tokens.

Return type:

Vocab

Example

from paddlenlp.data import Vocab
# The vocab file. The sample file can be downloaded firstly.
# wget https://bj.bcebos.com/paddlenlp/data/senta_word_dict.txt
vocab_file_path = './senta_word_dict.txt'
# Initialize the Vocab
vocab = Vocab.load_vocabulary(
    vocab_file_path,
    unk_token='[UNK]',
    pad_token='[PAD]')

vocab1 = Vocab.from_dict(vocab.token_to_idx)
print(len(vocab), len(vocab.token_to_idx), len(vocab1))
# 1256608 1256608 1256608
static build_vocab(iterator, max_size=None, min_freq=1, token_to_idx=None, unk_token=None, pad_token=None, bos_token=None, eos_token=None, **kwargs)[source]#

Builds the Vocab accoring to given iterator and other information. Firstly, iterate over the iterator to construct a collections.Counter and used to init the as Vocab.

Parameters:
  • iterator (collections.Iterable) – Iterator of tokens. Each element should be a list of tokens if wordlevel vocab is needed.

  • max_size (int, optional) – The max size of vocab, not including special tokens. Default: None.

  • min_freq (int, optional) – Ignore tokens whose frequencies are less than min_freq. Default: 1.

  • token_to_idx (dict, optional) – A dict specifies the mapping relationship between tokens and indices to be used. If provided, adjust the tokens and indices mapping according to it. If None, counter must be provided. Default: None.

  • unk_token (str, optional) – The special token for unknow token ‘<unk>’. If no need, it also could be None. Default: None.

  • pad_token (str, optional) – The special token for padding token ‘<pad>’. If no need, it also could be None. Default: None.

  • bos_token (str, optional) – The special token for bos token ‘<bos>’. If no need, it also could be None. Default: None.

  • eos_token (str, optional) – The special token for eos token ‘<eos>’. If no need, it also could be None. Default: None.

  • kwargs (dict) – Keyword arguments ending with _token. It can be used to specify further special tokens that will be exposed as attribute of the vocabulary and associated with an index.

Returns:

An instance of Vocab generated from given iterator and other informations.

Return type:

Vocab

Example

from paddlenlp.data import Vocab
# The vocab file. The sample file can be downloaded firstly.
# wget https://bj.bcebos.com/paddlenlp/data/senta_word_dict.txt
vocab_file_path = './senta_word_dict.txt'
# Initialize the Vocab
vocab = Vocab.load_vocabulary(
    vocab_file_path,
    unk_token='[UNK]',
    pad_token='[PAD]')

vocab1 = Vocab.build_vocab([list(vocab.token_to_idx.keys())])
print(len(vocab), len(vocab1))
# 1256608 1256608
static load_vocabulary(filepath, unk_token=None, pad_token=None, bos_token=None, eos_token=None, **kwargs)[source]#

Builds the Vocab from a file reserving all tokens by calling Vocab.from_dict() method. The file contains a token per line, and the line index would be the index of corresponding token.

Parameters:
  • filepath (str) – the path of file to construct vocabulary.

  • unk_token (str, optional) – special token for unknown token. If no need, it also could be None. Default: None.

  • pad_token (str, optional) – special token for padding token. If no need, it also could be None. Default: None.

  • bos_token (str, optional) – special token for bos token. If no need, it also could be None. Default: None.

  • eos_token (str, optional) – special token for eos token. If no need, it also could be None. Default: None.

  • kwargs (dict) – Keyword arguments ending with _token. It can be used to specify further special tokens that will be exposed as attribute of the vocabulary and associated with an index.

Returns:

An instance of Vocab generated from the given file.

Return type:

Vocab

Example

from paddlenlp.data import Vocab
# The vocab file. The sample file can be downloaded firstly.
# wget https://bj.bcebos.com/paddlenlp/data/senta_word_dict.txt
vocab_file_path = './senta_word_dict.txt'
# Initialize the Vocab
vocab = Vocab.load_vocabulary(
    vocab_file_path,
    unk_token='[UNK]',
    pad_token='[PAD]')
print(len(vocab))
# 1256608
save_vocabulary(filepath)[source]#

Save the Vocab to a specific file. Can be reloaded by calling load_vocabulary.

Parameters:

filepath (str) – the path of file to save vocabulary.