token_embedding#

list_embedding_name()[源代码]#

Lists all names of pretrained embedding models paddlenlp provides.

class TokenEmbedding(embedding_name='w2v.baidu_encyclopedia.target.word-word.dim300', unknown_token='[UNK]', unknown_token_vector=None, extended_vocab_path=None, trainable=True, keep_extended_vocab_only=False)[源代码]#

基类:Embedding

A TokenEmbedding can load pre-trained embedding model which paddlenlp provides by specifying embedding name. Furthermore, a TokenEmbedding can load extended vocabulary by specifying extended_vocab_path.

参数:
  • embedding_name (str, optional) -- The pre-trained embedding model name. Use paddlenlp.embeddings.list_embedding_name() to list the names of all embedding models that we provide. Defaults to w2v.baidu_encyclopedia.target.word-word.dim300.

  • unknown_token (str, optional) -- Specifies unknown token. Defaults to [UNK].

  • unknown_token_vector (list, optional) -- To initialize the vector of unknown token. If it's none, use normal distribution to initialize the vector of unknown token. Defaults to None.

  • extended_vocab_path (str, optional) -- The file path of extended vocabulary. Defaults to None.

  • trainable (bool, optional) -- Whether the weight of embedding can be trained. Defaults to True.

  • keep_extended_vocab_only (bool, optional) -- Whether to keep the extended vocabulary only, will be effective only if provides extended_vocab_path. Defaults to False.

set_trainable(trainable)[源代码]#

Whether or not to set the weights of token embedding to be trainable.

参数:

trainable (bool) -- The weights can be trained if trainable is set to True, or the weights are fixed if trainable is False.

search(words)[源代码]#

Gets the vectors of specifying words.

参数:

words (list or str or int) -- The words which need to be searched.

返回:

The vectors of specifying words.

返回类型:

numpy.array

示例

from paddlenlp.embeddings import TokenEmbedding

embed = TokenEmbedding()
vector =  embed.search('Welcome to use PaddlePaddle and PaddleNLP!')
get_idx_from_word(word)[源代码]#

Gets the index of specifying word by searching word_to_idx dict.

参数:

word (list or str or int) -- The input token word which we want to get the token index converted from.

返回:

The index of specifying word.

返回类型:

int

get_idx_list_from_words(words)[源代码]#

Gets the index list of specifying words by searching word_to_idx dict.

参数:

words (list or str or int) -- The input token words which we want to get the token indices converted from.

返回:

The indexes list of specifying words.

返回类型:

list

示例

from paddlenlp.embeddings import TokenEmbedding

embed = TokenEmbedding()
index =  embed.get_idx_from_word('Welcome to use PaddlePaddle and PaddleNLP!')
#635963
dot(word_a, word_b)[源代码]#

Calculates the dot product of 2 words. Dot product or scalar product is an algebraic operation that takes two equal-length sequences of numbers (usually coordinate vectors), and returns a single number.

参数:
  • word_a (str) -- The first word string.

  • word_b (str) -- The second word string.

返回:

The dot product of 2 words.

返回类型:

float

示例

from paddlenlp.embeddings import TokenEmbedding

embed = TokenEmbedding()
dot_product =  embed.dot('PaddlePaddle', 'PaddleNLP!')
#0.11827179
cosine_sim(word_a, word_b)[源代码]#

Calculates the cosine similarity of 2 word vectors. Cosine similarity is the cosine of the angle between two n-dimensional vectors in an n-dimensional space.

参数:
  • word_a (str) -- The first word string.

  • word_b (str) -- The second word string.

返回:

The cosine similarity of 2 words.

返回类型:

float

示例

from paddlenlp.embeddings import TokenEmbedding

embed = TokenEmbedding()
cosine_simi =  embed.cosine_sim('PaddlePaddle', 'PaddleNLP!')
#0.99999994