token_embedding#

list_embedding_name()[source]#

Lists all names of pretrained embedding models paddlenlp provides.

class TokenEmbedding(embedding_name='w2v.baidu_encyclopedia.target.word-word.dim300', unknown_token='[UNK]', unknown_token_vector=None, extended_vocab_path=None, trainable=True, keep_extended_vocab_only=False)[source]#

Bases: Embedding

A TokenEmbedding can load pre-trained embedding model which paddlenlp provides by specifying embedding name. Furthermore, a TokenEmbedding can load extended vocabulary by specifying extended_vocab_path.

Parameters:
  • embedding_name (str, optional) – The pre-trained embedding model name. Use paddlenlp.embeddings.list_embedding_name() to list the names of all embedding models that we provide. Defaults to w2v.baidu_encyclopedia.target.word-word.dim300.

  • unknown_token (str, optional) – Specifies unknown token. Defaults to [UNK].

  • unknown_token_vector (list, optional) – To initialize the vector of unknown token. If it’s none, use normal distribution to initialize the vector of unknown token. Defaults to None.

  • extended_vocab_path (str, optional) – The file path of extended vocabulary. Defaults to None.

  • trainable (bool, optional) – Whether the weight of embedding can be trained. Defaults to True.

  • keep_extended_vocab_only (bool, optional) – Whether to keep the extended vocabulary only, will be effective only if provides extended_vocab_path. Defaults to False.

set_trainable(trainable)[source]#

Whether or not to set the weights of token embedding to be trainable.

Parameters:

trainable (bool) – The weights can be trained if trainable is set to True, or the weights are fixed if trainable is False.

search(words)[source]#

Gets the vectors of specifying words.

Parameters:

words (list or str or int) – The words which need to be searched.

Returns:

The vectors of specifying words.

Return type:

numpy.array

Examples

from paddlenlp.embeddings import TokenEmbedding

embed = TokenEmbedding()
vector =  embed.search('Welcome to use PaddlePaddle and PaddleNLP!')
get_idx_from_word(word)[source]#

Gets the index of specifying word by searching word_to_idx dict.

Parameters:

word (list or str or int) – The input token word which we want to get the token index converted from.

Returns:

The index of specifying word.

Return type:

int

get_idx_list_from_words(words)[source]#

Gets the index list of specifying words by searching word_to_idx dict.

Parameters:

words (list or str or int) – The input token words which we want to get the token indices converted from.

Returns:

The indexes list of specifying words.

Return type:

list

Examples

from paddlenlp.embeddings import TokenEmbedding

embed = TokenEmbedding()
index =  embed.get_idx_from_word('Welcome to use PaddlePaddle and PaddleNLP!')
#635963
dot(word_a, word_b)[source]#

Calculates the dot product of 2 words. Dot product or scalar product is an algebraic operation that takes two equal-length sequences of numbers (usually coordinate vectors), and returns a single number.

Parameters:
  • word_a (str) – The first word string.

  • word_b (str) – The second word string.

Returns:

The dot product of 2 words.

Return type:

float

Examples

from paddlenlp.embeddings import TokenEmbedding

embed = TokenEmbedding()
dot_product =  embed.dot('PaddlePaddle', 'PaddleNLP!')
#0.11827179
cosine_sim(word_a, word_b)[source]#

Calculates the cosine similarity of 2 word vectors. Cosine similarity is the cosine of the angle between two n-dimensional vectors in an n-dimensional space.

Parameters:
  • word_a (str) – The first word string.

  • word_b (str) – The second word string.

Returns:

The cosine similarity of 2 words.

Return type:

float

Examples

from paddlenlp.embeddings import TokenEmbedding

embed = TokenEmbedding()
cosine_simi =  embed.cosine_sim('PaddlePaddle', 'PaddleNLP!')
#0.99999994