token_embedding#
- class TokenEmbedding(embedding_name='w2v.baidu_encyclopedia.target.word-word.dim300', unknown_token='[UNK]', unknown_token_vector=None, extended_vocab_path=None, trainable=True, keep_extended_vocab_only=False)[源代码]#
基类:
Embedding
A
TokenEmbedding
can load pre-trained embedding model which paddlenlp provides by specifying embedding name. Furthermore, aTokenEmbedding
can load extended vocabulary by specifying extended_vocab_path.- 参数:
embedding_name (
str
, optional) -- The pre-trained embedding model name. Usepaddlenlp.embeddings.list_embedding_name()
to list the names of all embedding models that we provide. Defaults tow2v.baidu_encyclopedia.target.word-word.dim300
.unknown_token (
str
, optional) -- Specifies unknown token. Defaults to[UNK]
.unknown_token_vector (
list
, optional) -- To initialize the vector of unknown token. If it's none, use normal distribution to initialize the vector of unknown token. Defaults toNone
.extended_vocab_path (
str
, optional) -- The file path of extended vocabulary. Defaults toNone
.trainable (
bool
, optional) -- Whether the weight of embedding can be trained. Defaults to True.keep_extended_vocab_only (
bool
, optional) -- Whether to keep the extended vocabulary only, will be effective only if provides extended_vocab_path. Defaults to False.
- set_trainable(trainable)[源代码]#
Whether or not to set the weights of token embedding to be trainable.
- 参数:
trainable (
bool
) -- The weights can be trained if trainable is set to True, or the weights are fixed if trainable is False.
- search(words)[源代码]#
Gets the vectors of specifying words.
- 参数:
words (
list
orstr
orint
) -- The words which need to be searched.- 返回:
The vectors of specifying words.
- 返回类型:
numpy.array
示例
from paddlenlp.embeddings import TokenEmbedding embed = TokenEmbedding() vector = embed.search('Welcome to use PaddlePaddle and PaddleNLP!')
- get_idx_from_word(word)[源代码]#
Gets the index of specifying word by searching word_to_idx dict.
- 参数:
word (
list
orstr
orint
) -- The input token word which we want to get the token index converted from.- 返回:
The index of specifying word.
- 返回类型:
int
- get_idx_list_from_words(words)[源代码]#
Gets the index list of specifying words by searching word_to_idx dict.
- 参数:
words (
list
orstr
orint
) -- The input token words which we want to get the token indices converted from.- 返回:
The indexes list of specifying words.
- 返回类型:
list
示例
from paddlenlp.embeddings import TokenEmbedding embed = TokenEmbedding() index = embed.get_idx_from_word('Welcome to use PaddlePaddle and PaddleNLP!') #635963
- dot(word_a, word_b)[源代码]#
Calculates the dot product of 2 words. Dot product or scalar product is an algebraic operation that takes two equal-length sequences of numbers (usually coordinate vectors), and returns a single number.
- 参数:
word_a (
str
) -- The first word string.word_b (
str
) -- The second word string.
- 返回:
The dot product of 2 words.
- 返回类型:
float
示例
from paddlenlp.embeddings import TokenEmbedding embed = TokenEmbedding() dot_product = embed.dot('PaddlePaddle', 'PaddleNLP!') #0.11827179
- cosine_sim(word_a, word_b)[源代码]#
Calculates the cosine similarity of 2 word vectors. Cosine similarity is the cosine of the angle between two n-dimensional vectors in an n-dimensional space.
- 参数:
word_a (
str
) -- The first word string.word_b (
str
) -- The second word string.
- 返回:
The cosine similarity of 2 words.
- 返回类型:
float
示例
from paddlenlp.embeddings import TokenEmbedding embed = TokenEmbedding() cosine_simi = embed.cosine_sim('PaddlePaddle', 'PaddleNLP!') #0.99999994