utils¶
-
download_file
(save_dir, filename, url, md5=None)[source]¶ Download the file from the url to specified directory. Check md5 value when the file is exists, if the md5 value is the same as the existed file, just use the older file, if not, will download the file from the url.
- Parameters
save_dir (string) – The specified directory saving the file.
filename (string) – The specified filename saving the file.
url (string) – The url downling the file.
md5 (string, optional) – The md5 value that checking the version downloaded.
-
download_check
(task)[source]¶ Check the resource status in the specified task.
- Parameters
task (string) – The name of specified task.
-
cut_chinese_sent
(para)[source]¶ Cut the Chinese sentences more precisely, reference to “https://blog.csdn.net/blmoistawinde/article/details/82379256”.
-
class
TermTreeNode
(sid: str, term: str, base: str, node_type: str = 'term', term_type: Optional[str] = None, hyper: Optional[str] = None, level: Optional[int] = None, alias: Optional[List[str]] = None, alias_ext: Optional[List[str]] = None, sub_type: Optional[List[str]] = None, sub_term: Optional[List[str]] = None, data: Optional[Dict[str, Any]] = None)[source]¶ Bases:
object
Defination of term node. All members are protected, to keep rigorism of data struct.
- Parameters
sid (str) – term id of node.
term (str) – term, common name of this term.
base (str) –
cb
indicates concept base,eb
indicates entity base.term_type (Optional[str], optional) – type of this term, constructs hirechical of
term
node. Defaults to None.hyper (Optional[str], optional) – parent type of a
type
node. Defaults to None.node_type (str, optional) – type statement of node,
type
orterm
. Defaults to “term”.alias (Optional[List[str]], optional) – alias of this term. Defaults to None.
alias_ext (Optional[List[str]], optional) – extended alias of this term, CANNOT be used in matching. Defaults to None.
sub_type (Optional[List[str]], optional) – grouped by some term. Defaults to None.
sub_term (Optional[List[str]], optional) – some lower term. Defaults to None.
data (Optional[Dict[str, Any]], optional) – to sore full imformation of a term. Defaults to None.
-
class
TermTree
[source]¶ Bases:
object
TermTree class.
-
add_term
(term: Optional[str] = None, base: Optional[str] = None, term_type: Optional[str] = None, sub_type: Optional[List[str]] = None, sub_term: Optional[List[str]] = None, alias: Optional[List[str]] = None, alias_ext: Optional[List[str]] = None, data: Optional[Dict[str, Any]] = None)[source]¶ Add a term into TermTree.
- Parameters
term (str) – common name of name.
base (str) – term is concept or entity.
term_type (str) – term type of this term
sub_type (Optional[List[str]], optional) – sub type of this term, must exists in TermTree. Defaults to None.
sub_terms (Optional[List[str]], optional) – sub terms of this term. Defaults to None.
alias (Optional[List[str]], optional) – alias of this term. Defaults to None.
alias_ext (Optional[List[str]], optional) – . Defaults to None.
data (Optional[Dict[str, Any]], optional) – [description]. Defaults to None.
-
find_term
(term: str, term_type: Optional[str] = None) → Tuple[bool, Optional[List[str]]][source]¶ Find a term in Term Tree. If term not exists, return None. If
term_type
is not None, will find term with this type.- Parameters
term (str) – term to look up.
term_type (Optional[str], optional) – find term in this term_type. Defaults to None.
- Returns
[description]
- Return type
Union[None, List[str]]
-
build_from_dir
(term_schema_path, term_data_path, linking=True)[source]¶ Build TermTree from a directory which should contain type schema and term data.
- Parameters
dir ([type]) – [description]
-
classmethod
from_dir
(term_schema_path, term_data_path, linking) → paddlenlp.taskflow.utils.TermTree[source]¶ Build TermTree from a directory which should contain type schema and term data.
- Parameters
source_dir ([type]) – [description]
- Returns
[description]
- Return type
-
-
levenstein_distance
(s1: str, s2: str) → int[source]¶ Calculate minimal Levenstein distance between s1 and s2.
- Parameters
s1 (str) – string
s2 (str) – string
- Returns
the minimal distance.
- Return type
int
-
class
BurkhardKellerNode
(word: str)[source]¶ Bases:
object
Node implementatation for BK-Tree. A BK-Tree node stores the information of current word, and its approximate words calculated by levenstein distance.
- Parameters
word (str) – word of current node.
-
class
BurkhardKellerTree
[source]¶ Bases:
object
Implementataion of BK-Tree
-
get_id_and_prob
(span_set, offset_mapping)[source]¶ Return text id and probability of predicted spans
- Parameters
span_set (set) – set of predicted spans.
offset_mapping (list[int]) – list of pair preserving the index of start and end char in original text pair (prompt + text) for each token.
- Returns
index of start and end char in original text. prob (list[float]): probabilities of predicted spans.
- Return type
sentence_id (list[tuple])
-
class
WordTagRelationExtractor
(schema)[source]¶ Bases:
object
Implement of information extractor.
-
class
DataCollatorGP
(tokenizer: paddlenlp.transformers.tokenizer_utils_base.PretrainedTokenizerBase, padding: Union[bool, str, paddlenlp.transformers.tokenizer_utils_base.PaddingStrategy] = True, max_length: Union[int, NoneType] = None, label_maps: Union[dict, NoneType] = None, task_type: Union[str, NoneType] = None)[source]¶ Bases:
object
-
class
DataCollatorForErnieCtm
(tokenizer: paddlenlp.transformers.tokenizer_utils_base.PretrainedTokenizerBase, padding: Union[bool, str, paddlenlp.transformers.tokenizer_utils_base.PaddingStrategy] = True, model: Union[str, NoneType] = 'wordtag')[source]¶ Bases:
object
-
class
DocSpan
(start, length)¶ Bases:
tuple
-
length
¶ Alias for field number 1
-
start
¶ Alias for field number 0
-
-
class
Example
(keys, key_labels, doc_tokens, text, qas_id, model_type, seq_labels, ori_boxes, boxes, segment_ids, symbol_ids, im_base64, image_rois)¶ Bases:
tuple
-
boxes
¶ Alias for field number 8
-
doc_tokens
¶ Alias for field number 2
-
im_base64
¶ Alias for field number 11
-
image_rois
¶ Alias for field number 12
-
key_labels
¶ Alias for field number 1
-
keys
¶ Alias for field number 0
-
model_type
¶ Alias for field number 5
-
ori_boxes
¶ Alias for field number 7
-
qas_id
¶ Alias for field number 4
-
segment_ids
¶ Alias for field number 9
-
seq_labels
¶ Alias for field number 6
-
symbol_ids
¶ Alias for field number 10
-
text
¶ Alias for field number 3
-
-
class
Feature
(unique_id, example_index, qas_id, doc_span_index, tokens, token_to_orig_map, token_is_max_context, token_ids, position_ids, text_type_ids, text_symbol_ids, overlaps, key_labels, seq_labels, se_seq_labels, bio_seq_labels, bioes_seq_labels, keys, model_type, doc_tokens, doc_labels, text, boxes, segment_ids, im_base64, image_rois)¶ Bases:
tuple
-
bio_seq_labels
¶ Alias for field number 15
-
bioes_seq_labels
¶ Alias for field number 16
-
boxes
¶ Alias for field number 22
-
doc_labels
¶ Alias for field number 20
-
doc_span_index
¶ Alias for field number 3
-
doc_tokens
¶ Alias for field number 19
-
example_index
¶ Alias for field number 1
-
im_base64
¶ Alias for field number 24
-
image_rois
¶ Alias for field number 25
-
key_labels
¶ Alias for field number 12
-
keys
¶ Alias for field number 17
-
model_type
¶ Alias for field number 18
-
overlaps
¶ Alias for field number 11
-
position_ids
¶ Alias for field number 8
-
qas_id
¶ Alias for field number 2
-
se_seq_labels
¶ Alias for field number 14
-
segment_ids
¶ Alias for field number 23
-
seq_labels
¶ Alias for field number 13
-
text
¶ Alias for field number 21
-
text_symbol_ids
¶ Alias for field number 10
-
text_type_ids
¶ Alias for field number 9
-
token_ids
¶ Alias for field number 7
-
token_is_max_context
¶ Alias for field number 6
-
token_to_orig_map
¶ Alias for field number 5
-
tokens
¶ Alias for field number 4
-
unique_id
¶ Alias for field number 0
-
-
class
ProcessReader
(dataset=None, sample_transforms=None, batch_transforms=None, batch_size=None, shuffle=False, drop_last=False, drop_empty=True, mixup_epoch=- 1, cutmix_epoch=- 1, class_aware_sampling=False, use_process=False, use_fine_grained_loss=False, num_classes=80, bufsize=- 1, memsize='3G', inputs_def=None, devices_num=1, num_trainers=1)[source]¶ Bases:
object
- Parameters
dataset (DataSet) – DataSet object
sample_transforms (list of BaseOperator) – a list of sample transforms operators.
batch_transforms (list of BaseOperator) – a list of batch transforms operators.
batch_size (int) – batch size.
shuffle (bool) – whether shuffle dataset or not. Default False.
drop_last (bool) – whether drop last batch or not. Default False.
drop_empty (bool) – whether drop sample when it’s gt is empty or not. Default True.
mixup_epoch (int) – mixup epoc number. Default is -1, meaning not use mixup.
cutmix_epoch (int) – cutmix epoc number. Default is -1, meaning not use cutmix.
class_aware_sampling (bool) – whether use class-aware sampling or not. Default False.
worker_num (int) – number of working threads/processes. Default -1, meaning not use multi-threads/multi-processes.
use_process (bool) – whether use multi-processes or not. It only works when worker_num > 1. Default False.
bufsize (int) – buffer size for multi-threads/multi-processes, please note, one instance in buffer is one batch data.
memsize (str) – size of shared memory used in result queue when use_process is true. Default 3G.
inputs_def (dict) – network input definition use to get input fields, which is used to determine the order of returned data.
devices_num (int) – number of devices.
num_trainers (int) – number of trainers. Default 1.
-
pad_batch_data
(insts, pad_idx=0, max_seq_len=None, return_pos=False, return_input_mask=False, return_max_len=False, return_num_token=False, return_seq_lens=False, pad_2d_pos_ids=False, pad_segment_id=False, select=False, extract=False)[source]¶ Pad the instances to the max sequence length in batch, and generate the corresponding position data and attention bias.