utils

utils#

download_file(save_dir, filename, url, md5=None)[source]#

Download the file from the url to specified directory. Check md5 value when the file is exists, if the md5 value is the same as the existed file, just use the older file, if not, will download the file from the url.

Parameters:

save_dir (string) – The specified directory saving the file.
filename (string) – The specified filename saving the file.
url (string) – The url downling the file.
md5 (string, optional) – The md5 value that checking the version downloaded.

download_check(task)[source]#

Check the resource status in the specified task.

Parameters:: task (string) – The name of specified task.

add_docstrings(*docstr)[source]#: The function that add the doc string to doc of class.

cut_chinese_sent(para)[source]#: Cut the Chinese sentences more precisely, reference to “https://blog.csdn.net/blmoistawinde/article/details/82379256”.

Bases: object

Defination of term node. All members are protected, to keep rigorism of data struct.

Parameters:

sid (str) – term id of node.
term (str) – term, common name of this term.
base (str) – cb indicates concept base, eb indicates entity base.
term_type (Optional[str], optional) – type of this term, constructs hirechical of term node. Defaults to None.
hyper (Optional[str], optional) – parent type of a type node. Defaults to None.
node_type (str, optional) – type statement of node, type or term. Defaults to “term”.
alias (Optional[List[str]], optional) – alias of this term. Defaults to None.
alias_ext (Optional[List[str]], optional) – extended alias of this term, CANNOT be used in matching. Defaults to None.
sub_type (Optional[List[str]], optional) – grouped by some term. Defaults to None.
sub_term (Optional[List[str]], optional) – some lower term. Defaults to None.
data (Optional[Dict[str, Any]], optional) – to sore full imformation of a term. Defaults to None.

classmethod from_dict(data: Dict[str, Any])[source]#

Build a node from dictionary data.

Parameters:: data (Dict[str, Any]) – Dictionary data contain all k-v data.
Returns:: TermTree node object.
Return type:: [type]

classmethod from_json(json_str: str)[source]#

Build a node from JSON string.

Parameters:: json_str (str) – JSON string formatted by TermTree data.
Returns:: TermTree node object.
Return type:: [type]

class TermTree[source]#

Bases: object

TermTree class.

Add a term into TermTree.

Parameters:

term (str) – common name of name.
base (str) – term is concept or entity.
term_type (str) – term type of this term
sub_type (Optional[List[str]], optional) – sub type of this term, must exists in TermTree. Defaults to None.
sub_terms (Optional[List[str]], optional) – sub terms of this term. Defaults to None.
alias (Optional[List[str]], optional) – alias of this term. Defaults to None.
alias_ext (Optional[List[str]], optional) – . Defaults to None.
data (Optional[Dict[str, Any]], optional) – [description]. Defaults to None.

find_term(term: str, term_type: str | None = None) → Tuple[bool, List[str] | None][source]#

Find a term in Term Tree. If term not exists, return None. If term_type is not None, will find term with this type.

Parameters:

term (str) – term to look up.
term_type (Optional[str], optional) – find term in this term_type. Defaults to None.

Returns:

[description]

Return type:

Union[None, List[str]]

build_from_dir(term_schema_path, term_data_path, linking=True)[source]#

Build TermTree from a directory which should contain type schema and term data.

Parameters:: dir ([type]) – [description]

classmethod from_dir(term_schema_path, term_data_path, linking) → TermTree[source]#

Build TermTree from a directory which should contain type schema and term data.

Parameters:: source_dir ([type]) – [description]
Returns:: [description]
Return type:: TermTree

save(save_dir)[source]#

Save term tree to directory save_dir

Parameters:: save_dir ([type]) – Directory.

levenstein_distance(s1: str, s2: str) → int[source]#

Calculate minimal Levenstein distance between s1 and s2.

Parameters:

s1 (str) – string
s2 (str) – string

Returns:

the minimal distance.

Return type:

int

class BurkhardKellerNode(word: str)[source]#

Bases: object

Node implementatation for BK-Tree. A BK-Tree node stores the information of current word, and its approximate words calculated by levenstein distance.

Parameters:: word (str) – word of current node.

class BurkhardKellerTree[source]#

Bases: object

Implementataion of BK-Tree

add(word: str)[source]#

Insert a word into current tree. If tree is empty, set this word to root.

Parameters:: word (str) – word to be inserted.

search_similar_word(word: str) → List[str][source]#

Search the most similar (minimal levenstain distance) word between s.

Parameters:: s (str) – target word
Returns:: similar words.
Return type:: List[str]

class TriedTree[source]#

Bases: object

Implementataion of TriedTree

add_word(word)[source]#: add single word into TriedTree

search(content)[source]#

Backward maximum matching

Parameters:

content (str) – string to be searched

Returns:

list of maximum matching words, each element represents: the starting and ending position of the matching string.

Return type:

List[Tuple]

class Customization[source]#

Bases: object

User intervention based on Aho-Corasick automaton

load_customization(filename, sep=None)[source]#: Load the custom vocab

parse_customization(query, lac_tags, prefix=False)[source]#: Use custom vocab to modify the lac results

class SchemaTree(name='root', children=None)[source]#

Bases: object

Implementataion of SchemaTree

get_id_and_prob(span_set, offset_mapping)[source]#

Return text id and probability of predicted spans

Parameters:

span_set (set) – set of predicted spans.
offset_mapping (list[int]) – list of pair preserving the index of start and end char in original text pair (prompt + text) for each token.

Returns:

index of start and end char in original text. prob (list[float]): probabilities of predicted spans.

Return type:

sentence_id (list[tuple])

class WordTagRelationExtractor(schema)[source]#

Bases: object

Implement of information extractor.

classmethod from_dict(config_dict)[source]#

Make an instance from a configuration dictionary.

Parameters:: config_dict (Dict[str, Any]) – configuration dict.

classmethod from_json(json_str)[source]#: Implement an instance from JSON str.

classmethod from_pkl(pkl_path)[source]#: Implement an instance from a serialized pickle package.

classmethod from_config(config_path)[source]#: Implement an instance from a configuration file.

add_schema_from_dict(config_dict)[source]#: Add the schema from the dict.

extract_spo(all_items)[source]#

Pipeline of mining procedure.

Parameters:: all_items ([type]) – [description]

class DataCollatorGP(tokenizer: paddlenlp.transformers.tokenizer_utils_base.PretrainedTokenizerBase, padding: bool | str | paddlenlp.transformers.tokenizer_utils_base.PaddingStrategy = True, max_length: int | NoneType = None, label_maps: dict | NoneType = None, task_type: str | NoneType = None)[source]#: Bases: object

class DataCollatorForErnieCtm(tokenizer: paddlenlp.transformers.tokenizer_utils_base.PretrainedTokenizerBase, padding: bool | str | paddlenlp.transformers.tokenizer_utils_base.PaddingStrategy = True, model: str | NoneType = 'wordtag')[source]#: Bases: object

class DocSpan(start, length)#

Bases: tuple

length#: Alias for field number 1

start#: Alias for field number 0

class Example(keys, key_labels, doc_tokens, text, qas_id, model_type, seq_labels, ori_boxes, boxes, segment_ids, symbol_ids, im_base64, image_rois)#

Bases: tuple

boxes#: Alias for field number 8

doc_tokens#: Alias for field number 2

im_base64#: Alias for field number 11

image_rois#: Alias for field number 12

key_labels#: Alias for field number 1

keys#: Alias for field number 0

model_type#: Alias for field number 5

ori_boxes#: Alias for field number 7

qas_id#: Alias for field number 4

segment_ids#: Alias for field number 9

seq_labels#: Alias for field number 6

symbol_ids#: Alias for field number 10

text#: Alias for field number 3

class Feature(unique_id, example_index, qas_id, doc_span_index, tokens, token_to_orig_map, token_is_max_context, token_ids, position_ids, text_type_ids, text_symbol_ids, overlaps, key_labels, seq_labels, se_seq_labels, bio_seq_labels, bioes_seq_labels, keys, model_type, doc_tokens, doc_labels, text, boxes, segment_ids, im_base64, image_rois)#

Bases: tuple

bio_seq_labels#: Alias for field number 15

bioes_seq_labels#: Alias for field number 16

boxes#: Alias for field number 22

doc_labels#: Alias for field number 20

doc_span_index#: Alias for field number 3

doc_tokens#: Alias for field number 19

example_index#: Alias for field number 1

im_base64#: Alias for field number 24

image_rois#: Alias for field number 25

key_labels#: Alias for field number 12

keys#: Alias for field number 17

model_type#: Alias for field number 18

overlaps#: Alias for field number 11

position_ids#: Alias for field number 8

qas_id#: Alias for field number 2

se_seq_labels#: Alias for field number 14

segment_ids#: Alias for field number 23

seq_labels#: Alias for field number 13

text#: Alias for field number 21

text_symbol_ids#: Alias for field number 10

text_type_ids#: Alias for field number 9

token_ids#: Alias for field number 7

token_is_max_context#: Alias for field number 6

token_to_orig_map#: Alias for field number 5

tokens#: Alias for field number 4

unique_id#: Alias for field number 0

class Compose(transforms, ctx=None)[source]#

Bases: object

compose

class ProcessReader(dataset=None, sample_transforms=None, batch_transforms=None, batch_size=None, shuffle=False, drop_last=False, drop_empty=True, mixup_epoch=-1, cutmix_epoch=-1, class_aware_sampling=False, use_process=False, use_fine_grained_loss=False, num_classes=80, bufsize=-1, memsize='3G', inputs_def=None, devices_num=1, num_trainers=1)[source]#

Bases: object

Parameters:

dataset (DataSet) – DataSet object
sample_transforms (list of BaseOperator) – a list of sample transforms operators.
batch_transforms (list of BaseOperator) – a list of batch transforms operators.
batch_size (int) – batch size.
shuffle (bool) – whether shuffle dataset or not. Default False.
drop_last (bool) – whether drop last batch or not. Default False.
drop_empty (bool) – whether drop sample when it’s gt is empty or not. Default True.
mixup_epoch (int) – mixup epoc number. Default is -1, meaning not use mixup.
cutmix_epoch (int) – cutmix epoc number. Default is -1, meaning not use cutmix.
class_aware_sampling (bool) – whether use class-aware sampling or not. Default False.
worker_num (int) – number of working threads/processes. Default -1, meaning not use multi-threads/multi-processes.
use_process (bool) – whether use multi-processes or not. It only works when worker_num > 1. Default False.
bufsize (int) – buffer size for multi-threads/multi-processes, please note, one instance in buffer is one batch data.
memsize (str) – size of shared memory used in result queue when use_process is true. Default 3G.
inputs_def (dict) – network input definition use to get input fields, which is used to determine the order of returned data.
devices_num (int) – number of devices.
num_trainers (int) – number of trainers. Default 1.

process(dataset)[source]#

worker(drop_empty=True, batch_samples=None)[source]#: sample transform and batch transform.

pad_batch_data(insts, pad_idx=0, max_seq_len=None, return_pos=False, return_input_mask=False, return_max_len=False, return_num_token=False, return_seq_lens=False, pad_2d_pos_ids=False, pad_segment_id=False, select=False, extract=False)[source]#: Pad the instances to the max sequence length in batch, and generate the corresponding position data and attention bias.

get_final_text(pred_text, orig_text, tokenizer, do_lower_case)[source]#: Project the tokenized prediction back to the original text.

find_bio_pos(label)[source]#: find answer position from BIO label

calEuclidean(x_list, y_list)[source]#: Calculate euclidean distance

longestCommonSequence(question_tokens, context_tokens)[source]#: Longest common sequence

utils

Contents

utils#