utils

download_file(save_dir, filename, url, md5=None)[source]

Download the file from the url to specified directory. Check md5 value when the file is exists, if the md5 value is the same as the existed file, just use the older file, if not, will download the file from the url.

Parameters
  • save_dir (string) – The specified directory saving the file.

  • filename (string) – The specified filename saving the file.

  • url (string) – The url downling the file.

  • md5 (string, optional) – The md5 value that checking the version downloaded.

download_check(task)[source]

Check the resource statuc in the specified task.

Parameters

task (string) – The name of specified task.

add_docstrings(*docstr)[source]

The function that add the doc string to doc of class.

cut_chinese_sent(para)[source]

Cut the Chinese sentences more precisely, reference to “https://blog.csdn.net/blmoistawinde/article/details/82379256”.

class TermTreeNode(sid: str, term: str, base: str, node_type: str = 'term', term_type: Optional[str] = None, hyper: Optional[str] = None, level: Optional[int] = None, alias: Optional[List[str]] = None, alias_ext: Optional[List[str]] = None, sub_type: Optional[List[str]] = None, sub_term: Optional[List[str]] = None, data: Optional[Dict[str, Any]] = None)[source]

Bases: object

Defination of term node. All members are protected, to keep rigorism of data struct.

Parameters
  • sid (str) – term id of node.

  • term (str) – term, common name of this term.

  • base (str) – cb indicates concept base, eb indicates entity base.

  • term_type (Optional[str], optional) – type of this term, constructs hirechical of term node. Defaults to None.

  • hyper (Optional[str], optional) – parent type of a type node. Defaults to None.

  • node_type (str, optional) – type statement of node, type or term. Defaults to “term”.

  • alias (Optional[List[str]], optional) – alias of this term. Defaults to None.

  • alias_ext (Optional[List[str]], optional) – extended alias of this term, CANNOT be used in matching. Defaults to None.

  • sub_type (Optional[List[str]], optional) – grouped by some term. Defaults to None.

  • sub_term (Optional[List[str]], optional) – some lower term. Defaults to None.

  • data (Optional[Dict[str, Any]], optional) – to sore full imformation of a term. Defaults to None.

classmethod from_dict(data: Dict[str, Any])[source]

Build a node from dictionary data.

Parameters

data (Dict[str, Any]) – Dictionary data contain all k-v data.

Returns

TermTree node object.

Return type

[type]

classmethod from_json(json_str: str)[source]

Build a node from JSON string.

Parameters

json_str (str) – JSON string formatted by TermTree data.

Returns

TermTree node object.

Return type

[type]

class TermTree[source]

Bases: object

TermTree class.

add_term(term: Optional[str] = None, base: Optional[str] = None, term_type: Optional[str] = None, sub_type: Optional[List[str]] = None, sub_term: Optional[List[str]] = None, alias: Optional[List[str]] = None, alias_ext: Optional[List[str]] = None, data: Optional[Dict[str, Any]] = None)[source]

Add a term into TermTree.

Parameters
  • term (str) – common name of name.

  • base (str) – term is concept or entity.

  • term_type (str) – term type of this term

  • sub_type (Optional[List[str]], optional) – sub type of this term, must exists in TermTree. Defaults to None.

  • sub_terms (Optional[List[str]], optional) – sub terms of this term. Defaults to None.

  • alias (Optional[List[str]], optional) – alias of this term. Defaults to None.

  • alias_ext (Optional[List[str]], optional) – . Defaults to None.

  • data (Optional[Dict[str, Any]], optional) – [description]. Defaults to None.

find_term(term: str, term_type: Optional[str] = None)Tuple[bool, Optional[List[str]]][source]

Find a term in Term Tree. If term not exists, return None. If term_type is not None, will find term with this type.

Parameters
  • term (str) – term to look up.

  • term_type (Optional[str], optional) – find term in this term_type. Defaults to None.

Returns

[description]

Return type

Union[None, List[str]]

build_from_dir(term_schema_path, term_data_path, linking=True)[source]

Build TermTree from a directory which should contain type schema and term data.

Parameters

dir ([type]) – [description]

classmethod from_dir(term_schema_path, term_data_path, linking)paddlenlp.taskflow.utils.TermTree[source]

Build TermTree from a directory which should contain type schema and term data.

Parameters

source_dir ([type]) – [description]

Returns

[description]

Return type

TermTree

save(save_dir)[source]

Save term tree to directory save_dir

Parameters

save_dir ([type]) – Directory.

levenstein_distance(s1: str, s2: str)int[source]

Calculate minimal Levenstein distance between s1 and s2.

Parameters
  • s1 (str) – string

  • s2 (str) – string

Returns

the minimal distance.

Return type

int

class BurkhardKellerNode(word: str)[source]

Bases: object

Node implementatation for BK-Tree. A BK-Tree node stores the information of current word, and its approximate words calculated by levenstein distance.

Parameters

word (str) – word of current node.

class BurkhardKellerTree[source]

Bases: object

Implementataion of BK-Tree

add(word: str)[source]

Insert a word into current tree. If tree is empty, set this word to root.

Parameters

word (str) – word to be inserted.

search_similar_word(word: str)List[str][source]

Search the most similar (minimal levenstain distance) word between s.

Parameters

s (str) – target word

Returns

similar words.

Return type

List[str]

class TriedTree[source]

Bases: object

Implementataion of TriedTree

add_word(word)[source]

add single word into TriedTree

search(content)[source]

Backward maximum matching

Parameters

content (str) – string to be searched

Returns

list of maximum matching words, each element represents

the starting and ending position of the matching string.

Return type

List[Tuple]

class Customization[source]

Bases: object

User intervention based on Aho-Corasick automaton

load_customization(filename, sep=None)[source]

Load the custom vocab

parse_customization(query, lac_tags, prefix=False)[source]

Use custom vocab to modify the lac results

class SchemaTree(name='root', children=None)[source]

Bases: object

Implementataion of SchemaTree

get_bool_ids_greater_than(probs, limit=0.5, return_prob=False)[source]

get idx of the last dim in prob arraies, which is greater than a limitation input: [[0.1, 0.1, 0.2, 0.5, 0.1, 0.3], [0.7, 0.6, 0.1, 0.1, 0.1, 0.1]]

0.4

output: [[3], [0, 1]]

get_span(start_ids, end_ids, with_prob=False)[source]

every id can only be used once get span set from position start and end list input: [1, 2, 10] [4, 12] output: set((2, 4), (10, 12))