semantic_html.utils¶
Functions¶
|
|
|
Generate a new UUID4 in URN format. |
|
|
|
Extracts a URI (sameAs) from a node using XPath. |
|
Extract plain text from lxml element, preserving paragraph line breaks. |
|
|
|
Extract prefix and suffix context around the node's text within its parent text. |
|
Remove HTML elements mapped to 'IGNORE' via xpath and optionally remove empty tags using lxml. |
|
Annotate an lxml tree with RDFa 'typeof' attributes and namespace declarations based on mapping @context. |
|
Build lookup table for xpath entries in mapping. |
|
Traverses the lxml tree and wraps regex matches in <span> tags, |
|
Simple and robust tokenization. |
|
Compute start and end character offsets for each token. |
|
Handle nested selector structures like refinedBy, Choice, List. |
|
Normalize WADM annotations into a simpler structure: |
|
Build a lookup from @id → text for all resources in a JSON-LD @graph. |
|
Helper: map annotations to BIO labels for a given text. |
|
Return sentences in CoNLL format as a string. |
Module Contents¶
- semantic_html.utils.safe_xpath(root, expr)¶
- semantic_html.utils.generate_uuid() str¶
Generate a new UUID4 in URN format.
- semantic_html.utils.normalize_whitespace(text: str) str¶
- semantic_html.utils.get_same_as(node: lxml.etree._Element, xpath: str = 'self::a/@href | .//a/@href') str | None¶
Extracts a URI (sameAs) from a node using XPath. Default behavior matches <a href=”…”> either on the node itself or in descendants.
- Parameters:
node – The lxml Element to inspect.
xpath – XPath expression selecting the desired attribute(s). Default: “self::a/@href | .//a/@href”
- Returns:
The first matching attribute value (string) or None.
- semantic_html.utils.extract_text_lxml(node) str¶
Extract plain text from lxml element, preserving paragraph line breaks.
- semantic_html.utils.find_offset_with_context(text, prefix, suffix, doc_text, max_chars=30)¶
- semantic_html.utils.extract_context(node: lxml.etree._Element, max_chars: int = 30) tuple[str, str]¶
Extract prefix and suffix context around the node’s text within its parent text.
- semantic_html.utils.clean_html(tree: lxml.etree._Element, mapping: dict, remove_empty_tags: bool = True) lxml.etree._Element¶
Remove HTML elements mapped to ‘IGNORE’ via xpath and optionally remove empty tags using lxml.
- semantic_html.utils.annotate_tree_with_rdfa(tree: lxml.etree._Element, mapping: dict, context: dict = None) lxml.etree._Element¶
Annotate an lxml tree with RDFa ‘typeof’ attributes and namespace declarations based on mapping @context.
- semantic_html.utils.mapping_lookup(mapping)¶
Build lookup table for xpath entries in mapping. Handles string or list of xpaths, and special Annotation subtypes. Returns xpath_lookup dict.
- semantic_html.utils.regex_wrap_tree(tree: lxml.etree._Element, mapping: dict) lxml.etree._Element¶
Traverses the lxml tree and wraps regex matches in <span> tags, mutates mapping.
- semantic_html.utils.tokenize(text)¶
Simple and robust tokenization.
- semantic_html.utils.build_token_spans(text, tokens)¶
Compute start and end character offsets for each token.
- semantic_html.utils._flatten_selector(sel)¶
Handle nested selector structures like refinedBy, Choice, List. Always return a list of flat selectors.
- semantic_html.utils.normalize_wadm(wadm, whitelist=None, blacklist=None)¶
Normalize WADM annotations into a simpler structure: - entity_type: string (from body purpose: tagging) - selectors: list of selector dicts Handles multiple tagging values robustly (e.g. “Annotation” + “Concept”).
- semantic_html.utils.resolve_texts(jsonld)¶
Build a lookup from @id → text for all resources in a JSON-LD @graph.
- semantic_html.utils._conll_from_annotations(text, annotations, max_span_tokens=None)¶
Helper: map annotations to BIO labels for a given text.
- semantic_html.utils.conll_to_string(sentences)¶
Return sentences in CoNLL format as a string.