semantic_html.utils
===================
.. py:module:: semantic_html.utils
Functions
---------
.. autoapisummary::
semantic_html.utils.safe_xpath
semantic_html.utils.generate_uuid
semantic_html.utils.normalize_whitespace
semantic_html.utils.get_same_as
semantic_html.utils.extract_text_lxml
semantic_html.utils.find_offset_with_context
semantic_html.utils.extract_context
semantic_html.utils.clean_html
semantic_html.utils.annotate_tree_with_rdfa
semantic_html.utils.mapping_lookup
semantic_html.utils.regex_wrap_tree
semantic_html.utils.tokenize
semantic_html.utils.build_token_spans
semantic_html.utils._flatten_selector
semantic_html.utils.normalize_wadm
semantic_html.utils.resolve_texts
semantic_html.utils._conll_from_annotations
semantic_html.utils.conll_to_string
Module Contents
---------------
.. py:function:: safe_xpath(root, expr)
.. py:function:: generate_uuid() -> str
Generate a new UUID4 in URN format.
.. py:function:: normalize_whitespace(text: str) -> str
.. py:function:: get_same_as(node: lxml.etree._Element, xpath: str = 'self::a/@href | .//a/@href') -> str | None
Extracts a URI (sameAs) from a node using XPath.
Default behavior matches either on the node itself or in descendants.
:param node: The lxml Element to inspect.
:param xpath: XPath expression selecting the desired attribute(s).
Default: "self::a/@href | .//a/@href"
:returns: The first matching attribute value (string) or None.
.. py:function:: extract_text_lxml(node) -> str
Extract plain text from lxml element, preserving paragraph line breaks.
.. py:function:: find_offset_with_context(text, prefix, suffix, doc_text, max_chars=30)
.. py:function:: extract_context(node: lxml.etree._Element, max_chars: int = 30) -> tuple[str, str]
Extract prefix and suffix context around the node's text within its parent text.
.. py:function:: clean_html(tree: lxml.etree._Element, mapping: dict, remove_empty_tags: bool = True) -> lxml.etree._Element
Remove HTML elements mapped to 'IGNORE' via xpath and optionally remove empty tags using lxml.
.. py:function:: annotate_tree_with_rdfa(tree: lxml.etree._Element, mapping: dict, context: dict = None) -> lxml.etree._Element
Annotate an lxml tree with RDFa 'typeof' attributes and namespace declarations based on mapping @context.
.. py:function:: mapping_lookup(mapping)
Build lookup table for xpath entries in mapping.
Handles string or list of xpaths, and special Annotation subtypes.
Returns xpath_lookup dict.
.. py:function:: regex_wrap_tree(tree: lxml.etree._Element, mapping: dict) -> lxml.etree._Element
Traverses the lxml tree and wraps regex matches in tags,
mutates mapping.
.. py:function:: tokenize(text)
Simple and robust tokenization.
.. py:function:: build_token_spans(text, tokens)
Compute start and end character offsets for each token.
.. py:function:: _flatten_selector(sel)
Handle nested selector structures like refinedBy, Choice, List.
Always return a list of flat selectors.
.. py:function:: normalize_wadm(wadm, whitelist=None, blacklist=None)
Normalize WADM annotations into a simpler structure:
- entity_type: string (from body purpose: tagging)
- selectors: list of selector dicts
Handles multiple tagging values robustly (e.g. "Annotation" + "Concept").
.. py:function:: resolve_texts(jsonld)
Build a lookup from @id → text for all resources in a JSON-LD @graph.
.. py:function:: _conll_from_annotations(text, annotations, max_span_tokens=None)
Helper: map annotations to BIO labels for a given text.
.. py:function:: conll_to_string(sentences)
Return sentences in CoNLL format as a string.