semantic_html.utils
===================

.. py:module:: semantic_html.utils


Functions
---------

.. autoapisummary::

   semantic_html.utils.safe_xpath
   semantic_html.utils.generate_uuid
   semantic_html.utils.normalize_whitespace
   semantic_html.utils.get_same_as
   semantic_html.utils.extract_text_lxml
   semantic_html.utils.find_offset_with_context
   semantic_html.utils.extract_context
   semantic_html.utils.clean_html
   semantic_html.utils.annotate_tree_with_rdfa
   semantic_html.utils.mapping_lookup
   semantic_html.utils.regex_wrap_tree
   semantic_html.utils.tokenize
   semantic_html.utils.build_token_spans
   semantic_html.utils._flatten_selector
   semantic_html.utils.normalize_wadm
   semantic_html.utils.resolve_texts
   semantic_html.utils._conll_from_annotations
   semantic_html.utils.conll_to_string


Module Contents
---------------

.. py:function:: safe_xpath(root, expr)

.. py:function:: generate_uuid() -> str

   Generate a new UUID4 in URN format.


.. py:function:: normalize_whitespace(text: str) -> str

.. py:function:: get_same_as(node: lxml.etree._Element, xpath: str = 'self::a/@href | .//a/@href') -> str | None

   Extracts a URI (sameAs) from a node using XPath.
   Default behavior matches <a href="..."> either on the node itself or in descendants.

   :param node: The lxml Element to inspect.
   :param xpath: XPath expression selecting the desired attribute(s).
                 Default: "self::a/@href | .//a/@href"

   :returns: The first matching attribute value (string) or None.


.. py:function:: extract_text_lxml(node) -> str

   Extract plain text from lxml element, preserving paragraph line breaks.


.. py:function:: find_offset_with_context(text, prefix, suffix, doc_text, max_chars=30)

.. py:function:: extract_context(node: lxml.etree._Element, max_chars: int = 30) -> tuple[str, str]

   Extract prefix and suffix context around the node's text within its parent text.


.. py:function:: clean_html(tree: lxml.etree._Element, mapping: dict, remove_empty_tags: bool = True) -> lxml.etree._Element

   Remove HTML elements mapped to 'IGNORE' via xpath and optionally remove empty tags using lxml.


.. py:function:: annotate_tree_with_rdfa(tree: lxml.etree._Element, mapping: dict, context: dict = None) -> lxml.etree._Element

   Annotate an lxml tree with RDFa 'typeof' attributes and namespace declarations based on mapping @context.


.. py:function:: mapping_lookup(mapping)

   Build lookup table for xpath entries in mapping.
   Handles string or list of xpaths, and special Annotation subtypes.
   Returns xpath_lookup dict.


.. py:function:: regex_wrap_tree(tree: lxml.etree._Element, mapping: dict) -> lxml.etree._Element

   Traverses the lxml tree and wraps regex matches in <span> tags,
   mutates mapping.


.. py:function:: tokenize(text)

   Simple and robust tokenization.


.. py:function:: build_token_spans(text, tokens)

   Compute start and end character offsets for each token.


.. py:function:: _flatten_selector(sel)

   Handle nested selector structures like refinedBy, Choice, List.
   Always return a list of flat selectors.


.. py:function:: normalize_wadm(wadm, whitelist=None, blacklist=None)

   Normalize WADM annotations into a simpler structure:
   - entity_type: string (from body purpose: tagging)
   - selectors: list of selector dicts
   Handles multiple tagging values robustly (e.g. "Annotation" + "Concept").


.. py:function:: resolve_texts(jsonld)

   Build a lookup from @id → text for all resources in a JSON-LD @graph.


.. py:function:: _conll_from_annotations(text, annotations, max_span_tokens=None)

   Helper: map annotations to BIO labels for a given text.


.. py:function:: conll_to_string(sentences)

   Return sentences in CoNLL format as a string.