semantic_html.utils =================== .. py:module:: semantic_html.utils Functions --------- .. autoapisummary:: semantic_html.utils.safe_xpath semantic_html.utils.generate_uuid semantic_html.utils.normalize_whitespace semantic_html.utils.get_same_as semantic_html.utils.extract_text_lxml semantic_html.utils.find_offset_with_context semantic_html.utils.extract_context semantic_html.utils.clean_html semantic_html.utils.annotate_tree_with_rdfa semantic_html.utils.mapping_lookup semantic_html.utils.regex_wrap_tree semantic_html.utils.tokenize semantic_html.utils.build_token_spans semantic_html.utils._flatten_selector semantic_html.utils.normalize_wadm semantic_html.utils.resolve_texts semantic_html.utils._conll_from_annotations semantic_html.utils.conll_to_string Module Contents --------------- .. py:function:: safe_xpath(root, expr) .. py:function:: generate_uuid() -> str Generate a new UUID4 in URN format. .. py:function:: normalize_whitespace(text: str) -> str .. py:function:: get_same_as(node: lxml.etree._Element, xpath: str = 'self::a/@href | .//a/@href') -> str | None Extracts a URI (sameAs) from a node using XPath. Default behavior matches either on the node itself or in descendants. :param node: The lxml Element to inspect. :param xpath: XPath expression selecting the desired attribute(s). Default: "self::a/@href | .//a/@href" :returns: The first matching attribute value (string) or None. .. py:function:: extract_text_lxml(node) -> str Extract plain text from lxml element, preserving paragraph line breaks. .. py:function:: find_offset_with_context(text, prefix, suffix, doc_text, max_chars=30) .. py:function:: extract_context(node: lxml.etree._Element, max_chars: int = 30) -> tuple[str, str] Extract prefix and suffix context around the node's text within its parent text. .. py:function:: clean_html(tree: lxml.etree._Element, mapping: dict, remove_empty_tags: bool = True) -> lxml.etree._Element Remove HTML elements mapped to 'IGNORE' via xpath and optionally remove empty tags using lxml. .. py:function:: annotate_tree_with_rdfa(tree: lxml.etree._Element, mapping: dict, context: dict = None) -> lxml.etree._Element Annotate an lxml tree with RDFa 'typeof' attributes and namespace declarations based on mapping @context. .. py:function:: mapping_lookup(mapping) Build lookup table for xpath entries in mapping. Handles string or list of xpaths, and special Annotation subtypes. Returns xpath_lookup dict. .. py:function:: regex_wrap_tree(tree: lxml.etree._Element, mapping: dict) -> lxml.etree._Element Traverses the lxml tree and wraps regex matches in tags, mutates mapping. .. py:function:: tokenize(text) Simple and robust tokenization. .. py:function:: build_token_spans(text, tokens) Compute start and end character offsets for each token. .. py:function:: _flatten_selector(sel) Handle nested selector structures like refinedBy, Choice, List. Always return a list of flat selectors. .. py:function:: normalize_wadm(wadm, whitelist=None, blacklist=None) Normalize WADM annotations into a simpler structure: - entity_type: string (from body purpose: tagging) - selectors: list of selector dicts Handles multiple tagging values robustly (e.g. "Annotation" + "Concept"). .. py:function:: resolve_texts(jsonld) Build a lookup from @id → text for all resources in a JSON-LD @graph. .. py:function:: _conll_from_annotations(text, annotations, max_span_tokens=None) Helper: map annotations to BIO labels for a given text. .. py:function:: conll_to_string(sentences) Return sentences in CoNLL format as a string.