semantic_html.utils

Functions

safe_xpath(root, expr)

generate_uuid(→ str)

Generate a new UUID4 in URN format.

normalize_whitespace(→ str)

get_same_as(→ str | None)

Extracts a URI (sameAs) from a node using XPath.

extract_text_lxml(→ str)

Extract plain text from lxml element, preserving paragraph line breaks.

find_offset_with_context(text, prefix, suffix, doc_text)

extract_context(→ tuple[str, str])

Extract prefix and suffix context around the node's text within its parent text.

clean_html(→ lxml.etree._Element)

Remove HTML elements mapped to 'IGNORE' via xpath and optionally remove empty tags using lxml.

annotate_tree_with_rdfa(→ lxml.etree._Element)

Annotate an lxml tree with RDFa 'typeof' attributes and namespace declarations based on mapping @context.

mapping_lookup(mapping)

Build lookup table for xpath entries in mapping.

regex_wrap_tree(→ lxml.etree._Element)

Traverses the lxml tree and wraps regex matches in <span> tags,

tokenize(text)

Simple and robust tokenization.

build_token_spans(text, tokens)

Compute start and end character offsets for each token.

_flatten_selector(sel)

Handle nested selector structures like refinedBy, Choice, List.

normalize_wadm(wadm[, whitelist, blacklist])

Normalize WADM annotations into a simpler structure:

resolve_texts(jsonld)

Build a lookup from @id → text for all resources in a JSON-LD @graph.

_conll_from_annotations(text, annotations[, ...])

Helper: map annotations to BIO labels for a given text.

conll_to_string(sentences)

Return sentences in CoNLL format as a string.

Module Contents

semantic_html.utils.safe_xpath(root, expr)
semantic_html.utils.generate_uuid() str

Generate a new UUID4 in URN format.

semantic_html.utils.normalize_whitespace(text: str) str
semantic_html.utils.get_same_as(node: lxml.etree._Element, xpath: str = 'self::a/@href | .//a/@href') str | None

Extracts a URI (sameAs) from a node using XPath. Default behavior matches <a href=”…”> either on the node itself or in descendants.

Parameters:
  • node – The lxml Element to inspect.

  • xpath – XPath expression selecting the desired attribute(s). Default: “self::a/@href | .//a/@href”

Returns:

The first matching attribute value (string) or None.

semantic_html.utils.extract_text_lxml(node) str

Extract plain text from lxml element, preserving paragraph line breaks.

semantic_html.utils.find_offset_with_context(text, prefix, suffix, doc_text, max_chars=30)
semantic_html.utils.extract_context(node: lxml.etree._Element, max_chars: int = 30) tuple[str, str]

Extract prefix and suffix context around the node’s text within its parent text.

semantic_html.utils.clean_html(tree: lxml.etree._Element, mapping: dict, remove_empty_tags: bool = True) lxml.etree._Element

Remove HTML elements mapped to ‘IGNORE’ via xpath and optionally remove empty tags using lxml.

semantic_html.utils.annotate_tree_with_rdfa(tree: lxml.etree._Element, mapping: dict, context: dict = None) lxml.etree._Element

Annotate an lxml tree with RDFa ‘typeof’ attributes and namespace declarations based on mapping @context.

semantic_html.utils.mapping_lookup(mapping)

Build lookup table for xpath entries in mapping. Handles string or list of xpaths, and special Annotation subtypes. Returns xpath_lookup dict.

semantic_html.utils.regex_wrap_tree(tree: lxml.etree._Element, mapping: dict) lxml.etree._Element

Traverses the lxml tree and wraps regex matches in <span> tags, mutates mapping.

semantic_html.utils.tokenize(text)

Simple and robust tokenization.

semantic_html.utils.build_token_spans(text, tokens)

Compute start and end character offsets for each token.

semantic_html.utils._flatten_selector(sel)

Handle nested selector structures like refinedBy, Choice, List. Always return a list of flat selectors.

semantic_html.utils.normalize_wadm(wadm, whitelist=None, blacklist=None)

Normalize WADM annotations into a simpler structure: - entity_type: string (from body purpose: tagging) - selectors: list of selector dicts Handles multiple tagging values robustly (e.g. “Annotation” + “Concept”).

semantic_html.utils.resolve_texts(jsonld)

Build a lookup from @id → text for all resources in a JSON-LD @graph.

semantic_html.utils._conll_from_annotations(text, annotations, max_span_tokens=None)

Helper: map annotations to BIO labels for a given text.

semantic_html.utils.conll_to_string(sentences)

Return sentences in CoNLL format as a string.