semantic_html.utils¶

Functions¶

`safe_xpath`(root, expr)
`generate_uuid`(→ str)	Generate a new UUID4 in URN format.
`normalize_whitespace`(→ str)
`get_same_as`(→ str \| None)	Extracts a URI (sameAs) from a node using XPath.
`extract_text_lxml`(→ str)	Extract plain text from lxml element, preserving paragraph line breaks.
`find_offset_with_context`(text, prefix, suffix, doc_text)
`extract_context`(→ tuple[str, str])	Extract prefix and suffix context around the node's text within its parent text.
`clean_html`(→ lxml.etree._Element)	Remove HTML elements mapped to 'IGNORE' via xpath and optionally remove empty tags using lxml.
`annotate_tree_with_rdfa`(→ lxml.etree._Element)	Annotate an lxml tree with RDFa 'typeof' attributes and namespace declarations based on mapping @context.
`mapping_lookup`(mapping)	Build lookup table for xpath entries in mapping.
`regex_wrap_tree`(→ lxml.etree._Element)	Traverses the lxml tree and wraps regex matches in <span> tags,
`tokenize`(text)	Simple and robust tokenization.
`build_token_spans`(text, tokens)	Compute start and end character offsets for each token.
`_flatten_selector`(sel)	Handle nested selector structures like refinedBy, Choice, List.
`normalize_wadm`(wadm[, whitelist, blacklist])	Normalize WADM annotations into a simpler structure:
`resolve_texts`(jsonld)	Build a lookup from @id → text for all resources in a JSON-LD @graph.
`_conll_from_annotations`(text, annotations[, ...])	Helper: map annotations to BIO labels for a given text.
`conll_to_string`(sentences)	Return sentences in CoNLL format as a string.

Module Contents¶

semantic_html.utils.safe_xpath(root, expr)¶

semantic_html.utils.generate_uuid() → str¶: Generate a new UUID4 in URN format.

semantic_html.utils.normalize_whitespace(text: str) → str¶

semantic_html.utils.get_same_as(node: lxml.etree._Element, xpath: str = 'self::a/@href | .//a/@href') → str | None¶

Extracts a URI (sameAs) from a node using XPath. Default behavior matches <a href=”…”> either on the node itself or in descendants.

Parameters:

node – The lxml Element to inspect.
xpath – XPath expression selecting the desired attribute(s). Default: “self::a/@href | .//a/@href”

Returns:

The first matching attribute value (string) or None.

semantic_html.utils.extract_text_lxml(node) → str¶: Extract plain text from lxml element, preserving paragraph line breaks.

semantic_html.utils.find_offset_with_context(text, prefix, suffix, doc_text, max_chars=30)¶

semantic_html.utils.extract_context(node: lxml.etree._Element, max_chars: int = 30) → tuple[str, str]¶: Extract prefix and suffix context around the node’s text within its parent text.

semantic_html.utils.clean_html(tree: lxml.etree._Element, mapping: dict, remove_empty_tags: bool = True) → lxml.etree._Element¶: Remove HTML elements mapped to ‘IGNORE’ via xpath and optionally remove empty tags using lxml.

semantic_html.utils.annotate_tree_with_rdfa(tree: lxml.etree._Element, mapping: dict, context: dict = None) → lxml.etree._Element¶: Annotate an lxml tree with RDFa ‘typeof’ attributes and namespace declarations based on mapping @context.

semantic_html.utils.mapping_lookup(mapping)¶: Build lookup table for xpath entries in mapping. Handles string or list of xpaths, and special Annotation subtypes. Returns xpath_lookup dict.

semantic_html.utils.regex_wrap_tree(tree: lxml.etree._Element, mapping: dict) → lxml.etree._Element¶: Traverses the lxml tree and wraps regex matches in <span> tags, mutates mapping.

semantic_html.utils.tokenize(text)¶: Simple and robust tokenization.

semantic_html.utils.build_token_spans(text, tokens)¶: Compute start and end character offsets for each token.

semantic_html.utils._flatten_selector(sel)¶: Handle nested selector structures like refinedBy, Choice, List. Always return a list of flat selectors.

semantic_html.utils.normalize_wadm(wadm, whitelist=None, blacklist=None)¶: Normalize WADM annotations into a simpler structure: - entity_type: string (from body purpose: tagging) - selectors: list of selector dicts Handles multiple tagging values robustly (e.g. “Annotation” + “Concept”).

semantic_html.utils.resolve_texts(jsonld)¶: Build a lookup from @id → text for all resources in a JSON-LD @graph.

semantic_html.utils._conll_from_annotations(text, annotations, max_span_tokens=None)¶: Helper: map annotations to BIO labels for a given text.

semantic_html.utils.conll_to_string(sentences)¶: Return sentences in CoNLL format as a string.

semantic_html.utils¶

Functions¶

Module Contents¶

Semantic HTML

Navigation

Related Topics