semantic_html.utils

Functions

generate_uuid(→ str)

Generate a new UUID4 in URN format.

normalize_whitespace(→ str)

extract_text_lxml(→ str)

Extract plain text from an HTML snippet using lxml.

find_offset_with_context(text, prefix, suffix, doc_text)

extract_context(tag[, max_chars])

clean_html(→ str)

Remove HTML elements mapped to 'IGNORE' and optionally remove empty tags.

annotate_html_with_rdfa(→ str)

Annotate an HTML string with RDFa 'typeof' attributes according to mapping.

build_tag_style_lookup(mapping)

Build lookup tables for tags, styles, and optional classes, including regex-only definitions.

Module Contents

semantic_html.utils.generate_uuid() str

Generate a new UUID4 in URN format.

semantic_html.utils.normalize_whitespace(text: str) str
semantic_html.utils.extract_text_lxml(html_snippet: str) str

Extract plain text from an HTML snippet using lxml.

semantic_html.utils.find_offset_with_context(text, prefix, suffix, doc_text, max_chars=30)
semantic_html.utils.extract_context(tag, max_chars=30)
semantic_html.utils.clean_html(html: str, mapping: dict, remove_empty_tags: bool = True) str

Remove HTML elements mapped to ‘IGNORE’ and optionally remove empty tags.

semantic_html.utils.annotate_html_with_rdfa(html: str, mapping: dict) str

Annotate an HTML string with RDFa ‘typeof’ attributes according to mapping.

semantic_html.utils.build_tag_style_lookup(mapping)

Build lookup tables for tags, styles, and optional classes, including regex-only definitions.