semantic_html.utils =================== .. py:module:: semantic_html.utils Functions --------- .. autoapisummary:: semantic_html.utils.generate_uuid semantic_html.utils.normalize_whitespace semantic_html.utils.extract_text_lxml semantic_html.utils.find_offset_with_context semantic_html.utils.extract_context semantic_html.utils.clean_html semantic_html.utils.annotate_html_with_rdfa semantic_html.utils.build_tag_style_lookup Module Contents --------------- .. py:function:: generate_uuid() -> str Generate a new UUID4 in URN format. .. py:function:: normalize_whitespace(text: str) -> str .. py:function:: extract_text_lxml(html_snippet: str) -> str Extract plain text from an HTML snippet using lxml. .. py:function:: find_offset_with_context(text, prefix, suffix, doc_text, max_chars=30) .. py:function:: extract_context(tag, max_chars=30) .. py:function:: clean_html(html: str, mapping: dict, remove_empty_tags: bool = True) -> str Remove HTML elements mapped to 'IGNORE' and optionally remove empty tags. .. py:function:: annotate_html_with_rdfa(html: str, mapping: dict) -> str Annotate an HTML string with RDFa 'typeof' attributes according to mapping. .. py:function:: build_tag_style_lookup(mapping) Build lookup tables for tags, styles, and optional classes, including regex-only definitions.