semantic_html.utils
===================
.. py:module:: semantic_html.utils
Functions
---------
.. autoapisummary::
semantic_html.utils.generate_uuid
semantic_html.utils.normalize_whitespace
semantic_html.utils.extract_text_lxml
semantic_html.utils.find_offset_with_context
semantic_html.utils.extract_context
semantic_html.utils.clean_html
semantic_html.utils.annotate_html_with_rdfa
semantic_html.utils.build_tag_style_lookup
Module Contents
---------------
.. py:function:: generate_uuid() -> str
Generate a new UUID4 in URN format.
.. py:function:: normalize_whitespace(text: str) -> str
.. py:function:: extract_text_lxml(html_snippet: str) -> str
Extract plain text from an HTML snippet using lxml.
.. py:function:: find_offset_with_context(text, prefix, suffix, doc_text, max_chars=30)
.. py:function:: extract_context(tag, max_chars=30)
.. py:function:: clean_html(html: str, mapping: dict, remove_empty_tags: bool = True) -> str
Remove HTML elements mapped to 'IGNORE' and optionally remove empty tags.
.. py:function:: annotate_html_with_rdfa(html: str, mapping: dict) -> str
Annotate an HTML string with RDFa 'typeof' attributes according to mapping.
.. py:function:: build_tag_style_lookup(mapping)
Build lookup tables for tags, styles, and optional classes, including regex-only definitions.