semantic_html.utils¶
Functions¶
|
Generate a new UUID4 in URN format. |
|
|
|
Extract plain text from an HTML snippet using lxml. |
|
|
|
|
|
Remove HTML elements mapped to 'IGNORE' and optionally remove empty tags. |
|
Annotate an HTML string with RDFa 'typeof' attributes according to mapping. |
|
Build lookup tables for tags, styles, and optional classes, including regex-only definitions. |
Module Contents¶
- semantic_html.utils.generate_uuid() str¶
Generate a new UUID4 in URN format.
- semantic_html.utils.normalize_whitespace(text: str) str¶
- semantic_html.utils.extract_text_lxml(html_snippet: str) str¶
Extract plain text from an HTML snippet using lxml.
- semantic_html.utils.find_offset_with_context(text, prefix, suffix, doc_text, max_chars=30)¶
- semantic_html.utils.extract_context(tag, max_chars=30)¶
- semantic_html.utils.clean_html(html: str, mapping: dict, remove_empty_tags: bool = True) str¶
Remove HTML elements mapped to ‘IGNORE’ and optionally remove empty tags.
- semantic_html.utils.annotate_html_with_rdfa(html: str, mapping: dict) str¶
Annotate an HTML string with RDFa ‘typeof’ attributes according to mapping.
- semantic_html.utils.build_tag_style_lookup(mapping)¶
Build lookup tables for tags, styles, and optional classes, including regex-only definitions.