Extract text from HTML
How is html_text different from .xpath('//text()') from LXML or .get_text() from Beautiful Soup? - Text extracted with html_text does not contain inline styles, javascript, comments and other text that is not normally visible to users; - html_text normalizes whitespace, but in a way smarter than .xpath('normalize-space()), adding spaces around inline elements (which are often used as block elements in html markup), and trying to avoid adding extra spaces for punctuation; - html-text can add newlines (e.g. after headers or paragraphs), so that the output text looks more like how it is rendered in browsers.
Release | Stable | Testing |
---|---|---|
Fedora Rawhide | 0.6.2-2.fc42 | - |
Fedora 42 | 0.6.2-2.fc42 | - |
Fedora 41 | 0.6.2-1.fc41 | - |
Fedora 40 | 0.6.2-1.fc40 | - |
You can contact the maintainers of this package via email at
python-html-text dash maintainers at fedoraproject dot org
.