I’d like to extract the text from an HTML file using Python. I want essentially the same output I would get if I copied the text from a browser and pasted it into notepad.
- Filter out HTML tags and resolve entities in python
- Convert XML/HTML Entities into Unicode String in Python
html2text is a Python program that does a pretty good job at this.