Module 5: clean_html.py (Cleaning HTML for posting)
Now, Microsoft Word likes to add lots of style and class tags to its html in order to preserve the font and other formatting that you used in the document itself – but for my posts I wanted my site’s CSS to handle all of the formatting.
So I created a very quick little script that would remove all of the class and style attributes from html tags in a given string.
import lxml.html def scrub_html_string(sent_string): # Parse the html html = lxml.html.fromstring(sent_string) for tag in html.xpath('//*[@class]'): # For each element with a class attribute, remove that class attribute tag.attrib.pop('class') for tag in html.xpath('//*[@style]'): # For each element with a class attribute, remove that class attribute tag.attrib.pop('style') return lxml.html.tostring(html)
For this I again turned to some answers I’d Googled from stackoverflow. Again, this is probably a place that I could have come up with a more elegant solution than the one I implemented – but this does what I want, and it runs quickly, so I didn’t put in the additional development time.