WordPress Book Posting Scripts

Module 5: clean_html.py (Cleaning HTML for posting)

Now, Microsoft Word likes to add lots of style and class tags to its html in order to preserve the font and other formatting that you used in the document itself – but for my posts I wanted my site’s CSS to handle all of the formatting.

So I created a very quick little script that would remove all of the class and style attributes from html tags in a given string.

import lxml.html

def scrub_html_string(sent_string):
    # Parse the html
    html = lxml.html.fromstring(sent_string)

    for tag in html.xpath('//*[@class]'):
        # For each element with a class attribute, remove that class attribute
        tag.attrib.pop('class')
    for tag in html.xpath('//*[@style]'):
        # For each element with a class attribute, remove that class attribute
        tag.attrib.pop('style')

    return lxml.html.tostring(html)

For this I again turned to some answers I’d Googled from stackoverflow. Again, this is probably a place that I could have come up with a more elegant solution than the one I implemented – but this does what I want, and it runs quickly, so I didn’t put in the additional development time.

Utility Scripts , ,

Leave a Reply

Your email address will not be published. Required fields are marked *