WordPress Book Posting Scripts

Module 3: chop_sections.py (Splitting the main document into chapters)

So now I have my main document turned into a .html file, and I can generate the dates that each chapter should be published on. The next thing I needed to do was to split the main document up into chapters, so that I could post each chapter individually.

I did this in a fairly brute force way: I read through the html file line by line and checked to see if the line I was on corresponded to a new heading (all of my chapter headings were h1 tags in the html). Whenever I hit a new heading, I created a file name corresponding to it and started writing all of the lines to that file until I hit another heading. Then I created a new file to receive the output.

This script is intended to chop up the full-length clean html version of the
book into chapter length snippets, based on the headings of sections.

import os

def new_heading(line):
    if line.strip()[:4].upper() == "<H1>":
        return True
    return False

def parse_file(source_html, export_path):

    def new_output(output_file):
        output_path = os.path.join(export_path, output_file)
        output_path = "%s.html" % output_path
        output_file = open(output_path, 'w')
        return (output_path, output_file)

    open_file = open(source_html, 'r')
    current_output = "FrontMaterial"
    all_outputs = []
    current_path, current_file = new_output(current_output)

    for line in open_file.readlines():
        if new_heading(line):
            heading = line.lower().strip()
            heading = heading.replace(" ", "_")
            heading = heading.replace("&nbsp;", "_")
            heading = heading.replace("<h1>", "")
            heading = heading.replace("</h1>", "")
            current_path, current_file = new_output(heading)
            print repr(heading), current_path
            # Do some formatting so that there aren't any arbitrary newlines
            # except after paragraph sections.
                                            " ").replace("</p>",
    return all_outputs

if __name__ == '__main__':
    def main():
        source_html = os.getcwd()
        source_html = os.path.join(source_html, r'htmls\temp.html')
        export_path = os.path.join(os.getcwd(), r'htmls')
        parse_file(source_html, export_path)

    print "DONE"

I scribbled this up without referencing anything online. I’ll admit that I could probably have written something a lot cleaner looking using some module specifically designed to parse html, but in this particular case it wasn’t worth the development time to do the research for that. This script works for what I need, runs fast, and I could slam it out in a few minutes.

I could also have had the script store these in memory instead of writing them to files. I chose to write them to files instead on the premise that I would eventually be expanding on this project to create a script that would automatically post progress reports for my future projects by using two folders for the chapters – first it would chop all the chapters up into one folder, and then compare those files with the results of a previous run (stored in a second folder) in order to determine which chapters had been edited and which chapters were new content. …But that’s getting ahead of myself.

Utility Scripts , ,

Leave a Reply

Your email address will not be published. Required fields are marked *