Module 3: chop_sections.py (Splitting the main document into chapters)
So now I have my main document turned into a .html file, and I can generate the dates that each chapter should be published on. The next thing I needed to do was to split the main document up into chapters, so that I could post each chapter individually.
I did this in a fairly brute force way: I read through the html file line by line and checked to see if the line I was on corresponded to a new heading (all of my chapter headings were h1 tags in the html). Whenever I hit a new heading, I created a file name corresponding to it and started writing all of the lines to that file until I hit another heading. Then I created a new file to receive the output.
""" This script is intended to chop up the full-length clean html version of the book into chapter length snippets, based on the headings of sections. """ import os def new_heading(line): if line.strip()[:4].upper() == "<H1>": return True return False def parse_file(source_html, export_path): def new_output(output_file): output_path = os.path.join(export_path, output_file) output_path = "%s.html" % output_path output_file = open(output_path, 'w') return (output_path, output_file) open_file = open(source_html, 'r') current_output = "FrontMaterial" all_outputs =  current_path, current_file = new_output(current_output) all_outputs.append(current_path) for line in open_file.readlines(): if new_heading(line): current_file.close() heading = line.lower().strip() heading = heading.replace(" ", "_") heading = heading.replace(" ", "_") heading = heading.replace("<h1>", "") heading = heading.replace("</h1>", "") current_path, current_file = new_output(heading) print repr(heading), current_path all_outputs.append(current_path) else: # Do some formatting so that there aren't any arbitrary newlines # except after paragraph sections. current_file.write(line.replace("\n", " ").replace("</p>", "</p>\n\n")) current_file.close() return all_outputs if __name__ == '__main__': def main(): source_html = os.getcwd() source_html = os.path.join(source_html, r'htmls\temp.html') export_path = os.path.join(os.getcwd(), r'htmls') parse_file(source_html, export_path) main() print "DONE"
I scribbled this up without referencing anything online. I’ll admit that I could probably have written something a lot cleaner looking using some module specifically designed to parse html, but in this particular case it wasn’t worth the development time to do the research for that. This script works for what I need, runs fast, and I could slam it out in a few minutes.
I could also have had the script store these in memory instead of writing them to files. I chose to write them to files instead on the premise that I would eventually be expanding on this project to create a script that would automatically post progress reports for my future projects by using two folders for the chapters – first it would chop all the chapters up into one folder, and then compare those files with the results of a previous run (stored in a second folder) in order to determine which chapters had been edited and which chapters were new content. …But that’s getting ahead of myself.