Module 1: doc_to_html.py (.docx to .html)
The first thing I had to do was find a way to convert my document to html. I did so with the following script:
import win32com.client as win32 import os def convert_to_html(doc_path): """ This uses microsoft word and win32com to open a specified document and then save it out to html. """ word = win32.Dispatch("Word.Application") word.Visible = 0 word.Documents.Open(doc_path) doc = word.ActiveDocument save_as_file = os.path.join(os.getcwd(), r'htmls\temp.html') print "Saved" doc.SaveAs(FileName=save_as_file, FileFormat=10) doc.Close() print "Closed" return save_as_file if __name__ == '__main__': doc_folder = r"C:\PathToTheFolderWithYourBook" doc = "NameOfYourBookFile.docx" doc_path = os.path.join(doc_folder, doc) return_path = convert_to_html(doc_path)
I developed this script with reference to stackoverflow answers on using win32com to automate Microsoft word that I’d Googled.
The variable FileFormat in line 17 determines what the output file format will be, not the extension for the new file name in line 15. If you want to adjust the script to convert word documents to other types of files, you can find out what to change the FileFormat to by referencing this list of word save format id numbers.