Tracking the names, places and invented words in your story

I've learned from earlier writing projects that to avoid editorial drift you need a single source of truth and produce all formats from that single source.

Editorial drift is my term for when different formats of a story (ebook, web page, etc) all feature different edits made at different times. Sometimes I manage to update all the formats so they're the same, but towards the end of a project I can run out of drive.

I'm working from a single source of truth for The Public Testimony of the Mercenary Called Graef. It's a single markdown file with markers at the end of each chapter, like this:

“That’s not going to keep my interest,” he says. “I’ll wager you this. You lose, you bring a jug of wine to my table. You win, you get a tercet.”

“Fine,” I say, “I’ll take your money.”

end:: “The Fall of Abbalas” 3240 

“I’m going to let you take it. For the sake of everyone in here. We hope you spend it on a bath,” he says, raising his voice over the chanting, the yelling, and the pub clatter. “You stink, towie, and we are tired of watching the lice crawl over you.”

Early on I realised that by using a simple text format I could easily extract all the character and place names from the file using a simple command line:

cat tptotmcg.txt | grep -o '\b[A-Z][a-z]*\b' | sort | uniq

It required a little editing by hand to remove all the english words that began sentences, but I did that as I went through and added details and definitions. Once I was done I had the beginning of a glossary, dramatis personae, and atlas in a single reference document. Very handy even if you're the writing the story.

I couldn't face doing that a second time

Once it was created I couldn't face updating it. The command I used would create new list that would have to be carefully compared against the original. So it sat untouched for months while I added new characters and places and vocabulary.

Then I realised I could automate the process. It would be easy using ChatGPT.

It wasn't easy. The code itself is simple, as you will see. The challenge was identifying words that I had created. An initial uppercase letter wasn't enough.

What made it possible in the end was remembering the existence of the Moby Project. If you've spent any amount of time programmatically dabbling with text you're probably aware of it. This public domain collection of lexical resources has some amazing components, but what I was interested in was single.txt – a file with over 350,000 words that included variant spellings and plurals. You can download single.txt here.

If a word in my story wasn't in that file, then I probably made it up. All I had to do was strip out punctuation, add a way to exclude specific words on demand, and it was done.

I track and document the production side of The Public Testimony of the Mercenary Called Graef in Obsidian, which is a Personal Knowledge Management (PKM) tool that works a bit like a wiki. So the script also wraps every word it finds in double square brackets so if I ever have or ever want to create a dedicated page for a character, place or word, I can.

The code and caveats

I could have written this code, but damn doesn't ChatGPT make this kind of thing so much faster and easier.

If I didn't know how to program python, and didn't know something about processing text, and programming in general, I'm not sure I could have used ChatGPT to produce this script. Its initial output didn't work. When I got it working it didn't work well. Without the Moby Project, and my own knowledge of its existence, this script would never have reached a useful state. It took about 4-6 hours over several days to get there.

The code:

import re
import datetime
import shutil
import argparse
import os


def main(defs_file, src_file):
    word_set = set()
    terms_dict = {}

    # check defs_file exists
    if not os.path.exists(defs_file):
        print("Definitions file does not exist.")
        return

    # this is the file with all the existing names/places/vocab
    # it expects one word per line in 1 of 4 formats:
    # term
    # [[term]]
    # term - definition
    # [[term]] - definition
    with open(defs_file, "r") as f:
        for line in f:
            line = line.strip()
            if line:
                parts = line.split(" - ")
                term = parts[0].strip().strip("[]")
                definition = "" if len(parts) == 1 else parts[1].strip()
                terms_dict[term] = definition

    # check single_words.txt exists
    # this is the file from the Moby Project
    if not os.path.exists("single.txt"):
        print("Single words file does not exist.")
        return

    with open("single.txt", "r") as f:
        for line in f:
            word_set.add(line.strip().lower())

    # a file containing words you don't to include
    if os.path.exists("excluded_words.txt"):
        with open("excluded_words.txt", "r") as f:
            for line in f:
                word_set.add(line.strip().lower())
    else:
        print("No excluded words file found.")

    # check src_file exists
    # this is the plaintext file containing your writing
    if not os.path.exists(src_file):
        print("Source file does not exist.")
        return

    with open(src_file, "r") as f:
        src_text = f.read()

    # remove possessives from names
    src_text = re.sub(r"(\w+)[\'\’]s\b", r"\1", src_text)
    # now remove all contractions
    src_text = re.sub(r"\b\w*[\'\’]\w*\b", "", src_text)
    # remove all other punctuation and numbers
    src_text = re.sub(r'[,\.\?!;:\(\)\*\-\'"“”\’0-9…—]', " ", src_text)
    # remove short words
    src_text = re.sub(r"\b\w{1,2}\b", "",  src_text)
    # remove extra spaces
    src_text = re.sub(r"\s+", " ", src_text)
    src_set = set(src_text.strip().split())

    # go through every word. If it's not english and it's not already in the definitions file, ignore it. Otherwise, add it.
    for word in src_set:
        word_lower = word.lower()
        if word not in terms_dict and word_lower not in word_set:
            terms_dict[word] = ""

    # the definition file gets backed up and renamed based on date and time
    # eg - defs.txt becomes defs-202302041522.txt
    # that only goes down to the minute. so don't rerun it too fast
    backup_file = f"{defs_file.rsplit('.', 1)[0]}-{datetime.datetime.now().strftime('%Y%m%d%H%M')}.{defs_file.rsplit('.', 1)[1]}"
    shutil.copy2(defs_file, backup_file)

    # write out all the definitions in alphabetical order
    with open(defs_file, "w") as f:
        for term in sorted(terms_dict.keys(), key=lambda s: s.lower()):
            definition = terms_dict[term]
            if definition:
                f.write(f"[[{term}]] - {definition}\n")
            else:
                f.write(f"[[{term}]]\n")


if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("--defs", required=True, help="definitions file")
    parser.add_argument("--src", required=True, help="source file")
    args = parser.parse_args()

    defs_file = args.defs
    src_file = args.src

    main(defs_file, src_file)

When you go to run the script from the command line make sure your excludes file and single.txt is in the current directory.

Then, if you saved the script as update_defs.py, it's just a matter of:

python ./update_defs.py --src novel.txt --defs defs.md

My command line is longer because my script is in one directory, my source file is deep in Dropbox so I can access it from my Raspberry Pi, and the file with my definitions is in my Obsidian vault which is under ~/Library/blah/blah/etc. I have the whole thing copied into my notes file in case it ever gets pushed out of my shell history.

I hope it's useful to you. If not, try asking ChatGPT to change it so it is. If you have any questions or comments you can try me on twitter.