Whodunit: LLM Murder Mysteries

Ready to solve an AI authored mystery? Follow the clues and catch the killer at whodunit.rip.

Great mysteries fill bookshelves and streaming services, but it’s hard to name a mystery game as popular as any of Christie or Doyle’s works. Clue springs to mind with its murder thematics, but play enough games and you’ll realize you’re solving logic puzzles, not mysteries.

There are other mystery games, but they’re usually role-playing, relying on a player to write and narrate the story, or they use previously-written mysteries, which eventually run out.

Enter, Large Language Models. They’re so good at sounding good, I figured they could write a mystery to solve as a low-tech pen and paper game for family game night. It was the perfect excuse to play with AI.

I set out with a few goals:

Develop a mental model for calling LLMs
Test a few top-tier LLM services and their golang libraries
Convice LLMs to write mysteries worthy of J. B. Fletcher

The first attempts sounded good but weren’t logically sound. Trial and error (mostly error) made significant improvement. Most of the insights are obvious now, but that’s the advantage of hindsight. Let’s dive in to Go and LLMs, problem-solving strategies, LLM advice learned the hard way, and using Temporal to manage unreliable LLM APIs.

What’s in a Mystery?

Every mystery has the same basic structure. There’s a scene that sets up the situation and introduces the setting — the locations in the story and the characters involved. Then there are clues that players discover as they investigate, each revealing a piece of the puzzle. Finally, there’s the solution that answers the central questions, and a denouement that wraps up the mystery, Scooby-Doo style.

Mystery parts

To play, the mystery is printed, and clues are cut into individual paper strips. Players start a game by reading the scene. Then they take turns investigating one clue at a time. They can look for clues in a location, or interrogate a character for valuable information. (Or not. Some locations and characters are dead ends.) Once the player thinks they’ve solved the mystery, they make an accusation and check against the solution. If they’re right, they read the denoument to wrap everything up.

It’s like an episode of Columbo. The scene opens with the crime already committed, introducing us to the setting (maybe a fancy mansion) and the characters (the wealthy victim, the jealous spouse, the suspicious butler). As Columbo investigates, he uncovers clues — the broken vase, the muddy footprints, the overheard argument. The solution reveals who did it and why, and the denouement shows how justice is served.

The challenge is making all these pieces work together logically. The scene can’t give away the solution, but it needs to establish the mystery. The clues need to hint at the truth without being obvious or obscure. Most importantly, the solution needs to make sense given everything that came before. Convincing an LLM to write a coherent mystery might try even Jessica Fletcher’s patience.

Calling LLMs: Fundamentals

Gemini, OpenAI, and Claude were top of the list for testing. Fortunately, all the text models have similar inputs and outputs. They accept a system prompt and user input, returning the text response.

Most also accept historical prompts, which are helpful if you want to have a conversation. Many of the libraries have helpers to manage history for you, but under the hood, they’re keeping track of past prompts and responses.

A simplified interface looks like this:

type TextGenerator interface {
    Generate(in Input) (Response, error)
}

type Input struct {
    SystemPrompt string   `json:"systemPrompt,omitempty"`
    Prompt       Prompt   `json:"prompt,omitempty"`
    History      []Prompt `json:"history,omitempty"`
}

type Prompt struct {
    Role string `json:"role,omitempty"` // user, model
    Text string `json:"text,omitempty"`
}

type Response struct {
    Text string `json:"text,omitempty"`
}

With a little glue code, these models work with any LLM API.

Building Better Mysteries

It took a few tries and a lot of promopting to get coherent mysteries.

All-In-One

The first attempt asked the LLM to write everything at once: scene, setting, characters, clues, and solution. LLMs are pretty good at mimicing example output, so it was easy to let the LLM do the heavy lifting.

The mysteries sounded great but didn’t make sense. Clues contradicted themselves. Every character looked guilty. The solution referenced characters not in the mystery. The mysteries just didn’t work.

But the LLM output followed the example format, and what it did write sounded great, full of clichés and cheeky bits. It was a start.

That first prompt looked something like this:

You are a mystery writer for the game “Whodunit?”. Please write a mystery. The mystery needs a title, a scene (introductory information), clues hidden at locations or with characters, a solution (questions the player needs to answer), and a denouement to summarize the events.

Write your output with the following structure.

<model>
{
  "title": "<...>",
  "scene": "Silence fell on a dark and stormy night in Wintertree Hall. <...>",
  "clues": [
    {
      "location": "Oyster Bar",
      "clue": "<...>"
    },
    {
      "characterRole": "conductor",
      "clue": "<...>"
    },
  ],
  "solution": {
    "Who is the murderer?": "<...>",
    "What was the motive?": "<...>",
    "What was the murder weapon?": "<...>"
  },
  "denouement": "<...>"
}
</model>

Writing longer, more detailed prompts was a natural next step. It was like writing instructions for a person. A mystery has intrinsic rules, and those rules need documenting if someone (or thing) is going to follow them.

Break it Down

The problem? The longer the instructions, the more the model “forgets.” It was too much information to keep in context all at once, and the fine details were lost. Just like when we break down problems as software engineers, breaking the mystery into smaller pieces makes it easier to keep the LLM focused on the important rules for that step and match example outputs.

The next attempt broke each step into its own prompt: write the scene, then the clues, then the solution and denoument. Each step was separate so the LLM wouldn’t be bombarded with unnecessary information.

All in one workflow

At this point, the writing process started looking more and more like a workflow. There were multiple steps run in order, each making API calls to services that frequently returned errors due to system load and throttling limits. Each mystery made multiple calls, and I didn’t want one failing call to break the workflow and throw away all the good work that had already been written. When in doubt, I use Temporal to model workflows. Each API call became an activity, which Temporal automatically retries on error. Each mystery writing run became more stable, and it was easier to iterate and test prompt updates.

The LLM followed the output formats, but the logic was still off. It gave multiple characters motive and opportunity — because it hadn’t picked a guilty party yet. Imagine if the Blue’s Clues notebook listed everyone as the culprit.

Order Matters

LLMs write one word at a time. When asked to write the scene, then the clues, then the solution, it writes them in that order. Meaning, it wrote clues before it “knew” who the perpetrator was. That’s why it seemed like multiple characters could have done it — the LLM hadn’t decided which one was guilty yet!

This inspired the next approach: write the solution first, then the clues, then the scene and denouement. Elementary, my dear Watson.

Ordered workflow

At this point, the mysteries started making sense. Not all the time, but often enough for a junior detective to crack the case. Unfortunately, some mysteries still took olympic level mental gymnastics to connect the dots between the clues.

Write Like A Writer

The breakthrough came from thinking like a writer. Writers know what they’re writing towards. They start with plot points — the key events that drive the story. Those plot points culminate in an ending. So the final approach was: write plot points first, then use those to drive the solution and clues.

Here’s the workflow that finally worked:

func MysteryWriterWorkflow() (MysteryOutput, error) {
    // 1. Write plot points first
    plotPointIn := PlotPointsInput{
        Location: "An abandoned warehouse",
    }
    plotPointsOut, err := WritePlotPoints(plotPointIn)
    // check err
    
    // 2. Write the solution based on plot points
    solutionIn := SolutionInput{
        PlotPointsIn:  plotPointIn,
        PlotPointsOut: plotPointsOut,
    }
    solutionOut, err := WriteSolution(solutionIn)
    // check err
    
    // 3. Write clues that point to the solution
    cluesIn := CluesInput{
        SolutionIn: solutionIn,
        SolutionOut: solutionOut,
    }
    cluesOut, err := WriteClues(cluesIn)
    // check err
    
    // 4. Finally, write the scene that sets everything up
    sceneDenoumentIn := SceneDenoumentInputt{
        CluesIn: cluesIn,
        CluesOut: cluesOut,
    }
    sceneDenoumentOut, err := WriteSceneAndDenoument(sceneDenoumentIn)
    // check err
    
    return sceneDenoumentOut, nil
}

You’ll see that each step carries the previous step’s inputs and outputs. That helped manage the LLMs context, so each step knew what had already been written in the mystery.

Tighten it Up

At this point, the mysteries made sense… most of the time. There were still a good number that didn’t follow instructions, especially nuanced details like “Assign clues to related locations and characters. For example, if Character A witnessed an arguement between the murderer and victim, assign the clue to Character A.” These details made the mysteries flow, but the LLMs struggled to follow instructions.

Again, the answer came from writers. Writers have editors to catch these discrepancies. The LLM needed an editor. The editor would read the original instructions, inputs, and outputs, and update the output wherever it didn’t match the instructions.

The editor uses the original prompt and input to revise the output. The goal is for the LLM to update the output so that it follows the instructions in the fewest changes possible.

func EditResponse(originalPrompt, originalInput, originalOutput string) (string, error) {
    systemPrompt := ```
        Read the original prompt, intput, and output.
        Audit the output. Update it so that it follows all the instructions in the original prompt.
        Make the fewest changes needed.```
    
    editorPrompt := ```
        Original Prompt:
        %s
        
        Original Input:
        %s
        
        Original Output:
        %s```
    
    editorPrompt = fmt.Sprintf(editorPrompt, originalPrompt, originalInput, originalOutput)

    in := Input{
        SystemPrompt: systemPrompt,
        Prompt: Prompt{
            Role: "user",
            Prompt: editorPrompt,
        },
    }

    resp, err := Generate(in Input)
    // check error
}

This was helpful for all the steps and caught issues as the workflow progressed.

LLM Learnings & Gotchas

Trial and error is a great way to learn, but it’s also nice to know the traps ahead of time.

Tell Them What You Want, Not What You Don’t Want

Telling an LLM not to do something is like asking them to not think of a pink elephant. They’re more likely to do what you don’t want because you told them about it.

Instead of:

Don’t include the murderer in the scene. Don’t reveal the weapon.

Use:

The scene must establish the central mystery aligned with the goals. Describe the goals without giving anything away.

Long Instructions Make Sloppy Outputs

LLMs struggle managing details when there are lots of details. Break instructions down into separate tasks, and have the LLM work on them one at a time.

The first attempt asked the LLM to do everything in one shot: 1) write the mystery “prompt” that players read by describing the setting, introducing characters, and declaring the mystery goals (who, what, why?), 2) write clues that are revealed to players as the game progresses, and 3) write the solution. The LLM managed to match the output format and write all the mystery pieces requested, but the mysteries themselves were incoherent.

The LLM was being given too much to “think” about, and it was having trouble following all of the rules simultaneously.

“Please Try Again”

That’s essentially what the editor LLM step does. Review the previous instructions and output to fix everything that was missed.

There has been a lot of discussion online about the best way to go about it. Simply telling the LLM to “do it better” works, but I found that a little extra prompting, such as asking for the smallest number of changes necessary, had better results.

“Show Your Work”

Or “Explain Your Thoughts”. Asking the LLM to explain its reasoning before producing output made a noticable improvement in mystery quality. It works with JSON output, too. The first field in the prompt example is "explanation": "<...>". Each output included a step by step explanation that supported the end result.

Unfortunately, this does increase output size, which can have a material cost. But it was worth it for the steps with logic requirements.

Be Prepared for API Errors and Throttled Responses

LLM APIs fail. Rate limits, outages, malformed outputs. Sometimes it felt like debugging a locked-room mystery — the kind even Encyclopedia Brown would walk away from.

The solution? Use Temporal workflows and activities for automatic retries:

llmRetryPolicy := &temporal.RetryPolicy{
    InitialInterval:        10 * time.Second,
    BackoffCoefficient:     2.0,
    MaximumAttempts:        5,
}

Large Personalities

The hardest part about comparing the models is how quickly new versions are released. Even small updates to the hosted models could make significant changes to output quality.

That said, the models do have their own quirks. The main axes I compared were tone and logical soundness.

Gemini

Gemini Pro 2.0 (and now 2.5) were the best at following longer prompts, likely due to their large context windows. The other models are catching up, but Gemini tended to stay on task better than the rest. It did struggle a bit with the output, though. It likes to wrap the response with ```json ``` tags, even when asked not to.

OpenAI

OpenAI o3 and 4o models needed extra encouragement to match the mysterious tone. Its outputs felt more clinical — they were matter of fact and transactional. They did well following logical instructions and editing past responses, but made less engaging mysteries overall.

Anthropic

Claude Opus and Sonnet 4 found a middle ground — they matched the tone and a good job following instructions.

Oh, just one more thing…

For the mystery of who writes pretty good mysteries most of the time? It was the LLM, with the prompts, in the workflow.

For the mystery of the AI-besting sleuths? Well, that’s up to you.