← Back to Blog
Go · · 4 min read

Building Simple Durable Jobs: A Go Library for Resilient Workflows


Most job queue libraries give you a way to run a function later. That’s fine until your process crashes halfway through a multi-step workflow and you need to figure out what already happened. I built Simple Durable Jobs to solve that problem without pulling in heavy infrastructure like Temporal or Cadence.

The problem

Consider a workflow that provisions a cloud resource: create the instance, wait for it to boot, configure networking, attach storage, run health checks. If the process dies after step 3, you need to know that steps 1-3 completed so you can resume at step 4, not re-run everything and end up with duplicate resources.

Most lightweight job libraries don’t track intermediate progress. You either run the whole thing again or build your own state tracking. Temporal solves this beautifully, but it’s a distributed system with its own cluster to operate. Sometimes you just want a library you can embed.

How checkpointing works

The core idea is a CallState context that each workflow step receives. It carries a mutex-protected Checkpoints map that persists to the database after each step completes. On recovery, the library replays the workflow but skips steps whose checkpoints already exist.

func ProvisionWorkflow(ctx context.Context, cs *jobs.CallState) error {
    instanceID, err := cs.Checkpoint("create-instance", func() (string, error) {
        return cloud.CreateInstance(cs.Input)
    })
    if err != nil {
        return err
    }

    _, err = cs.Checkpoint("configure-network", func() (string, error) {
        return cloud.ConfigureNetwork(instanceID)
    })
    return err
}

If the process crashes after create-instance completes, the next run sees that checkpoint exists, skips the creation, and picks up at configure-network. The instance ID is stored in the checkpoint, so it’s available without re-executing the call.

Fan-out/fan-in

Workflows often need to do things in parallel. Simple Durable Jobs supports fan-out with three completion strategies:

  • FailFast - cancel everything if any child fails
  • CollectAll - wait for all children regardless of failures
  • Threshold - succeed when N of M children complete

The parent job suspends while children execute, and the library handles the coordination. Each child is its own checkpointed workflow, so partial failures in a fan-out don’t lose progress on the children that succeeded.

Crash recovery

The library uses a heartbeat mechanism. Running jobs periodically update a last_heartbeat timestamp. A background reaper goroutine scans for jobs whose heartbeat is stale beyond a configurable threshold and marks them for retry.

Database-level locking (SELECT ... FOR UPDATE SKIP LOCKED) prevents multiple workers from grabbing the same job. This works across PostgreSQL, MySQL, and SQLite (with graceful fallback where the dialect doesn’t support row-level locking), so you don’t need Redis or an external lock service.

The monitoring dashboard

One thing I always wanted from a job library was visibility into what’s running without tailing logs. Simple Durable Jobs embeds a Svelte dashboard that connects to the Go backend via Connect-RPC streaming.

The dashboard shows:

  • Active, queued, and failed jobs in real-time
  • Checkpoint progress for running workflows
  • Retry history and error details
  • Cron schedule status

It’s served from the same Go binary, so there’s no separate frontend deployment needed.

Database as queue

I deliberately chose to use the database as the job queue rather than adding Redis or a message broker. The library uses GORM, so it works with PostgreSQL, MySQL, and SQLite out of the box. For most workloads, a well-indexed table with row-level locking is fast enough and eliminates an entire infrastructure dependency. The library supports priority queues through a simple integer column, and cron scheduling through a parsed cron expression stored alongside the job definition.

When to use it

Simple Durable Jobs fits when you need durable workflows but don’t want to operate a Temporal cluster. If your workload involves multi-step processes that need crash recovery, fan-out parallelism, or cron scheduling, and you’re already running PostgreSQL, MySQL, or even just SQLite, it’s a good fit. If you need sub-millisecond dispatch latency or millions of jobs per second, you’ll want something purpose-built for that scale.

The full source is on GitHub.