How I approach a slow background job
2025-07-29
When a background job runs slow, I don’t just add retries or throw more threads at it.
I step back and ask:
- 📌 Is this a problem with the job itself?
- 📌 Or with the system around it?
That distinction matters, because a lot of performance issues look like code problems but are actually systemic.
Here’s how I break it down 👇
Check the payload size
Large job payloads (objects, blobs, nested data) hurt Redis performance, clog queues, and slow deserialization. Passing just IDs keeps jobs fast and retryable.
➡️ Small payloads support a healthier system.
Isolate external calls
APIs, file systems, DB writes or anything that depends on I/O should be isolated, timed, and retried safely.
➡️ External latency isn’t your job’s fault, but it becomes your system’s problem.
Profile the internals
If the job is CPU-bound (like resizing images or crunching data), Ruby might struggle. I measure performance and consider: does this work belong here? Should it move to a native extension, separate service, or queue tier?
➡️ Sometimes the fix is not optimization, but extraction.
Batch when you can
N+1 isn’t just a query problem, it happens in jobs too. If it’s looping over 100 records, maybe it should fetch and process them in one pass.
➡️ Batched jobs reduce I/O load and system thrash.
Check for lock contention or shared resources
If multiple jobs touch the same records, deadlocks and retries follow. I look at DB locks, Sidekiq uniqueness, or whether the job needs idempotency guarantees.
➡️ Concurrency isn’t free. Structure matters.
Check retry patterns in context
Retries can hide deeper failures. I check whether retries spike during deploys, outages, or congestion. I also ask: does this job need retries or is that masking a design flaw?
➡️ Bad retry behavior is usually a system smell, not just a job bug.
Review queue structure and load
Is this job stuck behind heavy ones? Competing for threads? Underprovisioned? Sidekiq lets you shape priority and throughput—but the defaults won’t save you.
➡️ Queue topology is system design, not config trivia.
Trace the upstream trigger
Sometimes the job’s slow because it shouldn’t exist. Maybe the work should be done earlier (precomputed), or later (lazily), or not at all (deleted feature).
➡️ Job design is downstream from system clarity.
💡 Slow jobs don’t always mean bad code. Sometimes they reveal load imbalances, architectural mismatches, or flawed assumptions.
So I go back to the start: Is the slowness inside the job or around it?
The answer shapes everything that follows.