The Bug That Ate a Day: It Only Hung in Production

This is the last post in my series on rebuilding my daily fantasy lineup optimizer (it all starts here). The math was the easy part. This post is about the day I lost to a bug that didn’t exist on my computer and only showed up once the tool was on a real server — the kind of problem that makes you question your sanity before you find it.

Worked on my laptop, hung in production: the same code behaving differently on a server

▶ Try the free MLB lineup optimizer — no signup

“Works on my machine”

On my laptop the optimizer was fast and flawless. I’d click generate, lineups appeared in a couple of seconds, everything streamed in nicely. So I put it on a small cloud server, opened the live page, uploaded a file, clicked generate — and it just sat there. Forever. No lineups, no error, no crash. The same code that solved in two seconds at home would spin endlessly the moment it ran on the server. That’s the worst kind of bug: not a clean failure you can read, but a silent hang with nothing in the logs to grab onto.

Chasing the wrong suspect

My first assumption was that the server was just too slow — a small shared machine versus my laptop. So I made the problem smaller: trimmed the player pool, capped the solve time, simplified the model. None of it helped. A tiny problem that should solve instantly still hung forever. That ruled out “slow” and pointed at “stuck,” which are very different diagnoses. Something wasn’t running slowly; something wasn’t running at all. I’d wasted a good chunk of the day optimizing speed for a problem that had nothing to do with speed.

The one-line test that cracked it

Instead of guessing, I added a tiny diagnostic page that did nothing but run one trivial solve and report back. I loaded it on the server, and it answered instantly: the solver worked perfectly. That single result flipped the whole investigation. The solver wasn’t broken on the server at all. The difference was where I was running it. On my laptop and in that test page, the solve happened in the normal flow of handling a web request. In the full app, I’d been clever and pushed the solving onto a separate background worker so the page could stream results — and on this particular server, the solver hung when it was driven from that background worker, every time.

The fix, once I understood it, was almost insulting in its simplicity: stop being clever, and run the solve in the same place the test page did — directly as part of the request. The hang vanished immediately. The whole day came down to a wrong assumption about which part was broken.

Getting the nice part back

Moving the solve into the request fixed the hang but cost me the feature I liked most — lineups appearing one at a time instead of all at once. It turned out I’d blamed the streaming for the original bug when the real culprit was the background worker. So I brought streaming back a different way: the request itself now sends each lineup out the moment it’s solved. Same stable approach that fixed the hang, and the live one-by-one reveal is back. Best of both, once I stopped confusing two separate problems for one.

Why the background worker mattered at all

It’s worth explaining why I’d added that background worker, because the reason was good even though the result wasn’t. A solve takes a few seconds, and I didn’t want the page to freeze with a spinner while it ran — I wanted lineups to appear one at a time as they were found. The natural way to get that is to hand the solving to a separate worker running alongside the web page, so the page stays free to report progress. It’s a completely standard pattern. It just happened to be the one thing this setup couldn’t stomach: the solver, driven from that side worker, would start and never come back. On my laptop the same pattern was fine, which is exactly why I never saw it coming.

That’s the cruel thing about production-only bugs. Your laptop and a cloud server differ in a hundred small ways — how many things run at once, how processes are allowed to talk to each other, what’s installed — and most of those differences never matter, until one does. You can’t reason your way to which one from your own machine, because on your own machine everything works. The only move is to get information from the broken environment itself, which is what that little diagnostic page finally did.

What I gave up, and what I didn’t

I’ll be honest about the trade. Solving inside the request means the server is busy for the few seconds it takes — so if a crowd ever shows up and everyone hits generate at once, they’ll wait their turn rather than all solving in parallel. For a free tool with normal traffic that’s a non-issue, and it’s a price I’ll happily pay for “works every time” over “clever and occasionally catatonic.” If it ever gets popular enough to matter, the fix is a sturdier job-queue system — but I’m not building that until there’s a real crowd to justify it. Shipping the simple thing that works beats shipping the elaborate thing that mostly works.

The takeaway

Two lessons I keep relearning. First, when something fails only in production, the bug is almost always in the environment, not the logic — and no amount of staring at the code on your laptop will show it to you. Second, the fastest way out of a silent failure isn’t a smarter guess, it’s a dumb little test that isolates one variable. That throwaway diagnostic page saved me from another day of flailing. The tool that came out the other side is live, free, and — finally — works in production: the MLB lineup optimizer. Thanks for following the series; if you build something with it, I’d love to hear how it goes.

Things that I use, like, and am affiliated with:
Mint Mobile offers great cell phone service for $15 flat, get $15 off using the link. Get discounted phones with service activation and no contract.
I never spend money before I check Mr Rebates or Rakuten to get cashbacks, rebates, discounts, coupons or cheaper gift cards.

The Bug That Ate a Day: It Only Hung in Production

“Works on my machine”

Chasing the wrong suspect

The one-line test that cracked it

Getting the nice part back

Why the background worker mattered at all

What I gave up, and what I didn’t

The takeaway

Related

Leave a ReplyCancel reply

“Works on my machine”

Chasing the wrong suspect

The one-line test that cracked it

Getting the nice part back

Why the background worker mattered at all

What I gave up, and what I didn’t

The takeaway

Share this:

Related

Leave a ReplyCancel reply