Devops

What happened (the incident) We had a production incident where long-running AI pricing tasks were requeued and re-executed, doubling downstream LLM API costs and wasting CPU time. The short timeline: A batch of long-running tasks (each ~20–25 minutes) completed on workers. Immediately after completion the network layer silently dropped the broker TCP connection. Because tasks were running with late ACKs, Celery never acknowledged the finished tasks to Redis. Redis treated the tasks as unacknowledged and requeued them — workers picked them up again and executed the tasks a second time. The financial impact was unfortunate and immediate: LLM calls and compute doubled for affected runs. The technical impact exposed two important failure modes when using Redis as a Celery broker for long-running tasks: visibility timeout semantics and idle TCP connection drops. ...