When Your Celery Workers Ran Tasks Twice: Migrating from Redis to RabbitMQ

What happened (the incident) We had a production incident where long-running AI pricing tasks were requeued and re-executed, doubling downstream LLM API costs and wasting CPU time. The short timeline: A batch of long-running tasks (each ~20–25 minutes) completed on workers. Immediately after completion the network layer silently dropped the broker TCP connection. Because tasks were running with late ACKs, Celery never acknowledged the finished tasks to Redis. Redis treated the tasks as unacknowledged and requeued them — workers picked them up again and executed the tasks a second time. The financial impact was unfortunate and immediate: LLM calls and compute doubled for affected runs. The technical impact exposed two important failure modes when using Redis as a Celery broker for long-running tasks: visibility timeout semantics and idle TCP connection drops. ...

March 1, 2026 · 10 min · Zeeshan Khan

Building a Scalable Chat Backend with LangGraph, FastAPI, Celery, and Redis

Introduction Building a chat backend powered by LLMs seems straightforward at first. You create an API endpoint, invoke your LangGraph agent, and stream the response back to the client. It works beautifully in development. Then reality hits. Users lose connection mid-response. Load balancers time out long-running requests. Your server restarts, and all in-flight conversations are lost. Scaling horizontally becomes a nightmare because each request is tightly coupled to the process handling it. ...

January 28, 2026 · 11 min · Zeeshan Khan