Why your queue depth is the only chart you need
If you can only have one operational dashboard, make it the queue depth across every async system. Everything else is downstream of that one number.
Modern systems are full of queues — task queues, message buses, retry queues, dead letter queues. Each one is a potential pile-up. The single most predictive chart for production health is the depth of those queues over time.
What queue depth tells you
If a queue is growing faster than it's draining, something downstream is slow or broken. If it's empty, you're either keeping up or not getting work. If it's stable but high, you've got a capacity problem. One chart tells you which.
Set alerts on rate, not size
A queue at 10,000 messages might be fine if it's stable. A queue at 100 messages might be a crisis if it just doubled. Alert on the derivative — the rate of growth — not the absolute number.
Watch what's getting stuck before you watch what's failing.