Designing Error Aggregation Systems

Accepted Session
Short Form
Scheduled: Wednesday, June 22, 2011 from 1:30 – 2:15pm in B202/03


So often we’re solely focused on the performance of our production systems. When disaster strikes, your team needs to know when error conditions begin, where they’re coming from, frequency, and an indication of the last time they occurred. Parsing logs isn’t fast enough, and email can’t keep up or preserve metadata.


A large percentage of recovery time during an unexpected outage is often spent determining the extent of the problem and its source. Tools that help localize the problem and quickly measure its severity are extremely helpful. The last thing you need during an outage is to have your mail server fall over, too.

And yet, why don’t we have a general purpose solution to this?

This talk will explore designing error aggregation systems. We’ll cover effectively capturing events, efficiently processing them, and displaying the relevant information in real time. Error aggregators nicely compliment your existing logging systems and email systems, taking the heat when there’s a problem and intelligently rolling that data up for easy analysis during a crisis.

Speaking experience