Most of our friends talk to us about monitoring their systems in one minute or five minute granularity. They explain that the cost of monitoring limits their ability to monitor in more detail. We believe this may miss problems in systems that go undetected for quite some time. Often these issues become apparent when it's too late, after major performance issues or downtime.
We ran some numbers through a little math function to examine what happens to fault detection if you decrease granularity of your time series values. We use faults to mean those faults discovered by adaptive fault detection, but faults can be anything interesting that may occur as a time-series value. To simplify things, we assumed that faults existed for exactly one second, the sample size was 1000 seconds, and the fault rate was a random 10%. Note: Intervals is on the right axis.
While we recognize that faults have their own distribution and duration, and that these assumptions are not perfect, this experiment implies:
- as granularity decreases, you have an exponential loss of fault detection ability,
- at low levels of granularity, you will likely miss 100% of faults,
- there may be an optimal granularity that matches the average existence time of a fault, and
- fault detection has an increased variability as you reduce granularity (i.e. it's more variable if you are getting a smaller fault sample size).
So, how granular is your monitoring and why? We'll be simulating this theory with better data in the near future, so hang tight!