Today I felt like I completely failed at my job. I was ‘on shift’, therefore I was supposed to be keeping an eye on all of the systems at my work and their health. Yet I failed to notice an issue ‘that was glaring me in the face’ for over 3 hours, due to the ‘noise factor’.
At the beginning of my shift I noticed some parts of our monitoring had been semi-broken for a large part of the weekend. So I focused on getting that fixed. Once I had that fixed that my monitoring systems were showing about 2000 alarms, which is high. But due to the large chunk of the infrastructure that had gone un-monitored over the weekend I didn’t think much of it. Alongside this there was some database maintenance ongoing.
So how did the noise fail me, well the number of alarms for databases should have been glaringly obvious. But due to the monitoring and maintenance issues, I didn’t take heed. When you are used to having between 100-200 alarms, and you have a system reporting 2000; I find that it gets very difficult to get a handle on the real problems.
Overall I know I didn’t fail, but for the next few hours I am going to be beating myself up about it.