Like most things in life: If you can – keep it simple. Running databases in production isn’t easy, and with 7 years of practice we’ve found the best way to monitor MongoDB is to simplify the problem.
We’ve been using MongoDB extensively to power many different components of our server monitoring product. Ranging from basic user profiles, all the way to high throughput processing of over 30TB/month. This means we keep a very close eye on how our MongoDB clusters are performing; from the metrics we collect, to the graphs we configure.
Let’s start with the MongoDB metrics that we really care about, and why you should too.
The list of available MongoDB metrics is overwhelming, but let’s make it more manageable by honing in on the critical MongoDB ones. As you’ll be aware – when you’re busy in Ops, distractions should be minimised to allow you to focus on what really matters.
The replication built into MongoDB through replica sets has worked very well in our experience. However, by default writes only need to be accepted by the primary member and replicate down to other secondaries asynchronously i.e. MongoDB is eventually consistent by default. This means there is usually a short window where data might not be replicated should the primary fail.
This is a known property, so for critical data, you can adjust the write concern to return only when data has reached a certain number of secondaries. For other writes, you need to know when secondaries start to fall behind because this can indicate problems such as network issues or insufficient hardware capacity.
Replica secondaries can sometimes fall behind if you are moving a large number of chunks in a sharded cluster. As such, we only alert if the replicas fall behind for more than a certain period of time e.g. if they recover within 30min then we don’t alert.
The op time date metric as reported by Server Density is the key one to measure this.
In normal operation, one member of the replica set will be primary and all the other members will be secondaries. This rarely changes and if there is a member election, we want to know why. Usually this happens within seconds and the condition resolves itself but we want to investigate the cause pretty quickly because there could have been a hardware or network failure.
Flapping between states should not be a normal working condition and should only happen deliberately e.g. for maintenance or during a valid incident e.g. hardware failure.
You can find these metrics under the Replication Set grouping within Server Density.
As of MongoDB 3, locking is implemented at a collection or document level granularity, depending on the storage engine used. This was part of gradual changes from MongoDB 2.6 (database level locking). However, some operations take a global database lock e.g. dropping a collection so if this situation happens too often then you will start seeing performance problems as other operations (including reads) get backed up in the queue.
We’ve seen high effective lock % be a symptom of other issues within the database e.g. poorly configured indexes, no indexes, disk hardware failures and bad schema design. This means it’s important to know when the value is high for a long time, because it can cause the server to slow down (and become unresponsive, triggering a replica state change) or the oplog to start to lag behind.
Locking is only reported globally for pre-MongoDB 3.0 MMAP storage engines and reported in different locations for MongoDB 3.0 and above.
Related to this is how much work your disks are doing i.e. disk i/o % utilization. Approaching 100% indicates your disks are at capacity and you need to upgrade them i.e. spinning disk to SSD. If you are using SSDs already then you can provide more RAM or you need to split the data into shards.
Focussing on critical metrics does not equate to ignoring everything else. Tracking non-critical metrics helps to avoid issues that would escalate to critical production problems.
It is at this point that graphing and dashboards become an increasingly important tool in your arsenal. We suggest monitoring these metrics over time, whilst striving to make them as visible as possible to you and your team.
“Server Density dashboards quickly became our favourite feature – there’s a lot of metrics to track and our MongoDB dash has saved us from more than a few problems over the years. We display them on a big TV in the office so spikes are harder to miss & easier to reference.” – Alexandar Sandstrom, CTO at Skovik.
Here’s our internal MongoDB dashboard, and a deeper look into the individual metrics that we monitor:
Typical Server Density MongoDB dashboard. Get your custom dashboard →
Memory is probably the most important resource you can give MongoDB and so you want to make sure you always have enough. The rule of thumb is to always provide sufficient RAM for all of your indexes to fit in memory, and where possible, enough memory for all your data too.
Resident memory is the key metric here – MongoDB provides some useful statistics to show what it is doing with your memory, which are collected and reported by the Server Density MongoDB monitoring plugin too.
Page faults are related to memory because a page fault happens when MongoDB has to go to disk to find the data rather than memory. More page faults indicate that there is insufficient memory, so you should consider increasing the available RAM.
“Our MongoDB production performance is very good, though we do have to work hard to tune indexes, and dynamic documents can make things go crazy on RAM.” – Ján Mochňak, PHP Developer at SalesChamp.
Every connection to MongoDB has an overhead which contributes to the required memory for the system. This is initially limited by the Unix ulimit settings but then will become limited by the server resources, particularly memory.
High numbers of connections can also indicate problems elsewhere e.g. requests backing up due to high lock % or a problem with your application code opening too many connections.
If you’re anything like us, giving yourself peace of mind when you’re on call is crucial. That’s why configuring sensible and reliable alerts is very important to monitoring workflows:
“Keeping our cluster healthy is critical to our business and we use Server Density to consistently monitor and tweak thresholds and alerts to ensure problems get addressed immediately.” – Mark Lichtenberg, Director of Technology at MACH Energy.
|Oplog lag (op time date)||
Being able to fail over to a replica of your data is only useful if the data is up to date, so you need to know when that’s no longer the case!
Depending on how much you care about the individual nodes in your set, you’ll either want a critical alert on this so you can investigate right away or a high priority alert, so you can get to it quickly afterwards.
The real concern is when you’re getting close to losing a majority. This will typically only happen with down instances, so discovering that would typically be combined with an availability or “no data received” type alert, with that being the critical trigger.
|Disk i/o utilisation %||
When the disk i/o utilisation % hits 100% then your disk is at capacity and it will become a bottleneck. This will likely happen regularly so it is the duration that’s important, because anything longer than a few seconds will start causing issues.
This is not a critical alert that should wake you up but you should be aware when it starts to happen, and then look at page faults to see how this is impacting MongoDB’s ability to return data when it’s not in memory.
Now that you know the critical and non-critical metrics, your next step should be finding the best way to collect those metrics.
For those worrying moments that you need to firefight a problem or outage, then MongoDB include a number of tools out of the box. These can all be run against a live MongoDB server and report stats in real time:
Real-time tools are great for troubleshooting in the moment, but you’ll also need to keep track of statistics over time, so you can spot trends and get notified when metrics hit certain thresholds. This is where monitoring software comes in.
Taking responsibility for building and maintaining your own monitoring isn’t easy, but it is a cheaper alternative for people who have the time and capacity to dedicate to it. Here’s 3 of the most popular open-source monitoring tools:
Monitoring your infrastructure is a big job, let alone monitoring the monitoring environment itself. We found open source tools were complex to setup and maintain, so we built our own and Server Density was born.
We’re delighted to be as one of the most secure and scalable server monitoring tools out there.
Server Density is a hosted monitoring tool that our team continue to build and maintain everyday. We have a dedicated MongoDB plugin to pull all of the MongoDB metrics you need, and:
Configure alerts that we reliably deliver.
Create custom dashboards.
Graph your MongoDB data over time.
We take MongoDB monitoring very seriously, and we’ve built our product to make sure you have just the features you need and all executed to our exacting standards with a focus on scalability, security and reliability.