Cookies help us create our services and enhance your experience. By using this site you agree to our use of cookies. Okay.

How to monitor MongoDB

Like most things in life: If you can – keep it simple. Running databases in production isn’t easy, and with 7 years of practice we’ve found the best way to monitor MongoDB is to simplify the problem.

We'll discuss:

  1. How to focus on the right MongoDB metrics (critical and non-critical).
  2. The top alerts you can’t fall asleep without.
  3. Deciding on the right MongoDB monitoring to slot into your workflow.

We’ve been using MongoDB extensively to power many different components of our server monitoring product. Ranging from basic user profiles, all the way to high throughput processing of over 30TB/month. This means we keep a very close eye on how our MongoDB clusters are performing; from the metrics we collect, to the graphs we configure.

Let’s start with the MongoDB metrics that we really care about, and why you should too.

Key MongoDB monitoring metrics

The list of available MongoDB metrics is overwhelming, but let’s make it more manageable by honing in on the critical MongoDB ones. As you’ll be aware – when you’re busy in Ops, distractions should be minimised to allow you to focus on what really matters.

Oplog replication Lag

The replication built into MongoDB through replica sets has worked very well in our experience. However, by default writes only need to be accepted by the primary member and replicate down to other secondaries asynchronously i.e. MongoDB is eventually consistent by default. This means there is usually a short window where data might not be replicated should the primary fail.

This is a known property, so for critical data, you can adjust the write concern to return only when data has reached a certain number of secondaries. For other writes, you need to know when secondaries start to fall behind because this can indicate problems such as network issues or insufficient hardware capacity.

Monitor MongoDB write concern

Replica secondaries can sometimes fall behind if you are moving a large number of chunks in a sharded cluster. As such, we only alert if the replicas fall behind for more than a certain period of time e.g. if they recover within 30min then we don’t alert.

The op time date metric as reported by Server Density is the key one to measure this.

Replica state

In normal operation, one member of the replica set will be primary and all the other members will be secondaries. This rarely changes and if there is a member election, we want to know why. Usually this happens within seconds and the condition resolves itself but we want to investigate the cause pretty quickly because there could have been a hardware or network failure.

Flapping between states should not be a normal working condition and should only happen deliberately e.g. for maintenance or during a valid incident e.g. hardware failure.

You can find these metrics under the Replication Set grouping within Server Density.

Locking and disk i/o % utilization

As of MongoDB 3, locking is implemented at a collection or document level granularity, depending on the storage engine used. This was part of gradual changes from MongoDB 2.6 (database level locking). However, some operations take a global database lock e.g. dropping a collection so if this situation happens too often then you will start seeing performance problems as other operations (including reads) get backed up in the queue.

We’ve seen high effective lock % be a symptom of other issues within the database e.g. poorly configured indexes, no indexes, disk hardware failures and bad schema design. This means it’s important to know when the value is high for a long time, because it can cause the server to slow down (and become unresponsive, triggering a replica state change) or the oplog to start to lag behind.

Locking is only reported globally for pre-MongoDB 3.0 MMAP storage engines and reported in different locations for MongoDB 3.0 and above.

Related to this is how much work your disks are doing i.e. disk i/o % utilization. Approaching 100% indicates your disks are at capacity and you need to upgrade them i.e. spinning disk to SSD. If you are using SSDs already then you can provide more RAM or you need to split the data into shards.

MongoDB SSD

Non-critical metrics to monitor MongoDB

Focussing on critical metrics does not equate to ignoring everything else. Tracking non-critical metrics helps to avoid issues that would escalate to critical production problems.

It is at this point that graphing and dashboards become an increasingly important tool in your arsenal. We suggest monitoring these metrics over time, whilst striving to make them as visible as possible to you and your team.

“Server Density dashboards quickly became our favourite feature – there’s a lot of metrics to track and our MongoDB dash has saved us from more than a few problems over the years. We display them on a big TV in the office so spikes are harder to miss & easier to reference.” – Alexandar Sandstrom, CTO at Skovik.

Here’s our internal MongoDB dashboard, and a deeper look into the individual metrics that we monitor:

Monitor MongoDB Dashboard

Typical Server Density MongoDB dashboard. Get your custom dashboard →

Memory usage and page faults

Memory is probably the most important resource you can give MongoDB and so you want to make sure you always have enough. The rule of thumb is to always provide sufficient RAM for all of your indexes to fit in memory, and where possible, enough memory for all your data too.

Resident memory is the key metric here – MongoDB provides some useful statistics to show what it is doing with your memory, which are collected and reported by the Server Density MongoDB monitoring plugin too.

Page faults are related to memory because a page fault happens when MongoDB has to go to disk to find the data rather than memory. More page faults indicate that there is insufficient memory, so you should consider increasing the available RAM.

“Our MongoDB production performance is very good, though we do have to work hard to tune indexes, and dynamic documents can make things go crazy on RAM.” – Ján Mochňak, PHP Developer at SalesChamp.

Connections

Every connection to MongoDB has an overhead which contributes to the required memory for the system. This is initially limited by the Unix ulimit settings but then will become limited by the server resources, particularly memory.

High numbers of connections can also indicate problems elsewhere e.g. requests backing up due to high lock % or a problem with your application code opening too many connections.

Alerts you can’t sleep without.

If you’re anything like us, giving yourself peace of mind when you’re on call is crucial. That’s why configuring sensible and reliable alerts is very important to monitoring workflows:

“Keeping our cluster healthy is critical to our business and we use Server Density to consistently monitor and tweak thresholds and alerts to ensure problems get addressed immediately.” – Mark Lichtenberg, Director of Technology at MACH Energy.
Metric Comments Suggested Alert
Oplog lag (op time date)

Being able to fail over to a replica of your data is only useful if the data is up to date, so you need to know when that’s no longer the case!

  • Op time date more than 5 minutes behind current time, for at least 15 minutes.
  • Op time date more than 10 minutes behind current time for at least 5 minutes. Once the lag goes past 10 minutes it’s almost pointless to fail over because you risk losing 10 minutes of data, which for most applications would be unacceptable.
Replica state

Depending on how much you care about the individual nodes in your set, you’ll either want a critical alert on this so you can investigate right away or a high priority alert, so you can get to it quickly afterwards.

The real concern is when you’re getting close to losing a majority. This will typically only happen with down instances, so discovering that would typically be combined with an availability or “no data received” type alert, with that being the critical trigger.

  • Op time date more than 5 minutes behind current time, for at least 15 minutes.
  • Op time date more than 10 minutes behind current time for at least 5 minutes. Once the lag goes past 10 minutes it’s almost pointless to fail over because you risk losing 10 minutes of data, which for most applications would be unacceptable.
Disk i/o utilisation %

When the disk i/o utilisation % hits 100% then your disk is at capacity and it will become a bottleneck. This will likely happen regularly so it is the duration that’s important, because anything longer than a few seconds will start causing issues.

This is not a critical alert that should wake you up but you should be aware when it starts to happen, and then look at page faults to see how this is impacting MongoDB’s ability to return data when it’s not in memory.

  • Disk i/o utilisation greater than 95% for more than 2 minutes.

Tools to monitor MongoDB

Now that you know the critical and non-critical metrics, your next step should be finding the best way to collect those metrics.

MongoDB monitoring: Real time tools.

For those worrying moments that you need to firefight a problem or outage, then MongoDB include a number of tools out of the box. These can all be run against a live MongoDB server and report stats in real time:

MongoDB monitoring: Building your own.

Real-time tools are great for troubleshooting in the moment, but you’ll also need to keep track of statistics over time, so you can spot trends and get notified when metrics hit certain thresholds. This is where monitoring software comes in.

Taking responsibility for building and maintaining your own monitoring isn’t easy, but it is a cheaper alternative for people who have the time and capacity to dedicate to it. Here’s 3 of the most popular open-source monitoring tools:

Monitoring your infrastructure is a big job, let alone monitoring the monitoring environment itself. We found open source tools were complex to setup and maintain, so we built our own and Server Density was born.

We’re delighted to be as one of the most secure and scalable server monitoring tools out there.

MongoDB monitoring: Alerts, dashboards and graphs.

Server Density is a hosted monitoring tool that our team continue to build and maintain everyday. We have a dedicated MongoDB plugin to pull all of the MongoDB metrics you need, and:

Configure alerts that we reliably deliver.

MongoDB Monitoring Alert

Create custom dashboards.

Monitor MongoDB

Graph your MongoDB data over time.

MongoDB Graph

We take MongoDB monitoring very seriously, and we’ve built our product to make sure you have just the features you need and all executed to our exacting standards with a focus on scalability, security and reliability.

Update: We hosted a live Hangout on Air with Paul Done from MongoDB discussing how to monitor MongoDB. We’ve made the slides and video available, which can be found below.

Monitor MongoDB Slides

Monitor MongoDB Video


facebook Twitter LinkedIn