Jason Wilder's Blog

Optimizing MongoDB Indexes

Good indexes are an important part running a well performing application on MongoDB. MongoDB performs best when it can keep your indexes in RAM. Reducing the size of your indexes also leads to faster queries and the ability to manage more data with less RAM.

These are a few tips to reduce the size of your MongoDB indexes:

1) Determining Indexes Sizes

The first thing you should do is to understand the size of your indexes. You want to know the sizes before you make changes to confirm that the changes have actually reduced the size. Ideally, you are graphing your indexes over time with your monitoring tools.

Using the mongo shell you can run db.stats() to get database indexes stats:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
> db.stats()
{
  "db" : "examples1",
  "collections" : 6,
  "objects" : 403787,
  "avgObjSize" : 121.9966467469235,
  "dataSize" : 49260660,
  "storageSize" : 66695168,
  "numExtents" : 20,
  "indexes" : 9,
  "indexSize" : 48524560,
  "fileSize" : 520093696,
  "nsSizeMB" : 16,
  "ok" : 1
}
  • indexes - The number of indexes in examples1 DB
  • indexSize - The size of the indexes in example1 DB

Since each collection has indexes, you can run db.collection.stats() to see them:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
> db.address.stats()
{
  "ns" : "examples1.address",
  "count" : 3,
  "size" : 276,
  "avgObjSize" : 92,
  "storageSize" : 8192,
  "numExtents" : 1,
  "nindexes" : 2,
  "lastExtentSize" : 8192,
  "paddingFactor" : 1,
  "flags" : 1,
  "totalIndexSize" : 16352,
  "indexSizes" : {
      "_id_" : 8176,
      "_types_1" : 8176
  },
  "ok" : 1
}
  • totalIndexSize - The size of all indexes in the collection
  • indexSizes - A dictionary of index name and size

NOTE: all sizes returned by these commands are in bytes.

These commands are useful but they are tedious to use manually. To report on indexes stats, I wrote a utility, index-stats.py, that can be found in the mongodb-tools project on Github that makes things easier.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
(virtualenv) mongodb-tools$ ./index-stats.py
Checking DB: examples2.system.indexes
Checking DB: examples2.things
Checking DB: examples1.system.indexes
Checking DB: examples1.address
Checking DB: examples1.typeless_address
Checking DB: examples1.user
Checking DB: examples1.typeless_user

Index Overview
+----------------------------+------------------------+--------+------------+
|         Collection         |         Index          | % Size | Index Size |
+----------------------------+------------------------+--------+------------+
| examples1.address          | _id_                   |   0.0% |      7.98K |
| examples1.address          | _types_1               |   0.0% |      7.98K |
| examples1.typeless_address | _id_                   |   0.0% |      7.98K |
| examples1.typeless_user    | _id_                   |  10.1% |      6.21M |
| examples1.typeless_user    | address_id_1           |  10.1% |      6.21M |
| examples1.typeless_user    | typeless_address_ref_1 |   5.9% |      3.62M |
| examples1.user             | _id_                   |  10.1% |      6.21M |
| examples1.user             | _types_1               |   6.9% |      4.24M |
| examples1.user             | _types_1_address_id_1  |  12.2% |      7.51M |
| examples1.user             | _types_1_address_ref_1 |  26.2% |     16.09M |
| examples2.things           | _id_                   |  10.1% |      6.21M |
| examples2.things           | _types_1               |   8.4% |      5.13M |
+----------------------------+------------------------+--------+------------+

Top 5 Largest Indexes
+-------------------------+------------------------+--------+------------+
|        Collection       |         Index          | % Size | Index Size |
+-------------------------+------------------------+--------+------------+
| examples1.user          | _types_1_address_ref_1 |  26.2% |     16.09M |
| examples1.user          | _types_1_address_id_1  |  12.2% |      7.51M |
| examples1.typeless_user | _id_                   |  10.1% |      6.21M |
| examples2.things        | _types_1               |   8.4% |      5.13M |
| examples1.user          | _types_1               |   6.9% |      4.24M |
+-------------------------+------------------------+--------+------------+

Total Documents: 600016
Total Data Size: 74.77M
Total Index Size: 61.43M
RAM Headroom: 2.84G
Available RAM Headroom: 1.04G

The output shows the total index size, each index size, and their relative sizes to each other. In addition, the Top 5 Largest indexes are reported across all your collections. This makes it easy to determine your largest indexes and the ones where reducing their size will provide most benefit.

  • RAM Headroom is your physical memory - index size. A positive value means you have RAM available for indexes to fit in memory.
  • Available RAM Headroom is free memory - index size. Since other processes consume memory on this system, I don’t have the total RAM Headroom available.

The RAM Headroom stat idea comes from the MongoDB monitoring service I use, ServerDensity.

From this output, I would focus on the examples1.user collection and the types_1_address_ref_1 and types_1_address_id_1 indexes first.

2) Remove Redundant Indexes

If you have been releasing code changes over a period of time, you’ll likely end up with redundant indexes. MongoDB can use the prefix of a compound index if all the component parts are not available. In the previous output,

1
| examples1.user          | _types_1               |   6.9% |      4.24M |

is redundant with

1
2
| examples1.user          | _types_1_address_ref_1 |  26.2% |     16.09M |
| examples1.user          | _types_1_address_id_1  |  12.2% |      7.51M |

Because _types_1 is the prefix to these two indexes. Dropping it would save 4.2M on the total index size and be one less index to update when user documents change.

To make it easier to find these indexes, you can run redundant-indexes.py from mongodb-tools:

1
2
3
4
5
6
(virtualenv)mongodb-tools$ ./redundant-indexes.py
Checking DB: examples2
Checking DB: examples1
Index examples1.user[_types_1] may be redundant with examples1.user[_types_1_address_ref_1]
Index examples1.user[_types_1] may be redundant with examples1.user[_types_1_address_id_1]
Checking DB: local

3) Compact Command

If you are running MongoDB 2.0+, you can run the compact command to defragment your collections and rebuild the indexes. The compact command locks the database so make sure you know where you are running it beforehand. If you are running with replica sets, the easiest thing to do is to run it on your secondaries, one at a time, fail-over the primary to new secondary and run compact on the old primary.

4) MongoDB 2.0 Index Improvements

If you are not running MongoDB 2.0 or later, upgrading and rebuilding your indexes should provide about a 25% savings.

See Index Performance Enhancements

5) Check Index Criteria

Another thing to check is your index criteria. You want the values that are indexed to be small and as selective as possible. Indexing values that do not help MongoDB find your data faster slow queries down and increase the index size. If you are using a mapping framework for your application, and it support defining indexes in the code, you should check to see what it’s actually indexing. For example MongoEngine for Python uses a “_types” field to identify subclasses in the same collection. This can add a lot of space and may not add to the selectivity of you indexes.

In my test data, my largest index is:

1
| examples1.user             | _types_1_address_ref_1 |  26.2% |     16.09M |

Looking at the data for it:

1
2
3
4
5
6
7
8
9
10
11
12
13
> db.user.findOne()
{
  "_id" : ObjectId("4f2ef95c89a40a11c5000002"),
  "_types" : [
      "User"
  ],
  "address_id" : ObjectId("4f2ef95c89a40a11c5000000"),
  "address_ref" : {
      "$ref" : "address",
      "$id" : ObjectId("4f2ef95c89a40a11c5000000")
  },
  "_cls" : "User"
}

You can see that _types is an array with a value of User, the class name. Since I don’t have any subclasses of User in my code, indexing this value does not help the index selectivity. Another way of thinking about this is that each value in the index is going to have “User” as a prefix which adds a few extra bytes for value and does not increase the selectivity of the index.

Removing it in the code with:

1
2
class User(Document):
    meta {'index_types':False}

Changes the index to:

1
| examples1.user             | address_ref_1          |  16.8% |     12.39M |

About a 23% savings.

Digging in further, address_ref_1 is a ReferenceProperty to an Address object. The data above shows that it is a dictionary that contains the id of the reference field as well as the collection that it points to. If we change this ReferenceProperty to an ObjectIdProperty, which is what address_id, is, you can get additional savings:

1
2
| examples1.user             | address_id_1           |   9.5% |      6.21M |
| examples1.user             | address_ref_1          |  20.9% |     13.70M |

About a 53% savings. This changes the index value from being stored as a serialized dictionary to just and ObjectId which is likely highly optimized with MongoDB. Changing the property type does require code changes though and you also lose the automatic de-referencing capability provided by ReferenceProperties. It can produce significant savings though.

In total, we’ve reduced the original index by 61% by adjusting some index criteria and making some small code changes.

6) Delete/Move Old Data

In most applications, some data is accessed more frequently than others. If you have old data that won’t be accessed by your users, you may be able to purge it, move it to another un-indexed collection, or archive it somewhere outside of the DB. Ideally, you database contains and is indexing the working set of available data.

There are some other good optimization ideas that can be found here:

How do you tune your indexes?

Centralized Logging

Logs are a critical part of any system, they give you insight into what a system is doing as well what happened. Virtually every process running on a system generates logs in some form or another. Usually, these logs are written to files on local disks. When your system grows to multiple hosts, managing the logs and accessing them can get complicated. Searching for a particular error across hundreds of log files on hundreds of servers is difficult without good tools. A common approach to this problem is to setup a centralized logging solution so that multiple logs can be aggregated in a central location.

So what are your options?

File Replication

A simple approach is to setup file replication of your logs to a central server on a cron schedule. Usually rsync and cron are used since they are simple and straightforward to setup. This solution can work for a while but it doesn’t provide timely access to log data. It also doesn’t aggregate the logs and only co-locates them.

Syslog

Another option that you probably already have installed is syslog. Most people use rsyslog or syslog-ng which are two syslog implementations. These daemons allow processes to send log messages to them and the syslog configuration determines how the are stored. In a centralized logging setup, a central syslog daemon is setup on your network and the client logging dameons are setup to forward messages to the central daemon. A good write-up of this kind of setup can be found at: Centralized Logging Use Rsyslog

Syslog is great because just about everything uses it and you likely already have it installed on your system. With a central syslog server, you will likely need to figure out how to scale the server and make it highly-available.

Distributed Log Collectors

A new class of solutions that have come about have been designed for high-volume and high-throughput log and event collection. Most of these solutions are more general purpose event streaming and processing systems and logging is just one use case that can be solved using them. All of these have their specific features and differences but their architectures are fairly similar. They generally consist of logging clients and/or agents on each specific host. The agents forward logs to a cluster of collectors which in turn forward the messages to a scalable storage tier. The idea is that the collection tier is horizontally scalable to grow with the increase number of logging hosts and messages. Similarly, the storage tier is also intended to scale horizontally to grow with increased volume. This is gross simplification of all of these tools but they are a step beyond traditional syslog options.

  • Scribe - Scribe is scalable and reliable log aggregation server used and released by Facebook as open source. Scribe is written in C++ and uses Thrift for the protocol encoding. Since it uses thrift, virtually any language can work with it.

  • Flume - Flume is an Apache project for collecting, aggregating, and moving large amounts of log data. It stores all this data on HDFS.

  • logstash - logstash lets you ship, parse and index logs from any source. It works by defining inputs (files, syslog, etc.), filters (grep, split, multiline, etc..) and outputs (elasticsearch, mongodb, etc..). It also provides a UI for accessing and searching your logs. See Getting Started

  • Chukwa - Chukwa is another Apache project that collects logs onto HDFS.

  • fluentd - Fluentd is similar to logstash in that there are inputs and outputs for a large variety of sources and destination. Some of it’s design tenets are easy installation and small footprint. It doesn’t provide any storage tier itself but allows you to easily configure where your logs should be collected.

  • kafka - Kafka was developed at LinkedIn for their activity stream processing and is now an Apache incubator project. Although Kafka could be used for log collection this is not it’s primary use case. Setup requires Zookeeper to manage the cluster state.

  • Graylog2 - Graylog2 provides a UI for searching and analyzing logs. Logs are stored in MongoDB and/or elasticsearch. Graylog2 also provides the GELF logging format to overcome some issues with syslog message: 1024 byte limit and unstructured log messages. If you are logging long stacktraces, you may want to look into GELF.

  • splunk - Splunk is commercial product that has been around for several years. It provides a whole host of features for not only collecting logs but also analyzing and viewing them.

Hosted Logging Services

There are also several hosted “logging as a service” providers as well. The benefit of them is that you only need to configure your syslog forwarders or agents and they manage the collection, storage and access to the logs. All of the infrastructure that you have to setup and maintain is handled by them, freeing you up to focus on your application. Each service provide a simple setup (usuallysyslog forwarding based), an API and a UI to support search and analysis.

Octopress Blogging System

After several years of maintaining a Wordpress blog, I’ve decided to switch to Octopress. Wordpress worked well for me at first but it seemed to have more functionality than I really needed or wanted. The breaking point happened last night when I tried to upgrade it and the templates I was using no longer worked. Digging into the code was going to be more effort than it was worth. So now I’ve moved on to something simpler.

Octopress uses a different approach then Wordpress for generating content. Octopress generates a static site whereas Wordpress needs MySQL and PHP. I like the simplicty of Octopress and it also makes it easy version control everything w/ git.