Discussion:
[mongodb-user] Compound indexes for date range queries
John Myers
2016-10-17 22:54:18 UTC
Permalink
Hi,

I currently have very large collection with millions of documents, where
one field is the time of an event (in epoch seconds). All queries against
this collection will *at least* have this field present. There are some
other fields that can be included, depending on the application making the
query.

Example schema:
{
start: 1476744427,
sender: "Bob",
receiver: "Jane",
amount_sent: 45,
amount_received: 100
}

Let's say I want to support the following queries:

{$or: [{sender: "Bob", receiver: "Bob}], amount_sent: 30, start: {$gte:
12345, $lte: 5678}}
...
{start: {$gte: 12345, $lte: 5678}, amount_sent: 88}

If I need to sort, it will by most recent, so, descending.

The last one seems like I should just have an index on {start: -1,
amount_sent: 1} (or should it be amount_sent first??). My thought is that
if I use start as the first field in the compound index, other queries can
be supported, where other fields are searched on besides amount_sent, it
will just use the index to find the start times and then COLSCAN the rest.
For instance a query against "start" and "amount_received" can still
benefit from the index on start, but then have to scan for the
amount_received values.

Supporting the $or statement, I'm not sure, per the docs, each field in the
$or statement should have its own index, so do I need two here?
{start: -1, sender: 1} AND {start: -1, receiver: 1}

Thanks for any advice!
--
You received this message because you are subscribed to the Google Groups "mongodb-user"
group.

For other MongoDB technical support options, see: https://docs.mongodb.com/manual/support/
---
You received this message because you are subscribed to the Google Groups "mongodb-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mongodb-user+***@googlegroups.com.
To post to this group, send email to mongodb-***@googlegroups.com.
Visit this group at https://groups.google.com/group/mongodb-user.
To view this discussion on the web visit https://groups.google.com/d/msgid/mongodb-user/06a434f7-4760-414b-9d68-a2a23b9aafa2%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Kevin Adistambha
2016-11-02 03:21:11 UTC
Permalink
Hi John

It’s difficult to say for sure what indexes is optimal without knowing for
sure the cardinality of your data. However, I tried to make an educated
guess regarding your data, and populated a test collection using 50,000
randomly generated documents to check out your queries.

I tried with the arguably simpler second query you posted:

{start: {$gte: 12345, $lte: 5678}, amount_sent: 88}

For this query, you need an index having the fields start and amount_sent.
You suggested using {start:-1, amount_sent:1}. However, this is not
optimal, which we can see in the explain() output. Note the use of hint()
<https://docs.mongodb.com/manual/reference/method/cursor.hint/> to force
db.test.explain('executionStats').find({amount_sent:10, start: {$gte: 1467295200, $lte: 1483189200}}).sort({start:-1}).hint({start:-1, amount_sent:1})
...
"nReturned": 253,
"executionTimeMillis": 38,
"totalKeysExamined": 24947,
"totalDocsExamined": 253,
...

From the explain('executionStats') output, MongoDB needs to scan ~25,000
keys to return 253 documents matching the query.

A better index is {amount_sent:1, start:1}. This index is structured by
first having an exact value of amount_sent, where for each value of
amount_sent, sorted values of start is stored. Here’s an illustration of
the index structure:
amount_sent 1 2 3 

start 1 2 3 
 1 2 3 
 1 2 3 
 


Since you need an exact value of amount_sent and a range of start (and
possibly sort using start), you can verify from the illustration above that
the index facilitates the query perfectly.
db.test.explain('executionStats').find({amount_sent:10, start: {$gte: 1467295200, $lte: 1483189200}}).sort({start:-1}).hint({amount_sent:1, start:1})
...
"nReturned": 253,
"executionTimeMillis": 1,
"totalKeysExamined": 253,
"totalDocsExamined": 253,
...

Now MongoDB examines 253 keys (instead of ~25,000) and returns the same
number of documents. One takeaway from this is that an index fits the query
if the numbers in nReturned, totalKeysExamined, and totalDocsExamined are
the same.

Your first query is more involved:

{$or: [{sender: “Bob”, receiver: “Bob}], amount_sent: 30, start: {$gte:
12345, $lte: 5678}}
db.test.explain('executionStats').find({$or:[{sender:"Oscar"}, {receiver:"Oscar"}], amount_sent:10, start: {$gte: 1467295200, $lte: 1483189200}}).sort({start:-1}).hint({sender:1, amount_sent:1, start:1})
...
"nReturned": 4,
"executionTimeMillis": 118,
"totalKeysExamined": 50000,
"totalDocsExamined": 50000,
...
"stage": "SORT_KEY_GENERATOR",
...

This is not great at all. MongoDB looks at 50,000 keys and documents
(basically the whole collection), only to return 4 of them. Also, note the
stage called “SORT_KEY_GENERATOR” there. The existence of this stage means
that MongoDB cannot use the index to present a sorted result, and thus is
doing an in-memory sort instead, which is limited to 32 MB. If your query
returns a large result set that exceeds 32 MB, the query will fail. See Use
Indexes to Sort Query Results
<https://docs.mongodb.com/v3.2/tutorial/sort-results-with-indexes/>.

We can improve this by refactoring your query (by unwrapping the $or
statement) and creating a couple of indexes to support both terms of the
$or query.

The refactored query looks like:

{$or:[
{sender:"Oscar", amount_sent:10, start: {$gte: 1467295200, $lte: 1483189200}},
{receiver:"Oscar", amount_sent:10, start: {$gte: 1467295200, $lte: 1483189200}}
]}

and the two indexes are {sender:1, amount_sent:1, start:1} and {receiver:1,
amount_sent:1, start:1}, to satisfy both terms of the $or query. The
db.test.explain('executionStats').find({$or:[{sender:"Oscar", amount_sent:10, start: {$gte: 1467295200, $lte: 1483189200}}, {receiver:"Oscar", amount_sent:10, start: {$gte: 1467295200, $lte: 1483189200}}]}).sort({start:-1})
...
"nReturned": 4,
"executionTimeMillis": 1,
"totalKeysExamined": 4,
"totalDocsExamined": 4,
...
"stage": "SORT_MERGE",
...

The refactored query and the two indexes appear to work together well. Now
MongoDB returns 4 documents, and examined exactly the same number of index
keys and documents. Also note the existence of a “SORT_MERGE” stage. This
is basically the merge stage of a merge sort, which is not bound to the 32
MB limit. This stage merges the two (already sorted) results for each of
the $or terms. Also, execution time of the query drops from 118 ms to just
1 ms.

Please note that the analysis I did is mostly an educated guess since I
don’t know the exact nature of your data. To optimize your use case, I
would encourage you to fully understand the explain() result:
https://docs.mongodb.com/v3.2/reference/explain-results/

Best regards,
Kevin
​
--
You received this message because you are subscribed to the Google Groups "mongodb-user"
group.

For other MongoDB technical support options, see: https://docs.mongodb.com/manual/support/
---
You received this message because you are subscribed to the Google Groups "mongodb-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mongodb-user+***@googlegroups.com.
To post to this group, send email to mongodb-***@googlegroups.com.
Visit this group at https://groups.google.com/group/mongodb-user.
To view this discussion on the web visit https://groups.google.com/d/msgid/mongodb-user/3c58f4f2-32b8-4889-8941-5d67a47accc2%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Continue reading on narkive:
Loading...