Discussion:
Distinct records on large dataset
(too old to reply)
User
2018-11-26 19:30:11 UTC
Permalink
I have a collection with 5 billion records. This collection stores
metadata about files in a filesystem such that the "source" field in the
record points to the file where that piece of data comes from. With
external code, I do a "walk" in this filesystem to find new files that I
need to process the metadata for and store while also finding records that
I need to remove because the file no longer exists. I only know which ones
to remove from mongodb by finding all the sources of data in the
collection, then removing the ones I didn't see in the filesystem on my
prior walk.

I have 5 million files, which has just crossed into the threshold where
mongodb can't return them in a distinct query and instead gives me:
"distinct too big, 16mb cap"

A simple aggregation query of
db.transactions.aggregate([{"$group":{"_id":"$source"}}]) would work,
however it does a COLLSCAN which takes *way* longer than the few seconds it
used to take to do the distinct("source") since I have the source field
already indexed. How can I get "$group" (or some other command) to
recognize that the index already has each unique value? Due to size only,
I can't return it in one BSON document. Instead, I need to return one BSON
document for each key in the index, at least this seems like it would be
the most performant way for mongodb to do this.

Basically, how can I get the functionality of $group to use the index of
source instead of doing a COLLSCAN? $group doesn't seem smart enough to
realize it doesn't have to pull the whole record, but only the "source"
field (which is available in the index).

Thanks for any help.
--
You received this message because you are subscribed to the Google Groups "mongodb-user"
group.

For other MongoDB technical support options, see: https://docs.mongodb.com/manual/support/
---
You received this message because you are subscribed to the Google Groups "mongodb-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mongodb-user+***@googlegroups.com.
To post to this group, send email to mongodb-***@googlegroups.com.
Visit this group at https://groups.google.com/group/mongodb-user.
To view this discussion on the web visit https://groups.google.com/d/msgid/mongodb-user/63e62975-beb3-43a9-8f7c-a2354f328e31%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Bob Cochran
2018-11-26 19:55:43 UTC
Permalink
Hi,

What is your MongoDB version? What is your topology? What is the operating system version for the server? For the client(s)? Do you have some sample documents and details of your indexing scheme?

Thanks

Bob
I have a collection with 5 billion records. This collection stores metadata about files in a filesystem such that the "source" field in the record points to the file where that piece of data comes from. With external code, I do a "walk" in this filesystem to find new files that I need to process the metadata for and store while also finding records that I need to remove because the file no longer exists. I only know which ones to remove from mongodb by finding all the sources of data in the collection, then removing the ones I didn't see in the filesystem on my prior walk.
I have 5 million files, which has just crossed into the threshold where mongodb can't return them in a distinct query and instead gives me: "distinct too big, 16mb cap"
A simple aggregation query of db.transactions.aggregate([{"$group":{"_id":"$source"}}]) would work, however it does a COLLSCAN which takes *way* longer than the few seconds it used to take to do the distinct("source") since I have the source field already indexed. How can I get "$group" (or some other command) to recognize that the index already has each unique value? Due to size only, I can't return it in one BSON document. Instead, I need to return one BSON document for each key in the index, at least this seems like it would be the most performant way for mongodb to do this.
Basically, how can I get the functionality of $group to use the index of source instead of doing a COLLSCAN? $group doesn't seem smart enough to realize it doesn't have to pull the whole record, but only the "source" field (which is available in the index).
Thanks for any help.
--
You received this message because you are subscribed to the Google Groups "mongodb-user"
group.
For other MongoDB technical support options, see: https://docs.mongodb.com/manual/support/
---
You received this message because you are subscribed to the Google Groups "mongodb-user" group.
Visit this group at https://groups.google.com/group/mongodb-user.
To view this discussion on the web visit https://groups.google.com/d/msgid/mongodb-user/63e62975-beb3-43a9-8f7c-a2354f328e31%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google Groups "mongodb-user"
group.

For other MongoDB technical support options, see: https://docs.mongodb.com/manual/support/
---
You received this message because you are subscribed to the Google Groups "mongodb-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mongodb-user+***@googlegroups.com.
To post to this group, send email to mongodb-***@googlegroups.com.
Visit this group at https://groups.google.com/group/mongodb-user.
To view this discussion on the web visit https://groups.google.com/d/msgid/mongodb-user/343B889F-979E-4D05-820F-C1DB3A84E122%40gmail.com.
For more options, visit https://groups.google.com/d/optout.
Matthew Zimmerman
2018-11-26 22:17:18 UTC
Permalink
I am running 4.0.x version. The topology is one server with 1tb of
RAM on 4TB of NVME disks. OS version is ArchLinux with kernel -
4.18.5-arch1-1-ARCH #1 SMP PREEMPT. The client is using the mgo
driver with the Go language, but I'm just trying to do this in the
mongo shell at the moment before I try to code the method of getting
these unique values into the "walker" client code.

The collection looks something like this:
{source:"file1.csv", "timestamp":ISOTime(2018/01/01 8:00am),
"email":["***@host.com"]}
{source:"file1.csv", "timestamp":ISOTime(2018/01/01 8:00am),
"email":["***@host.com"]}
{source:"file2.csv", "timestamp":ISOTime(2018/02/01 8:00am),
"IP":"***@host.com"}
{source:"file2.csv", "timestamp":ISOTime(2018/02/01 8:00am),
"email":["***@host.com","***@host.com"]}

There's sparse indexes on source, timestamp, IP, and email.

My understanding of the aggregate functions that only some of them
will use an index. $match is one, but since each file has a "source"
field value, doing {"$match":{"source":{"$exists":true}}} doesn't
actually reduce the dataset for the $group stage at all.

$sort is another, but there's no argument to do a "unique" sort that I know of.

Thanks for any help.
Post by Bob Cochran
Hi,
What is your MongoDB version? What is your topology? What is the operating system version for the server? For the client(s)? Do you have some sample documents and details of your indexing scheme?
Thanks
Bob
I have a collection with 5 billion records. This collection stores metadata about files in a filesystem such that the "source" field in the record points to the file where that piece of data comes from. With external code, I do a "walk" in this filesystem to find new files that I need to process the metadata for and store while also finding records that I need to remove because the file no longer exists. I only know which ones to remove from mongodb by finding all the sources of data in the collection, then removing the ones I didn't see in the filesystem on my prior walk.
I have 5 million files, which has just crossed into the threshold where mongodb can't return them in a distinct query and instead gives me: "distinct too big, 16mb cap"
A simple aggregation query of db.transactions.aggregate([{"$group":{"_id":"$source"}}]) would work, however it does a COLLSCAN which takes *way* longer than the few seconds it used to take to do the distinct("source") since I have the source field already indexed. How can I get "$group" (or some other command) to recognize that the index already has each unique value? Due to size only, I can't return it in one BSON document. Instead, I need to return one BSON document for each key in the index, at least this seems like it would be the most performant way for mongodb to do this.
Basically, how can I get the functionality of $group to use the index of source instead of doing a COLLSCAN? $group doesn't seem smart enough to realize it doesn't have to pull the whole record, but only the "source" field (which is available in the index).
Thanks for any help.
--
You received this message because you are subscribed to the Google Groups "mongodb-user"
group.
For other MongoDB technical support options, see: https://docs.mongodb.com/manual/support/
---
You received this message because you are subscribed to the Google Groups "mongodb-user" group.
Visit this group at https://groups.google.com/group/mongodb-user.
To view this discussion on the web visit https://groups.google.com/d/msgid/mongodb-user/63e62975-beb3-43a9-8f7c-a2354f328e31%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google Groups "mongodb-user"
group.
For other MongoDB technical support options, see: https://docs.mongodb.com/manual/support/
---
You received this message because you are subscribed to the Google Groups "mongodb-user" group.
Visit this group at https://groups.google.com/group/mongodb-user.
To view this discussion on the web visit https://groups.google.com/d/msgid/mongodb-user/343B889F-979E-4D05-820F-C1DB3A84E122%40gmail.com.
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google Groups "mongodb-user"
group.

For other MongoDB technical support options, see: https://docs.mongodb.com/manual/support/
---
You received this message because you are subscribed to the Google Groups "mongodb-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mongodb-user+***@googlegroups.com.
To post to this group, send email to mongodb-***@googlegroups.com.
Visit this group at https://groups.google.com/group/mongodb-user.
To view this discussion on the web visit https://groups.google.com/d/msgid/mongodb-user/CAD53Lr7M3Rfnek5z_ct_z0Fw68r4B8-B%3Duwp9A-oNgJBdr17wA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.
Robert Cochran
2018-11-26 22:32:03 UTC
Permalink
Hi,

Are you really using a data type called "ISOTime"? That isn't supported in
the mongo shell. See This link for supported date types.
<https://docs.mongodb.com/manual/core/shell-types/> Perhaps you mean
"ISODate()"? That gives a far more precise date and time representation
than what your example documents seem to show in this post.

Thanks

Bob
Post by Matthew Zimmerman
I am running 4.0.x version. The topology is one server with 1tb of
RAM on 4TB of NVME disks. OS version is ArchLinux with kernel -
4.18.5-arch1-1-ARCH #1 SMP PREEMPT. The client is using the mgo
driver with the Go language, but I'm just trying to do this in the
mongo shell at the moment before I try to code the method of getting
these unique values into the "walker" client code.
{source:"file1.csv", "timestamp":ISOTime(2018/01/01 8:00am),
{source:"file1.csv", "timestamp":ISOTime(2018/01/01 8:00am),
{source:"file2.csv", "timestamp":ISOTime(2018/02/01 8:00am),
{source:"file2.csv", "timestamp":ISOTime(2018/02/01 8:00am),
There's sparse indexes on source, timestamp, IP, and email.
My understanding of the aggregate functions that only some of them
will use an index. $match is one, but since each file has a "source"
field value, doing {"$match":{"source":{"$exists":true}}} doesn't
actually reduce the dataset for the $group stage at all.
$sort is another, but there's no argument to do a "unique" sort that I know of.
Thanks for any help.
Post by Bob Cochran
Hi,
What is your MongoDB version? What is your topology? What is the
operating system version for the server? For the client(s)? Do you have
some sample documents and details of your indexing scheme?
Post by Bob Cochran
Thanks
Bob
I have a collection with 5 billion records. This collection stores
metadata about files in a filesystem such that the "source" field in the
record points to the file where that piece of data comes from. With
external code, I do a "walk" in this filesystem to find new files that I
need to process the metadata for and store while also finding records that
I need to remove because the file no longer exists. I only know which ones
to remove from mongodb by finding all the sources of data in the
collection, then removing the ones I didn't see in the filesystem on my
prior walk.
Post by Bob Cochran
I have 5 million files, which has just crossed into the threshold where
"distinct too big, 16mb cap"
Post by Bob Cochran
A simple aggregation query of
db.transactions.aggregate([{"$group":{"_id":"$source"}}]) would work,
however it does a COLLSCAN which takes *way* longer than the few seconds it
used to take to do the distinct("source") since I have the source field
already indexed. How can I get "$group" (or some other command) to
recognize that the index already has each unique value? Due to size only,
I can't return it in one BSON document. Instead, I need to return one BSON
document for each key in the index, at least this seems like it would be
the most performant way for mongodb to do this.
Post by Bob Cochran
Basically, how can I get the functionality of $group to use the index of
source instead of doing a COLLSCAN? $group doesn't seem smart enough to
realize it doesn't have to pull the whole record, but only the "source"
field (which is available in the index).
Post by Bob Cochran
Thanks for any help.
--
You received this message because you are subscribed to the Google
Groups "mongodb-user"
Post by Bob Cochran
group.
https://docs.mongodb.com/manual/support/
Post by Bob Cochran
---
You received this message because you are subscribed to the Google
Groups "mongodb-user" group.
Post by Bob Cochran
To unsubscribe from this group and stop receiving emails from it, send
<javascript:>.
Post by Bob Cochran
Visit this group at https://groups.google.com/group/mongodb-user.
To view this discussion on the web visit
https://groups.google.com/d/msgid/mongodb-user/63e62975-beb3-43a9-8f7c-a2354f328e31%40googlegroups.com.
Post by Bob Cochran
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google
Groups "mongodb-user"
Post by Bob Cochran
group.
https://docs.mongodb.com/manual/support/
Post by Bob Cochran
---
You received this message because you are subscribed to the Google
Groups "mongodb-user" group.
Post by Bob Cochran
To unsubscribe from this group and stop receiving emails from it, send
<javascript:>.
Post by Bob Cochran
Visit this group at https://groups.google.com/group/mongodb-user.
To view this discussion on the web visit
https://groups.google.com/d/msgid/mongodb-user/343B889F-979E-4D05-820F-C1DB3A84E122%40gmail.com.
Post by Bob Cochran
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google Groups "mongodb-user"
group.

For other MongoDB technical support options, see: https://docs.mongodb.com/manual/support/
---
You received this message because you are subscribed to the Google Groups "mongodb-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mongodb-user+***@googlegroups.com.
To post to this group, send email to mongodb-***@googlegroups.com.
Visit this group at https://groups.google.com/group/mongodb-user.
To view this discussion on the web visit https://groups.google.com/d/msgid/mongodb-user/149d5a77-3c7d-4c23-bf4e-a87ae61205f9%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Matthew Zimmerman
2018-11-26 22:50:07 UTC
Permalink
Forgive me, I don't typically input records in the json format so I
was doing it from memory; given your description, I'm sure it's
ISODate(). The client code handles all of the data parsing and
writing of records to the collection. The records given are just
examples. The schema holds ~15 different data types (in terms of
client side code), most of which are represented by string on the
mongodb side. Each of those 15 has an index on them (for purposes of
querying in the client code).

The client side code does enforce that each record put into the
collection has a "source" value, while the other 15 different data
types are not required.

I can give more specifics, however I'm not sure how that helps the
discussion and I don't want to distract from the core problem. The
core problem being that I have 5 billion records with 5 million unique
"source" values in the collection and I don't want to a COLLSCAN of
all 5 billion to only unique the 5 million. Note that when I had ~4
million unique "source" values, the distinct() command worked great
and within a few seconds, so I know returning quickly on 5 million is
likely possible, I just have to figure out how to get it to use the
index instead of doing a COLLSCAN for the records.
Hi,
Are you really using a data type called "ISOTime"? That isn't supported in the mongo shell. See This link for supported date types. Perhaps you mean "ISODate()"? That gives a far more precise date and time representation than what your example documents seem to show in this post.
Thanks
Bob
Post by Matthew Zimmerman
I am running 4.0.x version. The topology is one server with 1tb of
RAM on 4TB of NVME disks. OS version is ArchLinux with kernel -
4.18.5-arch1-1-ARCH #1 SMP PREEMPT. The client is using the mgo
driver with the Go language, but I'm just trying to do this in the
mongo shell at the moment before I try to code the method of getting
these unique values into the "walker" client code.
{source:"file1.csv", "timestamp":ISOTime(2018/01/01 8:00am),
{source:"file1.csv", "timestamp":ISOTime(2018/01/01 8:00am),
{source:"file2.csv", "timestamp":ISOTime(2018/02/01 8:00am),
{source:"file2.csv", "timestamp":ISOTime(2018/02/01 8:00am),
There's sparse indexes on source, timestamp, IP, and email.
My understanding of the aggregate functions that only some of them
will use an index. $match is one, but since each file has a "source"
field value, doing {"$match":{"source":{"$exists":true}}} doesn't
actually reduce the dataset for the $group stage at all.
$sort is another, but there's no argument to do a "unique" sort that I know of.
Thanks for any help.
Post by Bob Cochran
Hi,
What is your MongoDB version? What is your topology? What is the operating system version for the server? For the client(s)? Do you have some sample documents and details of your indexing scheme?
Thanks
Bob
I have a collection with 5 billion records. This collection stores metadata about files in a filesystem such that the "source" field in the record points to the file where that piece of data comes from. With external code, I do a "walk" in this filesystem to find new files that I need to process the metadata for and store while also finding records that I need to remove because the file no longer exists. I only know which ones to remove from mongodb by finding all the sources of data in the collection, then removing the ones I didn't see in the filesystem on my prior walk.
I have 5 million files, which has just crossed into the threshold where mongodb can't return them in a distinct query and instead gives me: "distinct too big, 16mb cap"
A simple aggregation query of db.transactions.aggregate([{"$group":{"_id":"$source"}}]) would work, however it does a COLLSCAN which takes *way* longer than the few seconds it used to take to do the distinct("source") since I have the source field already indexed. How can I get "$group" (or some other command) to recognize that the index already has each unique value? Due to size only, I can't return it in one BSON document. Instead, I need to return one BSON document for each key in the index, at least this seems like it would be the most performant way for mongodb to do this.
Basically, how can I get the functionality of $group to use the index of source instead of doing a COLLSCAN? $group doesn't seem smart enough to realize it doesn't have to pull the whole record, but only the "source" field (which is available in the index).
Thanks for any help.
--
You received this message because you are subscribed to the Google Groups "mongodb-user"
group.
For other MongoDB technical support options, see: https://docs.mongodb.com/manual/support/
---
You received this message because you are subscribed to the Google Groups "mongodb-user" group.
Visit this group at https://groups.google.com/group/mongodb-user.
To view this discussion on the web visit https://groups.google.com/d/msgid/mongodb-user/63e62975-beb3-43a9-8f7c-a2354f328e31%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google Groups "mongodb-user"
group.
For other MongoDB technical support options, see: https://docs.mongodb.com/manual/support/
---
You received this message because you are subscribed to the Google Groups "mongodb-user" group.
Visit this group at https://groups.google.com/group/mongodb-user.
To view this discussion on the web visit https://groups.google.com/d/msgid/mongodb-user/343B889F-979E-4D05-820F-C1DB3A84E122%40gmail.com.
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google Groups "mongodb-user"
group.
For other MongoDB technical support options, see: https://docs.mongodb.com/manual/support/
---
You received this message because you are subscribed to the Google Groups "mongodb-user" group.
Visit this group at https://groups.google.com/group/mongodb-user.
To view this discussion on the web visit https://groups.google.com/d/msgid/mongodb-user/149d5a77-3c7d-4c23-bf4e-a87ae61205f9%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google Groups "mongodb-user"
group.

For other MongoDB technical support options, see: https://docs.mongodb.com/manual/support/
---
You received this message because you are subscribed to the Google Groups "mongodb-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mongodb-user+***@googlegroups.com.
To post to this group, send email to mongodb-***@googlegroups.com.
Visit this group at https://groups.google.com/group/mongodb-user.
To view this discussion on the web visit https://groups.google.com/d/msgid/mongodb-user/CAD53Lr75Wd_1avmrcSRZkMWKqHXXJhQBdqUqFP0bHYwryAdZnQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.
'Kevin Adistambha' via mongodb-user
2018-11-27 05:54:07 UTC
Permalink
Hi,

If I understand correctly, you have a set of files, of which you kept the
metadata of said files in MongoDB. One file can have multiple entries in
MongoDB. Periodically, you list the files in the directory, query MongoDB
for unique filename values, then compare the two listings. For each file
that doesn’t exist in the directory file listing, you remove the
corresponding metadata in MongoDB to keep the two lists synced. Is this an
accurate description of your use case?

If this is correct, with regard to your question:

Basically, how can I get the functionality of $group to use the index of
source instead of doing a COLLSCAN?

In some cases, a COLLSCAN *could* be faster, since the query you’re doing
needs to look at every document in your collection anyway. However this is
very use case dependent, which would depend on the size of your documents,
the amount of RAM you have, whether the whole index or the whole collection
fit in RAM, etc.

Having said that, you could try to do a $sort before $group to see if the
query runs any faster, such as:

db.collection.aggregate([ {$sort: {source: 1}}, {$group: {_id: '$source'}} ])

Note that this would need you to have an index on the source field, e.g. db.collection.createIndex({source:
1}).

An alternative approach would be using a tool such as inotify
<http://man7.org/linux/man-pages/man7/inotify.7.html>. There is a blog post
titled Linux Filesystem Events with inotify
<https://www.linuxjournal.com/content/linux-filesystem-events-inotify> that
runs through an example on its use. The basic idea is to have a script that
monitors the output of inotify, and if a delete event is detected, do a
delete of the related source in the MongoDB database. This way, you could
avoid having to do a directory listing and a comparison with the aggregate
output.

OS version is ArchLinux

Please note that Arch Linux is not listed in the MongoDB Supported Platform
<https://docs.mongodb.com/manual/administration/production-notes/#supported-platforms>
list. I would encourage you to consider using one of the supported
platforms and follow the recommendations in the Production Notes
<https://docs.mongodb.com/manual/administration/production-notes/> for best
results. Having a Replica Set
<https://docs.mongodb.com/manual/replication/index.html> is also strongly
recommended in a production setting.

Best regards,
Kevin
​
--
You received this message because you are subscribed to the Google Groups "mongodb-user"
group.

For other MongoDB technical support options, see: https://docs.mongodb.com/manual/support/
---
You received this message because you are subscribed to the Google Groups "mongodb-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mongodb-user+***@googlegroups.com.
To post to this group, send email to mongodb-***@googlegroups.com.
Visit this group at https://groups.google.com/group/mongodb-user.
To view this discussion on the web visit https://groups.google.com/d/msgid/mongodb-user/f4cd4aa0-c38c-40a2-9e83-15ba8050b28d%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
User
2018-11-27 15:12:24 UTC
Permalink
Kevin, your understanding is correct.

I'm not a mongodb expert by any means, but in this case I don't think a
COLLSCAN should be required since the field is indexed and as such each
unique key is in there. The records themselves can have ~15 fields (or
just two). I know that when the distinct() function didn't overrun the
16mb BSON limit that it could return in a few seconds when using the
index. Now that the BSON limit is hit, I can't get any similar performance
using any other method.

The documentation says that $sort will use the index, however:
db.collection.aggregate([{"$sort":{"source":1}},{"$group":{"_id":"$source"}}],{"explain":1})
shows the winning plan is a COLLSCAN. My only explanation is that my
(current) index is sparse (meaning blank values aren't indexed). I tested
that and it does seem to be the case; if the index is sparse $sort won't
use it, otherwise it will. So I'm rebuilding my index on "source" without
sparse which should be done tomorrow sometime (I probably should have been
that way anyway). Then I'll test the performance of it.

I thought about using inotify but that won't work in the general sense
because 1) inotify can only watch so many files and I think I'm past that
limit already and 2) it can only watch file changes when the watcher is
active, so if my client process fails that's doing the watching, any
filesystem changes would be permanently missed unless some other walk
happens. I do actually keep a list of the files previously found in the
filesystem in a seperate mongodb collection where the index on "source" is
unique, however that doesn't ensure the two collections are in sync.

Thanks.
Post by 'Kevin Adistambha' via mongodb-user
Hi,
If I understand correctly, you have a set of files, of which you kept the
metadata of said files in MongoDB. One file can have multiple entries in
MongoDB. Periodically, you list the files in the directory, query MongoDB
for unique filename values, then compare the two listings. For each file
that doesn’t exist in the directory file listing, you remove the
corresponding metadata in MongoDB to keep the two lists synced. Is this an
accurate description of your use case?
Basically, how can I get the functionality of $group to use the index of
source instead of doing a COLLSCAN?
In some cases, a COLLSCAN *could* be faster, since the query you’re doing
needs to look at every document in your collection anyway. However this is
very use case dependent, which would depend on the size of your documents,
the amount of RAM you have, whether the whole index or the whole collection
fit in RAM, etc.
Having said that, you could try to do a $sort before $group to see if the
db.collection.aggregate([ {$sort: {source: 1}}, {$group: {_id: '$source'}} ])
1}).
An alternative approach would be using a tool such as inotify
<http://man7.org/linux/man-pages/man7/inotify.7.html>. There is a blog
post titled Linux Filesystem Events with inotify
<https://www.linuxjournal.com/content/linux-filesystem-events-inotify>
that runs through an example on its use. The basic idea is to have a script
that monitors the output of inotify, and if a delete event is detected,
do a delete of the related source in the MongoDB database. This way, you
could avoid having to do a directory listing and a comparison with the
aggregate output.
OS version is ArchLinux
Please note that Arch Linux is not listed in the MongoDB Supported Platform
<https://docs.mongodb.com/manual/administration/production-notes/#supported-platforms>
list. I would encourage you to consider using one of the supported
platforms and follow the recommendations in the Production Notes
<https://docs.mongodb.com/manual/administration/production-notes/> for
best results. Having a Replica Set
<https://docs.mongodb.com/manual/replication/index.html> is also strongly
recommended in a production setting.
Best regards,
Kevin
​
--
You received this message because you are subscribed to the Google Groups "mongodb-user"
group.

For other MongoDB technical support options, see: https://docs.mongodb.com/manual/support/
---
You received this message because you are subscribed to the Google Groups "mongodb-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mongodb-user+***@googlegroups.com.
To post to this group, send email to mongodb-***@googlegroups.com.
Visit this group at https://groups.google.com/group/mongodb-user.
To view this discussion on the web visit https://groups.google.com/d/msgid/mongodb-user/7ecf1b59-d554-436c-96ad-62f10f2893c1%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
User
2018-11-29 02:26:30 UTC
Permalink
Well, $sort is using the index now, yet it still doesn't return in over an
hour for
db.collection.aggregate([{"$sort":{"source":1}},{"$group":{"_id":"$source"}}])
when the distinct call would return in ~10 seconds if I remember correctly.

Is there anyway to make the "$group" processing multicore?

This is what explain says right now:
"winningPlan" : {
"stage" : "PROJECTION",
"transformBy" : {
"source" : 1,
"_id" : 0
},
"inputStage" : {
"stage" : "IXSCAN",
"keyPattern" : {
"source" : 1
},
"indexName" :
"sourcenotsparse",
"isMultiKey" :
false,
"multiKeyPaths" : {
"source" :
[ ]
},
"isUnique" : false,
"isSparse" : false,
"isPartial" : false,
"indexVersion" : 2,
"direction" :
"forward",
"indexBounds" : {
"source" : [

"[MinKey, MaxKey]"
]
}
}
},
"rejectedPlans" : [ ]
}
}
},
{
"$group" : {
"_id" : "$source"
}
}

So in theory, mongodb could pull only the "keys" from the index (and not
the values), but if there's ~1000 entries for fileA and ~2000 entries for
fileB, the $group stage is going to have to process ~3000 entries even
though it will only output two of them in the end. This whole process
seems to be only using one core -- I've got 140 available so it'd be nice
if this could complete ~100x faster if more cores were used.

I know this is an "optimized" path for $sort/$group with a non-sparse
index, but I feel like the system should be able to make this very fast in
this particular scenario.

I'm a novice with c++... is there a place I could start to see how to fix
this?

Thanks!
Post by User
Kevin, your understanding is correct.
I'm not a mongodb expert by any means, but in this case I don't think a
COLLSCAN should be required since the field is indexed and as such each
unique key is in there. The records themselves can have ~15 fields (or
just two). I know that when the distinct() function didn't overrun the
16mb BSON limit that it could return in a few seconds when using the
index. Now that the BSON limit is hit, I can't get any similar performance
using any other method.
db.collection.aggregate([{"$sort":{"source":1}},{"$group":{"_id":"$source"}}],{"explain":1})
shows the winning plan is a COLLSCAN. My only explanation is that my
(current) index is sparse (meaning blank values aren't indexed). I tested
that and it does seem to be the case; if the index is sparse $sort won't
use it, otherwise it will. So I'm rebuilding my index on "source" without
sparse which should be done tomorrow sometime (I probably should have been
that way anyway). Then I'll test the performance of it.
I thought about using inotify but that won't work in the general sense
because 1) inotify can only watch so many files and I think I'm past that
limit already and 2) it can only watch file changes when the watcher is
active, so if my client process fails that's doing the watching, any
filesystem changes would be permanently missed unless some other walk
happens. I do actually keep a list of the files previously found in the
filesystem in a seperate mongodb collection where the index on "source" is
unique, however that doesn't ensure the two collections are in sync.
Thanks.
Post by 'Kevin Adistambha' via mongodb-user
Hi,
If I understand correctly, you have a set of files, of which you kept the
metadata of said files in MongoDB. One file can have multiple entries in
MongoDB. Periodically, you list the files in the directory, query MongoDB
for unique filename values, then compare the two listings. For each file
that doesn’t exist in the directory file listing, you remove the
corresponding metadata in MongoDB to keep the two lists synced. Is this an
accurate description of your use case?
Basically, how can I get the functionality of $group to use the index of
source instead of doing a COLLSCAN?
In some cases, a COLLSCAN *could* be faster, since the query you’re
doing needs to look at every document in your collection anyway. However
this is very use case dependent, which would depend on the size of your
documents, the amount of RAM you have, whether the whole index or the whole
collection fit in RAM, etc.
Having said that, you could try to do a $sort before $group to see if
db.collection.aggregate([ {$sort: {source: 1}}, {$group: {_id: '$source'}} ])
1}).
An alternative approach would be using a tool such as inotify
<http://man7.org/linux/man-pages/man7/inotify.7.html>. There is a blog
post titled Linux Filesystem Events with inotify
<https://www.linuxjournal.com/content/linux-filesystem-events-inotify>
that runs through an example on its use. The basic idea is to have a script
that monitors the output of inotify, and if a delete event is detected,
do a delete of the related source in the MongoDB database. This way, you
could avoid having to do a directory listing and a comparison with the
aggregate output.
OS version is ArchLinux
Please note that Arch Linux is not listed in the MongoDB Supported Platform
<https://docs.mongodb.com/manual/administration/production-notes/#supported-platforms>
list. I would encourage you to consider using one of the supported
platforms and follow the recommendations in the Production Notes
<https://docs.mongodb.com/manual/administration/production-notes/> for
best results. Having a Replica Set
<https://docs.mongodb.com/manual/replication/index.html> is also
strongly recommended in a production setting.
Best regards,
Kevin
​
--
You received this message because you are subscribed to the Google Groups "mongodb-user"
group.

For other MongoDB technical support options, see: https://docs.mongodb.com/manual/support/
---
You received this message because you are subscribed to the Google Groups "mongodb-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mongodb-user+***@googlegroups.com.
To post to this group, send email to mongodb-***@googlegroups.com.
Visit this group at https://groups.google.com/group/mongodb-user.
To view this discussion on the web visit https://groups.google.com/d/msgid/mongodb-user/48a56369-8d6b-4128-84dc-204703137a5e%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Robert Cochran
2018-11-29 03:09:09 UTC
Permalink
You should only work with very small collections -- not more than 20
documents, for example -- until you have your schema and indexing correct.
To restate what the official documentation says -- "test on a small test
system before implementing any changes on a production system."

Thanks so much

Bob
Post by User
Well, $sort is using the index now, yet it still doesn't return in over an
hour for
db.collection.aggregate([{"$sort":{"source":1}},{"$group":{"_id":"$source"}}])
when the distinct call would return in ~10 seconds if I remember correctly.
Is there anyway to make the "$group" processing multicore?
"winningPlan" : {
"stage" : "PROJECTION",
"transformBy" : {
"source" : 1,
"_id" : 0
},
"inputStage" : {
"stage" : "IXSCAN",
"keyPattern" : {
"source" : 1
},
"sourcenotsparse",
false,
"multiKeyPaths" : {
[ ]
},
"isUnique" : false,
"isSparse" : false,
"isPartial" : false,
"indexVersion" : 2,
"forward",
"indexBounds" : {
"source" : [
"[MinKey, MaxKey]"
]
}
}
},
"rejectedPlans" : [ ]
}
}
},
{
"$group" : {
"_id" : "$source"
}
}
So in theory, mongodb could pull only the "keys" from the index (and not
the values), but if there's ~1000 entries for fileA and ~2000 entries for
fileB, the $group stage is going to have to process ~3000 entries even
though it will only output two of them in the end. This whole process
seems to be only using one core -- I've got 140 available so it'd be nice
if this could complete ~100x faster if more cores were used.
I know this is an "optimized" path for $sort/$group with a non-sparse
index, but I feel like the system should be able to make this very fast in
this particular scenario.
I'm a novice with c++... is there a place I could start to see how to fix
this?
Thanks!
Post by User
Kevin, your understanding is correct.
I'm not a mongodb expert by any means, but in this case I don't think a
COLLSCAN should be required since the field is indexed and as such each
unique key is in there. The records themselves can have ~15 fields (or
just two). I know that when the distinct() function didn't overrun the
16mb BSON limit that it could return in a few seconds when using the
index. Now that the BSON limit is hit, I can't get any similar performance
using any other method.
db.collection.aggregate([{"$sort":{"source":1}},{"$group":{"_id":"$source"}}],{"explain":1})
shows the winning plan is a COLLSCAN. My only explanation is that my
(current) index is sparse (meaning blank values aren't indexed). I tested
that and it does seem to be the case; if the index is sparse $sort won't
use it, otherwise it will. So I'm rebuilding my index on "source" without
sparse which should be done tomorrow sometime (I probably should have been
that way anyway). Then I'll test the performance of it.
I thought about using inotify but that won't work in the general sense
because 1) inotify can only watch so many files and I think I'm past that
limit already and 2) it can only watch file changes when the watcher is
active, so if my client process fails that's doing the watching, any
filesystem changes would be permanently missed unless some other walk
happens. I do actually keep a list of the files previously found in the
filesystem in a seperate mongodb collection where the index on "source" is
unique, however that doesn't ensure the two collections are in sync.
Thanks.
Post by 'Kevin Adistambha' via mongodb-user
Hi,
If I understand correctly, you have a set of files, of which you kept
the metadata of said files in MongoDB. One file can have multiple entries
in MongoDB. Periodically, you list the files in the directory, query
MongoDB for unique filename values, then compare the two listings. For each
file that doesn’t exist in the directory file listing, you remove the
corresponding metadata in MongoDB to keep the two lists synced. Is this an
accurate description of your use case?
Basically, how can I get the functionality of $group to use the index of
source instead of doing a COLLSCAN?
In some cases, a COLLSCAN *could* be faster, since the query you’re
doing needs to look at every document in your collection anyway. However
this is very use case dependent, which would depend on the size of your
documents, the amount of RAM you have, whether the whole index or the whole
collection fit in RAM, etc.
Having said that, you could try to do a $sort before $group to see if
db.collection.aggregate([ {$sort: {source: 1}}, {$group: {_id: '$source'}} ])
Note that this would need you to have an index on the source field,
e.g. db.collection.createIndex({source: 1}).
An alternative approach would be using a tool such as inotify
<http://man7.org/linux/man-pages/man7/inotify.7.html>. There is a blog
post titled Linux Filesystem Events with inotify
<https://www.linuxjournal.com/content/linux-filesystem-events-inotify>
that runs through an example on its use. The basic idea is to have a script
that monitors the output of inotify, and if a delete event is detected,
do a delete of the related source in the MongoDB database. This way,
you could avoid having to do a directory listing and a comparison with the
aggregate output.
OS version is ArchLinux
Please note that Arch Linux is not listed in the MongoDB Supported Platform
<https://docs.mongodb.com/manual/administration/production-notes/#supported-platforms>
list. I would encourage you to consider using one of the supported
platforms and follow the recommendations in the Production Notes
<https://docs.mongodb.com/manual/administration/production-notes/> for
best results. Having a Replica Set
<https://docs.mongodb.com/manual/replication/index.html> is also
strongly recommended in a production setting.
Best regards,
Kevin
​
--
You received this message because you are subscribed to the Google Groups "mongodb-user"
group.

For other MongoDB technical support options, see: https://docs.mongodb.com/manual/support/
---
You received this message because you are subscribed to the Google Groups "mongodb-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mongodb-user+***@googlegroups.com.
To post to this group, send email to mongodb-***@googlegroups.com.
Visit this group at https://groups.google.com/group/mongodb-user.
To view this discussion on the web visit https://groups.google.com/d/msgid/mongodb-user/2186ab1a-91fe-4211-abff-3e665d4b3379%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
User
2018-11-29 13:48:16 UTC
Permalink
Bob,

The problem is that the distinct call works great on a small test system.
Until you hit the 16mb limit (which is pretty large for just text), there's
no reason to switch from distinct. I hit this after using this schema in
production for over a year; only when the collection got to be a certain
size have I had to rethink the way I was doing it. Testing with a small
system doesn't bring out this issue.

If you read carefully, you would see that I am experimenting on another
collection with the sparse vs non-sparse indexes. Indeed, according to
explain:true, $sort doesn't work on a sparse index but does work on a
non-sparse index. On a collection of smaller size (5 million records
(which still goes over 16mb with distinct), the $sort/$group query can get
me the unique values of one field from one collection in 96 seconds where
distinct used to return it (probably on 4.5 million records) in 3-5
seconds. On the 5 billion collection size, distinct still returned in 3-5
seconds (same number of unique "source" values as the 5 million
collection), but now the $sort/$group query (even using the non-sparse
index) takes about an hour and a half because instead of only looking at
the uniques in the index, it's going through the entire 5 billion records.

In this particular instance, in these particular circumstances, I believe
mongodb has the ability to be more performant with this by ignoring all 5
billion records and instead pulling the record only from the index. Since
there's an immediate $projection which filters out all fields but the field
in the index, there's no reason to go through all 5 billion records to
determine what the index can already tell you (that there's no more
additional values of "source" than what is in the index already). I think
by implementing distinct() (distinctIter()?) and returning an iterator to
records with the "distinct" value as "_id", this could be just as
performant as the distinct call.

Thanks.
Post by Robert Cochran
You should only work with very small collections -- not more than 20
documents, for example -- until you have your schema and indexing correct.
To restate what the official documentation says -- "test on a small test
system before implementing any changes on a production system."
Thanks so much
Bob
Post by User
Well, $sort is using the index now, yet it still doesn't return in over
an hour for
db.collection.aggregate([{"$sort":{"source":1}},{"$group":{"_id":"$source"}}])
when the distinct call would return in ~10 seconds if I remember correctly.
Is there anyway to make the "$group" processing multicore?
"winningPlan" : {
"stage" : "PROJECTION",
"transformBy" : {
"source" : 1,
"_id" : 0
},
"inputStage" : {
"stage" : "IXSCAN",
"keyPattern" : {
"source" : 1
},
"sourcenotsparse",
false,
"multiKeyPaths" : {
"source"
: [ ]
},
"isUnique" : false,
"isSparse" : false,
"isPartial" : false,
"indexVersion" : 2,
"forward",
"indexBounds" : {
"source" : [
"[MinKey, MaxKey]"
]
}
}
},
"rejectedPlans" : [ ]
}
}
},
{
"$group" : {
"_id" : "$source"
}
}
So in theory, mongodb could pull only the "keys" from the index (and not
the values), but if there's ~1000 entries for fileA and ~2000 entries for
fileB, the $group stage is going to have to process ~3000 entries even
though it will only output two of them in the end. This whole process
seems to be only using one core -- I've got 140 available so it'd be nice
if this could complete ~100x faster if more cores were used.
I know this is an "optimized" path for $sort/$group with a non-sparse
index, but I feel like the system should be able to make this very fast in
this particular scenario.
I'm a novice with c++... is there a place I could start to see how to fix
this?
Thanks!
Post by User
Kevin, your understanding is correct.
I'm not a mongodb expert by any means, but in this case I don't think a
COLLSCAN should be required since the field is indexed and as such each
unique key is in there. The records themselves can have ~15 fields (or
just two). I know that when the distinct() function didn't overrun the
16mb BSON limit that it could return in a few seconds when using the
index. Now that the BSON limit is hit, I can't get any similar performance
using any other method.
db.collection.aggregate([{"$sort":{"source":1}},{"$group":{"_id":"$source"}}],{"explain":1})
shows the winning plan is a COLLSCAN. My only explanation is that my
(current) index is sparse (meaning blank values aren't indexed). I tested
that and it does seem to be the case; if the index is sparse $sort won't
use it, otherwise it will. So I'm rebuilding my index on "source" without
sparse which should be done tomorrow sometime (I probably should have been
that way anyway). Then I'll test the performance of it.
I thought about using inotify but that won't work in the general sense
because 1) inotify can only watch so many files and I think I'm past that
limit already and 2) it can only watch file changes when the watcher is
active, so if my client process fails that's doing the watching, any
filesystem changes would be permanently missed unless some other walk
happens. I do actually keep a list of the files previously found in the
filesystem in a seperate mongodb collection where the index on "source" is
unique, however that doesn't ensure the two collections are in sync.
Thanks.
Post by 'Kevin Adistambha' via mongodb-user
Hi,
If I understand correctly, you have a set of files, of which you kept
the metadata of said files in MongoDB. One file can have multiple entries
in MongoDB. Periodically, you list the files in the directory, query
MongoDB for unique filename values, then compare the two listings. For each
file that doesn’t exist in the directory file listing, you remove the
corresponding metadata in MongoDB to keep the two lists synced. Is this an
accurate description of your use case?
Basically, how can I get the functionality of $group to use the index
of source instead of doing a COLLSCAN?
In some cases, a COLLSCAN *could* be faster, since the query you’re
doing needs to look at every document in your collection anyway. However
this is very use case dependent, which would depend on the size of your
documents, the amount of RAM you have, whether the whole index or the whole
collection fit in RAM, etc.
Having said that, you could try to do a $sort before $group to see if
db.collection.aggregate([ {$sort: {source: 1}}, {$group: {_id: '$source'}} ])
Note that this would need you to have an index on the source field,
e.g. db.collection.createIndex({source: 1}).
An alternative approach would be using a tool such as inotify
<http://man7.org/linux/man-pages/man7/inotify.7.html>. There is a blog
post titled Linux Filesystem Events with inotify
<https://www.linuxjournal.com/content/linux-filesystem-events-inotify>
that runs through an example on its use. The basic idea is to have a script
that monitors the output of inotify, and if a delete event is
detected, do a delete of the related source in the MongoDB database.
This way, you could avoid having to do a directory listing and a comparison
with the aggregate output.
OS version is ArchLinux
Please note that Arch Linux is not listed in the MongoDB Supported Platform
<https://docs.mongodb.com/manual/administration/production-notes/#supported-platforms>
list. I would encourage you to consider using one of the supported
platforms and follow the recommendations in the Production Notes
<https://docs.mongodb.com/manual/administration/production-notes/> for
best results. Having a Replica Set
<https://docs.mongodb.com/manual/replication/index.html> is also
strongly recommended in a production setting.
Best regards,
Kevin
​
--
You received this message because you are subscribed to the Google Groups "mongodb-user"
group.

For other MongoDB technical support options, see: https://docs.mongodb.com/manual/support/
---
You received this message because you are subscribed to the Google Groups "mongodb-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mongodb-user+***@googlegroups.com.
To post to this group, send email to mongodb-***@googlegroups.com.
Visit this group at https://groups.google.com/group/mongodb-user.
To view this discussion on the web visit https://groups.google.com/d/msgid/mongodb-user/0ab3e8e3-bcc2-4a36-8bca-339e4388a3c1%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
badi
2018-11-29 16:10:12 UTC
Permalink
Hi , we ussually solve poor performance of $group by microSharding
(sharding on same server). It used multiple cores.
bye
--
You received this message because you are subscribed to the Google Groups "mongodb-user"
group.

For other MongoDB technical support options, see: https://docs.mongodb.com/manual/support/
---
You received this message because you are subscribed to the Google Groups "mongodb-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mongodb-user+***@googlegroups.com.
To post to this group, send email to mongodb-***@googlegroups.com.
Visit this group at https://groups.google.com/group/mongodb-user.
To view this discussion on the web visit https://groups.google.com/d/msgid/mongodb-user/02b18832-4cbd-424a-a227-515202d056bf%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
'Kevin Adistambha' via mongodb-user
2018-11-29 23:41:17 UTC
Permalink
Hi,

I understand that your original solution with distinct worked well, until
you hit a certain scale, and you were forced to look for other solutions.
Unfortunately, an aggregation query is single-threaded by its nature. The
distinct command is also not able to return a cursor currently. There is a
ticket for this functionality: SERVER-3141
<https://jira.mongodb.org/browse/SERVER-3141>.

In the meantime, off the top of my head I can think of three solutions:

One is to create a special collection containing the unique source. In this
solution, you can exploit the fact that MongoDB enforces a unique _id index
on each collection. Essentially, you’re creating an index in a manual
manner. The workflow could be:

1. When inserting into the main metadata collection, also insert into
this unique collection, e.g. db.unique.insert({_id: <filename>}).
2. You can ignore any duplicate key error when inserting into this
collection, since it just informs you that the filename already exists.
3. In cleanup time, just scan the unique collection, and delete the
corresponding entries from both this collection and the main metadata
collection.

Another solution is to do multiple aggregations bounded by some criteria
based on the characteristics of your data. For example, you can do one
aggregation on files starting with ‘a’, and another on files starting with
‘b’. If source is string, you would need to use regex to do this, e.g.:

db.collection.aggregate([ {$match: {source: /^a/}}, ... ])
db.collection.aggregate([ {$match: {source: /^b/}}, ... ])
...

Please see Regex index use
<https://docs.mongodb.com/manual/reference/operator/query/regex/#index-use>
for details to ensure that your regex query can use an index.

The last solution I can think of would be to shard the collection, since it
appears that the operation you need to do is now limited to the
capabilities of a single node to handle given your current workflow. I
believe you already have a quite powerful machine, so scaling vertically is
probably limited. This is an option if you don’t want to change your
current workflow, but this will change your deployment significantly, so it
will require a bit more planning, changes in backup methodologies, and any
other ops concerns compared to what you currently have.

In any case, please remember to take a complete backup of your app & data
before doing any major changes, and to thoroughly test any changes before
implementing them in production.

Best regards,
Kevin
​
--
You received this message because you are subscribed to the Google Groups "mongodb-user"
group.

For other MongoDB technical support options, see: https://docs.mongodb.com/manual/support/
---
You received this message because you are subscribed to the Google Groups "mongodb-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mongodb-user+***@googlegroups.com.
To post to this group, send email to mongodb-***@googlegroups.com.
Visit this group at https://groups.google.com/group/mongodb-user.
To view this discussion on the web visit https://groups.google.com/d/msgid/mongodb-user/21118095-cf25-4613-8a78-2a4c107e36f0%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Continue reading on narkive:
Loading...