Discussion:
Aggregate Pipeline addToSet on nested array of records returning a set of arrays
(too old to reply)
UltimateCodeWarrior
2018-11-13 22:17:38 UTC
Permalink
Goal: For the fastest-possible queries with C#.NET and MongoDB v4.0,
perform multiple distinct(s) utilizing a single aggregation pipeline style
query on a document with nested array records and return a flattened out
set.

Current-Problem: For first level document entries it works great,
unexpected results when running the distinct on the nested array records.

Example Document:

{
"title" : "Movie A",
"released": 1984,
"minutes" : 90,
"category": "SciFi",
"cost": 24000000,
"gross": 48000000,
"rated": "PG",
"score": 3.4,
"cast" : [
{
"name": "Buzz",
"gender":"M",
"age": 28,
"country": "USA",
"role": "Leading",
"award" : "Golden Globe",
"pay": 500000
},
{
"name": "Sally",
"gender":"F",
"age": 21,
"country": "Austrailia",
"role": "Supporting",
"award" : "Academy",
"pay": 300000
}
]
}


How to Build with Mongo Client:

use <your database name>;

db.createCollection("movies");

db.movies.insertOne({ "title" : "Movie A", "released": 1984, "minutes":
90, "category": "SciFi", "cost": 24000000, "gross": 48000000, "rated":
"PG", "score": 3.4, "cast" : [ { "name": "Buzz", "gender":"M", "age":
28, "country": "USA", "role": "Leading", "award" : "Golden Globe", "pay":
500000 }, { "name": "Sally", "gender":"F", "age": 21, "country":
"Austrailia", "role": "Supporting", "award": "Academy", "pay": 300000 }
]});

db.movies.insertOne({ "title" : "Movie B", "released": 1986, "minutes":
120, "category": "Horror", "cost": 28000000, "gross": 40000000, "rated":
"R", "score": 4.4, "cast" : [ { "name": "Ryan", "gender":"M", "age": 55,
"country": "England", "role": "Leading", "award": "Oscar Nomination",
"pay": 100000 }, { "name": "Sally", "gender":"F", "age": 23, "country":
"Austrailia", "role": "Supporting", "award": "Academy", "pay": 300000 }
]});

db.movies.insertOne({ "title" : "Movie C", "released": 1988, "minutes":
106, "category": "Drama", "cost": 38000000, "gross": 45000000, "rated":
"PG13", "score": 2.4, "cast" : [ { "name": "Pat", "gender":"F", "age":
42, "country": "Ireland", "role": "Supporting", "award": "Irish", "pay":
350000 }, { "name": "Gene", "gender":"F", "age": 44, "country": "France",
"role": "Leading", "award": "Independent", "pay": 280000 } ]});


C#.NET Code w/ Mongo v 4.0 and 2.5 Driver version

MongoDatabase mdb = Mongo.Controller.Instance.getDB();
var collection = mdb.GetCollection("movies");
var match = new BsonDocument
{
{
"$match",
new BsonDocument
{
{ "released", new BsonDocument
{
{"$gte", 1984},
{ "$lte", 1988}
}
}
}
}
};

var group = new BsonDocument
{
{ "$group",
new BsonDocument {
{ "_id", 0},

{ "categoryDistinct", new BsonDocument {
{ "$addToSet", "$category" } }},
{ "ratedDistinct", new BsonDocument { {
"$addToSet", "$rated" } }},
{ "countryDistinct", new BsonDocument {
{ "$addToSet", "$cast.country" } }},
{ "awardDistinct", new BsonDocument { {
"$addToSet", "$cast.award" } }}
}
}
};


var pipeline = new[] { match, group };
var args = new AggregateArgs();
args.Pipeline = pipeline;
args.AllowDiskUse = true;

var results = collection.Aggregate(args).ToList();


foreach (var obj in results)
{
Console.WriteLine(obj.ToString());
}


Output:

{ "_id" : 0, "categoryDistinct" : ["Drama", "Horror", "SciFi"],
"ratedDistinct" : ["PG13", "R", "PG"], "countryDistinct" : [["Ireland",
"France"], ["England", "Austrailia"], ["USA", "Austrailia"]],
"awardDistinct" : [["Irish", "Independent"], ["Oscar Nomination",
"Academy"], ["Golden Globe", "Academy"]] }


Notes: categoryDistinct and ratedDistinct are coming back as I would
expect, but the countryDistinct and awardDistinct are not, they are coming
back in a nested sub array record form.


Desired Output : countryDistinct should have been
["Austrailia,England,France,Ireland,USA"]
awardDistinct should have been: ["Academy","Golden
Globe","Independent","Irish","Oscar Nomination"];
(I know I didn't specify a sort anywhere so if you can
give a pointer that will not impact speed, that would be greatly
appreciated.


Any pointers on how to do this efficiently -- speed consciously would be
greatly appreciated.

Thanks in Advance!
--
You received this message because you are subscribed to the Google Groups "mongodb-user"
group.

For other MongoDB technical support options, see: https://docs.mongodb.com/manual/support/
---
You received this message because you are subscribed to the Google Groups "mongodb-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mongodb-user+***@googlegroups.com.
To post to this group, send email to mongodb-***@googlegroups.com.
Visit this group at https://groups.google.com/group/mongodb-user.
To view this discussion on the web visit https://groups.google.com/d/msgid/mongodb-user/b9292db8-a68e-4c37-9033-7db7f36a6d8c%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
'Wan Bachtiar' via mongodb-user
2018-11-15 00:56:33 UTC
Permalink
categoryDistinct and ratedDistinct are coming back as I would expect, but
the countryDistinct and awardDistinct are not, they are coming back in a
nested sub array record form.

Hi,

This is because cast is an array, and when the group access it as
cast.country and cast.award those fields are an array value of each of the
respective fields. Thus, adding those array values to a set would return an
array of arrays.

You can try to insert an $unwind
<https://docs.mongodb.com/manual/reference/operator/aggregation/unwind/>
stage after the $match and before the $group stage. For example:

db.movies.aggregate([
{"$match": {"released": {"$gte": 1984, "$lte": 1988}}},
{"$unwind": "$cast"},
{"$group": {
"_id": null,
"categoryDistinct": {"$addToSet":"$category"},
"ratedDistinct": {"$addToSet":"$rated"},
"countryDistinct": {"$addToSet":"$cast.country"},
"awardDistinct": {"$addToSet":"$cast.award"},
}
}
])

I know I didn’t specify a sort anywhere so if you can give a pointer that
will not impact speed, that would be greatly appreciated

Similarly, you can insert a $sort
<https://docs.mongodb.com/manual/reference/operator/aggregation/sort/>
stage after the $unwind and before $group to sort the documents before the
grouping stage. Any extra operations will impact speed, although at
various tolerable degrees. Based on the use case and requirements, you can
then perform adjustments. i.e. scale up, change document schema, etc.

You may also find Aggregation Pipeline Optimization
<https://docs.mongodb.com/manual/core/aggregation-pipeline-optimization/#aggregation-pipeline-optimization> a
useful reference.

Regards,
Wan.
​
--
You received this message because you are subscribed to the Google Groups "mongodb-user"
group.

For other MongoDB technical support options, see: https://docs.mongodb.com/manual/support/
---
You received this message because you are subscribed to the Google Groups "mongodb-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mongodb-user+***@googlegroups.com.
To post to this group, send email to mongodb-***@googlegroups.com.
Visit this group at https://groups.google.com/group/mongodb-user.
To view this discussion on the web visit https://groups.google.com/d/msgid/mongodb-user/54ce9760-ce5a-4355-8962-60f71f586d85%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
UltimateCodeWarrior
2018-11-15 03:30:23 UTC
Permalink
Thanks Wan!

I had this posted in a few places and what you had written was the
consensus. Your example was much more complete though, I appreciate it and
I picked up a few pointers from your post, namely setting _id to null as
well as being able to construct the query in javascript so that I can test
it in the mongo command line.

It does indeed work the way I needed it to, but it wasn't any faster
than what I was trying to optimize when dealing with over 100,000 records.
When I ran separate db.collection.distinct() on each of these fields, it
was actually faster to run multiple distinct queries rather than the
aggregation pipeline. I know that db.collection.distinct() is a limited /
special purpose aggregation pipeline function. I also tried launching
various threads (first 2, then 4 then 8) in C# with the year based $match
split into equal time chunks across those threads to try and speed up the
aggregation pipeline query -- and it seemed to actually have the reverse
effect and slowed the query down. Seems the single threaded back to back
synchronous distincts() are much faster. At least from C# as I was
hoping to save the overhead of issuing multiple queries to mongodb. Maybe
it's got it's own thread-pool built in and all I am doing is negating that
effect when I launch my own threads. It was disheartening after all that
effort to learn the more advanced aggregation pipeline way, but ending up
with no tangible benefit.

I wish there was a guide mongo operations to avoid if possible ($group,
$sort, $unwind) as they are CPU/RAM expensive. What would be cool is if
the mongodb founding developers could take a data set like I have uploaded
and show GOOD, BETTER and BEST ways of finding common things like
distincts, sums, difference, ratios in more complex documents, complete
with custom indexes and hints as well as show timing information each step
of the way. This would save a lot of frustration and headache. Knowing
when to use one operation over another is just as important as knowing how
to use an operation. That's the essence of what I am trying to figure out.
Post by 'Wan Bachtiar' via mongodb-user
categoryDistinct and ratedDistinct are coming back as I would expect, but
the countryDistinct and awardDistinct are not, they are coming back in a
nested sub array record form.
Hi,
This is because cast is an array, and when the group access it as
cast.country and cast.award those fields are an array value of each of
the respective fields. Thus, adding those array values to a set would
return an array of arrays.
You can try to insert an $unwind
<https://docs.mongodb.com/manual/reference/operator/aggregation/unwind/>
db.movies.aggregate([
{"$match": {"released": {"$gte": 1984, "$lte": 1988}}},
{"$unwind": "$cast"},
{"$group": {
"_id": null,
"categoryDistinct": {"$addToSet":"$category"},
"ratedDistinct": {"$addToSet":"$rated"},
"countryDistinct": {"$addToSet":"$cast.country"},
"awardDistinct": {"$addToSet":"$cast.award"},
}
}
])
I know I didn’t specify a sort anywhere so if you can give a pointer that
will not impact speed, that would be greatly appreciated
Similarly, you can insert a $sort
<https://docs.mongodb.com/manual/reference/operator/aggregation/sort/>
stage after the $unwind and before $group to sort the documents before
the grouping stage. Any extra operations will impact speed, although at
various tolerable degrees. Based on the use case and requirements, you can
then perform adjustments. i.e. scale up, change document schema, etc.
You may also find Aggregation Pipeline Optimization
<https://docs.mongodb.com/manual/core/aggregation-pipeline-optimization/#aggregation-pipeline-optimization> a
useful reference.
Regards,
Wan.
​
--
You received this message because you are subscribed to the Google Groups "mongodb-user"
group.

For other MongoDB technical support options, see: https://docs.mongodb.com/manual/support/
---
You received this message because you are subscribed to the Google Groups "mongodb-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mongodb-user+***@googlegroups.com.
To post to this group, send email to mongodb-***@googlegroups.com.
Visit this group at https://groups.google.com/group/mongodb-user.
To view this discussion on the web visit https://groups.google.com/d/msgid/mongodb-user/02e159b9-789b-4e0d-975c-584c9321dfd7%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Robert Cochran
2018-11-15 11:18:08 UTC
Permalink
Hi,

How much physical memory is installed on your MongoDB server device? I am
guessing it must be quite low on memory. Adding more memory would help you
with queries.

Thanks

Bob


On Thursday, November 15, 2018 at 5:15:24 AM UTC-5, UltimateCodeWarrior
Post by UltimateCodeWarrior
Thanks Wan!
I had this posted in a few places and what you had written was the
consensus. Your example was much more complete though, I appreciate it and
I picked up a few pointers from your post, namely setting _id to null as
well as being able to construct the query in javascript so that I can test
it in the mongo command line.
It does indeed work the way I needed it to, but it wasn't any faster
than what I was trying to optimize when dealing with over 100,000 records.
When I ran separate db.collection.distinct() on each of these fields, it
was actually faster to run multiple distinct queries rather than the
aggregation pipeline. I know that db.collection.distinct() is a limited /
special purpose aggregation pipeline function. I also tried launching
various threads (first 2, then 4 then 8) in C# with the year based $match
split into equal time chunks across those threads to try and speed up the
aggregation pipeline query -- and it seemed to actually have the reverse
effect and slowed the query down. Seems the single threaded back to back
synchronous distincts() are much faster. At least from C# as I was
hoping to save the overhead of issuing multiple queries to mongodb. Maybe
it's got it's own thread-pool built in and all I am doing is negating that
effect when I launch my own threads. It was disheartening after all that
effort to learn the more advanced aggregation pipeline way, but ending up
with no tangible benefit.
I wish there was a guide mongo operations to avoid if possible ($group,
$sort, $unwind) as they are CPU/RAM expensive. What would be cool is if
the mongodb founding developers could take a data set like I have uploaded
and show GOOD, BETTER and BEST ways of finding common things like
distincts, sums, difference, ratios in more complex documents, complete
with custom indexes and hints as well as show timing information each step
of the way. This would save a lot of frustration and headache. Knowing
when to use one operation over another is just as important as knowing how
to use an operation. That's the essence of what I am trying to figure out.
Post by 'Wan Bachtiar' via mongodb-user
categoryDistinct and ratedDistinct are coming back as I would expect, but
the countryDistinct and awardDistinct are not, they are coming back in a
nested sub array record form.
Hi,
This is because cast is an array, and when the group access it as
cast.country and cast.award those fields are an array value of each of
the respective fields. Thus, adding those array values to a set would
return an array of arrays.
You can try to insert an $unwind
<https://docs.mongodb.com/manual/reference/operator/aggregation/unwind/>
db.movies.aggregate([
{"$match": {"released": {"$gte": 1984, "$lte": 1988}}},
{"$unwind": "$cast"},
{"$group": {
"_id": null,
"categoryDistinct": {"$addToSet":"$category"},
"ratedDistinct": {"$addToSet":"$rated"},
"countryDistinct": {"$addToSet":"$cast.country"},
"awardDistinct": {"$addToSet":"$cast.award"},
}
}
])
I know I didn’t specify a sort anywhere so if you can give a pointer that
will not impact speed, that would be greatly appreciated
Similarly, you can insert a $sort
<https://docs.mongodb.com/manual/reference/operator/aggregation/sort/>
stage after the $unwind and before $group to sort the documents before
the grouping stage. Any extra operations will impact speed, although at
various tolerable degrees. Based on the use case and requirements, you can
then perform adjustments. i.e. scale up, change document schema, etc.
You may also find Aggregation Pipeline Optimization
<https://docs.mongodb.com/manual/core/aggregation-pipeline-optimization/#aggregation-pipeline-optimization> a
useful reference.
Regards,
Wan.
​
--
You received this message because you are subscribed to the Google Groups "mongodb-user"
group.

For other MongoDB technical support options, see: https://docs.mongodb.com/manual/support/
---
You received this message because you are subscribed to the Google Groups "mongodb-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mongodb-user+***@googlegroups.com.
To post to this group, send email to mongodb-***@googlegroups.com.
Visit this group at https://groups.google.com/group/mongodb-user.
To view this discussion on the web visit https://groups.google.com/d/msgid/mongodb-user/7ecfb11a-8e3e-4cc2-af9e-8ecef28bc10a%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Robert Cochran
2018-11-16 01:22:14 UTC
Permalink
Hi!

I hope you can assess the memory capacity of the host machine that your
MongoDB server is running on. You should be able to do very quick
aggregation queries on 100,000 records. If it takes a lot of time, perhaps
the underlying code is slow (.Net Core is a long way from being able to
execute with the speed of Node.js for example) and/or your machine is
resource starved. The more memory (RAM) you can give the physical hardware,
the better MongoDB will perform. Try to jack your memory up to 16 or 32 Gb.
Also try to find other processes which are eating your memory resources.
Shut them down if you or the system doesn't need them.

I played with your data a little. Thank you for posting it. The "movie"
documents are probably from a MongoDB University course, right? Perhaps you
are taking an M1xx series class? If so, that is fantastic. It is great to
learn MongoDB the database. One thing the online classes will cover in
detail is indexing. You can greatly speed up a query with good indexes.

I had a little fun by creating indexes on the "released" key:

MongoDB Enterprise > show collections
ucwmovies

MongoDB Enterprise > db.ucwmovies.createIndex( { "released" : 1 } )
{
"createdCollectionAutomatically" : false,
"numIndexesBefore" : 1,
"numIndexesAfter" : 2,
"ok" : 1
}

MongoDB Enterprise > db.ucwmovies.createIndex( { "released" : -1 } )
{
"createdCollectionAutomatically" : false,
"numIndexesBefore" : 2,
"numIndexesAfter" : 3,
"ok" : 1
}


Wan's query does a $match on that key. $match can make use of an index if
you place the $match stage early in the aggregation pipeline, as it is
here.

It will take a little while for each index to be built for 100,000
documents. Also indexing isn't a substitute for giving the server machine
an adequate amount of physical memory. Snap in a couple more sticks of
memory if your machine can take it and you can afford it. Then index on the
keys that you think should be indexed. In my case I think indexing 100,000
documents took less than a few seconds. Indexing millions of documents is
something that I'd notice more.

Thanks so much

Bob









On Thursday, November 15, 2018 at 5:15:24 AM UTC-5, UltimateCodeWarrior
Post by UltimateCodeWarrior
Thanks Wan!
I had this posted in a few places and what you had written was the
consensus. Your example was much more complete though, I appreciate it and
I picked up a few pointers from your post, namely setting _id to null as
well as being able to construct the query in javascript so that I can test
it in the mongo command line.
It does indeed work the way I needed it to, but it wasn't any faster
than what I was trying to optimize when dealing with over 100,000 records.
When I ran separate db.collection.distinct() on each of these fields, it
was actually faster to run multiple distinct queries rather than the
aggregation pipeline. I know that db.collection.distinct() is a limited /
special purpose aggregation pipeline function. I also tried launching
various threads (first 2, then 4 then 8) in C# with the year based $match
split into equal time chunks across those threads to try and speed up the
aggregation pipeline query -- and it seemed to actually have the reverse
effect and slowed the query down. Seems the single threaded back to back
synchronous distincts() are much faster. At least from C# as I was
hoping to save the overhead of issuing multiple queries to mongodb. Maybe
it's got it's own thread-pool built in and all I am doing is negating that
effect when I launch my own threads. It was disheartening after all that
effort to learn the more advanced aggregation pipeline way, but ending up
with no tangible benefit.
I wish there was a guide mongo operations to avoid if possible ($group,
$sort, $unwind) as they are CPU/RAM expensive. What would be cool is if
the mongodb founding developers could take a data set like I have uploaded
and show GOOD, BETTER and BEST ways of finding common things like
distincts, sums, difference, ratios in more complex documents, complete
with custom indexes and hints as well as show timing information each step
of the way. This would save a lot of frustration and headache. Knowing
when to use one operation over another is just as important as knowing how
to use an operation. That's the essence of what I am trying to figure out.
Post by 'Wan Bachtiar' via mongodb-user
categoryDistinct and ratedDistinct are coming back as I would expect, but
the countryDistinct and awardDistinct are not, they are coming back in a
nested sub array record form.
Hi,
This is because cast is an array, and when the group access it as
cast.country and cast.award those fields are an array value of each of
the respective fields. Thus, adding those array values to a set would
return an array of arrays.
You can try to insert an $unwind
<https://docs.mongodb.com/manual/reference/operator/aggregation/unwind/>
db.movies.aggregate([
{"$match": {"released": {"$gte": 1984, "$lte": 1988}}},
{"$unwind": "$cast"},
{"$group": {
"_id": null,
"categoryDistinct": {"$addToSet":"$category"},
"ratedDistinct": {"$addToSet":"$rated"},
"countryDistinct": {"$addToSet":"$cast.country"},
"awardDistinct": {"$addToSet":"$cast.award"},
}
}
])
I know I didn’t specify a sort anywhere so if you can give a pointer that
will not impact speed, that would be greatly appreciated
Similarly, you can insert a $sort
<https://docs.mongodb.com/manual/reference/operator/aggregation/sort/>
stage after the $unwind and before $group to sort the documents before
the grouping stage. Any extra operations will impact speed, although at
various tolerable degrees. Based on the use case and requirements, you can
then perform adjustments. i.e. scale up, change document schema, etc.
You may also find Aggregation Pipeline Optimization
<https://docs.mongodb.com/manual/core/aggregation-pipeline-optimization/#aggregation-pipeline-optimization> a
useful reference.
Regards,
Wan.
​
--
You received this message because you are subscribed to the Google Groups "mongodb-user"
group.

For other MongoDB technical support options, see: https://docs.mongodb.com/manual/support/
---
You received this message because you are subscribed to the Google Groups "mongodb-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mongodb-user+***@googlegroups.com.
To post to this group, send email to mongodb-***@googlegroups.com.
Visit this group at https://groups.google.com/group/mongodb-user.
To view this discussion on the web visit https://groups.google.com/d/msgid/mongodb-user/8b2de744-e17d-4f22-aa8a-9996fb1220bb%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
UltimateCodeWarrior
2018-11-16 02:52:45 UTC
Permalink
Hey Bob!
Thanks for the pointers, I welcome any and all suggestions.

I have not used to this Google Groups UI, it's a little bizarre on how to
post a reply, so I hope ...it gets to the right place.

So the machine I am using right now has 24 Gigs of Ram, Windows 10,
4-Core I7 Processor, Laptop, SSD. I am running C# Visual Studio, a C#
Web Server, WAMP, and Mongodb instance, Chrome (20 tabs), Notepad ++,
Examine 64, Eclipse, Slack, Sublime Text, and a couple of command line
prompt windows . The production level solution will be deployed on a
16GB Desktop machine with an 8-core I7 Processor, SSD. It won't have quite
so much stuff running.

I'm not part of that MongoDB University program, but glad to know it
exists. I will have to check into that course, it's always good to pick up
some job-saving pointers.

As far as the technology stack is concerned, it wasn't my first choice,
it's just an unlikely mix / odd couple that I was handed and need to
contend/cope with at the moment. I have noted your suggestion of Node.js
and should the need become imperative that we squeeze more speed out of the
machine, then...perhaps a Node.js re-write is in order and I am more than
game for that.

I feel I am suffering being on the bleeding edge because many of
the examples on the web (stack overflow) for C#-Mongo are from like 2012
and between the .NET API, Driver API, Driver Version, and the Mongo changes
its hard to find examples that aren't severely outdated. I had to
cobble/ patch many examples together to get to the example I posted. I
don't believe I have seen a single C#-Mongo4 Book on Amazon. If you know
of a good book, or one in development, I will gladly volunteer to be the
editor or feedback :)

I think I borrowed the idea of the movie database from another post I saw
that probably stuck in my subconscious. (Teacher-Student-Grade scenario
might hit a nerve with some talent, so I choose the Movie scenario) I
figured I would stick with something most people would understand
intuitively rather that have to do the mental hopscotch to understand the
data before they could go on to understand the code then question and
issue. I wanted to put in there enough substance to pose further questions
without having to radically change the data model.

So I read a few things in my travels on mongoDB.

#1 - User is limited to the amount of indexes, I think it was somewhere
around 64, some power of 2. This was a drawback to me because if the app
has dynamic queries where it allows the user to search by various criteria
(9-ways-from-Sunday), then it's possible to exceed the index limitation the
more query $match parameters that the user has opted to throw into the
mix. This happened in my case and so I was looking for a way to compress
many queries into a super aggregation pipeline so that I could ultimately
reduce the number of indexes. Well, I got partially there getting the
aggregation pipeline to work, but I didn't quite get the performance
results I wanted, even with indexing and threading. There's a good
chance I have missed something.

#2 - The size of the name of index was limited too, I think around 128
characters,so you have to figure out a more cryptic way of naming the
indexes, it cant be a 1:1 mangling of all of your query parameters like
the auto-namer does if you don't explicitly supply a name yourself,
otherwise you will run out of space and it will throw an exception.

#3 - The documentation cautions users of mongodb about having an
excessive number of indexes as well because it will slow down inserts as it
has to rebuild all of those indexes that were specified at the very
beginning. So, without being on the team that engineered the engine,
it's hard to know where that 'sweet spot' is. Is less more? or is more
better? I didn't find a good example in C# on how to use "hint" to force
an index either, the search continues...


A couple of questions on your indexes:

#1 - You are creating an ascending and descending index on the year.
I thought I read that mongodb was smart enough to know read an index
forward or backward, so why would it be necessary to create two?

#2 - When you created the indexes, it looked like you used the mongo
command line tool and it was probably a blocking/synchronous operation that
waited until completion before you got the command line prompt back? My
situation is slightly different, as I won't require my users to run through
a series of mongo commands to create indexes. I programatically create
indexes at the startup of the C# web server. Sometimes I add them after
the data has been added to test if they will indeed optimize the query by
speeding it up. I don't think the Driver gives me any sort of feedback to
know when the index is truly ready to 'go'. It seems that it's
asynchronous creation of the index itself, but I have know way of knowing
when the data has been indexed. What I usually do is shut down the C#
server after the indexes have been programatically inserted and then run
the db.movies.reIndex(); in the mongo command line shell and let that
finish . Then I exit mongo command line shell, shut down mongodb, restart
mongodb to make sure everything is running optimally before starting any
query profiling/timing commands. Is there a better way to do this/faster
way?


So with roughly 100k documents, with indexes on the released year, running
distincts on the fields separately seemed to be around 6 seconds.
When I tried C# threading where I would split up the released year into
2,4,8 equal chunks, and joined all the threads, it would be around 8
seconds, so no real gains there. It could be that the data was not as
spread out evenly in the released year bins as one would imagine, the data
could have been lopsided and so naturally some threads would have all the
heavy lifting and others would have nothing. So maybe if I threaded each
distinct and ran them asynchronously I would have gotten better performance
from threading.
When I tried the aggregation pipeline as mentioned earlier, I was getting
8-11.5 seconds. So things just kept getting worse and so I threw my hands
up and decided to go to a higher power and post the various groups for some
collaboration.

There's a couple of caveats that I would like to point out.

#1 - The data I have posted is similar in structure and element types
only, but not in element names / element content. I think my average
document size is about 2500 bytes according to mongo stats for the
collection.
#2 - There are virtually zero "skipped" documents in the data set that I
have when examining by the release date, in fact most of the times I'm
querying across all documents and release date doesn't cut things down
anyhow because I get the min and max release date earlier and use that in
the query or I just omit it all together .

So what I have decided to do (for now) in order to give the appearance of
faster queries is to actually pre-run them periodically in a background
thread on the C# server and cache the results. When the web service is
called, it will send back the cached results. This will prevent users
from bogging down mongo by clicking an asynchronous javascript dynamic
query button 10 times because they are impatient that the result is taking
a long time, meanwhile inadvertently flat lining the mongo server by
inundating it with overlapping / cascading resource gulping/choking queries
that further stretch out the timeline for getting results. It's a viscous
cycle.


All the Best!
--
You received this message because you are subscribed to the Google Groups "mongodb-user"
group.

For other MongoDB technical support options, see: https://docs.mongodb.com/manual/support/
---
You received this message because you are subscribed to the Google Groups "mongodb-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mongodb-user+***@googlegroups.com.
To post to this group, send email to mongodb-***@googlegroups.com.
Visit this group at https://groups.google.com/group/mongodb-user.
To view this discussion on the web visit https://groups.google.com/d/msgid/mongodb-user/9a47fd67-4c45-4c21-a9d5-ac7bb5b9c693%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
'Wan Bachtiar' via mongodb-user
2018-11-26 04:02:29 UTC
Permalink
When I ran separate db.collection.distinct() on each of these fields, it
was actually faster to run multiple distinct queries rather than the
aggregation pipeline. I know that db.collection.distinct() is a limited /
special purpose aggregation pipeline function

Hi,

Data in MongoDB has a flexible schema. Flexible schema allows you to focus
on your application design and let the database design conform for the
benefit of the application. If your application has a requirement for
specific queries, it would be beneficial to alter the document schema to
boost the query performance.

See Data Modelling Introduction
<https://docs.mongodb.org/manual/core/data-modeling-introduction/> and Data
Model Examples and Patterns
<https://docs.mongodb.org/manual/applications/data-models/>
for more information and examples.

User is limited to the amount of indexes, I think it was somewhere around
64, some power of 2. This was a drawback to me because if the app has
dynamic queries where it allows the user to search by various criteria
(9-ways-from-Sunday)

As you have mentioned in the thread, excessive number of indexes may affect
write operation performance. This is because for every write operation,
those indexes would have to be updated as well (not rebuild entirely).

Depending on the use case of your application, you can always ensure the
application add an indexed field to limit the scope of the query and
simplify query operations. See also Indexing Strategies
<https://docs.mongodb.com/manual/applications/indexes/>.

You are creating an ascending and descending index on the year. I thought I
read that mongodb was smart enough to know read an index forward or backward

Yes, you don’t need both ascending or descending indexes on a Single Field
Index <https://docs.mongodb.com/manual/core/index-single/> because MongoDB
can traverse the index in either direction. Although this may not be the
case with Compound Indexes
<https://docs.mongodb.com/manual/core/index-compound/>.

When you created the indexes, it looked like you used the mongo command
line tool and it was probably a blocking/synchronous operation that waited
until completion before you got the command line prompt back?

There are two types of index creation, foreground and background. The
default for mongo shell is to create in the foreground (blocking). Users
have the option to specify background to build an index in the background
so the operation does not block other database activities. See also
db.collection.createIndex()
<https://docs.mongodb.com/manual/reference/method/db.collection.createIndex/>

So with roughly 100k documents, with indexes on the released year, running
distincts on the fields separately seemed to be around 6 seconds.

I would suggest to perform cursor.explain()
<https://docs.mongodb.com/manual/reference/method/cursor.explain/#cursor.explain>
on the queries to analyse further what’s causing the slowness. See also Explain
Results <https://docs.mongodb.com/manual/reference/explain-results/>.

Also, you could perform some monitoring on the server to identify
bottlenecks. See also MongoDB Monitoring Tools
<https://docs.mongodb.com/manual/administration/monitoring/>

Regards,
Wan.
​
--
You received this message because you are subscribed to the Google Groups "mongodb-user"
group.

For other MongoDB technical support options, see: https://docs.mongodb.com/manual/support/
---
You received this message because you are subscribed to the Google Groups "mongodb-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mongodb-user+***@googlegroups.com.
To post to this group, send email to mongodb-***@googlegroups.com.
Visit this group at https://groups.google.com/group/mongodb-user.
To view this discussion on the web visit https://groups.google.com/d/msgid/mongodb-user/81aa90d7-1b56-4562-9da6-bc672d60c950%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Loading...