Discussion:
Mongo DB performances
(too old to reply)
Angelo Immediata
2012-06-28 13:36:51 UTC
Permalink
Hi all
I know I always repeat the question about performance issues above all
regarding to the grouping operations
But since in order to execute a query (something similar to a classical SQL
group by) mongo takes around 20 seconds on a collection of 1 million and
half documents I tried to investigate on this (20 seconds for one client
and for only one query are really not acceptable; i can't image for several
contemporary clients how long this query takes). We are talking about this
kind of query: *select count(event_name), event_name from audits group by
event_name*
I launched the mongostat and mongotop coomands
I attach to the mail their results

Can anybody tell me how I can improve performances?
Note that I'l already using indices and i'm using the new aggregation
framework.
I attached to the mail the mongo commands results

Do you have any suggestion?

Thank you
Cheers,
Angelo
--
You received this message because you are subscribed to the Google
Groups "mongodb-user" group.
To post to this group, send email to mongodb-user-/***@public.gmane.org
To unsubscribe from this group, send email to
mongodb-user+unsubscribe-/***@public.gmane.org
See also the IRC channel -- freenode.net#mongodb
Scott Hernandez
2012-06-28 14:39:28 UTC
Permalink
It would be helpful if you can post sample data (including indexes) and
your actual commands.
Post by Angelo Immediata
Hi all
I know I always repeat the question about performance issues above all
regarding to the grouping operations
But since in order to execute a query (something similar to a classical
SQL group by) mongo takes around 20 seconds on a collection of 1 million
and half documents I tried to investigate on this (20 seconds for one
client and for only one query are really not acceptable; i can't image for
several contemporary clients how long this query takes). We are talking
about this kind of query: *select count(event_name), event_name from
audits group by event_name*
I launched the mongostat and mongotop coomands
I attach to the mail their results
Can anybody tell me how I can improve performances?
Note that I'l already using indices and i'm using the new aggregation
framework.
I attached to the mail the mongo commands results
Do you have any suggestion?
Thank you
Cheers,
Angelo
--
You received this message because you are subscribed to the Google
Groups "mongodb-user" group.
To unsubscribe from this group, send email to
See also the IRC channel -- freenode.net#mongodb
--
You received this message because you are subscribed to the Google
Groups "mongodb-user" group.
To post to this group, send email to mongodb-user-/***@public.gmane.org
To unsubscribe from this group, send email to
mongodb-user+unsubscribe-/***@public.gmane.org
See also the IRC channel -- freenode.net#mongodb
Angelo Immediata
2012-06-28 15:03:22 UTC
Permalink
Hi Scott
Attached to the mail I put one sample record (I have more or less 1 million
and half of these records) and the list of my indexes
Please look at file "Samples.txt"

Cheers,
Angelo
Post by Scott Hernandez
It would be helpful if you can post sample data (including indexes) and
your actual commands.
Post by Angelo Immediata
Hi all
I know I always repeat the question about performance issues above all
regarding to the grouping operations
But since in order to execute a query (something similar to a classical
SQL group by) mongo takes around 20 seconds on a collection of 1 million
and half documents I tried to investigate on this (20 seconds for one
client and for only one query are really not acceptable; i can't image for
several contemporary clients how long this query takes). We are talking
about this kind of query: *select count(event_name), event_name from
audits group by event_name*
I launched the mongostat and mongotop coomands
I attach to the mail their results
Can anybody tell me how I can improve performances?
Note that I'l already using indices and i'm using the new aggregation
framework.
I attached to the mail the mongo commands results
Do you have any suggestion?
Thank you
Cheers,
Angelo
--
You received this message because you are subscribed to the Google
Groups "mongodb-user" group.
To unsubscribe from this group, send email to
See also the IRC channel -- freenode.net#mongodb
--
You received this message because you are subscribed to the Google
Groups "mongodb-user" group.
To unsubscribe from this group, send email to
See also the IRC channel -- freenode.net#mongodb
--
You received this message because you are subscribed to the Google
Groups "mongodb-user" group.
To post to this group, send email to mongodb-user-/***@public.gmane.org
To unsubscribe from this group, send email to
mongodb-user+unsubscribe-/***@public.gmane.org
See also the IRC channel -- freenode.net#mongodb
Angelo Immediata
2012-06-29 08:00:33 UTC
Permalink
Hello
Just a further information.
One of the slowest queries is the following one:

*{ "aggregate" : "audits" , "pipeline" : [ { "$group" : { "_id" : {
"hdr_event_number" : "$hdr_event_number"} , "numbEvent" : { "$sum" : 1}}}]}*

I tried to modify it by adding some filters in this way:
*{ "aggregate" : "audits" , "pipeline" : [ { "$match" : { "reciveDate" : {
"$gt" : 1338328800000}}} , { "$group" : { "_id" : { "hdr_event_number" :
"$hdr_event_number"} , "numbEvent" : { "$sum" : 1}}}]}*

But in any case it can happen that in my environment (my real one) i can
receive 1 million of data ad day; in fact my customer talks about 1 TB ad
day.....

Thank you
Angelo
Post by Angelo Immediata
Hi Scott
Attached to the mail I put one sample record (I have more or less 1
million and half of these records) and the list of my indexes
Please look at file "Samples.txt"
Cheers,
Angelo
Post by Scott Hernandez
It would be helpful if you can post sample data (including indexes) and
your actual commands.
Post by Angelo Immediata
Hi all
I know I always repeat the question about performance issues above all
regarding to the grouping operations
But since in order to execute a query (something similar to a classical
SQL group by) mongo takes around 20 seconds on a collection of 1 million
and half documents I tried to investigate on this (20 seconds for one
client and for only one query are really not acceptable; i can't image for
several contemporary clients how long this query takes). We are talking
about this kind of query: *select count(event_name), event_name from
audits group by event_name*
I launched the mongostat and mongotop coomands
I attach to the mail their results
Can anybody tell me how I can improve performances?
Note that I'l already using indices and i'm using the new aggregation
framework.
I attached to the mail the mongo commands results
Do you have any suggestion?
Thank you
Cheers,
Angelo
--
You received this message because you are subscribed to the Google
Groups "mongodb-user" group.
To unsubscribe from this group, send email to
See also the IRC channel -- freenode.net#mongodb
--
You received this message because you are subscribed to the Google
Groups "mongodb-user" group.
To unsubscribe from this group, send email to
See also the IRC channel -- freenode.net#mongodb
--
You received this message because you are subscribed to the Google
Groups "mongodb-user" group.
To post to this group, send email to mongodb-user-/***@public.gmane.org
To unsubscribe from this group, send email to
mongodb-user+unsubscribe-/***@public.gmane.org
See also the IRC channel -- freenode.net#mongodb
Angelo Immediata
2012-07-01 07:37:44 UTC
Permalink
Hello
Any suggestion/Infomation?

Angelo
Post by Angelo Immediata
Hello
Just a further information.
{ "aggregate" : "audits" , "pipeline" : [ { "$group" : { "_id" : {
1}}}]}
Post by Angelo Immediata
{ "aggregate" : "audits" , "pipeline" : [ { "$match" : { "reciveDate" : {
"$hdr_event_number"} , "numbEvent" : { "$sum" : 1}}}]}
But in any case it can happen that in my environment (my real one) i can
receive 1 million of data ad day; in fact my customer talks about 1 TB ad
day.....
Thank you
Angelo
Post by Angelo Immediata
Hi Scott
Attached to the mail I put one sample record (I have more or less 1
million and half of these records) and the list of my indexes
Please look at file "Samples.txt"
Cheers,
Angelo
Post by Scott Hernandez
It would be helpful if you can post sample data (including indexes) and
your actual commands.
Post by Angelo Immediata
Hi all
I know I always repeat the question about performance issues above all
regarding to the grouping operations
But since in order to execute a query (something similar to a classical
SQL group by) mongo takes around 20 seconds on a collection of 1
million and
Post by Angelo Immediata
Post by Angelo Immediata
Post by Scott Hernandez
Post by Angelo Immediata
half documents I tried to investigate on this (20 seconds for one
client and
Post by Angelo Immediata
Post by Angelo Immediata
Post by Scott Hernandez
Post by Angelo Immediata
for only one query are really not acceptable; i can't image for several
contemporary clients how long this query takes). We are talking about this
kind of query: select count(event_name), event_name from audits group
by
Post by Angelo Immediata
Post by Angelo Immediata
Post by Scott Hernandez
Post by Angelo Immediata
event_name
I launched the mongostat and mongotop coomands
I attach to the mail their results
Can anybody tell me how I can improve performances?
Note that I'l already using indices and i'm using the new aggregation
framework.
I attached to the mail the mongo commands results
Do you have any suggestion?
Thank you
Cheers,
Angelo
--
You received this message because you are subscribed to the Google
Groups "mongodb-user" group.
To unsubscribe from this group, send email to
See also the IRC channel -- freenode.net#mongodb
--
You received this message because you are subscribed to the Google
Groups "mongodb-user" group.
To unsubscribe from this group, send email to
See also the IRC channel -- freenode.net#mongodb
--
You received this message because you are subscribed to the Google
Groups "mongodb-user" group.
To post to this group, send email to mongodb-user-/***@public.gmane.org
To unsubscribe from this group, send email to
mongodb-user+unsubscribe-/***@public.gmane.org
See also the IRC channel -- freenode.net#mongodb
Scott Hernandez
2012-07-01 12:22:49 UTC
Permalink
This is limited by how data must be processed. There is not a lot to
optimize other than to change tactics and store counters (pre~aggregate) so
you do not have reprocesss the data.
Post by Angelo Immediata
Hello
Any suggestion/Infomation?
Angelo
Post by Angelo Immediata
Hello
Just a further information.
{ "aggregate" : "audits" , "pipeline" : [ { "$group" : { "_id" : {
1}}}]}
Post by Angelo Immediata
{ "aggregate" : "audits" , "pipeline" : [ { "$match" : { "reciveDate" : {
"$hdr_event_number"} , "numbEvent" : { "$sum" : 1}}}]}
But in any case it can happen that in my environment (my real one) i can
receive 1 million of data ad day; in fact my customer talks about 1 TB ad
day.....
Thank you
Angelo
Post by Angelo Immediata
Hi Scott
Attached to the mail I put one sample record (I have more or less 1
million and half of these records) and the list of my indexes
Please look at file "Samples.txt"
Cheers,
Angelo
Post by Scott Hernandez
It would be helpful if you can post sample data (including indexes) and
your actual commands.
Post by Angelo Immediata
Hi all
I know I always repeat the question about performance issues above all
regarding to the grouping operations
But since in order to execute a query (something similar to a
classical
Post by Angelo Immediata
Post by Angelo Immediata
Post by Scott Hernandez
Post by Angelo Immediata
SQL group by) mongo takes around 20 seconds on a collection of 1
million and
Post by Angelo Immediata
Post by Angelo Immediata
Post by Scott Hernandez
Post by Angelo Immediata
half documents I tried to investigate on this (20 seconds for one
client and
Post by Angelo Immediata
Post by Angelo Immediata
Post by Scott Hernandez
Post by Angelo Immediata
for only one query are really not acceptable; i can't image for
several
Post by Angelo Immediata
Post by Angelo Immediata
Post by Scott Hernandez
Post by Angelo Immediata
contemporary clients how long this query takes). We are talking about
this
Post by Angelo Immediata
Post by Angelo Immediata
Post by Scott Hernandez
Post by Angelo Immediata
kind of query: select count(event_name), event_name from audits group
by
Post by Angelo Immediata
Post by Angelo Immediata
Post by Scott Hernandez
Post by Angelo Immediata
event_name
I launched the mongostat and mongotop coomands
I attach to the mail their results
Can anybody tell me how I can improve performances?
Note that I'l already using indices and i'm using the new aggregation
framework.
I attached to the mail the mongo commands results
Do you have any suggestion?
Thank you
Cheers,
Angelo
--
You received this message because you are subscribed to the Google
Groups "mongodb-user" group.
To unsubscribe from this group, send email to
See also the IRC channel -- freenode.net#mongodb
--
You received this message because you are subscribed to the Google
Groups "mongodb-user" group.
To unsubscribe from this group, send email to
See also the IRC channel -- freenode.net#mongodb
--
You received this message because you are subscribed to the Google
Groups "mongodb-user" group.
To unsubscribe from this group, send email to
See also the IRC channel -- freenode.net#mongodb
--
You received this message because you are subscribed to the Google
Groups "mongodb-user" group.
To post to this group, send email to mongodb-user-/***@public.gmane.org
To unsubscribe from this group, send email to
mongodb-user+unsubscribe-/***@public.gmane.org
See also the IRC channel -- freenode.net#mongodb
Jorge Costa
2012-07-01 16:24:46 UTC
Permalink
1,500,000 documents of your sample.txt (8Kb) is roughly around 12GB
On my SSD (which is not ideal for sequential IO) it takes me 137 seconds
to read that amount of data.
If you look into your setup - number of shards vs sequential disk speed
- it is likely that you'll get a figure close to your 20 seconds.
Your filters will likely reduce the range of data to be read, but I
reckon you're still IO bound.

$ dd if=/dev/zero of=12gb.txt bs=8k count=1500000
1500000+0 records in
1500000+0 records out
12288000000 bytes (12 GB) copied, 22.4173 s, 548 MB/s

$ dd if=12gb.txt of=/dev/zero bs=8k count=1500000 iflag=direct
1500000+0 records in
1500000+0 records out
12288000000 bytes (12 GB) copied, 137.162 s, 89.6 MB/s

a) get more/faster disks
b) shard over more servers
c) read less
Post by Scott Hernandez
This is limited by how data must be processed. There is not a lot to
optimize other than to change tactics and store counters
(pre~aggregate) so you do not have reprocesss the data.
Hello
Any suggestion/Infomation?
Angelo
Post by Angelo Immediata
Hello
Just a further information.
{ "aggregate" : "audits" , "pipeline" : [ { "$group" : { "_id" : {
"hdr_event_number" : "$hdr_event_number"} , "numbEvent" : {
"$sum" : 1}}}]}
Post by Angelo Immediata
{ "aggregate" : "audits" , "pipeline" : [ { "$match" : {
"reciveDate" : {
Post by Angelo Immediata
"$gt" : 1338328800000}}} , { "$group" : { "_id" : {
"$hdr_event_number"} , "numbEvent" : { "$sum" : 1}}}]}
But in any case it can happen that in my environment (my real
one) i can
Post by Angelo Immediata
receive 1 million of data ad day; in fact my customer talks
about 1 TB ad
Post by Angelo Immediata
day.....
Thank you
Angelo
Post by Angelo Immediata
Hi Scott
Attached to the mail I put one sample record (I have more or less 1
million and half of these records) and the list of my indexes
Please look at file "Samples.txt"
Cheers,
Angelo
Post by Scott Hernandez
It would be helpful if you can post sample data (including
indexes) and
Post by Angelo Immediata
Post by Angelo Immediata
Post by Scott Hernandez
your actual commands.
On Thu, Jun 28, 2012 at 9:36 AM, Angelo Immediata
Post by Angelo Immediata
Hi all
I know I always repeat the question about performance issues
above all
Post by Angelo Immediata
Post by Angelo Immediata
Post by Scott Hernandez
Post by Angelo Immediata
regarding to the grouping operations
But since in order to execute a query (something similar to a
classical
Post by Angelo Immediata
Post by Angelo Immediata
Post by Scott Hernandez
Post by Angelo Immediata
SQL group by) mongo takes around 20 seconds on a collection
of 1 million and
Post by Angelo Immediata
Post by Angelo Immediata
Post by Scott Hernandez
Post by Angelo Immediata
half documents I tried to investigate on this (20 seconds for
one client and
Post by Angelo Immediata
Post by Angelo Immediata
Post by Scott Hernandez
Post by Angelo Immediata
for only one query are really not acceptable; i can't image
for several
Post by Angelo Immediata
Post by Angelo Immediata
Post by Scott Hernandez
Post by Angelo Immediata
contemporary clients how long this query takes). We are
talking about this
Post by Angelo Immediata
Post by Angelo Immediata
Post by Scott Hernandez
Post by Angelo Immediata
kind of query: select count(event_name), event_name from
audits group by
Post by Angelo Immediata
Post by Angelo Immediata
Post by Scott Hernandez
Post by Angelo Immediata
event_name
I launched the mongostat and mongotop coomands
I attach to the mail their results
Can anybody tell me how I can improve performances?
Note that I'l already using indices and i'm using the new
aggregation
Post by Angelo Immediata
Post by Angelo Immediata
Post by Scott Hernandez
Post by Angelo Immediata
framework.
I attached to the mail the mongo commands results
Do you have any suggestion?
Thank you
Cheers,
Angelo
--
You received this message because you are subscribed to the
Google
Post by Angelo Immediata
Post by Angelo Immediata
Post by Scott Hernandez
Post by Angelo Immediata
Groups "mongodb-user" group.
To post to this group, send email to
To unsubscribe from this group, send email to
See also the IRC channel -- freenode.net#mongodb
<http://freenode.net#mongodb>
Post by Angelo Immediata
Post by Angelo Immediata
Post by Scott Hernandez
--
You received this message because you are subscribed to the Google
Groups "mongodb-user" group.
To post to this group, send email to
To unsubscribe from this group, send email to
See also the IRC channel -- freenode.net#mongodb
<http://freenode.net#mongodb>
--
You received this message because you are subscribed to the Google
Groups "mongodb-user" group.
To unsubscribe from this group, send email to
See also the IRC channel -- freenode.net#mongodb
<http://freenode.net#mongodb>
--
You received this message because you are subscribed to the Google
Groups "mongodb-user" group.
To unsubscribe from this group, send email to
See also the IRC channel -- freenode.net#mongodb
--
You received this message because you are subscribed to the Google
Groups "mongodb-user" group.
To post to this group, send email to mongodb-user-/***@public.gmane.org
To unsubscribe from this group, send email to
mongodb-user+unsubscribe-/***@public.gmane.org
See also the IRC channel -- freenode.net#mongodb
Angelo Immediata
2012-07-02 07:39:20 UTC
Permalink
So...as far as I understood...there is not much optimization I can do.
The suggestions are:

- get more/faster disks: but this is not possible for now....I'm in an
evaluation phase and my managers will not give me permission to buy new and
faster disks
- shard over more servers: i can try to set up another server where to
instal the same version of mongo and try tests with it
- read less: i really think this is impossible. In my tests I have only
1 client who just reads the data provided by the query I showed early.

Well I must admit that I was thinking that mongo could offer to me better
performances in these kind of queries. Inserts and updates are really
performant (even if I use indexes)
These kind of queries are not...and I think these queries are the most used
in real-time data processing.....

Angelo
Post by Jorge Costa
1,500,000 documents of your sample.txt (8Kb) is roughly around 12GB
On my SSD (which is not ideal for sequential IO) it takes me 137 seconds
to read that amount of data.
If you look into your setup - number of shards vs sequential disk speed -
it is likely that you'll get a figure close to your 20 seconds.
Your filters will likely reduce the range of data to be read, but I reckon
you're still IO bound.
$ dd if=/dev/zero of=12gb.txt bs=8k count=1500000
1500000+0 records in
1500000+0 records out
12288000000 bytes (12 GB) copied, 22.4173 s, 548 MB/s
$ dd if=12gb.txt of=/dev/zero bs=8k count=1500000 iflag=direct
1500000+0 records in
1500000+0 records out
12288000000 bytes (12 GB) copied, 137.162 s, 89.6 MB/s
a) get more/faster disks
b) shard over more servers
c) read less
This is limited by how data must be processed. There is not a lot to
optimize other than to change tactics and store counters (pre~aggregate) so
you do not have reprocesss the data.
Post by Angelo Immediata
Hello
Any suggestion/Infomation?
Angelo
Post by Angelo Immediata
Hello
Just a further information.
{ "aggregate" : "audits" , "pipeline" : [ { "$group" : { "_id" : {
1}}}]}
{
Post by Angelo Immediata
"$hdr_event_number"} , "numbEvent" : { "$sum" : 1}}}]}
But in any case it can happen that in my environment (my real one) i can
receive 1 million of data ad day; in fact my customer talks about 1 TB
ad
Post by Angelo Immediata
day.....
Thank you
Angelo
Post by Angelo Immediata
Hi Scott
Attached to the mail I put one sample record (I have more or less 1
million and half of these records) and the list of my indexes
Please look at file "Samples.txt"
Cheers,
Angelo
Post by Scott Hernandez
It would be helpful if you can post sample data (including indexes)
and
Post by Angelo Immediata
Post by Angelo Immediata
Post by Scott Hernandez
your actual commands.
On Thu, Jun 28, 2012 at 9:36 AM, Angelo Immediata <
Post by Angelo Immediata
Hi all
I know I always repeat the question about performance issues above
all
Post by Angelo Immediata
Post by Angelo Immediata
Post by Scott Hernandez
Post by Angelo Immediata
regarding to the grouping operations
But since in order to execute a query (something similar to a
classical
Post by Angelo Immediata
Post by Angelo Immediata
Post by Scott Hernandez
Post by Angelo Immediata
SQL group by) mongo takes around 20 seconds on a collection of 1
million and
Post by Angelo Immediata
Post by Angelo Immediata
Post by Scott Hernandez
Post by Angelo Immediata
half documents I tried to investigate on this (20 seconds for one
client and
Post by Angelo Immediata
Post by Angelo Immediata
Post by Scott Hernandez
Post by Angelo Immediata
for only one query are really not acceptable; i can't image for
several
Post by Angelo Immediata
Post by Angelo Immediata
Post by Scott Hernandez
Post by Angelo Immediata
contemporary clients how long this query takes). We are talking
about this
Post by Angelo Immediata
Post by Angelo Immediata
Post by Scott Hernandez
Post by Angelo Immediata
kind of query: select count(event_name), event_name from audits
group by
Post by Angelo Immediata
Post by Angelo Immediata
Post by Scott Hernandez
Post by Angelo Immediata
event_name
I launched the mongostat and mongotop coomands
I attach to the mail their results
Can anybody tell me how I can improve performances?
Note that I'l already using indices and i'm using the new aggregation
framework.
I attached to the mail the mongo commands results
Do you have any suggestion?
Thank you
Cheers,
Angelo
--
You received this message because you are subscribed to the Google
Groups "mongodb-user" group.
To unsubscribe from this group, send email to
See also the IRC channel -- freenode.net#mongodb
--
You received this message because you are subscribed to the Google
Groups "mongodb-user" group.
To unsubscribe from this group, send email to
See also the IRC channel -- freenode.net#mongodb
--
You received this message because you are subscribed to the Google
Groups "mongodb-user" group.
To unsubscribe from this group, send email to
See also the IRC channel -- freenode.net#mongodb
--
You received this message because you are subscribed to the Google
Groups "mongodb-user" group.
To unsubscribe from this group, send email to
See also the IRC channel -- freenode.net#mongodb
--
You received this message because you are subscribed to the Google
Groups "mongodb-user" group.
To unsubscribe from this group, send email to
See also the IRC channel -- freenode.net#mongodb
--
You received this message because you are subscribed to the Google
Groups "mongodb-user" group.
To post to this group, send email to mongodb-user-/***@public.gmane.org
To unsubscribe from this group, send email to
mongodb-user+unsubscribe-/***@public.gmane.org
See also the IRC channel -- freenode.net#mongodb
Sam Millman
2012-07-02 08:26:16 UTC
Permalink
For real-time processing you would pre-aggregate your fields. You are
count() operators and what not which are not very efficient methods by
which to perform data processing.

A more efficient way is to create summary tables by which to index your
detail data as a quicker rate. It is more disk space but it allows for more
localised queries.

I have only come in at the end and have not read the whole thread but I
think the problem could be closer to that.

Edit:

Ok I just read your using the aggregation framework which is not evne
finished....and has got a lot of speed fixes to come yet before it is fully
done. I am unsure as to whether index usage have been implemented in it
yet..I heard some one on here saying that they haven't, not sure if the
situation has changed yet. Either way the aggregation framework is not even
stable until 2.2.
Post by Angelo Immediata
So...as far as I understood...there is not much optimization I can do.
- get more/faster disks: but this is not possible for now....I'm in an
evaluation phase and my managers will not give me permission to buy new and
faster disks
- shard over more servers: i can try to set up another server where to
instal the same version of mongo and try tests with it
- read less: i really think this is impossible. In my tests I have
only 1 client who just reads the data provided by the query I showed early.
Well I must admit that I was thinking that mongo could offer to me better
performances in these kind of queries. Inserts and updates are really
performant (even if I use indexes)
These kind of queries are not...and I think these queries are the most
used in real-time data processing.....
Angelo
Post by Jorge Costa
1,500,000 documents of your sample.txt (8Kb) is roughly around 12GB
On my SSD (which is not ideal for sequential IO) it takes me 137 seconds
to read that amount of data.
If you look into your setup - number of shards vs sequential disk speed
- it is likely that you'll get a figure close to your 20 seconds.
Your filters will likely reduce the range of data to be read, but I
reckon you're still IO bound.
$ dd if=/dev/zero of=12gb.txt bs=8k count=1500000
1500000+0 records in
1500000+0 records out
12288000000 bytes (12 GB) copied, 22.4173 s, 548 MB/s
$ dd if=12gb.txt of=/dev/zero bs=8k count=1500000 iflag=direct
1500000+0 records in
1500000+0 records out
12288000000 bytes (12 GB) copied, 137.162 s, 89.6 MB/s
a) get more/faster disks
b) shard over more servers
c) read less
This is limited by how data must be processed. There is not a lot to
optimize other than to change tactics and store counters (pre~aggregate) so
you do not have reprocesss the data.
Post by Angelo Immediata
Hello
Any suggestion/Infomation?
Angelo
Post by Angelo Immediata
Hello
Just a further information.
{ "aggregate" : "audits" , "pipeline" : [ { "$group" : { "_id" : {
1}}}]}
Post by Angelo Immediata
{ "aggregate" : "audits" , "pipeline" : [ { "$match" : { "reciveDate"
: {
Post by Angelo Immediata
"$gt" : 1338328800000}}} , { "$group" : { "_id" : { "hdr_event_number"
"$hdr_event_number"} , "numbEvent" : { "$sum" : 1}}}]}
But in any case it can happen that in my environment (my real one) i
can
Post by Angelo Immediata
receive 1 million of data ad day; in fact my customer talks about 1 TB
ad
Post by Angelo Immediata
day.....
Thank you
Angelo
Post by Angelo Immediata
Hi Scott
Attached to the mail I put one sample record (I have more or less 1
million and half of these records) and the list of my indexes
Please look at file "Samples.txt"
Cheers,
Angelo
Post by Scott Hernandez
It would be helpful if you can post sample data (including indexes)
and
Post by Angelo Immediata
Post by Angelo Immediata
Post by Scott Hernandez
your actual commands.
On Thu, Jun 28, 2012 at 9:36 AM, Angelo Immediata <
Post by Angelo Immediata
Hi all
I know I always repeat the question about performance issues above
all
Post by Angelo Immediata
Post by Angelo Immediata
Post by Scott Hernandez
Post by Angelo Immediata
regarding to the grouping operations
But since in order to execute a query (something similar to a
classical
Post by Angelo Immediata
Post by Angelo Immediata
Post by Scott Hernandez
Post by Angelo Immediata
SQL group by) mongo takes around 20 seconds on a collection of 1
million and
Post by Angelo Immediata
Post by Angelo Immediata
Post by Scott Hernandez
Post by Angelo Immediata
half documents I tried to investigate on this (20 seconds for one
client and
Post by Angelo Immediata
Post by Angelo Immediata
Post by Scott Hernandez
Post by Angelo Immediata
for only one query are really not acceptable; i can't image for
several
Post by Angelo Immediata
Post by Angelo Immediata
Post by Scott Hernandez
Post by Angelo Immediata
contemporary clients how long this query takes). We are talking
about this
Post by Angelo Immediata
Post by Angelo Immediata
Post by Scott Hernandez
Post by Angelo Immediata
kind of query: select count(event_name), event_name from audits
group by
Post by Angelo Immediata
Post by Angelo Immediata
Post by Scott Hernandez
Post by Angelo Immediata
event_name
I launched the mongostat and mongotop coomands
I attach to the mail their results
Can anybody tell me how I can improve performances?
Note that I'l already using indices and i'm using the new
aggregation
Post by Angelo Immediata
Post by Angelo Immediata
Post by Scott Hernandez
Post by Angelo Immediata
framework.
I attached to the mail the mongo commands results
Do you have any suggestion?
Thank you
Cheers,
Angelo
--
You received this message because you are subscribed to the Google
Groups "mongodb-user" group.
To unsubscribe from this group, send email to
See also the IRC channel -- freenode.net#mongodb
--
You received this message because you are subscribed to the Google
Groups "mongodb-user" group.
To unsubscribe from this group, send email to
See also the IRC channel -- freenode.net#mongodb
--
You received this message because you are subscribed to the Google
Groups "mongodb-user" group.
To unsubscribe from this group, send email to
See also the IRC channel -- freenode.net#mongodb
--
You received this message because you are subscribed to the Google
Groups "mongodb-user" group.
To unsubscribe from this group, send email to
See also the IRC channel -- freenode.net#mongodb
--
You received this message because you are subscribed to the Google
Groups "mongodb-user" group.
To unsubscribe from this group, send email to
See also the IRC channel -- freenode.net#mongodb
--
You received this message because you are subscribed to the Google
Groups "mongodb-user" group.
To unsubscribe from this group, send email to
See also the IRC channel -- freenode.net#mongodb
--
You received this message because you are subscribed to the Google
Groups "mongodb-user" group.
To post to this group, send email to mongodb-user-/***@public.gmane.org
To unsubscribe from this group, send email to
mongodb-user+unsubscribe-/***@public.gmane.org
See also the IRC channel -- freenode.net#mongodb
Angelo Immediata
2012-07-02 10:07:28 UTC
Permalink
Pardon me...maybe it's my nescience in mongo (I'm using it from a couple of
months)....but when you talk about pre-aggregate data...what do you mean?
May you give me a sample (possibly based on the data I sent in the current
thread)?

Thank you
Angelo
Post by Sam Millman
For real-time processing you would pre-aggregate your fields. You are
count() operators and what not which are not very efficient methods by
which to perform data processing.
A more efficient way is to create summary tables by which to index your
detail data as a quicker rate. It is more disk space but it allows for more
localised queries.
I have only come in at the end and have not read the whole thread but I
think the problem could be closer to that.
Ok I just read your using the aggregation framework which is not evne
finished....and has got a lot of speed fixes to come yet before it is fully
done. I am unsure as to whether index usage have been implemented in it
yet..I heard some one on here saying that they haven't, not sure if the
situation has changed yet. Either way the aggregation framework is not even
stable until 2.2.
Post by Angelo Immediata
So...as far as I understood...there is not much optimization I can do.
- get more/faster disks: but this is not possible for now....I'm in
an evaluation phase and my managers will not give me permission to buy new
and faster disks
- shard over more servers: i can try to set up another server where
to instal the same version of mongo and try tests with it
- read less: i really think this is impossible. In my tests I have
only 1 client who just reads the data provided by the query I showed early.
Well I must admit that I was thinking that mongo could offer to me better
performances in these kind of queries. Inserts and updates are really
performant (even if I use indexes)
These kind of queries are not...and I think these queries are the most
used in real-time data processing.....
Angelo
Post by Jorge Costa
1,500,000 documents of your sample.txt (8Kb) is roughly around 12GB
On my SSD (which is not ideal for sequential IO) it takes me 137 seconds
to read that amount of data.
If you look into your setup - number of shards vs sequential disk speed
- it is likely that you'll get a figure close to your 20 seconds.
Your filters will likely reduce the range of data to be read, but I
reckon you're still IO bound.
$ dd if=/dev/zero of=12gb.txt bs=8k count=1500000
1500000+0 records in
1500000+0 records out
12288000000 bytes (12 GB) copied, 22.4173 s, 548 MB/s
$ dd if=12gb.txt of=/dev/zero bs=8k count=1500000 iflag=direct
1500000+0 records in
1500000+0 records out
12288000000 bytes (12 GB) copied, 137.162 s, 89.6 MB/s
a) get more/faster disks
b) shard over more servers
c) read less
This is limited by how data must be processed. There is not a lot to
optimize other than to change tactics and store counters (pre~aggregate) so
you do not have reprocesss the data.
Post by Angelo Immediata
Hello
Any suggestion/Infomation?
Angelo
Post by Angelo Immediata
Hello
Just a further information.
{ "aggregate" : "audits" , "pipeline" : [ { "$group" : { "_id" : {
1}}}]}
Post by Angelo Immediata
{ "aggregate" : "audits" , "pipeline" : [ { "$match" : { "reciveDate"
: {
Post by Angelo Immediata
"$gt" : 1338328800000}}} , { "$group" : { "_id" : {
"$hdr_event_number"} , "numbEvent" : { "$sum" : 1}}}]}
But in any case it can happen that in my environment (my real one) i
can
Post by Angelo Immediata
receive 1 million of data ad day; in fact my customer talks about 1
TB ad
Post by Angelo Immediata
day.....
Thank you
Angelo
Post by Angelo Immediata
Hi Scott
Attached to the mail I put one sample record (I have more or less 1
million and half of these records) and the list of my indexes
Please look at file "Samples.txt"
Cheers,
Angelo
Post by Scott Hernandez
It would be helpful if you can post sample data (including indexes)
and
Post by Angelo Immediata
Post by Angelo Immediata
Post by Scott Hernandez
your actual commands.
On Thu, Jun 28, 2012 at 9:36 AM, Angelo Immediata <
Post by Angelo Immediata
Hi all
I know I always repeat the question about performance issues above
all
Post by Angelo Immediata
Post by Angelo Immediata
Post by Scott Hernandez
Post by Angelo Immediata
regarding to the grouping operations
But since in order to execute a query (something similar to a
classical
Post by Angelo Immediata
Post by Angelo Immediata
Post by Scott Hernandez
Post by Angelo Immediata
SQL group by) mongo takes around 20 seconds on a collection of 1
million and
Post by Angelo Immediata
Post by Angelo Immediata
Post by Scott Hernandez
Post by Angelo Immediata
half documents I tried to investigate on this (20 seconds for one
client and
Post by Angelo Immediata
Post by Angelo Immediata
Post by Scott Hernandez
Post by Angelo Immediata
for only one query are really not acceptable; i can't image for
several
Post by Angelo Immediata
Post by Angelo Immediata
Post by Scott Hernandez
Post by Angelo Immediata
contemporary clients how long this query takes). We are talking
about this
Post by Angelo Immediata
Post by Angelo Immediata
Post by Scott Hernandez
Post by Angelo Immediata
kind of query: select count(event_name), event_name from audits
group by
Post by Angelo Immediata
Post by Angelo Immediata
Post by Scott Hernandez
Post by Angelo Immediata
event_name
I launched the mongostat and mongotop coomands
I attach to the mail their results
Can anybody tell me how I can improve performances?
Note that I'l already using indices and i'm using the new
aggregation
Post by Angelo Immediata
Post by Angelo Immediata
Post by Scott Hernandez
Post by Angelo Immediata
framework.
I attached to the mail the mongo commands results
Do you have any suggestion?
Thank you
Cheers,
Angelo
--
You received this message because you are subscribed to the Google
Groups "mongodb-user" group.
To unsubscribe from this group, send email to
See also the IRC channel -- freenode.net#mongodb
--
You received this message because you are subscribed to the Google
Groups "mongodb-user" group.
To unsubscribe from this group, send email to
See also the IRC channel -- freenode.net#mongodb
--
You received this message because you are subscribed to the Google
Groups "mongodb-user" group.
To unsubscribe from this group, send email to
See also the IRC channel -- freenode.net#mongodb
--
You received this message because you are subscribed to the Google
Groups "mongodb-user" group.
To unsubscribe from this group, send email to
See also the IRC channel -- freenode.net#mongodb
--
You received this message because you are subscribed to the Google
Groups "mongodb-user" group.
To unsubscribe from this group, send email to
See also the IRC channel -- freenode.net#mongodb
--
You received this message because you are subscribed to the Google
Groups "mongodb-user" group.
To unsubscribe from this group, send email to
See also the IRC channel -- freenode.net#mongodb
--
You received this message because you are subscribed to the Google
Groups "mongodb-user" group.
To unsubscribe from this group, send email to
See also the IRC channel -- freenode.net#mongodb
--
You received this message because you are subscribed to the Google
Groups "mongodb-user" group.
To post to this group, send email to mongodb-user-/***@public.gmane.org
To unsubscribe from this group, send email to
mongodb-user+unsubscribe-/***@public.gmane.org
See also the IRC channel -- freenode.net#mongodb
Sam Millman
2012-07-02 10:12:54 UTC
Permalink
Pre-aggregation is very common in many DBs when dealing with millions of
rows.

A good exmaple is counting. Say you wish to count all shares on a FB wall
post but doing it constantly is crashing your SQL server (or NoSQL server)
and causing lots of IO locks etc. A method to overcome this would be to
house a new field on the wall post record called "total_shares". Each time
a user clicks share or something or does an action to create a statistic
for data processing you $inc this field atomically (or another depending on
conditions).

Once you have a field like this up and running you can easily query over
this field without having to kill the server doing long standing tasks like
count().
Post by Angelo Immediata
Pardon me...maybe it's my nescience in mongo (I'm using it from a couple
of months)....but when you talk about pre-aggregate data...what do you
mean? May you give me a sample (possibly based on the data I sent in the
current thread)?
Thank you
Angelo
Post by Sam Millman
For real-time processing you would pre-aggregate your fields. You are
count() operators and what not which are not very efficient methods by
which to perform data processing.
A more efficient way is to create summary tables by which to index your
detail data as a quicker rate. It is more disk space but it allows for more
localised queries.
I have only come in at the end and have not read the whole thread but I
think the problem could be closer to that.
Ok I just read your using the aggregation framework which is not evne
finished....and has got a lot of speed fixes to come yet before it is fully
done. I am unsure as to whether index usage have been implemented in it
yet..I heard some one on here saying that they haven't, not sure if the
situation has changed yet. Either way the aggregation framework is not even
stable until 2.2.
Post by Angelo Immediata
So...as far as I understood...there is not much optimization I can do.
- get more/faster disks: but this is not possible for now....I'm in
an evaluation phase and my managers will not give me permission to buy new
and faster disks
- shard over more servers: i can try to set up another server where
to instal the same version of mongo and try tests with it
- read less: i really think this is impossible. In my tests I have
only 1 client who just reads the data provided by the query I showed early.
Well I must admit that I was thinking that mongo could offer to me
better performances in these kind of queries. Inserts and updates are
really performant (even if I use indexes)
These kind of queries are not...and I think these queries are the most
used in real-time data processing.....
Angelo
Post by Jorge Costa
1,500,000 documents of your sample.txt (8Kb) is roughly around 12GB
On my SSD (which is not ideal for sequential IO) it takes me 137
seconds to read that amount of data.
If you look into your setup - number of shards vs sequential disk
speed - it is likely that you'll get a figure close to your 20 seconds.
Your filters will likely reduce the range of data to be read, but I
reckon you're still IO bound.
$ dd if=/dev/zero of=12gb.txt bs=8k count=1500000
1500000+0 records in
1500000+0 records out
12288000000 bytes (12 GB) copied, 22.4173 s, 548 MB/s
$ dd if=12gb.txt of=/dev/zero bs=8k count=1500000 iflag=direct
1500000+0 records in
1500000+0 records out
12288000000 bytes (12 GB) copied, 137.162 s, 89.6 MB/s
a) get more/faster disks
b) shard over more servers
c) read less
This is limited by how data must be processed. There is not a lot to
optimize other than to change tactics and store counters (pre~aggregate) so
you do not have reprocesss the data.
Post by Angelo Immediata
Hello
Any suggestion/Infomation?
Angelo
Post by Angelo Immediata
Hello
Just a further information.
{ "aggregate" : "audits" , "pipeline" : [ { "$group" : { "_id" : {
1}}}]}
Post by Angelo Immediata
{ "aggregate" : "audits" , "pipeline" : [ { "$match" : {
"reciveDate" : {
Post by Angelo Immediata
"$gt" : 1338328800000}}} , { "$group" : { "_id" : {
"$hdr_event_number"} , "numbEvent" : { "$sum" : 1}}}]}
But in any case it can happen that in my environment (my real one) i
can
Post by Angelo Immediata
receive 1 million of data ad day; in fact my customer talks about 1
TB ad
Post by Angelo Immediata
day.....
Thank you
Angelo
Post by Angelo Immediata
Hi Scott
Attached to the mail I put one sample record (I have more or less 1
million and half of these records) and the list of my indexes
Please look at file "Samples.txt"
Cheers,
Angelo
Post by Scott Hernandez
It would be helpful if you can post sample data (including
indexes) and
Post by Angelo Immediata
Post by Angelo Immediata
Post by Scott Hernandez
your actual commands.
On Thu, Jun 28, 2012 at 9:36 AM, Angelo Immediata <
Post by Angelo Immediata
Hi all
I know I always repeat the question about performance issues
above all
Post by Angelo Immediata
Post by Angelo Immediata
Post by Scott Hernandez
Post by Angelo Immediata
regarding to the grouping operations
But since in order to execute a query (something similar to a
classical
Post by Angelo Immediata
Post by Angelo Immediata
Post by Scott Hernandez
Post by Angelo Immediata
SQL group by) mongo takes around 20 seconds on a collection of 1
million and
Post by Angelo Immediata
Post by Angelo Immediata
Post by Scott Hernandez
Post by Angelo Immediata
half documents I tried to investigate on this (20 seconds for one
client and
Post by Angelo Immediata
Post by Angelo Immediata
Post by Scott Hernandez
Post by Angelo Immediata
for only one query are really not acceptable; i can't image for
several
Post by Angelo Immediata
Post by Angelo Immediata
Post by Scott Hernandez
Post by Angelo Immediata
contemporary clients how long this query takes). We are talking
about this
Post by Angelo Immediata
Post by Angelo Immediata
Post by Scott Hernandez
Post by Angelo Immediata
kind of query: select count(event_name), event_name from audits
group by
Post by Angelo Immediata
Post by Angelo Immediata
Post by Scott Hernandez
Post by Angelo Immediata
event_name
I launched the mongostat and mongotop coomands
I attach to the mail their results
Can anybody tell me how I can improve performances?
Note that I'l already using indices and i'm using the new
aggregation
Post by Angelo Immediata
Post by Angelo Immediata
Post by Scott Hernandez
Post by Angelo Immediata
framework.
I attached to the mail the mongo commands results
Do you have any suggestion?
Thank you
Cheers,
Angelo
--
You received this message because you are subscribed to the Google
Groups "mongodb-user" group.
To post to this group, send email to
To unsubscribe from this group, send email to
See also the IRC channel -- freenode.net#mongodb
--
You received this message because you are subscribed to the Google
Groups "mongodb-user" group.
To unsubscribe from this group, send email to
See also the IRC channel -- freenode.net#mongodb
--
You received this message because you are subscribed to the Google
Groups "mongodb-user" group.
To unsubscribe from this group, send email to
See also the IRC channel -- freenode.net#mongodb
--
You received this message because you are subscribed to the Google
Groups "mongodb-user" group.
To unsubscribe from this group, send email to
See also the IRC channel -- freenode.net#mongodb
--
You received this message because you are subscribed to the Google
Groups "mongodb-user" group.
To unsubscribe from this group, send email to
See also the IRC channel -- freenode.net#mongodb
--
You received this message because you are subscribed to the Google
Groups "mongodb-user" group.
To unsubscribe from this group, send email to
See also the IRC channel -- freenode.net#mongodb
--
You received this message because you are subscribed to the Google
Groups "mongodb-user" group.
To unsubscribe from this group, send email to
See also the IRC channel -- freenode.net#mongodb
--
You received this message because you are subscribed to the Google
Groups "mongodb-user" group.
To unsubscribe from this group, send email to
See also the IRC channel -- freenode.net#mongodb
--
You received this message because you are subscribed to the Google
Groups "mongodb-user" group.
To post to this group, send email to mongodb-user-/***@public.gmane.org
To unsubscribe from this group, send email to
mongodb-user+unsubscribe-/***@public.gmane.org
See also the IRC channel -- freenode.net#mongodb
Sam Millman
2012-07-02 10:16:02 UTC
Permalink
Hmm everytime an event is logged you could create a new table that is
pre-grouped with numbEvent already filled in which would statisfy this
aggregation:

*{ "aggregate" : "audits" , "pipeline" : [ { "$group" : { "_id" : {
"hdr_event_number" : "$hdr_event_number"} , "numbEvent" : { "$sum" : 1}}}]}

*You would still need to do difficult aggregation for certain pivots of
data but hopefully it will make 90% of data processing easier.*

*You also use an incremental MR here. It would likely be faster since it
will only be taking a delta index of your data and just pushing it onto the
end of the previous set of MR results.*
*
Post by Sam Millman
Pre-aggregation is very common in many DBs when dealing with millions of
rows.
A good exmaple is counting. Say you wish to count all shares on a FB wall
post but doing it constantly is crashing your SQL server (or NoSQL server)
and causing lots of IO locks etc. A method to overcome this would be to
house a new field on the wall post record called "total_shares". Each time
a user clicks share or something or does an action to create a statistic
for data processing you $inc this field atomically (or another depending on
conditions).
Once you have a field like this up and running you can easily query over
this field without having to kill the server doing long standing tasks like
count().
Post by Angelo Immediata
Pardon me...maybe it's my nescience in mongo (I'm using it from a couple
of months)....but when you talk about pre-aggregate data...what do you
mean? May you give me a sample (possibly based on the data I sent in the
current thread)?
Thank you
Angelo
Post by Sam Millman
For real-time processing you would pre-aggregate your fields. You are
count() operators and what not which are not very efficient methods by
which to perform data processing.
A more efficient way is to create summary tables by which to index your
detail data as a quicker rate. It is more disk space but it allows for more
localised queries.
I have only come in at the end and have not read the whole thread but I
think the problem could be closer to that.
Ok I just read your using the aggregation framework which is not evne
finished....and has got a lot of speed fixes to come yet before it is fully
done. I am unsure as to whether index usage have been implemented in it
yet..I heard some one on here saying that they haven't, not sure if the
situation has changed yet. Either way the aggregation framework is not even
stable until 2.2.
Post by Angelo Immediata
So...as far as I understood...there is not much optimization I can do.
- get more/faster disks: but this is not possible for now....I'm in
an evaluation phase and my managers will not give me permission to buy new
and faster disks
- shard over more servers: i can try to set up another server where
to instal the same version of mongo and try tests with it
- read less: i really think this is impossible. In my tests I have
only 1 client who just reads the data provided by the query I showed early.
Well I must admit that I was thinking that mongo could offer to me
better performances in these kind of queries. Inserts and updates are
really performant (even if I use indexes)
These kind of queries are not...and I think these queries are the most
used in real-time data processing.....
Angelo
Post by Jorge Costa
1,500,000 documents of your sample.txt (8Kb) is roughly around 12GB
On my SSD (which is not ideal for sequential IO) it takes me 137
seconds to read that amount of data.
If you look into your setup - number of shards vs sequential disk
speed - it is likely that you'll get a figure close to your 20 seconds.
Your filters will likely reduce the range of data to be read, but I
reckon you're still IO bound.
$ dd if=/dev/zero of=12gb.txt bs=8k count=1500000
1500000+0 records in
1500000+0 records out
12288000000 bytes (12 GB) copied, 22.4173 s, 548 MB/s
$ dd if=12gb.txt of=/dev/zero bs=8k count=1500000 iflag=direct
1500000+0 records in
1500000+0 records out
12288000000 bytes (12 GB) copied, 137.162 s, 89.6 MB/s
a) get more/faster disks
b) shard over more servers
c) read less
This is limited by how data must be processed. There is not a lot to
optimize other than to change tactics and store counters (pre~aggregate) so
you do not have reprocesss the data.
Post by Angelo Immediata
Hello
Any suggestion/Infomation?
Angelo
Post by Angelo Immediata
Hello
Just a further information.
{ "aggregate" : "audits" , "pipeline" : [ { "$group" : { "_id" : {
"hdr_event_number" : "$hdr_event_number"} , "numbEvent" : { "$sum"
: 1}}}]}
Post by Angelo Immediata
{ "aggregate" : "audits" , "pipeline" : [ { "$match" : {
"reciveDate" : {
Post by Angelo Immediata
"$gt" : 1338328800000}}} , { "$group" : { "_id" : {
"$hdr_event_number"} , "numbEvent" : { "$sum" : 1}}}]}
But in any case it can happen that in my environment (my real one)
i can
Post by Angelo Immediata
receive 1 million of data ad day; in fact my customer talks about 1
TB ad
Post by Angelo Immediata
day.....
Thank you
Angelo
Post by Angelo Immediata
Hi Scott
Attached to the mail I put one sample record (I have more or less 1
million and half of these records) and the list of my indexes
Please look at file "Samples.txt"
Cheers,
Angelo
Post by Scott Hernandez
It would be helpful if you can post sample data (including
indexes) and
Post by Angelo Immediata
Post by Angelo Immediata
Post by Scott Hernandez
your actual commands.
On Thu, Jun 28, 2012 at 9:36 AM, Angelo Immediata <
Post by Angelo Immediata
Hi all
I know I always repeat the question about performance issues
above all
Post by Angelo Immediata
Post by Angelo Immediata
Post by Scott Hernandez
Post by Angelo Immediata
regarding to the grouping operations
But since in order to execute a query (something similar to a
classical
Post by Angelo Immediata
Post by Angelo Immediata
Post by Scott Hernandez
Post by Angelo Immediata
SQL group by) mongo takes around 20 seconds on a collection of 1
million and
Post by Angelo Immediata
Post by Angelo Immediata
Post by Scott Hernandez
Post by Angelo Immediata
half documents I tried to investigate on this (20 seconds for
one client and
Post by Angelo Immediata
Post by Angelo Immediata
Post by Scott Hernandez
Post by Angelo Immediata
for only one query are really not acceptable; i can't image for
several
Post by Angelo Immediata
Post by Angelo Immediata
Post by Scott Hernandez
Post by Angelo Immediata
contemporary clients how long this query takes). We are talking
about this
Post by Angelo Immediata
Post by Angelo Immediata
Post by Scott Hernandez
Post by Angelo Immediata
kind of query: select count(event_name), event_name from audits
group by
Post by Angelo Immediata
Post by Angelo Immediata
Post by Scott Hernandez
Post by Angelo Immediata
event_name
I launched the mongostat and mongotop coomands
I attach to the mail their results
Can anybody tell me how I can improve performances?
Note that I'l already using indices and i'm using the new
aggregation
Post by Angelo Immediata
Post by Angelo Immediata
Post by Scott Hernandez
Post by Angelo Immediata
framework.
I attached to the mail the mongo commands results
Do you have any suggestion?
Thank you
Cheers,
Angelo
--
You received this message because you are subscribed to the
Google
Post by Angelo Immediata
Post by Angelo Immediata
Post by Scott Hernandez
Post by Angelo Immediata
Groups "mongodb-user" group.
To post to this group, send email to
To unsubscribe from this group, send email to
See also the IRC channel -- freenode.net#mongodb
--
You received this message because you are subscribed to the Google
Groups "mongodb-user" group.
To post to this group, send email to
To unsubscribe from this group, send email to
See also the IRC channel -- freenode.net#mongodb
--
You received this message because you are subscribed to the Google
Groups "mongodb-user" group.
To unsubscribe from this group, send email to
See also the IRC channel -- freenode.net#mongodb
--
You received this message because you are subscribed to the Google
Groups "mongodb-user" group.
To unsubscribe from this group, send email to
See also the IRC channel -- freenode.net#mongodb
--
You received this message because you are subscribed to the Google
Groups "mongodb-user" group.
To unsubscribe from this group, send email to
See also the IRC channel -- freenode.net#mongodb
--
You received this message because you are subscribed to the Google
Groups "mongodb-user" group.
To unsubscribe from this group, send email to
See also the IRC channel -- freenode.net#mongodb
--
You received this message because you are subscribed to the Google
Groups "mongodb-user" group.
To unsubscribe from this group, send email to
See also the IRC channel -- freenode.net#mongodb
--
You received this message because you are subscribed to the Google
Groups "mongodb-user" group.
To unsubscribe from this group, send email to
See also the IRC channel -- freenode.net#mongodb
--
You received this message because you are subscribed to the Google
Groups "mongodb-user" group.
To post to this group, send email to mongodb-user-/***@public.gmane.org
To unsubscribe from this group, send email to
mongodb-user+unsubscribe-/***@public.gmane.org
See also the IRC channel -- freenode.net#mongodb
Angelo Immediata
2012-07-02 12:58:00 UTC
Permalink
Hello Sam

Thank you so much for the clarification
Moreover I gave alook to this link
http://docs.mongodb.org/manual/use-cases/pre-aggregated-reports/ and it
helped me a lot
I'll do some other tests by using these approaches

Thank you
Angelo
Post by Sam Millman
Hmm everytime an event is logged you could create a new table that is
pre-grouped with numbEvent already filled in which would statisfy this
*{ "aggregate" : "audits" , "pipeline" : [ { "$group" : { "_id" : {
"hdr_event_number" : "$hdr_event_number"} , "numbEvent" : { "$sum" : 1}}}]}
*
You would still need to do difficult aggregation for certain pivots of
data but hopefully it will make 90% of data processing easier.*
*You also use an incremental MR here. It would likely be faster since it
will only be taking a delta index of your data and just pushing it onto the
end of the previous set of MR results.
*
*
Post by Sam Millman
Pre-aggregation is very common in many DBs when dealing with millions of
rows.
A good exmaple is counting. Say you wish to count all shares on a FB wall
post but doing it constantly is crashing your SQL server (or NoSQL server)
and causing lots of IO locks etc. A method to overcome this would be to
house a new field on the wall post record called "total_shares". Each time
a user clicks share or something or does an action to create a statistic
for data processing you $inc this field atomically (or another depending on
conditions).
Once you have a field like this up and running you can easily query over
this field without having to kill the server doing long standing tasks like
count().
Post by Angelo Immediata
Pardon me...maybe it's my nescience in mongo (I'm using it from a couple
of months)....but when you talk about pre-aggregate data...what do you
mean? May you give me a sample (possibly based on the data I sent in the
current thread)?
Thank you
Angelo
Post by Sam Millman
For real-time processing you would pre-aggregate your fields. You are
count() operators and what not which are not very efficient methods by
which to perform data processing.
A more efficient way is to create summary tables by which to index your
detail data as a quicker rate. It is more disk space but it allows for more
localised queries.
I have only come in at the end and have not read the whole thread but I
think the problem could be closer to that.
Ok I just read your using the aggregation framework which is not evne
finished....and has got a lot of speed fixes to come yet before it is fully
done. I am unsure as to whether index usage have been implemented in it
yet..I heard some one on here saying that they haven't, not sure if the
situation has changed yet. Either way the aggregation framework is not even
stable until 2.2.
Post by Angelo Immediata
So...as far as I understood...there is not much optimization I can do.
- get more/faster disks: but this is not possible for now....I'm
in an evaluation phase and my managers will not give me permission to buy
new and faster disks
- shard over more servers: i can try to set up another server
where to instal the same version of mongo and try tests with it
- read less: i really think this is impossible. In my tests I have
only 1 client who just reads the data provided by the query I showed early.
Well I must admit that I was thinking that mongo could offer to me
better performances in these kind of queries. Inserts and updates are
really performant (even if I use indexes)
These kind of queries are not...and I think these queries are the most
used in real-time data processing.....
Angelo
Post by Jorge Costa
1,500,000 documents of your sample.txt (8Kb) is roughly around 12GB
On my SSD (which is not ideal for sequential IO) it takes me 137
seconds to read that amount of data.
If you look into your setup - number of shards vs sequential disk
speed - it is likely that you'll get a figure close to your 20 seconds.
Your filters will likely reduce the range of data to be read, but I
reckon you're still IO bound.
$ dd if=/dev/zero of=12gb.txt bs=8k count=1500000
1500000+0 records in
1500000+0 records out
12288000000 bytes (12 GB) copied, 22.4173 s, 548 MB/s
$ dd if=12gb.txt of=/dev/zero bs=8k count=1500000 iflag=direct
1500000+0 records in
1500000+0 records out
12288000000 bytes (12 GB) copied, 137.162 s, 89.6 MB/s
a) get more/faster disks
b) shard over more servers
c) read less
This is limited by how data must be processed. There is not a lot to
optimize other than to change tactics and store counters (pre~aggregate) so
you do not have reprocesss the data.
Post by Angelo Immediata
Hello
Any suggestion/Infomation?
Angelo
Post by Angelo Immediata
Hello
Just a further information.
{ "aggregate" : "audits" , "pipeline" : [ { "$group" : { "_id" : {
"hdr_event_number" : "$hdr_event_number"} , "numbEvent" : { "$sum"
: 1}}}]}
Post by Angelo Immediata
{ "aggregate" : "audits" , "pipeline" : [ { "$match" : {
"reciveDate" : {
Post by Angelo Immediata
"$gt" : 1338328800000}}} , { "$group" : { "_id" : {
"$hdr_event_number"} , "numbEvent" : { "$sum" : 1}}}]}
But in any case it can happen that in my environment (my real one)
i can
Post by Angelo Immediata
receive 1 million of data ad day; in fact my customer talks about
1 TB ad
Post by Angelo Immediata
day.....
Thank you
Angelo
Post by Angelo Immediata
Hi Scott
Attached to the mail I put one sample record (I have more or less
1
Post by Angelo Immediata
Post by Angelo Immediata
million and half of these records) and the list of my indexes
Please look at file "Samples.txt"
Cheers,
Angelo
Post by Scott Hernandez
It would be helpful if you can post sample data (including
indexes) and
Post by Angelo Immediata
Post by Angelo Immediata
Post by Scott Hernandez
your actual commands.
On Thu, Jun 28, 2012 at 9:36 AM, Angelo Immediata <
Post by Angelo Immediata
Hi all
I know I always repeat the question about performance issues
above all
Post by Angelo Immediata
Post by Angelo Immediata
Post by Scott Hernandez
Post by Angelo Immediata
regarding to the grouping operations
But since in order to execute a query (something similar to a
classical
Post by Angelo Immediata
Post by Angelo Immediata
Post by Scott Hernandez
Post by Angelo Immediata
SQL group by) mongo takes around 20 seconds on a collection of
1 million and
Post by Angelo Immediata
Post by Angelo Immediata
Post by Scott Hernandez
Post by Angelo Immediata
half documents I tried to investigate on this (20 seconds for
one client and
Post by Angelo Immediata
Post by Angelo Immediata
Post by Scott Hernandez
Post by Angelo Immediata
for only one query are really not acceptable; i can't image for
several
Post by Angelo Immediata
Post by Angelo Immediata
Post by Scott Hernandez
Post by Angelo Immediata
contemporary clients how long this query takes). We are talking
about this
Post by Angelo Immediata
Post by Angelo Immediata
Post by Scott Hernandez
Post by Angelo Immediata
kind of query: select count(event_name), event_name from audits
group by
Post by Angelo Immediata
Post by Angelo Immediata
Post by Scott Hernandez
Post by Angelo Immediata
event_name
I launched the mongostat and mongotop coomands
I attach to the mail their results
Can anybody tell me how I can improve performances?
Note that I'l already using indices and i'm using the new
aggregation
Post by Angelo Immediata
Post by Angelo Immediata
Post by Scott Hernandez
Post by Angelo Immediata
framework.
I attached to the mail the mongo commands results
Do you have any suggestion?
Thank you
Cheers,
Angelo
--
You received this message because you are subscribed to the
Google
Post by Angelo Immediata
Post by Angelo Immediata
Post by Scott Hernandez
Post by Angelo Immediata
Groups "mongodb-user" group.
To post to this group, send email to
To unsubscribe from this group, send email to
See also the IRC channel -- freenode.net#mongodb
--
You received this message because you are subscribed to the
Google
Post by Angelo Immediata
Post by Angelo Immediata
Post by Scott Hernandez
Groups "mongodb-user" group.
To post to this group, send email to
To unsubscribe from this group, send email to
See also the IRC channel -- freenode.net#mongodb
--
You received this message because you are subscribed to the Google
Groups "mongodb-user" group.
To unsubscribe from this group, send email to
See also the IRC channel -- freenode.net#mongodb
--
You received this message because you are subscribed to the Google
Groups "mongodb-user" group.
To unsubscribe from this group, send email to
See also the IRC channel -- freenode.net#mongodb
--
You received this message because you are subscribed to the Google
Groups "mongodb-user" group.
To unsubscribe from this group, send email to
See also the IRC channel -- freenode.net#mongodb
--
You received this message because you are subscribed to the Google
Groups "mongodb-user" group.
To unsubscribe from this group, send email to
See also the IRC channel -- freenode.net#mongodb
--
You received this message because you are subscribed to the Google
Groups "mongodb-user" group.
To unsubscribe from this group, send email to
See also the IRC channel -- freenode.net#mongodb
--
You received this message because you are subscribed to the Google
Groups "mongodb-user" group.
To unsubscribe from this group, send email to
See also the IRC channel -- freenode.net#mongodb
--
You received this message because you are subscribed to the Google
Groups "mongodb-user" group.
To unsubscribe from this group, send email to
See also the IRC channel -- freenode.net#mongodb
--
You received this message because you are subscribed to the Google
Groups "mongodb-user" group.
To post to this group, send email to mongodb-user-/***@public.gmane.org
To unsubscribe from this group, send email to
mongodb-user+unsubscribe-/***@public.gmane.org
See also the IRC channel -- freenode.net#mongodb
Angelo Immediata
2012-07-09 10:54:42 UTC
Permalink
Hello
Pardon me I was late...but I had some other issue to solve
I gave a look to the link
http://docs.mongodb.org/manual/use-cases/pre-aggregated-reports/ for the
pre-aggregated data
From what I understood in order to improve performances in the real time
data processing I:

- should have a kind of daemon that, for example, creates every 30 days
all the "statistical records" I need to query by putting as default values
0 (for example)
- every time a new event happens in my software I should upset the
related and right "statistical record" by updating the related hour or day
or minute increasing the value of 1

Did I get good the message?
Should I follow this kind of approach?

I know this can be trevial but I'm not a great database expert :)

Cheers,
Angelo
Post by Angelo Immediata
Hello Sam
Thank you so much for the clarification
Moreover I gave alook to this link
http://docs.mongodb.org/manual/use-cases/pre-aggregated-reports/ and it
helped me a lot
I'll do some other tests by using these approaches
Thank you
Angelo
Post by Sam Millman
Hmm everytime an event is logged you could create a new table that is
pre-grouped with numbEvent already filled in which would statisfy this
*{ "aggregate" : "audits" , "pipeline" : [ { "$group" : { "_id" : {
"hdr_event_number" : "$hdr_event_number"} , "numbEvent" : { "$sum" : 1}}}]}
*
You would still need to do difficult aggregation for certain pivots of
data but hopefully it will make 90% of data processing easier.*
*You also use an incremental MR here. It would likely be faster since it
will only be taking a delta index of your data and just pushing it onto the
end of the previous set of MR results.
*
*
Post by Sam Millman
Pre-aggregation is very common in many DBs when dealing with millions of
rows.
A good exmaple is counting. Say you wish to count all shares on a FB
wall post but doing it constantly is crashing your SQL server (or NoSQL
server) and causing lots of IO locks etc. A method to overcome this would
be to house a new field on the wall post record called "total_shares". Each
time a user clicks share or something or does an action to create a
statistic for data processing you $inc this field atomically (or another
depending on conditions).
Once you have a field like this up and running you can easily query over
this field without having to kill the server doing long standing tasks like
count().
Post by Angelo Immediata
Pardon me...maybe it's my nescience in mongo (I'm using it from a
couple of months)....but when you talk about pre-aggregate data...what do
you mean? May you give me a sample (possibly based on the data I sent in
the current thread)?
Thank you
Angelo
Post by Sam Millman
For real-time processing you would pre-aggregate your fields. You are
count() operators and what not which are not very efficient methods by
which to perform data processing.
A more efficient way is to create summary tables by which to index
your detail data as a quicker rate. It is more disk space but it allows for
more localised queries.
I have only come in at the end and have not read the whole thread but
I think the problem could be closer to that.
Ok I just read your using the aggregation framework which is not evne
finished....and has got a lot of speed fixes to come yet before it is fully
done. I am unsure as to whether index usage have been implemented in it
yet..I heard some one on here saying that they haven't, not sure if the
situation has changed yet. Either way the aggregation framework is not even
stable until 2.2.
Post by Angelo Immediata
So...as far as I understood...there is not much optimization I can do.
- get more/faster disks: but this is not possible for now....I'm
in an evaluation phase and my managers will not give me permission to buy
new and faster disks
- shard over more servers: i can try to set up another server
where to instal the same version of mongo and try tests with it
- read less: i really think this is impossible. In my tests I
have only 1 client who just reads the data provided by the query I showed
early.
Well I must admit that I was thinking that mongo could offer to me
better performances in these kind of queries. Inserts and updates are
really performant (even if I use indexes)
These kind of queries are not...and I think these queries are the
most used in real-time data processing.....
Angelo
Post by Jorge Costa
1,500,000 documents of your sample.txt (8Kb) is roughly around 12GB
On my SSD (which is not ideal for sequential IO) it takes me 137
seconds to read that amount of data.
If you look into your setup - number of shards vs sequential disk
speed - it is likely that you'll get a figure close to your 20 seconds.
Your filters will likely reduce the range of data to be read, but I
reckon you're still IO bound.
$ dd if=/dev/zero of=12gb.txt bs=8k count=1500000
1500000+0 records in
1500000+0 records out
12288000000 bytes (12 GB) copied, 22.4173 s, 548 MB/s
$ dd if=12gb.txt of=/dev/zero bs=8k count=1500000 iflag=direct
1500000+0 records in
1500000+0 records out
12288000000 bytes (12 GB) copied, 137.162 s, 89.6 MB/s
a) get more/faster disks
b) shard over more servers
c) read less
This is limited by how data must be processed. There is not a lot to
optimize other than to change tactics and store counters (pre~aggregate) so
you do not have reprocesss the data.
Post by Angelo Immediata
Hello
Any suggestion/Infomation?
Angelo
Post by Angelo Immediata
Hello
Just a further information.
{ "aggregate" : "audits" , "pipeline" : [ { "$group" : { "_id" : {
"hdr_event_number" : "$hdr_event_number"} , "numbEvent" : {
"$sum" : 1}}}]}
Post by Angelo Immediata
{ "aggregate" : "audits" , "pipeline" : [ { "$match" : {
"reciveDate" : {
Post by Angelo Immediata
"$gt" : 1338328800000}}} , { "$group" : { "_id" : {
"$hdr_event_number"} , "numbEvent" : { "$sum" : 1}}}]}
But in any case it can happen that in my environment (my real
one) i can
Post by Angelo Immediata
receive 1 million of data ad day; in fact my customer talks about
1 TB ad
Post by Angelo Immediata
day.....
Thank you
Angelo
Post by Angelo Immediata
Hi Scott
Attached to the mail I put one sample record (I have more or
less 1
Post by Angelo Immediata
Post by Angelo Immediata
million and half of these records) and the list of my indexes
Please look at file "Samples.txt"
Cheers,
Angelo
Post by Scott Hernandez
It would be helpful if you can post sample data (including
indexes) and
Post by Angelo Immediata
Post by Angelo Immediata
Post by Scott Hernandez
your actual commands.
On Thu, Jun 28, 2012 at 9:36 AM, Angelo Immediata <
Post by Angelo Immediata
Hi all
I know I always repeat the question about performance issues
above all
Post by Angelo Immediata
Post by Angelo Immediata
Post by Scott Hernandez
Post by Angelo Immediata
regarding to the grouping operations
But since in order to execute a query (something similar to a
classical
Post by Angelo Immediata
Post by Angelo Immediata
Post by Scott Hernandez
Post by Angelo Immediata
SQL group by) mongo takes around 20 seconds on a collection of
1 million and
Post by Angelo Immediata
Post by Angelo Immediata
Post by Scott Hernandez
Post by Angelo Immediata
half documents I tried to investigate on this (20 seconds for
one client and
Post by Angelo Immediata
Post by Angelo Immediata
Post by Scott Hernandez
Post by Angelo Immediata
for only one query are really not acceptable; i can't image
for several
Post by Angelo Immediata
Post by Angelo Immediata
Post by Scott Hernandez
Post by Angelo Immediata
contemporary clients how long this query takes). We are
talking about this
Post by Angelo Immediata
Post by Angelo Immediata
Post by Scott Hernandez
Post by Angelo Immediata
kind of query: select count(event_name), event_name from
audits group by
Post by Angelo Immediata
Post by Angelo Immediata
Post by Scott Hernandez
Post by Angelo Immediata
event_name
I launched the mongostat and mongotop coomands
I attach to the mail their results
Can anybody tell me how I can improve performances?
Note that I'l already using indices and i'm using the new
aggregation
Post by Angelo Immediata
Post by Angelo Immediata
Post by Scott Hernandez
Post by Angelo Immediata
framework.
I attached to the mail the mongo commands results
Do you have any suggestion?
Thank you
Cheers,
Angelo
--
You received this message because you are subscribed to the
Google
Post by Angelo Immediata
Post by Angelo Immediata
Post by Scott Hernandez
Post by Angelo Immediata
Groups "mongodb-user" group.
To post to this group, send email to
To unsubscribe from this group, send email to
See also the IRC channel -- freenode.net#mongodb
--
You received this message because you are subscribed to the
Google
Post by Angelo Immediata
Post by Angelo Immediata
Post by Scott Hernandez
Groups "mongodb-user" group.
To post to this group, send email to
To unsubscribe from this group, send email to
See also the IRC channel -- freenode.net#mongodb
--
You received this message because you are subscribed to the Google
Groups "mongodb-user" group.
To unsubscribe from this group, send email to
See also the IRC channel -- freenode.net#mongodb
--
You received this message because you are subscribed to the Google
Groups "mongodb-user" group.
To unsubscribe from this group, send email to
See also the IRC channel -- freenode.net#mongodb
--
You received this message because you are subscribed to the Google
Groups "mongodb-user" group.
To unsubscribe from this group, send email to
See also the IRC channel -- freenode.net#mongodb
--
You received this message because you are subscribed to the Google
Groups "mongodb-user" group.
To unsubscribe from this group, send email to
See also the IRC channel -- freenode.net#mongodb
--
You received this message because you are subscribed to the Google
Groups "mongodb-user" group.
To unsubscribe from this group, send email to
See also the IRC channel -- freenode.net#mongodb
--
You received this message because you are subscribed to the Google
Groups "mongodb-user" group.
To unsubscribe from this group, send email to
See also the IRC channel -- freenode.net#mongodb
--
You received this message because you are subscribed to the Google
Groups "mongodb-user" group.
To unsubscribe from this group, send email to
See also the IRC channel -- freenode.net#mongodb
--
You received this message because you are subscribed to the Google
Groups "mongodb-user" group.
To post to this group, send email to mongodb-user-/***@public.gmane.org
To unsubscribe from this group, send email to
mongodb-user+unsubscribe-/***@public.gmane.org
See also the IRC channel -- freenode.net#mongodb
Sam Millman
2012-07-09 11:07:56 UTC
Permalink
"daemon"

I would change that to cronjob. A daemon would be unneeded load right here.
So in Linux or Windows I would schedule a PHP script to run * * 30 * *
which basically means every 30 days. You could be clever and get it to
understand when months actually end but I think that is outside the scope
of this question atm.

"every time a new event happens in my software I should upset the related
and right "statistical record" by updating the related hour or day or
minute increasing the value of 1"

Indeed this is a common approach and one well demonstrated in the article
you linked. It is logical and robust. You will need to take care of places
where there is not a record but $inc allows for a default value as well so
if the field/row does not exist $inc will automatically be able to generate
that field and set it to its default value.

"Did I get good the message?"

Yep :).

"Should I follow this kind of approach?"

As kinda said above: yes, the docs are quite explanatory and comprehensive.
I would follow this approach.
Post by Angelo Immediata
Hello
Pardon me I was late...but I had some other issue to solve
I gave a look to the link
http://docs.mongodb.org/manual/use-cases/pre-aggregated-reports/ for the
pre-aggregated data
From what I understood in order to improve performances in the real time
- should have a kind of daemon that, for example, creates every 30
days all the "statistical records" I need to query by putting as default
values 0 (for example)
- every time a new event happens in my software I should upset the
related and right "statistical record" by updating the related hour or day
or minute increasing the value of 1
Did I get good the message?
Should I follow this kind of approach?
I know this can be trevial but I'm not a great database expert :)
Cheers,
Angelo
Post by Angelo Immediata
Hello Sam
Thank you so much for the clarification
Moreover I gave alook to this link
http://docs.mongodb.org/manual/use-cases/pre-aggregated-reports/ and it
helped me a lot
I'll do some other tests by using these approaches
Thank you
Angelo
Post by Sam Millman
Hmm everytime an event is logged you could create a new table that is
pre-grouped with numbEvent already filled in which would statisfy this
*{ "aggregate" : "audits" , "pipeline" : [ { "$group" : { "_id" : {
"hdr_event_number" : "$hdr_event_number"} , "numbEvent" : { "$sum" : 1}}}]}
*
You would still need to do difficult aggregation for certain pivots of
data but hopefully it will make 90% of data processing easier.*
*You also use an incremental MR here. It would likely be faster since
it will only be taking a delta index of your data and just pushing it onto
the end of the previous set of MR results.
*
*
Post by Sam Millman
Pre-aggregation is very common in many DBs when dealing with millions
of rows.
A good exmaple is counting. Say you wish to count all shares on a FB
wall post but doing it constantly is crashing your SQL server (or NoSQL
server) and causing lots of IO locks etc. A method to overcome this would
be to house a new field on the wall post record called "total_shares". Each
time a user clicks share or something or does an action to create a
statistic for data processing you $inc this field atomically (or another
depending on conditions).
Once you have a field like this up and running you can easily query
over this field without having to kill the server doing long standing tasks
like count().
Post by Angelo Immediata
Pardon me...maybe it's my nescience in mongo (I'm using it from a
couple of months)....but when you talk about pre-aggregate data...what do
you mean? May you give me a sample (possibly based on the data I sent in
the current thread)?
Thank you
Angelo
Post by Sam Millman
For real-time processing you would pre-aggregate your fields. You are
count() operators and what not which are not very efficient methods by
which to perform data processing.
A more efficient way is to create summary tables by which to index
your detail data as a quicker rate. It is more disk space but it allows for
more localised queries.
I have only come in at the end and have not read the whole thread but
I think the problem could be closer to that.
Ok I just read your using the aggregation framework which is not evne
finished....and has got a lot of speed fixes to come yet before it is fully
done. I am unsure as to whether index usage have been implemented in it
yet..I heard some one on here saying that they haven't, not sure if the
situation has changed yet. Either way the aggregation framework is not even
stable until 2.2.
Post by Angelo Immediata
So...as far as I understood...there is not much optimization I can do.
- get more/faster disks: but this is not possible for now....I'm
in an evaluation phase and my managers will not give me permission to buy
new and faster disks
- shard over more servers: i can try to set up another server
where to instal the same version of mongo and try tests with it
- read less: i really think this is impossible. In my tests I
have only 1 client who just reads the data provided by the query I showed
early.
Well I must admit that I was thinking that mongo could offer to me
better performances in these kind of queries. Inserts and updates are
really performant (even if I use indexes)
These kind of queries are not...and I think these queries are the
most used in real-time data processing.....
Angelo
Post by Jorge Costa
1,500,000 documents of your sample.txt (8Kb) is roughly around 12GB
On my SSD (which is not ideal for sequential IO) it takes me 137
seconds to read that amount of data.
If you look into your setup - number of shards vs sequential disk
speed - it is likely that you'll get a figure close to your 20 seconds.
Your filters will likely reduce the range of data to be read, but I
reckon you're still IO bound.
$ dd if=/dev/zero of=12gb.txt bs=8k count=1500000
1500000+0 records in
1500000+0 records out
12288000000 bytes (12 GB) copied, 22.4173 s, 548 MB/s
$ dd if=12gb.txt of=/dev/zero bs=8k count=1500000 iflag=direct
1500000+0 records in
1500000+0 records out
12288000000 bytes (12 GB) copied, 137.162 s, 89.6 MB/s
a) get more/faster disks
b) shard over more servers
c) read less
This is limited by how data must be processed. There is not a lot
to optimize other than to change tactics and store counters (pre~aggregate)
so you do not have reprocesss the data.
Post by Angelo Immediata
Hello
Any suggestion/Infomation?
Angelo
Post by Angelo Immediata
Hello
Just a further information.
{
Post by Angelo Immediata
"hdr_event_number" : "$hdr_event_number"} , "numbEvent" : {
"$sum" : 1}}}]}
Post by Angelo Immediata
{ "aggregate" : "audits" , "pipeline" : [ { "$match" : {
"reciveDate" : {
Post by Angelo Immediata
"$gt" : 1338328800000}}} , { "$group" : { "_id" : {
"$hdr_event_number"} , "numbEvent" : { "$sum" : 1}}}]}
But in any case it can happen that in my environment (my real
one) i can
Post by Angelo Immediata
receive 1 million of data ad day; in fact my customer talks
about 1 TB ad
Post by Angelo Immediata
day.....
Thank you
Angelo
Post by Angelo Immediata
Hi Scott
Attached to the mail I put one sample record (I have more or
less 1
Post by Angelo Immediata
Post by Angelo Immediata
million and half of these records) and the list of my indexes
Please look at file "Samples.txt"
Cheers,
Angelo
Post by Scott Hernandez
It would be helpful if you can post sample data (including
indexes) and
Post by Angelo Immediata
Post by Angelo Immediata
Post by Scott Hernandez
your actual commands.
On Thu, Jun 28, 2012 at 9:36 AM, Angelo Immediata <
Post by Angelo Immediata
Hi all
I know I always repeat the question about performance issues
above all
Post by Angelo Immediata
Post by Angelo Immediata
Post by Scott Hernandez
Post by Angelo Immediata
regarding to the grouping operations
But since in order to execute a query (something similar to a
classical
Post by Angelo Immediata
Post by Angelo Immediata
Post by Scott Hernandez
Post by Angelo Immediata
SQL group by) mongo takes around 20 seconds on a collection
of 1 million and
Post by Angelo Immediata
Post by Angelo Immediata
Post by Scott Hernandez
Post by Angelo Immediata
half documents I tried to investigate on this (20 seconds for
one client and
Post by Angelo Immediata
Post by Angelo Immediata
Post by Scott Hernandez
Post by Angelo Immediata
for only one query are really not acceptable; i can't image
for several
Post by Angelo Immediata
Post by Angelo Immediata
Post by Scott Hernandez
Post by Angelo Immediata
contemporary clients how long this query takes). We are
talking about this
Post by Angelo Immediata
Post by Angelo Immediata
Post by Scott Hernandez
Post by Angelo Immediata
kind of query: select count(event_name), event_name from
audits group by
Post by Angelo Immediata
Post by Angelo Immediata
Post by Scott Hernandez
Post by Angelo Immediata
event_name
I launched the mongostat and mongotop coomands
I attach to the mail their results
Can anybody tell me how I can improve performances?
Note that I'l already using indices and i'm using the new
aggregation
Post by Angelo Immediata
Post by Angelo Immediata
Post by Scott Hernandez
Post by Angelo Immediata
framework.
I attached to the mail the mongo commands results
Do you have any suggestion?
Thank you
Cheers,
Angelo
--
You received this message because you are subscribed to the
Google
Post by Angelo Immediata
Post by Angelo Immediata
Post by Scott Hernandez
Post by Angelo Immediata
Groups "mongodb-user" group.
To post to this group, send email to
To unsubscribe from this group, send email to
See also the IRC channel -- freenode.net#mongodb
--
You received this message because you are subscribed to the
Google
Post by Angelo Immediata
Post by Angelo Immediata
Post by Scott Hernandez
Groups "mongodb-user" group.
To post to this group, send email to
To unsubscribe from this group, send email to
See also the IRC channel -- freenode.net#mongodb
--
You received this message because you are subscribed to the Google
Groups "mongodb-user" group.
To unsubscribe from this group, send email to
See also the IRC channel -- freenode.net#mongodb
--
You received this message because you are subscribed to the Google
Groups "mongodb-user" group.
To unsubscribe from this group, send email to
See also the IRC channel -- freenode.net#mongodb
--
You received this message because you are subscribed to the Google
Groups "mongodb-user" group.
To unsubscribe from this group, send email to
See also the IRC channel -- freenode.net#mongodb
--
You received this message because you are subscribed to the Google
Groups "mongodb-user" group.
To unsubscribe from this group, send email to
See also the IRC channel -- freenode.net#mongodb
--
You received this message because you are subscribed to the Google
Groups "mongodb-user" group.
To unsubscribe from this group, send email to
See also the IRC channel -- freenode.net#mongodb
--
You received this message because you are subscribed to the Google
Groups "mongodb-user" group.
To unsubscribe from this group, send email to
See also the IRC channel -- freenode.net#mongodb
--
You received this message because you are subscribed to the Google
Groups "mongodb-user" group.
To unsubscribe from this group, send email to
See also the IRC channel -- freenode.net#mongodb
--
You received this message because you are subscribed to the Google
Groups "mongodb-user" group.
To unsubscribe from this group, send email to
See also the IRC channel -- freenode.net#mongodb
--
You received this message because you are subscribed to the Google
Groups "mongodb-user" group.
To post to this group, send email to mongodb-user-/***@public.gmane.org
To unsubscribe from this group, send email to
mongodb-user+unsubscribe-/***@public.gmane.org
See also the IRC channel -- freenode.net#mongodb
Angelo Immediata
2012-07-09 11:34:00 UTC
Permalink
Well...what to say Sam....if you come to italy one day you have a wine
bottle paied [?]

Thank you for the clarifications

Angelo
Post by Sam Millman
"daemon"
I would change that to cronjob. A daemon would be unneeded load right
here. So in Linux or Windows I would schedule a PHP script to run * * 30 *
* which basically means every 30 days. You could be clever and get it to
understand when months actually end but I think that is outside the scope
of this question atm.
"every time a new event happens in my software I should upset the
related and right "statistical record" by updating the related hour or day
or minute increasing the value of 1"
Indeed this is a common approach and one well demonstrated in the article
you linked. It is logical and robust. You will need to take care of places
where there is not a record but $inc allows for a default value as well so
if the field/row does not exist $inc will automatically be able to generate
that field and set it to its default value.
"Did I get good the message?"
Yep :).
"Should I follow this kind of approach?"
As kinda said above: yes, the docs are quite explanatory and
comprehensive. I would follow this approach.
Post by Angelo Immediata
Hello
Pardon me I was late...but I had some other issue to solve
I gave a look to the link
http://docs.mongodb.org/manual/use-cases/pre-aggregated-reports/ for
the pre-aggregated data
From what I understood in order to improve performances in the real time
- should have a kind of daemon that, for example, creates every 30
days all the "statistical records" I need to query by putting as default
values 0 (for example)
- every time a new event happens in my software I should upset the
related and right "statistical record" by updating the related hour or day
or minute increasing the value of 1
Did I get good the message?
Should I follow this kind of approach?
I know this can be trevial but I'm not a great database expert :)
Cheers,
Angelo
Post by Angelo Immediata
Hello Sam
Thank you so much for the clarification
Moreover I gave alook to this link
http://docs.mongodb.org/manual/use-cases/pre-aggregated-reports/ and it
helped me a lot
I'll do some other tests by using these approaches
Thank you
Angelo
Post by Sam Millman
Hmm everytime an event is logged you could create a new table that is
pre-grouped with numbEvent already filled in which would statisfy this
*{ "aggregate" : "audits" , "pipeline" : [ { "$group" : { "_id" : {
"hdr_event_number" : "$hdr_event_number"} , "numbEvent" : { "$sum" : 1}}}]}
*
You would still need to do difficult aggregation for certain pivots of
data but hopefully it will make 90% of data processing easier.*
*You also use an incremental MR here. It would likely be faster since
it will only be taking a delta index of your data and just pushing it onto
the end of the previous set of MR results.
*
*
Post by Sam Millman
Pre-aggregation is very common in many DBs when dealing with millions
of rows.
A good exmaple is counting. Say you wish to count all shares on a FB
wall post but doing it constantly is crashing your SQL server (or NoSQL
server) and causing lots of IO locks etc. A method to overcome this would
be to house a new field on the wall post record called "total_shares". Each
time a user clicks share or something or does an action to create a
statistic for data processing you $inc this field atomically (or another
depending on conditions).
Once you have a field like this up and running you can easily query
over this field without having to kill the server doing long standing tasks
like count().
Post by Angelo Immediata
Pardon me...maybe it's my nescience in mongo (I'm using it from a
couple of months)....but when you talk about pre-aggregate data...what do
you mean? May you give me a sample (possibly based on the data I sent in
the current thread)?
Thank you
Angelo
Post by Sam Millman
For real-time processing you would pre-aggregate your fields. You
are count() operators and what not which are not very efficient methods by
which to perform data processing.
A more efficient way is to create summary tables by which to index
your detail data as a quicker rate. It is more disk space but it allows for
more localised queries.
I have only come in at the end and have not read the whole thread
but I think the problem could be closer to that.
Ok I just read your using the aggregation framework which is not
evne finished....and has got a lot of speed fixes to come yet before it is
fully done. I am unsure as to whether index usage have been implemented in
it yet..I heard some one on here saying that they haven't, not sure if the
situation has changed yet. Either way the aggregation framework is not even
stable until 2.2.
Post by Angelo Immediata
So...as far as I understood...there is not much optimization I can do.
- get more/faster disks: but this is not possible for
now....I'm in an evaluation phase and my managers will not give me
permission to buy new and faster disks
- shard over more servers: i can try to set up another server
where to instal the same version of mongo and try tests with it
- read less: i really think this is impossible. In my tests I
have only 1 client who just reads the data provided by the query I showed
early.
Well I must admit that I was thinking that mongo could offer to me
better performances in these kind of queries. Inserts and updates are
really performant (even if I use indexes)
These kind of queries are not...and I think these queries are the
most used in real-time data processing.....
Angelo
Post by Jorge Costa
1,500,000 documents of your sample.txt (8Kb) is roughly around 12GB
On my SSD (which is not ideal for sequential IO) it takes me 137
seconds to read that amount of data.
If you look into your setup - number of shards vs sequential disk
speed - it is likely that you'll get a figure close to your 20 seconds.
Your filters will likely reduce the range of data to be read, but
I reckon you're still IO bound.
$ dd if=/dev/zero of=12gb.txt bs=8k count=1500000
1500000+0 records in
1500000+0 records out
12288000000 bytes (12 GB) copied, 22.4173 s, 548 MB/s
$ dd if=12gb.txt of=/dev/zero bs=8k count=1500000 iflag=direct
1500000+0 records in
1500000+0 records out
12288000000 bytes (12 GB) copied, 137.162 s, 89.6 MB/s
a) get more/faster disks
b) shard over more servers
c) read less
This is limited by how data must be processed. There is not a lot
to optimize other than to change tactics and store counters (pre~aggregate)
so you do not have reprocesss the data.
Post by Angelo Immediata
Hello
Any suggestion/Infomation?
Angelo
Post by Angelo Immediata
Hello
Just a further information.
{ "aggregate" : "audits" , "pipeline" : [ { "$group" : { "_id"
: {
Post by Angelo Immediata
"hdr_event_number" : "$hdr_event_number"} , "numbEvent" : {
"$sum" : 1}}}]}
Post by Angelo Immediata
{ "aggregate" : "audits" , "pipeline" : [ { "$match" : {
"reciveDate" : {
Post by Angelo Immediata
"$gt" : 1338328800000}}} , { "$group" : { "_id" : {
"$hdr_event_number"} , "numbEvent" : { "$sum" : 1}}}]}
But in any case it can happen that in my environment (my real
one) i can
Post by Angelo Immediata
receive 1 million of data ad day; in fact my customer talks
about 1 TB ad
Post by Angelo Immediata
day.....
Thank you
Angelo
Post by Angelo Immediata
Hi Scott
Attached to the mail I put one sample record (I have more or
less 1
Post by Angelo Immediata
Post by Angelo Immediata
million and half of these records) and the list of my indexes
Please look at file "Samples.txt"
Cheers,
Angelo
Post by Scott Hernandez
It would be helpful if you can post sample data (including
indexes) and
Post by Angelo Immediata
Post by Angelo Immediata
Post by Scott Hernandez
your actual commands.
On Thu, Jun 28, 2012 at 9:36 AM, Angelo Immediata <
Post by Angelo Immediata
Hi all
I know I always repeat the question about performance issues
above all
Post by Angelo Immediata
Post by Angelo Immediata
Post by Scott Hernandez
Post by Angelo Immediata
regarding to the grouping operations
But since in order to execute a query (something similar to
a classical
Post by Angelo Immediata
Post by Angelo Immediata
Post by Scott Hernandez
Post by Angelo Immediata
SQL group by) mongo takes around 20 seconds on a collection
of 1 million and
Post by Angelo Immediata
Post by Angelo Immediata
Post by Scott Hernandez
Post by Angelo Immediata
half documents I tried to investigate on this (20 seconds
for one client and
Post by Angelo Immediata
Post by Angelo Immediata
Post by Scott Hernandez
Post by Angelo Immediata
for only one query are really not acceptable; i can't image
for several
Post by Angelo Immediata
Post by Angelo Immediata
Post by Scott Hernandez
Post by Angelo Immediata
contemporary clients how long this query takes). We are
talking about this
Post by Angelo Immediata
Post by Angelo Immediata
Post by Scott Hernandez
Post by Angelo Immediata
kind of query: select count(event_name), event_name from
audits group by
Post by Angelo Immediata
Post by Angelo Immediata
Post by Scott Hernandez
Post by Angelo Immediata
event_name
I launched the mongostat and mongotop coomands
I attach to the mail their results
Can anybody tell me how I can improve performances?
Note that I'l already using indices and i'm using the new
aggregation
Post by Angelo Immediata
Post by Angelo Immediata
Post by Scott Hernandez
Post by Angelo Immediata
framework.
I attached to the mail the mongo commands results
Do you have any suggestion?
Thank you
Cheers,
Angelo
--
You received this message because you are subscribed to the
Google
Post by Angelo Immediata
Post by Angelo Immediata
Post by Scott Hernandez
Post by Angelo Immediata
Groups "mongodb-user" group.
To post to this group, send email to
To unsubscribe from this group, send email to
See also the IRC channel -- freenode.net#mongodb
--
You received this message because you are subscribed to the
Google
Post by Angelo Immediata
Post by Angelo Immediata
Post by Scott Hernandez
Groups "mongodb-user" group.
To post to this group, send email to
To unsubscribe from this group, send email to
See also the IRC channel -- freenode.net#mongodb
--
You received this message because you are subscribed to the Google
Groups "mongodb-user" group.
To post to this group, send email to
To unsubscribe from this group, send email to
See also the IRC channel -- freenode.net#mongodb
--
You received this message because you are subscribed to the Google
Groups "mongodb-user" group.
To unsubscribe from this group, send email to
See also the IRC channel -- freenode.net#mongodb
--
You received this message because you are subscribed to the Google
Groups "mongodb-user" group.
To unsubscribe from this group, send email to
See also the IRC channel -- freenode.net#mongodb
--
You received this message because you are subscribed to the Google
Groups "mongodb-user" group.
To unsubscribe from this group, send email to
See also the IRC channel -- freenode.net#mongodb
--
You received this message because you are subscribed to the Google
Groups "mongodb-user" group.
To unsubscribe from this group, send email to
See also the IRC channel -- freenode.net#mongodb
--
You received this message because you are subscribed to the Google
Groups "mongodb-user" group.
To unsubscribe from this group, send email to
See also the IRC channel -- freenode.net#mongodb
--
You received this message because you are subscribed to the Google
Groups "mongodb-user" group.
To unsubscribe from this group, send email to
See also the IRC channel -- freenode.net#mongodb
--
You received this message because you are subscribed to the Google
Groups "mongodb-user" group.
To unsubscribe from this group, send email to
See also the IRC channel -- freenode.net#mongodb
--
You received this message because you are subscribed to the Google
Groups "mongodb-user" group.
To unsubscribe from this group, send email to
See also the IRC channel -- freenode.net#mongodb
--
You received this message because you are subscribed to the Google
Groups "mongodb-user" group.
To post to this group, send email to mongodb-user-/***@public.gmane.org
To unsubscribe from this group, send email to
mongodb-user+unsubscribe-/***@public.gmane.org
See also the IRC channel -- freenode.net#mongodb
Continue reading on narkive:
Loading...