Python + MongoDB - Cursor iteration too slow ?

Two quick observations:

- mylist = list(db_collection.find()) - That means PyMongo has to retrieve
and deserialize all 150k documents before your application can process a
single document. The batch_size option will have no impact here. How are
you doing the same in C++? Why not use a Cursor for what it's meant for and
incrementally process documents?

- C++ is a lower level statically typed, natively compiled language. Python
is a very high level, dynamically typed, interpreted language. The
performance of your python application is never going to be close to as
fast as one written in C++. It's the nature of the language.

Post by Rob B
I have a database find query which returns 150k documents where each
document contains three integer fields and one datetime field. The
following code attempts to create a list from the cursor object. Iterating
the cursor is incredibly slow - about 80 seconds! The same operation via
the C++ drivers is orders of magnitude faster - it must be an issue with
PyMongo?
client = MongoClient()
client = MongoClient('localhost', 27017)
db = client.taq
collection_str = "mycollection"
db_collection = db[collection_str]
mylist = list(db_collection.find())
This issue has been discussed before and I tried the suggestions. One is
cursor = db_collection.find()
cursor.bath_size(10000)
mylist = list(cursor)
However, this had no impact. A second suggestion was to check that the C
extensions are installed - I have them installed so this is not the issue.
The Mongo database is installed on the same machine so it is not a network
issue - it works fine from C++ ... querying from Pymongo is the issue.
Since MongoDB is marketed as being able to handle Big Data, surely there
is a way to retrieve data quickly via Python? This issue has been raised
before but I am yet to find a solution.... has anyone got a suggestion that
works? It this case I am retrieving 150k documents, but normally the query
would be retrieving 1million so this is going to be a real issue for me.
PS each document is very simple, only containing a datetime field and
three integer fields.
--
--
You received this message because you are subscribed to the Google
Groups "mongodb-user" group.
To unsubscribe from this group, send email to
See also the IRC channel -- freenode.net#mongodb
---
You received this message because you are subscribed to the Google Groups
"mongodb-user" group.
To unsubscribe from this group and stop receiving emails from it, send an
For more options, visit https://groups.google.com/groups/opt_out.

--
--
You received this message because you are subscribed to the Google
Groups "mongodb-user" group.
To post to this group, send email to mongodb-user-/***@public.gmane.org
To unsubscribe from this group, send email to
mongodb-user+unsubscribe-/***@public.gmane.org
See also the IRC channel -- freenode.net#mongodb

---
You received this message because you are subscribed to the Google Groups "mongodb-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mongodb-user+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/***@public.gmane.org
For more options, visit https://groups.google.com/groups/opt_out.

Rob B

2013-08-22 16:25:55 UTC

Hi Bernie,

I was creating a list so I could pass the list to the constructor of a
Pandas Dataframe. However, this is not the issue so I probably shouldn't
have included that line in my example. In the following example I simply
iterate over the cursor one document at a time and set a dummy variable
temp to 1 on each iteration. This takes the same length of time - about 80
seconds.

tic = time.time()
cursor = db_collection.find({"dt": {"$gte": start_dt, "$lt": end_dt }
}).sort("t")
for s in cursor: temp=1
toc = time.time()
print('elapsed time: ',toc-tic)

I'm well aware that I won't get the same speeds as C++, but 80 to 100
seconds to retrieve and iterate 150k documents is unacceptably slow. There
must be another solution? The same query takes a few seconds from C++ so
this is more than a marginal difference. Anyone who is accessing big
datasets using PyMongo has surely encountered this issue. Any idea?

Thanks, Rob.

Post by Bernie Hackett
- mylist = list(db_collection.find()) - That means PyMongo has to
retrieve and deserialize all 150k documents before your application can
process a single document. The batch_size option will have no impact here.
How are you doing the same in C++? Why not use a Cursor for what it's meant
for and incrementally process documents?
- C++ is a lower level statically typed, natively compiled language.
Python is a very high level, dynamically typed, interpreted language. The
performance of your python application is never going to be close to as
fast as one written in C++. It's the nature of the language.

--
--
You received this message because you are subscribed to the Google
Groups "mongodb-user" group.
To unsubscribe from this group, send email to
See also the IRC channel -- freenode.net#mongodb
---
You received this message because you are subscribed to a topic in the
Google Groups "mongodb-user" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/mongodb-user/M6dxQgvGrbE/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
For more options, visit https://groups.google.com/groups/opt_out.

Bernie Hackett

2013-08-22 16:48:19 UTC

I see. I was confused by your example using batch_size and list(). The
documents you are describing sound very simple, and decoding three integers
(32 or 64 bit) is about the simplest thing PyMongo does. The datetime code
in PyMongo's C extensions could be sped up a bit when dealing with
timestamps after the epoch, but I think most of the processing time is
being taken up creating PyObjects (in this case PyLong, PyDict, PyDateTime,
PyUnicode) in C.

Can you post an example document I can test with?

In PyMongo 3.0 we plan to rework how Cursor buffers the data it retrieves,
which should improve performance a bit.

In the meantime, the only suggestion I can give you is to project out any
fields in the document you don't actually need, which will make PyMongo
have to do less work.

Post by Rob B
Hi Bernie,
I was creating a list so I could pass the list to the constructor of a
Pandas Dataframe. However, this is not the issue so I probably shouldn't
have included that line in my example. In the following example I simply
iterate over the cursor one document at a time and set a dummy variable
temp to 1 on each iteration. This takes the same length of time - about 80
seconds.
tic = time.time()
cursor = db_collection.find({"dt": {"$gte": start_dt, "$lt": end_dt }
}).sort("t")
for s in cursor: temp=1
toc = time.time()
print('elapsed time: ',toc-tic)
I'm well aware that I won't get the same speeds as C++, but 80 to 100
seconds to retrieve and iterate 150k documents is unacceptably slow. There
must be another solution? The same query takes a few seconds from C++ so
this is more than a marginal difference. Anyone who is accessing big
datasets using PyMongo has surely encountered this issue. Any idea?
Thanks, Rob.

Post by Rob B
I have a database find query which returns 150k documents where each
document contains three integer fields and one datetime field. The
following code attempts to create a list from the cursor object. Iterating
the cursor is incredibly slow - about 80 seconds! The same operation via
the C++ drivers is orders of magnitude faster - it must be an issue with
PyMongo?
client = MongoClient()
client = MongoClient('localhost', 27017)
db = client.taq
collection_str = "mycollection"
db_collection = db[collection_str]
mylist = list(db_collection.find())
This issue has been discussed before and I tried the suggestions. One is
cursor = db_collection.find()
cursor.bath_size(10000)
mylist = list(cursor)
However, this had no impact. A second suggestion was to check that the C
extensions are installed - I have them installed so this is not the issue.
The Mongo database is installed on the same machine so it is not a network
issue - it works fine from C++ ... querying from Pymongo is the issue.
Since MongoDB is marketed as being able to handle Big Data, surely there
is a way to retrieve data quickly via Python? This issue has been raised
before but I am yet to find a solution.... has anyone got a suggestion that
works? It this case I am retrieving 150k documents, but normally the query
would be retrieving 1million so this is going to be a real issue for me.
PS each document is very simple, only containing a datetime field and
three integer fields.
--
--
You received this message because you are subscribed to the Google
Groups "mongodb-user" group.
To unsubscribe from this group, send email to
See also the IRC channel -- freenode.net#mongodb
---
You received this message because you are subscribed to the Google
Groups "mongodb-user" group.
To unsubscribe from this group and stop receiving emails from it, send
For more options, visit https://groups.google.com/groups/opt_out.

--
--
You received this message because you are subscribed to the Google
Groups "mongodb-user" group.
To unsubscribe from this group, send email to
See also the IRC channel -- freenode.net#mongodb
---
You received this message because you are subscribed to the Google Groups
"mongodb-user" group.
To unsubscribe from this group and stop receiving emails from it, send an
For more options, visit https://groups.google.com/groups/opt_out.

Rob B

2013-08-22 17:03:07 UTC

Hi Bernie,

I need all the fields in the documents, there are not many. Here is an
example document:

{'_id': ObjectId('51fc9fb9e89db689aed7e386'),
'dt': datetime.datetime(2012, 9, 2, 22, 0, 0, 772000),
'p': 9638,
't': 79200772,
'v': 1}

It's good to hear that you are aiming to improve PyMongo 3.0 but as it
stands PyMongo is so slow that it is not usable for big datasets in my
experience. I am experimenting with using MongoDB for storing financial
data. MongoDB is marketed as a tick data storage solution in the 10gen
presentation I posted (
http://www.10gen.com/presentations/webinar-how-banks-use-mongodb-tick-database),
however I'm realizing that it is not viable ... certainly not from
Python... and Python is arguably the most common route of access for a
quant working in finance. In my example I retrieved 150k documents (where
each document is a trade)..... typically usage would involve retrieving
multiple million trades.... this would take far too long using PyMongo when
it takes 80 seconds to iterate 150k trades.

Is there any other suggestion? If there is no way to make faster queries
from PyMongo then I have no choice but to find an alternative database,
which would be disappointing as I've invested quite a bit of time in
MongoDB over the last month and am happy with it apart from the speed,
which is a show stopper.

Post by Bernie Hackett
I see. I was confused by your example using batch_size and list(). The
documents you are describing sound very simple, and decoding three integers
(32 or 64 bit) is about the simplest thing PyMongo does. The datetime code
in PyMongo's C extensions could be sped up a bit when dealing with
timestamps after the epoch, but I think most of the processing time is
being taken up creating PyObjects (in this case PyLong, PyDict, PyDateTime,
PyUnicode) in C.
Can you post an example document I can test with?
In PyMongo 3.0 we plan to rework how Cursor buffers the data it retrieves,
which should improve performance a bit.
In the meantime, the only suggestion I can give you is to project out any
fields in the document you don't actually need, which will make PyMongo
have to do less work.

Post by Rob B
I have a database find query which returns 150k documents where each
document contains three integer fields and one datetime field. The
following code attempts to create a list from the cursor object. Iterating
the cursor is incredibly slow - about 80 seconds! The same operation via
the C++ drivers is orders of magnitude faster - it must be an issue with
PyMongo?
client = MongoClient()
client = MongoClient('localhost', 27017)
db = client.taq
collection_str = "mycollection"
db_collection = db[collection_str]
mylist = list(db_collection.find())
This issue has been discussed before and I tried the suggestions. One
cursor = db_collection.find()
cursor.bath_size(10000)
mylist = list(cursor)
However, this had no impact. A second suggestion was to check that the
C extensions are installed - I have them installed so this is not the
issue. The Mongo database is installed on the same machine so it is not a
network issue - it works fine from C++ ... querying from Pymongo is the
issue.
Since MongoDB is marketed as being able to handle Big Data, surely
there is a way to retrieve data quickly via Python? This issue has been
raised before but I am yet to find a solution.... has anyone got a
suggestion that works? It this case I am retrieving 150k documents, but
normally the query would be retrieving 1million so this is going to be a
real issue for me.
PS each document is very simple, only containing a datetime field and
three integer fields.
--
--
You received this message because you are subscribed to the Google
Groups "mongodb-user" group.
To unsubscribe from this group, send email to
See also the IRC channel -- freenode.net#mongodb
---
You received this message because you are subscribed to the Google
Groups "mongodb-user" group.
To unsubscribe from this group and stop receiving emails from it, send
For more options, visit https://groups.google.com/groups/opt_out.

--
--
You received this message because you are subscribed to the Google
Groups "mongodb-user" group.
To unsubscribe from this group, send email to
See also the IRC channel -- freenode.net#mongodb
---
You received this message because you are subscribed to the Google Groups
"mongodb-user" group.
To unsubscribe from this group and stop receiving emails from it, send an
For more options, visit https://groups.google.com/groups/opt_out.

Bernie Hackett

2013-08-22 18:38:02 UTC

c.foo.bar.count()

150000

c.foo.bar.find_one()

{u'p': 9638, u'dt': datetime.datetime(2013, 8, 22, 18, 27, 59, 569000),
u'_id': ObjectId('5216582ffba52202877399f2'), u't': 79200772, u'v': 1}
import timeit

timeit.timeit(stmt='l = list(c.foo.bar.find())', setup='import pymongo;

c = pymongo.MongoClient()', number=1)
1.0295019149780273

timeit.timeit(stmt='l = list(c.foo.bar.find(fields={"_id": False}))',

setup='import pymongo; c = pymongo.MongoClient()', number=1)
0.6698269844055176

pymongo.has_c()

True

import bson
bson.has_c()

True

Post by Rob B
Hi Bernie,
I need all the fields in the documents, there are not many. Here is an
{'_id': ObjectId('51fc9fb9e89db689aed7e386'),
'dt': datetime.datetime(2012, 9, 2, 22, 0, 0, 772000),
'p': 9638,
't': 79200772,
'v': 1}
It's good to hear that you are aiming to improve PyMongo 3.0 but as it
stands PyMongo is so slow that it is not usable for big datasets in my
experience. I am experimenting with using MongoDB for storing financial
data. MongoDB is marketed as a tick data storage solution in the 10gen
presentation I posted (
http://www.10gen.com/presentations/webinar-how-banks-use-mongodb-tick-database), however I'm realizing that it is not viable ... certainly not from
Python... and Python is arguably the most common route of access for a
quant working in finance. In my example I retrieved 150k documents (where
each document is a trade)..... typically usage would involve retrieving
multiple million trades.... this would take far too long using PyMongo when
it takes 80 seconds to iterate 150k trades.
Is there any other suggestion? If there is no way to make faster queries
from PyMongo then I have no choice but to find an alternative database,
which would be disappointing as I've invested quite a bit of time in
MongoDB over the last month and am happy with it apart from the speed,
which is a show stopper.

Post by Bernie Hackett
I see. I was confused by your example using batch_size and list(). The
documents you are describing sound very simple, and decoding three integers
(32 or 64 bit) is about the simplest thing PyMongo does. The datetime code
in PyMongo's C extensions could be sped up a bit when dealing with
timestamps after the epoch, but I think most of the processing time is
being taken up creating PyObjects (in this case PyLong, PyDict, PyDateTime,
PyUnicode) in C.
Can you post an example document I can test with?
In PyMongo 3.0 we plan to rework how Cursor buffers the data it
retrieves, which should improve performance a bit.
In the meantime, the only suggestion I can give you is to project out any
fields in the document you don't actually need, which will make PyMongo
have to do less work.

Hi Bernie,
I was creating a list so I could pass the list to the constructor of a
Pandas Dataframe. However, this is not the issue so I probably shouldn't
have included that line in my example. In the following example I simply
iterate over the cursor one document at a time and set a dummy variable
temp to 1 on each iteration. This takes the same length of time - about 80
seconds.
tic = time.time()
cursor = db_collection.find({"dt": {"$gte": start_dt, "$lt": end_dt }
}).sort("t")
for s in cursor: temp=1
toc = time.time()
print('elapsed time: ',toc-tic)
I'm well aware that I won't get the same speeds as C++, but 80 to 100
seconds to retrieve and iterate 150k documents is unacceptably slow. There
must be another solution? The same query takes a few seconds from C++ so
this is more than a marginal difference. Anyone who is accessing big
datasets using PyMongo has surely encountered this issue. Any idea?
Thanks, Rob.

Post by Rob B
I have a database find query which returns 150k documents where each
document contains three integer fields and one datetime field. The
following code attempts to create a list from the cursor object. Iterating
the cursor is incredibly slow - about 80 seconds! The same operation via
the C++ drivers is orders of magnitude faster - it must be an issue with
PyMongo?
client = MongoClient()
client = MongoClient('localhost', 27017)
db = client.taq
collection_str = "mycollection"
db_collection = db[collection_str]
mylist = list(db_collection.find())
This issue has been discussed before and I tried the suggestions. One
cursor = db_collection.find()
cursor.bath_size(10000)
mylist = list(cursor)
However, this had no impact. A second suggestion was to check that the
C extensions are installed - I have them installed so this is not the
issue. The Mongo database is installed on the same machine so it is not a
network issue - it works fine from C++ ... querying from Pymongo is the
issue.
Since MongoDB is marketed as being able to handle Big Data, surely
there is a way to retrieve data quickly via Python? This issue has been
raised before but I am yet to find a solution.... has anyone got a
suggestion that works? It this case I am retrieving 150k documents, but
normally the query would be retrieving 1million so this is going to be a
real issue for me.
PS each document is very simple, only containing a datetime field and
three integer fields.
--
--
You received this message because you are subscribed to the Google
Groups "mongodb-user" group.
To unsubscribe from this group, send email to
See also the IRC channel -- freenode.net#mongodb
---
You received this message because you are subscribed to the Google
Groups "mongodb-user" group.
To unsubscribe from this group and stop receiving emails from it, send
For more options, visit https://groups.google.com/groups/opt_out.

--
--
You received this message because you are subscribed to the Google
Groups "mongodb-user" group.
To unsubscribe from this group, send email to
See also the IRC channel -- freenode.net#mongodb
---
You received this message because you are subscribed to the Google
Groups "mongodb-user" group.
To unsubscribe from this group and stop receiving emails from it, send
For more options, visit https://groups.google.com/groups/opt_out.

--
--
You received this message because you are subscribed to the Google
Groups "mongodb-user" group.
To unsubscribe from this group, send email to
See also the IRC channel -- freenode.net#mongodb
---
You received this message because you are subscribed to the Google Groups
"mongodb-user" group.
To unsubscribe from this group and stop receiving emails from it, send an
For more options, visit https://groups.google.com/groups/opt_out.

Rob B

2013-08-22 18:44:25 UTC

Yes bson.has_c() returns True for me also.

PS I should mentioned that there are 40 million trades in the collection
but my query is returning 150k trades (which is the number of trades
occurring on the date specified in the query). I have indexed the
collection correctly. Does the size of the collection explain the slow
down? If yes then that's not good as tick databases will contain billions
of trades. Or perhaps there is another issue we are missing?

c.foo.bar.count()

150000

c.foo.bar.find_one()

{u'p': 9638, u'dt': datetime.datetime(2013, 8, 22, 18, 27, 59, 569000),
u'_id': ObjectId('5216582ffba52202877399f2'), u't': 79200772, u'v': 1}
import timeit

timeit.timeit(stmt='l = list(c.foo.bar.find())', setup='import

pymongo; c = pymongo.MongoClient()', number=1)
1.0295019149780273

timeit.timeit(stmt='l = list(c.foo.bar.find(fields={"_id": False}))',

setup='import pymongo; c = pymongo.MongoClient()', number=1)
0.6698269844055176

pymongo.has_c()

True

import bson
bson.has_c()

True

Post by Bernie Hackett
I see. I was confused by your example using batch_size and list(). The
documents you are describing sound very simple, and decoding three integers
(32 or 64 bit) is about the simplest thing PyMongo does. The datetime code
in PyMongo's C extensions could be sped up a bit when dealing with
timestamps after the epoch, but I think most of the processing time is
being taken up creating PyObjects (in this case PyLong, PyDict, PyDateTime,
PyUnicode) in C.
Can you post an example document I can test with?
In PyMongo 3.0 we plan to rework how Cursor buffers the data it
retrieves, which should improve performance a bit.
In the meantime, the only suggestion I can give you is to project out
any fields in the document you don't actually need, which will make PyMongo
have to do less work.

Hi Bernie,
I was creating a list so I could pass the list to the constructor of a
Pandas Dataframe. However, this is not the issue so I probably shouldn't
have included that line in my example. In the following example I simply
iterate over the cursor one document at a time and set a dummy variable
temp to 1 on each iteration. This takes the same length of time - about 80
seconds.
tic = time.time()
cursor = db_collection.find({"dt": {"$gte": start_dt, "$lt": end_dt }
}).sort("t")
for s in cursor: temp=1
toc = time.time()
print('elapsed time: ',toc-tic)
I'm well aware that I won't get the same speeds as C++, but 80 to 100
seconds to retrieve and iterate 150k documents is unacceptably slow. There
must be another solution? The same query takes a few seconds from C++ so
this is more than a marginal difference. Anyone who is accessing big
datasets using PyMongo has surely encountered this issue. Any idea?
Thanks, Rob.

Post by Rob B
I have a database find query which returns 150k documents where each
document contains three integer fields and one datetime field. The
following code attempts to create a list from the cursor object. Iterating
the cursor is incredibly slow - about 80 seconds! The same operation via
the C++ drivers is orders of magnitude faster - it must be an issue with
PyMongo?
client = MongoClient()
client = MongoClient('localhost', 27017)
db = client.taq
collection_str = "mycollection"
db_collection = db[collection_str]
mylist = list(db_collection.find())
This issue has been discussed before and I tried the suggestions. One
cursor = db_collection.find()
cursor.bath_size(10000)
mylist = list(cursor)
However, this had no impact. A second suggestion was to check that
the C extensions are installed - I have them installed so this is not the
issue. The Mongo database is installed on the same machine so it is not a
network issue - it works fine from C++ ... querying from Pymongo is the
issue.
Since MongoDB is marketed as being able to handle Big Data, surely
there is a way to retrieve data quickly via Python? This issue has been
raised before but I am yet to find a solution.... has anyone got a
suggestion that works? It this case I am retrieving 150k documents, but
normally the query would be retrieving 1million so this is going to be a
real issue for me.
PS each document is very simple, only containing a datetime field and
three integer fields.
--
--
You received this message because you are subscribed to the Google
Groups "mongodb-user" group.
To unsubscribe from this group, send email to
See also the IRC channel -- freenode.net#mongodb
---
You received this message because you are subscribed to the Google
Groups "mongodb-user" group.
To unsubscribe from this group and stop receiving emails from it,
For more options, visit https://groups.google.com/groups/opt_out.

--
--
You received this message because you are subscribed to the Google
Groups "mongodb-user" group.
To unsubscribe from this group, send email to
See also the IRC channel -- freenode.net#mongodb
---
You received this message because you are subscribed to a topic in the
Google Groups "mongodb-user" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/mongodb-user/M6dxQgvGrbE/unsubscribe
.
To unsubscribe from this group and all its topics, send an email to
For more options, visit https://groups.google.com/groups/opt_out.

--
--
You received this message because you are subscribed to the Google
Groups "mongodb-user" group.
To unsubscribe from this group, send email to
See also the IRC channel -- freenode.net#mongodb
---
You received this message because you are subscribed to the Google
Groups "mongodb-user" group.
To unsubscribe from this group and stop receiving emails from it, send
For more options, visit https://groups.google.com/groups/opt_out.

--
--
You received this message because you are subscribed to the Google
Groups "mongodb-user" group.
To unsubscribe from this group, send email to
See also the IRC channel -- freenode.net#mongodb
---
You received this message because you are subscribed to the Google Groups
"mongodb-user" group.
To unsubscribe from this group and stop receiving emails from it, send an
For more options, visit https://groups.google.com/groups/opt_out.

Bernie Hackett

2013-08-22 18:53:22 UTC

You should check pymongo.has_c() as well. Some linux distros ship two
separate C extensions for PyMongo.

Also, you said the same application written in C++ is orders of magnitude
faster. Did you run that application against the same database cluster? If
not, there may be some other issue here not related to PyMongo. The explain
plan and/or database profiler may explain may help you discover the nature
of your performance problems.

Post by Rob B
Yes bson.has_c() returns True for me also.
PS I should mentioned that there are 40 million trades in the collection
but my query is returning 150k trades (which is the number of trades
occurring on the date specified in the query). I have indexed the
collection correctly. Does the size of the collection explain the slow
down? If yes then that's not good as tick databases will contain billions
of trades. Or perhaps there is another issue we are missing?

c.foo.bar.count()

150000

c.foo.bar.find_one()

{u'p': 9638, u'dt': datetime.datetime(2013, 8, 22, 18, 27, 59, 569000),
u'_id': ObjectId('5216582ffba52202877399f2'), u't': 79200772, u'v': 1}
import timeit

timeit.timeit(stmt='l = list(c.foo.bar.find())', setup='import

pymongo; c = pymongo.MongoClient()', number=1)
1.0295019149780273

timeit.timeit(stmt='l = list(c.foo.bar.find(fields={"_id": False}))',

setup='import pymongo; c = pymongo.MongoClient()', number=1)
0.6698269844055176

pymongo.has_c()

True

import bson
bson.has_c()

True

Post by Bernie Hackett
I see. I was confused by your example using batch_size and list(). The
documents you are describing sound very simple, and decoding three integers
(32 or 64 bit) is about the simplest thing PyMongo does. The datetime code
in PyMongo's C extensions could be sped up a bit when dealing with
timestamps after the epoch, but I think most of the processing time is
being taken up creating PyObjects (in this case PyLong, PyDict, PyDateTime,
PyUnicode) in C.
Can you post an example document I can test with?
In PyMongo 3.0 we plan to rework how Cursor buffers the data it
retrieves, which should improve performance a bit.
In the meantime, the only suggestion I can give you is to project out
any fields in the document you don't actually need, which will make PyMongo
have to do less work.

Hi Bernie,
I was creating a list so I could pass the list to the constructor of a
Pandas Dataframe. However, this is not the issue so I probably shouldn't
have included that line in my example. In the following example I simply
iterate over the cursor one document at a time and set a dummy variable
temp to 1 on each iteration. This takes the same length of time - about 80
seconds.
tic = time.time()
cursor = db_collection.find({"dt": {"$gte": start_dt, "$lt": end_dt }
}).sort("t")
for s in cursor: temp=1
toc = time.time()
print('elapsed time: ',toc-tic)
I'm well aware that I won't get the same speeds as C++, but 80 to 100
seconds to retrieve and iterate 150k documents is unacceptably slow. There
must be another solution? The same query takes a few seconds from C++ so
this is more than a marginal difference. Anyone who is accessing big
datasets using PyMongo has surely encountered this issue. Any idea?
Thanks, Rob.

Post by Rob B
I have a database find query which returns 150k documents where each
document contains three integer fields and one datetime field. The
following code attempts to create a list from the cursor object. Iterating
the cursor is incredibly slow - about 80 seconds! The same operation via
the C++ drivers is orders of magnitude faster - it must be an issue with
PyMongo?
client = MongoClient()
client = MongoClient('localhost', 27017)
db = client.taq
collection_str = "mycollection"
db_collection = db[collection_str]
mylist = list(db_collection.find())
This issue has been discussed before and I tried the suggestions.
cursor = db_collection.find()
cursor.bath_size(10000)
mylist = list(cursor)
However, this had no impact. A second suggestion was to check that
the C extensions are installed - I have them installed so this is not the
issue. The Mongo database is installed on the same machine so it is not a
network issue - it works fine from C++ ... querying from Pymongo is the
issue.
Since MongoDB is marketed as being able to handle Big Data, surely
there is a way to retrieve data quickly via Python? This issue has been
raised before but I am yet to find a solution.... has anyone got a
suggestion that works? It this case I am retrieving 150k documents, but
normally the query would be retrieving 1million so this is going to be a
real issue for me.
PS each document is very simple, only containing a datetime field
and three integer fields.
--
--
You received this message because you are subscribed to the Google
Groups "mongodb-user" group.
To unsubscribe from this group, send email to
See also the IRC channel -- freenode.net#mongodb
---
You received this message because you are subscribed to the Google
Groups "mongodb-user" group.
To unsubscribe from this group and stop receiving emails from it,
For more options, visit https://groups.google.com/groups/opt_out.

--
--
You received this message because you are subscribed to the Google
Groups "mongodb-user" group.
To unsubscribe from this group, send email to
See also the IRC channel -- freenode.net#mongodb
---
You received this message because you are subscribed to a topic in
the Google Groups "mongodb-user" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/mongodb-user/M6dxQgvGrbE/unsubscribe
.
To unsubscribe from this group and all its topics, send an email to
For more options, visit https://groups.google.com/groups/opt_out.

--
--
You received this message because you are subscribed to the Google
Groups "mongodb-user" group.
To unsubscribe from this group, send email to
See also the IRC channel -- freenode.net#mongodb
---
You received this message because you are subscribed to the Google
Groups "mongodb-user" group.
To unsubscribe from this group and stop receiving emails from it, send
For more options, visit https://groups.google.com/groups/opt_out.

--
--
You received this message because you are subscribed to the Google
Groups "mongodb-user" group.
To unsubscribe from this group, send email to
See also the IRC channel -- freenode.net#mongodb
---
You received this message because you are subscribed to the Google Groups
"mongodb-user" group.
To unsubscribe from this group and stop receiving emails from it, send an
For more options, visit https://groups.google.com/groups/opt_out.

Rob B

2013-08-22 19:05:56 UTC

1. pymongo.has_c() is also returning True for me so that's not the issue..

2. Both the C++ and Python applications are accessing the same database
which resides on a single server so I am not working in a
distributed/sharded environment.

3. I think I found the issue. If I run:

trade_collection.find({"dt": {"$gte": start_dt, "$lt": end_dt }
}).explain()

I get the following output:

{'allPlans': [{'cursor': 'BtreeCursor dt_1',
'indexBounds': {'dt': [[datetime.datetime(2012, 1, 3, 0, 0),
datetime.datetime(2012, 1, 4, 0, 0)]]},
'n': 154435,
'nscanned': 154435,
'nscannedObjects': 154435}],
'cursor': 'BtreeCursor dt_1',
'indexBounds': {'dt': [[datetime.datetime(2012, 1, 3, 0, 0),
datetime.datetime(2012, 1, 4, 0, 0)]]},
'indexOnly': False,
'isMultiKey': False,
'millis': 326,
'n': 154435,
'nChunkSkips': 0,
'nYields': 0,
'nscanned': 154435,
'nscannedAllPlans': 154435,
'nscannedObjects': 154435,
'nscannedObjectsAllPlans': 154435,
'oldPlan': {'cursor': 'BtreeCursor dt_1',
'indexBounds': {'dt': [[datetime.datetime(2012, 1, 3, 0, 0),
datetime.datetime(2012, 1, 4, 0, 0)]]}},
'scanAndOrder': False,
}

.. this runs very fast and only 154435 of the 37454178 documents are
scanned. However if I run the original query:

trade_collection.find({"dt": {"$gte": start_dt, "$lt": end_dt }
}).sort("t").explain()

... including sort("t") means I am scanning all 37454178 documents. Below
is the output from explain:

{'allPlans': [{'cursor': 'BtreeCursor dt_1',
'indexBounds': {'dt': [[datetime.datetime(2012, 1, 3, 0, 0),
datetime.datetime(2012, 1, 4, 0, 0)]]},
'n': 23017,
'nscanned': 23017,
'nscannedObjects': 23017},
{'cursor': 'BtreeCursor t_1',
'indexBounds': {'t': [[{'$minElement': 1}, {'$maxElement': 1}]]},
'n': 154435,
'nscanned': 37454178,
'nscannedObjects': 37454178},
{'cursor': 'BasicCursor',
'indexBounds': {},
'n': 0,
'nscanned': 23016,
'nscannedObjects': 23016}],
'cursor': 'BtreeCursor t_1',
'indexBounds': {'t': [[{'$minElement': 1}, {'$maxElement': 1}]]},
'indexOnly': False,
'isMultiKey': False,
'millis': 88622,
'n': 154435,
'nChunkSkips': 0,
'nYields': 88,
'nscanned': 37454178,
'nscannedAllPlans': 37500211,
'nscannedObjects': 37454178,
'nscannedObjectsAllPlans': 37500211,
'oldPlan': {'cursor': 'BtreeCursor t_1',
'indexBounds': {'t': [[{'$minElement': 1}, {'$maxElement': 1}]]}},
'scanAndOrder': False,
}

I have created an index for "dt" which is the reason that the first query
is fast. However, perhaps I need a compound index which includes "dt" and
"t" so that the sort command introduced into the second query does not
result in the scanning of every document?

Thanks,

Rob.

Post by Bernie Hackett
You should check pymongo.has_c() as well. Some linux distros ship two
separate C extensions for PyMongo.
Also, you said the same application written in C++ is orders of magnitude
faster. Did you run that application against the same database cluster? If
not, there may be some other issue here not related to PyMongo. The explain
plan and/or database profiler may explain may help you discover the nature
of your performance problems.

c.foo.bar.count()

150000

c.foo.bar.find_one()

{u'p': 9638, u'dt': datetime.datetime(2013, 8, 22, 18, 27, 59, 569000),
u'_id': ObjectId('5216582ffba52202877399f2'), u't': 79200772, u'v': 1}
import timeit

timeit.timeit(stmt='l = list(c.foo.bar.find())', setup='import

pymongo; c = pymongo.MongoClient()', number=1)
1.0295019149780273
False}))', setup='import pymongo; c = pymongo.MongoClient()', number=1)
0.6698269844055176

pymongo.has_c()

True

import bson
bson.has_c()

True

Post by Rob B
Hi Bernie,
I need all the fields in the documents, there are not many. Here is an
{'_id': ObjectId('51fc9fb9e89db689aed7e386'),
'dt': datetime.datetime(2012, 9, 2, 22, 0, 0, 772000),
'p': 9638,
't': 79200772,
'v': 1}
It's good to hear that you are aiming to improve PyMongo 3.0 but as it
stands PyMongo is so slow that it is not usable for big datasets in my
experience. I am experimenting with using MongoDB for storing financial
data. MongoDB is marketed as a tick data storage solution in the 10gen
presentation I posted (
http://www.10gen.com/presentations/webinar-how-banks-use-mongodb-tick-database), however I'm realizing that it is not viable ... certainly not from
Python... and Python is arguably the most common route of access for a
quant working in finance. In my example I retrieved 150k documents (where
each document is a trade)..... typically usage would involve retrieving
multiple million trades.... this would take far too long using PyMongo when
it takes 80 seconds to iterate 150k trades.
Is there any other suggestion? If there is no way to make faster
queries from PyMongo then I have no choice but to find an alternative
database, which would be disappointing as I've invested quite a bit of time
in MongoDB over the last month and am happy with it apart from the speed,
which is a show stopper.

Post by Bernie Hackett
I see. I was confused by your example using batch_size and list(). The
documents you are describing sound very simple, and decoding three integers
(32 or 64 bit) is about the simplest thing PyMongo does. The datetime code
in PyMongo's C extensions could be sped up a bit when dealing with
timestamps after the epoch, but I think most of the processing time is
being taken up creating PyObjects (in this case PyLong, PyDict, PyDateTime,
PyUnicode) in C.
Can you post an example document I can test with?
In PyMongo 3.0 we plan to rework how Cursor buffers the data it
retrieves, which should improve performance a bit.
In the meantime, the only suggestion I can give you is to project out
any fields in the document you don't actually need, which will make PyMongo
have to do less work.

Hi Bernie,
I was creating a list so I could pass the list to the constructor of
a Pandas Dataframe. However, this is not the issue so I probably shouldn't
have included that line in my example. In the following example I simply
iterate over the cursor one document at a time and set a dummy variable
temp to 1 on each iteration. This takes the same length of time - about 80
seconds.
tic = time.time()
cursor = db_collection.find({"dt": {"$gte": start_dt, "$lt": end_dt }
}).sort("t")
for s in cursor: temp=1
toc = time.time()
print('elapsed time: ',toc-tic)
I'm well aware that I won't get the same speeds as C++, but 80 to 100
seconds to retrieve and iterate 150k documents is unacceptably slow. There
must be another solution? The same query takes a few seconds from C++ so
this is more than a marginal difference. Anyone who is accessing big
datasets using PyMongo has surely encountered this issue. Any idea?
Thanks, Rob.

Post by Rob B
I have a database find query which returns 150k documents where
each document contains three integer fields and one datetime field. The
following code attempts to create a list from the cursor object. Iterating
the cursor is incredibly slow - about 80 seconds! The same operation via
the C++ drivers is orders of magnitude faster - it must be an issue with
PyMongo?
client = MongoClient()
client = MongoClient('localhost', 27017)
db = client.taq
collection_str = "mycollection"
db_collection = db[collection_str]
mylist = list(db_collection.find())
This issue has been discussed before and I tried the suggestions.
cursor = db_collection.find()
cursor.bath_size(10000)
mylist = list(cursor)
However, this had no impact. A second suggestion was to check that
the C extensions are installed - I have them installed so this is not the
issue. The Mongo database is installed on the same machine so it is not a
network issue - it works fine from C++ ... querying from Pymongo is the
issue.
Since MongoDB is marketed as being able to handle Big Data, surely
there is a way to retrieve data quickly via Python? This issue has been
raised before but I am yet to find a solution.... has anyone got a
suggestion that works? It this case I am retrieving 150k documents, but
normally the query would be retrieving 1million so this is going to be a
real issue for me.
PS each document is very simple, only containing a datetime field
and three integer fields.
--
--
You received this message because you are subscribed to the Google
Groups "mongodb-user" group.
To unsubscribe from this group, send email to
See also the IRC channel -- freenode.net#mongodb
---
You received this message because you are subscribed to the Google
Groups "mongodb-user" group.
To unsubscribe from this group and stop receiving emails from it,
For more options, visit https://groups.google.com/groups/opt_out.

--
--
You received this message because you are subscribed to the Google
Groups "mongodb-user" group.
To unsubscribe from this group, send email to
See also the IRC channel -- freenode.net#mongodb
---
You received this message because you are subscribed to a topic in
the Google Groups "mongodb-user" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/mongodb-user/M6dxQgvGrbE/unsubscribe
.
To unsubscribe from this group and all its topics, send an email to
For more options, visit https://groups.google.com/groups/opt_out.

--
--
You received this message because you are subscribed to the Google
Groups "mongodb-user" group.
To unsubscribe from this group, send email to
See also the IRC channel -- freenode.net#mongodb
---
You received this message because you are subscribed to the Google
Groups "mongodb-user" group.
To unsubscribe from this group and stop receiving emails from it,
For more options, visit https://groups.google.com/groups/opt_out.

--
--
You received this message because you are subscribed to the Google
Groups "mongodb-user" group.
To unsubscribe from this group, send email to
See also the IRC channel -- freenode.net#mongodb
---
You received this message because you are subscribed to a topic in the
Google Groups "mongodb-user" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/mongodb-user/M6dxQgvGrbE/unsubscribe
.
To unsubscribe from this group and all its topics, send an email to
For more options, visit https://groups.google.com/groups/opt_out.

--
--
You received this message because you are subscribed to the Google
Groups "mongodb-user" group.
To unsubscribe from this group, send email to
See also the IRC channel -- freenode.net#mongodb
---
You received this message because you are subscribed to the Google
Groups "mongodb-user" group.
To unsubscribe from this group and stop receiving emails from it, send
For more options, visit https://groups.google.com/groups/opt_out.

--
--
You received this message because you are subscribed to the Google
Groups "mongodb-user" group.
To unsubscribe from this group, send email to
See also the IRC channel -- freenode.net#mongodb
---
You received this message because you are subscribed to the Google Groups
"mongodb-user" group.
To unsubscribe from this group and stop receiving emails from it, send an
For more options, visit https://groups.google.com/groups/opt_out.

Bernie Hackett

2013-08-22 19:08:19 UTC

Yes, you are going to need a compound index for this.

Post by Rob B
1. pymongo.has_c() is also returning True for me so that's not the issue..
2. Both the C++ and Python applications are accessing the same database
which resides on a single server so I am not working in a
distributed/sharded environment.
trade_collection.find({"dt": {"$gte": start_dt, "$lt": end_dt }
}).explain()
{'allPlans': [{'cursor': 'BtreeCursor dt_1',
'indexBounds': {'dt': [[datetime.datetime(2012, 1, 3, 0, 0),
datetime.datetime(2012, 1, 4, 0, 0)]]},
'n': 154435,
'nscanned': 154435,
'nscannedObjects': 154435}],
'cursor': 'BtreeCursor dt_1',
'indexBounds': {'dt': [[datetime.datetime(2012, 1, 3, 0, 0),
datetime.datetime(2012, 1, 4, 0, 0)]]},
'indexOnly': False,
'isMultiKey': False,
'millis': 326,
'n': 154435,
'nChunkSkips': 0,
'nYields': 0,
'nscanned': 154435,
'nscannedAllPlans': 154435,
'nscannedObjects': 154435,
'nscannedObjectsAllPlans': 154435,
'oldPlan': {'cursor': 'BtreeCursor dt_1',
'indexBounds': {'dt': [[datetime.datetime(2012, 1, 3, 0, 0),
datetime.datetime(2012, 1, 4, 0, 0)]]}},
'scanAndOrder': False,
}
.. this runs very fast and only 154435 of the 37454178 documents are
trade_collection.find({"dt": {"$gte": start_dt, "$lt": end_dt }
}).sort("t").explain()
... including sort("t") means I am scanning all 37454178 documents. Below
{'allPlans': [{'cursor': 'BtreeCursor dt_1',
'indexBounds': {'dt': [[datetime.datetime(2012, 1, 3, 0, 0),
datetime.datetime(2012, 1, 4, 0, 0)]]},
'n': 23017,
'nscanned': 23017,
'nscannedObjects': 23017},
{'cursor': 'BtreeCursor t_1',
'indexBounds': {'t': [[{'$minElement': 1}, {'$maxElement': 1}]]},
'n': 154435,
'nscanned': 37454178,
'nscannedObjects': 37454178},
{'cursor': 'BasicCursor',
'indexBounds': {},
'n': 0,
'nscanned': 23016,
'nscannedObjects': 23016}],
'cursor': 'BtreeCursor t_1',
'indexBounds': {'t': [[{'$minElement': 1}, {'$maxElement': 1}]]},
'indexOnly': False,
'isMultiKey': False,
'millis': 88622,
'n': 154435,
'nChunkSkips': 0,
'nYields': 88,
'nscanned': 37454178,
'nscannedAllPlans': 37500211,
'nscannedObjects': 37454178,
'nscannedObjectsAllPlans': 37500211,
'oldPlan': {'cursor': 'BtreeCursor t_1',
'indexBounds': {'t': [[{'$minElement': 1}, {'$maxElement': 1}]]}},
'scanAndOrder': False,
}
I have created an index for "dt" which is the reason that the first query
is fast. However, perhaps I need a compound index which includes "dt" and
"t" so that the sort command introduced into the second query does not
result in the scanning of every document?
Thanks,
Rob.

c.foo.bar.count()

150000

c.foo.bar.find_one()

{u'p': 9638, u'dt': datetime.datetime(2013, 8, 22, 18, 27, 59, 569000),
u'_id': ObjectId('5216582ffba52202877399f2'), u't': 79200772, u'v': 1}
import timeit

timeit.timeit(stmt='l = list(c.foo.bar.find())', setup='import

pymongo; c = pymongo.MongoClient()', number=1)
1.0295019149780273
False}))', setup='import pymongo; c = pymongo.MongoClient()', number=1)
0.6698269844055176

pymongo.has_c()

True

import bson
bson.has_c()

True

Post by Rob B
Hi Bernie,
I need all the fields in the documents, there are not many. Here is an
{'_id': ObjectId('51fc9fb9e89db689aed7e386'),
'dt': datetime.datetime(2012, 9, 2, 22, 0, 0, 772000),
'p': 9638,
't': 79200772,
'v': 1}
It's good to hear that you are aiming to improve PyMongo 3.0 but as it
stands PyMongo is so slow that it is not usable for big datasets in my
experience. I am experimenting with using MongoDB for storing financial
data. MongoDB is marketed as a tick data storage solution in the 10gen
presentation I posted (
http://www.10gen.com/presentations/webinar-how-banks-use-mongodb-tick-database), however I'm realizing that it is not viable ... certainly not from
Python... and Python is arguably the most common route of access for a
quant working in finance. In my example I retrieved 150k documents (where
each document is a trade)..... typically usage would involve retrieving
multiple million trades.... this would take far too long using PyMongo when
it takes 80 seconds to iterate 150k trades.
Is there any other suggestion? If there is no way to make faster
queries from PyMongo then I have no choice but to find an alternative
database, which would be disappointing as I've invested quite a bit of time
in MongoDB over the last month and am happy with it apart from the speed,
which is a show stopper.

Post by Bernie Hackett
I see. I was confused by your example using batch_size and list().
The documents you are describing sound very simple, and decoding three
integers (32 or 64 bit) is about the simplest thing PyMongo does. The
datetime code in PyMongo's C extensions could be sped up a bit when dealing
with timestamps after the epoch, but I think most of the processing time is
being taken up creating PyObjects (in this case PyLong, PyDict, PyDateTime,
PyUnicode) in C.
Can you post an example document I can test with?
In PyMongo 3.0 we plan to rework how Cursor buffers the data it
retrieves, which should improve performance a bit.
In the meantime, the only suggestion I can give you is to project out
any fields in the document you don't actually need, which will make PyMongo
have to do less work.

Hi Bernie,
I was creating a list so I could pass the list to the constructor of
a Pandas Dataframe. However, this is not the issue so I probably shouldn't
have included that line in my example. In the following example I simply
iterate over the cursor one document at a time and set a dummy variable
temp to 1 on each iteration. This takes the same length of time - about 80
seconds.
tic = time.time()
cursor = db_collection.find({"dt": {"$gte": start_dt, "$lt": end_dt
} }).sort("t")
for s in cursor: temp=1
toc = time.time()
print('elapsed time: ',toc-tic)
I'm well aware that I won't get the same speeds as C++, but 80 to
100 seconds to retrieve and iterate 150k documents is unacceptably slow.
There must be another solution? The same query takes a few seconds from C++
so this is more than a marginal difference. Anyone who is accessing big
datasets using PyMongo has surely encountered this issue. Any idea?
Thanks, Rob.

Post by Bernie Hackett
- mylist = list(db_collection.find()) - That means PyMongo has to
retrieve and deserialize all 150k documents before your application can
process a single document. The batch_size option will have no impact here.
How are you doing the same in C++? Why not use a Cursor for what it's meant
for and incrementally process documents?
- C++ is a lower level statically typed, natively compiled
language. Python is a very high level, dynamically typed, interpreted
language. The performance of your python application is never going to be
close to as fast as one written in C++. It's the nature of the language.

Post by Rob B
I have a database find query which returns 150k documents where
each document contains three integer fields and one datetime field. The
following code attempts to create a list from the cursor object. Iterating
the cursor is incredibly slow - about 80 seconds! The same operation via
the C++ drivers is orders of magnitude faster - it must be an issue with
PyMongo?
client = MongoClient()
client = MongoClient('localhost', 27017)
db = client.taq
collection_str = "mycollection"
db_collection = db[collection_str]
mylist = list(db_collection.find())
This issue has been discussed before and I tried the suggestions.
cursor = db_collection.find()
cursor.bath_size(10000)
mylist = list(cursor)
However, this had no impact. A second suggestion was to check that
the C extensions are installed - I have them installed so this is not the
issue. The Mongo database is installed on the same machine so it is not a
network issue - it works fine from C++ ... querying from Pymongo is the
issue.
Since MongoDB is marketed as being able to handle Big Data, surely
there is a way to retrieve data quickly via Python? This issue has been
raised before but I am yet to find a solution.... has anyone got a
suggestion that works? It this case I am retrieving 150k documents, but
normally the query would be retrieving 1million so this is going to be a
real issue for me.
PS each document is very simple, only containing a datetime field
and three integer fields.
--
--
You received this message because you are subscribed to the Google
Groups "mongodb-user" group.
To unsubscribe from this group, send email to
See also the IRC channel -- freenode.net#mongodb
---
You received this message because you are subscribed to the Google
Groups "mongodb-user" group.
To unsubscribe from this group and stop receiving emails from it,
For more options, visit https://groups.google.com/groups/opt_out.

--
--
You received this message because you are subscribed to the Google
Groups "mongodb-user" group.
To unsubscribe from this group, send email to
See also the IRC channel -- freenode.net#mongodb
---
You received this message because you are subscribed to a topic in
the Google Groups "mongodb-user" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/mongodb-user/M6dxQgvGrbE/unsubscribe
.
To unsubscribe from this group and all its topics, send an email to
For more options, visit https://groups.google.com/groups/opt_out.

--
--
You received this message because you are subscribed to the Google
Groups "mongodb-user" group.
To unsubscribe from this group, send email to
See also the IRC channel -- freenode.net#mongodb
---
You received this message because you are subscribed to the Google
Groups "mongodb-user" group.
To unsubscribe from this group and stop receiving emails from it,
For more options, visit https://groups.google.com/groups/opt_out.

--
--
You received this message because you are subscribed to the Google
Groups "mongodb-user" group.
To unsubscribe from this group, send email to
See also the IRC channel -- freenode.net#mongodb
---
You received this message because you are subscribed to a topic in
the Google Groups "mongodb-user" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/mongodb-user/M6dxQgvGrbE/unsubscribe
.
To unsubscribe from this group and all its topics, send an email to
For more options, visit https://groups.google.com/groups/opt_out.

--
--
You received this message because you are subscribed to the Google
Groups "mongodb-user" group.
To unsubscribe from this group, send email to
See also the IRC channel -- freenode.net#mongodb
---
You received this message because you are subscribed to the Google
Groups "mongodb-user" group.
To unsubscribe from this group and stop receiving emails from it, send
For more options, visit https://groups.google.com/groups/opt_out.

--
--
You received this message because you are subscribed to the Google
Groups "mongodb-user" group.
To unsubscribe from this group, send email to
See also the IRC channel -- freenode.net#mongodb
---
You received this message because you are subscribed to the Google Groups
"mongodb-user" group.
To unsubscribe from this group and stop receiving emails from it, send an
For more options, visit https://groups.google.com/groups/opt_out.

Rob B

2013-08-22 19:11:06 UTC

Just ran a test: I have gone from 80 seconds down to 4 seconds when I
include a compound index.... I'm glad we got to the bottom of this, thanks
for your help.

Post by Bernie Hackett
Yes, you are going to need a compound index for this.

Post by Bernie Hackett
You should check pymongo.has_c() as well. Some linux distros ship two
separate C extensions for PyMongo.
Also, you said the same application written in C++ is orders of
magnitude faster. Did you run that application against the same database
cluster? If not, there may be some other issue here not related to PyMongo.
The explain plan and/or database profiler may explain may help you discover
the nature of your performance problems.

Post by Rob B
Yes bson.has_c() returns True for me also.
PS I should mentioned that there are 40 million trades in the
collection but my query is returning 150k trades (which is the number of
trades occurring on the date specified in the query). I have indexed the
collection correctly. Does the size of the collection explain the slow
down? If yes then that's not good as tick databases will contain billions
of trades. Or perhaps there is another issue we are missing?

c.foo.bar.count()

150000

c.foo.bar.find_one()

{u'p': 9638, u'dt': datetime.datetime(2013, 8, 22, 18, 27, 59,
569000), u'_id': ObjectId('5216582ffba52202877399f2'), u't': 79200772,
u'v': 1}
import timeit

timeit.timeit(stmt='l = list(c.foo.bar.find())', setup='import

pymongo; c = pymongo.MongoClient()', number=1)
1.0295019149780273
False}))', setup='import pymongo; c = pymongo.MongoClient()', number=1)
0.6698269844055176

pymongo.has_c()

True

import bson
bson.has_c()

True

Post by Rob B
Hi Bernie,
I need all the fields in the documents, there are not many. Here is
{'_id': ObjectId('51fc9fb9e89db689aed7e386'),
'dt': datetime.datetime(2012, 9, 2, 22, 0, 0, 772000),
'p': 9638,
't': 79200772,
'v': 1}
It's good to hear that you are aiming to improve PyMongo 3.0 but as
it stands PyMongo is so slow that it is not usable for big datasets in my
experience. I am experimenting with using MongoDB for storing financial
data. MongoDB is marketed as a tick data storage solution in the 10gen
presentation I posted (
http://www.10gen.com/presentations/webinar-how-banks-use-mongodb-tick-database), however I'm realizing that it is not viable ... certainly not from
Python... and Python is arguably the most common route of access for a
quant working in finance. In my example I retrieved 150k documents (where
each document is a trade)..... typically usage would involve retrieving
multiple million trades.... this would take far too long using PyMongo when
it takes 80 seconds to iterate 150k trades.
Is there any other suggestion? If there is no way to make faster
queries from PyMongo then I have no choice but to find an alternative
database, which would be disappointing as I've invested quite a bit of time
in MongoDB over the last month and am happy with it apart from the speed,
which is a show stopper.

Post by Bernie Hackett
I see. I was confused by your example using batch_size and list().
The documents you are describing sound very simple, and decoding three
integers (32 or 64 bit) is about the simplest thing PyMongo does. The
datetime code in PyMongo's C extensions could be sped up a bit when dealing
with timestamps after the epoch, but I think most of the processing time is
being taken up creating PyObjects (in this case PyLong, PyDict, PyDateTime,
PyUnicode) in C.
Can you post an example document I can test with?
In PyMongo 3.0 we plan to rework how Cursor buffers the data it
retrieves, which should improve performance a bit.
In the meantime, the only suggestion I can give you is to project
out any fields in the document you don't actually need, which will make
PyMongo have to do less work.

Hi Bernie,
I was creating a list so I could pass the list to the constructor
of a Pandas Dataframe. However, this is not the issue so I probably
shouldn't have included that line in my example. In the following example I
simply iterate over the cursor one document at a time and set a dummy
variable temp to 1 on each iteration. This takes the same length of time -
about 80 seconds.
tic = time.time()
cursor = db_collection.find({"dt": {"$gte": start_dt, "$lt": end_dt
} }).sort("t")
for s in cursor: temp=1
toc = time.time()
print('elapsed time: ',toc-tic)
I'm well aware that I won't get the same speeds as C++, but 80 to
100 seconds to retrieve and iterate 150k documents is unacceptably slow.
There must be another solution? The same query takes a few seconds from C++
so this is more than a marginal difference. Anyone who is accessing big
datasets using PyMongo has surely encountered this issue. Any idea?
Thanks, Rob.

Post by Bernie Hackett
- mylist = list(db_collection.find()) - That means PyMongo has to
retrieve and deserialize all 150k documents before your application can
process a single document. The batch_size option will have no impact here.
How are you doing the same in C++? Why not use a Cursor for what it's meant
for and incrementally process documents?
- C++ is a lower level statically typed, natively compiled
language. Python is a very high level, dynamically typed, interpreted
language. The performance of your python application is never going to be
close to as fast as one written in C++. It's the nature of the language.

Post by Rob B
I have a database find query which returns 150k documents where
each document contains three integer fields and one datetime field. The
following code attempts to create a list from the cursor object. Iterating
the cursor is incredibly slow - about 80 seconds! The same operation via
the C++ drivers is orders of magnitude faster - it must be an issue with
PyMongo?
client = MongoClient()
client = MongoClient('localhost', 27017)
db = client.taq
collection_str = "mycollection"
db_collection = db[collection_str]
mylist = list(db_collection.find())
This issue has been discussed before and I tried the suggestions.
cursor = db_collection.find()
cursor.bath_size(10000)
mylist = list(cursor)
However, this had no impact. A second suggestion was to check
that the C extensions are installed - I have them installed so this is not
the issue. The Mongo database is installed on the same machine so it is not
a network issue - it works fine from C++ ... querying from Pymongo is the
issue.
Since MongoDB is marketed as being able to handle Big Data,
surely there is a way to retrieve data quickly via Python? This issue has
been raised before but I am yet to find a solution.... has anyone got a
suggestion that works? It this case I am retrieving 150k documents, but
normally the query would be retrieving 1million so this is going to be a
real issue for me.
PS each document is very simple, only containing a datetime field
and three integer fields.
--
--
You received this message because you are subscribed to the Google
Groups "mongodb-user" group.
To post to this group, send email to
To unsubscribe from this group, send email to
See also the IRC channel -- freenode.net#mongodb
---
You received this message because you are subscribed to the
Google Groups "mongodb-user" group.
To unsubscribe from this group and stop receiving emails from it,
For more options, visit https://groups.google.com/groups/opt_out.

--
--
You received this message because you are subscribed to the Google
Groups "mongodb-user" group.
To unsubscribe from this group, send email to
See also the IRC channel -- freenode.net#mongodb
---
You received this message because you are subscribed to a topic in
the Google Groups "mongodb-user" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/mongodb-user/M6dxQgvGrbE/unsubscribe
.
To unsubscribe from this group and all its topics, send an email
For more options, visit https://groups.google.com/groups/opt_out.

--
--
You received this message because you are subscribed to the Google
Groups "mongodb-user" group.
To unsubscribe from this group, send email to
See also the IRC channel -- freenode.net#mongodb
---
You received this message because you are subscribed to the Google
Groups "mongodb-user" group.
To unsubscribe from this group and stop receiving emails from it,
For more options, visit https://groups.google.com/groups/opt_out.

--
--
You received this message because you are subscribed to the Google
Groups "mongodb-user" group.
To unsubscribe from this group, send email to
See also the IRC channel -- freenode.net#mongodb
---
You received this message because you are subscribed to a topic in
the Google Groups "mongodb-user" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/mongodb-user/M6dxQgvGrbE/unsubscribe
.
To unsubscribe from this group and all its topics, send an email to
For more options, visit https://groups.google.com/groups/opt_out.

--
--
You received this message because you are subscribed to the Google
Groups "mongodb-user" group.
To unsubscribe from this group, send email to
See also the IRC channel -- freenode.net#mongodb
---
You received this message because you are subscribed to the Google
Groups "mongodb-user" group.
To unsubscribe from this group and stop receiving emails from it,
For more options, visit https://groups.google.com/groups/opt_out.

--
--
You received this message because you are subscribed to the Google
Groups "mongodb-user" group.
To unsubscribe from this group, send email to
See also the IRC channel -- freenode.net#mongodb
---
You received this message because you are subscribed to a topic in the
Google Groups "mongodb-user" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/mongodb-user/M6dxQgvGrbE/unsubscribe
.
To unsubscribe from this group and all its topics, send an email to
For more options, visit https://groups.google.com/groups/opt_out.

--
--
You received this message because you are subscribed to the Google
Groups "mongodb-user" group.
To unsubscribe from this group, send email to
See also the IRC channel -- freenode.net#mongodb
---
You received this message because you are subscribed to the Google
Groups "mongodb-user" group.
To unsubscribe from this group and stop receiving emails from it, send
For more options, visit https://groups.google.com/groups/opt_out.

--
--
You received this message because you are subscribed to the Google
Groups "mongodb-user" group.
To unsubscribe from this group, send email to
See also the IRC channel -- freenode.net#mongodb
---
You received this message because you are subscribed to the Google Groups
"mongodb-user" group.
To unsubscribe from this group and stop receiving emails from it, send an
For more options, visit https://groups.google.com/groups/opt_out.

Bernie Hackett

2013-08-22 19:14:58 UTC

Sure thing. Sorry for the false starts. I hadn't had enough coffee and
should have realized there was something else wrong. :-)

Post by Rob B
Just ran a test: I have gone from 80 seconds down to 4 seconds when I
include a compound index.... I'm glad we got to the bottom of this, thanks
for your help.

Post by Bernie Hackett
Yes, you are going to need a compound index for this.

Post by Rob B
1. pymongo.has_c() is also returning True for me so that's not the issue..
2. Both the C++ and Python applications are accessing the same database
which resides on a single server so I am not working in a
distributed/sharded environment.
trade_collection.find({"dt": {"$gte": start_dt, "$lt": end_dt }
}).explain()
{'allPlans': [{'cursor': 'BtreeCursor dt_1',
'indexBounds': {'dt': [[datetime.datetime(2012, 1, 3, 0, 0),
datetime.datetime(2012, 1, 4, 0, 0)]]},
'n': 154435,
'nscanned': 154435,
'nscannedObjects': 154435}],
'cursor': 'BtreeCursor dt_1',
'indexBounds': {'dt': [[datetime.datetime(2012, 1, 3, 0, 0),
datetime.datetime(2012, 1, 4, 0, 0)]]},
'indexOnly': False,
'isMultiKey': False,
'millis': 326,
'n': 154435,
'nChunkSkips': 0,
'nYields': 0,
'nscanned': 154435,
'nscannedAllPlans': 154435,
'nscannedObjects': 154435,
'nscannedObjectsAllPlans': 154435,
'oldPlan': {'cursor': 'BtreeCursor dt_1',
'indexBounds': {'dt': [[datetime.datetime(2012, 1, 3, 0, 0),
datetime.datetime(2012, 1, 4, 0, 0)]]}},
'scanAndOrder': False,
}
.. this runs very fast and only 154435 of the 37454178 documents are
trade_collection.find({"dt": {"$gte": start_dt, "$lt": end_dt }
}).sort("t").explain()
... including sort("t") means I am scanning all 37454178 documents.
{'allPlans': [{'cursor': 'BtreeCursor dt_1',
'indexBounds': {'dt': [[datetime.datetime(2012, 1, 3, 0, 0),
datetime.datetime(2012, 1, 4, 0, 0)]]},
'n': 23017,
'nscanned': 23017,
'nscannedObjects': 23017},
{'cursor': 'BtreeCursor t_1',
'indexBounds': {'t': [[{'$minElement': 1}, {'$maxElement': 1}]]},
'n': 154435,
'nscanned': 37454178,
'nscannedObjects': 37454178},
{'cursor': 'BasicCursor',
'indexBounds': {},
'n': 0,
'nscanned': 23016,
'nscannedObjects': 23016}],
'cursor': 'BtreeCursor t_1',
'indexBounds': {'t': [[{'$minElement': 1}, {'$maxElement': 1}]]},
'indexOnly': False,
'isMultiKey': False,
'millis': 88622,
'n': 154435,
'nChunkSkips': 0,
'nYields': 88,
'nscanned': 37454178,
'nscannedAllPlans': 37500211,
'nscannedObjects': 37454178,
'nscannedObjectsAllPlans': 37500211,
'oldPlan': {'cursor': 'BtreeCursor t_1',
'indexBounds': {'t': [[{'$minElement': 1}, {'$maxElement': 1}]]}},
'scanAndOrder': False,
}
I have created an index for "dt" which is the reason that the first
query is fast. However, perhaps I need a compound index which includes "dt"
and "t" so that the sort command introduced into the second query does not
result in the scanning of every document?
Thanks,
Rob.

Post by Bernie Hackett
You should check pymongo.has_c() as well. Some linux distros ship two
separate C extensions for PyMongo.
Also, you said the same application written in C++ is orders of
magnitude faster. Did you run that application against the same database
cluster? If not, there may be some other issue here not related to PyMongo.
The explain plan and/or database profiler may explain may help you discover
the nature of your performance problems.

Post by Rob B
Yes bson.has_c() returns True for me also.
PS I should mentioned that there are 40 million trades in the
collection but my query is returning 150k trades (which is the number of
trades occurring on the date specified in the query). I have indexed the
collection correctly. Does the size of the collection explain the slow
down? If yes then that's not good as tick databases will contain billions
of trades. Or perhaps there is another issue we are missing?

c.foo.bar.count()

150000

c.foo.bar.find_one()

{u'p': 9638, u'dt': datetime.datetime(2013, 8, 22, 18, 27, 59,
569000), u'_id': ObjectId('5216582ffba52202877399f2'), u't': 79200772,
u'v': 1}
import timeit

timeit.timeit(stmt='l = list(c.foo.bar.find())', setup='import

pymongo; c = pymongo.MongoClient()', number=1)
1.0295019149780273
False}))', setup='import pymongo; c = pymongo.MongoClient()', number=1)
0.6698269844055176

pymongo.has_c()

True

import bson
bson.has_c()

True

Post by Bernie Hackett
I see. I was confused by your example using batch_size and list().
The documents you are describing sound very simple, and decoding three
integers (32 or 64 bit) is about the simplest thing PyMongo does. The
datetime code in PyMongo's C extensions could be sped up a bit when dealing
with timestamps after the epoch, but I think most of the processing time is
being taken up creating PyObjects (in this case PyLong, PyDict, PyDateTime,
PyUnicode) in C.
Can you post an example document I can test with?
In PyMongo 3.0 we plan to rework how Cursor buffers the data it
retrieves, which should improve performance a bit.
In the meantime, the only suggestion I can give you is to project
out any fields in the document you don't actually need, which will make
PyMongo have to do less work.

Hi Bernie,
I was creating a list so I could pass the list to the constructor
of a Pandas Dataframe. However, this is not the issue so I probably
shouldn't have included that line in my example. In the following example I
simply iterate over the cursor one document at a time and set a dummy
variable temp to 1 on each iteration. This takes the same length of time -
about 80 seconds.
tic = time.time()
end_dt } }).sort("t")
for s in cursor: temp=1
toc = time.time()
print('elapsed time: ',toc-tic)
I'm well aware that I won't get the same speeds as C++, but 80 to
100 seconds to retrieve and iterate 150k documents is unacceptably slow.
There must be another solution? The same query takes a few seconds from C++
so this is more than a marginal difference. Anyone who is accessing big
datasets using PyMongo has surely encountered this issue. Any idea?
Thanks, Rob.

Post by Bernie Hackett
- mylist = list(db_collection.find()) - That means PyMongo has
to retrieve and deserialize all 150k documents before your application can
process a single document. The batch_size option will have no impact here.
How are you doing the same in C++? Why not use a Cursor for what it's meant
for and incrementally process documents?
- C++ is a lower level statically typed, natively compiled
language. Python is a very high level, dynamically typed, interpreted
language. The performance of your python application is never going to be
close to as fast as one written in C++. It's the nature of the language.
On Thu, Aug 22, 2013 at 8:57 AM, Rob B <

Post by Rob B
I have a database find query which returns 150k documents where
each document contains three integer fields and one datetime field. The
following code attempts to create a list from the cursor object. Iterating
the cursor is incredibly slow - about 80 seconds! The same operation via
the C++ drivers is orders of magnitude faster - it must be an issue with
PyMongo?
client = MongoClient()
client = MongoClient('localhost', 27017)
db = client.taq
collection_str = "mycollection"
db_collection = db[collection_str]
mylist = list(db_collection.find())
This issue has been discussed before and I tried the
suggestions. One is to change the default batch size. So I tried the
cursor = db_collection.find()
cursor.bath_size(10000)
mylist = list(cursor)
However, this had no impact. A second suggestion was to check
that the C extensions are installed - I have them installed so this is not
the issue. The Mongo database is installed on the same machine so it is not
a network issue - it works fine from C++ ... querying from Pymongo is the
issue.
Since MongoDB is marketed as being able to handle Big Data,
surely there is a way to retrieve data quickly via Python? This issue has
been raised before but I am yet to find a solution.... has anyone got a
suggestion that works? It this case I am retrieving 150k documents, but
normally the query would be retrieving 1million so this is going to be a
real issue for me.
PS each document is very simple, only containing a datetime
field and three integer fields.
--
--
You received this message because you are subscribed to the Google
Groups "mongodb-user" group.
To post to this group, send email to
To unsubscribe from this group, send email to
See also the IRC channel -- freenode.net#mongodb
---
You received this message because you are subscribed to the
Google Groups "mongodb-user" group.
To unsubscribe from this group and stop receiving emails from
For more options, visit https://groups.google.com/groups/opt_out
.

--
--
You received this message because you are subscribed to the Google
Groups "mongodb-user" group.
To post to this group, send email to
To unsubscribe from this group, send email to
See also the IRC channel -- freenode.net#mongodb
---
You received this message because you are subscribed to a topic
in the Google Groups "mongodb-user" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/mongodb-user/M6dxQgvGrbE/unsubscribe
.
To unsubscribe from this group and all its topics, send an email
For more options, visit https://groups.google.com/groups/opt_out.

--
--
You received this message because you are subscribed to the Google
Groups "mongodb-user" group.
To unsubscribe from this group, send email to
See also the IRC channel -- freenode.net#mongodb
---
You received this message because you are subscribed to the Google
Groups "mongodb-user" group.
To unsubscribe from this group and stop receiving emails from it,
For more options, visit https://groups.google.com/groups/opt_out.

--
--
You received this message because you are subscribed to the Google
Groups "mongodb-user" group.
To unsubscribe from this group, send email to
See also the IRC channel -- freenode.net#mongodb
---
You received this message because you are subscribed to a topic in
the Google Groups "mongodb-user" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/mongodb-user/M6dxQgvGrbE/unsubscribe
.
To unsubscribe from this group and all its topics, send an email to
For more options, visit https://groups.google.com/groups/opt_out.

--
--
You received this message because you are subscribed to the Google
Groups "mongodb-user" group.
To unsubscribe from this group, send email to
See also the IRC channel -- freenode.net#mongodb
---
You received this message because you are subscribed to the Google
Groups "mongodb-user" group.
To unsubscribe from this group and stop receiving emails from it,
For more options, visit https://groups.google.com/groups/opt_out.

--
--
You received this message because you are subscribed to the Google
Groups "mongodb-user" group.
To unsubscribe from this group, send email to
See also the IRC channel -- freenode.net#mongodb
---
You received this message because you are subscribed to a topic in
the Google Groups "mongodb-user" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/mongodb-user/M6dxQgvGrbE/unsubscribe
.
To unsubscribe from this group and all its topics, send an email to
For more options, visit https://groups.google.com/groups/opt_out.

--
--
You received this message because you are subscribed to the Google
Groups "mongodb-user" group.
To unsubscribe from this group, send email to
See also the IRC channel -- freenode.net#mongodb
---
You received this message because you are subscribed to the Google
Groups "mongodb-user" group.
To unsubscribe from this group and stop receiving emails from it, send
For more options, visit https://groups.google.com/groups/opt_out.

--
--
You received this message because you are subscribed to the Google
Groups "mongodb-user" group.
To unsubscribe from this group, send email to
See also the IRC channel -- freenode.net#mongodb
---
You received this message because you are subscribed to the Google Groups
"mongodb-user" group.
To unsubscribe from this group and stop receiving emails from it, send an
For more options, visit https://groups.google.com/groups/opt_out.

Rob B

2013-08-22 19:40:40 UTC

Thanks again.

PS regarding the list mentioned in the first comment of this thread.. what
is the fastest way to convert a cursor to a Pandas Dataframe? I have the
following:

cursor = trade_collection.find({"dt": {"$gte": start_dt, "$lt": end_dt }
}).sort("dt")
cursor.batch_size(50000)
elems=[]
for c in cursor: elems.append(c)
df_trades = pd.DataFrame(elems)

Or, I could do the following:

tic = time.time()
cursor = trade_collection.find({"dt": {"$gte": start_dt, "$lt": end_dt }
}).sort("dt")
cursor.batch_size(50000)
elems = list(cursor)
df_trades = pd.DataFrame(elems)
toc = time.time()
print('elapsed time: ',toc-tic)

The cursor is returning 150k trades as before. However, the above examples
take almost 8 seconds. So it takes 8 seconds to create a Dataframe
containing all the trades that occurred in a particular day. Is there
faster way? 8 seconds is still very slow.

Post by Bernie Hackett
Sure thing. Sorry for the false starts. I hadn't had enough coffee and
should have realized there was something else wrong. :-)

Post by Rob B
Just ran a test: I have gone from 80 seconds down to 4 seconds when I
include a compound index.... I'm glad we got to the bottom of this, thanks
for your help.

Post by Bernie Hackett
Yes, you are going to need a compound index for this.

Post by Rob B
1. pymongo.has_c() is also returning True for me so that's not the issue..
2. Both the C++ and Python applications are accessing the same database
which resides on a single server so I am not working in a
distributed/sharded environment.
trade_collection.find({"dt": {"$gte": start_dt, "$lt": end_dt }
}).explain()
{'allPlans': [{'cursor': 'BtreeCursor dt_1',
'indexBounds': {'dt': [[datetime.datetime(2012, 1, 3, 0, 0),
datetime.datetime(2012, 1, 4, 0, 0)]]},
'n': 154435,
'nscanned': 154435,
'nscannedObjects': 154435}],
'cursor': 'BtreeCursor dt_1',
'indexBounds': {'dt': [[datetime.datetime(2012, 1, 3, 0, 0),
datetime.datetime(2012, 1, 4, 0, 0)]]},
'indexOnly': False,
'isMultiKey': False,
'millis': 326,
'n': 154435,
'nChunkSkips': 0,
'nYields': 0,
'nscanned': 154435,
'nscannedAllPlans': 154435,
'nscannedObjects': 154435,
'nscannedObjectsAllPlans': 154435,
'oldPlan': {'cursor': 'BtreeCursor dt_1',
'indexBounds': {'dt': [[datetime.datetime(2012, 1, 3, 0, 0),
datetime.datetime(2012, 1, 4, 0, 0)]]}},
'scanAndOrder': False,
}
.. this runs very fast and only 154435 of the 37454178 documents are
trade_collection.find({"dt": {"$gte": start_dt, "$lt": end_dt }
}).sort("t").explain()
... including sort("t") means I am scanning all 37454178 documents.
{'allPlans': [{'cursor': 'BtreeCursor dt_1',
'indexBounds': {'dt': [[datetime.datetime(2012, 1, 3, 0, 0),
datetime.datetime(2012, 1, 4, 0, 0)]]},
'n': 23017,
'nscanned': 23017,
'nscannedObjects': 23017},
{'cursor': 'BtreeCursor t_1',
'indexBounds': {'t': [[{'$minElement': 1}, {'$maxElement': 1}]]},
'n': 154435,
'nscanned': 37454178,
'nscannedObjects': 37454178},
{'cursor': 'BasicCursor',
'indexBounds': {},
'n': 0,
'nscanned': 23016,
'nscannedObjects': 23016}],
'cursor': 'BtreeCursor t_1',
'indexBounds': {'t': [[{'$minElement': 1}, {'$maxElement': 1}]]},
'indexOnly': False,
'isMultiKey': False,
'millis': 88622,
'n': 154435,
'nChunkSkips': 0,
'nYields': 88,
'nscanned': 37454178,
'nscannedAllPlans': 37500211,
'nscannedObjects': 37454178,
'nscannedObjectsAllPlans': 37500211,
'oldPlan': {'cursor': 'BtreeCursor t_1',
'indexBounds': {'t': [[{'$minElement': 1}, {'$maxElement': 1}]]}},
'scanAndOrder': False,
}
I have created an index for "dt" which is the reason that the first
query is fast. However, perhaps I need a compound index which includes "dt"
and "t" so that the sort command introduced into the second query does not
result in the scanning of every document?
Thanks,
Rob.

Post by Bernie Hackett
You should check pymongo.has_c() as well. Some linux distros ship two
separate C extensions for PyMongo.
Also, you said the same application written in C++ is orders of
magnitude faster. Did you run that application against the same database
cluster? If not, there may be some other issue here not related to PyMongo.
The explain plan and/or database profiler may explain may help you discover
the nature of your performance problems.

Post by Rob B
Yes bson.has_c() returns True for me also.
PS I should mentioned that there are 40 million trades in the
collection but my query is returning 150k trades (which is the number of
trades occurring on the date specified in the query). I have indexed the
collection correctly. Does the size of the collection explain the slow
down? If yes then that's not good as tick databases will contain billions
of trades. Or perhaps there is another issue we are missing?

c.foo.bar.count()

150000

c.foo.bar.find_one()

{u'p': 9638, u'dt': datetime.datetime(2013, 8, 22, 18, 27, 59,
569000), u'_id': ObjectId('5216582ffba52202877399f2'), u't': 79200772,
u'v': 1}
import timeit

timeit.timeit(stmt='l = list(c.foo.bar.find())', setup='import

pymongo; c = pymongo.MongoClient()', number=1)
1.0295019149780273
False}))', setup='import pymongo; c = pymongo.MongoClient()', number=1)
0.6698269844055176

pymongo.has_c()

True

import bson
bson.has_c()

True

Post by Bernie Hackett
I see. I was confused by your example using batch_size and list().
The documents you are describing sound very simple, and decoding three
integers (32 or 64 bit) is about the simplest thing PyMongo does. The
datetime code in PyMongo's C extensions could be sped up a bit when dealing
with timestamps after the epoch, but I think most of the processing time is
being taken up creating PyObjects (in this case PyLong, PyDict, PyDateTime,
PyUnicode) in C.
Can you post an example document I can test with?
In PyMongo 3.0 we plan to rework how Cursor buffers the data it
retrieves, which should improve performance a bit.
In the meantime, the only suggestion I can give you is to project
out any fields in the document you don't actually need, which will make
PyMongo have to do less work.

Hi Bernie,
I was creating a list so I could pass the list to the constructor
of a Pandas Dataframe. However, this is not the issue so I probably
shouldn't have included that line in my example. In the following example I
simply iterate over the cursor one document at a time and set a dummy
variable temp to 1 on each iteration. This takes the same length of time -
about 80 seconds.
tic = time.time()
end_dt } }).sort("t")
for s in cursor: temp=1
toc = time.time()
print('elapsed time: ',toc-tic)
I'm well aware that I won't get the same speeds as C++, but 80 to
100 seconds to retrieve and iterate 150k documents is unacceptably slow.
There must be another solution? The same query takes a few seconds from C++
so this is more than a marginal difference. Anyone who is accessing big
datasets using PyMongo has surely encountered this issue. Any idea?
Thanks, Rob.

Post by Bernie Hackett
- mylist = list(db_collection.find()) - That means PyMongo has
to retrieve and deserialize all 150k documents before your application can
process a single document. The batch_size option will have no impact here.
How are you doing the same in C++? Why not use a Cursor for what it's meant
for and incrementally process documents?
- C++ is a lower level statically typed, natively compiled
language. Python is a very high level, dynamically typed, interpreted
language. The performance of your python application is never going to be
close to as fast as one written in C++. It's the nature of the language.
On Thu, Aug 22, 2013 at 8:57 AM, Rob B <

Post by Rob B
I have a database find query which returns 150k documents where
each document contains three integer fields and one datetime field. The
following code attempts to create a list from the cursor object. Iterating
the cursor is incredibly slow - about 80 seconds! The same operation via
the C++ drivers is orders of magnitude faster - it must be an issue with
PyMongo?
client = MongoClient()
client = MongoClient('localhost', 27017)
db = client.taq
collection_str = "mycollection"
db_collection = db[collection_str]
mylist = list(db_collection.find())
This issue has been discussed before and I tried the
suggestions. One is to change the default batch size. So I tried the
cursor = db_collection.find()
cursor.bath_size(10000)
mylist = list(cursor)
However, this had no impact. A second suggestion was to check
that the C extensions are installed - I have them installed so this is not
the issue. The Mongo database is installed on the same machine so it is not
a network issue - it works fine from C++ ... querying from Pymongo is the
issue.
Since MongoDB is marketed as being able to handle Big Data,
surely there is a way to retrieve data quickly via Python? This issue has
been raised before but I am yet to find a solution.... has anyone got a
suggestion that works? It this case I am retrieving 150k documents, but
normally the query would be retrieving 1million so this is going to be a
real issue for me.
PS each document is very simple, only containing a datetime
field and three integer fields.
--
--
You received this message because you are subscribed to the Google
Groups "mongodb-user" group.
To post to this group, send email to
To unsubscribe from this group, send email to
See also the IRC channel -- freenode.net#mongodb
---
You received this message because you are subscribed to the
Google Groups "mongodb-user" group.
To unsubscribe from this group and stop receiving emails from
For more options, visit
https://groups.google.com/groups/opt_out.

--
--
You received this message because you are subscribed to the Google
Groups "mongodb-user" group.
To post to this group, send email to
To unsubscribe from this group, send email to
See also the IRC channel -- freenode.net#mongodb
---
You received this message because you are subscribed to the
Google Groups "mongodb-user" group.
To unsubscribe from this group and stop receiving emails from it,
For more options, visit https://groups.google.com/groups/opt_out.

--
--
You received this message because you are subscribed to the Google
Groups "mongodb-user" group.
To unsubscribe from this group, send email to
See also the IRC channel -- freenode.net#mongodb
---
You received this message because you are subscribed to a topic in
the Google Groups "mongodb-user" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/mongodb-user/M6dxQgvGrbE/unsubscribe
.
To unsubscribe from this group and all its topics, send an email
For more options, visit https://groups.google.com/groups/opt_out.

--
--
You received this message because you are subscribed to the Google
Groups "mongodb-user" group.
To unsubscribe from this group, send email to
See also the IRC channel -- freenode.net#mongodb
---
You received this message because you are subscribed to the Google
Groups "mongodb-user" group.
To unsubscribe from this group and stop receiving emails from it,
For more options, visit https://groups.google.com/groups/opt_out.

--
--
You received this message because you are subscribed to the Google
Groups "mongodb-user" group.
To unsubscribe from this group, send email to
See also the IRC channel -- freenode.net#mongodb
---
You received this message because you are subscribed to a topic in
the Google Groups "mongodb-user" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/mongodb-user/M6dxQgvGrbE/unsubscribe
.
To unsubscribe from this group and all its topics, send an email to
For more options, visit https://groups.google.com/groups/opt_out.

--
--
You received this message because you are subscribed to the Google
Groups "mongodb-user" group.
To unsubscribe from this group, send email to
See also the IRC channel -- freenode.net#mongodb
---
You received this message because you are subscribed to the Google
Groups "mongodb-user" group.
To unsubscribe from this group and stop receiving emails from it,
For more options, visit https://groups.google.com/groups/opt_out.

--
--
You received this message because you are subscribed to the Google
Groups "mongodb-user" group.
To unsubscribe from this group, send email to
See also the IRC channel -- freenode.net#mongodb
---
You received this message because you are subscribed to a topic in the
Google Groups "mongodb-user" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/mongodb-user/M6dxQgvGrbE/unsubscribe
.
To unsubscribe from this group and all its topics, send an email to
For more options, visit https://groups.google.com/groups/opt_out.

--
--
You received this message because you are subscribed to the Google
Groups "mongodb-user" group.
To unsubscribe from this group, send email to
See also the IRC channel -- freenode.net#mongodb
---
You received this message because you are subscribed to the Google
Groups "mongodb-user" group.
To unsubscribe from this group and stop receiving emails from it, send
For more options, visit https://groups.google.com/groups/opt_out.

--
--
You received this message because you are subscribed to the Google
Groups "mongodb-user" group.
To unsubscribe from this group, send email to
See also the IRC channel -- freenode.net#mongodb
---
You received this message because you are subscribed to the Google Groups
"mongodb-user" group.
To unsubscribe from this group and stop receiving emails from it, send an
For more options, visit https://groups.google.com/groups/opt_out.

Bernie Hackett

2013-08-22 20:00:42 UTC

Huh, I don't have a great answer for that. Looking at the source for
Dataframe it doesn't appear to take a generic iterator, so you can't just
pass it the cursor. That's unfortunate. Wrapping the cursor in a list seems
to be your only choice.

Post by Rob B
Thanks again.
PS regarding the list mentioned in the first comment of this thread.. what
is the fastest way to convert a cursor to a Pandas Dataframe? I have the
cursor = trade_collection.find({"dt": {"$gte": start_dt, "$lt": end_dt }
}).sort("dt")
cursor.batch_size(50000)
elems=[]
for c in cursor: elems.append(c)
df_trades = pd.DataFrame(elems)
tic = time.time()
cursor = trade_collection.find({"dt": {"$gte": start_dt, "$lt": end_dt }
}).sort("dt")
cursor.batch_size(50000)
elems = list(cursor)
df_trades = pd.DataFrame(elems)
toc = time.time()
print('elapsed time: ',toc-tic)
The cursor is returning 150k trades as before. However, the above examples
take almost 8 seconds. So it takes 8 seconds to create a Dataframe
containing all the trades that occurred in a particular day. Is there
faster way? 8 seconds is still very slow.

Post by Bernie Hackett
Sure thing. Sorry for the false starts. I hadn't had enough coffee and
should have realized there was something else wrong. :-)

Post by Rob B
Just ran a test: I have gone from 80 seconds down to 4 seconds when I
include a compound index.... I'm glad we got to the bottom of this, thanks
for your help.

Post by Bernie Hackett
Yes, you are going to need a compound index for this.

Post by Rob B
1. pymongo.has_c() is also returning True for me so that's not the issue..
2. Both the C++ and Python applications are accessing the same
database which resides on a single server so I am not working in a
distributed/sharded environment.
trade_collection.find({"dt": {"$gte": start_dt, "$lt": end_dt }
}).explain()
{'allPlans': [{'cursor': 'BtreeCursor dt_1',
'indexBounds': {'dt': [[datetime.datetime(2012, 1, 3, 0, 0),
datetime.datetime(2012, 1, 4, 0, 0)]]},
'n': 154435,
'nscanned': 154435,
'nscannedObjects': 154435}],
'cursor': 'BtreeCursor dt_1',
'indexBounds': {'dt': [[datetime.datetime(2012, 1, 3, 0, 0),
datetime.datetime(2012, 1, 4, 0, 0)]]},
'indexOnly': False,
'isMultiKey': False,
'millis': 326,
'n': 154435,
'nChunkSkips': 0,
'nYields': 0,
'nscanned': 154435,
'nscannedAllPlans': 154435,
'nscannedObjects': 154435,
'nscannedObjectsAllPlans': 154435,
'oldPlan': {'cursor': 'BtreeCursor dt_1',
'indexBounds': {'dt': [[datetime.datetime(2012, 1, 3, 0, 0),
datetime.datetime(2012, 1, 4, 0, 0)]]}},
'scanAndOrder': False,
}
.. this runs very fast and only 154435 of the 37454178 documents are
trade_collection.find({"dt": {"$gte": start_dt, "$lt": end_dt }
}).sort("t").explain()
... including sort("t") means I am scanning all 37454178 documents.
{'allPlans': [{'cursor': 'BtreeCursor dt_1',
'indexBounds': {'dt': [[datetime.datetime(2012, 1, 3, 0, 0),
datetime.datetime(2012, 1, 4, 0, 0)]]},
'n': 23017,
'nscanned': 23017,
'nscannedObjects': 23017},
{'cursor': 'BtreeCursor t_1',
'indexBounds': {'t': [[{'$minElement': 1}, {'$maxElement': 1}]]},
'n': 154435,
'nscanned': 37454178,
'nscannedObjects': 37454178},
{'cursor': 'BasicCursor',
'indexBounds': {},
'n': 0,
'nscanned': 23016,
'nscannedObjects': 23016}],
'cursor': 'BtreeCursor t_1',
'indexBounds': {'t': [[{'$minElement': 1}, {'$maxElement': 1}]]},
'indexOnly': False,
'isMultiKey': False,
'millis': 88622,
'n': 154435,
'nChunkSkips': 0,
'nYields': 88,
'nscanned': 37454178,
'nscannedAllPlans': 37500211,
'nscannedObjects': 37454178,
'nscannedObjectsAllPlans': 37500211,
'oldPlan': {'cursor': 'BtreeCursor t_1',
'indexBounds': {'t': [[{'$minElement': 1}, {'$maxElement': 1}]]}},
'scanAndOrder': False,
}
I have created an index for "dt" which is the reason that the first
query is fast. However, perhaps I need a compound index which includes "dt"
and "t" so that the sort command introduced into the second query does not
result in the scanning of every document?
Thanks,
Rob.

Post by Bernie Hackett
You should check pymongo.has_c() as well. Some linux distros ship two
separate C extensions for PyMongo.
Also, you said the same application written in C++ is orders of
magnitude faster. Did you run that application against the same database
cluster? If not, there may be some other issue here not related to PyMongo.
The explain plan and/or database profiler may explain may help you discover
the nature of your performance problems.

Post by Rob B
Yes bson.has_c() returns True for me also.
PS I should mentioned that there are 40 million trades in the
collection but my query is returning 150k trades (which is the number of
trades occurring on the date specified in the query). I have indexed the
collection correctly. Does the size of the collection explain the slow
down? If yes then that's not good as tick databases will contain billions
of trades. Or perhaps there is another issue we are missing?

c.foo.bar.count()

150000

c.foo.bar.find_one()

{u'p': 9638, u'dt': datetime.datetime(2013, 8, 22, 18, 27, 59,
569000), u'_id': ObjectId('5216582ffba52202877399f2'), u't': 79200772,
u'v': 1}
import timeit

timeit.timeit(stmt='l = list(c.foo.bar.find())', setup='import

pymongo; c = pymongo.MongoClient()', number=1)
1.0295019149780273
False}))', setup='import pymongo; c = pymongo.MongoClient()', number=1)
0.6698269844055176

pymongo.has_c()

True

import bson
bson.has_c()

True

Post by Rob B
Hi Bernie,
I need all the fields in the documents, there are not many. Here
{'_id': ObjectId('51fc9fb9e89db689aed7e386'),
'dt': datetime.datetime(2012, 9, 2, 22, 0, 0, 772000),
'p': 9638,
't': 79200772,
'v': 1}
It's good to hear that you are aiming to improve PyMongo 3.0 but
as it stands PyMongo is so slow that it is not usable for big datasets in
my experience. I am experimenting with using MongoDB for storing financial
data. MongoDB is marketed as a tick data storage solution in the 10gen
presentation I posted (
http://www.10gen.com/presentations/webinar-how-banks-use-mongodb-tick-database), however I'm realizing that it is not viable ... certainly not from
Python... and Python is arguably the most common route of access for a
quant working in finance. In my example I retrieved 150k documents (where
each document is a trade)..... typically usage would involve retrieving
multiple million trades.... this would take far too long using PyMongo when
it takes 80 seconds to iterate 150k trades.
Is there any other suggestion? If there is no way to make faster
queries from PyMongo then I have no choice but to find an alternative
database, which would be disappointing as I've invested quite a bit of time
in MongoDB over the last month and am happy with it apart from the speed,
which is a show stopper.

Post by Bernie Hackett
I see. I was confused by your example using batch_size and
list(). The documents you are describing sound very simple, and decoding
three integers (32 or 64 bit) is about the simplest thing PyMongo does. The
datetime code in PyMongo's C extensions could be sped up a bit when dealing
with timestamps after the epoch, but I think most of the processing time is
being taken up creating PyObjects (in this case PyLong, PyDict, PyDateTime,
PyUnicode) in C.
Can you post an example document I can test with?
In PyMongo 3.0 we plan to rework how Cursor buffers the data it
retrieves, which should improve performance a bit.
In the meantime, the only suggestion I can give you is to project
out any fields in the document you don't actually need, which will make
PyMongo have to do less work.
On Thu, Aug 22, 2013 at 9:25 AM, Rob B <

Hi Bernie,
I was creating a list so I could pass the list to the
constructor of a Pandas Dataframe. However, this is not the issue so I
probably shouldn't have included that line in my example. In the following
example I simply iterate over the cursor one document at a time and set a
dummy variable temp to 1 on each iteration. This takes the same length of
time - about 80 seconds.
tic = time.time()
end_dt } }).sort("t")
for s in cursor: temp=1
toc = time.time()
print('elapsed time: ',toc-tic)
I'm well aware that I won't get the same speeds as C++, but 80
to 100 seconds to retrieve and iterate 150k documents is unacceptably slow.
There must be another solution? The same query takes a few seconds from C++
so this is more than a marginal difference. Anyone who is accessing big
datasets using PyMongo has surely encountered this issue. Any idea?
Thanks, Rob.
On Thu, Aug 22, 2013 at 5:16 PM, Bernie Hackett <

Post by Bernie Hackett
- mylist = list(db_collection.find()) - That means PyMongo has
to retrieve and deserialize all 150k documents before your application can
process a single document. The batch_size option will have no impact here.
How are you doing the same in C++? Why not use a Cursor for what it's meant
for and incrementally process documents?
- C++ is a lower level statically typed, natively compiled
language. Python is a very high level, dynamically typed, interpreted
language. The performance of your python application is never going to be
close to as fast as one written in C++. It's the nature of the language.
On Thu, Aug 22, 2013 at 8:57 AM, Rob B <

Post by Rob B
I have a database find query which returns 150k documents
where each document contains three integer fields and one datetime field.
The following code attempts to create a list from the cursor object.
Iterating the cursor is incredibly slow - about 80 seconds! The same
operation via the C++ drivers is orders of magnitude faster - it must be an
issue with PyMongo?
client = MongoClient()
client = MongoClient('localhost', 27017)
db = client.taq
collection_str = "mycollection"
db_collection = db[collection_str]
mylist = list(db_collection.find())
This issue has been discussed before and I tried the
suggestions. One is to change the default batch size. So I tried the
cursor = db_collection.find()
cursor.bath_size(10000)
mylist = list(cursor)
However, this had no impact. A second suggestion was to check
that the C extensions are installed - I have them installed so this is not
the issue. The Mongo database is installed on the same machine so it is not
a network issue - it works fine from C++ ... querying from Pymongo is the
issue.
Since MongoDB is marketed as being able to handle Big Data,
surely there is a way to retrieve data quickly via Python? This issue has
been raised before but I am yet to find a solution.... has anyone got a
suggestion that works? It this case I am retrieving 150k documents, but
normally the query would be retrieving 1million so this is going to be a
real issue for me.
PS each document is very simple, only containing a datetime
field and three integer fields.
--
--
You received this message because you are subscribed to the Google
Groups "mongodb-user" group.
To post to this group, send email to
To unsubscribe from this group, send email to
See also the IRC channel -- freenode.net#mongodb
---
You received this message because you are subscribed to the
Google Groups "mongodb-user" group.
To unsubscribe from this group and stop receiving emails from
.
For more options, visit
https://groups.google.com/groups/opt_out.

--
--
You received this message because you are subscribed to the Google
Groups "mongodb-user" group.
To post to this group, send email to
To unsubscribe from this group, send email to
See also the IRC channel -- freenode.net#mongodb
---
You received this message because you are subscribed to a topic
in the Google Groups "mongodb-user" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/mongodb-user/M6dxQgvGrbE/unsubscribe
.
To unsubscribe from this group and all its topics, send an
For more options, visit
https://groups.google.com/groups/opt_out.

--
--
You received this message because you are subscribed to the Google
Groups "mongodb-user" group.
To post to this group, send email to
To unsubscribe from this group, send email to
See also the IRC channel -- freenode.net#mongodb
---
You received this message because you are subscribed to the
Google Groups "mongodb-user" group.
To unsubscribe from this group and stop receiving emails from
For more options, visit https://groups.google.com/groups/opt_out
.

--
--
You received this message because you are subscribed to the Google
Groups "mongodb-user" group.
To unsubscribe from this group, send email to
See also the IRC channel -- freenode.net#mongodb
---
You received this message because you are subscribed to the Google
Groups "mongodb-user" group.
To unsubscribe from this group and stop receiving emails from it,
For more options, visit https://groups.google.com/groups/opt_out.

--
--
You received this message because you are subscribed to the Google
Groups "mongodb-user" group.
To unsubscribe from this group, send email to
See also the IRC channel -- freenode.net#mongodb
---
You received this message because you are subscribed to a topic in
the Google Groups "mongodb-user" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/mongodb-user/M6dxQgvGrbE/unsubscribe
.
To unsubscribe from this group and all its topics, send an email to
For more options, visit https://groups.google.com/groups/opt_out.

--
--
You received this message because you are subscribed to the Google
Groups "mongodb-user" group.
To unsubscribe from this group, send email to
See also the IRC channel -- freenode.net#mongodb
---
You received this message because you are subscribed to the Google
Groups "mongodb-user" group.
To unsubscribe from this group and stop receiving emails from it,
For more options, visit https://groups.google.com/groups/opt_out.

--
--
You received this message because you are subscribed to the Google
Groups "mongodb-user" group.
To unsubscribe from this group, send email to
See also the IRC channel -- freenode.net#mongodb
---
You received this message because you are subscribed to a topic in
the Google Groups "mongodb-user" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/mongodb-user/M6dxQgvGrbE/unsubscribe
.
To unsubscribe from this group and all its topics, send an email to
For more options, visit https://groups.google.com/groups/opt_out.

--
--
You received this message because you are subscribed to the Google
Groups "mongodb-user" group.
To unsubscribe from this group, send email to
See also the IRC channel -- freenode.net#mongodb
---
You received this message because you are subscribed to the Google
Groups "mongodb-user" group.
To unsubscribe from this group and stop receiving emails from it, send
For more options, visit https://groups.google.com/groups/opt_out.

--
--
You received this message because you are subscribed to the Google
Groups "mongodb-user" group.
To unsubscribe from this group, send email to
See also the IRC channel -- freenode.net#mongodb
---
You received this message because you are subscribed to the Google Groups
"mongodb-user" group.
To unsubscribe from this group and stop receiving emails from it, send an
For more options, visit https://groups.google.com/groups/opt_out.

Brandon Paquette

2015-11-09 06:01:09 UTC