resurrecting this thread b/c it was helpful to me, and wanted to add that
Post by Bernie HackettHuh, I don't have a great answer for that. Looking at the source for
Dataframe it doesn't appear to take a generic iterator, so you can't just
pass it the cursor. That's unfortunate. Wrapping the cursor in a list seems
to be your only choice.
Thanks again.
PS regarding the list mentioned in the first comment of this thread.. what
is the fastest way to convert a cursor to a Pandas Dataframe? I have the
cursor = trade_collection.find({"dt": {"$gte": start_dt, "$lt": end_dt }
}).sort("dt")
cursor.batch_size(50000)
elems=[]
for c in cursor: elems.append(c)
df_trades = pd.DataFrame(elems)
tic = time.time()
cursor = trade_collection.find({"dt": {"$gte": start_dt, "$lt": end_dt }
}).sort("dt")
cursor.batch_size(50000)
elems = list(cursor)
df_trades = pd.DataFrame(elems)
toc = time.time()
print('elapsed time: ',toc-tic)
The cursor is returning 150k trades as before. However, the above examples
take almost 8 seconds. So it takes 8 seconds to create a Dataframe
containing all the trades that occurred in a particular day. Is there
faster way? 8 seconds is still very slow.
Sure thing. Sorry for the false starts. I hadn't had enough coffee and
should have realized there was something else wrong. :-)
Just ran a test: I have gone from 80 seconds down to 4 seconds when I
include a compound index.... I'm glad we got to the bottom of this, thanks
for your help.
Yes, you are going to need a compound index for this.
1. pymongo.has_c() is also returning True for me so that's not the issue..
2. Both the C++ and Python applications are accessing the same database
which resides on a single server so I am not working in a
distributed/sharded environment.
trade_collection.find({"dt": {"$gte": start_dt, "$lt": end_dt }
}).explain()
{'allPlans': [{'cursor': 'BtreeCursor dt_1',
'indexBounds': {'dt': [[datetime.datetime(2012, 1, 3, 0, 0),
datetime.datetime(2012, 1, 4, 0, 0)]]},
'n': 154435,
'nscanned': 154435,
'nscannedObjects': 154435}],
'cursor': 'BtreeCursor dt_1',
'indexBounds': {'dt': [[datetime.datetime(2012, 1, 3, 0, 0),
datetime.datetime(2012, 1, 4, 0, 0)]]},
'indexOnly': False,
'isMultiKey': False,
'millis': 326,
'n': 154435,
'nChunkSkips': 0,
'nYields': 0,
'nscanned': 154435,
'nscannedAllPlans': 154435,
'nscannedObjects': 154435,
'nscannedObjectsAllPlans': 154435,
'oldPlan': {'cursor': 'BtreeCursor dt_1',
'indexBounds': {'dt': [[datetime.datetime(2012, 1, 3, 0, 0),
datetime.datetime(2012, 1, 4, 0, 0)]]}},
'scanAndOrder': False,
}
.. this runs very fast and only 154435 of the 37454178 documents are
trade_collection.find({"dt": {"$gte": start_dt, "$lt": end_dt }
}).sort("t").explain()
... including sort("t") means I am scanning all 37454178 documents. Below
{'allPlans': [{'cursor': 'BtreeCursor dt_1',
'indexBounds': {'dt': [[datetime.datetime(2012, 1, 3, 0, 0),
datetime.datetime(2012, 1, 4, 0, 0)]]},
'n': 23017,
'nscanned': 23017,
'nscannedObjects': 23017},
{'cursor': 'BtreeCursor t_1',
'indexBounds': {'t': [[{'$minElement': 1}, {'$maxElement': 1}]]},
'n': 154435,
'nscanned': 37454178,
'nscannedObjects': 37454178},
{'cursor': 'BasicCursor',
'indexBounds': {},
'n': 0,
'nscanned': 23016,
'nscannedObjects': 23016}],
'cursor': 'BtreeCursor t_1',
'indexBounds': {'t': [[{'$minElement': 1}, {'$maxElement': 1}]]},
'indexOnly': False,
'isMultiKey': False,
'millis': 88622,
'n': 154435,
'nChunkSkips': 0,
'nYields': 88,
'nscanned': 37454178,
'nscannedAllPlans': 37500211,
'nscannedObjects': 37454178,
'nscannedObjectsAllPlans': 37500211,
'oldPlan': {'cursor': 'BtreeCursor t_1',
'indexBounds': {'t': [[{'$minElement': 1}, {'$maxElement': 1}]]}},
'scanAndOrder': False,
}
I have created an index for "dt" which is the reason that the first query
is fast. However, perhaps I need a compound index which includes "dt" and
"t" so that the sort command introduced into the second query does not
result in the scanning of every document?
Thanks,
Rob.
You should check pymongo.has_c() as well. Some linux distros ship two
separate C extensions for PyMongo.
Also, you said the same application written in C++ is orders of magnitude
faster. Did you run that application against the same database cluster? If
not, there may be some other issue here not related to PyMongo. The explain
plan and/or database profiler may explain may help you discover the nature
of your performance problems.
Yes bson.has_c() returns True for me also.
PS I should mentioned that there are 40 million trades in the collection
but my query is returning 150k trades (which is the number of trades
occurring on the date specified in the query). I have indexed the
collection correctly. Does the size of the collection explain the slow
down? If yes then that's not good as tick databases will contain billions
of trades. Or perhaps there is another issue we are missing?
c.foo.bar.count()
150000
c.foo.bar.find_one()
{u'p': 9638, u'dt': datetime.datetime(2013, 8, 22, 18, 27, 59, 569000),
u'_id': ObjectId('5216582ffba52202877399f2'), u't': 79200772, u'v': 1}
import timeit
timeit.timeit(stmt='l = list(c.foo.bar.find())', setup='import
pymongo; c = pymongo.MongoClient()', number=1)
1.0295019149780273
timeit.timeit(stmt='l = list(c.foo.bar.find(fields={"_id": False}))',
setup='import pymongo; c = pymongo.MongoClient()', number=1)
0.6698269844055176
pymongo.has_c()
True
import bson
bson.has_c()
True
Hi Bernie,
I need all the fields in the documents, there are not many. Here is an
{'_id': ObjectId('51fc9fb9e89db689aed7e386'),
'dt': datetime.datetime(2012, 9, 2, 22, 0, 0, 772000),
'p': 9638,
't': 79200772,
'v': 1}
It's good to hear that you are aiming to improve PyMongo 3.0 but as it
stands PyMongo is so slow that it is not usable for big datasets in my
experience. I am experimenting with using MongoDB for storing financial
data. MongoDB is marketed as a tick data storage solution in the 10gen
presentation I posted (
http://www.10gen.com/presentations/webinar-how-banks-use-mongodb-tick-database
), however I'm realizing that it is not viable ... certainly not from
Python... and Python is arguably the most common route of access for a
quant working in finance. In my example I retrieved 150k documents (where
each document is a trade)..... typically usage would involve retrieving
multiple million trades.... this would take far too long using PyMongo when
it takes 80 seconds to iterate 150k trades.
Is there any other suggestion? If there is no way to make faster queries
from PyMongo then I have no choice but to find an alternative database,
which would be disappointing as I've invested quite a bit of time in
MongoDB over the last month and am happy with it apart from the speed,
which is a show stopper.
I see. I was confused by your example using batch_size and list(). The
documents you are describing sound very simple, and decoding three integers
(32 or 64 bit) is about the simplest thing PyMongo does. The datetime code
in PyMongo's C extensions could be sped up a bit when dealing with
timestamps after the epoch, but I think most of the processing time is
being taken up creating PyObjects (in this case PyLong, PyDict, PyDateTime,
PyUnicode) in C.
Can you post an example document I can test with?
In PyMongo 3.0 we plan to rework how Cursor buffers the data it retrieves,
which should improve performance a bit.
In the meantime, the only suggestion I can give you is to project out any
fields in the document you don't actually need, which will make PyMongo
have to do less work.
Hi Bernie,
I was creating a list so I could pass the list to the constructor of a
Pandas Dataframe. However, this is not the issue so I probably shouldn't
have included that line in my example. In the following example I simply
iterate over the cursor one document at a time and set a dummy variable
temp to 1 on each iteration. This takes the same length of time - about 80
seconds.
tic = time.time()
cursor = db_collection.find({"dt": {"$gte": start_dt, "$lt": end_dt }
}).sort("t")
for s in cursor: temp=1
toc = time.time()
print('elapsed time: ',toc-tic)
I'm well aware that I won't get the same speeds as C++, but 80 to 100
seconds to retrieve and iterate 150k documents is unacceptably slow. There
must be another solution? The same query takes a few seconds from C++ so
this is more than a marginal difference. Anyone who is accessing big
datasets using PyMongo has surely encountered this issue. Any idea?
Thanks, Rob.
- mylist = list(db_collection.find()) - That means PyMongo has to
retrieve and deserialize all 150k documents before your application can
process a single document. The batch_size option will have no impact here.
How are you doing the same in C++? Why not use a Cursor for what it's meant
for and incrementally process documents?
- C++ is a lower level statically typed, natively compiled language.
Python is a very high level, dynamically typed, interpreted language. The
performance of your python application is never going to be close to as
fast as one written in C++. It's the nature of the language.
I have a database find query which returns 150k documents where each
document contains three integer fields and one datetime field. The
following code attempts to create a list from the cursor object. Iterating
the cursor is incredibly slow - about 80 seconds! The same operation via
the C++ drivers is orders of magnitude faster - it must be an issue with
PyMongo?
client = MongoClient()
client = MongoClient('localhost', 27017)
db = client.taq
collection_str = "mycollection"
db_collection = db[collection_str]
mylist = list(db_collection.find())
This issue has been discussed before and I tried the suggestions. One is
cursor = db_collection.find()
cursor.bath_size(10000)
mylist = list(cursor)
However, this had no impact. A second suggestion was to check that the C
extensions are installed - I have them installed so this is not the issue.
The Mongo database is installed on the same machine so it is not a network
issue - it works fine from C++ ... querying from Pymongo is the issue.
Since MongoDB is marketed as being able to handle Big Data, surely there
is a way to retrieve data quickly via Python? This issue has been raised
before but I am yet to find a solution.... has anyone got a suggestion that
works? It this case I am retrieving 150k documents, but normally the query
would be retrieving 1million so this is going to be a real issue for me.
PS each document is very simple, only containing a datetime field and
three integer fields.
--
--
You received this message because you are subscribed to the Google
Groups "mongodb-user" group.
<javascript:>
To unsubscribe from this group, send email to
See also the IRC channel -- freenode.net#mongodb
---
You received this message because you are subscribed to the Google Groups
"mongodb-user" group.
To unsubscribe from this group and stop receiving emails from it, send an
For more options, visit https://groups.google.com/groups/opt_out.
--
--
You received this message because you are subscribed to the Google
Groups "mongodb-user" group.
<javascript:>
To unsubscribe from this group, send email to
See also the IRC channel -- freenode.net#mongodb
---
You received this message because you are subscribed to a topic in the
Google Groups "mongodb-user" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/mongodb-user/M6dxQgvGrbE/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
For more options, visit https://groups.google.com/groups/opt_out.
--
--
You received this message because you are subscribed to the Google
Groups "mongodb-user" group.
<javascript:>
To unsubscribe from this group, send email to
See also the IRC channel -- freenode.net#mongodb
---
You received this message because you are subscribed to the Google Groups
"mongodb-user" group.
To unsubscribe from this group and stop receiving emails from it, send an
For more options, visit https://groups.google.com/groups/opt_out.
--
--
You received this message because you are subscribed to the Google
Groups "mongodb-user" group.
<javascript:>
To unsubscribe from this group, send email to
See also the IRC channel -- freenode.net#mongodb
---
You received this message because you are subscribed to a topic in the
Google Groups "mongodb-user" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/mongodb-user/M6dxQgvGrbE/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
For more options, visit https://groups.google.com/groups/opt_out.
--
--
You received this message because you are subscribed to the Google
Groups "mongodb-user" group.
<javascript:>
To unsubscribe from this group, send email to
See also the IRC channel -- freenode.net#mongodb
---
You received this message because you are subscribed to the Google Groups
"mongodb-user" group.
To unsubscribe from this group and stop receiving emails from it, send an
For more options, visit https://groups.google.com/groups/opt_out.
--
--
You received this message because you are subscribed to the Google
Groups "mongodb-user" group.
<javascript:>
To unsubscribe from this group, send email to
See also the IRC channel -- freenode.net#mongodb
---
You received this message because you are subscribed to a topic in the
Google Groups "mongodb-user" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/mongodb-user/M6dxQgvGrbE/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
For more options, visit https://groups.google.com/groups/opt_out.
--
--
You received this message because you are subscribed to the Google
Groups "mongodb-user" group.
<javascript:>
To unsubscribe from this group, send email to
See also the IRC channel -- freenode.net#mongodb
---
You received this message because you are subscribed to the Google Groups
"mongodb-user" group.
To unsubscribe from this group and stop receiving emails from it, send an
For more options, visit https://groups.google.com/groups/opt_out.
--
--
You received this message because you are subscribed to the Google
Groups "mongodb-user" group.
<javascript:>
To unsubscribe from this group, send email to
See also the IRC channel -- freenode.net#mongodb
---
You received this message because you are subscribed to a topic in the
Google Groups "mongodb-user" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/mongodb-user/M6dxQgvGrbE/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
For more options, visit https://groups.google.com/groups/opt_out.
--
--
You received this message because you are subscribed to the Google
Groups "mongodb-user" group.
<javascript:>
To unsubscribe from this group, send email to
See also the IRC channel -- freenode.net#mongodb
---
You received this message because you are subscribed to the Google Groups
"mongodb-user" group.
To unsubscribe from this group and stop receiving emails from it, send an
For more options, visit https://groups.google.com/groups/opt_out.
--
--
You received this message because you are subscribed to the Google
Groups "mongodb-user" group.
<javascript:>
To unsubscribe from this group, sen
...
You received this message because you are subscribed to the Google Groups "mongodb-user"
group.
For other MongoDB technical support options, see: http://www.mongodb.org/about/support/.
You received this message because you are subscribed to the Google Groups "mongodb-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mongodb-user+***@googlegroups.com.
To post to this group, send email to mongodb-***@googlegroups.com.
Visit this group at http://groups.google.com/group/mongodb-user.
To view this discussion on the web visit https://groups.google.com/d/msgid/mongodb-user/46dc5c93-e6b6-449d-aa23-68a4aaab7817%40googlegroups.com.