Discussion:
Reading large (2.5GB) bson files with pymongo eats up 30GB+ ram
Matthias Lee
2012-10-18 01:28:46 UTC
Permalink
Hello there,

Ive been using pymongo for a while and have read a few smaller bson files,
but today I was trying to convert a large bson file to json. (contains no
binary data)
Every way I tried reading and decoding resulted in me maxing out my RAM at
32GB.

If there a more efficient way of reading/decoding bson that this:
import bson
f = open("bigBson,bson", 'rb')
result = bson.decode_all(f.read())

perhaps it can be decoded incrementally?

In comparison, using mongorestore to load the same file barely increased my
memory usage.

Thanks,

Matthias
--
You received this message because you are subscribed to the Google
Groups "mongodb-user" group.
To post to this group, send email to mongodb-user-/***@public.gmane.org
To unsubscribe from this group, send email to
mongodb-user+unsubscribe-/***@public.gmane.org
See also the IRC channel -- freenode.net#mongodb
Bernie Hackett
2012-10-18 01:47:38 UTC
Permalink
I'm a little surprised to hear that PyMongo would use 30+ GB of ram to
decode but mongorestore isn't a very good comparison. mongorestore
reads each document and inserts it into the database. By comparison,
your python code is reading the entire file into a string, passing
that entire string to decode_all, which then has to create dictionary
objects for all of the documents in the file, returning the entire
file as a list of dictionaries. We haven't even gotten to inserting
the documents into MongoDB yet. That's never going to use memory
efficiently.
Post by Matthias Lee
Hello there,
Ive been using pymongo for a while and have read a few smaller bson files,
but today I was trying to convert a large bson file to json. (contains no
binary data)
Every way I tried reading and decoding resulted in me maxing out my RAM at
32GB.
import bson
f = open("bigBson,bson", 'rb')
result = bson.decode_all(f.read())
perhaps it can be decoded incrementally?
In comparison, using mongorestore to load the same file barely increased my
memory usage.
Thanks,
Matthias
--
You received this message because you are subscribed to the Google
Groups "mongodb-user" group.
To unsubscribe from this group, send email to
See also the IRC channel -- freenode.net#mongodb
Bernie Hackett
2012-10-18 01:55:47 UTC
Permalink
There is some code in the mongo-hadoop connector that can help you with this:

https://github.com/mongodb/mongo-hadoop/blob/master/streaming/language_support/python/pymongo_hadoop/input.py#L7-50

Also, make sure you are using the C extensions for PyMongo. You can
check like this:

python -c 'import pymongo; print pymongo.has_c()'
Post by Bernie Hackett
I'm a little surprised to hear that PyMongo would use 30+ GB of ram to
decode but mongorestore isn't a very good comparison. mongorestore
reads each document and inserts it into the database. By comparison,
your python code is reading the entire file into a string, passing
that entire string to decode_all, which then has to create dictionary
objects for all of the documents in the file, returning the entire
file as a list of dictionaries. We haven't even gotten to inserting
the documents into MongoDB yet. That's never going to use memory
efficiently.
Post by Matthias Lee
Hello there,
Ive been using pymongo for a while and have read a few smaller bson files,
but today I was trying to convert a large bson file to json. (contains no
binary data)
Every way I tried reading and decoding resulted in me maxing out my RAM at
32GB.
import bson
f = open("bigBson,bson", 'rb')
result = bson.decode_all(f.read())
perhaps it can be decoded incrementally?
In comparison, using mongorestore to load the same file barely increased my
memory usage.
Thanks,
Matthias
--
You received this message because you are subscribed to the Google
Groups "mongodb-user" group.
To unsubscribe from this group, send email to
See also the IRC channel -- freenode.net#mongodb
Matthias Lee
2012-10-19 14:55:24 UTC
Permalink
Thanks, I will have a look at the hadoop connector.

I did check, and I do have the C extension.
Post by Bernie Hackett
https://github.com/mongodb/mongo-hadoop/blob/master/streaming/language_support/python/pymongo_hadoop/input.py#L7-50
Also, make sure you are using the C extensions for PyMongo. You can
python -c 'import pymongo; print pymongo.has_c()'
Post by Bernie Hackett
I'm a little surprised to hear that PyMongo would use 30+ GB of ram to
decode but mongorestore isn't a very good comparison. mongorestore
reads each document and inserts it into the database. By comparison,
your python code is reading the entire file into a string, passing
that entire string to decode_all, which then has to create dictionary
objects for all of the documents in the file, returning the entire
file as a list of dictionaries. We haven't even gotten to inserting
the documents into MongoDB yet. That's never going to use memory
efficiently.
Post by Matthias Lee
Hello there,
Ive been using pymongo for a while and have read a few smaller bson
files,
Post by Bernie Hackett
Post by Matthias Lee
but today I was trying to convert a large bson file to json. (contains
no
Post by Bernie Hackett
Post by Matthias Lee
binary data)
Every way I tried reading and decoding resulted in me maxing out my RAM
at
Post by Bernie Hackett
Post by Matthias Lee
32GB.
import bson
f = open("bigBson,bson", 'rb')
result = bson.decode_all(f.read())
perhaps it can be decoded incrementally?
In comparison, using mongorestore to load the same file barely
increased my
Post by Bernie Hackett
Post by Matthias Lee
memory usage.
Thanks,
Matthias
--
You received this message because you are subscribed to the Google
Groups "mongodb-user" group.
To unsubscribe from this group, send email to
See also the IRC channel -- freenode.net#mongodb
--
You received this message because you are subscribed to the Google
Groups "mongodb-user" group.
To post to this group, send email to mongodb-user-/***@public.gmane.org
To unsubscribe from this group, send email to
mongodb-user+unsubscribe-/***@public.gmane.org
See also the IRC channel -- freenode.net#mongodb
Loading...