Reading large (2.5GB) bson files with pymongo eats up 30GB+ ram

Discussion:

Matthias Lee

2012-10-18 01:28:46 UTC

Hello there,

Ive been using pymongo for a while and have read a few smaller bson files,
but today I was trying to convert a large bson file to json. (contains no
binary data)
Every way I tried reading and decoding resulted in me maxing out my RAM at
32GB.

If there a more efficient way of reading/decoding bson that this:
import bson
f = open("bigBson,bson", 'rb')
result = bson.decode_all(f.read())

perhaps it can be decoded incrementally?

In comparison, using mongorestore to load the same file barely increased my
memory usage.

Thanks,

Matthias

--
You received this message because you are subscribed to the Google
Groups "mongodb-user" group.
To post to this group, send email to mongodb-user-/***@public.gmane.org
To unsubscribe from this group, send email to
mongodb-user+unsubscribe-/***@public.gmane.org
See also the IRC channel -- freenode.net#mongodb

Bernie Hackett

2012-10-18 01:47:38 UTC

Permalink

I'm a little surprised to hear that PyMongo would use 30+ GB of ram to
decode but mongorestore isn't a very good comparison. mongorestore
reads each document and inserts it into the database. By comparison,
your python code is reading the entire file into a string, passing
that entire string to decode_all, which then has to create dictionary
objects for all of the documents in the file, returning the entire
file as a list of dictionaries. We haven't even gotten to inserting
the documents into MongoDB yet. That's never going to use memory
efficiently.

Post by Matthias Lee
Hello there,
Ive been using pymongo for a while and have read a few smaller bson files,
but today I was trying to convert a large bson file to json. (contains no
binary data)
Every way I tried reading and decoding resulted in me maxing out my RAM at
32GB.
import bson
f = open("bigBson,bson", 'rb')
result = bson.decode_all(f.read())
perhaps it can be decoded incrementally?
In comparison, using mongorestore to load the same file barely increased my
memory usage.
Thanks,
Matthias
--
You received this message because you are subscribed to the Google
Groups "mongodb-user" group.
To unsubscribe from this group, send email to
See also the IRC channel -- freenode.net#mongodb

Bernie Hackett

2012-10-18 01:55:47 UTC

Permalink

There is some code in the mongo-hadoop connector that can help you with this:

https://github.com/mongodb/mongo-hadoop/blob/master/streaming/language_support/python/pymongo_hadoop/input.py#L7-50

Also, make sure you are using the C extensions for PyMongo. You can
check like this:

python -c 'import pymongo; print pymongo.has_c()'

Post by Bernie Hackett
I'm a little surprised to hear that PyMongo would use 30+ GB of ram to
decode but mongorestore isn't a very good comparison. mongorestore
reads each document and inserts it into the database. By comparison,
your python code is reading the entire file into a string, passing
that entire string to decode_all, which then has to create dictionary
objects for all of the documents in the file, returning the entire
file as a list of dictionaries. We haven't even gotten to inserting
the documents into MongoDB yet. That's never going to use memory
efficiently.

Matthias Lee

2012-10-19 14:55:24 UTC

Permalink

Thanks, I will have a look at the hadoop connector.

I did check, and I do have the C extension.

Post by Bernie Hackett
https://github.com/mongodb/mongo-hadoop/blob/master/streaming/language_support/python/pymongo_hadoop/input.py#L7-50
Also, make sure you are using the C extensions for PyMongo. You can
python -c 'import pymongo; print pymongo.has_c()'

Post by Matthias Lee
Hello there,
Ive been using pymongo for a while and have read a few smaller bson

files,

Post by Bernie Hackett

Post by Matthias Lee
but today I was trying to convert a large bson file to json. (contains

Post by Bernie Hackett

Post by Matthias Lee
binary data)
Every way I tried reading and decoding resulted in me maxing out my RAM

Post by Bernie Hackett

Post by Matthias Lee
32GB.
import bson
f = open("bigBson,bson", 'rb')
result = bson.decode_all(f.read())
perhaps it can be decoded incrementally?
In comparison, using mongorestore to load the same file barely

increased my

Post by Bernie Hackett

Post by Matthias Lee
memory usage.
Thanks,
Matthias
--
You received this message because you are subscribed to the Google
Groups "mongodb-user" group.
To unsubscribe from this group, send email to
See also the IRC channel -- freenode.net#mongodb