[mongodb-user] Optimizing data compression

Discussion:

j***@findmeon.com

2015-10-10 16:28:28 UTC

We're giving MongoDB a detailed look as the 'manager' for our archived
data. We operate a web spider and need to save snapshots for a while
(potential re-indexing based on new rulesets). Write once, never update,
rarely read, highly likely to delete. In testing different types of
compression strategies, the best results came from bucketing many documents
together.

Naturally, WiredTiger caught my eye. The block_compressor seems to do
exactly what we need. Great!

Running some tests on our sample data, it's not working as well as I'd
hope. I'm only seeing about 50% compression.

I think this might be caused by two things:

1. the level of zlib compression
2. the size of blocks that wiredtiger managers

Does anyone know what these are set at, and if these are controllable? I
found some various info online that says wiredtiger uses zlib level 6 (but
no source) and that the blocksize on the wiredtiger storage were
configurable -- but that was pre-mongodb on a direct install.

Right now, it looks like we'd get better compression on our dataset using
document level compression at zlib level 9.
--
You received this message because you are subscribed to the Google Groups "mongodb-user"
group.

For other MongoDB technical support options, see: http://www.mongodb.org/about/support/.
---
You received this message because you are subscribed to the Google Groups "mongodb-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mongodb-user+***@googlegroups.com.
To post to this group, send email to mongodb-***@googlegroups.com.
Visit this group at http://groups.google.com/group/mongodb-user.
To view this discussion on the web visit https://groups.google.com/d/msgid/mongodb-user/c7e8197d-57b1-4ac1-ab87-a33ad7dafd46%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

David Hows

2015-10-12 22:51:02 UTC

Permalink

By default WiredTiger uses the Z_DEFAULT_COMPRESSION level within zlib,
which equates to about level 6 of compression according to the zlib docs.
Unfortunately, the setting here is not exposed in any of the API's so there
isn't really much that can be done surrounding the zlib compression level.

You could control the size of the blocks of data being compressed to some
extent by setting a leaf_page_max value for both indexes and collections.
Collections use the WT default of 32KB and indexes use 16K pages. These
values are set when the collections are created, so you will need to create
new collections with the options outlined below in order to see a
difference.

If possible I'd recommend doing some trial and error increasing the above
sizes, then creating new collections and filling them with dummy data to
see what impact that has on your gross storage size.

The leaf_page_max value can be changed with
--wiredTigerCollectionConfigString. For example
--wiredTigerCollectionConfigString="leaf_page_max=128K" should set the
leaf_page_max value to 128K for all new collections. The same can be done
for indexes with the --wiredTigerIndexConfigString command line option.

The above 2 configs have config file YAML equivalents of:
storage.wiredTiger.collectionConfig.configString
storage.wiredTiger.indexConfig.configString
--
You received this message because you are subscribed to the Google Groups "mongodb-user"
group.

For other MongoDB technical support options, see: http://www.mongodb.org/about/support/.
---
You received this message because you are subscribed to the Google Groups "mongodb-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mongodb-user+***@googlegroups.com.
To post to this group, send email to mongodb-***@googlegroups.com.
Visit this group at http://groups.google.com/group/mongodb-user.
To view this discussion on the web visit https://groups.google.com/d/msgid/mongodb-user/43ab1ffb-5555-4307-9f76-441125389f61%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Frederick Cheung

2015-10-13 09:29:54 UTC

Permalink

Post by David Hows
You could control the size of the blocks of data being compressed to some
extent by setting a leaf_page_max value for both indexes and collections.
Collections use the WT default of 32KB and indexes use 16K pages. These
values are set when the collections are created, so you will need to create
new collections with the options outlined below in order to see a
difference.

Hi,

what would the potential downsides be of increasing leaf_page_max ? We're
also in a case where many of our documents are > 32k

Thanks,

Fred
--
You received this message because you are subscribed to the Google Groups "mongodb-user"
group.

For other MongoDB technical support options, see: http://www.mongodb.org/about/support/.
---
You received this message because you are subscribed to the Google Groups "mongodb-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mongodb-user+***@googlegroups.com.
To post to this group, send email to mongodb-***@googlegroups.com.
Visit this group at http://groups.google.com/group/mongodb-user.
To view this discussion on the web visit https://groups.google.com/d/msgid/mongodb-user/030d14c6-38d0-46fd-bfed-40e0ca54fa0f%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

David Hows

2015-10-13 22:58:22 UTC

Permalink

Hi Fred,

The potential downside of increasing the page size is that you may wind up
over-reading when a given document needs to be read from disk, as larger
pages can (potentially) contain more documents. This means that you may
read say 4 different documents on the one page from disk, when you want
only the one document for a given query.

If many of your documents are larger than the current page size, then it
may be more efficient to increase it.

As ever, you are ALWAYS best to do some performance testing with the
various settings to confirm whats best.

- David
--
You received this message because you are subscribed to the Google Groups "mongodb-user"
group.

For other MongoDB technical support options, see: http://www.mongodb.org/about/support/.
---
You received this message because you are subscribed to the Google Groups "mongodb-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mongodb-user+***@googlegroups.com.
To post to this group, send email to mongodb-***@googlegroups.com.
Visit this group at http://groups.google.com/group/mongodb-user.
To view this discussion on the web visit https://groups.google.com/d/msgid/mongodb-user/d80dc60a-b591-4fc5-ba0b-907d5b309245%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

j***@findmeon.com

2015-10-16 19:18:51 UTC

Permalink

for the benefit of others:

I migrated some data the other day. I didn't change the block size, though
I will probably try that once I get a grasp on the size distribution of
documents in the new setup.

the migration was pretty simple - i left block compression as zlib and
replaced a single field in the document with an lzma level 9 compressed
version. so our application is responsible for compressing the 'payload'
of snapshots, while mongo is compressing metadata.

using postgresql which automatically 'toasts' documents - about 60GB
between the table and index.
using wiredtiger+zlib - raw documents - 36GB
using wiredtiger+zlib - compressed documents - 31GB

once i get a good idea on what the new distribution of compressed data
is... i'll run some tests to see if there is any noticeable difference.
--
You received this message because you are subscribed to the Google Groups "mongodb-user"
group.

For other MongoDB technical support options, see: http://www.mongodb.org/about/support/.
---
You received this message because you are subscribed to the Google Groups "mongodb-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mongodb-user+***@googlegroups.com.
To post to this group, send email to mongodb-***@googlegroups.com.
Visit this group at http://groups.google.com/group/mongodb-user.
To view this discussion on the web visit https://groups.google.com/d/msgid/mongodb-user/2f22b58e-8afd-4fcd-bfad-941d1791aa2d%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

David Hows

2015-10-13 01:12:39 UTC

Permalink

Another thing that may be a factor is checkpointing. As WiredTiger goes
along and makes checkpoint's these can be added to the file increasing
their size, unfortunately there is little that can be done about these
either.

Post by j***@findmeon.com
We're giving MongoDB a detailed look as the 'manager' for our archived
data. We operate a web spider and need to save snapshots for a while
(potential re-indexing based on new rulesets). Write once, never update,
rarely read, highly likely to delete. In testing different types of
compression strategies, the best results came from bucketing many documents
together.
Naturally, WiredTiger caught my eye. The block_compressor seems to do
exactly what we need. Great!
Running some tests on our sample data, it's not working as well as I'd
hope. I'm only seeing about 50% compression.
1. the level of zlib compression
2. the size of blocks that wiredtiger managers
Does anyone know what these are set at, and if these are controllable? I
found some various info online that says wiredtiger uses zlib level 6 (but
no source) and that the blocksize on the wiredtiger storage were
configurable -- but that was pre-mongodb on a direct install.
Right now, it looks like we'd get better compression on our dataset using
document level compression at zlib level 9.

--
You received this message because you are subscribed to the Google Groups "mongodb-user"
group.

For other MongoDB technical support options, see: http://www.mongodb.org/about/support/.
---
You received this message because you are subscribed to the Google Groups "mongodb-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mongodb-user+***@googlegroups.com.
To post to this group, send email to mongodb-***@googlegroups.com.
Visit this group at http://groups.google.com/group/mongodb-user.
To view this discussion on the web visit https://groups.google.com/d/msgid/mongodb-user/6effc553-d3af-45a8-97f1-93f5f754216a%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

j***@findmeon.com

2015-10-13 06:30:11 UTC

Permalink

Thanks David.

The block size looks to be the issue. 50% of our current index is in the
100-200k range..

Based on the above, I may just switch to compressing the documents at the
application level. The block size we'd need is pretty large.
--
You received this message because you are subscribed to the Google Groups "mongodb-user"
group.

For other MongoDB technical support options, see: http://www.mongodb.org/about/support/.
---
You received this message because you are subscribed to the Google Groups "mongodb-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mongodb-user+***@googlegroups.com.
To post to this group, send email to mongodb-***@googlegroups.com.
Visit this group at http://groups.google.com/group/mongodb-user.
To view this discussion on the web visit https://groups.google.com/d/msgid/mongodb-user/805191c1-6645-4030-9d45-f4d9e97ad19d%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Continue reading on narkive:

Search results for '[mongodb-user] Optimizing data compression' (Questions and Answers)

replies

what is data compression utility?

started 2009-10-11 11:04:31 UTC

computers & internet

replies

i need to know which compression program can compress the data to half?

started 2008-02-28 02:29:58 UTC

software

replies

whats the difference between DV and the MPEG styles of compression? explain it?