Discussion:
[mongodb-user] Proper way to update reasonably large data using mongo-spark connector via pyspark
Xiao Cui
2017-05-19 21:11:19 UTC
Permalink
Hello everyone,

I m using Spark as ETL tool for our data pipeline (mainly pyspark, hosts on
EMR). In the very end of the step, using mongo spark connector to export
the results to mongo database. We have seen that writing causing the mongo
instance's whole performance down, i.e.

command: insert { insert: "xxx_intermediate_wip", ordered: true, documents:
309 } ninserted:309 keyUpdates:0 writeConflicts:0 numYields:0 reslen:80
locks:{ Global: { acquireCount: { r: 315, w: 315 } }, MMAPV1Journal: {
acquireCount: { w: 322 }, acquireWaitCount: { w: 4 }, timeAcquiringMicros:
{ w: 7690 } }, Database: { acquireCount: { w: 315 }, acquireWaitCount: { w:
1 }, timeAcquiringMicros: { w: 13713400 } }, Collection: { acquireCount: {
W: 6 }, acquireWaitCount: { W: 6 }, timeAcquiringMicros: { W: 86102310 } },
Metadata: { acquireCount: { w: 309 } }, oplog: { acquireCount: { W: 309 },
acquireWaitCount: { W: 1 } } } protocol:op_query 100165ms

The questions/issues I have:
1. Is there a way to ensureIndex on the collection through mongo-spark
connector's python api? Will this help for speed up writing?.
2. Is there a way that we can override the default batch size here:
https://github.com/mongodb/mongo-spark/blob/master/src/main/scala/com/mongodb/spark/MongoSpark.scala
Since i noticed all insert is around 300 documents.
3. Are there any recommended "best practices" for these kind of scenarios?

(The dataframe is around 4M records, we are using Mongo 3.2 without
wireTiger)


Thanks in advance!
--
You received this message because you are subscribed to the Google Groups "mongodb-user"
group.

For other MongoDB technical support options, see: https://docs.mongodb.com/manual/support/
---
You received this message because you are subscribed to the Google Groups "mongodb-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mongodb-user+***@googlegroups.com.
To post to this group, send email to mongodb-***@googlegroups.com.
Visit this group at https://groups.google.com/group/mongodb-user.
To view this discussion on the web visit https://groups.google.com/d/msgid/mongodb-user/d599132f-6785-4d7a-b1d1-5074ce01c659%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
'Wan Bachtiar' via mongodb-user
2017-06-28 07:56:01 UTC
Permalink
We have seen that writing causing the mongo instance’s whole performance
down, i.e.

Hi Xiao,

It’s been a while since you posted this question, have you found a way to
improve insert performance?

Before you’re going deeper into the Spark config, I would recommend to
limiting the scope of the performance test. For example, by testing your
MongoDB instance performance to handle 4M inserts. The goal is to find out
the performance bottleneck by executing simple tests. For example, check
your MongoDB memory/disk IO. See also MongoDB Capacity Planning
<https://www.mongodb.com/blog/post/capacity-planning-and-hardware-provisioning-mongodb-ten-minutes>

You may find the following performance related resources useful:

- MongoDB 3.2 write operation performance
<https://docs.mongodb.com/v3.2/core/write-performance/>
- MongoDB 3.2: analysing performance
<https://docs.mongodb.com/v3.2/administration/analyzing-mongodb-performance/>
- MongoDB University M201: MongoDB Performance
<https://university.mongodb.com/courses/M201/about>

Is there a way to ensureIndex on the collection through mongo-spark
connector’s python api? Will this help for speed up writing?.

Generally, adding an index wouldn’t improve your insert operations. It may
improve update operations querying, but depends on the update operation
itself.

we are using Mongo 3.2 without wiredTiger

If possible, I would recommend to consider/test using WiredTiger
<https://docs.mongodb.com/v3.2/core/wiredtiger/> storage engine for your
use case.

Regards,
Wan.
​
--
You received this message because you are subscribed to the Google Groups "mongodb-user"
group.

For other MongoDB technical support options, see: https://docs.mongodb.com/manual/support/
---
You received this message because you are subscribed to the Google Groups "mongodb-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mongodb-user+***@googlegroups.com.
To post to this group, send email to mongodb-***@googlegroups.com.
Visit this group at https://groups.google.com/group/mongodb-user.
To view this discussion on the web visit https://groups.google.com/d/msgid/mongodb-user/81c2c93c-2c0f-4e81-a5dd-b69456853787%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Loading...