Xiao Cui
2017-05-19 21:11:19 UTC
Hello everyone,
I m using Spark as ETL tool for our data pipeline (mainly pyspark, hosts on
EMR). In the very end of the step, using mongo spark connector to export
the results to mongo database. We have seen that writing causing the mongo
instance's whole performance down, i.e.
command: insert { insert: "xxx_intermediate_wip", ordered: true, documents:
309 } ninserted:309 keyUpdates:0 writeConflicts:0 numYields:0 reslen:80
locks:{ Global: { acquireCount: { r: 315, w: 315 } }, MMAPV1Journal: {
acquireCount: { w: 322 }, acquireWaitCount: { w: 4 }, timeAcquiringMicros:
{ w: 7690 } }, Database: { acquireCount: { w: 315 }, acquireWaitCount: { w:
1 }, timeAcquiringMicros: { w: 13713400 } }, Collection: { acquireCount: {
W: 6 }, acquireWaitCount: { W: 6 }, timeAcquiringMicros: { W: 86102310 } },
Metadata: { acquireCount: { w: 309 } }, oplog: { acquireCount: { W: 309 },
acquireWaitCount: { W: 1 } } } protocol:op_query 100165ms
The questions/issues I have:
1. Is there a way to ensureIndex on the collection through mongo-spark
connector's python api? Will this help for speed up writing?.
2. Is there a way that we can override the default batch size here:
https://github.com/mongodb/mongo-spark/blob/master/src/main/scala/com/mongodb/spark/MongoSpark.scala
Since i noticed all insert is around 300 documents.
3. Are there any recommended "best practices" for these kind of scenarios?
(The dataframe is around 4M records, we are using Mongo 3.2 without
wireTiger)
Thanks in advance!
--
You received this message because you are subscribed to the Google Groups "mongodb-user"
group.
For other MongoDB technical support options, see: https://docs.mongodb.com/manual/support/
---
You received this message because you are subscribed to the Google Groups "mongodb-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mongodb-user+***@googlegroups.com.
To post to this group, send email to mongodb-***@googlegroups.com.
Visit this group at https://groups.google.com/group/mongodb-user.
To view this discussion on the web visit https://groups.google.com/d/msgid/mongodb-user/d599132f-6785-4d7a-b1d1-5074ce01c659%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
I m using Spark as ETL tool for our data pipeline (mainly pyspark, hosts on
EMR). In the very end of the step, using mongo spark connector to export
the results to mongo database. We have seen that writing causing the mongo
instance's whole performance down, i.e.
command: insert { insert: "xxx_intermediate_wip", ordered: true, documents:
309 } ninserted:309 keyUpdates:0 writeConflicts:0 numYields:0 reslen:80
locks:{ Global: { acquireCount: { r: 315, w: 315 } }, MMAPV1Journal: {
acquireCount: { w: 322 }, acquireWaitCount: { w: 4 }, timeAcquiringMicros:
{ w: 7690 } }, Database: { acquireCount: { w: 315 }, acquireWaitCount: { w:
1 }, timeAcquiringMicros: { w: 13713400 } }, Collection: { acquireCount: {
W: 6 }, acquireWaitCount: { W: 6 }, timeAcquiringMicros: { W: 86102310 } },
Metadata: { acquireCount: { w: 309 } }, oplog: { acquireCount: { W: 309 },
acquireWaitCount: { W: 1 } } } protocol:op_query 100165ms
The questions/issues I have:
1. Is there a way to ensureIndex on the collection through mongo-spark
connector's python api? Will this help for speed up writing?.
2. Is there a way that we can override the default batch size here:
https://github.com/mongodb/mongo-spark/blob/master/src/main/scala/com/mongodb/spark/MongoSpark.scala
Since i noticed all insert is around 300 documents.
3. Are there any recommended "best practices" for these kind of scenarios?
(The dataframe is around 4M records, we are using Mongo 3.2 without
wireTiger)
Thanks in advance!
--
You received this message because you are subscribed to the Google Groups "mongodb-user"
group.
For other MongoDB technical support options, see: https://docs.mongodb.com/manual/support/
---
You received this message because you are subscribed to the Google Groups "mongodb-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mongodb-user+***@googlegroups.com.
To post to this group, send email to mongodb-***@googlegroups.com.
Visit this group at https://groups.google.com/group/mongodb-user.
To view this discussion on the web visit https://groups.google.com/d/msgid/mongodb-user/d599132f-6785-4d7a-b1d1-5074ce01c659%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.