Discussion:
MongoDB Log Level
(too old to reply)
rhea ghosh
2014-11-26 19:05:27 UTC
Permalink
Hi,

I'm having trouble finding this in the documentation. What are the various
log levels (also known as verbosity). I've tried looking for this in the
documentation but maybe I'm looking in the wrong place. I know there are
levels 0-5 but aside from that I haven't found a description of the
different levels.

It seems that 0 is at warning level and only logs information that are
warnings unless you set the profiling level differently and then it'll log
slow queries too.

I turned up the log level to 1 because i'm trying to track a heartbeat
issue but it seems to be logging all commands, reads/writes but i don't
really see any more heartbeat information being logged on the primary. Also
my primary log file rocketed up to 5GB with it at a logging level of 1 for
half a day which seems really unusual given that the logs are normally
around 2MB.

So what do the different log levels actually stand for and what information
is tracked? And how much should I expect them to grow if I turn them up?

Thanks
--
You received this message because you are subscribed to the Google Groups "mongodb-user"
group.

For other MongoDB technical support options, see: http://www.mongodb.org/about/support/.
---
You received this message because you are subscribed to the Google Groups "mongodb-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mongodb-user+***@googlegroups.com.
To post to this group, send email to mongodb-***@googlegroups.com.
Visit this group at http://groups.google.com/group/mongodb-user.
To view this discussion on the web visit https://groups.google.com/d/msgid/mongodb-user/d457ede7-3f1c-4b4c-af21-afef9d1a5893%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Adam C
2014-11-27 16:53:58 UTC
Permalink
Log level 0 is the default and unless you are debugging a specific issue
with your mongod/mongos you should leave it at that level. Slow queries
will still be logged at level 0, once they go beyond the configured slowms
threshold
<http://docs.mongodb.org/manual/reference/method/db.setProfilingLevel/#db.setProfilingLevel>
(default 100ms). The logs get very chatty with any increase, but
especially at the higher levels.

There is an open issue to document the differences between the log levels,
but things need to be cleaned up a bit in the core database before that can
happen in a sane way. The Jira issue for that (if you would like to
watch/vote) is DOCS-3678 <https://jira.mongodb.org/browse/DOCS-3678>.

For now, unless you have a real need for increased logging, the
recommendation is to leave it at the default level of 0.

Adam
Post by rhea ghosh
Hi,
I'm having trouble finding this in the documentation. What are the various
log levels (also known as verbosity). I've tried looking for this in the
documentation but maybe I'm looking in the wrong place. I know there are
levels 0-5 but aside from that I haven't found a description of the
different levels.
It seems that 0 is at warning level and only logs information that are
warnings unless you set the profiling level differently and then it'll log
slow queries too.
I turned up the log level to 1 because i'm trying to track a heartbeat
issue but it seems to be logging all commands, reads/writes but i don't
really see any more heartbeat information being logged on the primary. Also
my primary log file rocketed up to 5GB with it at a logging level of 1 for
half a day which seems really unusual given that the logs are normally
around 2MB.
So what do the different log levels actually stand for and what
information is tracked? And how much should I expect them to grow if I turn
them up?
Thanks
--
You received this message because you are subscribed to the Google Groups "mongodb-user"
group.

For other MongoDB technical support options, see: http://www.mongodb.org/about/support/.
---
You received this message because you are subscribed to the Google Groups "mongodb-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mongodb-user+***@googlegroups.com.
To post to this group, send email to mongodb-***@googlegroups.com.
Visit this group at http://groups.google.com/group/mongodb-user.
To view this discussion on the web visit https://groups.google.com/d/msgid/mongodb-user/7bc0c901-744e-457e-9321-e123f71e3e7e%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
rhea
2014-11-27 20:01:31 UTC
Permalink
Yes I've noticed the chattiness increase with an increased log level. I
guess I'm just confused that there doesn't seem to be any extra heartbeat
information, simply that more queries are being logged (at least from what
I've seen in my logs so far).

I would like to increase my log level because I'm trying to trouble shoot
an issue where my primary server is slow to respond to my secondary and
arbiter but we've done a wireshark trace on all the traffic between the two
and determined it is not an infrastructure issue, the network looks
perfectly healthy at the time of this slowness. So I want to see if there
is any information available from the database since the boxes are new and
look fine as well. I would just change thee timeout for fail over to a
slightly longer timeout since we are getting so many false positives but I
don't believe this is configurable. So I'd like to increase logging to get
to the bottom of the issue but I'd prefer not to have insanely large logs
if I'm not going to get the information I need from them.
Post by Adam C
Log level 0 is the default and unless you are debugging a specific issue
with your mongod/mongos you should leave it at that level. Slow queries
will still be logged at level 0, once they go beyond the configured
slowms threshold
<http://docs.mongodb.org/manual/reference/method/db.setProfilingLevel/#db.setProfilingLevel>
(default 100ms). The logs get very chatty with any increase, but
especially at the higher levels.
There is an open issue to document the differences between the log levels,
but things need to be cleaned up a bit in the core database before that can
happen in a sane way. The Jira issue for that (if you would like to
watch/vote) is DOCS-3678 <https://jira.mongodb.org/browse/DOCS-3678>.
For now, unless you have a real need for increased logging, the
recommendation is to leave it at the default level of 0.
Adam
Post by rhea ghosh
Hi,
I'm having trouble finding this in the documentation. What are the
various log levels (also known as verbosity). I've tried looking for this
in the documentation but maybe I'm looking in the wrong place. I know there
are levels 0-5 but aside from that I haven't found a description of the
different levels.
It seems that 0 is at warning level and only logs information that are
warnings unless you set the profiling level differently and then it'll log
slow queries too.
I turned up the log level to 1 because i'm trying to track a heartbeat
issue but it seems to be logging all commands, reads/writes but i don't
really see any more heartbeat information being logged on the primary. Also
my primary log file rocketed up to 5GB with it at a logging level of 1 for
half a day which seems really unusual given that the logs are normally
around 2MB.
So what do the different log levels actually stand for and what
information is tracked? And how much should I expect them to grow if I turn
them up?
Thanks
--
You received this message because you are subscribed to the Google Groups "mongodb-user"
group.
http://www.mongodb.org/about/support/.
---
You received this message because you are subscribed to a topic in the
Google Groups "mongodb-user" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/mongodb-user/O4nBV4Gzf-s/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
Visit this group at http://groups.google.com/group/mongodb-user.
To view this discussion on the web visit
https://groups.google.com/d/msgid/mongodb-user/7bc0c901-744e-457e-9321-e123f71e3e7e%40googlegroups.com
<https://groups.google.com/d/msgid/mongodb-user/7bc0c901-744e-457e-9321-e123f71e3e7e%40googlegroups.com?utm_medium=email&utm_source=footer>
.
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google Groups "mongodb-user"
group.

For other MongoDB technical support options, see: http://www.mongodb.org/about/support/.
---
You received this message because you are subscribed to the Google Groups "mongodb-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mongodb-user+***@googlegroups.com.
To post to this group, send email to mongodb-***@googlegroups.com.
Visit this group at http://groups.google.com/group/mongodb-user.
To view this discussion on the web visit https://groups.google.com/d/msgid/mongodb-user/CAM%3DPFgoMiSejX_FsBOxppVSk6XxbHwR29A0kRSM9sbhKecks6g%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.
Adam Comerford
2014-11-27 20:12:14 UTC
Permalink
When you say "slow to respond" what do you mean? Are you talking about
replica set hearbeats, or some other operation?

Also, what version are you using? Those heartbeats, as of version 2.4.5 at
least, don't require a lock even with authentication enabled (if you are
using an older version than 2.4.5 and it is heartbeatsI would recommend
upgrading).

Generally, you don't need to have debugging enabled to figure out what is
causing replica set heartbeats to be slow, because (since they are so
lightweight) if they are slow then lots of other things will be too, there
will be no lack of symptoms to choose from.

Adam
Post by rhea
Yes I've noticed the chattiness increase with an increased log level. I
guess I'm just confused that there doesn't seem to be any extra heartbeat
information, simply that more queries are being logged (at least from what
I've seen in my logs so far).
I would like to increase my log level because I'm trying to trouble shoot
an issue where my primary server is slow to respond to my secondary and
arbiter but we've done a wireshark trace on all the traffic between the two
and determined it is not an infrastructure issue, the network looks
perfectly healthy at the time of this slowness. So I want to see if there
is any information available from the database since the boxes are new and
look fine as well. I would just change thee timeout for fail over to a
slightly longer timeout since we are getting so many false positives but I
don't believe this is configurable. So I'd like to increase logging to get
to the bottom of the issue but I'd prefer not to have insanely large logs
if I'm not going to get the information I need from them.
Post by Adam C
Log level 0 is the default and unless you are debugging a specific issue
with your mongod/mongos you should leave it at that level. Slow queries
will still be logged at level 0, once they go beyond the configured
slowms threshold
<http://docs.mongodb.org/manual/reference/method/db.setProfilingLevel/#db.setProfilingLevel>
(default 100ms). The logs get very chatty with any increase, but
especially at the higher levels.
There is an open issue to document the differences between the log
levels, but things need to be cleaned up a bit in the core database before
that can happen in a sane way. The Jira issue for that (if you would like
to watch/vote) is DOCS-3678 <https://jira.mongodb.org/browse/DOCS-3678>.
For now, unless you have a real need for increased logging, the
recommendation is to leave it at the default level of 0.
Adam
Post by rhea ghosh
Hi,
I'm having trouble finding this in the documentation. What are the
various log levels (also known as verbosity). I've tried looking for this
in the documentation but maybe I'm looking in the wrong place. I know there
are levels 0-5 but aside from that I haven't found a description of the
different levels.
It seems that 0 is at warning level and only logs information that are
warnings unless you set the profiling level differently and then it'll log
slow queries too.
I turned up the log level to 1 because i'm trying to track a heartbeat
issue but it seems to be logging all commands, reads/writes but i don't
really see any more heartbeat information being logged on the primary. Also
my primary log file rocketed up to 5GB with it at a logging level of 1 for
half a day which seems really unusual given that the logs are normally
around 2MB.
So what do the different log levels actually stand for and what
information is tracked? And how much should I expect them to grow if I turn
them up?
Thanks
--
You received this message because you are subscribed to the Google Groups "mongodb-user"
group.
http://www.mongodb.org/about/support/.
---
You received this message because you are subscribed to a topic in the
Google Groups "mongodb-user" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/mongodb-user/O4nBV4Gzf-s/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
Visit this group at http://groups.google.com/group/mongodb-user.
To view this discussion on the web visit
https://groups.google.com/d/msgid/mongodb-user/7bc0c901-744e-457e-9321-e123f71e3e7e%40googlegroups.com
<https://groups.google.com/d/msgid/mongodb-user/7bc0c901-744e-457e-9321-e123f71e3e7e%40googlegroups.com?utm_medium=email&utm_source=footer>
.
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google Groups "mongodb-user"
group.
http://www.mongodb.org/about/support/.
---
You received this message because you are subscribed to a topic in the
Google Groups "mongodb-user" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/mongodb-user/O4nBV4Gzf-s/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
Visit this group at http://groups.google.com/group/mongodb-user.
To view this discussion on the web visit
https://groups.google.com/d/msgid/mongodb-user/CAM%3DPFgoMiSejX_FsBOxppVSk6XxbHwR29A0kRSM9sbhKecks6g%40mail.gmail.com
<https://groups.google.com/d/msgid/mongodb-user/CAM%3DPFgoMiSejX_FsBOxppVSk6XxbHwR29A0kRSM9sbhKecks6g%40mail.gmail.com?utm_medium=email&utm_source=footer>
.
For more options, visit https://groups.google.com/d/optout.
--
Adam Comerford | ***@mongodb.com
Senior Solutions Engineer, MongoDB EMEA
--
You received this message because you are subscribed to the Google Groups "mongodb-user"
group.

For other MongoDB technical support options, see: http://www.mongodb.org/about/support/.
---
You received this message because you are subscribed to the Google Groups "mongodb-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mongodb-user+***@googlegroups.com.
To post to this group, send email to mongodb-***@googlegroups.com.
Visit this group at http://groups.google.com/group/mongodb-user.
To view this discussion on the web visit https://groups.google.com/d/msgid/mongodb-user/CAA2sbES6USvcjLfPimO7j%2B7hLnH-s-6vXdxA2ovoim0XKfiXhw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.
rhea
2014-11-27 20:46:11 UTC
Permalink
When we looked through the trace we could see clear examples where the
replica at heartbeats were, in fact, slow to respond. It would take 14-16s
before the primary would send a response even though we can see it ack the
request from the secondary (or arbiter). And then we would see an election
go through. At any rate, we haven't seen any latency in our application or
any monitoring on the mongos, when looking in the logs it looks pretty
normal. So it's curious for us that we don't see other symptoms but we do
see this happen every few weeks. Typically we don't have any problems but
recently there was a bug (which has been fixed i believe) where mongo
wouldn't rollback replica set transactions properly. And I'm not sure which
build its been fixed in, but we are running 2.6.1 and just had it happen a
couple weeks ago so it must be after the build we have. Aside from this
heartbeat issue we haven't run into any other problems with our cluster. So
it would be nice to iron this out so that we have a stable cluster again.
Post by Adam Comerford
When you say "slow to respond" what do you mean? Are you talking about
replica set hearbeats, or some other operation?
Also, what version are you using? Those heartbeats, as of version 2.4.5
at least, don't require a lock even with authentication enabled (if you are
using an older version than 2.4.5 and it is heartbeatsI would recommend
upgrading).
Generally, you don't need to have debugging enabled to figure out what is
causing replica set heartbeats to be slow, because (since they are so
lightweight) if they are slow then lots of other things will be too, there
will be no lack of symptoms to choose from.
Adam
Post by rhea
Yes I've noticed the chattiness increase with an increased log level. I
guess I'm just confused that there doesn't seem to be any extra heartbeat
information, simply that more queries are being logged (at least from what
I've seen in my logs so far).
I would like to increase my log level because I'm trying to trouble shoot
an issue where my primary server is slow to respond to my secondary and
arbiter but we've done a wireshark trace on all the traffic between the two
and determined it is not an infrastructure issue, the network looks
perfectly healthy at the time of this slowness. So I want to see if there
is any information available from the database since the boxes are new and
look fine as well. I would just change thee timeout for fail over to a
slightly longer timeout since we are getting so many false positives but I
don't believe this is configurable. So I'd like to increase logging to get
to the bottom of the issue but I'd prefer not to have insanely large logs
if I'm not going to get the information I need from them.
Post by Adam C
Log level 0 is the default and unless you are debugging a specific issue
with your mongod/mongos you should leave it at that level. Slow queries
will still be logged at level 0, once they go beyond the configured
slowms threshold
<http://docs.mongodb.org/manual/reference/method/db.setProfilingLevel/#db.setProfilingLevel>
(default 100ms). The logs get very chatty with any increase, but
especially at the higher levels.
There is an open issue to document the differences between the log
levels, but things need to be cleaned up a bit in the core database before
that can happen in a sane way. The Jira issue for that (if you would like
to watch/vote) is DOCS-3678 <https://jira.mongodb.org/browse/DOCS-3678>.
For now, unless you have a real need for increased logging, the
recommendation is to leave it at the default level of 0.
Adam
Post by rhea ghosh
Hi,
I'm having trouble finding this in the documentation. What are the
various log levels (also known as verbosity). I've tried looking for this
in the documentation but maybe I'm looking in the wrong place. I know there
are levels 0-5 but aside from that I haven't found a description of the
different levels.
It seems that 0 is at warning level and only logs information that are
warnings unless you set the profiling level differently and then it'll log
slow queries too.
I turned up the log level to 1 because i'm trying to track a heartbeat
issue but it seems to be logging all commands, reads/writes but i don't
really see any more heartbeat information being logged on the primary. Also
my primary log file rocketed up to 5GB with it at a logging level of 1 for
half a day which seems really unusual given that the logs are normally
around 2MB.
So what do the different log levels actually stand for and what
information is tracked? And how much should I expect them to grow if I turn
them up?
Thanks
--
You received this message because you are subscribed to the Google
Groups "mongodb-user"
group.
http://www.mongodb.org/about/support/.
---
You received this message because you are subscribed to a topic in the
Google Groups "mongodb-user" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/mongodb-user/O4nBV4Gzf-s/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
Visit this group at http://groups.google.com/group/mongodb-user.
To view this discussion on the web visit
https://groups.google.com/d/msgid/mongodb-user/7bc0c901-744e-457e-9321-e123f71e3e7e%40googlegroups.com
<https://groups.google.com/d/msgid/mongodb-user/7bc0c901-744e-457e-9321-e123f71e3e7e%40googlegroups.com?utm_medium=email&utm_source=footer>
.
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google Groups "mongodb-user"
group.
http://www.mongodb.org/about/support/.
---
You received this message because you are subscribed to a topic in the
Google Groups "mongodb-user" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/mongodb-user/O4nBV4Gzf-s/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
Visit this group at http://groups.google.com/group/mongodb-user.
To view this discussion on the web visit
https://groups.google.com/d/msgid/mongodb-user/CAM%3DPFgoMiSejX_FsBOxppVSk6XxbHwR29A0kRSM9sbhKecks6g%40mail.gmail.com
<https://groups.google.com/d/msgid/mongodb-user/CAM%3DPFgoMiSejX_FsBOxppVSk6XxbHwR29A0kRSM9sbhKecks6g%40mail.gmail.com?utm_medium=email&utm_source=footer>
.
For more options, visit https://groups.google.com/d/optout.
--
Senior Solutions Engineer, MongoDB EMEA
--
You received this message because you are subscribed to the Google Groups "mongodb-user"
group.
http://www.mongodb.org/about/support/.
---
You received this message because you are subscribed to a topic in the
Google Groups "mongodb-user" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/mongodb-user/O4nBV4Gzf-s/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
Visit this group at http://groups.google.com/group/mongodb-user.
To view this discussion on the web visit
https://groups.google.com/d/msgid/mongodb-user/CAA2sbES6USvcjLfPimO7j%2B7hLnH-s-6vXdxA2ovoim0XKfiXhw%40mail.gmail.com
<https://groups.google.com/d/msgid/mongodb-user/CAA2sbES6USvcjLfPimO7j%2B7hLnH-s-6vXdxA2ovoim0XKfiXhw%40mail.gmail.com?utm_medium=email&utm_source=footer>
.
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google Groups "mongodb-user"
group.

For other MongoDB technical support options, see: http://www.mongodb.org/about/support/.
---
You received this message because you are subscribed to the Google Groups "mongodb-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mongodb-user+***@googlegroups.com.
To post to this group, send email to mongodb-***@googlegroups.com.
Visit this group at http://groups.google.com/group/mongodb-user.
To view this discussion on the web visit https://groups.google.com/d/msgid/mongodb-user/CAM%3DPFgrb8aVp8CzXQ5PAc%2BgoH13b1ichkGEYqz5uyWzOMxSGMg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.
Adam C
2014-11-29 06:14:04 UTC
Permalink
Well, it doesn't sound like an issue I recall from the earlier 2.6 builds,
but I may just not remember the issue in question. There are >160 issues
closed between 2.6.1 and 2.6.5 - I'll have a look through them and see if
any can account for such a slow response from a heartbeat that I might have
missed. Is an upgrade to 2.6.5 something you would be willing to try? We
don't have a specific smoking gun, but before we look at trying to
investigate a new potential bug it is always advisable to be on the latest
point release of the build to make sure the issue has not already be fixed.

It would certainly be interesting to get a look at the traces showing the
slow responses also, though I understand that might not be possible.

Adam
Post by rhea
When we looked through the trace we could see clear examples where the
replica at heartbeats were, in fact, slow to respond. It would take 14-16s
before the primary would send a response even though we can see it ack the
request from the secondary (or arbiter). And then we would see an election
go through. At any rate, we haven't seen any latency in our application or
any monitoring on the mongos, when looking in the logs it looks pretty
normal. So it's curious for us that we don't see other symptoms but we do
see this happen every few weeks. Typically we don't have any problems but
recently there was a bug (which has been fixed i believe) where mongo
wouldn't rollback replica set transactions properly. And I'm not sure which
build its been fixed in, but we are running 2.6.1 and just had it happen a
couple weeks ago so it must be after the build we have. Aside from this
heartbeat issue we haven't run into any other problems with our cluster. So
it would be nice to iron this out so that we have a stable cluster again.
Post by Adam Comerford
When you say "slow to respond" what do you mean? Are you talking about
replica set hearbeats, or some other operation?
Also, what version are you using? Those heartbeats, as of version 2.4.5
at least, don't require a lock even with authentication enabled (if you are
using an older version than 2.4.5 and it is heartbeatsI would recommend
upgrading).
Generally, you don't need to have debugging enabled to figure out what is
causing replica set heartbeats to be slow, because (since they are so
lightweight) if they are slow then lots of other things will be too, there
will be no lack of symptoms to choose from.
Adam
Post by rhea
Yes I've noticed the chattiness increase with an increased log level. I
guess I'm just confused that there doesn't seem to be any extra heartbeat
information, simply that more queries are being logged (at least from what
I've seen in my logs so far).
I would like to increase my log level because I'm trying to trouble
shoot an issue where my primary server is slow to respond to my secondary
and arbiter but we've done a wireshark trace on all the traffic between the
two and determined it is not an infrastructure issue, the network looks
perfectly healthy at the time of this slowness. So I want to see if there
is any information available from the database since the boxes are new and
look fine as well. I would just change thee timeout for fail over to a
slightly longer timeout since we are getting so many false positives but I
don't believe this is configurable. So I'd like to increase logging to get
to the bottom of the issue but I'd prefer not to have insanely large logs
if I'm not going to get the information I need from them.
Post by Adam C
Log level 0 is the default and unless you are debugging a specific
issue with your mongod/mongos you should leave it at that level. Slow
queries will still be logged at level 0, once they go beyond the configured
slowms threshold
<http://docs.mongodb.org/manual/reference/method/db.setProfilingLevel/#db.setProfilingLevel>
(default 100ms). The logs get very chatty with any increase, but
especially at the higher levels.
There is an open issue to document the differences between the log
levels, but things need to be cleaned up a bit in the core database before
that can happen in a sane way. The Jira issue for that (if you would like
to watch/vote) is DOCS-3678 <https://jira.mongodb.org/browse/DOCS-3678>
.
For now, unless you have a real need for increased logging, the
recommendation is to leave it at the default level of 0.
Adam
Post by rhea ghosh
Hi,
I'm having trouble finding this in the documentation. What are the
various log levels (also known as verbosity). I've tried looking for this
in the documentation but maybe I'm looking in the wrong place. I know there
are levels 0-5 but aside from that I haven't found a description of the
different levels.
It seems that 0 is at warning level and only logs information that are
warnings unless you set the profiling level differently and then it'll log
slow queries too.
I turned up the log level to 1 because i'm trying to track a heartbeat
issue but it seems to be logging all commands, reads/writes but i don't
really see any more heartbeat information being logged on the primary. Also
my primary log file rocketed up to 5GB with it at a logging level of 1 for
half a day which seems really unusual given that the logs are normally
around 2MB.
So what do the different log levels actually stand for and what
information is tracked? And how much should I expect them to grow if I turn
them up?
Thanks
--
You received this message because you are subscribed to the Google
Groups "mongodb-user"
group.
http://www.mongodb.org/about/support/.
---
You received this message because you are subscribed to a topic in the
Google Groups "mongodb-user" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/mongodb-user/O4nBV4Gzf-s/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
<javascript:>.
Visit this group at http://groups.google.com/group/mongodb-user.
To view this discussion on the web visit
https://groups.google.com/d/msgid/mongodb-user/7bc0c901-744e-457e-9321-e123f71e3e7e%40googlegroups.com
<https://groups.google.com/d/msgid/mongodb-user/7bc0c901-744e-457e-9321-e123f71e3e7e%40googlegroups.com?utm_medium=email&utm_source=footer>
.
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google
Groups "mongodb-user"
group.
http://www.mongodb.org/about/support/.
---
You received this message because you are subscribed to a topic in the
Google Groups "mongodb-user" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/mongodb-user/O4nBV4Gzf-s/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
<javascript:>.
Visit this group at http://groups.google.com/group/mongodb-user.
To view this discussion on the web visit
https://groups.google.com/d/msgid/mongodb-user/CAM%3DPFgoMiSejX_FsBOxppVSk6XxbHwR29A0kRSM9sbhKecks6g%40mail.gmail.com
<https://groups.google.com/d/msgid/mongodb-user/CAM%3DPFgoMiSejX_FsBOxppVSk6XxbHwR29A0kRSM9sbhKecks6g%40mail.gmail.com?utm_medium=email&utm_source=footer>
.
For more options, visit https://groups.google.com/d/optout.
--
Senior Solutions Engineer, MongoDB EMEA
--
You received this message because you are subscribed to the Google Groups "mongodb-user"
group.
http://www.mongodb.org/about/support/.
---
You received this message because you are subscribed to a topic in the
Google Groups "mongodb-user" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/mongodb-user/O4nBV4Gzf-s/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
<javascript:>.
Visit this group at http://groups.google.com/group/mongodb-user.
To view this discussion on the web visit
https://groups.google.com/d/msgid/mongodb-user/CAA2sbES6USvcjLfPimO7j%2B7hLnH-s-6vXdxA2ovoim0XKfiXhw%40mail.gmail.com
<https://groups.google.com/d/msgid/mongodb-user/CAA2sbES6USvcjLfPimO7j%2B7hLnH-s-6vXdxA2ovoim0XKfiXhw%40mail.gmail.com?utm_medium=email&utm_source=footer>
.
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google Groups "mongodb-user"
group.

For other MongoDB technical support options, see: http://www.mongodb.org/about/support/.
---
You received this message because you are subscribed to the Google Groups "mongodb-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mongodb-user+***@googlegroups.com.
To post to this group, send email to mongodb-***@googlegroups.com.
Visit this group at http://groups.google.com/group/mongodb-user.
To view this discussion on the web visit https://groups.google.com/d/msgid/mongodb-user/8b83fdda-c2d7-4581-83c1-479b39cd49e7%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
rhea
2014-11-29 15:17:12 UTC
Permalink
I can post the traces when I get back in the office on Monday if that will
help.

And sorry to not be clear. The bug was that the primary that stepped down
because of slow heartbeats wasn't able to rollback the transactions and we
ended up having to manually sync because it just shut down.

I would be willing to try upgrading to 2.6.5 it would just require a little
testing beforehand since we do have production systems running on it.
Post by Adam C
Well, it doesn't sound like an issue I recall from the earlier 2.6 builds,
but I may just not remember the issue in question. There are >160 issues
closed between 2.6.1 and 2.6.5 - I'll have a look through them and see if
any can account for such a slow response from a heartbeat that I might have
missed. Is an upgrade to 2.6.5 something you would be willing to try? We
don't have a specific smoking gun, but before we look at trying to
investigate a new potential bug it is always advisable to be on the latest
point release of the build to make sure the issue has not already be fixed.
It would certainly be interesting to get a look at the traces showing the
slow responses also, though I understand that might not be possible.
Adam
Post by rhea
When we looked through the trace we could see clear examples where the
replica at heartbeats were, in fact, slow to respond. It would take 14-16s
before the primary would send a response even though we can see it ack the
request from the secondary (or arbiter). And then we would see an election
go through. At any rate, we haven't seen any latency in our application or
any monitoring on the mongos, when looking in the logs it looks pretty
normal. So it's curious for us that we don't see other symptoms but we do
see this happen every few weeks. Typically we don't have any problems but
recently there was a bug (which has been fixed i believe) where mongo
wouldn't rollback replica set transactions properly. And I'm not sure which
build its been fixed in, but we are running 2.6.1 and just had it happen a
couple weeks ago so it must be after the build we have. Aside from this
heartbeat issue we haven't run into any other problems with our cluster. So
it would be nice to iron this out so that we have a stable cluster again.
Post by Adam Comerford
When you say "slow to respond" what do you mean? Are you talking about
replica set hearbeats, or some other operation?
Also, what version are you using? Those heartbeats, as of version 2.4.5
at least, don't require a lock even with authentication enabled (if you are
using an older version than 2.4.5 and it is heartbeatsI would recommend
upgrading).
Generally, you don't need to have debugging enabled to figure out what
is causing replica set heartbeats to be slow, because (since they are so
lightweight) if they are slow then lots of other things will be too, there
will be no lack of symptoms to choose from.
Adam
Post by rhea
Yes I've noticed the chattiness increase with an increased log level. I
guess I'm just confused that there doesn't seem to be any extra heartbeat
information, simply that more queries are being logged (at least from what
I've seen in my logs so far).
I would like to increase my log level because I'm trying to trouble
shoot an issue where my primary server is slow to respond to my secondary
and arbiter but we've done a wireshark trace on all the traffic between the
two and determined it is not an infrastructure issue, the network looks
perfectly healthy at the time of this slowness. So I want to see if there
is any information available from the database since the boxes are new and
look fine as well. I would just change thee timeout for fail over to a
slightly longer timeout since we are getting so many false positives but I
don't believe this is configurable. So I'd like to increase logging to get
to the bottom of the issue but I'd prefer not to have insanely large logs
if I'm not going to get the information I need from them.
Post by Adam C
Log level 0 is the default and unless you are debugging a specific
issue with your mongod/mongos you should leave it at that level. Slow
queries will still be logged at level 0, once they go beyond the configured
slowms threshold
<http://docs.mongodb.org/manual/reference/method/db.setProfilingLevel/#db.setProfilingLevel>
(default 100ms). The logs get very chatty with any increase, but
especially at the higher levels.
There is an open issue to document the differences between the log
levels, but things need to be cleaned up a bit in the core database before
that can happen in a sane way. The Jira issue for that (if you would like
to watch/vote) is DOCS-3678
<https://jira.mongodb.org/browse/DOCS-3678>.
For now, unless you have a real need for increased logging, the
recommendation is to leave it at the default level of 0.
Adam
Post by rhea ghosh
Hi,
I'm having trouble finding this in the documentation. What are the
various log levels (also known as verbosity). I've tried looking for this
in the documentation but maybe I'm looking in the wrong place. I know there
are levels 0-5 but aside from that I haven't found a description of the
different levels.
It seems that 0 is at warning level and only logs information that
are warnings unless you set the profiling level differently and then it'll
log slow queries too.
I turned up the log level to 1 because i'm trying to track a
heartbeat issue but it seems to be logging all commands, reads/writes but i
don't really see any more heartbeat information being logged on the
primary. Also my primary log file rocketed up to 5GB with it at a logging
level of 1 for half a day which seems really unusual given that the logs
are normally around 2MB.
So what do the different log levels actually stand for and what
information is tracked? And how much should I expect them to grow if I turn
them up?
Thanks
--
You received this message because you are subscribed to the Google
Groups "mongodb-user"
group.
http://www.mongodb.org/about/support/.
---
You received this message because you are subscribed to a topic in the
Google Groups "mongodb-user" group.
To unsubscribe from this topic, visit https://groups.google.com/d/
topic/mongodb-user/O4nBV4Gzf-s/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
Visit this group at http://groups.google.com/group/mongodb-user.
To view this discussion on the web visit https://groups.google.com/d/
msgid/mongodb-user/7bc0c901-744e-457e-9321-e123f71e3e7e%
40googlegroups.com
<https://groups.google.com/d/msgid/mongodb-user/7bc0c901-744e-457e-9321-e123f71e3e7e%40googlegroups.com?utm_medium=email&utm_source=footer>
.
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google
Groups "mongodb-user"
group.
http://www.mongodb.org/about/support/.
---
You received this message because you are subscribed to a topic in the
Google Groups "mongodb-user" group.
To unsubscribe from this topic, visit https://groups.google.com/d/
topic/mongodb-user/O4nBV4Gzf-s/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
Visit this group at http://groups.google.com/group/mongodb-user.
To view this discussion on the web visit https://groups.google.com/d/
msgid/mongodb-user/CAM%3DPFgoMiSejX_FsBOxppVSk6XxbHwR29A0kRSM9sbhK
ecks6g%40mail.gmail.com
<https://groups.google.com/d/msgid/mongodb-user/CAM%3DPFgoMiSejX_FsBOxppVSk6XxbHwR29A0kRSM9sbhKecks6g%40mail.gmail.com?utm_medium=email&utm_source=footer>
.
For more options, visit https://groups.google.com/d/optout.
--
Senior Solutions Engineer, MongoDB EMEA
--
You received this message because you are subscribed to the Google
Groups "mongodb-user"
group.
http://www.mongodb.org/about/support/.
---
You received this message because you are subscribed to a topic in the
Google Groups "mongodb-user" group.
To unsubscribe from this topic, visit https://groups.google.com/d/
topic/mongodb-user/O4nBV4Gzf-s/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
Visit this group at http://groups.google.com/group/mongodb-user.
To view this discussion on the web visit https://groups.google.com/d/
msgid/mongodb-user/CAA2sbES6USvcjLfPimO7j%2B7hLnH-s-
6vXdxA2ovoim0XKfiXhw%40mail.gmail.com
<https://groups.google.com/d/msgid/mongodb-user/CAA2sbES6USvcjLfPimO7j%2B7hLnH-s-6vXdxA2ovoim0XKfiXhw%40mail.gmail.com?utm_medium=email&utm_source=footer>
.
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google Groups "mongodb-user"
group.
http://www.mongodb.org/about/support/.
---
You received this message because you are subscribed to a topic in the
Google Groups "mongodb-user" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/mongodb-user/O4nBV4Gzf-s/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
Visit this group at http://groups.google.com/group/mongodb-user.
To view this discussion on the web visit
https://groups.google.com/d/msgid/mongodb-user/8b83fdda-c2d7-4581-83c1-479b39cd49e7%40googlegroups.com
<https://groups.google.com/d/msgid/mongodb-user/8b83fdda-c2d7-4581-83c1-479b39cd49e7%40googlegroups.com?utm_medium=email&utm_source=footer>
.
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google Groups "mongodb-user"
group.

For other MongoDB technical support options, see: http://www.mongodb.org/about/support/.
---
You received this message because you are subscribed to the Google Groups "mongodb-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mongodb-user+***@googlegroups.com.
To post to this group, send email to mongodb-***@googlegroups.com.
Visit this group at http://groups.google.com/group/mongodb-user.
To view this discussion on the web visit https://groups.google.com/d/msgid/mongodb-user/CAM%3DPFgrPaj-duaDMphdajbYps-nA3V5PZ0Fo9y0FrQoSurfNig%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.
rhea ghosh
2014-12-02 17:35:22 UTC
Permalink
Ok, I'm attaching the traces here. The traces are two from the original
primary and one from the original secondary server. You'll see that the
blue highlighted part shows where the primary responds to the secondary but
it takes about 16 seconds do to so. And then with the trace to the arbiter
you'll see the highlighted part where it finally replied to the arbiter
after 35 seconds and the arbiter had already closed the TCP connection. The
red highlights are where the original secondary and the arbiter reset the
connection with the original primary.

These are the logs from the original primary at the time which I don't feel
show any reason for it to be slow.

2014-11-20T16:59:06.575-0600 [conn60120] command admin.$cmd command:
serverStatus { serverStatus: 1 } ntoreturn:1 keyUpdates:0 numYields:0
locks(micros) r:39 reslen:3656 866ms
2014-11-20T16:59:07.579-0600 [conn60120] end connection xx.x.xxx.xxx:xxxxx (
173 connections now open)
2014-11-20T16:59:08.528-0600 [conn60124] end connection xx.x.xxx.xxx:xxxxx (
172 connections now open)
2014-11-20T16:59:29.613-0600 [clientcursormon] mem (MB) res:236057 virt:
3745418
2014-11-20T16:59:29.613-0600 [clientcursormon] mapped (incl journal view):
3737954
2014-11-20T16:59:29.613-0600 [clientcursormon] connections:172
2014-11-20T16:59:29.613-0600 [clientcursormon] replication threads:32
2014-11-20T16:59:30.754-0600 [conn57521] end connection xx.x.xxx.xxx:xxxxx (
171 connections now open)
2014-11-20T17:00:00.433-0600 [initandlisten] connection accepted from xx.x.
xxx.xxx:xxxxx #60126 (172 connections now open)
2014-11-20T17:00:00.433-0600 [initandlisten] connection accepted from xx.x.
xxx.xxx:xxxxx #60127 (173 connections now open)
2014-11-20T17:00:00.433-0600 [conn60125] end connection xx.x.xxx.xxx:xxxxx (
172 connections now open)
2014-11-20T17:00:00.433-0600 [initandlisten] connection accepted from xx.x.
xxx.xxx:xxxxx #60128 (174 connections now open)
2014-11-20T17:00:00.433-0600 [conn60126] end connection xx.x.xxx.xxx:xxxxx (
172 connections now open)
2014-11-20T17:00:00.433-0600 [initandlisten] connection accepted from xx.x.
xxx.xxx:xxxxx #60129 (174 connections now open)
2014-11-20T17:00:00.433-0600 [initandlisten] connection accepted from xx.x.
xxx.xxx:xxxxx #60130 (174 connections now open)
2014-11-20T17:00:00.433-0600 [initandlisten] connection accepted from xx.x.
xxx.xxx:xxxxx #60131 (175 connections now open)
2014-11-20T17:00:00.433-0600 [conn60127] end connection xx.x.xxx.xxx:xxxxx (
174 connections now open)
2014-11-20T17:00:00.433-0600 [conn60128] end connection xx.x.xxx.xxx:xxxxx (
174 connections now open)
2014-11-20T17:00:00.434-0600 [conn60130] end connection xx.x.xxx.xxx:xxxxx (
174 connections now open)
2014-11-20T17:00:00.433-0600 [conn60129] end connection xx.x.xxx.xxx:xxxxx (
174 connections now open)
2014-11-20T17:00:00.433-0600 [initandlisten] connection accepted from xx.x.
xxx.xxx:xxxxx #60132 (176 connections now open)
2014-11-20T17:00:00.434-0600 [conn60131] end connection xx.x.xxx.xxx:xxxxx (
171 connections now open)
2014-11-20T17:00:00.434-0600 [initandlisten] connection accepted from xx.x.
xxx.xxx:xxxxx #60133 (173 connections now open)
2014-11-20T17:00:00.434-0600 [initandlisten] connection accepted from xx.x.
xxx.xxx:xxxxx #60134 (173 connections now open)
2014-11-20T17:00:00.434-0600 [conn60132] end connection xx.x.xxx.xxx:xxxxx (
172 connections now open)
2014-11-20T17:00:00.434-0600 [conn60134] authenticate db: local {
authenticate: 1, nonce: "xxx", user: "__system", key: "xxx" }
2014-11-20T17:00:00.435-0600 [rsHealthPoll] replset info ARBITER:27017
thinks that we are down
2014-11-20T17:00:00.477-0600 [rsHealthPoll] replSet member MONGOSERVER2:
27017 is now in state PRIMARY
2014-11-20T17:00:00.477-0600 [rsMgr] stepping down; another primary was
elected more recently
2014-11-20T17:00:00.478-0600 [rsMgr] replSet relinquishing primary state
2014-11-20T17:00:00.478-0600 [rsMgr] replSet SECONDARY
2014-11-20T17:00:00.478-0600 [rsMgr] replSet closing client sockets after
relinquishing primary

As I said before I'm not unwilling to try 2.6.5 but I just don't know if it
will resolve the issues that I've been seeing and sine I have a production
system up and running currently, so I'd need to test before I can make the
upgrade.
Post by rhea
I can post the traces when I get back in the office on Monday if that will
help.
And sorry to not be clear. The bug was that the primary that stepped down
because of slow heartbeats wasn't able to rollback the transactions and we
ended up having to manually sync because it just shut down.
I would be willing to try upgrading to 2.6.5 it would just require a
little testing beforehand since we do have production systems running on it.
Post by Adam C
Well, it doesn't sound like an issue I recall from the earlier 2.6
builds, but I may just not remember the issue in question. There are >160
issues closed between 2.6.1 and 2.6.5 - I'll have a look through them and
see if any can account for such a slow response from a heartbeat that I
might have missed. Is an upgrade to 2.6.5 something you would be willing
to try? We don't have a specific smoking gun, but before we look at trying
to investigate a new potential bug it is always advisable to be on the
latest point release of the build to make sure the issue has not already be
fixed.
It would certainly be interesting to get a look at the traces showing the
slow responses also, though I understand that might not be possible.
Adam
Post by rhea
When we looked through the trace we could see clear examples where the
replica at heartbeats were, in fact, slow to respond. It would take 14-16s
before the primary would send a response even though we can see it ack the
request from the secondary (or arbiter). And then we would see an election
go through. At any rate, we haven't seen any latency in our application or
any monitoring on the mongos, when looking in the logs it looks pretty
normal. So it's curious for us that we don't see other symptoms but we do
see this happen every few weeks. Typically we don't have any problems but
recently there was a bug (which has been fixed i believe) where mongo
wouldn't rollback replica set transactions properly. And I'm not sure which
build its been fixed in, but we are running 2.6.1 and just had it happen a
couple weeks ago so it must be after the build we have. Aside from this
heartbeat issue we haven't run into any other problems with our cluster. So
it would be nice to iron this out so that we have a stable cluster again.
Post by Adam Comerford
When you say "slow to respond" what do you mean? Are you talking about
replica set hearbeats, or some other operation?
Also, what version are you using? Those heartbeats, as of version
2.4.5 at least, don't require a lock even with authentication enabled (if
you are using an older version than 2.4.5 and it is heartbeatsI would
recommend upgrading).
Generally, you don't need to have debugging enabled to figure out what
is causing replica set heartbeats to be slow, because (since they are so
lightweight) if they are slow then lots of other things will be too, there
will be no lack of symptoms to choose from.
Adam
Post by rhea
Yes I've noticed the chattiness increase with an increased log level.
I guess I'm just confused that there doesn't seem to be any extra heartbeat
information, simply that more queries are being logged (at least from what
I've seen in my logs so far).
I would like to increase my log level because I'm trying to trouble
shoot an issue where my primary server is slow to respond to my secondary
and arbiter but we've done a wireshark trace on all the traffic between the
two and determined it is not an infrastructure issue, the network looks
perfectly healthy at the time of this slowness. So I want to see if there
is any information available from the database since the boxes are new and
look fine as well. I would just change thee timeout for fail over to a
slightly longer timeout since we are getting so many false positives but I
don't believe this is configurable. So I'd like to increase logging to get
to the bottom of the issue but I'd prefer not to have insanely large logs
if I'm not going to get the information I need from them.
Post by Adam C
Log level 0 is the default and unless you are debugging a specific
issue with your mongod/mongos you should leave it at that level. Slow
queries will still be logged at level 0, once they go beyond the configured
slowms threshold
<http://docs.mongodb.org/manual/reference/method/db.setProfilingLevel/#db.setProfilingLevel>
(default 100ms). The logs get very chatty with any increase, but
especially at the higher levels.
There is an open issue to document the differences between the log
levels, but things need to be cleaned up a bit in the core database before
that can happen in a sane way. The Jira issue for that (if you would like
to watch/vote) is DOCS-3678
<https://jira.mongodb.org/browse/DOCS-3678>.
For now, unless you have a real need for increased logging, the
recommendation is to leave it at the default level of 0.
Adam
Post by rhea ghosh
Hi,
I'm having trouble finding this in the documentation. What are the
various log levels (also known as verbosity). I've tried looking for this
in the documentation but maybe I'm looking in the wrong place. I know there
are levels 0-5 but aside from that I haven't found a description of the
different levels.
It seems that 0 is at warning level and only logs information that
are warnings unless you set the profiling level differently and then it'll
log slow queries too.
I turned up the log level to 1 because i'm trying to track a
heartbeat issue but it seems to be logging all commands, reads/writes but i
don't really see any more heartbeat information being logged on the
primary. Also my primary log file rocketed up to 5GB with it at a logging
level of 1 for half a day which seems really unusual given that the logs
are normally around 2MB.
So what do the different log levels actually stand for and what
information is tracked? And how much should I expect them to grow if I turn
them up?
Thanks
--
You received this message because you are subscribed to the Google
Groups "mongodb-user"
group.
http://www.mongodb.org/about/support/.
---
You received this message because you are subscribed to a topic in
the Google Groups "mongodb-user" group.
To unsubscribe from this topic, visit https://groups.google.com/d/
topic/mongodb-user/O4nBV4Gzf-s/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
Visit this group at http://groups.google.com/group/mongodb-user.
To view this discussion on the web visit https://groups.google.com/d/
msgid/mongodb-user/7bc0c901-744e-457e-9321-e123f71e3e7e%
40googlegroups.com
<https://groups.google.com/d/msgid/mongodb-user/7bc0c901-744e-457e-9321-e123f71e3e7e%40googlegroups.com?utm_medium=email&utm_source=footer>
.
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google
Groups "mongodb-user"
group.
http://www.mongodb.org/about/support/.
---
You received this message because you are subscribed to a topic in the
Google Groups "mongodb-user" group.
To unsubscribe from this topic, visit https://groups.google.com/d/
topic/mongodb-user/O4nBV4Gzf-s/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
Visit this group at http://groups.google.com/group/mongodb-user.
To view this discussion on the web visit https://groups.google.com/d/
msgid/mongodb-user/CAM%3DPFgoMiSejX_FsBOxppVSk6XxbHwR29A0kRSM9sbhK
ecks6g%40mail.gmail.com
<https://groups.google.com/d/msgid/mongodb-user/CAM%3DPFgoMiSejX_FsBOxppVSk6XxbHwR29A0kRSM9sbhKecks6g%40mail.gmail.com?utm_medium=email&utm_source=footer>
.
For more options, visit https://groups.google.com/d/optout.
--
Senior Solutions Engineer, MongoDB EMEA
--
You received this message because you are subscribed to the Google
Groups "mongodb-user"
group.
http://www.mongodb.org/about/support/.
---
You received this message because you are subscribed to a topic in the
Google Groups "mongodb-user" group.
To unsubscribe from this topic, visit https://groups.google.com/d/
topic/mongodb-user/O4nBV4Gzf-s/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
Visit this group at http://groups.google.com/group/mongodb-user.
To view this discussion on the web visit https://groups.google.com/d/
msgid/mongodb-user/CAA2sbES6USvcjLfPimO7j%2B7hLnH-s-
6vXdxA2ovoim0XKfiXhw%40mail.gmail.com
<https://groups.google.com/d/msgid/mongodb-user/CAA2sbES6USvcjLfPimO7j%2B7hLnH-s-6vXdxA2ovoim0XKfiXhw%40mail.gmail.com?utm_medium=email&utm_source=footer>
.
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google Groups "mongodb-user"
group.
http://www.mongodb.org/about/support/.
---
You received this message because you are subscribed to a topic in the
Google Groups "mongodb-user" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/mongodb-user/O4nBV4Gzf-s/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
Visit this group at http://groups.google.com/group/mongodb-user.
To view this discussion on the web visit
https://groups.google.com/d/msgid/mongodb-user/8b83fdda-c2d7-4581-83c1-479b39cd49e7%40googlegroups.com
<https://groups.google.com/d/msgid/mongodb-user/8b83fdda-c2d7-4581-83c1-479b39cd49e7%40googlegroups.com?utm_medium=email&utm_source=footer>
.
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google Groups "mongodb-user"
group.

For other MongoDB technical support options, see: http://www.mongodb.org/about/support/.
---
You received this message because you are subscribed to the Google Groups "mongodb-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mongodb-user+***@googlegroups.com.
To post to this group, send email to mongodb-***@googlegroups.com.
Visit this group at http://groups.google.com/group/mongodb-user.
To view this discussion on the web visit https://groups.google.com/d/msgid/mongodb-user/0f56d5b9-a46c-4a72-ae6b-ba961fa45af5%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
rhea ghosh
2014-12-08 17:22:48 UTC
Permalink
Also along with the logs and the original primary server these are the logs
from the secondary server and the arbiter.

original secondary (MONGOSERVER2) logs
2014-11-20T16:59:01.735-0600 [conn59639] authenticate db: admin {
authenticate: 1, nonce: "xxx", user: "svcClusterAdmin", key: "xxx" }
2014-11-20T16:59:02.000-0600 [repl index builder 30] Index Build(
background): 36200500/124435801 29%
2014-11-20T16:59:05.003-0600 [repl index builder 30] Index Build(
background): 36577700/124435801 29%
2014-11-20T16:59:08.000-0600 [repl index builder 30] Index Build(
background): 36958700/124435801 29%
2014-11-20T16:59:08.528-0600 [conn59637] end connection 10.0.240.250:59921 (
70 connections now open)
2014-11-20T16:59:11.000-0600 [repl index builder 30] Index Build(
background): 37338400/124435801 30%
2014-11-20T16:59:14.000-0600 [repl index builder 30] Index Build(
background): 37697700/124435801 30%
2014-11-20T16:59:17.010-0600 [repl index builder 30] Index Build(
background): 38079300/124435801 30%
2014-11-20T16:59:18.527-0600 [rsHealthPoll] DBClientCursor::init call()
failed
2014-11-20T16:59:18.547-0600 [rsHealthPoll] can't authenticate to
CHILXPROD061:27017 (10.0.240.250) failed as internal user, error:
DBClientBase::findN: transport error: CHILXPROD061:27017 ns: local.$cmd
query: { getnonce: 1 }
2014-11-20T16:59:20.000-0600 [repl index builder 30] Index
Build(background): 38458700/124435801 30%
2014-11-20T16:59:23.000-0600 [repl index builder 30] Index
Build(background): 38835100/124435801 31%
2014-11-20T16:59:26.000-0600 [repl index builder 30] Index
Build(background): 39211100/124435801 31%
2014-11-20T16:59:26.645-0600 [conn59638] end connection
xxx.xx.xxx.xxx:xxxxx (69 connections now open)
2014-11-20T16:59:26.645-0600 [initandlisten] connection accepted from
xxx.xx.xxx.xxx:xxxxx #59640 (70 connections now open)
2014-11-20T16:59:26.678-0600 [conn59640] authenticate db: local {
authenticate: 1, nonce: "xxx", user: "__system", key: "xxx" }
2014-11-20T16:59:28.549-0600 [rsHealthPoll] DBClientCursor::init call()
failed
2014-11-20T16:59:28.556-0600 [rsHealthPoll] replSet info MONGOSERVER1:27017
is down (or slow to respond):
2014-11-20T16:59:28.556-0600 [rsHealthPoll] replSet member
MONGOSERVER1:27017 is now in state DOWN
2014-11-20T16:59:28.556-0600 [rsMgr] replSet info electSelf 1
2014-11-20T16:59:29.000-0600 [repl index builder 30] Index
Build(background): 39581200/124435801 31%
2014-11-20T16:59:30.766-0600 [rsMgr] replSet PRIMARY
2014-11-20T16:59:32.000-0600 [repl index builder 30] Index
Build(background): 39965000/124435801 32%
2014-11-20T16:59:33.972-0600 [conn59639] command admin.$cmd command:
serverStatus { serverStatus: 1 } ntoreturn:1 keyUpdates:0 numYields:0
locks(micros) r:49 reslen:3608 154ms
2014-11-20T16:59:35.000-0600 [repl index builder 30] Index
Build(background): 40289100/124435801 32%
2014-11-20T16:59:38.000-0600 [repl index builder 30] Index
Build(background): 40679700/124435801 32%
2014-11-20T16:59:38.735-0600 [journal] old journal file will be removed:
/opt/mongodata/journal/j._87
2014-11-20T16:59:40.597-0600 [rsHealthPoll] DBClientCursor::init call()
failed
2014-11-20T16:59:41.000-0600 [repl index builder 30] Index
Build(background): 41064300/124435801 33%
2014-11-20T16:59:44.000-0600 [repl index builder 30] Index
Build(background): 41449500/124435801 33%
2014-11-20T16:59:47.000-0600 [repl index builder 30] Index
Build(background): 41858100/124435801 33%
2014-11-20T16:59:50.000-0600 [repl index builder 30] Index
Build(background): 42233500/124435801 33%
2014-11-20T16:59:52.598-0600 [rsHealthPoll] DBClientCursor::init call()
failed
2014-11-20T16:59:53.000-0600 [repl index builder 30] Index
Build(background): 42613000/124435801 34%
2014-11-20T16:59:56.000-0600 [repl index builder 30] Index
Build(background): 42954300/124435801 34%
2014-11-20T16:59:56.685-0600 [conn59640] end connection
xxx.xx.xxx.xxx:xxxxx (69 connections now open)
2014-11-20T16:59:56.686-0600 [initandlisten] connection accepted from
10.0.104.189:50816 #59641 (70 connections now open)
2014-11-20T16:59:56.726-0600 [conn59641] authenticate db: local {
authenticate: 1, nonce: "xxx", user: "__system", key: "xxx" }
2014-11-20T16:59:59.000-0600 [repl index builder 30] Index
Build(background): 43317900/124435801 34%
2014-11-20T17:00:00.435-0600 [rsHealthPoll] replSet member
MONGOSERVER1:27017 is up
2014-11-20T17:00:00.435-0600 [rsHealthPoll] replSet member
MONGOSERVER1:27017 is now in state PRIMARY
2014-11-20T17:00:00.435-0600 [rsMgr] another PRIMARY detected but it should
step down since it was elected earlier than me
2014-11-20T17:00:00.435-0600 [rsMgr] another PRIMARY detected but it should
step down since it was elected earlier than me

arbiter logs
2014-11-20T16:58:50.711-0600 [conn41035] authenticate db: local {
authenticate: 1, nonce: "xxx", user: "__system", key: "xxx" }
2014-11-20T16:59:13.954-0600 [rsHealthPoll] DBClientCursor::init call()
failed
2014-11-20T16:59:13.954-0600 [rsHealthPoll] can't authenticate to
CHILXPROD061:27017 (10.0.240.250) failed as internal user, error:
DBClientBase::findN: transport error: CHILXPROD061:27017 ns: local.$cmd
query: { getnonce: 1 }
2014-11-20T16:59:20.741-0600 [conn41036] authenticate db: local {
authenticate: 1, nonce: "xxx", user: "__system", key: "xxx" }
2014-11-20T16:59:23.955-0600 [rsHealthPoll] DBClientCursor::init call()
failed
2014-11-20T16:59:23.955-0600 [rsHealthPoll] replSet info MONGOSERVER1:27017
is down (or slow to respond):
2014-11-20T16:59:23.955-0600 [rsHealthPoll] replSet member
MONGOSERVER1:27017 is now in state DOWN
2014-11-20T16:59:28.556-0600 [conn41036] replSet info voting yea for
CHILXPROD062:27017 (1)
2014-11-20T16:59:32.680-0600 [rsHealthPoll] replSet member
MONGOSERVER2:27017 is now in state PRIMARY
2014-11-20T16:59:35.956-0600 [rsHealthPoll] DBClientCursor::init call()
failed
2014-11-20T16:59:47.958-0600 [rsHealthPoll] DBClientCursor::init call()
failed
2014-11-20T16:59:50.790-0600 [conn41037] authenticate db: local {
authenticate: 1, nonce: "xxx", user: "__system", key: "xxx" }
2014-11-20T16:59:59.959-0600 [rsHealthPoll] DBClientCursor::init call()
failed
2014-11-20T17:00:00.434-0600 [conn41038] authenticate db: local {
authenticate: 1, nonce: "xxx", user: "__system", key: "xxx" }
2014-11-20T17:00:01.962-0600 [rsHealthPoll] replSet member
MONGOSERVER1:27017 is up
2014-11-20T17:00:01.962-0600 [rsHealthPoll] replSet member MONGOSERVER1:27017
is now in state SECONDARY


any additional thoughts as to why this could be happening would be helpful
since it is clearly unrelated to the network and seems unlikely that it's
the actual hardware. it would be nice to have some idea of why mongodb
pauses for 16 seconds or so and there's no response back to the the other
nodes.
Post by rhea ghosh
Ok, I'm attaching the traces here. The traces are two from the original
primary and one from the original secondary server. You'll see that the
blue highlighted part shows where the primary responds to the secondary but
it takes about 16 seconds do to so. And then with the trace to the arbiter
you'll see the highlighted part where it finally replied to the arbiter
after 35 seconds and the arbiter had already closed the TCP connection. The
red highlights are where the original secondary and the arbiter reset the
connection with the original primary.
These are the logs from the original primary at the time which I don't
feel show any reason for it to be slow.
serverStatus { serverStatus: 1 } ntoreturn:1 keyUpdates:0 numYields:0
locks(micros) r:39 reslen:3656 866ms
2014-11-20T16:59:07.579-0600 [conn60120] end connection xx.x.xxx.xxx:xxxxx
(173 connections now open)
2014-11-20T16:59:08.528-0600 [conn60124] end connection xx.x.xxx.xxx:xxxxx
(172 connections now open)
3745418
2014-11-20T16:59:29.613-0600 [clientcursormon] mapped (incl journal view
):3737954
2014-11-20T16:59:29.613-0600 [clientcursormon] connections:172
2014-11-20T16:59:29.613-0600 [clientcursormon] replication threads:32
2014-11-20T16:59:30.754-0600 [conn57521] end connection xx.x.xxx.xxx:xxxxx
(171 connections now open)
2014-11-20T17:00:00.433-0600 [initandlisten] connection accepted from xx.x
.xxx.xxx:xxxxx #60126 (172 connections now open)
2014-11-20T17:00:00.433-0600 [initandlisten] connection accepted from xx.x
.xxx.xxx:xxxxx #60127 (173 connections now open)
2014-11-20T17:00:00.433-0600 [conn60125] end connection xx.x.xxx.xxx:xxxxx
(172 connections now open)
2014-11-20T17:00:00.433-0600 [initandlisten] connection accepted from xx.x
.xxx.xxx:xxxxx #60128 (174 connections now open)
2014-11-20T17:00:00.433-0600 [conn60126] end connection xx.x.xxx.xxx:xxxxx
(172 connections now open)
2014-11-20T17:00:00.433-0600 [initandlisten] connection accepted from xx.x
.xxx.xxx:xxxxx #60129 (174 connections now open)
2014-11-20T17:00:00.433-0600 [initandlisten] connection accepted from xx.x
.xxx.xxx:xxxxx #60130 (174 connections now open)
2014-11-20T17:00:00.433-0600 [initandlisten] connection accepted from xx.x
.xxx.xxx:xxxxx #60131 (175 connections now open)
2014-11-20T17:00:00.433-0600 [conn60127] end connection xx.x.xxx.xxx:xxxxx
(174 connections now open)
2014-11-20T17:00:00.433-0600 [conn60128] end connection xx.x.xxx.xxx:xxxxx
(174 connections now open)
2014-11-20T17:00:00.434-0600 [conn60130] end connection xx.x.xxx.xxx:xxxxx
(174 connections now open)
2014-11-20T17:00:00.433-0600 [conn60129] end connection xx.x.xxx.xxx:xxxxx
(174 connections now open)
2014-11-20T17:00:00.433-0600 [initandlisten] connection accepted from xx.x
.xxx.xxx:xxxxx #60132 (176 connections now open)
2014-11-20T17:00:00.434-0600 [
...
--
You received this message because you are subscribed to the Google Groups "mongodb-user"
group.

For other MongoDB technical support options, see: http://www.mongodb.org/about/support/.
---
You received this message because you are subscribed to the Google Groups "mongodb-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mongodb-user+***@googlegroups.com.
To post to this group, send email to mongodb-***@googlegroups.com.
Visit this group at http://groups.google.com/group/mongodb-user.
To view this discussion on the web visit https://groups.google.com/d/msgid/mongodb-user/690b2b5d-3850-497e-aa98-fc9cf5b2b79f%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Asya Kamsky
2014-12-08 17:32:40 UTC
Permalink
Hi Rhea:

From secondary and arbiter we can see they couldn't connect to primary:

2014-11-20T16:59:13.954-0600 [rsHealthPoll] can't authenticate to
CHILXPROD061:27017 (10.0.240.250) failed as internal user, error:
DBClientBase::findN: transport error: CHILXPROD061:27017 ns: local.$cmd
query: { getnonce: 1 }
2014-11-20T16:59:18.547-0600 [rsHealthPoll] can't authenticate to
CHILXPROD061:27017 (10.0.240.250) failed as internal user, error:
DBClientBase::findN: transport error: CHILXPROD061:27017 ns: local.$cmd
query: { getnonce: 1 }
2014-11-20T16:59:40.597-0600 [rsHealthPoll] DBClientCursor::init call()
failed
failures until
2014-11-20T17:00:01.962-0600 [rsHealthPoll] replSet member
MONGOSERVER1:27017 is up
2014-11-20T17:00:00.435-0600 [rsHealthPoll] replSet member
MONGOSERVER1:27017 is up

From the primary, we simply don't see the attempts to connect:
2014-11-20T16:59:07.579-0600 [conn60120] end connection xx.x.xxx.xxx:xxxxx (
173 connections now open)
2014-11-20T16:59:08.528-0600 [conn60124] end connection xx.x.xxx.xxx:xxxxx (
172 connections now open)
2014-11-20T16:59:29.613-0600 [clientcursormon] mem (MB) res:236057 virt:
3745418
2014-11-20T16:59:29.613-0600 [clientcursormon] mapped (incl journal view):
3737954
2014-11-20T16:59:29.613-0600 [clientcursormon] connections:172
2014-11-20T16:59:29.613-0600 [clientcursormon] replication threads:32
2014-11-20T16:59:30.754-0600 [conn57521] end connection xx.x.xxx.xxx:xxxxx (
171 connections now open)
2014-11-20T17:00:00.433-0600 [initandlisten] connection accepted from xx.x.
xxx.xxx:xxxxx #60126 (172 connections now open)
2014-11-20T17:00:00.433-0600 [initandlisten] connection accepted from xx.x.
xxx.xxx:xxxxx #60127 (173 connections now open)


There isn't really any way to tell what was happening on the OS level - is
it unusual for your primary to log *nothing* for entire 30 seconds? If so,
then it's possible that the mongod process was de-prioritized by the OS for
some reason and basically couldn't do anything while swapped out...

Is this a complete log from this time? The primary's rsHealthPoll should
be running and connecting to the other members of the replica set - we
don't see any evidence of that either - it really suggests very strongly
something was happening on the primary's box that was possibly blocking
everything from progressing normally...

Asya
Post by rhea ghosh
Also along with the logs and the original primary server these are the
logs from the secondary server and the arbiter.
original secondary (MONGOSERVER2) logs
2014-11-20T16:59:01.735-0600 [conn59639] authenticate db: admin {
authenticate: 1, nonce: "xxx", user: "svcClusterAdmin", key: "xxx" }
2014-11-20T16:59:02.000-0600 [repl index builder 30] Index
Build(background): 36200500/124435801 29%
2014-11-20T16:59:05.003-0600 [repl index builder 30] Index
Build(background): 36577700/124435801 29%
2014-11-20T16:59:08.000-0600 [repl index builder 30] Index
Build(background): 36958700/124435801 29%
2014-11-20T16:59:08.528-0600 [conn59637] end connection 10.0.240.250:59921
(70 connections now open)
2014-11-20T16:59:11.000-0600 [repl index builder 30] Index
Build(background): 37338400/124435801 30%
2014-11-20T16:59:14.000-0600 [repl index builder 30] Index
Build(background): 37697700/124435801 30%
2014-11-20T16:59:17.010-0600 [repl index builder 30] Index
Build(background): 38079300/124435801 30%
2014-11-20T16:59:18.527-0600 [rsHealthPoll] DBClientCursor::init call()
failed
2014-11-20T16:59:18.547-0600 [rsHealthPoll] can't authenticate to
DBClientBase::findN: transport error: CHILXPROD061:27017 ns: local.$cmd
query: { getnonce: 1 }
2014-11-20T16:59:20.000-0600 [repl index builder 30] Index
Build(background): 38458700/124435801 30%
2014-11-20T16:59:23.000-0600 [repl index builder 30] Index
Build(background): 38835100/124435801 31%
2014-11-20T16:59:26.000-0600 [repl index builder 30] Index
Build(background): 39211100/124435801 31%
2014-11-20T16:59:26.645-0600 [conn59638] end connection
xxx.xx.xxx.xxx:xxxxx (69 connections now open)
2014-11-20T16:59:26.645-0600 [initandlisten] connection accepted from
xxx.xx.xxx.xxx:xxxxx #59640 (70 connections now open)
2014-11-20T16:59:26.678-0600 [conn59640] authenticate db: local {
authenticate: 1, nonce: "xxx", user: "__system", key: "xxx" }
2014-11-20T16:59:28.549-0600 [rsHealthPoll] DBClientCursor::init call()
failed
2014-11-20T16:59:28.556-0600 [rsHealthPoll] replSet info
2014-11-20T16:59:28.556-0600 [rsHealthPoll] replSet member
MONGOSERVER1:27017 is now in state DOWN
2014-11-20T16:59:28.556-0600 [rsMgr] replSet info electSelf 1
2014-11-20T16:59:29.000-0600 [repl index builder 30] Index
Build(background): 39581200/124435801 31%
2014-11-20T16:59:30.766-0600 [rsMgr] replSet PRIMARY
2014-11-20T16:59:32.000-0600 [repl index builder 30] Index
Build(background): 39965000/124435801 32%
serverStatus { serverStatus: 1 } ntoreturn:1 keyUpdates:0 numYields:0
locks(micros) r:49 reslen:3608 154ms
2014-11-20T16:59:35.000-0600 [repl index builder 30] Index
Build(background): 40289100/124435801 32%
2014-11-20T16:59:38.000-0600 [repl index builder 30] Index
Build(background): 40679700/124435801 32%
/opt/mongodata/journal/j._87
2014-11-20T16:59:40.597-0600 [rsHealthPoll] DBClientCursor::init call()
failed
2014-11-20T16:59:41.000-0600 [repl index builder 30] Index
Build(background): 41064300/124435801 33%
2014-11-20T16:59:44.000-0600 [repl index builder 30] Index
Build(background): 41449500/124435801 33%
2014-11-20T16:59:47.000-0600 [repl index builder 30] Index
Build(background): 41858100/124435801 33%
2014-11-20T16:59:50.000-0600 [repl index builder 30] Index
Build(background): 42233500/124435801 33%
2014-11-20T16:59:52.598-0600 [rsHealthPoll] DBClientCursor::init call()
failed
2014-11-20T16:59:53.000-0600 [repl index builder 30] Index
Build(background): 42613000/124435801 34%
2014-11-20T16:59:56.000-0600 [repl index builder 30] Index
Build(background): 42954300/124435801 34%
2014-11-20T16:59:56.685-0600 [conn59640] end connection
xxx.xx.xxx.xxx:xxxxx (69 connections now open)
2014-11-20T16:59:56.686-0600 [initandlisten] connection accepted from
10.0.104.189:50816 #59641 (70 connections now open)
2014-11-20T16:59:56.726-0600 [conn59641] authenticate db: local {
authenticate: 1, nonce: "xxx", user: "__system", key: "xxx" }
2014-11-20T16:59:59.000-0600 [repl index builder 30] Index
Build(background): 43317900/124435801 34%
2014-11-20T17:00:00.435-0600 [rsHealthPoll] replSet member
MONGOSERVER1:27017 is up
2014-11-20T17:00:00.435-0600 [rsHealthPoll] replSet member
MONGOSERVER1:27017 is now in state PRIMARY
2014-11-20T17:00:00.435-0600 [rsMgr] another PRIMARY detected but it
should step down since it was elected earlier than me
2014-11-20T17:00:00.435-0600 [rsMgr] another PRIMARY detected but it
should step down since it was elected earlier than me
arbiter logs
2014-11-20T16:58:50.711-0600 [conn41035] authenticate db: local {
authenticate: 1, nonce: "xxx", user: "__system", key: "xxx" }
2014-11-20T16:59:13.954-0600 [rsHealthPoll] DBClientCursor::init call()
failed
2014-11-20T16:59:13.954-0600 [rsHealthPoll] can't authenticate to
DBClientBase::findN: transport error: CHILXPROD061:27017 ns: local.$cmd
query: { getnonce: 1 }
2014-11-20T16:59:20.741-0600 [conn41036] authenticate db: local {
authenticate: 1, nonce: "xxx", user: "__system", key: "xxx" }
2014-11-20T16:59:23.955-0600 [rsHealthPoll] DBClientCursor::init call()
failed
2014-11-20T16:59:23.955-0600 [rsHealthPoll] replSet info
2014-11-20T16:59:23.955-0600 [rsHealthPoll] replSet member
MONGOSERVER1:27017 is now in state DOWN
2014-11-20T16:59:28.556-0600 [conn41036] replSet info voting yea for
CHILXPROD062:27017 (1)
2014-11-20T16:59:32.680-0600 [rsHealthPoll] replSet member
MONGOSERVER2:27017 is now in state PRIMARY
2014-11-20T16:59:35.956-0600 [rsHealthPoll] DBClientCursor::init call()
failed
2014-11-20T16:59:47.958-0600 [rsHealthPoll] DBClientCursor::init call()
failed
2014-11-20T16:59:50.790-0600 [conn41037] authenticate db: local {
authenticate: 1, nonce: "xxx", user: "__system", key: "xxx" }
2014-11-20T16:59:59.959-0600 [rsHealthPoll] DBClientCursor::init call()
failed
2014-11-20T17:00:00.434-0600 [conn41038] authenticate db: local {
authenticate: 1, nonce: "xxx", user: "__system", key: "xxx" }
2014-11-20T17:00:01.962-0600 [rsHealthPoll] replSet member
MONGOSERVER1:27017 is up
2014-11-20T17:00:01.962-0600 [rsHealthPoll] replSet member MONGOSERVER1:27017
is now in state SECONDARY
any additional thoughts as to why this could be happening would be helpful
since it is clearly unrelated to the network and seems unlikely that it's
the actual hardware. it would be nice to have some idea of why mongodb
pauses for 16 seconds or so and there's no response back to the the other
nodes.
Post by rhea ghosh
Ok, I'm attaching the traces here. The traces are two from the original
primary and one from the original secondary server. You'll see that the
blue highlighted part shows where the primary responds to the secondary but
it takes about 16 seconds do to so. And then with the trace to the arbiter
you'll see the highlighted part where it finally replied to the arbiter
after 35 seconds and the arbiter had already closed the TCP connection. The
red highlights are where the original secondary and the arbiter reset the
connection with the original primary.
These are the logs from the original primary at the time which I don't
feel show any reason for it to be slow.
serverStatus { serverStatus: 1 } ntoreturn:1 keyUpdates:0 numYields:0
locks(micros) r:39 reslen:3656 866ms
xxxxx (173 connections now open)
xxxxx (172 connections now open)
3745418
2014-11-20T16:59:29.613-0600 [clientcursormon] mapped (incl journal view
):3737954
2014-11-20T16:59:29.613-0600 [clientcursormon] connections:172
2014-11-20T16:59:29.613-0600 [clientcursormon] replication threads:32
xxxxx (171 connections now open)
2014-11-20T17:00:00.433-0600 [initandlisten] connection accepted from xx.
x.xxx.xxx:xxxxx #60126 (172 connections now open)
2014-11-20T17:00:00.433-0600 [initandlisten] connection accepted from xx.
x.xxx.xxx:xxxxx #60127 (173 connections now open)
xxxxx (172 connections now open)
2014-11-20T17:00:00.433-0600 [initandlisten] connection accepted from xx.
x.xxx.xxx:xxxxx #60128 (174 connections now open)
xxxxx (172 connections now open)
2014-11-20T17:00:00.433-0600 [initandlisten] connection accepted from xx.
x.xxx.xxx:xxxxx #60129 (174 connections now open)
2014-11-20T17:00:00.433-0600 [initandlisten] connection accepted from xx.
x.xxx.xxx:xxxxx #60130 (174 connections now open)
2014-11-20T17:00:00.433-0600 [initandlisten] connection accepted from xx.
x.xxx.xxx:xxxxx #60131 (175 connections now open)
xxxxx (174 connections now open)
xxxxx (174 connections now open)
xxxxx (174 connections now open)
xxxxx (174 connections now open)
2014-11-20T17:00:00.433-0600 [initandlisten] connection accepted from xx.
x.xxx.xxx:xxxxx #60132 (176 connections now open)
2014-11-20T17:00:00.434-0600 [
...
--
You received this message because you are subscribed to the Google Groups "mongodb-user"
group.
http://www.mongodb.org/about/support/.
---
You received this message because you are subscribed to the Google Groups
"mongodb-user" group.
To unsubscribe from this group and stop receiving emails from it, send an
Visit this group at http://groups.google.com/group/mongodb-user.
To view this discussion on the web visit
https://groups.google.com/d/msgid/mongodb-user/690b2b5d-3850-497e-aa98-fc9cf5b2b79f%40googlegroups.com
<https://groups.google.com/d/msgid/mongodb-user/690b2b5d-3850-497e-aa98-fc9cf5b2b79f%40googlegroups.com?utm_medium=email&utm_source=footer>
.
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google Groups "mongodb-user"
group.

For other MongoDB technical support options, see: http://www.mongodb.org/about/support/.
---
You received this message because you are subscribed to the Google Groups "mongodb-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mongodb-user+***@googlegroups.com.
To post to this group, send email to mongodb-***@googlegroups.com.
Visit this group at http://groups.google.com/group/mongodb-user.
To view this discussion on the web visit https://groups.google.com/d/msgid/mongodb-user/CAOe6dJBAQcSHcayU5R9NMDO4_c3yPwHPdHehMANpE%3DQMBngFJw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.
rhea ghosh
2014-12-08 18:03:21 UTC
Permalink
Hi Asya,

Thanks for responding. Yes, that's exactly what I thought was strange is
that the primary didn't log anything for a whole 30seconds, not the
attempts, nothing. And yes that is the entire log for that time period, I
assumed you don't really need to see too much before/after when it is
behaving normally again.

It is clear that the secondary and arbiter weren't able to communicate but
since I attached the network traces, which is what I initially thought
could be the problem and they were completely clean and it looks like the
primary just wasn't sending a response (if you look a couple responses ago
you'll see those attachments) even though the connection was healthy.

Also the reason I started this thread was that I wanted to increase the
logging level to see if there would be more useful information in the logs
if I do this but when I did that it ended up making my logs insanely large,
like from a typical day you would have 2-5MB of logs and that went up to
over 5GB just by changing the log level from 0 to 1 on the primary. To be
clear I upped the logs *after* this occurred so there was nothing useful,
it just seemed to log every query on the server and I eventually had to
turn them down because the drive I keep the logs on simply doesn't have
that kind of space. But I'm just trying to isolate where the problem could
be so I can try to work on that.

In your response it seems like the problem could potentially be swapping,
or some other OS process but that seems pretty unlikely. Currently the only
thing running on the box is MongoDB so there isn't much competition for
resources. The boxes have 32 cores, 1TB Cachecade drive, 20TB SATA drives,
256GB RAM and 4GB Swap none of which I've observed to be under any strain
especially in the middle of the day when these communication lapses have
been happening.

Is there any reason that MongoDB would fail to write to a log file? Or any
way to get some sort of logging especially on the replica set health
instead of logging all the queries. I'm less interested in getting query
data and more on the general health of the cluster/servers so I can prevent
this issue from occurring.
Post by rhea ghosh
2014-11-20T16:59:13.954-0600 [rsHealthPoll] can't authenticate to
DBClientBase::findN: transport error: CHILXPROD061:27017 ns: local.$cmd
query: { getnonce: 1 }
2014-11-20T16:59:18.547-0600 [rsHealthPoll] can't authenticate to
DBClientBase::findN: transport error: CHILXPROD061:27017 ns: local.$cmd
query: { getnonce: 1 }
2014-11-20T16:59:40.597-0600 [rsHealthPoll] DBClientCursor::init call()
failed
failures until
2014-11-20T17:00:01.962-0600 [rsHealthPoll] replSet member
MONGOSERVER1:27017 is up
2014-11-20T17:00:00.435-0600 [rsHealthPoll] replSet member
MONGOSERVER1:27017 is up
xxxxx (173 connections now open)
2014-11-20T16:59:08.528-0600 [conn60124] end connection xx.x.xxx.xxx:xxxxx
(172 connections now open)
3745418
2014-11-20T16:59:29.613-0600 [clientcursormon] mapped (incl journal view
):3737954
2014-11-20T16:59:29.613-0600 [clientcursormon] connections:172
2014-11-20T16:59:29.613-0600 [clientcursormon] replication threads:32
xxxxx (171 connections now open)
2014-11-20T17:00:00.433-0600 [initandlisten] connection accepted from xx.x
.xxx.xxx:xxxxx #60126 (172 connections now open)
2014-11-20T17:00:00.433-0600 [initandlisten] connection accepted from xx.x
.xxx.xxx:xxxxx #60127 (173 connections now open)
There isn't really any way to tell what was happening on the OS level - is
it unusual for your primary to log *nothing* for entire 30 seconds? If so,
then it's possible that the mongod process was de-prioritized by the OS for
some reason and basically couldn't do anything while swapped out...
Is this a complete log from this time? The primary's rsHealthPoll should
be running and connecting to the other members of the replica set - we
don't see any evidence of that either - it really suggests very strongly
something was happening on the primary's box that was possibly blocking
everything from progressing normally...
Asya
Also along with the logs and the original primary server these are the
logs from the secondary server and the arbiter.
original secondary (MONGOSERVER2) logs
2014-11-20T16:59:01.735-0600 [conn59639] authenticate db: admin {
authenticate: 1, nonce: "xxx", user: "svcClusterAdmin", key: "xxx" }
2014-11-20T16:59:02.000-0600 [repl index builder 30] Index
Build(background): 36200500/124435801 29%
2014-11-20T16:59:05.003-0600 [repl index builder 30] Index
Build(background): 36577700/124435801 29%
2014-11-20T16:59:08.000-0600 [repl index builder 30] Index
Build(background): 36958700/124435801 29%
2014-11-20T16:59:08.528-0600 [conn59637] end connection 10.0.240.250:59921
(70 connections now open)
2014-11-20T16:59:11.000-0600 [repl index builder 30] Index
Build(background): 37338400/124435801 30%
2014-11-20T16:59:14.000-0600 [repl index builder 30] Index
Build(background): 37697700/124435801 30%
2014-11-20T16:59:17.010-0600 [repl index builder 30] Index
Build(background): 38079300/124435801 30%
2014-11-20T16:59:18.527-0600 [rsHealthPoll] DBClientCursor::init call()
failed
2014-11-20T16:59:18.547-0600 [rsHealthPoll] can't authenticate to
DBClientBase::findN: transport error: CHILXPROD061:27017 ns: local.$cmd
query: { getnonce: 1 }
2014-11-20T16:59:20.000-0600 [repl index builder 30] Index
Build(background): 38458700/124435801 30%
2014-11-20T16:59:23.000-0600 [repl index builder 30] Index
Build(background): 38835100/124435801 31%
2014-11-20T16:59:26.000-0600 [repl index builder 30] Index
Build(background): 39211100/124435801 31%
2014-11-20T16:59:26.645-0600 [conn59638] end connection
xxx.xx.xxx.xxx:xxxxx (69 connections now open)
2014-11-20T16:59:26.645-0600 [initandlisten] connection accepted from
xxx.xx.xxx.xxx:xxxxx #59640 (70 connections now open)
2014-11-20T16:59:26.678-0600 [conn59640] authenticate db: local {
authenticate: 1, nonce: "xxx", user: "__system", key: "xxx" }
2014-11-20T16:59:28.549-0600 [rsHealthPoll] DBClientCursor::init call()
failed
2014-11-20T16:59:28.556-0600 [rsHealthPoll] replSet info
2014-11-20T16:59:28.556-0600 [rsHealthPoll] replSet member
MONGOSERVER1:27017 is now in state DOWN
2014-11-20T16:59:28.556-0600 [rsMgr] replSet info electSelf 1
2014-11-20T16:59:29.000-0600 [repl index builder 30] Index
Build(background): 39581200/124435801 31%
2014-11-20T16:59:30.766-0600 [rsMgr] replSet PRIMARY
2014-11-20T16:59:32.000-0600 [repl index builder 30] Index
Build(background): 39965000/124435801 32%
serverStatus { serverStatus: 1 } ntoreturn:1 keyUpdates:0 numYields:0
locks(micros) r:49 reslen:3608 154ms
2014-11-20T16:59:35.000-0600 [repl index builder 30] Index
Build(background): 40289100/124435801 32%
2014-11-20T16:59:38.000-0600 [repl index builder 30] Index
Build(background): 40679700/124435801 32%
/opt/mongodata/journal/j._87
2014-11-20T16:59:40.597-0600 [rsHealthPoll] DBClientCursor::init call()
failed
2014-11-20T16:59:41.000-0600 [repl index builder 30] Index
Build(background): 41064300/124435801 33%
2014-11-20T16:59:44.000-0600 [repl index builder 30] Index
Build(background): 41449500/124435801 33%
2014-11-20T16:59:47.000-0600 [repl index builder 30] Index
Build(background): 41858100/124435801 33%
2014-11-20T16:59:50.000-0600 [repl index builder 30] Index
Build(background): 42233500/124435801 33%
2014-11-20T16:59:52.598-0600 [rsHealthPoll] DBClientCursor::init call()
failed
2014-11-20T16:59:53.000-0600 [repl index builder 30] Index
Build(background): 42613000/124435801 34%
2014-11-20T16:59:56.000-0600 [repl index builder 30] Index
Build(background): 42954300/124435801 34%
2014-11-20T16:59:56.685-0600 [conn59640] end connection
xxx.xx.xxx.xxx:xxxxx (69 connections now open)
2014-11-20T16:59:56.686-0600 [initandlisten] connection accepted from
10.0.104.189:50816 #59641 (70 connections now open)
2014-11-20T16:59:56.726-0600 [conn59641] authenticate db: local {
authenticate: 1, nonce: "xxx", user: "__system", key: "xxx" }
2014-11-20T16:59:59.000-0600 [repl index builder 30] Index
Build(background): 43317900/124435801 34%
2014-11-20T17:00:00.435-0600 [rsHealthPoll] replSet member
MONGOSERVER1:27017 is up
2014-11-20T17:00:00.435-0600 [rsHealthPoll] replSet member
MONGOSERVER1:27017 is now in state PRIMARY
2014-11-20T17:00:00.435-0600 [rsMgr] another PRIMARY detected but it
should step down since it was elected earlier than me
2014-11-20T17:00:00.435-0600 [rsMgr] another PRIMARY detected but it
should step down since it was elected earlier than me
arbiter logs
2014-11-20T16:58:50.711-0600 [conn41035] authenticate db: local {
authenticate: 1, nonce: "xxx", user: "__system", key: "xxx" }
2014-11-20T16:59:13.954-0600 [rsHealthPoll] DBClientCursor::init call()
failed
2014-11-20T16:59:13.954-0600 [rsHealthPoll] can't authenticate to
DBClientBase::findN: transport error: CHILXPROD061:27017 ns: local.$cmd
query: { getnonce: 1 }
2014-11-20T16:59:20.741-0600 [conn41036] authenticate db: local {
authenticate: 1, nonce: "xxx", user: "__system", key: "xxx" }
2014-11-20T16:59:23.955-0600 [rsHealthPoll] DBClientCursor::init call()
failed
2014-11-20T16:59:23.955-0600 [rsHealthPoll] replSet info
2014-11-20T16:59:23.955-0600 [rsHealthPoll] replSet member
MONGOSERVER1:27017 is now in state DOWN
2014-11-20T16:59:28.556-0600 [conn41036] replSet info voting yea for
CHILXPROD062:27017 (1)
2014-11-20T16:59:32.680-0600 [rsHealthPoll] replSet member
MONGOSERVER2:27017 is now in state PRIMARY
2014-11-20T16:59:35.956-0600 [rsHealthPoll] DBClientCursor::init call()
failed
2014-11-20T16:59:47.958-0600 [rsHealthPoll] DBClientCursor::init call()
failed
2014-11-20T16:59:50.790-0600 [conn41037] authenticate db: local {
authenticate: 1, nonce: "xxx", user: "__system", key: "xxx" }
2014-11-20T16:59:59.959-0600 [rsHealthPoll] DBClientCursor::init call()
failed
2014-11-20T17:00:00.434-0600 [conn41038] authenticate db: local {
authenticate: 1, nonce: "xxx", user: "__system", key: "xxx" }
2014-11-20T17:00:01.962-0600 [rsHealthPoll] replSet member
MONGOSERVER1:27017 is up
2014-11-20T17:00:01.962-0600 [rsHealthPoll] replSet member MONGOSERVER1:27017
is now in state SECONDARY
any additional thoughts as to why this could be happening would be helpful
since it is clearly unrelated to the network and seems unlikely that it's
the actual hardware. it would be nice to have some idea of why mongodb
pauses for 16 seconds or so and there's no response back to the the other
nodes.
Ok, I'm attaching the traces here. The traces are two from the original
primary and one from the original secondary server. You'll see that the
blue highlighted part shows where the primary responds to the secondary but
it takes about 16 seconds do to so. And then with the trace to the arbiter
you'll see the highlighted part where it finally replied to the arbiter
after 35 seconds and the arbiter had already closed the TCP connection. The
red highlights are where the original secondary and the arbiter reset the
connection with the original primary.
These are the logs from the original primary at the time which I don't
feel show any reason for it to be slow.
serverStatus { serverStatus: 1 } ntoreturn:1 keyUpdates:0 numYields:0
locks(micros) r:39 reslen:3656 866ms
2014-11-20T16:59:07.579-0600 [conn60120] end connection xx.x.xxx.xxx:xxxxx
(173 connections now open)
2014-11-20T16
...
--
You received this message because you are subscribed to the Google Groups "mongodb-user"
group.

For other MongoDB technical support options, see: http://www.mongodb.org/about/support/.
---
You received this message because you are subscribed to the Google Groups "mongodb-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mongodb-user+***@googlegroups.com.
To post to this group, send email to mongodb-***@googlegroups.com.
Visit this group at http://groups.google.com/group/mongodb-user.
To view this discussion on the web visit https://groups.google.com/d/msgid/mongodb-user/b2f9e814-c935-47a5-a4fa-83a06d30591c%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Asya Kamsky
2014-12-08 19:04:54 UTC
Permalink
Unfortunately in the current version all the logging goes up when you up
the verbose level. Starting with 2.8 we have separate controls for
verbosity of different subcomponent and that is exactly what you would want
here.

It sounds like you're saying this happens periodically? (Where primary
logs nothing for some non-trivial amount of time) including not responding
or sending any heartbeats during that time?

I'm wondering if that's coinciding with flushing data files... Is this
system in MMS by any chance?

Asya
Post by rhea ghosh
Hi Asya,
Thanks for responding. Yes, that's exactly what I thought was strange is
that the primary didn't log anything for a whole 30seconds, not the
attempts, nothing. And yes that is the entire log for that time period, I
assumed you don't really need to see too much before/after when it is
behaving normally again.
It is clear that the secondary and arbiter weren't able to communicate but
since I attached the network traces, which is what I initially thought
could be the problem and they were completely clean and it looks like the
primary just wasn't sending a response (if you look a couple responses ago
you'll see those attachments) even though the connection was healthy.
Also the reason I started this thread was that I wanted to increase the
logging level to see if there would be more useful information in the logs
if I do this but when I did that it ended up making my logs insanely large,
like from a typical day you would have 2-5MB of logs and that went up to
over 5GB just by changing the log level from 0 to 1 on the primary. To be
clear I upped the logs *after* this occurred so there was nothing useful,
it just seemed to log every query on the server and I eventually had to
turn them down because the drive I keep the logs on simply doesn't have
that kind of space. But I'm just trying to isolate where the problem could
be so I can try to work on that.
In your response it seems like the problem could potentially be swapping,
or some other OS process but that seems pretty unlikely. Currently the only
thing running on the box is MongoDB so there isn't much competition for
resources. The boxes have 32 cores, 1TB Cachecade drive, 20TB SATA drives,
256GB RAM and 4GB Swap none of which I've observed to be under any strain
especially in the middle of the day when these communication lapses have
been happening.
Is there any reason that MongoDB would fail to write to a log file? Or any
way to get some sort of logging especially on the replica set health
instead of logging all the queries. I'm less interested in getting query
data and more on the general health of the cluster/servers so I can prevent
this issue from occurring.
Post by rhea ghosh
2014-11-20T16:59:13.954-0600 [rsHealthPoll] can't authenticate to
DBClientBase::findN: transport error: CHILXPROD061:27017 ns: local.$cmd
query: { getnonce: 1 }
2014-11-20T16:59:18.547-0600 [rsHealthPoll] can't authenticate to
DBClientBase::findN: transport error: CHILXPROD061:27017 ns: local.$cmd
query: { getnonce: 1 }
2014-11-20T16:59:40.597-0600 [rsHealthPoll] DBClientCursor::init call()
failed
failures until
2014-11-20T17:00:01.962-0600 [rsHealthPoll] replSet member
MONGOSERVER1:27017 is up
2014-11-20T17:00:00.435-0600 [rsHealthPoll] replSet member
MONGOSERVER1:27017 is up
xxxxx (173 connections now open)
xxxxx (172 connections now open)
3745418
2014-11-20T16:59:29.613-0600 [clientcursormon] mapped (incl journal view
):3737954
2014-11-20T16:59:29.613-0600 [clientcursormon] connections:172
2014-11-20T16:59:29.613-0600 [clientcursormon] replication threads:32
xxxxx (171 connections now open)
2014-11-20T17:00:00.433-0600 [initandlisten] connection accepted from xx.
x.xxx.xxx:xxxxx #60126 (172 connections now open)
2014-11-20T17:00:00.433-0600 [initandlisten] connection accepted from xx.
x.xxx.xxx:xxxxx #60127 (173 connections now open)
There isn't really any way to tell what was happening on the OS level -
is it unusual for your primary to log *nothing* for entire 30 seconds? If
so, then it's possible that the mongod process was de-prioritized by the OS
for some reason and basically couldn't do anything while swapped out...
Is this a complete log from this time? The primary's rsHealthPoll
should be running and connecting to the other members of the replica set -
we don't see any evidence of that either - it really suggests very strongly
something was happening on the primary's box that was possibly blocking
everything from progressing normally...
Asya
Also along with the logs and the original primary server these are the
logs from the secondary server and the arbiter.
original secondary (MONGOSERVER2) logs
2014-11-20T16:59:01.735-0600 [conn59639] authenticate db: admin {
authenticate: 1, nonce: "xxx", user: "svcClusterAdmin", key: "xxx" }
2014-11-20T16:59:02.000-0600 [repl index builder 30] Index
Build(background): 36200500/124435801 29%
2014-11-20T16:59:05.003-0600 [repl index builder 30] Index
Build(background): 36577700/124435801 29%
2014-11-20T16:59:08.000-0600 [repl index builder 30] Index
Build(background): 36958700/124435801 29%
59921 (70 connections now open)
2014-11-20T16:59:11.000-0600 [repl index builder 30] Index
Build(background): 37338400/124435801 30%
2014-11-20T16:59:14.000-0600 [repl index builder 30] Index
Build(background): 37697700/124435801 30%
2014-11-20T16:59:17.010-0600 [repl index builder 30] Index
Build(background): 38079300/124435801 30%
2014-11-20T16:59:18.527-0600 [rsHealthPoll] DBClientCursor::init call()
failed
2014-11-20T16:59:18.547-0600 [rsHealthPoll] can't authenticate to
DBClientBase::findN: transport error: CHILXPROD061:27017 ns: local.$cmd
query: { getnonce: 1 }
2014-11-20T16:59:20.000-0600 [repl index builder 30] Index
Build(background): 38458700/124435801 30%
2014-11-20T16:59:23.000-0600 [repl index builder 30] Index
Build(background): 38835100/124435801 31%
2014-11-20T16:59:26.000-0600 [repl index builder 30] Index
Build(background): 39211100/124435801 31%
2014-11-20T16:59:26.645-0600 [conn59638] end connection
xxx.xx.xxx.xxx:xxxxx (69 connections now open)
2014-11-20T16:59:26.645-0600 [initandlisten] connection accepted from
xxx.xx.xxx.xxx:xxxxx #59640 (70 connections now open)
2014-11-20T16:59:26.678-0600 [conn59640] authenticate db: local {
authenticate: 1, nonce: "xxx", user: "__system", key: "xxx" }
2014-11-20T16:59:28.549-0600 [rsHealthPoll] DBClientCursor::init call()
failed
2014-11-20T16:59:28.556-0600 [rsHealthPoll] replSet info
2014-11-20T16:59:28.556-0600 [rsHealthPoll] replSet member
MONGOSERVER1:27017 is now in state DOWN
2014-11-20T16:59:28.556-0600 [rsMgr] replSet info electSelf 1
2014-11-20T16:59:29.000-0600 [repl index builder 30] Index
Build(background): 39581200/124435801 31%
2014-11-20T16:59:30.766-0600 [rsMgr] replSet PRIMARY
2014-11-20T16:59:32.000-0600 [repl index builder 30] Index
Build(background): 39965000/124435801 32%
serverStatus { serverStatus: 1 } ntoreturn:1 keyUpdates:0 numYields:0
locks(micros) r:49 reslen:3608 154ms
2014-11-20T16:59:35.000-0600 [repl index builder 30] Index
Build(background): 40289100/124435801 32%
2014-11-20T16:59:38.000-0600 [repl index builder 30] Index
Build(background): 40679700/124435801 32%
/opt/mongodata/journal/j._87
2014-11-20T16:59:40.597-0600 [rsHealthPoll] DBClientCursor::init call()
failed
2014-11-20T16:59:41.000-0600 [repl index builder 30] Index
Build(background): 41064300/124435801 33%
2014-11-20T16:59:44.000-0600 [repl index builder 30] Index
Build(background): 41449500/124435801 33%
2014-11-20T16:59:47.000-0600 [repl index builder 30] Index
Build(background): 41858100/124435801 33%
2014-11-20T16:59:50.000-0600 [repl index builder 30] Index
Build(background): 42233500/124435801 33%
2014-11-20T16:59:52.598-0600 [rsHealthPoll] DBClientCursor::init call()
failed
2014-11-20T16:59:53.000-0600 [repl index builder 30] Index
Build(background): 42613000/124435801 34%
2014-11-20T16:59:56.000-0600 [repl index builder 30] Index
Build(background): 42954300/124435801 34%
2014-11-20T16:59:56.685-0600 [conn59640] end connection
xxx.xx.xxx.xxx:xxxxx (69 connections now open)
2014-11-20T16:59:56.686-0600 [initandlisten] connection accepted from
10.0.104.189:50816 #59641 (70 connections now open)
2014-11-20T16:59:56.726-0600 [conn59641] authenticate db: local {
authenticate: 1, nonce: "xxx", user: "__system", key: "xxx" }
2014-11-20T16:59:59.000-0600 [repl index builder 30] Index
Build(background): 43317900/124435801 34%
2014-11-20T17:00:00.435-0600 [rsHealthPoll] replSet member
MONGOSERVER1:27017 is up
2014-11-20T17:00:00.435-0600 [rsHealthPoll] replSet member
MONGOSERVER1:27017 is now in state PRIMARY
2014-11-20T17:00:00.435-0600 [rsMgr] another PRIMARY detected but it
should step down since it was elected earlier than me
2014-11-20T17:00:00.435-0600 [rsMgr] another PRIMARY detected but it
should step down since it was elected earlier than me
arbiter logs
2014-11-20T16:58:50.711-0600 [conn41035] authenticate db: local {
authenticate: 1, nonce: "xxx", user: "__system", key: "xxx" }
2014-11-20T16:59:13.954-0600 [rsHealthPoll] DBClientCursor::init call()
failed
2014-11-20T16:59:13.954-0600 [rsHealthPoll] can't authenticate to
DBClientBase::findN: transport error: CHILXPROD061:27017 ns: local.$cmd
query: { getnonce: 1 }
2014-11-20T16:59:20.741-0600 [conn41036] authenticate db: local {
authenticate: 1, nonce: "xxx", user: "__system", key: "xxx" }
2014-11-20T16:59:23.955-0600 [rsHealthPoll] DBClientCursor::init call()
failed
2014-11-20T16:59:23.955-0600 [rsHealthPoll] replSet info
2014-11-20T16:59:23.955-0600 [rsHealthPoll] replSet member
MONGOSERVER1:27017 is now in state DOWN
2014-11-20T16:59:28.556-0600 [conn41036] replSet info voting yea for
CHILXPROD062:27017 (1)
2014-11-20T16:59:32.680-0600 [rsHealthPoll] replSet member
MONGOSERVER2:27017 is now in state PRIMARY
2014-11-20T16:59:35.956-0600 [rsHealthPoll] DBClientCursor::init call()
failed
2014-11-20T16:59:47.958-0600 [rsHealthPoll] DBClientCursor::init call()
failed
2014-11-20T16:59:50.790-0600 [conn41037] authenticate db: local {
authenticate: 1, nonce: "xxx", user: "__system", key: "xxx" }
2014-11-20T16:59:59.959-0600 [rsHealthPoll] DBClientCursor::init call()
failed
2014-11-20T17:00:00.434-0600 [conn41038] authenticate db: local {
authenticate: 1, nonce: "xxx", user: "__system", key: "xxx" }
2014-11-20T17:00:01.962-0600 [rsHealthPoll] replSet member
MONGOSERVER1:27017 is up
2014-11-20T17:00:01.962-0600 [rsHealthPoll] replSet member MONGOSERVER1:27017
is now in state SECONDARY
any additional thoughts as to why this could be happening would be
helpful since it is clearly unrelated to the network and seems unlikely
that it's the actual hardware. it would be nice to have some idea of why
mongodb pauses for 16 seconds or so and there's no response back to the the
other nodes.
Ok, I'm attaching the traces here. The traces are two from the original
primary and one from the original secondary server. You'll see that the
blue highlighted part shows where the primary responds to the secondary but
it takes about 16 seconds do to so. And then with the trace to the arbiter
you'll see the highlighted part where it finally replied to the arbiter
after 35 seconds and the arbiter had already closed the TCP connection. The
red highlights are where the original secondary and the arbiter reset the
connection with the original primary.
These are the logs from the original primary at the time which I don't
feel show any reason for it to be slow.
serverStatus { serverStatus: 1 } ntoreturn:1 keyUpdates:0 numYields:0
locks(micros) r:39 reslen:3656 866ms
xxxxx (173 connections now open)
2014-11-20T16
...
--
You received this message because you are subscribed to the Google Groups "mongodb-user"
group.
http://www.mongodb.org/about/support/.
---
You received this message because you are subscribed to the Google Groups
"mongodb-user" group.
To unsubscribe from this group and stop receiving emails from it, send an
Visit this group at http://groups.google.com/group/mongodb-user.
To view this discussion on the web visit
https://groups.google.com/d/msgid/mongodb-user/b2f9e814-c935-47a5-a4fa-83a06d30591c%40googlegroups.com
<https://groups.google.com/d/msgid/mongodb-user/b2f9e814-c935-47a5-a4fa-83a06d30591c%40googlegroups.com?utm_medium=email&utm_source=footer>
.
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google Groups "mongodb-user"
group.

For other MongoDB technical support options, see: http://www.mongodb.org/about/support/.
---
You received this message because you are subscribed to the Google Groups "mongodb-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mongodb-user+***@googlegroups.com.
To post to this group, send email to mongodb-***@googlegroups.com.
Visit this group at http://groups.google.com/group/mongodb-user.
To view this discussion on the web visit https://groups.google.com/d/msgid/mongodb-user/CAOe6dJBGdFioDax1E3j5rwAdP38mXXYXe-X6CkybSxCE%2BakwBw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.
rhea ghosh
2014-12-08 19:58:28 UTC
Permalink
Oh sweet, so 2.8 will have more controls on the logs, I'll really be
looking forward to that.

Yes this does happen periodically where the primary logs nothing for some
non-trivial amount of time including not responding to heartbeats. We
really only noticed this behavior because there's a bug in 2.6.1 where the
primary wasn't able to rollback some writes when it stepped down after this
period of non-communication with the secondary and arbiter. So I had to
manually resync everything. I believe the rollback issue is fixed in 2.6.5
and at some point we want to upgrade but if the logging features are in 2.8
we may want to wait until that is out.

This system isn't on MMS currently because the server is not externally
facing and therefore unable to connect to MMS. But I have set up some
monitoring so I can check the CPU, memory etc and keep an eye out that
things appear to be relatively normal for the most part. I'm also running
mongostat in the background (so far our system has been handing it without
any impact to performance) and sending that info to logstash and kibana to
have some of the same information provided as in MMS. However for this time
period I can't tell if there was anything happening because I don't see any
information logging between 16:59:07 and 17:00:01 on the original primary
server. If I look at the secondary however it's fine and still reporting
back information up until it becomes the primary at 16:59:30. It would be
nice to know if it was happening when flushing data to files unfortunately
we have a gap of information in our system at that time basically for what
it seems to be anything trying to access MongoDB for just under 30s. The
systems we have on it are production systems but aren't so time-sensitive
that we would notice a 30s lag though, which is potentially why we didn't
catch it until we hit that rollback bug.

Is there any way other than MMS and mongostat to get the monitoring info
exposed?

Thanks for your help!
Post by Asya Kamsky
Unfortunately in the current version all the logging goes up when you up
the verbose level. Starting with 2.8 we have separate controls for
verbosity of different subcomponent and that is exactly what you would want
here.
It sounds like you're saying this happens periodically? (Where primary
logs nothing for some non-trivial amount of time) including not responding
or sending any heartbeats during that time?
I'm wondering if that's coinciding with flushing data files... Is this
system in MMS by any chance?
Asya
Hi Asya,
Thanks for responding. Yes, that's exactly what I thought was strange is
that the primary didn't log anything for a whole 30seconds, not the
attempts, nothing. And yes that is the entire log for that time period, I
assumed you don't really need to see too much before/after when it is
behaving normally again.
It is clear that the secondary and arbiter weren't able to communicate but
since I attached the network traces, which is what I initially thought
could be the problem and they were completely clean and it looks like the
primary just wasn't sending a response (if you look a couple responses ago
you'll see those attachments) even though the connection was healthy.
Also the reason I started this thread was that I wanted to increase the
logging level to see if there would be more useful information in the logs
if I do this but when I did that it ended up making my logs insanely large,
like from a typical day you would have 2-5MB of logs and that went up to
over 5GB just by changing the log level from 0 to 1 on the primary. To be
clear I upped the logs *after* this occurred so there was nothing useful,
it just seemed to log every query on the server and I eventually had to
turn them down because the drive I keep the logs on simply doesn't have
that kind of space. But I'm just trying to isolate where the problem could
be so I can try to work on that.
In your response it seems like the problem could potentially be swapping,
or some other OS process but that seems pretty unlikely. Currently the only
thing running on the box is MongoDB so there isn't much competition for
resources. The boxes have 32 cores, 1TB Cachecade drive, 20TB SATA drives,
256GB RAM and 4GB Swap none of which I've observed to be under any strain
especially in the middle of the day when these communication lapses have
been happening.
Is there any reason that MongoDB would fail to write to a log file? Or any
way to get some sort of logging especially on the replica set health
instead of logging all the queries. I'm less interested in getting query
data and more on the general health of the cluster/servers so I can prevent
this issue from occurring.
2014-11-20T16:59:13.954-0600 [rsHealthPoll] can't authenticate to
DBClientBase::findN: transport error: CHILXPROD061:27017 ns: local.$cmd
query: { getnonce: 1 }
2014-11-20T16:59:18.547-0600 [rsHealthPoll] can't authenticate to
DBClientBase::findN: transport error: CHILXPROD061:27017 ns: local.$cmd
query: { getnonce: 1 }
2014-11-20T16:59:40.597-0600 [rsHealthPoll] DBClientCursor::init call()
failed
failures until
2014-11-20T17:00:01.962-0600 [rsHealthPoll] replSet member
MONGOSERVER1:27017 is up
2014-11-20T17:00:00.435-0600 [rsHealthPoll] replSet member
MONGOSERVER1:27017 is up
xxxxx (173 connections now open)
2014-11-20T16:59:08.528-0600 [conn60124] end connection xx.x.xxx.xxx:xxxxx
(172 connections now open)
3745418
2014-11-20T16:59:29.613-0600 [clientcursormon] mapped (incl journal view
):3737954
2014-11-20T16:59:29.613-0600 [clientcursormon] connections:172
2014-11-20T16:59:29.613-0600 [clientcursormon] replication threads:32
xxxxx (171 connections now open)
2014-11-20T17:00:00.433-0600 [initandlisten] connection accepted from xx.x
.xxx.xxx:xxxxx #60126 (172 connections now open)
2014-11-20T17:00:00.433-0600 [initandlisten] connection accepted from xx.x
.xxx.xxx:xxxxx #60127 (173 connections now open)
There isn't really any way to tell what was happening on the OS level - is
it unusual for your primary to log *nothing* for entire 30 seconds? If so,
then it's possible that the mongod process was de-prioritized by the OS for
some reason and basically couldn't do anything while swapped out...
Is this a complete log from this time? The primary's rsHealthPoll should
be running and connecting to the other members of the replica set - we
don't see any evidence of that either - it really suggests very strongly
something was happening on the primary's box that was possibly blocking
everything from progressing normally...
Asya
Also along with the logs and the original primary server these are the
logs from the secondary server and the arbiter.
original secondary (MONGOSERVER2) logs
2014-11-20T16:59:01.735-0600 [conn59639] authenticate db: admin {
authenticate: 1, nonce: "xxx", user: "svcClusterAdmin", key: "xxx" }
2014-11-20T16:59:02.000-0600 [repl index builder 30] Index
Build(background): 36200500/124435801 29%
2014-11-20T16:59:05.003-0600 [repl index builder 30] Index
Build(background): 36577700/124435801 29%
2014-11-20T16:59:08.000-0600 [repl index builder 30] Index
Build(background): 36958700/124435801 29%
2014-11-20T16:59:08.528-0600 [conn59637] end connection 10.0.240.250:59921
(70 connections now open)
2014-11-20T16:59:11.000-0600 [repl index builder 30] Index
Build(background): 37338400/124435801 30%
2014-11-20T16:59:14.000-0600 [repl index builder 30] Index
Build(background): 37697700/124435801 30%
2014-11-20T16:59:17.010-0600 [repl index builder 30] Index
Build(background): 38079300/124435801 30%
2014-11-20T16:59:18.527-0600 [rsHealthPoll] DBClientCursor::init call()
failed
2014-11-20T16:59:18.547-0600 [rsHealthPoll] can't authenticate to
DBClientBase::findN: transport error: CHILXPROD061:27017 ns: local.$cmd
query: { getnonce: 1 }
2014-11-20T16:59:20.000-0600 [repl index builder 30] Index
Build(background): 38458700/124435801 30%
2014-11-20T16:59:23.000-0600 [repl index builder 30] Index
Build(background): 38835100/124435801 31%
2014-11-20T16:59:26.000-0600 [repl index builder 30] Index
Build(background): 39211100/124435801 31%
2014-11-20T16:59:26.645-0600 [conn59638] end connection
xxx.xx.xxx.xxx:xxxxx (69 connections now open)
2014-11-20T16:59:26.645-0600 [initandlisten] connection accepted from
xxx.xx.xxx.xxx:xxxxx #59640 (70 connections now open)
2014-11-20T16:59:26.678-0600 [conn59640] authenticate db: local {
authenticate: 1, nonce: "xxx", user: "__system", key: "xxx" }
2014-11-20T16:59:28.549-0600 [rsHealthPoll] DBClientCursor::init call()
failed
2014-11-20T16:59:28.556-0600 [rsHealthPoll] replSet info
2014-11-20T16:59:28.556-0600 [rsHealthPoll] replSet member
MONGOSERVER1:27017 is now in state DOWN
2014-11-20T16:59:28.556-0600 [rsMgr] replSet info electSelf 1
2014-11-20T16:59:29.000-0600 [repl index builder 30] Index
Build(background): 39581200/124435801 31%
2014-11-20T16:59:30.766-0600 [rsMgr] replSet PRIMARY
2014-11-20T16:59:32.000-0600 [repl index builder 30] Index
Build(background): 39965000/124435801 32%
serverStatus { serverStatus: 1 } ntoreturn:1 keyUpdates:0 numYields:0
locks(micros) r:49 reslen:3608 154ms
2014-11-20T16:59:35.000-0600 [repl index builder 30] Index
Build(background): 40289100/124435801 32%
2014-11-20T16:59:38.000-0600 [repl index builder 30] Index
Build(background): 40679700/124435801 32%
/opt/mongodata/journal/j._87
2014-11-20T16:59:40.597-0600 [rsHealthPoll] DBClientCursor::init call()
failed
2014-11-20T16:59:41.000-0600 [repl index builder 30] Index
Build(background): 41064300/124435801 33%
2014-11-20T16:59:44.000-0600 [repl index builder 30] Index
Build(background): 41449500/124435801 33%
2014-11-20T16:59:47.000-0600 [repl index builder 30] Index
Build(background): 41858100/124435801 33%
2014-11-20T16:59:50.000-0600 [repl index builder 30] Index
Build(background): 42233500/124435801 33%
2014-11-20T16:59:52.598-0600 [rsHealthPoll] DBClientCursor::init call()
failed
2014-11-20T16:59:53.000-0600 [repl index builder 30] Index
Build(background): 42613000/124435801 34%
2014-11-20T16:59:56.000-0600 [repl index builder 30] Index
Build(background): 42954300/124435801 34%
2014-11-20T16:59:56.685-0600 [conn59640] end connection
xxx.xx.xxx.xxx:xxxxx (69 connections now open)
2014-11-20T16:59:56.686-0600 [initandlisten] connection accepted from
10.0.104.189:50816 #59641 (70 connections now open)
2014-11-20T16:59:56.726-0600 [conn59641] authenticate db: local {
authenticate: 1, nonce: "xxx", user: "__system", key: "xxx" }
2014-11-20T16:59:59.000-0600 [repl index builder 30] Index
Build(background): 43317900/124435801 34%
2014-11-20T17:00:00.435-0600 [rsHealthPoll] replSet member
MONGOSERVER1:27017 is up
2014-11-20T17:00:00.435-0600 [rsHealthPoll] replSet member
MONGOSERVER1:27017 is now in state PRIMARY
2014-11-20T17:00:00.435-0600 [rsMgr] another PRIMARY detected but it
should step down since it was elected earlier than me
2014-11-20T17:00:00.435-0600 [rsMgr] another PRIMARY detected but it
should step down since it was elected earlier than me
arbiter logs
2014-11-20T16:58:50.711-0600 [conn41035] authenticate db: local {
authenticate: 1, nonce: "xxx", user: "__system", key: "xxx" }
2014-11-20T16:59:13.954-0600 [rsHealthPoll] DBClientCursor::init call()
failed
2014-11-20T16:59:13.954-0600 [rsHealthPoll] can't authenticate to
DBClientBase::findN: transport error: CHILXPROD061:27017 ns: local.$cmd
query: { getnonce: 1 }
2014-11-20T16:59:20.741-0600 [conn41036] authenticate db: local {
authenticate: 1, nonce: "xxx", user: "__system", key: "xxx" }
2014-11-20T16:59:23.955-0600 [rsHealthPoll] DBClientCursor::init call()
failed
2014-11-20T16:59:23.955-0600 [rsHealthPoll] replSet info
2014-11-20T16:59:23.955-0600 [rsHealthPoll] replSet member
MONGOSERVER1:27017 is now in state DOWN
2014-11-20T16:59:28.556-0600 [conn41036] replSet info voting yea for
CHILXPROD062:27017 (1)
2014-11-20T16:59:32.680-0600 [rsHealthPoll] replSet member
MONGOSERVER2:27017 is now in state PRIMARY
2014-11-20T16:59:35.956-0600 [rsHealthPoll] DBClientCursor::init call()
failed
2014-11-20T16:59:47.958-0600 [rsHealthPoll] DBClientCursor::init call()
failed
2
...
--
You received this message because you are subscribed to the Google Groups "mongodb-user"
group.

For other MongoDB technical support options, see: http://www.mongodb.org/about/support/.
---
You received this message because you are subscribed to the Google Groups "mongodb-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mongodb-user+***@googlegroups.com.
To post to this group, send email to mongodb-***@googlegroups.com.
Visit this group at http://groups.google.com/group/mongodb-user.
To view this discussion on the web visit https://groups.google.com/d/msgid/mongodb-user/382b2fcb-e189-4746-bb5f-4495d6ce4aed%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Continue reading on narkive:
Loading...