Discussion:
[mongodb-user] MongoDB heartbeat issue during a NAT problem
Martin Gradev
2018-12-04 15:30:35 UTC
Permalink
Hello,



I faced a very strange behavior from our MongoDB clusters in 2 projects
simultaneously. Both projects are not related to each other, so facing the
same issue simultaneously means that there was an issue not related to our
application which runs on top.



Here are some details about our setup:



1. I have 3 DCs - DC1, DC2 and DC3.

1.1. DC3 contain only MongoDB arbiter nodes.

1.2. DC1 and DC2 contain MongoDB replica members.

1.3. Each replica set contain 2 members from DC1, 2 members from DC2 and 1
member (arbiter) from DC3. DC3 member has priority 1, DC1 members have
priority 2 and DC2 members have priority 3.



This setup ensures that the active MongoDB DC will be DC2, as the nodes
have the highest priority. If DC2 is down, DC1 will take the primary role
until the issue in DC2 is resolved.


[image: mongodb_issue.jpg] <about:invalid#zClosurez>






















I experienced an unpredicted situation - The Cisco ASR router, responsible
for the NAT translation in DC2 started to reject the new connections, as
its NAT connection pool was exhausted. Surprisingly, this lead to *no
re-election for the Primary*. This resulted our MongoDB database clusters
to be not accessible from our frontend (apache2) nodes, as the client side
was trying to make a connection to the primary node in DC2. Because of the
above issue, the new connection was not possible.



In our MongoDB logs, I found that during our downtime, there were no health
check errors for our clusters. The only errors I found were related to the
oplog connectivity timeout.



2018-12-02T06:10:42.412Z I ASIO [NetworkInterfaceASIO-RS-0] Failed to
connect to 11.22.33.43:10000 - NetworkInterfaceExceededTimeLimit: Operation
timed out, request was RemoteCommand 299125357 -- target:11.22.33.43:10000
db:admin cmd:{ isMaster: 1 }

2018-12-02T06:10:42.413Z I ASIO [NetworkInterfaceASIO-RS-0] Dropping
all pooled connections to 11.22.33.43:10000 due to failed operation on a
connection

2018-12-02T06:10:42.413Z I ASIO [NetworkInterfaceASIO-RS-0] Failed to
connect to 11.22.33.43:10000 - NetworkInterfaceExceededTimeLimit: Operation
timed out, request was RemoteCommand 299125359 -- target:11.22.33.43:10000
db:admin cmd:{ isMaster: 1 }

2018-12-02T06:10:42.413Z I REPL [replication-678] Error returned from
oplog query (no more query restarts left): ExceededTimeLimit: Operation
timed out, request was RemoteCommand 299125357 -- target:11.22.33.43:10000
db:admin cmd:{ isMaster: 1 }

2018-12-02T06:10:42.413Z I ASIO [NetworkInterfaceASIO-RS-0] Connecting
to 11.22.33.43:10000

2018-12-02T06:10:42.413Z I REPL [SyncSourceFeedback] SyncSourceFeedback
error sending update to 11.22.33.43:10000: ExceededTimeLimit: Operation
timed out, request was RemoteCommand 299125357 -- target:11.22.33.43:10000
db:admin cmd:{ isMaster: 1 }

2018-12-02T06:10:42.413Z W REPL [rsBackgroundSync] Fetcher stopped
querying remote oplog with error: ExceededTimeLimit: Operation timed out,
request was RemoteCommand 299125357 -- target:11.22.33.43:10000 db:admin
cmd:{ isMaster: 1 }

2018-12-02T06:10:42.413Z I REPL [rsBackgroundSync] sync source
candidate: 11.22.33.46:10000

................

2018-12-02T06:11:02.413Z I ASIO [NetworkInterfaceASIO-RS-0] Failed to
connect to 11.22.33.43:10000 - NetworkInterfaceExceededTimeLimit: Operation
timed out, request was RemoteCommand 299125480 -- target:11.22.33.43:10000
db:admin cmd:{ isMaster: 1 }

2018-12-02T06:11:02.413Z I ASIO [NetworkInterfaceASIO-RS-0] Dropping
all pooled connections to 11.22.33.43:10000 due to failed operation on a
connection

2018-12-02T06:11:02.413Z I ASIO [NetworkInterfaceASIO-RS-0] Failed to
connect to 11.22.33.46:10000 - NetworkInterfaceExceededTimeLimit: Operation
timed out, request was RemoteCommand 299125482 -- target:11.22.33.46:10000
db:admin cmd:{ isMaster: 1 }

2018-12-02T06:11:02.414Z I ASIO [NetworkInterfaceASIO-RS-0] Dropping
all pooled connections to 11.22.33.46:10000 due to failed operation on a
connection

2018-12-02T06:11:02.414Z I REPL [replication-678] Blacklisting
11.22.33.46:10000 due to error: 'ExceededTimeLimit: Couldn't get a
connection within the time limit' for 10s until:
2018-12-02T06:11:12.413+0000

2018-12-02T06:11:02.414Z I REPL [replication-678] sync source
candidate: 11.22.33.43:10000



The 11.22.33.43 corresponds to db03 in DC2, 11.22.33.46 is db04 in DC2.



As you can see, the oplog synchronization timed out on db03 (DC2, which
lead to synchronization attempt from db04 (DC2). During the same time, the
health check was succeeding, as I am missing such errors in our logs.



Once I logged on the database from a node in DC1, I was able to see that
the health check timestamp is accurate. For more details, you can see the
attached rs.status and rs.config.



"optimeDate" : ISODate("2018-12-02T06:13:47Z"),

"optimeDurableDate" :
ISODate("2018-12-02T06:13:47Z"),

"lastHeartbeat" :
ISODate("2018-12-02T07:04:51.836Z"),

"lastHeartbeatRecv" :
ISODate("2018-12-02T07:04:51.223Z"),



From this log, I can see that the optimeDate was 51minutes behind the
lastHeartbeat, which proves that the heartbeat succeeded.



In order to resolve this, I had to manually reconfigure our replica sets to
exclude DC2, which resolved the issue.



I was able to reproduce the same behavior by applying this rule on our
*iptables -I INPUT -j DROP -p tcp --destination-port 10000 --syn*
This resulted the Secondary nodes to not keep up with the primary.



"optimeDate" :
ISODate("2018-12-04T14:09:58Z"),

"optimeDurableDate" :
ISODate("2018-12-04T14:09:58Z"),

"lastHeartbeat" :
ISODate("2018-12-04T14:45:51.137Z"),

"lastHeartbeatRecv" :
ISODate("2018-12-04T14:45:48.889Z"),



I need your help for 2 things:



1. Confirm if this is expected behavior.

2. Decide the best approach for countermeasure, as I need to overcome this
situation.


Best Regards,
Martin
--
You received this message because you are subscribed to the Google Groups "mongodb-user"
group.

For other MongoDB technical support options, see: https://docs.mongodb.com/manual/support/
---
You received this message because you are subscribed to the Google Groups "mongodb-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mongodb-user+***@googlegroups.com.
To post to this group, send email to mongodb-***@googlegroups.com.
Visit this group at https://groups.google.com/group/mongodb-user.
To view this discussion on the web visit https://groups.google.com/d/msgid/mongodb-user/66fb8cec-34e3-4a8c-903c-1e6551b120e5%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Loading...