Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NetworkPolicy xDS stream stalls for ~20 seconds before finishing updates on all worker threads #958

Open
jrajahalme opened this issue Oct 3, 2024 · 2 comments

Comments

@jrajahalme
Copy link
Member

NPDS version 247 is received at 07:14:02, some worker threads update immediately after, but other only at 07:14:22, causing Cilium Agent to not be able to get ACKs on policy updates in time (100ms timeout is much less than 20 seconds):

from logs-cilium-envoy-kgzx9-cilium-envoy-20241003-071420.log:

2024-10-03T07:14:02.719903223Z [2024-10-03 07:14:02.719][11][debug][config] [external/envoy/source/extensions/config_subscription/grpc/grpc_mux_impl.cc:268] Received gRPC message for type.googleapis.com/cilium.NetworkPolicy at version 247
2024-10-03T07:14:02.720433651Z [2024-10-03 07:14:02.719][11][debug][config] [cilium/network_policy.cc:1200] Received Network Policy for endpoint 122 in onConfigUpdate() version 247
2024-10-03T07:14:02.720443269Z [2024-10-03 07:14:02.719][11][debug][config] [cilium/network_policy.cc:1200] Received Network Policy for endpoint 2491 in onConfigUpdate() version 247
2024-10-03T07:14:02.720451755Z [2024-10-03 07:14:02.719][11][debug][config] [cilium/network_policy.cc:1200] Received Network Policy for endpoint 1166 in onConfigUpdate() version 247
2024-10-03T07:14:02.720461132Z [2024-10-03 07:14:02.719][11][debug][config] [cilium/network_policy.cc:1200] Received Network Policy for endpoint 2283 in onConfigUpdate() version 247
2024-10-03T07:14:02.720470640Z [2024-10-03 07:14:02.719][11][debug][config] [cilium/network_policy.cc:1200] Received Network Policy for endpoint 3851 in onConfigUpdate() version 247
2024-10-03T07:14:02.720545489Z [2024-10-03 07:14:02.719][11][debug][config] [cilium/network_policy.cc:1200] Received Network Policy for endpoint 3704 in onConfigUpdate() version 247
2024-10-03T07:14:02.720578060Z [2024-10-03 07:14:02.719][11][debug][config] [cilium/network_policy.cc:1200] Received Network Policy for endpoint 4060 in onConfigUpdate() version 247
2024-10-03T07:14:02.720596565Z [2024-10-03 07:14:02.719][11][debug][init] [external/envoy/source/common/init/manager_impl.cc:24] added target NetworkPolicyMap manager for version 247 to init manager NetworkPolicyMap manager for version 247
2024-10-03T07:14:02.720688046Z [2024-10-03 07:14:02.719][18][trace][config] [cilium/network_policy.cc:1158] Cilium L7 NetworkPolicyMap::onConfigUpdate(): Starting updates on the worker thread for version 247
2024-10-03T07:14:02.720712881Z [2024-10-03 07:14:02.719][21][trace][config] [cilium/network_policy.cc:1158] Cilium L7 NetworkPolicyMap::onConfigUpdate(): Starting updates on the worker thread for version 247
2024-10-03T07:14:02.720718071Z [2024-10-03 07:14:02.719][19][trace][config] [cilium/network_policy.cc:1158] Cilium L7 NetworkPolicyMap::onConfigUpdate(): Starting updates on the worker thread for version 247
2024-10-03T07:14:02.720723432Z [2024-10-03 07:14:02.719][11][debug][init] [external/envoy/source/common/init/manager_impl.cc:53] init manager NetworkPolicyMap manager for version 247 initializing
2024-10-03T07:14:02.720750111Z [2024-10-03 07:14:02.720][11][debug][init] [external/envoy/source/common/init/target_impl.cc:15] init manager NetworkPolicyMap manager for version 247 initializing target NetworkPolicyMap manager for version 247
2024-10-03T07:14:02.720758056Z [2024-10-03 07:14:02.720][11][debug][config] [external/envoy/source/extensions/config_subscription/grpc/grpc_subscription_impl.cc:89] gRPC config for type.googleapis.com/cilium.NetworkPolicy accepted with 7 resources with version 247
2024-10-03T07:14:22.599105864Z [2024-10-03 07:14:22.596][20][trace][config] [cilium/network_policy.cc:1158] Cilium L7 NetworkPolicyMap::onConfigUpdate(): Starting updates on the worker thread for version 247
2024-10-03T07:14:22.599476795Z [2024-10-03 07:14:22.598][11][trace][config] [cilium/network_policy.cc:1158] Cilium L7 NetworkPolicyMap::onConfigUpdate(): Starting updates on the main thread for version 247
2024-10-03T07:14:22.599488767Z [2024-10-03 07:14:22.598][11][debug][init] [external/envoy/source/common/init/watcher_impl.cc:14] target NetworkPolicyMap manager for version 247 initialized, notifying init manager NetworkPolicyMap manager for version 247
2024-10-03T07:14:22.599492905Z [2024-10-03 07:14:22.598][11][debug][init] [external/envoy/source/common/init/watcher_impl.cc:14] init manager NetworkPolicyMap manager for version 247 initialized, notifying NetworkPolicyMap manager for version 247
2024-10-03T07:14:22.610338237Z [2024-10-03 07:14:22.610][11][debug][init] [external/envoy/source/common/init/watcher_impl.cc:31] init manager NetworkPolicyMap manager for version 247 destroyed
2024-10-03T07:14:22.621076334Z [2024-10-03 07:14:22.621][11][debug][init] [external/envoy/source/common/init/target_impl.cc:34] target NetworkPolicyMap manager for version 247 destroyed
2024-10-03T07:14:22.621218048Z [2024-10-03 07:14:22.621][11][debug][init] [external/envoy/source/common/init/watcher_impl.cc:31] NetworkPolicyMap manager for version 247 destroyed

Agent logs (logs-cilium-x8hcj-cilium-agent-20241003-071420.log):

2024-10-03T07:14:01.000668320Z time="2024-10-03T07:14:01Z" level=debug msg="preparing new cache transaction: upserting 1 entries, deleting 0 entries" subsys=xds xdsCachedVersion=247 xdsTypeURL=type.googleapis.com/cilium.NetworkPolicy
2024-10-03T07:14:01.001285060Z time="2024-10-03T07:14:01Z" level=debug msg="cache unmodified by transaction; aborting" subsys=xds xdsCachedVersion=247 xdsTypeURL=type.googleapis.com/cilium.NetworkPolicy
2024-10-03T07:14:01.001443175Z time="2024-10-03T07:14:01Z" level=debug msg="preparing new cache transaction: upserting 1 entries, deleting 0 entries" subsys=xds xdsCachedVersion=247 xdsTypeURL=type.googleapis.com/cilium.NetworkPolicy
2024-10-03T07:14:01.001572256Z time="2024-10-03T07:14:01Z" level=debug msg="cache unmodified by transaction; aborting" subsys=xds xdsCachedVersion=247 xdsTypeURL=type.googleapis.com/cilium.NetworkPolicy
2024-10-03T07:14:02.718246744Z time="2024-10-03T07:14:02Z" level=debug msg="preparing new cache transaction: upserting 1 entries, deleting 0 entries" subsys=xds xdsCachedVersion=247 xdsTypeURL=type.googleapis.com/cilium.NetworkPolicy
2024-10-03T07:14:02.718389742Z time="2024-10-03T07:14:02Z" level=debug msg="cache unmodified by transaction; aborting" subsys=xds xdsCachedVersion=247 xdsTypeURL=type.googleapis.com/cilium.NetworkPolicy
2024-10-03T07:14:02.718940838Z time="2024-10-03T07:14:02Z" level=debug msg="preparing new cache transaction: upserting 1 entries, deleting 0 entries" subsys=xds xdsCachedVersion=247 xdsTypeURL=type.googleapis.com/cilium.NetworkPolicy
2024-10-03T07:14:02.718961556Z time="2024-10-03T07:14:02Z" level=debug msg="updating resource in cache" subsys=xds xdsCachedVersion=247 xdsResourceName=3851 xdsTypeURL=type.googleapis.com/cilium.NetworkPolicy
2024-10-03T07:14:02.719011749Z time="2024-10-03T07:14:02Z" level=debug msg="committing cache transaction and notifying of new version" subsys=xds xdsCachedVersion=247 xdsTypeURL=type.googleapis.com/cilium.NetworkPolicy
2024-10-03T07:14:02.719113429Z time="2024-10-03T07:14:02Z" level=debug msg="sending xDS response with 7 resources" subsys=xds xdsCachedVersion=247 xdsCanary=false xdsClientNode="host~127.0.0.1~no-id~localdomain" xdsNonce=247 xdsStreamID=5 xdsTypeURL=type.googleapis.com/cilium.NetworkPolicy
...
2024-10-03T07:14:02.818710515Z time="2024-10-03T07:14:02Z" level=warning msg="Timed out waiting for datapath updates of FQDN IP information; returning response. Consider increasing --tofqdns-proxy-response-max-delay if this keeps happening." subsys=daemon
...
2024-10-03T07:14:22.604181483Z time="2024-10-03T07:14:22Z" level=debug msg="received request from xDS stream" subsys=xds xdsAckedVersion=247 xdsClientNode="host~127.0.0.1~no-id~localdomain" xdsNonce=247 xdsStreamID=5 xdsTypeURL=type.googleapis.com/cilium.NetworkPolicy
2024-10-03T07:14:22.604194406Z time="2024-10-03T07:14:22Z" level=debug msg="notifying observers of ACKs" subsys=xds xdsAckedVersion=247 xdsClientNode="host~127.0.0.1~no-id~localdomain" xdsNonce=247 xdsStreamID=5 xdsTypeURL=type.googleapis.com/cilium.NetworkPolicy
2024-10-03T07:14:22.604199907Z time="2024-10-03T07:14:22Z" level=debug msg="completing ACK: &{247 type.googleapis.com/cilium.NetworkPolicy map[]}" subsys=xds xdsAckedVersion=247 xdsClientNode=127.0.0.1 xdsNonce=247 xdsTypeURL=type.googleapis.com/cilium.NetworkPolicy
2024-10-03T07:14:22.604205507Z time="2024-10-03T07:14:22Z" level=debug msg="starting watch on 0 resources" subsys=xds xdsAckedVersion=247 xdsClientNode="host~127.0.0.1~no-id~localdomain" xdsNonce=247 xdsStreamID=5 xdsTypeURL=type.googleapis.com/cilium.NetworkPolicy
2024-10-03T07:14:22.604210677Z time="2024-10-03T07:14:22Z" level=debug msg="getting 0 resources from set" subsys=xds xdsAckedVersion=247 xdsClientNode=127.0.0.1 xdsTypeURL=type.googleapis.com/cilium.NetworkPolicy
2024-10-03T07:14:22.604215827Z time="2024-10-03T07:14:22Z" level=debug msg="no resource names requested, returning all 7 resources" subsys=xds xdsAckedVersion=247 xdsClientNode=127.0.0.1 xdsTypeURL=type.googleapis.com/cilium.NetworkPolicy
@jrajahalme
Copy link
Member Author

@jrajahalme
Copy link
Member Author

The affected worker thread [20] is doing something between 07:14:02 and 07:14:22:

2024-10-03T07:14:02.124384806Z [2024-10-03 07:14:02.124][20][debug][connection] [external/envoy/source/common/network/connection_impl.cc:278] [Tags: "ConnectionId":"128"] closing socket: 1
2024-10-03T07:14:12.357880125Z [2024-10-03 07:14:12.357][20][trace][connection] [external/envoy/source/common/network/connection_impl.cc:469] [Tags: "ConnectionId":"128"] raising connection event 1
2024-10-03T07:14:12.357898981Z [2024-10-03 07:14:12.357][20][trace][conn_handler] [external/envoy/source/common/listener_manager/active_stream_listener_base.cc:126] [Tags: "ConnectionId":"128"] tcp connection on event 1
2024-10-03T07:14:12.357904441Z [2024-10-03 07:14:12.357][20][debug][conn_handler] [external/envoy/source/common/listener_manager/active_stream_listener_base.cc:136] [Tags: "ConnectionId":"128"] adding to cleanup list
2024-10-03T07:14:12.357909881Z [2024-10-03 07:14:12.357][20][trace][main] [external/envoy/source/common/event/dispatcher_impl.cc:228] item added to deferred deletion list (size=1)
2024-10-03T07:14:12.357915311Z [2024-10-03 07:14:12.357][20][trace][main] [external/envoy/source/common/event/dispatcher_impl.cc:228] item added to deferred deletion list (size=2)
2024-10-03T07:14:12.357920881Z [2024-10-03 07:14:12.357][20][trace][main] [external/envoy/source/common/event/dispatcher_impl.cc:122] clearing deferred deletion list (size=2)
2024-10-03T07:14:12.358216078Z [2024-10-03 07:14:12.357][20][debug][connection] [external/envoy/source/common/network/connection_impl.cc:146] [Tags: "ConnectionId":"132"] closing data_to_write=0 type=1
2024-10-03T07:14:12.358227920Z [2024-10-03 07:14:12.358][20][debug][connection] [external/envoy/source/common/network/connection_impl.cc:278] [Tags: "ConnectionId":"132"] closing socket: 1
2024-10-03T07:14:22.599030373Z [2024-10-03 07:14:22.596][20][trace][connection] [external/envoy/source/common/network/connection_impl.cc:469] [Tags: "ConnectionId":"132"] raising connection event 1
2024-10-03T07:14:22.599052324Z [2024-10-03 07:14:22.596][20][trace][conn_handler] [external/envoy/source/common/listener_manager/active_stream_listener_base.cc:126] [Tags: "ConnectionId":"132"] tcp connection on event 1
2024-10-03T07:14:22.599058585Z [2024-10-03 07:14:22.596][20][debug][conn_handler] [external/envoy/source/common/listener_manager/active_stream_listener_base.cc:136] [Tags: "ConnectionId":"132"] adding to cleanup list
2024-10-03T07:14:22.599092148Z [2024-10-03 07:14:22.596][20][trace][main] [external/envoy/source/common/event/dispatcher_impl.cc:228] item added to deferred deletion list (size=1)
2024-10-03T07:14:22.599097278Z [2024-10-03 07:14:22.596][20][trace][main] [external/envoy/source/common/event/dispatcher_impl.cc:228] item added to deferred deletion list (size=2)
2024-10-03T07:14:22.599101696Z [2024-10-03 07:14:22.596][20][trace][main] [external/envoy/source/common/event/dispatcher_impl.cc:122] clearing deferred deletion list (size=2)
2024-10-03T07:14:22.599105864Z [2024-10-03 07:14:22.596][20][trace][config] [cilium/network_policy.cc:1158] Cilium L7 NetworkPolicyMap::onConfigUpdate(): Starting updates on the worker thread for version 247
2024-10-03T07:14:22.599109921Z [2024-10-03 07:14:22.596][20][trace][config] [cilium/network_policy.cc:1165] Cilium updating network policy for endpoint fd00:10:244:2::ed86
2024-10-03T07:14:22.599113959Z [2024-10-03 07:14:22.596][20][trace][config] [cilium/network_policy.cc:1165] Cilium updating network policy for endpoint 10.244.2.133

It is curious that the policy update starts progressing on the same exact millisecond with a closing event on connection [132] after it has drained. There is no history for this connection in the logs, but it seems like the worker event loop was stalling until that connection had drained out?

@jrajahalme jrajahalme changed the title NetworkPolicy xDS stream stalls for 20 seconds before finishing updates on all worker threads NetworkPolicy xDS stream stalls for ~20 seconds before finishing updates on all worker threads Oct 3, 2024
jrajahalme added a commit that referenced this issue Oct 3, 2024
…pdate policies

Add a warning log if worker threads take longer than 100ms to update
their network policy maps. This is to help identify cases like #958,
where worker thread [20] took almost 20 seconds to start running the
update.

Related: #958

Signed-off-by: Jarno Rajahalme <[email protected]>
jrajahalme added a commit that referenced this issue Oct 3, 2024
…pdate policies

Add a warning log if worker threads take longer than 100ms to update
their network policy maps. This is to help identify cases like #958,
where worker thread [20] took almost 20 seconds to start running the
update.

Related: #958

Signed-off-by: Jarno Rajahalme <[email protected]>
github-merge-queue bot pushed a commit that referenced this issue Oct 3, 2024
…pdate policies

Add a warning log if worker threads take longer than 100ms to update
their network policy maps. This is to help identify cases like #958,
where worker thread [20] took almost 20 seconds to start running the
update.

Related: #958

Signed-off-by: Jarno Rajahalme <[email protected]>
jrajahalme added a commit that referenced this issue Oct 3, 2024
…pdate policies

Add a warning log if worker threads take longer than 100ms to update
their network policy maps. This is to help identify cases like #958,
where worker thread [20] took almost 20 seconds to start running the
update.

Related: #958

Signed-off-by: Jarno Rajahalme <[email protected]>
jrajahalme added a commit that referenced this issue Oct 3, 2024
…pdate policies

[ upstream commit 769db97 ]

Add a warning log if worker threads take longer than 100ms to update
their network policy maps. This is to help identify cases like #958,
where worker thread [20] took almost 20 seconds to start running the
update.

Related: #958

Signed-off-by: Jarno Rajahalme <[email protected]>
sayboras pushed a commit that referenced this issue Oct 3, 2024
…pdate policies

[ upstream commit 769db97 ]

Add a warning log if worker threads take longer than 100ms to update
their network policy maps. This is to help identify cases like #958,
where worker thread [20] took almost 20 seconds to start running the
update.

Related: #958

Signed-off-by: Jarno Rajahalme <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant