-
Notifications
You must be signed in to change notification settings - Fork 78
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
:enhancement: Docer image for testing purporsses #80
Comments
Hey @ansible42. Coincidentally, we were just discussing this internally. :) Could you open an issue on the ni/nilrt.git repo? That's the top-level of the NI LinuxRT project, and is were we would like to track requests like this. I'd be interested to understand more about your requirements, so that I can feed that into our backlog - but we can have that discussion on the new issue. |
I'll shoot you an email as some info is not public. |
Closing this issue in favor of ni/nilrt#175 since it's not kernel-related. |
In commit 0345691 ("tick/rcu: Stop allowing RCU_SOFTIRQ in idle") the new function report_idle_softirq() was created by breaking code out of the existing can_stop_idle_tick() for kernels v5.18 and newer. In doing so, the code essentially went from a one conditional: if (a && b && c) warn(); to a three conditional: if (!a) return; if (!b) return; if (!c) return; warn(); However, it seems one of the conditionals didn't get a "!" removed. Compare the instance of local_bh_blocked() in the old code: - if (ratelimit < 10 && !local_bh_blocked() && - (local_softirq_pending() & SOFTIRQ_STOP_IDLE_MASK)) { - pr_warn("NOHZ tick-stop error: Non-RCU local softirq work is pending, handler #%02x!!!\n", - (unsigned int) local_softirq_pending()); - ratelimit++; - } ...to the usage in the new (5.18+) code: + /* On RT, softirqs handling may be waiting on some lock */ + if (!local_bh_blocked()) + return false; It seems apparent that the "!" should be removed from the new code. This issue lay dormant until another fixup for the same commit was added in commit a7e282c ("tick/rcu: Fix bogus ratelimit condition"). This commit realized the ratelimit was essentially set to zero instead of ten, and hence *no* softirq pending messages would ever be issued. Once this commit was backported via linux-stable, both the v6.1 and v6.4 preempt-rt kernels started printing out 10 instances of this at boot: NOHZ tick-stop error: local softirq work is pending, handler ni#80!!! Just to double check my understanding of things, I confirmed that the v5.18-rt did print the pending-80 messages with a cherry pick of the ratelimit fix, and then confirmed no pending softirq messages were printed with a revert of mainline's 034569 on a v5.18-rt baseline. Finally I confirmed it fixed the issue on v6.1-rt and v6.4-rt, and also didn't break anything on a defconfig of mainline master of today. Fixes: 0345691 ("tick/rcu: Stop allowing RCU_SOFTIRQ in idle") Cc: Wen Yang <[email protected]> Cc: Thomas Gleixner <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: "Paul E. McKenney" <[email protected]> Cc: Frederic Weisbecker <[email protected]> Signed-off-by: Paul Gortmaker <[email protected]> Reviewed-by: Wen Yang <[email protected]> Tested-by: Ahmad Fatoum <[email protected]> Signed-off-by: Sebastian Andrzej Siewior <[email protected]> Link: https://lore.kernel.org/r/[email protected]
[ Upstream commit 96c1fa0 ] In commit 0345691 ("tick/rcu: Stop allowing RCU_SOFTIRQ in idle") the new function report_idle_softirq() was created by breaking code out of the existing can_stop_idle_tick() for kernels v5.18 and newer. In doing so, the code essentially went from a one conditional: if (a && b && c) warn(); to a three conditional: if (!a) return; if (!b) return; if (!c) return; warn(); But that conversion got the condition for the RT specific local_bh_blocked() wrong. The original condition was: !local_bh_blocked() but the conversion failed to negate it so it ended up as: if (!local_bh_blocked()) return false; This issue lay dormant until another fixup for the same commit was added in commit a7e282c ("tick/rcu: Fix bogus ratelimit condition"). This commit realized the ratelimit was essentially set to zero instead of ten, and hence *no* softirq pending messages would ever be issued. Once this commit was backported via linux-stable, both the v6.1 and v6.4 preempt-rt kernels started printing out 10 instances of this at boot: NOHZ tick-stop error: local softirq work is pending, handler ni#80!!! Remove the negation and return when local_bh_blocked() evaluates to true to bring the correct behaviour back. Fixes: 0345691 ("tick/rcu: Stop allowing RCU_SOFTIRQ in idle") Signed-off-by: Paul Gortmaker <[email protected]> Signed-off-by: Thomas Gleixner <[email protected]> Tested-by: Ahmad Fatoum <[email protected]> Reviewed-by: Wen Yang <[email protected]> Acked-by: Frederic Weisbecker <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Sasha Levin <[email protected]>
commit 26bda3d upstream. Right now it is possible to do a vfe_get() with the internal reference count at 1. If vfe_check_clock_rates() returns non-zero then we will leave the reference count as-is and run: - pm_runtime_put_sync() - vfe->ops->pm_domain_off() skip: - camss_disable_clocks() Subsequent vfe_put() calls will when the ref-count is non-zero unconditionally run: - pm_runtime_put_sync() - vfe->ops->pm_domain_off() - camss_disable_clocks() vfe_get() should not attempt to roll-back on error when the ref-count is non-zero as the upper layers will still do their own vfe_put() operations. vfe_put() will drop the reference count and do the necessary power domain release, the cleanup jumps in vfe_get() should only be run when the ref-count is zero. [ 50.095796] CPU: 7 PID: 3075 Comm: cam Not tainted 6.3.2+ ni#80 [ 50.095798] Hardware name: LENOVO 21BXCTO1WW/21BXCTO1WW, BIOS N3HET82W (1.54 ) 05/26/2023 [ 50.095799] pstate: 60400005 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--) [ 50.095802] pc : refcount_warn_saturate+0xf4/0x148 [ 50.095804] lr : refcount_warn_saturate+0xf4/0x148 [ 50.095805] sp : ffff80000c7cb8b0 [ 50.095806] x29: ffff80000c7cb8b0 x28: ffff16ecc0e3fc10 x27: 0000000000000000 [ 50.095810] x26: 0000000000000000 x25: 0000000000020802 x24: 0000000000000000 [ 50.095813] x23: ffff16ecc7360640 x22: 00000000ffffffff x21: 0000000000000005 [ 50.095815] x20: ffff16ed175f4400 x19: ffffb4d9852942a8 x18: ffffffffffffffff [ 50.095818] x17: ffffb4d9852d4a48 x16: ffffb4d983da5db8 x15: ffff80000c7cb320 [ 50.095821] x14: 0000000000000001 x13: 2e656572662d7265 x12: 7466612d65737520 [ 50.095823] x11: 00000000ffffefff x10: ffffb4d9850cebf0 x9 : ffffb4d9835cf954 [ 50.095826] x8 : 0000000000017fe8 x7 : c0000000ffffefff x6 : 0000000000057fa8 [ 50.095829] x5 : ffff16f813fe3d08 x4 : 0000000000000000 x3 : ffff621e8f4d2000 [ 50.095832] x2 : 0000000000000000 x1 : 0000000000000000 x0 : ffff16ed32119040 [ 50.095835] Call trace: [ 50.095836] refcount_warn_saturate+0xf4/0x148 [ 50.095838] device_link_put_kref+0x84/0xc8 [ 50.095843] device_link_del+0x38/0x58 [ 50.095846] vfe_pm_domain_off+0x3c/0x50 [qcom_camss] [ 50.095860] vfe_put+0x114/0x140 [qcom_camss] [ 50.095869] csid_set_power+0x2c8/0x408 [qcom_camss] [ 50.095878] pipeline_pm_power_one+0x164/0x170 [videodev] [ 50.095896] pipeline_pm_power+0xc4/0x110 [videodev] [ 50.095909] v4l2_pipeline_pm_use+0x5c/0xa0 [videodev] [ 50.095923] v4l2_pipeline_pm_get+0x1c/0x30 [videodev] [ 50.095937] video_open+0x7c/0x100 [qcom_camss] [ 50.095945] v4l2_open+0x84/0x130 [videodev] [ 50.095960] chrdev_open+0xc8/0x250 [ 50.095964] do_dentry_open+0x1bc/0x498 [ 50.095966] vfs_open+0x34/0x40 [ 50.095968] path_openat+0xb44/0xf20 [ 50.095971] do_filp_open+0xa4/0x160 [ 50.095974] do_sys_openat2+0xc8/0x188 [ 50.095975] __arm64_sys_openat+0x6c/0xb8 [ 50.095977] invoke_syscall+0x50/0x128 [ 50.095982] el0_svc_common.constprop.0+0x4c/0x100 [ 50.095985] do_el0_svc+0x40/0xa8 [ 50.095988] el0_svc+0x2c/0x88 [ 50.095991] el0t_64_sync_handler+0xf4/0x120 [ 50.095994] el0t_64_sync+0x190/0x198 [ 50.095996] ---[ end trace 0000000000000000 ]--- Fixes: 7790969 ("media: camss: vfe: Fix runtime PM imbalance on error") Cc: [email protected] Signed-off-by: Bryan O'Donoghue <[email protected]> Reviewed-by: Laurent Pinchart <[email protected]> Signed-off-by: Hans Verkuil <[email protected]> Signed-off-by: Greg Kroah-Hartman <[email protected]>
[ Upstream commit e2b706c ] When I perform the following test operations: 1.ip link add br0 type bridge 2.brctl addif br0 eth0 3.ip addr add 239.0.0.1/32 dev eth0 4.ip addr add 239.0.0.1/32 dev br0 5.ip addr add 224.0.0.1/32 dev br0 6.while ((1)) do ifconfig br0 up ifconfig br0 down done 7.send IGMPv2 query packets to port eth0 continuously. For example, ./mausezahn ethX -c 0 "01 00 5e 00 00 01 00 72 19 88 aa 02 08 00 45 00 00 1c 00 01 00 00 01 02 0e 7f c0 a8 0a b7 e0 00 00 01 11 64 ee 9b 00 00 00 00" The preceding tests may trigger the refcnt uaf issue of the mc list. The stack is as follows: refcount_t: addition on 0; use-after-free. WARNING: CPU: 21 PID: 144 at lib/refcount.c:25 refcount_warn_saturate (lib/refcount.c:25) CPU: 21 PID: 144 Comm: ksoftirqd/21 Kdump: loaded Not tainted 6.7.0-rc1-next-20231117-dirty ni#80 Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 RIP: 0010:refcount_warn_saturate (lib/refcount.c:25) RSP: 0018:ffffb68f00657910 EFLAGS: 00010286 RAX: 0000000000000000 RBX: ffff8a00c3bf96c0 RCX: ffff8a07b6160908 RDX: 00000000ffffffd8 RSI: 0000000000000027 RDI: ffff8a07b6160900 RBP: ffff8a00cba36862 R08: 0000000000000000 R09: 00000000ffff7fff R10: ffffb68f006577c0 R11: ffffffffb0fdcdc8 R12: ffff8a00c3bf9680 R13: ffff8a00c3bf96f0 R14: 0000000000000000 R15: ffff8a00d8766e00 FS: 0000000000000000(0000) GS:ffff8a07b6140000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 000055f10b520b28 CR3: 000000039741a000 CR4: 00000000000006f0 Call Trace: <TASK> igmp_heard_query (net/ipv4/igmp.c:1068) igmp_rcv (net/ipv4/igmp.c:1132) ip_protocol_deliver_rcu (net/ipv4/ip_input.c:205) ip_local_deliver_finish (net/ipv4/ip_input.c:234) __netif_receive_skb_one_core (net/core/dev.c:5529) netif_receive_skb_internal (net/core/dev.c:5729) netif_receive_skb (net/core/dev.c:5788) br_handle_frame_finish (net/bridge/br_input.c:216) nf_hook_bridge_pre (net/bridge/br_input.c:294) __netif_receive_skb_core (net/core/dev.c:5423) __netif_receive_skb_list_core (net/core/dev.c:5606) __netif_receive_skb_list (net/core/dev.c:5674) netif_receive_skb_list_internal (net/core/dev.c:5764) napi_gro_receive (net/core/gro.c:609) e1000_clean_rx_irq (drivers/net/ethernet/intel/e1000/e1000_main.c:4467) e1000_clean (drivers/net/ethernet/intel/e1000/e1000_main.c:3805) __napi_poll (net/core/dev.c:6533) net_rx_action (net/core/dev.c:6735) __do_softirq (kernel/softirq.c:554) run_ksoftirqd (kernel/softirq.c:913) smpboot_thread_fn (kernel/smpboot.c:164) kthread (kernel/kthread.c:388) ret_from_fork (arch/x86/kernel/process.c:153) ret_from_fork_asm (arch/x86/entry/entry_64.S:250) </TASK> The root causes are as follows: Thread A Thread B ... netif_receive_skb br_dev_stop ... br_multicast_leave_snoopers ... __ip_mc_dec_group ... __igmp_group_dropped igmp_rcv igmp_stop_timer igmp_heard_query //ref = 1 ip_ma_put igmp_mod_timer refcount_dec_and_test igmp_start_timer //ref = 0 ... refcount_inc //ref increases from 0 When the device receives an IGMPv2 Query message, it starts the timer immediately, regardless of whether the device is running. If the device is down and has left the multicast group, it will cause the mc list refcount uaf issue. Fixes: 1da177e ("Linux-2.6.12-rc2") Signed-off-by: Zhengchao Shao <[email protected]> Reviewed-by: Eric Dumazet <[email protected]> Reviewed-by: Hangbin Liu <[email protected]> Signed-off-by: David S. Miller <[email protected]> Signed-off-by: Sasha Levin <[email protected]>
Including a docker image in the official pipeline would be helpful so that we can use that image for testing in our own CI workflows.
The text was updated successfully, but these errors were encountered: