Message ID | 20220930173652.1251349-1-koba.ko@canonical.com (mailing list archive) |
---|---|
State | Changes Requested |
Headers | show |
Series | [V2] dmaengine: Fix client_count is countered one more incorrectly. | expand |
Tested-by: Joel Savitz <jsavitz@redhat.com> Reviewed-by: Joel Savitz <jsavitz@redhat.com>
Hi Vinod, Thoughts on this patch? Maybe changing the summary to "dmaengine: Fix double increment of client_count in dma_chan_get()" would be clearer? On Sat, Oct 01, 2022 at 01:36:52AM +0800, Koba Ko wrote: > If the passed client_count is 0, > it would be incremented by balance_ref_count first > then increment one more. > This would cause client_count to 2. > > cat /sys/class/dma/dma0chan*/in_use > 2 > 2 > 2 Would this be better? The first time dma_chan_get() is called for a channel the channel client_count is incorrectly incremented twice for public channels, first in balance_ref_count(), and again prior to returning. This results in an incorrect client count which will lead to the channel resources not being freed when they should be. A simple test of repeated module load and unload of async_tx on a Dell Power Edge R7425 also shows this resulting in a kref underflow warning. Regards, Jerry > > Fixes: d2f4f99db3e9 ("dmaengine: Rework dma_chan_get") > Signed-off-by: Koba Ko <koba.ko@canonical.com> > Reviewed-by: Jie Hai <haijie1@huawei.com> > Test-by: Jie Hai <haijie1@huawei.com> > Reviewed-by: Jerry Snitselaar <jsnitsel@redhat.com> > Reviewed-by: Dave Jiang <dave.jiang@intel.com> > > --- > V2: Remove [3/3] on subject. > --- > drivers/dma/dmaengine.c | 7 ++++--- > 1 file changed, 4 insertions(+), 3 deletions(-) > > diff --git a/drivers/dma/dmaengine.c b/drivers/dma/dmaengine.c > index 2cfa8458b51be..78f8a9f3ad825 100644 > --- a/drivers/dma/dmaengine.c > +++ b/drivers/dma/dmaengine.c > @@ -451,7 +451,8 @@ static int dma_chan_get(struct dma_chan *chan) > /* The channel is already in use, update client count */ > if (chan->client_count) { > __module_get(owner); > - goto out; > + chan->client_count++; > + return 0; > } > > if (!try_module_get(owner)) > @@ -470,11 +471,11 @@ static int dma_chan_get(struct dma_chan *chan) > goto err_out; > } > > + chan->client_count++; > + > if (!dma_has_cap(DMA_PRIVATE, chan->device->cap_mask)) > balance_ref_count(chan); > > -out: > - chan->client_count++; > return 0; > > err_out: > -- > 2.25.1 >
On 02-11-22, 11:07, Jerry Snitselaar wrote: > Hi Vinod, > > Thoughts on this patch? > > Maybe changing the summary to "dmaengine: Fix double increment of client_count in dma_chan_get()" > would be clearer? Yes that would be better > > On Sat, Oct 01, 2022 at 01:36:52AM +0800, Koba Ko wrote: > > If the passed client_count is 0, > > it would be incremented by balance_ref_count first > > then increment one more. > > This would cause client_count to 2. > > > > cat /sys/class/dma/dma0chan*/in_use > > 2 > > 2 > > 2 > > Would this be better? > > The first time dma_chan_get() is called for a channel the channel > client_count is incorrectly incremented twice for public channels, > first in balance_ref_count(), and again prior to returning. This > results in an incorrect client count which will lead to the > channel resources not being freed when they should be. A simple > test of repeated module load and unload of async_tx on a Dell > Power Edge R7425 also shows this resulting in a kref underflow > warning. Agree, also if you have the underflow warning handy, do add it to the log
On Fri, Nov 04, 2022 at 06:55:21PM +0530, Vinod Koul wrote: > On 02-11-22, 11:07, Jerry Snitselaar wrote: > > Hi Vinod, > > > > Thoughts on this patch? > > > > Maybe changing the summary to "dmaengine: Fix double increment of client_count in dma_chan_get()" > > would be clearer? > > Yes that would be better > > > > > On Sat, Oct 01, 2022 at 01:36:52AM +0800, Koba Ko wrote: > > > If the passed client_count is 0, > > > it would be incremented by balance_ref_count first > > > then increment one more. > > > This would cause client_count to 2. > > > > > > cat /sys/class/dma/dma0chan*/in_use > > > 2 > > > 2 > > > 2 > > > > Would this be better? > > > > The first time dma_chan_get() is called for a channel the channel > > client_count is incorrectly incremented twice for public channels, > > first in balance_ref_count(), and again prior to returning. This > > results in an incorrect client count which will lead to the > > channel resources not being freed when they should be. A simple > > test of repeated module load and unload of async_tx on a Dell > > Power Edge R7425 also shows this resulting in a kref underflow > > warning. > > Agree, also if you have the underflow warning handy, do add it to the > log > I don't know if Canonical saw that in their testing, but that was how our QE initially found the issue. [ 124.329662] async_tx: api initialized (async) [ 129.000627] async_tx: api initialized (async) [ 130.047839] ------------[ cut here ]------------ [ 130.052472] refcount_t: underflow; use-after-free. [ 130.057279] WARNING: CPU: 3 PID: 19364 at lib/refcount.c:28 refcount_warn_saturate+0xba/0x110 [ 130.065811] Modules linked in: async_tx(-) rfkill intel_rapl_msr intel_rapl_common amd64_edac edac_mce_amd ipmi_ssif kvm_amd dcdbas kvm mgag200 drm_shmem_helper acpi_ipmi irqbypass drm_kms_helper ipmi_si syscopyarea sysfillrect rapl pcspkr ipmi_devintf sysimgblt fb_sys_fops k10temp i2c_piix4 ipmi_msghandler acpi_power_meter acpi_cpufreq vfat fat drm fuse xfs libcrc32c sd_mod t10_pi sg ahci crct10dif_pclmul libahci crc32_pclmul crc32c_intel ghash_clmulni_intel igb megaraid_sas i40e libata i2c_algo_bit ccp sp5100_tco dca dm_mirror dm_region_hash dm_log dm_mod [last unloaded: async_tx] [ 130.117361] CPU: 3 PID: 19364 Comm: modprobe Kdump: loaded Not tainted 5.14.0-185.el9.x86_64 #1 [ 130.126091] Hardware name: Dell Inc. PowerEdge R7425/02MJ3T, BIOS 1.18.0 01/17/2022 [ 130.133806] RIP: 0010:refcount_warn_saturate+0xba/0x110 [ 130.139041] Code: 01 01 e8 6d bd 55 00 0f 0b e9 72 9d 8a 00 80 3d 26 18 9c 01 00 75 85 48 c7 c7 f8 a3 03 9d c6 05 16 18 9c 01 01 e8 4a bd 55 00 <0f> 0b e9 4f 9d 8a 00 80 3d 01 18 9c 01 00 0f 85 5e ff ff ff 48 c7 [ 130.157807] RSP: 0018:ffffbf98898afe68 EFLAGS: 00010286 [ 130.163036] RAX: 0000000000000000 RBX: ffff9da06028e598 RCX: 0000000000000000 [ 130.170172] RDX: ffff9daf9de26480 RSI: ffff9daf9de198a0 RDI: ffff9daf9de198a0 [ 130.177316] RBP: ffff9da7cddf3970 R08: 0000000000000000 R09: 00000000ffff7fff [ 130.184459] R10: ffffbf98898afd00 R11: ffffffff9d9e8c28 R12: ffff9da7cddf1970 [ 130.191596] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 [ 130.198739] FS: 00007f646435c740(0000) GS:ffff9daf9de00000(0000) knlGS:0000000000000000 [ 130.206832] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 130.212586] CR2: 00007f6463b214f0 CR3: 00000008ab98c000 CR4: 00000000003506e0 [ 130.219729] Call Trace: [ 130.222192] <TASK> [ 130.224305] dma_chan_put+0x10d/0x110 [ 130.227988] dmaengine_put+0x7a/0xa0 [ 130.231575] __do_sys_delete_module.constprop.0+0x178/0x280 [ 130.237157] ? syscall_trace_enter.constprop.0+0x145/0x1d0 [ 130.242652] do_syscall_64+0x5c/0x90 [ 130.246240] ? exc_page_fault+0x62/0x150 [ 130.250178] entry_SYSCALL_64_after_hwframe+0x63/0xcd [ 130.255243] RIP: 0033:0x7f6463a3f5ab [ 130.258830] Code: 73 01 c3 48 8b 0d 75 a8 1b 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa b8 b0 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 45 a8 1b 00 f7 d8 64 89 01 48 [ 130.277591] RSP: 002b:00007fff22f972c8 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0 [ 130.285164] RAX: ffffffffffffffda RBX: 000055b6786edd40 RCX: 00007f6463a3f5ab [ 130.292303] RDX: 0000000000000000 RSI: 0000000000000800 RDI: 000055b6786edda8 [ 130.299443] RBP: 000055b6786edd40 R08: 0000000000000000 R09: 0000000000000000 [ 130.306584] R10: 00007f6463b9eac0 R11: 0000000000000206 R12: 000055b6786edda8 [ 130.313731] R13: 0000000000000000 R14: 000055b6786edda8 R15: 00007fff22f995f8 [ 130.320875] </TASK> [ 130.323081] ---[ end trace eff7156d56b5cf25 ]--- Regards, Jerry > -- > ~Vinod
On 04-11-22, 08:36, Jerry Snitselaar wrote: > On Fri, Nov 04, 2022 at 06:55:21PM +0530, Vinod Koul wrote: > > On 02-11-22, 11:07, Jerry Snitselaar wrote: > > > Hi Vinod, > > > > > > Thoughts on this patch? > > > > > > Maybe changing the summary to "dmaengine: Fix double increment of client_count in dma_chan_get()" > > > would be clearer? > > > > Yes that would be better > > > > > > > > On Sat, Oct 01, 2022 at 01:36:52AM +0800, Koba Ko wrote: > > > > If the passed client_count is 0, > > > > it would be incremented by balance_ref_count first > > > > then increment one more. > > > > This would cause client_count to 2. > > > > > > > > cat /sys/class/dma/dma0chan*/in_use > > > > 2 > > > > 2 > > > > 2 > > > > > > Would this be better? > > > > > > The first time dma_chan_get() is called for a channel the channel > > > client_count is incorrectly incremented twice for public channels, > > > first in balance_ref_count(), and again prior to returning. This > > > results in an incorrect client count which will lead to the > > > channel resources not being freed when they should be. A simple > > > test of repeated module load and unload of async_tx on a Dell > > > Power Edge R7425 also shows this resulting in a kref underflow > > > warning. > > > > Agree, also if you have the underflow warning handy, do add it to the > > log > > > > I don't know if Canonical saw that in their testing, but that was how our > QE initially found the issue. > > [ 124.329662] async_tx: api initialized (async) > [ 129.000627] async_tx: api initialized (async) > [ 130.047839] ------------[ cut here ]------------ > [ 130.052472] refcount_t: underflow; use-after-free. > [ 130.057279] WARNING: CPU: 3 PID: 19364 at lib/refcount.c:28 refcount_warn_saturate+0xba/0x110 > [ 130.065811] Modules linked in: async_tx(-) rfkill intel_rapl_msr intel_rapl_common amd64_edac edac_mce_amd ipmi_ssif kvm_amd dcdbas kvm mgag200 drm_shmem_helper acpi_ipmi irqbypass drm_kms_helper ipmi_si syscopyarea sysfillrect rapl pcspkr ipmi_devintf sysimgblt fb_sys_fops k10temp i2c_piix4 ipmi_msghandler acpi_power_meter acpi_cpufreq vfat fat drm fuse xfs libcrc32c sd_mod t10_pi sg ahci crct10dif_pclmul libahci crc32_pclmul crc32c_intel ghash_clmulni_intel igb megaraid_sas i40e libata i2c_algo_bit ccp sp5100_tco dca dm_mirror dm_region_hash dm_log dm_mod [last unloaded: async_tx] > [ 130.117361] CPU: 3 PID: 19364 Comm: modprobe Kdump: loaded Not tainted 5.14.0-185.el9.x86_64 #1 > [ 130.126091] Hardware name: Dell Inc. PowerEdge R7425/02MJ3T, BIOS 1.18.0 01/17/2022 > [ 130.133806] RIP: 0010:refcount_warn_saturate+0xba/0x110 > [ 130.139041] Code: 01 01 e8 6d bd 55 00 0f 0b e9 72 9d 8a 00 80 3d 26 18 9c 01 00 75 85 48 c7 c7 f8 a3 03 9d c6 05 16 18 9c 01 01 e8 4a bd 55 00 <0f> 0b e9 4f 9d 8a 00 80 3d 01 18 9c 01 00 0f 85 5e ff ff ff 48 c7 > [ 130.157807] RSP: 0018:ffffbf98898afe68 EFLAGS: 00010286 > [ 130.163036] RAX: 0000000000000000 RBX: ffff9da06028e598 RCX: 0000000000000000 > [ 130.170172] RDX: ffff9daf9de26480 RSI: ffff9daf9de198a0 RDI: ffff9daf9de198a0 > [ 130.177316] RBP: ffff9da7cddf3970 R08: 0000000000000000 R09: 00000000ffff7fff > [ 130.184459] R10: ffffbf98898afd00 R11: ffffffff9d9e8c28 R12: ffff9da7cddf1970 > [ 130.191596] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 > [ 130.198739] FS: 00007f646435c740(0000) GS:ffff9daf9de00000(0000) knlGS:0000000000000000 > [ 130.206832] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > [ 130.212586] CR2: 00007f6463b214f0 CR3: 00000008ab98c000 CR4: 00000000003506e0 > [ 130.219729] Call Trace: > [ 130.222192] <TASK> > [ 130.224305] dma_chan_put+0x10d/0x110 > [ 130.227988] dmaengine_put+0x7a/0xa0 > [ 130.231575] __do_sys_delete_module.constprop.0+0x178/0x280 > [ 130.237157] ? syscall_trace_enter.constprop.0+0x145/0x1d0 > [ 130.242652] do_syscall_64+0x5c/0x90 > [ 130.246240] ? exc_page_fault+0x62/0x150 > [ 130.250178] entry_SYSCALL_64_after_hwframe+0x63/0xcd > [ 130.255243] RIP: 0033:0x7f6463a3f5ab > [ 130.258830] Code: 73 01 c3 48 8b 0d 75 a8 1b 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa b8 b0 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 45 a8 1b 00 f7 d8 64 89 01 48 > [ 130.277591] RSP: 002b:00007fff22f972c8 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0 > [ 130.285164] RAX: ffffffffffffffda RBX: 000055b6786edd40 RCX: 00007f6463a3f5ab > [ 130.292303] RDX: 0000000000000000 RSI: 0000000000000800 RDI: 000055b6786edda8 > [ 130.299443] RBP: 000055b6786edd40 R08: 0000000000000000 R09: 0000000000000000 > [ 130.306584] R10: 00007f6463b9eac0 R11: 0000000000000206 R12: 000055b6786edda8 > [ 130.313731] R13: 0000000000000000 R14: 000055b6786edda8 R15: 00007fff22f995f8 > [ 130.320875] </TASK> > [ 130.323081] ---[ end trace eff7156d56b5cf25 ]--- This is good, give more details and should be added to log in next iteration
hi Vinod, Should i re-submit v3 as per comments? Koba Ko On Sat, Nov 5, 2022 at 3:18 PM Vinod Koul <vkoul@kernel.org> wrote: > > On 04-11-22, 08:36, Jerry Snitselaar wrote: > > On Fri, Nov 04, 2022 at 06:55:21PM +0530, Vinod Koul wrote: > > > On 02-11-22, 11:07, Jerry Snitselaar wrote: > > > > Hi Vinod, > > > > > > > > Thoughts on this patch? > > > > > > > > Maybe changing the summary to "dmaengine: Fix double increment of client_count in dma_chan_get()" > > > > would be clearer? > > > > > > Yes that would be better > > > > > > > > > > > On Sat, Oct 01, 2022 at 01:36:52AM +0800, Koba Ko wrote: > > > > > If the passed client_count is 0, > > > > > it would be incremented by balance_ref_count first > > > > > then increment one more. > > > > > This would cause client_count to 2. > > > > > > > > > > cat /sys/class/dma/dma0chan*/in_use > > > > > 2 > > > > > 2 > > > > > 2 > > > > > > > > Would this be better? > > > > > > > > The first time dma_chan_get() is called for a channel the channel > > > > client_count is incorrectly incremented twice for public channels, > > > > first in balance_ref_count(), and again prior to returning. This > > > > results in an incorrect client count which will lead to the > > > > channel resources not being freed when they should be. A simple > > > > test of repeated module load and unload of async_tx on a Dell > > > > Power Edge R7425 also shows this resulting in a kref underflow > > > > warning. > > > > > > Agree, also if you have the underflow warning handy, do add it to the > > > log > > > > > > > I don't know if Canonical saw that in their testing, but that was how our > > QE initially found the issue. > > > > [ 124.329662] async_tx: api initialized (async) > > [ 129.000627] async_tx: api initialized (async) > > [ 130.047839] ------------[ cut here ]------------ > > [ 130.052472] refcount_t: underflow; use-after-free. > > [ 130.057279] WARNING: CPU: 3 PID: 19364 at lib/refcount.c:28 refcount_warn_saturate+0xba/0x110 > > [ 130.065811] Modules linked in: async_tx(-) rfkill intel_rapl_msr intel_rapl_common amd64_edac edac_mce_amd ipmi_ssif kvm_amd dcdbas kvm mgag200 drm_shmem_helper acpi_ipmi irqbypass drm_kms_helper ipmi_si syscopyarea sysfillrect rapl pcspkr ipmi_devintf sysimgblt fb_sys_fops k10temp i2c_piix4 ipmi_msghandler acpi_power_meter acpi_cpufreq vfat fat drm fuse xfs libcrc32c sd_mod t10_pi sg ahci crct10dif_pclmul libahci crc32_pclmul crc32c_intel ghash_clmulni_intel igb megaraid_sas i40e libata i2c_algo_bit ccp sp5100_tco dca dm_mirror dm_region_hash dm_log dm_mod [last unloaded: async_tx] > > [ 130.117361] CPU: 3 PID: 19364 Comm: modprobe Kdump: loaded Not tainted 5.14.0-185.el9.x86_64 #1 > > [ 130.126091] Hardware name: Dell Inc. PowerEdge R7425/02MJ3T, BIOS 1.18.0 01/17/2022 > > [ 130.133806] RIP: 0010:refcount_warn_saturate+0xba/0x110 > > [ 130.139041] Code: 01 01 e8 6d bd 55 00 0f 0b e9 72 9d 8a 00 80 3d 26 18 9c 01 00 75 85 48 c7 c7 f8 a3 03 9d c6 05 16 18 9c 01 01 e8 4a bd 55 00 <0f> 0b e9 4f 9d 8a 00 80 3d 01 18 9c 01 00 0f 85 5e ff ff ff 48 c7 > > [ 130.157807] RSP: 0018:ffffbf98898afe68 EFLAGS: 00010286 > > [ 130.163036] RAX: 0000000000000000 RBX: ffff9da06028e598 RCX: 0000000000000000 > > [ 130.170172] RDX: ffff9daf9de26480 RSI: ffff9daf9de198a0 RDI: ffff9daf9de198a0 > > [ 130.177316] RBP: ffff9da7cddf3970 R08: 0000000000000000 R09: 00000000ffff7fff > > [ 130.184459] R10: ffffbf98898afd00 R11: ffffffff9d9e8c28 R12: ffff9da7cddf1970 > > [ 130.191596] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 > > [ 130.198739] FS: 00007f646435c740(0000) GS:ffff9daf9de00000(0000) knlGS:0000000000000000 > > [ 130.206832] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > > [ 130.212586] CR2: 00007f6463b214f0 CR3: 00000008ab98c000 CR4: 00000000003506e0 > > [ 130.219729] Call Trace: > > [ 130.222192] <TASK> > > [ 130.224305] dma_chan_put+0x10d/0x110 > > [ 130.227988] dmaengine_put+0x7a/0xa0 > > [ 130.231575] __do_sys_delete_module.constprop.0+0x178/0x280 > > [ 130.237157] ? syscall_trace_enter.constprop.0+0x145/0x1d0 > > [ 130.242652] do_syscall_64+0x5c/0x90 > > [ 130.246240] ? exc_page_fault+0x62/0x150 > > [ 130.250178] entry_SYSCALL_64_after_hwframe+0x63/0xcd > > [ 130.255243] RIP: 0033:0x7f6463a3f5ab > > [ 130.258830] Code: 73 01 c3 48 8b 0d 75 a8 1b 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa b8 b0 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 45 a8 1b 00 f7 d8 64 89 01 48 > > [ 130.277591] RSP: 002b:00007fff22f972c8 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0 > > [ 130.285164] RAX: ffffffffffffffda RBX: 000055b6786edd40 RCX: 00007f6463a3f5ab > > [ 130.292303] RDX: 0000000000000000 RSI: 0000000000000800 RDI: 000055b6786edda8 > > [ 130.299443] RBP: 000055b6786edd40 R08: 0000000000000000 R09: 0000000000000000 > > [ 130.306584] R10: 00007f6463b9eac0 R11: 0000000000000206 R12: 000055b6786edda8 > > [ 130.313731] R13: 0000000000000000 R14: 000055b6786edda8 R15: 00007fff22f995f8 > > [ 130.320875] </TASK> > > [ 130.323081] ---[ end trace eff7156d56b5cf25 ]--- > > This is good, give more details and should be added to log in next > iteration > > -- > ~Vinod
I would like to send v3. Because the comments provided by Jerry are more detail, i would like replace the original with his comments. Here's draft, dmaengine: Fix double increment of client_count in dma_chan_get() The first time dma_chan_get() is called for a channel the channel client_count is incorrectly incremented twice for public channels, first in balance_ref_count(), and again prior to returning. This results in an incorrect client count which will lead to the channel resources not being freed when they should be. A simple test of repeated module load and unload of async_tx on a Dell Power Edge R7425 also shows this resulting in a kref underflow warning. [ 124.329662] async_tx: api initialized (async) [ 129.000627] async_tx: api initialized (async) [ 130.047839] ------------[ cut here ]------------ [ 130.052472] refcount_t: underflow; use-after-free. [ 130.057279] WARNING: CPU: 3 PID: 19364 at lib/refcount.c:28 refcount_warn_saturate+0xba/0x110 [ 130.065811] Modules linked in: async_tx(-) rfkill intel_rapl_msr intel_rapl_common amd64_edac edac_mce_amd ipmi_ssif kvm_amd dcdbas kvm mgag200 drm_shmem_helper acpi_ipmi irqbypass drm_kms_helper ipmi_si syscopyarea sysfillrect rapl pcspkr ipmi_devintf sysimgblt fb_sys_fops k10temp i2c_piix4 ipmi_msghandler acpi_power_meter acpi_cpufreq vfat fat drm fuse xfs libcrc32c sd_mod t10_pi sg ahci crct10dif_pclmul libahci crc32_pclmul crc32c_intel ghash_clmulni_intel igb megaraid_sas i40e libata i2c_algo_bit ccp sp5100_tco dca dm_mirror dm_region_hash dm_log dm_mod [last unloaded: async_tx] [ 130.117361] CPU: 3 PID: 19364 Comm: modprobe Kdump: loaded Not tainted 5.14.0-185.el9.x86_64 #1 [ 130.126091] Hardware name: Dell Inc. PowerEdge R7425/02MJ3T, BIOS 1.18.0 01/17/2022 [ 130.133806] RIP: 0010:refcount_warn_saturate+0xba/0x110 [ 130.139041] Code: 01 01 e8 6d bd 55 00 0f 0b e9 72 9d 8a 00 80 3d 26 18 9c 01 00 75 85 48 c7 c7 f8 a3 03 9d c6 05 16 18 9c 01 01 e8 4a bd 55 00 <0f> 0b e9 4f 9d 8a 00 80 3d 01 18 9c 01 00 0f 85 5e ff ff ff 48 c7 [ 130.157807] RSP: 0018:ffffbf98898afe68 EFLAGS: 00010286 [ 130.163036] RAX: 0000000000000000 RBX: ffff9da06028e598 RCX: 0000000000000000 [ 130.170172] RDX: ffff9daf9de26480 RSI: ffff9daf9de198a0 RDI: ffff9daf9de198a0 [ 130.177316] RBP: ffff9da7cddf3970 R08: 0000000000000000 R09: 00000000ffff7fff [ 130.184459] R10: ffffbf98898afd00 R11: ffffffff9d9e8c28 R12: ffff9da7cddf1970 [ 130.191596] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 [ 130.198739] FS: 00007f646435c740(0000) GS:ffff9daf9de00000(0000) knlGS:0000000000000000 [ 130.206832] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 130.212586] CR2: 00007f6463b214f0 CR3: 00000008ab98c000 CR4: 00000000003506e0 [ 130.219729] Call Trace: [ 130.222192] <TASK> [ 130.224305] dma_chan_put+0x10d/0x110 [ 130.227988] dmaengine_put+0x7a/0xa0 [ 130.231575] __do_sys_delete_module.constprop.0+0x178/0x280 [ 130.237157] ? syscall_trace_enter.constprop.0+0x145/0x1d0 [ 130.242652] do_syscall_64+0x5c/0x90 [ 130.246240] ? exc_page_fault+0x62/0x150 [ 130.250178] entry_SYSCALL_64_after_hwframe+0x63/0xcd [ 130.255243] RIP: 0033:0x7f6463a3f5ab [ 130.258830] Code: 73 01 c3 48 8b 0d 75 a8 1b 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa b8 b0 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 45 a8 1b 00 f7 d8 64 89 01 48 [ 130.277591] RSP: 002b:00007fff22f972c8 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0 [ 130.285164] RAX: ffffffffffffffda RBX: 000055b6786edd40 RCX: 00007f6463a3f5ab [ 130.292303] RDX: 0000000000000000 RSI: 0000000000000800 RDI: 000055b6786edda8 [ 130.299443] RBP: 000055b6786edd40 R08: 0000000000000000 R09: 0000000000000000 [ 130.306584] R10: 00007f6463b9eac0 R11: 0000000000000206 R12: 000055b6786edda8 [ 130.313731] R13: 0000000000000000 R14: 000055b6786edda8 R15: 00007fff22f995f8 [ 130.320875] </TASK> [ 130.323081] ---[ end trace eff7156d56b5cf25 ]--- cat /sys/class/dma/dma0chan*/in_use would get the wrong result. 2 2 2 On Wed, Nov 16, 2022 at 2:32 PM Koba Ko <koba.ko@canonical.com> wrote: > > hi Vinod, > Should i re-submit v3 as per comments? > > Koba Ko > > On Sat, Nov 5, 2022 at 3:18 PM Vinod Koul <vkoul@kernel.org> wrote: > > > > On 04-11-22, 08:36, Jerry Snitselaar wrote: > > > On Fri, Nov 04, 2022 at 06:55:21PM +0530, Vinod Koul wrote: > > > > On 02-11-22, 11:07, Jerry Snitselaar wrote: > > > > > Hi Vinod, > > > > > > > > > > Thoughts on this patch? > > > > > > > > > > Maybe changing the summary to "dmaengine: Fix double increment of client_count in dma_chan_get()" > > > > > would be clearer? > > > > > > > > Yes that would be better > > > > > > > > > > > > > > On Sat, Oct 01, 2022 at 01:36:52AM +0800, Koba Ko wrote: > > > > > > If the passed client_count is 0, > > > > > > it would be incremented by balance_ref_count first > > > > > > then increment one more. > > > > > > This would cause client_count to 2. > > > > > > > > > > > > cat /sys/class/dma/dma0chan*/in_use > > > > > > 2 > > > > > > 2 > > > > > > 2 > > > > > > > > > > Would this be better? > > > > > > > > > > The first time dma_chan_get() is called for a channel the channel > > > > > client_count is incorrectly incremented twice for public channels, > > > > > first in balance_ref_count(), and again prior to returning. This > > > > > results in an incorrect client count which will lead to the > > > > > channel resources not being freed when they should be. A simple > > > > > test of repeated module load and unload of async_tx on a Dell > > > > > Power Edge R7425 also shows this resulting in a kref underflow > > > > > warning. > > > > > > > > Agree, also if you have the underflow warning handy, do add it to the > > > > log > > > > > > > > > > I don't know if Canonical saw that in their testing, but that was how our > > > QE initially found the issue. > > > > > > [ 124.329662] async_tx: api initialized (async) > > > [ 129.000627] async_tx: api initialized (async) > > > [ 130.047839] ------------[ cut here ]------------ > > > [ 130.052472] refcount_t: underflow; use-after-free. > > > [ 130.057279] WARNING: CPU: 3 PID: 19364 at lib/refcount.c:28 refcount_warn_saturate+0xba/0x110 > > > [ 130.065811] Modules linked in: async_tx(-) rfkill intel_rapl_msr intel_rapl_common amd64_edac edac_mce_amd ipmi_ssif kvm_amd dcdbas kvm mgag200 drm_shmem_helper acpi_ipmi irqbypass drm_kms_helper ipmi_si syscopyarea sysfillrect rapl pcspkr ipmi_devintf sysimgblt fb_sys_fops k10temp i2c_piix4 ipmi_msghandler acpi_power_meter acpi_cpufreq vfat fat drm fuse xfs libcrc32c sd_mod t10_pi sg ahci crct10dif_pclmul libahci crc32_pclmul crc32c_intel ghash_clmulni_intel igb megaraid_sas i40e libata i2c_algo_bit ccp sp5100_tco dca dm_mirror dm_region_hash dm_log dm_mod [last unloaded: async_tx] > > > [ 130.117361] CPU: 3 PID: 19364 Comm: modprobe Kdump: loaded Not tainted 5.14.0-185.el9.x86_64 #1 > > > [ 130.126091] Hardware name: Dell Inc. PowerEdge R7425/02MJ3T, BIOS 1.18.0 01/17/2022 > > > [ 130.133806] RIP: 0010:refcount_warn_saturate+0xba/0x110 > > > [ 130.139041] Code: 01 01 e8 6d bd 55 00 0f 0b e9 72 9d 8a 00 80 3d 26 18 9c 01 00 75 85 48 c7 c7 f8 a3 03 9d c6 05 16 18 9c 01 01 e8 4a bd 55 00 <0f> 0b e9 4f 9d 8a 00 80 3d 01 18 9c 01 00 0f 85 5e ff ff ff 48 c7 > > > [ 130.157807] RSP: 0018:ffffbf98898afe68 EFLAGS: 00010286 > > > [ 130.163036] RAX: 0000000000000000 RBX: ffff9da06028e598 RCX: 0000000000000000 > > > [ 130.170172] RDX: ffff9daf9de26480 RSI: ffff9daf9de198a0 RDI: ffff9daf9de198a0 > > > [ 130.177316] RBP: ffff9da7cddf3970 R08: 0000000000000000 R09: 00000000ffff7fff > > > [ 130.184459] R10: ffffbf98898afd00 R11: ffffffff9d9e8c28 R12: ffff9da7cddf1970 > > > [ 130.191596] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 > > > [ 130.198739] FS: 00007f646435c740(0000) GS:ffff9daf9de00000(0000) knlGS:0000000000000000 > > > [ 130.206832] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > > > [ 130.212586] CR2: 00007f6463b214f0 CR3: 00000008ab98c000 CR4: 00000000003506e0 > > > [ 130.219729] Call Trace: > > > [ 130.222192] <TASK> > > > [ 130.224305] dma_chan_put+0x10d/0x110 > > > [ 130.227988] dmaengine_put+0x7a/0xa0 > > > [ 130.231575] __do_sys_delete_module.constprop.0+0x178/0x280 > > > [ 130.237157] ? syscall_trace_enter.constprop.0+0x145/0x1d0 > > > [ 130.242652] do_syscall_64+0x5c/0x90 > > > [ 130.246240] ? exc_page_fault+0x62/0x150 > > > [ 130.250178] entry_SYSCALL_64_after_hwframe+0x63/0xcd > > > [ 130.255243] RIP: 0033:0x7f6463a3f5ab > > > [ 130.258830] Code: 73 01 c3 48 8b 0d 75 a8 1b 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa b8 b0 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 45 a8 1b 00 f7 d8 64 89 01 48 > > > [ 130.277591] RSP: 002b:00007fff22f972c8 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0 > > > [ 130.285164] RAX: ffffffffffffffda RBX: 000055b6786edd40 RCX: 00007f6463a3f5ab > > > [ 130.292303] RDX: 0000000000000000 RSI: 0000000000000800 RDI: 000055b6786edda8 > > > [ 130.299443] RBP: 000055b6786edd40 R08: 0000000000000000 R09: 0000000000000000 > > > [ 130.306584] R10: 00007f6463b9eac0 R11: 0000000000000206 R12: 000055b6786edda8 > > > [ 130.313731] R13: 0000000000000000 R14: 000055b6786edda8 R15: 00007fff22f995f8 > > > [ 130.320875] </TASK> > > > [ 130.323081] ---[ end trace eff7156d56b5cf25 ]--- > > > > This is good, give more details and should be added to log in next > > iteration > > > > -- > > ~Vinod
diff --git a/drivers/dma/dmaengine.c b/drivers/dma/dmaengine.c index 2cfa8458b51be..78f8a9f3ad825 100644 --- a/drivers/dma/dmaengine.c +++ b/drivers/dma/dmaengine.c @@ -451,7 +451,8 @@ static int dma_chan_get(struct dma_chan *chan) /* The channel is already in use, update client count */ if (chan->client_count) { __module_get(owner); - goto out; + chan->client_count++; + return 0; } if (!try_module_get(owner)) @@ -470,11 +471,11 @@ static int dma_chan_get(struct dma_chan *chan) goto err_out; } + chan->client_count++; + if (!dma_has_cap(DMA_PRIVATE, chan->device->cap_mask)) balance_ref_count(chan); -out: - chan->client_count++; return 0; err_out: