Message ID | 20231130152047.200169-1-dongyun.liu@transsion.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | zram: Using GFP_ATOMIC instead of GFP_KERNEL to allocate bitmap memory in backing_dev_store | expand |
On 11/30/23 8:20 AM, Dongyun Liu wrote: > diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c > index d77d3664ca08..ee6c22c50e09 100644 > --- a/drivers/block/zram/zram_drv.c > +++ b/drivers/block/zram/zram_drv.c > @@ -514,7 +514,7 @@ static ssize_t backing_dev_store(struct device *dev, > > nr_pages = i_size_read(inode) >> PAGE_SHIFT; > bitmap_sz = BITS_TO_LONGS(nr_pages) * sizeof(long); > - bitmap = kvzalloc(bitmap_sz, GFP_KERNEL); > + bitmap = kmalloc(bitmap_sz, GFP_ATOMIC); > if (!bitmap) { > err = -ENOMEM; > goto out; Outside of this moving from a zeroed alloc to one that does not, the change looks woefully incomplete. Why does this allocation need to be GFP_ATOMIC, and: 1) file_name = kmalloc(PATH_MAX, GFP_KERNEL); does not 2) filp_open() -> getname_kernel() -> __getname() does not 3) filp_open() -> getname_kernel() does not 4) bdev_open_by_dev() does not IOW, you have a slew of GFP_KERNEL allocations in there, and you probably just patched the largest one. But the core issue remains. The whole handling of backing_dev_store() looks pretty broken.
On 2023/11/30 23:37, Jens Axboe wrote: > On 11/30/23 8:20 AM, Dongyun Liu wrote: >> diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c >> index d77d3664ca08..ee6c22c50e09 100644 >> --- a/drivers/block/zram/zram_drv.c >> +++ b/drivers/block/zram/zram_drv.c >> @@ -514,7 +514,7 @@ static ssize_t backing_dev_store(struct device *dev, >> >> nr_pages = i_size_read(inode) >> PAGE_SHIFT; >> bitmap_sz = BITS_TO_LONGS(nr_pages) * sizeof(long); >> - bitmap = kvzalloc(bitmap_sz, GFP_KERNEL); >> + bitmap = kmalloc(bitmap_sz, GFP_ATOMIC); >> if (!bitmap) { >> err = -ENOMEM; >> goto out; > > Outside of this moving from a zeroed alloc to one that does not, the > change looks woefully incomplete. Why does this allocation need to be > GFP_ATOMIC, and: By using GFP_ATOMIC, it indicates that the caller cannot reclaim or sleep, although we can prevent the risk of deadlock when acquiring the zram->lock again in zram_bvec_write. > > 1) file_name = kmalloc(PATH_MAX, GFP_KERNEL); does not There is no zram->init_lock held here, so there is no need to use GFP_ATOMIC. > 2) filp_open() -> getname_kernel() -> __getname() does not > 3) filp_open() -> getname_kernel() does not > 4) bdev_open_by_dev() does not Missing the use of GFP_ATOMIC. > > IOW, you have a slew of GFP_KERNEL allocations in there, and you > probably just patched the largest one. But the core issue remains. > > The whole handling of backing_dev_store() looks pretty broken. > Indeed, this patch only solves the biggest problem and does not fundamentally solve it, because there are many processes for holding zram->init_lock before allocation memory in backing_dev_store that need to be fully modified, and I did not consider it thoroughly. Obviously, a larger and better patch is needed to eliminate this risk, but it is currently not necessary. Thank you for your kind and patient.
On 11/30/23 11:51 PM, Dongyun Liu wrote: > > > On 2023/11/30 23:37, Jens Axboe wrote: >> On 11/30/23 8:20 AM, Dongyun Liu wrote: >>> diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c >>> index d77d3664ca08..ee6c22c50e09 100644 >>> --- a/drivers/block/zram/zram_drv.c >>> +++ b/drivers/block/zram/zram_drv.c >>> @@ -514,7 +514,7 @@ static ssize_t backing_dev_store(struct device *dev, >>> nr_pages = i_size_read(inode) >> PAGE_SHIFT; >>> bitmap_sz = BITS_TO_LONGS(nr_pages) * sizeof(long); >>> - bitmap = kvzalloc(bitmap_sz, GFP_KERNEL); >>> + bitmap = kmalloc(bitmap_sz, GFP_ATOMIC); >>> if (!bitmap) { >>> err = -ENOMEM; >>> goto out; >> >> Outside of this moving from a zeroed alloc to one that does not, the >> change looks woefully incomplete. Why does this allocation need to be >> GFP_ATOMIC, and: > > By using GFP_ATOMIC, it indicates that the caller cannot reclaim or > sleep, although we can prevent the risk of deadlock when acquiring > the zram->lock again in zram_bvec_write. Yes, I am very much aware of how gfp allocation flags work and how why it's broken. It was a rhetorical question as to why you think you could get away with just fixing one of them. >> 1) file_name = kmalloc(PATH_MAX, GFP_KERNEL); does not > > There is no zram->init_lock held here, so there is no need to use > GFP_ATOMIC. True >> 2) filp_open() -> getname_kernel() -> __getname() does not >> 3) filp_open() -> getname_kernel() does not >> 4) bdev_open_by_dev() does not > > Missing the use of GFP_ATOMIC. Indeed! >> IOW, you have a slew of GFP_KERNEL allocations in there, and you >> probably just patched the largest one. But the core issue remains. >> >> The whole handling of backing_dev_store() looks pretty broken. >> > > Indeed, this patch only solves the biggest problem and does not > fundamentally solve it, because there are many processes for holding > zram->init_lock before allocation memory in backing_dev_store that > need to be fully modified, and I did not consider it thoroughly. > Obviously, a larger and better patch is needed to eliminate this risk, > but it is currently not necessary. You agree that it doesn't fix the issue, it just happens to fix the one that you hit. And then you jump to the conclusion that this is all that's needed to fix it. Ehm, confused?
On (23/12/01 07:19), Jens Axboe wrote: > >> IOW, you have a slew of GFP_KERNEL allocations in there, and you > >> probably just patched the largest one. But the core issue remains. > >> > >> The whole handling of backing_dev_store() looks pretty broken. > >> > > > > Indeed, this patch only solves the biggest problem and does not > > fundamentally solve it, because there are many processes for holding > > zram->init_lock before allocation memory in backing_dev_store that > > need to be fully modified, and I did not consider it thoroughly. > > Obviously, a larger and better patch is needed to eliminate this risk, > > but it is currently not necessary. > > You agree that it doesn't fix the issue, it just happens to fix the one > that you hit. And then you jump to the conclusion that this is all > that's needed to fix it. Ehm, confused? Yeah. zram is very sensitive to memory - zsmalloc pool needs physical pages (allocated on demand) to keep data written to zram. zram probably simply should not be used on systems under such heavy memory pressure. Throwing GPF_ATOMICs at zram isn't going to fix anything. Jens, you said that zram's backing device handling is broken. Got a minute to elaborate?
On (23/11/30 23:20), Dongyun Liu wrote: > INFO: task init:331 blocked for more than 120 seconds. "echo 0 > > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > task:init state:D stack: 0 pid: 1 ppid: 0 flags:0x04000000 > Call trace: > __switch_to+0x244/0x4e4 > __schedule+0x5bc/0xc48 > schedule+0x80/0x164 > rwsem_down_read_slowpath+0x4fc/0xf9c > __down_read+0x140/0x188 > down_read+0x14/0x24 > try_wakeup_wbd_thread+0x78/0x1ec [zram] > __zram_bvec_write+0x720/0x878 [zram] > zram_bvec_rw+0xa8/0x234 [zram] > zram_submit_bio+0x16c/0x268 [zram] > submit_bio_noacct+0x128/0x3c8 > submit_bio+0x1cc/0x3d0 > __swap_writepage+0x5c4/0xd4c > swap_writepage+0x130/0x158 > pageout+0x1f4/0x478 > shrink_page_list+0x9b4/0x1eb8 > shrink_inactive_list+0x2f4/0xaa8 > shrink_lruvec+0x184/0x340 > shrink_node_memcgs+0x84/0x3a0 > shrink_node+0x2c4/0x6c4 > shrink_zones+0x16c/0x29c > do_try_to_free_pages+0xe4/0x2b4 > try_to_free_pages+0x388/0x7b4 > __alloc_pages_direct_reclaim+0x88/0x278 > __alloc_pages_slowpath+0x4ec/0xf6c > __alloc_pages_nodemask+0x1f4/0x3dc > kmalloc_order+0x54/0x338 > kmalloc_order_trace+0x34/0x1bc > __kmalloc+0x5e8/0x9c0 > kvmalloc_node+0xa8/0x264 > backing_dev_store+0x1a4/0x818 [zram] > dev_attr_store+0x38/0x8c > sysfs_kf_write+0x64/0xc4 Hmm, I'm not really following this backtrace. Backing device configuration is only possible on un-initialized zram device. If it's uninitialized, then why is it being used for swapout later in the call stack?
On Thu, Nov 30, 2023 at 11:20:47PM +0800, Dongyun Liu wrote: > We encountered a deadlock issue when using zram on kernel version 5.10. > The reason is that the backing_dev_store will first hold the > zram->init_lock, and then it will allocate memory with kvmalloc_node. At > this moment, the system may be in a state of memory shortage, so it will > enter the memory swapping process. In zram_bvec_write, we will trigger > our own thread to move data from zram to the backing device to free up > more available memory, which is the work done in the > try_wakeup_zram_wbd. In this function, zram->init_lock will be acquired > again to read zram->bd_wb_limit, which causes a deadlock as follow call > trace. Isn't it your custom feature instead of upstream? > > The latest version of Linux does not have the bug, but a potential risk > in future.If we try to acquire zram->init_lock again in the zram process > (for example, when we need to read zram->bd_wb_limit), there is a risk > of deadlock. The bug is quite elusive, and it only appears with low It would be very helpful if you show how the zram in the upstream could cause the deadlock instead of your custom feature. > probability after conducting a week-long stress test using 50 devices. > Therefore, I suggest passing the GFP_ATOMIC flag to allocate memory in > the backing_dev_store, to avoid similar issues we encountered. Please > consider it, Thank you. > > INFO: task init:331 blocked for more than 120 seconds. "echo 0 > > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > task:init state:D stack: 0 pid: 1 ppid: 0 flags:0x04000000 > Call trace: > __switch_to+0x244/0x4e4 > __schedule+0x5bc/0xc48 > schedule+0x80/0x164 > rwsem_down_read_slowpath+0x4fc/0xf9c > __down_read+0x140/0x188 > down_read+0x14/0x24 > try_wakeup_wbd_thread+0x78/0x1ec [zram] > __zram_bvec_write+0x720/0x878 [zram] > zram_bvec_rw+0xa8/0x234 [zram] > zram_submit_bio+0x16c/0x268 [zram] > submit_bio_noacct+0x128/0x3c8 > submit_bio+0x1cc/0x3d0 > __swap_writepage+0x5c4/0xd4c > swap_writepage+0x130/0x158 > pageout+0x1f4/0x478 > shrink_page_list+0x9b4/0x1eb8 > shrink_inactive_list+0x2f4/0xaa8 > shrink_lruvec+0x184/0x340 > shrink_node_memcgs+0x84/0x3a0 > shrink_node+0x2c4/0x6c4 > shrink_zones+0x16c/0x29c > do_try_to_free_pages+0xe4/0x2b4 > try_to_free_pages+0x388/0x7b4 > __alloc_pages_direct_reclaim+0x88/0x278 > __alloc_pages_slowpath+0x4ec/0xf6c > __alloc_pages_nodemask+0x1f4/0x3dc > kmalloc_order+0x54/0x338 > kmalloc_order_trace+0x34/0x1bc > __kmalloc+0x5e8/0x9c0 > kvmalloc_node+0xa8/0x264 > backing_dev_store+0x1a4/0x818 [zram] > dev_attr_store+0x38/0x8c > sysfs_kf_write+0x64/0xc4 > kernfs_fop_write_iter+0x168/0x2ac > vfs_write+0x300/0x37c > ksys_write+0x7c/0xf0 > __arm64_sys_write+0x20/0x30 > el0_svc_common+0xd4/0x270 > el0_svc+0x28/0x98 > el0_sync_handler+0x8c/0xf0 > el0_sync+0x1b4/0x1c0 > > Signed-off-by: Dongyun Liu <dongyun.liu@transsion.com> > --- > drivers/block/zram/zram_drv.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c > index d77d3664ca08..ee6c22c50e09 100644 > --- a/drivers/block/zram/zram_drv.c > +++ b/drivers/block/zram/zram_drv.c > @@ -514,7 +514,7 @@ static ssize_t backing_dev_store(struct device *dev, > > nr_pages = i_size_read(inode) >> PAGE_SHIFT; > bitmap_sz = BITS_TO_LONGS(nr_pages) * sizeof(long); > - bitmap = kvzalloc(bitmap_sz, GFP_KERNEL); > + bitmap = kmalloc(bitmap_sz, GFP_ATOMIC); > if (!bitmap) { > err = -ENOMEM; > goto out; > -- > 2.25.1 >
On 2023/12/1 22:19, Jens Axboe wrote: > On 11/30/23 11:51 PM, Dongyun Liu wrote: >> >> >> On 2023/11/30 23:37, Jens Axboe wrote: >>> On 11/30/23 8:20 AM, Dongyun Liu wrote: >>>> diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c >>>> index d77d3664ca08..ee6c22c50e09 100644 >>>> --- a/drivers/block/zram/zram_drv.c >>>> +++ b/drivers/block/zram/zram_drv.c >>>> @@ -514,7 +514,7 @@ static ssize_t backing_dev_store(struct device *dev, >>>> nr_pages = i_size_read(inode) >> PAGE_SHIFT; >>>> bitmap_sz = BITS_TO_LONGS(nr_pages) * sizeof(long); >>>> - bitmap = kvzalloc(bitmap_sz, GFP_KERNEL); >>>> + bitmap = kmalloc(bitmap_sz, GFP_ATOMIC); >>>> if (!bitmap) { >>>> err = -ENOMEM; >>>> goto out; >>> >>> Outside of this moving from a zeroed alloc to one that does not, the >>> change looks woefully incomplete. Why does this allocation need to be >>> GFP_ATOMIC, and: >> >> By using GFP_ATOMIC, it indicates that the caller cannot reclaim or >> sleep, although we can prevent the risk of deadlock when acquiring >> the zram->lock again in zram_bvec_write. > > Yes, I am very much aware of how gfp allocation flags work and how why > it's broken. It was a rhetorical question as to why you think you could > get away with just fixing one of them. > >>> 1) file_name = kmalloc(PATH_MAX, GFP_KERNEL); does not >> >> There is no zram->init_lock held here, so there is no need to use >> GFP_ATOMIC. > > True > >>> 2) filp_open() -> getname_kernel() -> __getname() does not >>> 3) filp_open() -> getname_kernel() does not >>> 4) bdev_open_by_dev() does not >> >> Missing the use of GFP_ATOMIC. > > Indeed! > >>> IOW, you have a slew of GFP_KERNEL allocations in there, and you >>> probably just patched the largest one. But the core issue remains. >>> >>> The whole handling of backing_dev_store() looks pretty broken. >>> >> >> Indeed, this patch only solves the biggest problem and does not >> fundamentally solve it, because there are many processes for holding >> zram->init_lock before allocation memory in backing_dev_store that >> need to be fully modified, and I did not consider it thoroughly. >> Obviously, a larger and better patch is needed to eliminate this risk, >> but it is currently not necessary. > > You agree that it doesn't fix the issue, it just happens to fix the one > that you hit. And then you jump to the conclusion that this is all > that's needed to fix it. Ehm, confused? > Hi, Jens, Maybe there's something wrong with my expression. You can think of it this way: I agree with you that it doesn't fix the issue.
On 2023/12/1 23:39, Sergey Senozhatsky wrote: > On (23/11/30 23:20), Dongyun Liu wrote: >> INFO: task init:331 blocked for more than 120 seconds. "echo 0 > >> /proc/sys/kernel/hung_task_timeout_secs" disables this message. >> task:init state:D stack: 0 pid: 1 ppid: 0 flags:0x04000000 >> Call trace: >> __switch_to+0x244/0x4e4 >> __schedule+0x5bc/0xc48 >> schedule+0x80/0x164 >> rwsem_down_read_slowpath+0x4fc/0xf9c >> __down_read+0x140/0x188 >> down_read+0x14/0x24 >> try_wakeup_wbd_thread+0x78/0x1ec [zram] >> __zram_bvec_write+0x720/0x878 [zram] >> zram_bvec_rw+0xa8/0x234 [zram] >> zram_submit_bio+0x16c/0x268 [zram] >> submit_bio_noacct+0x128/0x3c8 >> submit_bio+0x1cc/0x3d0 >> __swap_writepage+0x5c4/0xd4c >> swap_writepage+0x130/0x158 >> pageout+0x1f4/0x478 >> shrink_page_list+0x9b4/0x1eb8 >> shrink_inactive_list+0x2f4/0xaa8 >> shrink_lruvec+0x184/0x340 >> shrink_node_memcgs+0x84/0x3a0 >> shrink_node+0x2c4/0x6c4 >> shrink_zones+0x16c/0x29c >> do_try_to_free_pages+0xe4/0x2b4 >> try_to_free_pages+0x388/0x7b4 >> __alloc_pages_direct_reclaim+0x88/0x278 >> __alloc_pages_slowpath+0x4ec/0xf6c >> __alloc_pages_nodemask+0x1f4/0x3dc >> kmalloc_order+0x54/0x338 >> kmalloc_order_trace+0x34/0x1bc >> __kmalloc+0x5e8/0x9c0 >> kvmalloc_node+0xa8/0x264 >> backing_dev_store+0x1a4/0x818 [zram] >> dev_attr_store+0x38/0x8c >> sysfs_kf_write+0x64/0xc4 > > Hmm, I'm not really following this backtrace. Backing device > configuration is only possible on un-initialized zram device. > If it's uninitialized, then why is it being used for swapout > later in the call stack? Uh, at this moment, zram has finished initializing and is working. The backing device is an optional zram-based feature. I think it can be created later.
On (23/12/02 23:50), Dongyun Liu wrote: > On 2023/12/1 23:39, Sergey Senozhatsky wrote: > > On (23/11/30 23:20), Dongyun Liu wrote: > > > INFO: task init:331 blocked for more than 120 seconds. "echo 0 > > > > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > > > task:init state:D stack: 0 pid: 1 ppid: 0 flags:0x04000000 > > > Call trace: > > > __switch_to+0x244/0x4e4 > > > __schedule+0x5bc/0xc48 > > > schedule+0x80/0x164 > > > rwsem_down_read_slowpath+0x4fc/0xf9c > > > __down_read+0x140/0x188 > > > down_read+0x14/0x24 > > > try_wakeup_wbd_thread+0x78/0x1ec [zram] > > > __zram_bvec_write+0x720/0x878 [zram] > > > zram_bvec_rw+0xa8/0x234 [zram] > > > zram_submit_bio+0x16c/0x268 [zram] > > > submit_bio_noacct+0x128/0x3c8 > > > submit_bio+0x1cc/0x3d0 > > > __swap_writepage+0x5c4/0xd4c > > > swap_writepage+0x130/0x158 > > > pageout+0x1f4/0x478 > > > shrink_page_list+0x9b4/0x1eb8 > > > shrink_inactive_list+0x2f4/0xaa8 > > > shrink_lruvec+0x184/0x340 > > > shrink_node_memcgs+0x84/0x3a0 > > > shrink_node+0x2c4/0x6c4 > > > shrink_zones+0x16c/0x29c > > > do_try_to_free_pages+0xe4/0x2b4 > > > try_to_free_pages+0x388/0x7b4 > > > __alloc_pages_direct_reclaim+0x88/0x278 > > > __alloc_pages_slowpath+0x4ec/0xf6c > > > __alloc_pages_nodemask+0x1f4/0x3dc > > > kmalloc_order+0x54/0x338 > > > kmalloc_order_trace+0x34/0x1bc > > > __kmalloc+0x5e8/0x9c0 > > > kvmalloc_node+0xa8/0x264 > > > backing_dev_store+0x1a4/0x818 [zram] > > > dev_attr_store+0x38/0x8c > > > sysfs_kf_write+0x64/0xc4 > > > > Hmm, I'm not really following this backtrace. Backing device > > configuration is only possible on un-initialized zram device. > > If it's uninitialized, then why is it being used for swapout > > later in the call stack? > > Uh, at this moment, zram has finished initializing and is > working. The backing device is an optional zram-based feature. > I think it can be created later. backing_dev_store() can't be called on an initialized device, that's what init_done() at the very beginning of backing_dev_store() is supposed to ensure: ... down_write(&zram->init_lock); if (init_done(zram)) { pr_info("Can't setup backing device for initialized device\n"); err = -EBUSY; goto out; } ...
On 2023/12/3 9:23, Sergey Senozhatsky wrote: > On (23/12/02 23:50), Dongyun Liu wrote: >> On 2023/12/1 23:39, Sergey Senozhatsky wrote: >>> On (23/11/30 23:20), Dongyun Liu wrote: >>>> INFO: task init:331 blocked for more than 120 seconds. "echo 0 > >>>> /proc/sys/kernel/hung_task_timeout_secs" disables this message. >>>> task:init state:D stack: 0 pid: 1 ppid: 0 flags:0x04000000 >>>> Call trace: >>>> __switch_to+0x244/0x4e4 >>>> __schedule+0x5bc/0xc48 >>>> schedule+0x80/0x164 >>>> rwsem_down_read_slowpath+0x4fc/0xf9c >>>> __down_read+0x140/0x188 >>>> down_read+0x14/0x24 >>>> try_wakeup_wbd_thread+0x78/0x1ec [zram] >>>> __zram_bvec_write+0x720/0x878 [zram] >>>> zram_bvec_rw+0xa8/0x234 [zram] >>>> zram_submit_bio+0x16c/0x268 [zram] >>>> submit_bio_noacct+0x128/0x3c8 >>>> submit_bio+0x1cc/0x3d0 >>>> __swap_writepage+0x5c4/0xd4c >>>> swap_writepage+0x130/0x158 >>>> pageout+0x1f4/0x478 >>>> shrink_page_list+0x9b4/0x1eb8 >>>> shrink_inactive_list+0x2f4/0xaa8 >>>> shrink_lruvec+0x184/0x340 >>>> shrink_node_memcgs+0x84/0x3a0 >>>> shrink_node+0x2c4/0x6c4 >>>> shrink_zones+0x16c/0x29c >>>> do_try_to_free_pages+0xe4/0x2b4 >>>> try_to_free_pages+0x388/0x7b4 >>>> __alloc_pages_direct_reclaim+0x88/0x278 >>>> __alloc_pages_slowpath+0x4ec/0xf6c >>>> __alloc_pages_nodemask+0x1f4/0x3dc >>>> kmalloc_order+0x54/0x338 >>>> kmalloc_order_trace+0x34/0x1bc >>>> __kmalloc+0x5e8/0x9c0 >>>> kvmalloc_node+0xa8/0x264 >>>> backing_dev_store+0x1a4/0x818 [zram] >>>> dev_attr_store+0x38/0x8c >>>> sysfs_kf_write+0x64/0xc4 >>> >>> Hmm, I'm not really following this backtrace. Backing device >>> configuration is only possible on un-initialized zram device. >>> If it's uninitialized, then why is it being used for swapout >>> later in the call stack? >> >> Uh, at this moment, zram has finished initializing and is >> working. The backing device is an optional zram-based feature. >> I think it can be created later. > > backing_dev_store() can't be called on an initialized device, > that's what init_done() at the very beginning of backing_dev_store() > is supposed to ensure: > We have modified the logic here to allow backing_dev_store() to be called multiple times. That's why the call trace comes in. On the other hand, I realized that I should show the upstream's call stack instead of my custom feature next time. Thank you for your toleration. > ... > down_write(&zram->init_lock); > if (init_done(zram)) { > pr_info("Can't setup backing device for initialized device\n"); > err = -EBUSY; > goto out; > } > ...
diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c index d77d3664ca08..ee6c22c50e09 100644 --- a/drivers/block/zram/zram_drv.c +++ b/drivers/block/zram/zram_drv.c @@ -514,7 +514,7 @@ static ssize_t backing_dev_store(struct device *dev, nr_pages = i_size_read(inode) >> PAGE_SHIFT; bitmap_sz = BITS_TO_LONGS(nr_pages) * sizeof(long); - bitmap = kvzalloc(bitmap_sz, GFP_KERNEL); + bitmap = kmalloc(bitmap_sz, GFP_ATOMIC); if (!bitmap) { err = -ENOMEM; goto out;
We encountered a deadlock issue when using zram on kernel version 5.10. The reason is that the backing_dev_store will first hold the zram->init_lock, and then it will allocate memory with kvmalloc_node. At this moment, the system may be in a state of memory shortage, so it will enter the memory swapping process. In zram_bvec_write, we will trigger our own thread to move data from zram to the backing device to free up more available memory, which is the work done in the try_wakeup_zram_wbd. In this function, zram->init_lock will be acquired again to read zram->bd_wb_limit, which causes a deadlock as follow call trace. The latest version of Linux does not have the bug, but a potential risk in future.If we try to acquire zram->init_lock again in the zram process (for example, when we need to read zram->bd_wb_limit), there is a risk of deadlock. The bug is quite elusive, and it only appears with low probability after conducting a week-long stress test using 50 devices. Therefore, I suggest passing the GFP_ATOMIC flag to allocate memory in the backing_dev_store, to avoid similar issues we encountered. Please consider it, Thank you. INFO: task init:331 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. task:init state:D stack: 0 pid: 1 ppid: 0 flags:0x04000000 Call trace: __switch_to+0x244/0x4e4 __schedule+0x5bc/0xc48 schedule+0x80/0x164 rwsem_down_read_slowpath+0x4fc/0xf9c __down_read+0x140/0x188 down_read+0x14/0x24 try_wakeup_wbd_thread+0x78/0x1ec [zram] __zram_bvec_write+0x720/0x878 [zram] zram_bvec_rw+0xa8/0x234 [zram] zram_submit_bio+0x16c/0x268 [zram] submit_bio_noacct+0x128/0x3c8 submit_bio+0x1cc/0x3d0 __swap_writepage+0x5c4/0xd4c swap_writepage+0x130/0x158 pageout+0x1f4/0x478 shrink_page_list+0x9b4/0x1eb8 shrink_inactive_list+0x2f4/0xaa8 shrink_lruvec+0x184/0x340 shrink_node_memcgs+0x84/0x3a0 shrink_node+0x2c4/0x6c4 shrink_zones+0x16c/0x29c do_try_to_free_pages+0xe4/0x2b4 try_to_free_pages+0x388/0x7b4 __alloc_pages_direct_reclaim+0x88/0x278 __alloc_pages_slowpath+0x4ec/0xf6c __alloc_pages_nodemask+0x1f4/0x3dc kmalloc_order+0x54/0x338 kmalloc_order_trace+0x34/0x1bc __kmalloc+0x5e8/0x9c0 kvmalloc_node+0xa8/0x264 backing_dev_store+0x1a4/0x818 [zram] dev_attr_store+0x38/0x8c sysfs_kf_write+0x64/0xc4 kernfs_fop_write_iter+0x168/0x2ac vfs_write+0x300/0x37c ksys_write+0x7c/0xf0 __arm64_sys_write+0x20/0x30 el0_svc_common+0xd4/0x270 el0_svc+0x28/0x98 el0_sync_handler+0x8c/0xf0 el0_sync+0x1b4/0x1c0 Signed-off-by: Dongyun Liu <dongyun.liu@transsion.com> --- drivers/block/zram/zram_drv.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)