diff mbox series

zram: Using GFP_ATOMIC instead of GFP_KERNEL to allocate bitmap memory in backing_dev_store

Message ID 20231130152047.200169-1-dongyun.liu@transsion.com (mailing list archive)
State New, archived
Headers show
Series zram: Using GFP_ATOMIC instead of GFP_KERNEL to allocate bitmap memory in backing_dev_store | expand

Commit Message

Dongyun Liu Nov. 30, 2023, 3:20 p.m. UTC
We encountered a deadlock issue when using zram on kernel version 5.10.
The reason is that the backing_dev_store will first hold the
zram->init_lock, and then it will allocate memory with kvmalloc_node. At
this moment, the system may be in a state of memory shortage, so it will
enter the memory swapping process. In zram_bvec_write, we will trigger
our own thread to move data from zram to the backing device to free up
more available memory, which is the work done in the
try_wakeup_zram_wbd. In this function, zram->init_lock will be acquired
again to read zram->bd_wb_limit, which causes a deadlock as follow call
trace.

The latest version of Linux does not have the bug, but a potential risk
in future.If we try to acquire zram->init_lock again in the zram process
(for example, when we need to read zram->bd_wb_limit), there is a risk
of deadlock. The bug is quite elusive, and it only appears with low
probability after conducting a week-long stress test using 50 devices.
Therefore, I suggest passing the GFP_ATOMIC flag to allocate memory in
the backing_dev_store, to avoid similar issues we encountered. Please
consider it, Thank you.

INFO: task init:331 blocked for more than 120 seconds.  "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:init   state:D stack:    0 pid:    1 ppid:     0 flags:0x04000000
Call trace:
  __switch_to+0x244/0x4e4
  __schedule+0x5bc/0xc48
  schedule+0x80/0x164
  rwsem_down_read_slowpath+0x4fc/0xf9c
  __down_read+0x140/0x188
  down_read+0x14/0x24
  try_wakeup_wbd_thread+0x78/0x1ec [zram]
  __zram_bvec_write+0x720/0x878 [zram]
  zram_bvec_rw+0xa8/0x234 [zram]
  zram_submit_bio+0x16c/0x268 [zram]
  submit_bio_noacct+0x128/0x3c8
  submit_bio+0x1cc/0x3d0
  __swap_writepage+0x5c4/0xd4c
  swap_writepage+0x130/0x158
  pageout+0x1f4/0x478
  shrink_page_list+0x9b4/0x1eb8
  shrink_inactive_list+0x2f4/0xaa8
  shrink_lruvec+0x184/0x340
  shrink_node_memcgs+0x84/0x3a0
  shrink_node+0x2c4/0x6c4
  shrink_zones+0x16c/0x29c
  do_try_to_free_pages+0xe4/0x2b4
  try_to_free_pages+0x388/0x7b4
  __alloc_pages_direct_reclaim+0x88/0x278
  __alloc_pages_slowpath+0x4ec/0xf6c
  __alloc_pages_nodemask+0x1f4/0x3dc
  kmalloc_order+0x54/0x338
  kmalloc_order_trace+0x34/0x1bc
  __kmalloc+0x5e8/0x9c0
  kvmalloc_node+0xa8/0x264
  backing_dev_store+0x1a4/0x818 [zram]
  dev_attr_store+0x38/0x8c
  sysfs_kf_write+0x64/0xc4
  kernfs_fop_write_iter+0x168/0x2ac
  vfs_write+0x300/0x37c
  ksys_write+0x7c/0xf0
  __arm64_sys_write+0x20/0x30
  el0_svc_common+0xd4/0x270
  el0_svc+0x28/0x98
  el0_sync_handler+0x8c/0xf0
  el0_sync+0x1b4/0x1c0

Signed-off-by: Dongyun Liu <dongyun.liu@transsion.com>
---
 drivers/block/zram/zram_drv.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Comments

Jens Axboe Nov. 30, 2023, 3:37 p.m. UTC | #1
On 11/30/23 8:20 AM, Dongyun Liu wrote:
> diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
> index d77d3664ca08..ee6c22c50e09 100644
> --- a/drivers/block/zram/zram_drv.c
> +++ b/drivers/block/zram/zram_drv.c
> @@ -514,7 +514,7 @@ static ssize_t backing_dev_store(struct device *dev,
>  
>  	nr_pages = i_size_read(inode) >> PAGE_SHIFT;
>  	bitmap_sz = BITS_TO_LONGS(nr_pages) * sizeof(long);
> -	bitmap = kvzalloc(bitmap_sz, GFP_KERNEL);
> +	bitmap = kmalloc(bitmap_sz, GFP_ATOMIC);
>  	if (!bitmap) {
>  		err = -ENOMEM;
>  		goto out;

Outside of this moving from a zeroed alloc to one that does not, the
change looks woefully incomplete. Why does this allocation need to be
GFP_ATOMIC, and:

1) file_name = kmalloc(PATH_MAX, GFP_KERNEL); does not
2) filp_open() -> getname_kernel() -> __getname() does not
3) filp_open() -> getname_kernel() does not
4) bdev_open_by_dev() does not

IOW, you have a slew of GFP_KERNEL allocations in there, and you
probably just patched the largest one. But the core issue remains.

The whole handling of backing_dev_store() looks pretty broken.
Dongyun Liu Dec. 1, 2023, 6:51 a.m. UTC | #2
On 2023/11/30 23:37, Jens Axboe wrote:
> On 11/30/23 8:20 AM, Dongyun Liu wrote:
>> diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
>> index d77d3664ca08..ee6c22c50e09 100644
>> --- a/drivers/block/zram/zram_drv.c
>> +++ b/drivers/block/zram/zram_drv.c
>> @@ -514,7 +514,7 @@ static ssize_t backing_dev_store(struct device *dev,
>>   
>>   	nr_pages = i_size_read(inode) >> PAGE_SHIFT;
>>   	bitmap_sz = BITS_TO_LONGS(nr_pages) * sizeof(long);
>> -	bitmap = kvzalloc(bitmap_sz, GFP_KERNEL);
>> +	bitmap = kmalloc(bitmap_sz, GFP_ATOMIC);
>>   	if (!bitmap) {
>>   		err = -ENOMEM;
>>   		goto out;
> 
> Outside of this moving from a zeroed alloc to one that does not, the
> change looks woefully incomplete. Why does this allocation need to be
> GFP_ATOMIC, and:

By using GFP_ATOMIC, it indicates that the caller cannot reclaim or 
sleep, although we can prevent the risk of  deadlock when acquiring the 
zram->lock again in zram_bvec_write.

> 
> 1) file_name = kmalloc(PATH_MAX, GFP_KERNEL); does not

There is no zram->init_lock held here, so there is no need to use
GFP_ATOMIC.

> 2) filp_open() -> getname_kernel() -> __getname() does not
> 3) filp_open() -> getname_kernel() does not
> 4) bdev_open_by_dev() does not

Missing the use of GFP_ATOMIC.

> 
> IOW, you have a slew of GFP_KERNEL allocations in there, and you
> probably just patched the largest one. But the core issue remains.
> 
> The whole handling of backing_dev_store() looks pretty broken.
> 

Indeed, this patch only solves the biggest problem and does not
fundamentally solve it, because there are many processes for holding
zram->init_lock before allocation memory in backing_dev_store that need
to be fully modified, and I did not consider it thoroughly. Obviously,
a larger and better patch is needed to eliminate this risk, but it is 
currently not necessary.

Thank you for your kind and patient.
Jens Axboe Dec. 1, 2023, 2:19 p.m. UTC | #3
On 11/30/23 11:51 PM, Dongyun Liu wrote:
> 
> 
> On 2023/11/30 23:37, Jens Axboe wrote:
>> On 11/30/23 8:20 AM, Dongyun Liu wrote:
>>> diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
>>> index d77d3664ca08..ee6c22c50e09 100644
>>> --- a/drivers/block/zram/zram_drv.c
>>> +++ b/drivers/block/zram/zram_drv.c
>>> @@ -514,7 +514,7 @@ static ssize_t backing_dev_store(struct device *dev,
>>>         nr_pages = i_size_read(inode) >> PAGE_SHIFT;
>>>       bitmap_sz = BITS_TO_LONGS(nr_pages) * sizeof(long);
>>> -    bitmap = kvzalloc(bitmap_sz, GFP_KERNEL);
>>> +    bitmap = kmalloc(bitmap_sz, GFP_ATOMIC);
>>>       if (!bitmap) {
>>>           err = -ENOMEM;
>>>           goto out;
>>
>> Outside of this moving from a zeroed alloc to one that does not, the
>> change looks woefully incomplete. Why does this allocation need to be
>> GFP_ATOMIC, and:
> 
> By using GFP_ATOMIC, it indicates that the caller cannot reclaim or
> sleep, although we can prevent the risk of  deadlock when acquiring
> the zram->lock again in zram_bvec_write.

Yes, I am very much aware of how gfp allocation flags work and how why
it's broken. It was a rhetorical question as to why you think you could
get away with just fixing one of them.

>> 1) file_name = kmalloc(PATH_MAX, GFP_KERNEL); does not
> 
> There is no zram->init_lock held here, so there is no need to use
> GFP_ATOMIC.

True

>> 2) filp_open() -> getname_kernel() -> __getname() does not
>> 3) filp_open() -> getname_kernel() does not
>> 4) bdev_open_by_dev() does not
> 
> Missing the use of GFP_ATOMIC.

Indeed!

>> IOW, you have a slew of GFP_KERNEL allocations in there, and you
>> probably just patched the largest one. But the core issue remains.
>>
>> The whole handling of backing_dev_store() looks pretty broken.
>>
> 
> Indeed, this patch only solves the biggest problem and does not
> fundamentally solve it, because there are many processes for holding
> zram->init_lock before allocation memory in backing_dev_store that
> need to be fully modified, and I did not consider it thoroughly.
> Obviously, a larger and better patch is needed to eliminate this risk,
> but it is currently not necessary.

You agree that it doesn't fix the issue, it just happens to fix the one
that you hit. And then you jump to the conclusion that this is all
that's needed to fix it. Ehm, confused?
Sergey Senozhatsky Dec. 1, 2023, 3:28 p.m. UTC | #4
On (23/12/01 07:19), Jens Axboe wrote:
> >> IOW, you have a slew of GFP_KERNEL allocations in there, and you
> >> probably just patched the largest one. But the core issue remains.
> >>
> >> The whole handling of backing_dev_store() looks pretty broken.
> >>
> > 
> > Indeed, this patch only solves the biggest problem and does not
> > fundamentally solve it, because there are many processes for holding
> > zram->init_lock before allocation memory in backing_dev_store that
> > need to be fully modified, and I did not consider it thoroughly.
> > Obviously, a larger and better patch is needed to eliminate this risk,
> > but it is currently not necessary.
> 
> You agree that it doesn't fix the issue, it just happens to fix the one
> that you hit. And then you jump to the conclusion that this is all
> that's needed to fix it. Ehm, confused?

Yeah.

zram is very sensitive to memory - zsmalloc pool needs physical pages
(allocated on demand) to keep data written to zram. zram probably simply
should not be used on systems under such heavy memory pressure. Throwing
GPF_ATOMICs at zram isn't going to fix anything.

Jens, you said that zram's backing device handling is broken. Got a minute
to elaborate?
Sergey Senozhatsky Dec. 1, 2023, 3:39 p.m. UTC | #5
On (23/11/30 23:20), Dongyun Liu wrote:
> INFO: task init:331 blocked for more than 120 seconds.  "echo 0 >
> /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> task:init   state:D stack:    0 pid:    1 ppid:     0 flags:0x04000000
> Call trace:
>   __switch_to+0x244/0x4e4
>   __schedule+0x5bc/0xc48
>   schedule+0x80/0x164
>   rwsem_down_read_slowpath+0x4fc/0xf9c
>   __down_read+0x140/0x188
>   down_read+0x14/0x24
>   try_wakeup_wbd_thread+0x78/0x1ec [zram]
>   __zram_bvec_write+0x720/0x878 [zram]
>   zram_bvec_rw+0xa8/0x234 [zram]
>   zram_submit_bio+0x16c/0x268 [zram]
>   submit_bio_noacct+0x128/0x3c8
>   submit_bio+0x1cc/0x3d0
>   __swap_writepage+0x5c4/0xd4c
>   swap_writepage+0x130/0x158
>   pageout+0x1f4/0x478
>   shrink_page_list+0x9b4/0x1eb8
>   shrink_inactive_list+0x2f4/0xaa8
>   shrink_lruvec+0x184/0x340
>   shrink_node_memcgs+0x84/0x3a0
>   shrink_node+0x2c4/0x6c4
>   shrink_zones+0x16c/0x29c
>   do_try_to_free_pages+0xe4/0x2b4
>   try_to_free_pages+0x388/0x7b4
>   __alloc_pages_direct_reclaim+0x88/0x278
>   __alloc_pages_slowpath+0x4ec/0xf6c
>   __alloc_pages_nodemask+0x1f4/0x3dc
>   kmalloc_order+0x54/0x338
>   kmalloc_order_trace+0x34/0x1bc
>   __kmalloc+0x5e8/0x9c0
>   kvmalloc_node+0xa8/0x264
>   backing_dev_store+0x1a4/0x818 [zram]
>   dev_attr_store+0x38/0x8c
>   sysfs_kf_write+0x64/0xc4

Hmm, I'm not really following this backtrace. Backing device
configuration is only possible on un-initialized zram device.
If it's uninitialized, then why is it being used for swapout
later in the call stack?
Minchan Kim Dec. 1, 2023, 6:41 p.m. UTC | #6
On Thu, Nov 30, 2023 at 11:20:47PM +0800, Dongyun Liu wrote:
> We encountered a deadlock issue when using zram on kernel version 5.10.
> The reason is that the backing_dev_store will first hold the
> zram->init_lock, and then it will allocate memory with kvmalloc_node. At
> this moment, the system may be in a state of memory shortage, so it will
> enter the memory swapping process. In zram_bvec_write, we will trigger
> our own thread to move data from zram to the backing device to free up
> more available memory, which is the work done in the
> try_wakeup_zram_wbd. In this function, zram->init_lock will be acquired
> again to read zram->bd_wb_limit, which causes a deadlock as follow call
> trace.

Isn't it your custom feature instead of upstream?

> 
> The latest version of Linux does not have the bug, but a potential risk
> in future.If we try to acquire zram->init_lock again in the zram process
> (for example, when we need to read zram->bd_wb_limit), there is a risk
> of deadlock. The bug is quite elusive, and it only appears with low

It would be very helpful if you show how the zram in the upstream could
cause the deadlock instead of your custom feature.

> probability after conducting a week-long stress test using 50 devices.
> Therefore, I suggest passing the GFP_ATOMIC flag to allocate memory in
> the backing_dev_store, to avoid similar issues we encountered. Please
> consider it, Thank you.
> 
> INFO: task init:331 blocked for more than 120 seconds.  "echo 0 >
> /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> task:init   state:D stack:    0 pid:    1 ppid:     0 flags:0x04000000
> Call trace:
>   __switch_to+0x244/0x4e4
>   __schedule+0x5bc/0xc48
>   schedule+0x80/0x164
>   rwsem_down_read_slowpath+0x4fc/0xf9c
>   __down_read+0x140/0x188
>   down_read+0x14/0x24
>   try_wakeup_wbd_thread+0x78/0x1ec [zram]
>   __zram_bvec_write+0x720/0x878 [zram]
>   zram_bvec_rw+0xa8/0x234 [zram]
>   zram_submit_bio+0x16c/0x268 [zram]
>   submit_bio_noacct+0x128/0x3c8
>   submit_bio+0x1cc/0x3d0
>   __swap_writepage+0x5c4/0xd4c
>   swap_writepage+0x130/0x158
>   pageout+0x1f4/0x478
>   shrink_page_list+0x9b4/0x1eb8
>   shrink_inactive_list+0x2f4/0xaa8
>   shrink_lruvec+0x184/0x340
>   shrink_node_memcgs+0x84/0x3a0
>   shrink_node+0x2c4/0x6c4
>   shrink_zones+0x16c/0x29c
>   do_try_to_free_pages+0xe4/0x2b4
>   try_to_free_pages+0x388/0x7b4
>   __alloc_pages_direct_reclaim+0x88/0x278
>   __alloc_pages_slowpath+0x4ec/0xf6c
>   __alloc_pages_nodemask+0x1f4/0x3dc
>   kmalloc_order+0x54/0x338
>   kmalloc_order_trace+0x34/0x1bc
>   __kmalloc+0x5e8/0x9c0
>   kvmalloc_node+0xa8/0x264
>   backing_dev_store+0x1a4/0x818 [zram]
>   dev_attr_store+0x38/0x8c
>   sysfs_kf_write+0x64/0xc4
>   kernfs_fop_write_iter+0x168/0x2ac
>   vfs_write+0x300/0x37c
>   ksys_write+0x7c/0xf0
>   __arm64_sys_write+0x20/0x30
>   el0_svc_common+0xd4/0x270
>   el0_svc+0x28/0x98
>   el0_sync_handler+0x8c/0xf0
>   el0_sync+0x1b4/0x1c0
> 
> Signed-off-by: Dongyun Liu <dongyun.liu@transsion.com>
> ---
>  drivers/block/zram/zram_drv.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
> index d77d3664ca08..ee6c22c50e09 100644
> --- a/drivers/block/zram/zram_drv.c
> +++ b/drivers/block/zram/zram_drv.c
> @@ -514,7 +514,7 @@ static ssize_t backing_dev_store(struct device *dev,
>  
>  	nr_pages = i_size_read(inode) >> PAGE_SHIFT;
>  	bitmap_sz = BITS_TO_LONGS(nr_pages) * sizeof(long);
> -	bitmap = kvzalloc(bitmap_sz, GFP_KERNEL);
> +	bitmap = kmalloc(bitmap_sz, GFP_ATOMIC);
>  	if (!bitmap) {
>  		err = -ENOMEM;
>  		goto out;
> -- 
> 2.25.1
>
Dongyun Liu Dec. 2, 2023, 1:54 p.m. UTC | #7
On 2023/12/1 22:19, Jens Axboe wrote:
> On 11/30/23 11:51 PM, Dongyun Liu wrote:
>>
>>
>> On 2023/11/30 23:37, Jens Axboe wrote:
>>> On 11/30/23 8:20 AM, Dongyun Liu wrote:
>>>> diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
>>>> index d77d3664ca08..ee6c22c50e09 100644
>>>> --- a/drivers/block/zram/zram_drv.c
>>>> +++ b/drivers/block/zram/zram_drv.c
>>>> @@ -514,7 +514,7 @@ static ssize_t backing_dev_store(struct device *dev,
>>>>          nr_pages = i_size_read(inode) >> PAGE_SHIFT;
>>>>        bitmap_sz = BITS_TO_LONGS(nr_pages) * sizeof(long);
>>>> -    bitmap = kvzalloc(bitmap_sz, GFP_KERNEL);
>>>> +    bitmap = kmalloc(bitmap_sz, GFP_ATOMIC);
>>>>        if (!bitmap) {
>>>>            err = -ENOMEM;
>>>>            goto out;
>>>
>>> Outside of this moving from a zeroed alloc to one that does not, the
>>> change looks woefully incomplete. Why does this allocation need to be
>>> GFP_ATOMIC, and:
>>
>> By using GFP_ATOMIC, it indicates that the caller cannot reclaim or
>> sleep, although we can prevent the risk of  deadlock when acquiring
>> the zram->lock again in zram_bvec_write.
> 
> Yes, I am very much aware of how gfp allocation flags work and how why
> it's broken. It was a rhetorical question as to why you think you could
> get away with just fixing one of them.
> 
>>> 1) file_name = kmalloc(PATH_MAX, GFP_KERNEL); does not
>>
>> There is no zram->init_lock held here, so there is no need to use
>> GFP_ATOMIC.
> 
> True
> 
>>> 2) filp_open() -> getname_kernel() -> __getname() does not
>>> 3) filp_open() -> getname_kernel() does not
>>> 4) bdev_open_by_dev() does not
>>
>> Missing the use of GFP_ATOMIC.
> 
> Indeed!
> 
>>> IOW, you have a slew of GFP_KERNEL allocations in there, and you
>>> probably just patched the largest one. But the core issue remains.
>>>
>>> The whole handling of backing_dev_store() looks pretty broken.
>>>
>>
>> Indeed, this patch only solves the biggest problem and does not
>> fundamentally solve it, because there are many processes for holding
>> zram->init_lock before allocation memory in backing_dev_store that
>> need to be fully modified, and I did not consider it thoroughly.
>> Obviously, a larger and better patch is needed to eliminate this risk,
>> but it is currently not necessary.
> 
> You agree that it doesn't fix the issue, it just happens to fix the one
> that you hit. And then you jump to the conclusion that this is all
> that's needed to fix it. Ehm, confused?
> 

Hi, Jens, Maybe there's something wrong with my expression. You can 
think of it this way: I agree with you that it doesn't fix the issue.
Dongyun Liu Dec. 2, 2023, 3:50 p.m. UTC | #8
On 2023/12/1 23:39, Sergey Senozhatsky wrote:
> On (23/11/30 23:20), Dongyun Liu wrote:
>> INFO: task init:331 blocked for more than 120 seconds.  "echo 0 >
>> /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>> task:init   state:D stack:    0 pid:    1 ppid:     0 flags:0x04000000
>> Call trace:
>>    __switch_to+0x244/0x4e4
>>    __schedule+0x5bc/0xc48
>>    schedule+0x80/0x164
>>    rwsem_down_read_slowpath+0x4fc/0xf9c
>>    __down_read+0x140/0x188
>>    down_read+0x14/0x24
>>    try_wakeup_wbd_thread+0x78/0x1ec [zram]
>>    __zram_bvec_write+0x720/0x878 [zram]
>>    zram_bvec_rw+0xa8/0x234 [zram]
>>    zram_submit_bio+0x16c/0x268 [zram]
>>    submit_bio_noacct+0x128/0x3c8
>>    submit_bio+0x1cc/0x3d0
>>    __swap_writepage+0x5c4/0xd4c
>>    swap_writepage+0x130/0x158
>>    pageout+0x1f4/0x478
>>    shrink_page_list+0x9b4/0x1eb8
>>    shrink_inactive_list+0x2f4/0xaa8
>>    shrink_lruvec+0x184/0x340
>>    shrink_node_memcgs+0x84/0x3a0
>>    shrink_node+0x2c4/0x6c4
>>    shrink_zones+0x16c/0x29c
>>    do_try_to_free_pages+0xe4/0x2b4
>>    try_to_free_pages+0x388/0x7b4
>>    __alloc_pages_direct_reclaim+0x88/0x278
>>    __alloc_pages_slowpath+0x4ec/0xf6c
>>    __alloc_pages_nodemask+0x1f4/0x3dc
>>    kmalloc_order+0x54/0x338
>>    kmalloc_order_trace+0x34/0x1bc
>>    __kmalloc+0x5e8/0x9c0
>>    kvmalloc_node+0xa8/0x264
>>    backing_dev_store+0x1a4/0x818 [zram]
>>    dev_attr_store+0x38/0x8c
>>    sysfs_kf_write+0x64/0xc4
> 
> Hmm, I'm not really following this backtrace. Backing device
> configuration is only possible on un-initialized zram device.
> If it's uninitialized, then why is it being used for swapout
> later in the call stack?

Uh, at this moment, zram has finished initializing and is
working. The backing device is an optional zram-based feature.
I think it can be created later.
Sergey Senozhatsky Dec. 3, 2023, 1:23 a.m. UTC | #9
On (23/12/02 23:50), Dongyun Liu wrote:
> On 2023/12/1 23:39, Sergey Senozhatsky wrote:
> > On (23/11/30 23:20), Dongyun Liu wrote:
> > > INFO: task init:331 blocked for more than 120 seconds.  "echo 0 >
> > > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> > > task:init   state:D stack:    0 pid:    1 ppid:     0 flags:0x04000000
> > > Call trace:
> > >    __switch_to+0x244/0x4e4
> > >    __schedule+0x5bc/0xc48
> > >    schedule+0x80/0x164
> > >    rwsem_down_read_slowpath+0x4fc/0xf9c
> > >    __down_read+0x140/0x188
> > >    down_read+0x14/0x24
> > >    try_wakeup_wbd_thread+0x78/0x1ec [zram]
> > >    __zram_bvec_write+0x720/0x878 [zram]
> > >    zram_bvec_rw+0xa8/0x234 [zram]
> > >    zram_submit_bio+0x16c/0x268 [zram]
> > >    submit_bio_noacct+0x128/0x3c8
> > >    submit_bio+0x1cc/0x3d0
> > >    __swap_writepage+0x5c4/0xd4c
> > >    swap_writepage+0x130/0x158
> > >    pageout+0x1f4/0x478
> > >    shrink_page_list+0x9b4/0x1eb8
> > >    shrink_inactive_list+0x2f4/0xaa8
> > >    shrink_lruvec+0x184/0x340
> > >    shrink_node_memcgs+0x84/0x3a0
> > >    shrink_node+0x2c4/0x6c4
> > >    shrink_zones+0x16c/0x29c
> > >    do_try_to_free_pages+0xe4/0x2b4
> > >    try_to_free_pages+0x388/0x7b4
> > >    __alloc_pages_direct_reclaim+0x88/0x278
> > >    __alloc_pages_slowpath+0x4ec/0xf6c
> > >    __alloc_pages_nodemask+0x1f4/0x3dc
> > >    kmalloc_order+0x54/0x338
> > >    kmalloc_order_trace+0x34/0x1bc
> > >    __kmalloc+0x5e8/0x9c0
> > >    kvmalloc_node+0xa8/0x264
> > >    backing_dev_store+0x1a4/0x818 [zram]
> > >    dev_attr_store+0x38/0x8c
> > >    sysfs_kf_write+0x64/0xc4
> > 
> > Hmm, I'm not really following this backtrace. Backing device
> > configuration is only possible on un-initialized zram device.
> > If it's uninitialized, then why is it being used for swapout
> > later in the call stack?
> 
> Uh, at this moment, zram has finished initializing and is
> working. The backing device is an optional zram-based feature.
> I think it can be created later.

backing_dev_store() can't be called on an initialized device,
that's what init_done() at the very beginning of backing_dev_store()
is supposed to ensure:

...
	down_write(&zram->init_lock);
	if (init_done(zram)) {
		pr_info("Can't setup backing device for initialized device\n");
		err = -EBUSY;
		goto out;
	}
...
Dongyun Liu Dec. 3, 2023, 2:45 p.m. UTC | #10
On 2023/12/3 9:23, Sergey Senozhatsky wrote:
> On (23/12/02 23:50), Dongyun Liu wrote:
>> On 2023/12/1 23:39, Sergey Senozhatsky wrote:
>>> On (23/11/30 23:20), Dongyun Liu wrote:
>>>> INFO: task init:331 blocked for more than 120 seconds.  "echo 0 >
>>>> /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>>>> task:init   state:D stack:    0 pid:    1 ppid:     0 flags:0x04000000
>>>> Call trace:
>>>>     __switch_to+0x244/0x4e4
>>>>     __schedule+0x5bc/0xc48
>>>>     schedule+0x80/0x164
>>>>     rwsem_down_read_slowpath+0x4fc/0xf9c
>>>>     __down_read+0x140/0x188
>>>>     down_read+0x14/0x24
>>>>     try_wakeup_wbd_thread+0x78/0x1ec [zram]
>>>>     __zram_bvec_write+0x720/0x878 [zram]
>>>>     zram_bvec_rw+0xa8/0x234 [zram]
>>>>     zram_submit_bio+0x16c/0x268 [zram]
>>>>     submit_bio_noacct+0x128/0x3c8
>>>>     submit_bio+0x1cc/0x3d0
>>>>     __swap_writepage+0x5c4/0xd4c
>>>>     swap_writepage+0x130/0x158
>>>>     pageout+0x1f4/0x478
>>>>     shrink_page_list+0x9b4/0x1eb8
>>>>     shrink_inactive_list+0x2f4/0xaa8
>>>>     shrink_lruvec+0x184/0x340
>>>>     shrink_node_memcgs+0x84/0x3a0
>>>>     shrink_node+0x2c4/0x6c4
>>>>     shrink_zones+0x16c/0x29c
>>>>     do_try_to_free_pages+0xe4/0x2b4
>>>>     try_to_free_pages+0x388/0x7b4
>>>>     __alloc_pages_direct_reclaim+0x88/0x278
>>>>     __alloc_pages_slowpath+0x4ec/0xf6c
>>>>     __alloc_pages_nodemask+0x1f4/0x3dc
>>>>     kmalloc_order+0x54/0x338
>>>>     kmalloc_order_trace+0x34/0x1bc
>>>>     __kmalloc+0x5e8/0x9c0
>>>>     kvmalloc_node+0xa8/0x264
>>>>     backing_dev_store+0x1a4/0x818 [zram]
>>>>     dev_attr_store+0x38/0x8c
>>>>     sysfs_kf_write+0x64/0xc4
>>>
>>> Hmm, I'm not really following this backtrace. Backing device
>>> configuration is only possible on un-initialized zram device.
>>> If it's uninitialized, then why is it being used for swapout
>>> later in the call stack?
>>
>> Uh, at this moment, zram has finished initializing and is
>> working. The backing device is an optional zram-based feature.
>> I think it can be created later.
> 
> backing_dev_store() can't be called on an initialized device,
> that's what init_done() at the very beginning of backing_dev_store()
> is supposed to ensure:
> 

We have modified the logic here to allow backing_dev_store() to be 
called multiple times. That's why the call trace comes in. On the other 
hand, I realized that I should show the upstream's call stack instead
of my custom feature next time. Thank you for your toleration.

> ...
> 	down_write(&zram->init_lock);
> 	if (init_done(zram)) {
> 		pr_info("Can't setup backing device for initialized device\n");
> 		err = -EBUSY;
> 		goto out;
> 	}
> ...
diff mbox series

Patch

diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index d77d3664ca08..ee6c22c50e09 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -514,7 +514,7 @@  static ssize_t backing_dev_store(struct device *dev,
 
 	nr_pages = i_size_read(inode) >> PAGE_SHIFT;
 	bitmap_sz = BITS_TO_LONGS(nr_pages) * sizeof(long);
-	bitmap = kvzalloc(bitmap_sz, GFP_KERNEL);
+	bitmap = kmalloc(bitmap_sz, GFP_ATOMIC);
 	if (!bitmap) {
 		err = -ENOMEM;
 		goto out;