mbox series

[v3,0/9] cachefiles: random bugfixes

Message ID 20240628062930.2467993-1-libaokun@huaweicloud.com (mailing list archive)
Headers show
Series cachefiles: random bugfixes | expand

Message

Baokun Li June 28, 2024, 6:29 a.m. UTC
From: Baokun Li <libaokun1@huawei.com>

Hi all!

This is the third version of this patch series, in which another patch set
is subsumed into this one to avoid confusing the two patch sets.
(https://patchwork.kernel.org/project/linux-fsdevel/list/?series=854914)

Thank you, Jia Zhu, Gao Xiang, Jeff Layton, for the feedback in the
previous version.

We've been testing ondemand mode for cachefiles since January, and we're
almost done. We hit a lot of issues during the testing period, and this
patch series fixes some of the issues. The patches have passed internal
testing without regression.

The following is a brief overview of the patches, see the patches for
more details.

Patch 1-2: Add fscache_try_get_volume() helper function to avoid
fscache_volume use-after-free on cache withdrawal.

Patch 3: Fix cachefiles_lookup_cookie() and cachefiles_withdraw_cache()
concurrency causing cachefiles_volume use-after-free.

Patch 4: Propagate error codes returned by vfs_getxattr() to avoid
endless loops.

Patch 5-7: A read request waiting for reopen could be closed maliciously
before the reopen worker is executing or waiting to be scheduled. So
ondemand_object_worker() may be called after the info and object and even
the cache have been freed and trigger use-after-free. So use
cancel_work_sync() in cachefiles_ondemand_clean_object() to cancel the
reopen worker or wait for it to finish. Since it makes no sense to wait
for the daemon to complete the reopen request, to avoid this pointless
operation blocking cancel_work_sync(), Patch 1 avoids request generation
by the DROPPING state when the request has not been sent, and Patch 2
flushes the requests of the current object before cancel_work_sync().

Patch 8: Cyclic allocation of msg_id to avoid msg_id reuse misleading
the daemon to cause hung.

Patch 9: Hold xas_lock during polling to avoid dereferencing reqs causing
use-after-free. This issue was triggered frequently in our tests, and we
found that anolis 5.10 had fixed it. So to avoid failing the test, this
patch is pushed upstream as well.

Comments and questions are, as always, welcome.
Please let me know what you think.

Thanks,
Baokun

Changes since v2:
  * Collect Acked-by from Jeff Layton.(Thanks for your ack!)
  * Collect RVB from Gao Xiang.(Thanks for your review!)
  * Patch 1-4 from another patch set.
  * Pathch 4,6,7: Optimise commit messages and subjects.

Changes since v1:
  * Collect RVB from Jia Zhu and Gao Xiang.(Thanks for your review!)
  * Pathch 1,2:Add more commit messages.
  * Pathch 3:Add Fixes tag as suggested by Jia Zhu.
  * Pathch 4:No longer changing "do...while" to "retry" to focus changes
    and optimise commit messages.
  * Pathch 5: Drop the internal RVB tag.

v1: https://lore.kernel.org/all/20240424033409.2735257-1-libaokun@huaweicloud.com
v2: https://lore.kernel.org/all/20240515125136.3714580-1-libaokun@huaweicloud.com

Baokun Li (7):
  netfs, fscache: export fscache_put_volume() and add
    fscache_try_get_volume()
  cachefiles: fix slab-use-after-free in fscache_withdraw_volume()
  cachefiles: fix slab-use-after-free in cachefiles_withdraw_cookie()
  cachefiles: propagate errors from vfs_getxattr() to avoid infinite
    loop
  cachefiles: stop sending new request when dropping object
  cachefiles: cancel all requests for the object that is being dropped
  cachefiles: cyclic allocation of msg_id to avoid reuse

Hou Tao (1):
  cachefiles: wait for ondemand_object_worker to finish when dropping
    object

Jingbo Xu (1):
  cachefiles: add missing lock protection when polling

 fs/cachefiles/cache.c          | 45 ++++++++++++++++++++++++++++-
 fs/cachefiles/daemon.c         |  4 +--
 fs/cachefiles/internal.h       |  3 ++
 fs/cachefiles/ondemand.c       | 52 ++++++++++++++++++++++++++++++----
 fs/cachefiles/volume.c         |  1 -
 fs/cachefiles/xattr.c          |  5 +++-
 fs/netfs/fscache_volume.c      | 14 +++++++++
 fs/netfs/internal.h            |  2 --
 include/linux/fscache-cache.h  |  6 ++++
 include/trace/events/fscache.h |  4 +++
 10 files changed, 123 insertions(+), 13 deletions(-)

Comments

Gao Xiang June 28, 2024, 7:39 a.m. UTC | #1
Hi Baokun,

On 2024/6/28 14:29, libaokun@huaweicloud.com wrote:
> From: Baokun Li <libaokun1@huawei.com>
> 
> Hi all!
> 
> This is the third version of this patch series, in which another patch set
> is subsumed into this one to avoid confusing the two patch sets.
> (https://patchwork.kernel.org/project/linux-fsdevel/list/?series=854914)
> 
> Thank you, Jia Zhu, Gao Xiang, Jeff Layton, for the feedback in the
> previous version.
> 
> We've been testing ondemand mode for cachefiles since January, and we're
> almost done. We hit a lot of issues during the testing period, and this
> patch series fixes some of the issues. The patches have passed internal
> testing without regression.
> 
> The following is a brief overview of the patches, see the patches for
> more details.
> 
> Patch 1-2: Add fscache_try_get_volume() helper function to avoid
> fscache_volume use-after-free on cache withdrawal.
> 
> Patch 3: Fix cachefiles_lookup_cookie() and cachefiles_withdraw_cache()
> concurrency causing cachefiles_volume use-after-free.
> 
> Patch 4: Propagate error codes returned by vfs_getxattr() to avoid
> endless loops.
> 
> Patch 5-7: A read request waiting for reopen could be closed maliciously
> before the reopen worker is executing or waiting to be scheduled. So
> ondemand_object_worker() may be called after the info and object and even
> the cache have been freed and trigger use-after-free. So use
> cancel_work_sync() in cachefiles_ondemand_clean_object() to cancel the
> reopen worker or wait for it to finish. Since it makes no sense to wait
> for the daemon to complete the reopen request, to avoid this pointless
> operation blocking cancel_work_sync(), Patch 1 avoids request generation
> by the DROPPING state when the request has not been sent, and Patch 2
> flushes the requests of the current object before cancel_work_sync().
> 
> Patch 8: Cyclic allocation of msg_id to avoid msg_id reuse misleading
> the daemon to cause hung.
> 
> Patch 9: Hold xas_lock during polling to avoid dereferencing reqs causing
> use-after-free. This issue was triggered frequently in our tests, and we
> found that anolis 5.10 had fixed it. So to avoid failing the test, this
> patch is pushed upstream as well.
> 
> Comments and questions are, as always, welcome.
> Please let me know what you think.

Patch 4-9 looks good to me, and they are independent to patch 1-3
so personally I guess they could go upstream in advance.

I hope the way to fix cachefiles in patch 1-4 could be also
confirmed by David and Jeff since they relates the generic
cachefiles logic anyway.

Thanks,
Gao Xiang

> 
> Thanks,
> Baokun
Baokun Li June 28, 2024, 11:37 a.m. UTC | #2
Hi Xiang,

On 2024/6/28 15:39, Gao Xiang wrote:
> Hi Baokun,
>
> On 2024/6/28 14:29, libaokun@huaweicloud.com wrote:
>> From: Baokun Li <libaokun1@huawei.com>
>>
>> Hi all!
>>
>> This is the third version of this patch series, in which another 
>> patch set
>> is subsumed into this one to avoid confusing the two patch sets.
>> (https://patchwork.kernel.org/project/linux-fsdevel/list/?series=854914)
>>
>> Thank you, Jia Zhu, Gao Xiang, Jeff Layton, for the feedback in the
>> previous version.
>>
>> We've been testing ondemand mode for cachefiles since January, and we're
>> almost done. We hit a lot of issues during the testing period, and this
>> patch series fixes some of the issues. The patches have passed internal
>> testing without regression.
>>
>> The following is a brief overview of the patches, see the patches for
>> more details.
>>
>> Patch 1-2: Add fscache_try_get_volume() helper function to avoid
>> fscache_volume use-after-free on cache withdrawal.
>>
>> Patch 3: Fix cachefiles_lookup_cookie() and cachefiles_withdraw_cache()
>> concurrency causing cachefiles_volume use-after-free.
>>
>> Patch 4: Propagate error codes returned by vfs_getxattr() to avoid
>> endless loops.
>>
>> Patch 5-7: A read request waiting for reopen could be closed maliciously
>> before the reopen worker is executing or waiting to be scheduled. So
>> ondemand_object_worker() may be called after the info and object and 
>> even
>> the cache have been freed and trigger use-after-free. So use
>> cancel_work_sync() in cachefiles_ondemand_clean_object() to cancel the
>> reopen worker or wait for it to finish. Since it makes no sense to wait
>> for the daemon to complete the reopen request, to avoid this pointless
>> operation blocking cancel_work_sync(), Patch 1 avoids request generation
>> by the DROPPING state when the request has not been sent, and Patch 2
>> flushes the requests of the current object before cancel_work_sync().
>>
>> Patch 8: Cyclic allocation of msg_id to avoid msg_id reuse misleading
>> the daemon to cause hung.
>>
>> Patch 9: Hold xas_lock during polling to avoid dereferencing reqs 
>> causing
>> use-after-free. This issue was triggered frequently in our tests, and we
>> found that anolis 5.10 had fixed it. So to avoid failing the test, this
>> patch is pushed upstream as well.
>>
>> Comments and questions are, as always, welcome.
>> Please let me know what you think.
>
> Patch 4-9 looks good to me, and they are independent to patch 1-3
> so personally I guess they could go upstream in advance.

Thank you for your review!

Indeed, the first three patches have no dependencies on the later
ones, and the later ones can be merged in first.
>
> I hope the way to fix cachefiles in patch 1-4 could be also
> confirmed by David and Jeff since they relates the generic
> cachefiles logic anyway.
Yes, the first four patches modify the generic process for cachefiles,
and it would be great if David and Jeff could take a look at those.
The logic for patches 2 and 3 is a bit more complex, so the review
may take some time.

Cheers,
Baokun
Baokun Li July 2, 2024, 12:25 p.m. UTC | #3
Friendly ping...

On 2024/6/28 14:29, libaokun@huaweicloud.com wrote:
> From: Baokun Li <libaokun1@huawei.com>
>
> Hi all!
>
> This is the third version of this patch series, in which another patch set
> is subsumed into this one to avoid confusing the two patch sets.
> (https://patchwork.kernel.org/project/linux-fsdevel/list/?series=854914)
>
> Thank you, Jia Zhu, Gao Xiang, Jeff Layton, for the feedback in the
> previous version.
>
> We've been testing ondemand mode for cachefiles since January, and we're
> almost done. We hit a lot of issues during the testing period, and this
> patch series fixes some of the issues. The patches have passed internal
> testing without regression.
>
> The following is a brief overview of the patches, see the patches for
> more details.
>
> Patch 1-2: Add fscache_try_get_volume() helper function to avoid
> fscache_volume use-after-free on cache withdrawal.
>
> Patch 3: Fix cachefiles_lookup_cookie() and cachefiles_withdraw_cache()
> concurrency causing cachefiles_volume use-after-free.
>
> Patch 4: Propagate error codes returned by vfs_getxattr() to avoid
> endless loops.
>
> Patch 5-7: A read request waiting for reopen could be closed maliciously
> before the reopen worker is executing or waiting to be scheduled. So
> ondemand_object_worker() may be called after the info and object and even
> the cache have been freed and trigger use-after-free. So use
> cancel_work_sync() in cachefiles_ondemand_clean_object() to cancel the
> reopen worker or wait for it to finish. Since it makes no sense to wait
> for the daemon to complete the reopen request, to avoid this pointless
> operation blocking cancel_work_sync(), Patch 1 avoids request generation
> by the DROPPING state when the request has not been sent, and Patch 2
> flushes the requests of the current object before cancel_work_sync().
>
> Patch 8: Cyclic allocation of msg_id to avoid msg_id reuse misleading
> the daemon to cause hung.
>
> Patch 9: Hold xas_lock during polling to avoid dereferencing reqs causing
> use-after-free. This issue was triggered frequently in our tests, and we
> found that anolis 5.10 had fixed it. So to avoid failing the test, this
> patch is pushed upstream as well.
>
> Comments and questions are, as always, welcome.
> Please let me know what you think.
>
> Thanks,
> Baokun
>
> Changes since v2:
>    * Collect Acked-by from Jeff Layton.(Thanks for your ack!)
>    * Collect RVB from Gao Xiang.(Thanks for your review!)
>    * Patch 1-4 from another patch set.
>    * Pathch 4,6,7: Optimise commit messages and subjects.
>
> Changes since v1:
>    * Collect RVB from Jia Zhu and Gao Xiang.(Thanks for your review!)
>    * Pathch 1,2:Add more commit messages.
>    * Pathch 3:Add Fixes tag as suggested by Jia Zhu.
>    * Pathch 4:No longer changing "do...while" to "retry" to focus changes
>      and optimise commit messages.
>    * Pathch 5: Drop the internal RVB tag.
>
> v1: https://lore.kernel.org/all/20240424033409.2735257-1-libaokun@huaweicloud.com
> v2: https://lore.kernel.org/all/20240515125136.3714580-1-libaokun@huaweicloud.com
>
> Baokun Li (7):
>    netfs, fscache: export fscache_put_volume() and add
>      fscache_try_get_volume()
>    cachefiles: fix slab-use-after-free in fscache_withdraw_volume()
>    cachefiles: fix slab-use-after-free in cachefiles_withdraw_cookie()
>    cachefiles: propagate errors from vfs_getxattr() to avoid infinite
>      loop
>    cachefiles: stop sending new request when dropping object
>    cachefiles: cancel all requests for the object that is being dropped
>    cachefiles: cyclic allocation of msg_id to avoid reuse
>
> Hou Tao (1):
>    cachefiles: wait for ondemand_object_worker to finish when dropping
>      object
>
> Jingbo Xu (1):
>    cachefiles: add missing lock protection when polling
>
>   fs/cachefiles/cache.c          | 45 ++++++++++++++++++++++++++++-
>   fs/cachefiles/daemon.c         |  4 +--
>   fs/cachefiles/internal.h       |  3 ++
>   fs/cachefiles/ondemand.c       | 52 ++++++++++++++++++++++++++++++----
>   fs/cachefiles/volume.c         |  1 -
>   fs/cachefiles/xattr.c          |  5 +++-
>   fs/netfs/fscache_volume.c      | 14 +++++++++
>   fs/netfs/internal.h            |  2 --
>   include/linux/fscache-cache.h  |  6 ++++
>   include/trace/events/fscache.h |  4 +++
>   10 files changed, 123 insertions(+), 13 deletions(-)
>