Message ID | 20240606184818.1566920-1-yosryahmed@google.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | mm: zswap: add VM_BUG_ON() if large folio swapin is attempted | expand |
On Thu, Jun 6, 2024 at 1:22 PM David Hildenbrand <david@redhat.com> wrote: > > On 06.06.24 20:48, Yosry Ahmed wrote: > > With ongoing work to support large folio swapin, it is important to make > > sure we do not pass large folios to zswap_load() without implementing > > proper support. > > > > For example, if a swapin fault observes that contiguous PTEs are > > pointing to contiguous swap entries and tries to swap them in as a large > > folio, swap_read_folio() will pass in a large folio to zswap_load(), but > > zswap_load() will only effectively load the first page in the folio. If > > the first page is not in zswap, the folio will be read from disk, even > > though other pages may be in zswap. > > > > In both cases, this will lead to silent data corruption. > > > > Proper large folio swapin support needs to go into zswap before zswap > > can be enabled in a system that supports large folio swapin. > > > > Looking at callers of swap_read_folio(), it seems like they are either > > allocated from __read_swap_cache_async() or do_swap_page() in the > > SWP_SYNCHRONOUS_IO path. Both of which allocate order-0 folios, so we > > are fine for now. > > > > Add a VM_BUG_ON() in zswap_load() to make sure that we detect changes in > > the order of those allocations without proper handling of zswap. > > > > Alternatively, swap_read_folio() (or its callers) can be updated to have > > a fallback mechanism that splits large folios or reads subpages > > separately. Similar logic may be needed anyway in case part of a large > > folio is already in the swapcache and the rest of it is swapped out. > > > > Signed-off-by: Yosry Ahmed <yosryahmed@google.com> > > --- > > > > Sorry for the long CC list, I just found myself repeatedly looking at > > new series that add swap support for mTHPs / large folios, making sure > > they do not break with zswap or make incorrect assumptions. This debug > > check should give us some peace of mind. Hopefully this patch will also > > raise awareness among people who are working on this. > > > > --- > > mm/zswap.c | 3 +++ > > 1 file changed, 3 insertions(+) > > > > diff --git a/mm/zswap.c b/mm/zswap.c > > index b9b35ef86d9be..6007252429bb2 100644 > > --- a/mm/zswap.c > > +++ b/mm/zswap.c > > @@ -1577,6 +1577,9 @@ bool zswap_load(struct folio *folio) > > if (!entry) > > return false; > > > > + /* Zswap loads do not handle large folio swapins correctly yet */ > > + VM_BUG_ON(folio_test_large(folio)); > > + > > There is no way we could have a WARN_ON_ONCE() and recover, right? Not without making more fundamental changes to the surrounding swap code. Currently zswap_load() returns either true (folio was loaded from zswap) or false (folio is not in zswap). To handle this correctly zswap_load() would need to tell swap_read_folio() which subpages are in zswap and have been loaded, and then swap_read_folio() would need to read the remaining subpages from disk. This of course assumes that the caller of swap_read_folio() made sure that the entire folio is swapped out and protected against races with other swapins. Also, because swap_read_folio() cannot split the folio itself, other swap_read_folio_*() functions that are called from it should be updated to handle swapping in tail subpages, which may be questionable in its own right. An alternative would be that zswap_load() (or a separate interface) could tell swap_read_folio() that the folio is partially in zswap, then we can just bail and tell the caller that it cannot read the large folio and that it should be split. There may be other options as well, but the bottom line is that it is possible, but probably not something that we want to do right now. A stronger protection method would be to introduce a config option or boot parameter for large folio swapin, and then make CONFIG_ZSWAP depend on it being disabled, or have zswap check it at boot and refuse to be enabled if it is on.
On Fri, Jun 7, 2024 at 8:32 AM Yosry Ahmed <yosryahmed@google.com> wrote: > > On Thu, Jun 6, 2024 at 1:22 PM David Hildenbrand <david@redhat.com> wrote: > > > > On 06.06.24 20:48, Yosry Ahmed wrote: > > > With ongoing work to support large folio swapin, it is important to make > > > sure we do not pass large folios to zswap_load() without implementing > > > proper support. > > > > > > For example, if a swapin fault observes that contiguous PTEs are > > > pointing to contiguous swap entries and tries to swap them in as a large > > > folio, swap_read_folio() will pass in a large folio to zswap_load(), but > > > zswap_load() will only effectively load the first page in the folio. If > > > the first page is not in zswap, the folio will be read from disk, even > > > though other pages may be in zswap. > > > > > > In both cases, this will lead to silent data corruption. > > > > > > Proper large folio swapin support needs to go into zswap before zswap > > > can be enabled in a system that supports large folio swapin. > > > > > > Looking at callers of swap_read_folio(), it seems like they are either > > > allocated from __read_swap_cache_async() or do_swap_page() in the > > > SWP_SYNCHRONOUS_IO path. Both of which allocate order-0 folios, so we > > > are fine for now. > > > > > > Add a VM_BUG_ON() in zswap_load() to make sure that we detect changes in > > > the order of those allocations without proper handling of zswap. > > > > > > Alternatively, swap_read_folio() (or its callers) can be updated to have > > > a fallback mechanism that splits large folios or reads subpages > > > separately. Similar logic may be needed anyway in case part of a large > > > folio is already in the swapcache and the rest of it is swapped out. > > > > > > Signed-off-by: Yosry Ahmed <yosryahmed@google.com> Acked-by: Barry Song <baohua@kernel.org> this has been observed by me[1], that's why you can find the below code in my swapin patch +static struct folio *alloc_swap_folio(struct vm_fault *vmf) +{ + ... + /* + * a large folio being swapped-in could be partially in + * zswap and partially in swap devices, zswap doesn't + * support large folios yet, we might get corrupted + * zero-filled data by reading all subpages from swap + * devices while some of them are actually in zswap + */ + if (is_zswap_enabled()) + goto fallback; [1] https://lore.kernel.org/linux-mm/20240304081348.197341-6-21cnbao@gmail.com/ > > > --- > > > > > > Sorry for the long CC list, I just found myself repeatedly looking at > > > new series that add swap support for mTHPs / large folios, making sure > > > they do not break with zswap or make incorrect assumptions. This debug > > > check should give us some peace of mind. Hopefully this patch will also > > > raise awareness among people who are working on this. > > > > > > --- > > > mm/zswap.c | 3 +++ > > > 1 file changed, 3 insertions(+) > > > > > > diff --git a/mm/zswap.c b/mm/zswap.c > > > index b9b35ef86d9be..6007252429bb2 100644 > > > --- a/mm/zswap.c > > > +++ b/mm/zswap.c > > > @@ -1577,6 +1577,9 @@ bool zswap_load(struct folio *folio) > > > if (!entry) > > > return false; > > > > > > + /* Zswap loads do not handle large folio swapins correctly yet */ > > > + VM_BUG_ON(folio_test_large(folio)); > > > + > > > > There is no way we could have a WARN_ON_ONCE() and recover, right? > > Not without making more fundamental changes to the surrounding swap > code. Currently zswap_load() returns either true (folio was loaded > from zswap) or false (folio is not in zswap). > > To handle this correctly zswap_load() would need to tell > swap_read_folio() which subpages are in zswap and have been loaded, > and then swap_read_folio() would need to read the remaining subpages > from disk. This of course assumes that the caller of swap_read_folio() > made sure that the entire folio is swapped out and protected against > races with other swapins. > > Also, because swap_read_folio() cannot split the folio itself, other > swap_read_folio_*() functions that are called from it should be > updated to handle swapping in tail subpages, which may be questionable > in its own right. > > An alternative would be that zswap_load() (or a separate interface) > could tell swap_read_folio() that the folio is partially in zswap, > then we can just bail and tell the caller that it cannot read the > large folio and that it should be split. > > There may be other options as well, but the bottom line is that it is > possible, but probably not something that we want to do right now. > > A stronger protection method would be to introduce a config option or > boot parameter for large folio swapin, and then make CONFIG_ZSWAP > depend on it being disabled, or have zswap check it at boot and refuse > to be enabled if it is on. Thanks Barry
On Thu, Jun 6, 2024 at 1:55 PM Barry Song <21cnbao@gmail.com> wrote: > > On Fri, Jun 7, 2024 at 8:32 AM Yosry Ahmed <yosryahmed@google.com> wrote: > > > > On Thu, Jun 6, 2024 at 1:22 PM David Hildenbrand <david@redhat.com> wrote: > > > > > > On 06.06.24 20:48, Yosry Ahmed wrote: > > > > With ongoing work to support large folio swapin, it is important to make > > > > sure we do not pass large folios to zswap_load() without implementing > > > > proper support. > > > > > > > > For example, if a swapin fault observes that contiguous PTEs are > > > > pointing to contiguous swap entries and tries to swap them in as a large > > > > folio, swap_read_folio() will pass in a large folio to zswap_load(), but > > > > zswap_load() will only effectively load the first page in the folio. If > > > > the first page is not in zswap, the folio will be read from disk, even > > > > though other pages may be in zswap. > > > > > > > > In both cases, this will lead to silent data corruption. > > > > > > > > Proper large folio swapin support needs to go into zswap before zswap > > > > can be enabled in a system that supports large folio swapin. > > > > > > > > Looking at callers of swap_read_folio(), it seems like they are either > > > > allocated from __read_swap_cache_async() or do_swap_page() in the > > > > SWP_SYNCHRONOUS_IO path. Both of which allocate order-0 folios, so we > > > > are fine for now. > > > > > > > > Add a VM_BUG_ON() in zswap_load() to make sure that we detect changes in > > > > the order of those allocations without proper handling of zswap. > > > > > > > > Alternatively, swap_read_folio() (or its callers) can be updated to have > > > > a fallback mechanism that splits large folios or reads subpages > > > > separately. Similar logic may be needed anyway in case part of a large > > > > folio is already in the swapcache and the rest of it is swapped out. > > > > > > > > Signed-off-by: Yosry Ahmed <yosryahmed@google.com> > > Acked-by: Barry Song <baohua@kernel.org> Thanks! > > this has been observed by me[1], that's why you can find the below > code in my swapin patch Thanks for the pointer. I suppose if we add more generic swapin support we'll have to add a similar check in __read_swap_cache_async(), unless we are planning to reuse alloc_swap_folio() there. It would be nice if we can have a global check for this rather than add it to all different swapin paths (although I suspect there are only two allocations paths right now). We can always disable zswap if mTHP swapin is enabled, or always disable mTHP swapin if zswap is enabled. At least until proper support is added. > > +static struct folio *alloc_swap_folio(struct vm_fault *vmf) > +{ > + ... > + /* > + * a large folio being swapped-in could be partially in > + * zswap and partially in swap devices, zswap doesn't > + * support large folios yet, we might get corrupted > + * zero-filled data by reading all subpages from swap > + * devices while some of them are actually in zswap > + */ > + if (is_zswap_enabled()) > + goto fallback; > > [1] https://lore.kernel.org/linux-mm/20240304081348.197341-6-21cnbao@gmail.com/
On Fri, Jun 7, 2024 at 9:17 AM David Hildenbrand <david@redhat.com> wrote: > > On 06.06.24 22:31, Yosry Ahmed wrote: > > On Thu, Jun 6, 2024 at 1:22 PM David Hildenbrand <david@redhat.com> wrote: > >> > >> On 06.06.24 20:48, Yosry Ahmed wrote: > >>> With ongoing work to support large folio swapin, it is important to make > >>> sure we do not pass large folios to zswap_load() without implementing > >>> proper support. > >>> > >>> For example, if a swapin fault observes that contiguous PTEs are > >>> pointing to contiguous swap entries and tries to swap them in as a large > >>> folio, swap_read_folio() will pass in a large folio to zswap_load(), but > >>> zswap_load() will only effectively load the first page in the folio. If > >>> the first page is not in zswap, the folio will be read from disk, even > >>> though other pages may be in zswap. > >>> > >>> In both cases, this will lead to silent data corruption. > >>> > >>> Proper large folio swapin support needs to go into zswap before zswap > >>> can be enabled in a system that supports large folio swapin. > >>> > >>> Looking at callers of swap_read_folio(), it seems like they are either > >>> allocated from __read_swap_cache_async() or do_swap_page() in the > >>> SWP_SYNCHRONOUS_IO path. Both of which allocate order-0 folios, so we > >>> are fine for now. > >>> > >>> Add a VM_BUG_ON() in zswap_load() to make sure that we detect changes in > >>> the order of those allocations without proper handling of zswap. > >>> > >>> Alternatively, swap_read_folio() (or its callers) can be updated to have > >>> a fallback mechanism that splits large folios or reads subpages > >>> separately. Similar logic may be needed anyway in case part of a large > >>> folio is already in the swapcache and the rest of it is swapped out. > >>> > >>> Signed-off-by: Yosry Ahmed <yosryahmed@google.com> > >>> --- > >>> > >>> Sorry for the long CC list, I just found myself repeatedly looking at > >>> new series that add swap support for mTHPs / large folios, making sure > >>> they do not break with zswap or make incorrect assumptions. This debug > >>> check should give us some peace of mind. Hopefully this patch will also > >>> raise awareness among people who are working on this. > >>> > >>> --- > >>> mm/zswap.c | 3 +++ > >>> 1 file changed, 3 insertions(+) > >>> > >>> diff --git a/mm/zswap.c b/mm/zswap.c > >>> index b9b35ef86d9be..6007252429bb2 100644 > >>> --- a/mm/zswap.c > >>> +++ b/mm/zswap.c > >>> @@ -1577,6 +1577,9 @@ bool zswap_load(struct folio *folio) > >>> if (!entry) > >>> return false; > >>> > >>> + /* Zswap loads do not handle large folio swapins correctly yet */ > >>> + VM_BUG_ON(folio_test_large(folio)); > >>> + > >> > >> There is no way we could have a WARN_ON_ONCE() and recover, right? > > > > Not without making more fundamental changes to the surrounding swap > > code. Currently zswap_load() returns either true (folio was loaded > > from zswap) or false (folio is not in zswap). > > > > To handle this correctly zswap_load() would need to tell > > swap_read_folio() which subpages are in zswap and have been loaded, > > and then swap_read_folio() would need to read the remaining subpages > > from disk. This of course assumes that the caller of swap_read_folio() > > made sure that the entire folio is swapped out and protected against > > races with other swapins. > > > > Also, because swap_read_folio() cannot split the folio itself, other > > swap_read_folio_*() functions that are called from it should be > > updated to handle swapping in tail subpages, which may be questionable > > in its own right. > > > > An alternative would be that zswap_load() (or a separate interface) > > could tell swap_read_folio() that the folio is partially in zswap, > > then we can just bail and tell the caller that it cannot read the > > large folio and that it should be split. > > > > There may be other options as well, but the bottom line is that it is > > possible, but probably not something that we want to do right now. > > > > A stronger protection method would be to introduce a config option or > > boot parameter for large folio swapin, and then make CONFIG_ZSWAP > > depend on it being disabled, or have zswap check it at boot and refuse > > to be enabled if it is on. > > Right, sounds like the VM_BUG_ON() really is not that easily avoidable. > > I was wondering, if we could WARN_ON_ONCE and make the swap code detect > this like a read-error from disk. > > I think do_swap_page() detects that by checking if the folio is not > uptodate: > > if (unlikely(!folio_test_uptodate(folio))) { > ret = VM_FAULT_SIGBUS; > goto out_nomap; > } > > So maybe WARN_ON_ONCE() + triggering that might be a bit nicer to the > system (but the app would crash either way, there is no way around it). > I'd rather fallback to small folios swapin instead crashing apps till we fix the large folio swapin in zswap :-) +static struct folio *alloc_swap_folio(struct vm_fault *vmf) +{ + ... + + if (is_zswap_enabled()) + goto fallback; > -- > Cheers, > > David / dhildenb Thanks Barry
On Thu, Jun 6, 2024 at 2:17 PM David Hildenbrand <david@redhat.com> wrote: > > On 06.06.24 22:31, Yosry Ahmed wrote: > > On Thu, Jun 6, 2024 at 1:22 PM David Hildenbrand <david@redhat.com> wrote: > >> > >> On 06.06.24 20:48, Yosry Ahmed wrote: > >>> With ongoing work to support large folio swapin, it is important to make > >>> sure we do not pass large folios to zswap_load() without implementing > >>> proper support. > >>> > >>> For example, if a swapin fault observes that contiguous PTEs are > >>> pointing to contiguous swap entries and tries to swap them in as a large > >>> folio, swap_read_folio() will pass in a large folio to zswap_load(), but > >>> zswap_load() will only effectively load the first page in the folio. If > >>> the first page is not in zswap, the folio will be read from disk, even > >>> though other pages may be in zswap. > >>> > >>> In both cases, this will lead to silent data corruption. > >>> > >>> Proper large folio swapin support needs to go into zswap before zswap > >>> can be enabled in a system that supports large folio swapin. > >>> > >>> Looking at callers of swap_read_folio(), it seems like they are either > >>> allocated from __read_swap_cache_async() or do_swap_page() in the > >>> SWP_SYNCHRONOUS_IO path. Both of which allocate order-0 folios, so we > >>> are fine for now. > >>> > >>> Add a VM_BUG_ON() in zswap_load() to make sure that we detect changes in > >>> the order of those allocations without proper handling of zswap. > >>> > >>> Alternatively, swap_read_folio() (or its callers) can be updated to have > >>> a fallback mechanism that splits large folios or reads subpages > >>> separately. Similar logic may be needed anyway in case part of a large > >>> folio is already in the swapcache and the rest of it is swapped out. > >>> > >>> Signed-off-by: Yosry Ahmed <yosryahmed@google.com> > >>> --- > >>> > >>> Sorry for the long CC list, I just found myself repeatedly looking at > >>> new series that add swap support for mTHPs / large folios, making sure > >>> they do not break with zswap or make incorrect assumptions. This debug > >>> check should give us some peace of mind. Hopefully this patch will also > >>> raise awareness among people who are working on this. > >>> > >>> --- > >>> mm/zswap.c | 3 +++ > >>> 1 file changed, 3 insertions(+) > >>> > >>> diff --git a/mm/zswap.c b/mm/zswap.c > >>> index b9b35ef86d9be..6007252429bb2 100644 > >>> --- a/mm/zswap.c > >>> +++ b/mm/zswap.c > >>> @@ -1577,6 +1577,9 @@ bool zswap_load(struct folio *folio) > >>> if (!entry) > >>> return false; > >>> > >>> + /* Zswap loads do not handle large folio swapins correctly yet */ > >>> + VM_BUG_ON(folio_test_large(folio)); > >>> + > >> > >> There is no way we could have a WARN_ON_ONCE() and recover, right? > > > > Not without making more fundamental changes to the surrounding swap > > code. Currently zswap_load() returns either true (folio was loaded > > from zswap) or false (folio is not in zswap). > > > > To handle this correctly zswap_load() would need to tell > > swap_read_folio() which subpages are in zswap and have been loaded, > > and then swap_read_folio() would need to read the remaining subpages > > from disk. This of course assumes that the caller of swap_read_folio() > > made sure that the entire folio is swapped out and protected against > > races with other swapins. > > > > Also, because swap_read_folio() cannot split the folio itself, other > > swap_read_folio_*() functions that are called from it should be > > updated to handle swapping in tail subpages, which may be questionable > > in its own right. > > > > An alternative would be that zswap_load() (or a separate interface) > > could tell swap_read_folio() that the folio is partially in zswap, > > then we can just bail and tell the caller that it cannot read the > > large folio and that it should be split. > > > > There may be other options as well, but the bottom line is that it is > > possible, but probably not something that we want to do right now. > > > > A stronger protection method would be to introduce a config option or > > boot parameter for large folio swapin, and then make CONFIG_ZSWAP > > depend on it being disabled, or have zswap check it at boot and refuse > > to be enabled if it is on. > > Right, sounds like the VM_BUG_ON() really is not that easily avoidable. > > I was wondering, if we could WARN_ON_ONCE and make the swap code detect > this like a read-error from disk. > > I think do_swap_page() detects that by checking if the folio is not > uptodate: > > if (unlikely(!folio_test_uptodate(folio))) { > ret = VM_FAULT_SIGBUS; > goto out_nomap; > } > > So maybe WARN_ON_ONCE() + triggering that might be a bit nicer to the > system (but the app would crash either way, there is no way around it). It seems like most paths will handle this correctly just if the folio is not uptodate. do_swap_page() seems like it will work correctly whether swapin_readahead() and the direct call to swap_read_folio() in do_swap_page() should work correctly in this case. The shmem swapin path seems like it will return -EIO, which in the fault path will also sigbus, and in the file read/write path I assume will be handled correctly. However, looking at the swapoff paths, it seems like we don't really check uptodate. For example, shmem_unuse_swap_entries() will just throw -EIO away. Maybe it is handled on a higher level by the fact that the number of swap entries will not drop to zero so swapoff will not complete? :) Anyway, I believe it may be possible to just not set uptodate, but I am not very sure how reliable it will be. It may be better than nothing anyway, I guess?
On Thu, Jun 6, 2024 at 2:30 PM Barry Song <21cnbao@gmail.com> wrote: > > On Fri, Jun 7, 2024 at 9:17 AM David Hildenbrand <david@redhat.com> wrote: > > > > On 06.06.24 22:31, Yosry Ahmed wrote: > > > On Thu, Jun 6, 2024 at 1:22 PM David Hildenbrand <david@redhat.com> wrote: > > >> > > >> On 06.06.24 20:48, Yosry Ahmed wrote: > > >>> With ongoing work to support large folio swapin, it is important to make > > >>> sure we do not pass large folios to zswap_load() without implementing > > >>> proper support. > > >>> > > >>> For example, if a swapin fault observes that contiguous PTEs are > > >>> pointing to contiguous swap entries and tries to swap them in as a large > > >>> folio, swap_read_folio() will pass in a large folio to zswap_load(), but > > >>> zswap_load() will only effectively load the first page in the folio. If > > >>> the first page is not in zswap, the folio will be read from disk, even > > >>> though other pages may be in zswap. > > >>> > > >>> In both cases, this will lead to silent data corruption. > > >>> > > >>> Proper large folio swapin support needs to go into zswap before zswap > > >>> can be enabled in a system that supports large folio swapin. > > >>> > > >>> Looking at callers of swap_read_folio(), it seems like they are either > > >>> allocated from __read_swap_cache_async() or do_swap_page() in the > > >>> SWP_SYNCHRONOUS_IO path. Both of which allocate order-0 folios, so we > > >>> are fine for now. > > >>> > > >>> Add a VM_BUG_ON() in zswap_load() to make sure that we detect changes in > > >>> the order of those allocations without proper handling of zswap. > > >>> > > >>> Alternatively, swap_read_folio() (or its callers) can be updated to have > > >>> a fallback mechanism that splits large folios or reads subpages > > >>> separately. Similar logic may be needed anyway in case part of a large > > >>> folio is already in the swapcache and the rest of it is swapped out. > > >>> > > >>> Signed-off-by: Yosry Ahmed <yosryahmed@google.com> > > >>> --- > > >>> > > >>> Sorry for the long CC list, I just found myself repeatedly looking at > > >>> new series that add swap support for mTHPs / large folios, making sure > > >>> they do not break with zswap or make incorrect assumptions. This debug > > >>> check should give us some peace of mind. Hopefully this patch will also > > >>> raise awareness among people who are working on this. > > >>> > > >>> --- > > >>> mm/zswap.c | 3 +++ > > >>> 1 file changed, 3 insertions(+) > > >>> > > >>> diff --git a/mm/zswap.c b/mm/zswap.c > > >>> index b9b35ef86d9be..6007252429bb2 100644 > > >>> --- a/mm/zswap.c > > >>> +++ b/mm/zswap.c > > >>> @@ -1577,6 +1577,9 @@ bool zswap_load(struct folio *folio) > > >>> if (!entry) > > >>> return false; > > >>> > > >>> + /* Zswap loads do not handle large folio swapins correctly yet */ > > >>> + VM_BUG_ON(folio_test_large(folio)); > > >>> + > > >> > > >> There is no way we could have a WARN_ON_ONCE() and recover, right? > > > > > > Not without making more fundamental changes to the surrounding swap > > > code. Currently zswap_load() returns either true (folio was loaded > > > from zswap) or false (folio is not in zswap). > > > > > > To handle this correctly zswap_load() would need to tell > > > swap_read_folio() which subpages are in zswap and have been loaded, > > > and then swap_read_folio() would need to read the remaining subpages > > > from disk. This of course assumes that the caller of swap_read_folio() > > > made sure that the entire folio is swapped out and protected against > > > races with other swapins. > > > > > > Also, because swap_read_folio() cannot split the folio itself, other > > > swap_read_folio_*() functions that are called from it should be > > > updated to handle swapping in tail subpages, which may be questionable > > > in its own right. > > > > > > An alternative would be that zswap_load() (or a separate interface) > > > could tell swap_read_folio() that the folio is partially in zswap, > > > then we can just bail and tell the caller that it cannot read the > > > large folio and that it should be split. > > > > > > There may be other options as well, but the bottom line is that it is > > > possible, but probably not something that we want to do right now. > > > > > > A stronger protection method would be to introduce a config option or > > > boot parameter for large folio swapin, and then make CONFIG_ZSWAP > > > depend on it being disabled, or have zswap check it at boot and refuse > > > to be enabled if it is on. > > > > Right, sounds like the VM_BUG_ON() really is not that easily avoidable. > > > > I was wondering, if we could WARN_ON_ONCE and make the swap code detect > > this like a read-error from disk. > > > > I think do_swap_page() detects that by checking if the folio is not > > uptodate: > > > > if (unlikely(!folio_test_uptodate(folio))) { > > ret = VM_FAULT_SIGBUS; > > goto out_nomap; > > } > > > > So maybe WARN_ON_ONCE() + triggering that might be a bit nicer to the > > system (but the app would crash either way, there is no way around it). > > > > I'd rather fallback to small folios swapin instead crashing apps till we fix > the large folio swapin in zswap :-) I think David is referring to catching the buggy cases that do not properly fallback to small folios with zswap, not as an alternative to the fallback. This is at least what I had in mind with the patch.
On Fri, Jun 7, 2024 at 9:37 AM Yosry Ahmed <yosryahmed@google.com> wrote: > > On Thu, Jun 6, 2024 at 2:30 PM Barry Song <21cnbao@gmail.com> wrote: > > > > On Fri, Jun 7, 2024 at 9:17 AM David Hildenbrand <david@redhat.com> wrote: > > > > > > On 06.06.24 22:31, Yosry Ahmed wrote: > > > > On Thu, Jun 6, 2024 at 1:22 PM David Hildenbrand <david@redhat.com> wrote: > > > >> > > > >> On 06.06.24 20:48, Yosry Ahmed wrote: > > > >>> With ongoing work to support large folio swapin, it is important to make > > > >>> sure we do not pass large folios to zswap_load() without implementing > > > >>> proper support. > > > >>> > > > >>> For example, if a swapin fault observes that contiguous PTEs are > > > >>> pointing to contiguous swap entries and tries to swap them in as a large > > > >>> folio, swap_read_folio() will pass in a large folio to zswap_load(), but > > > >>> zswap_load() will only effectively load the first page in the folio. If > > > >>> the first page is not in zswap, the folio will be read from disk, even > > > >>> though other pages may be in zswap. > > > >>> > > > >>> In both cases, this will lead to silent data corruption. > > > >>> > > > >>> Proper large folio swapin support needs to go into zswap before zswap > > > >>> can be enabled in a system that supports large folio swapin. > > > >>> > > > >>> Looking at callers of swap_read_folio(), it seems like they are either > > > >>> allocated from __read_swap_cache_async() or do_swap_page() in the > > > >>> SWP_SYNCHRONOUS_IO path. Both of which allocate order-0 folios, so we > > > >>> are fine for now. > > > >>> > > > >>> Add a VM_BUG_ON() in zswap_load() to make sure that we detect changes in > > > >>> the order of those allocations without proper handling of zswap. > > > >>> > > > >>> Alternatively, swap_read_folio() (or its callers) can be updated to have > > > >>> a fallback mechanism that splits large folios or reads subpages > > > >>> separately. Similar logic may be needed anyway in case part of a large > > > >>> folio is already in the swapcache and the rest of it is swapped out. > > > >>> > > > >>> Signed-off-by: Yosry Ahmed <yosryahmed@google.com> > > > >>> --- > > > >>> > > > >>> Sorry for the long CC list, I just found myself repeatedly looking at > > > >>> new series that add swap support for mTHPs / large folios, making sure > > > >>> they do not break with zswap or make incorrect assumptions. This debug > > > >>> check should give us some peace of mind. Hopefully this patch will also > > > >>> raise awareness among people who are working on this. > > > >>> > > > >>> --- > > > >>> mm/zswap.c | 3 +++ > > > >>> 1 file changed, 3 insertions(+) > > > >>> > > > >>> diff --git a/mm/zswap.c b/mm/zswap.c > > > >>> index b9b35ef86d9be..6007252429bb2 100644 > > > >>> --- a/mm/zswap.c > > > >>> +++ b/mm/zswap.c > > > >>> @@ -1577,6 +1577,9 @@ bool zswap_load(struct folio *folio) > > > >>> if (!entry) > > > >>> return false; > > > >>> > > > >>> + /* Zswap loads do not handle large folio swapins correctly yet */ > > > >>> + VM_BUG_ON(folio_test_large(folio)); > > > >>> + > > > >> > > > >> There is no way we could have a WARN_ON_ONCE() and recover, right? > > > > > > > > Not without making more fundamental changes to the surrounding swap > > > > code. Currently zswap_load() returns either true (folio was loaded > > > > from zswap) or false (folio is not in zswap). > > > > > > > > To handle this correctly zswap_load() would need to tell > > > > swap_read_folio() which subpages are in zswap and have been loaded, > > > > and then swap_read_folio() would need to read the remaining subpages > > > > from disk. This of course assumes that the caller of swap_read_folio() > > > > made sure that the entire folio is swapped out and protected against > > > > races with other swapins. > > > > > > > > Also, because swap_read_folio() cannot split the folio itself, other > > > > swap_read_folio_*() functions that are called from it should be > > > > updated to handle swapping in tail subpages, which may be questionable > > > > in its own right. > > > > > > > > An alternative would be that zswap_load() (or a separate interface) > > > > could tell swap_read_folio() that the folio is partially in zswap, > > > > then we can just bail and tell the caller that it cannot read the > > > > large folio and that it should be split. > > > > > > > > There may be other options as well, but the bottom line is that it is > > > > possible, but probably not something that we want to do right now. > > > > > > > > A stronger protection method would be to introduce a config option or > > > > boot parameter for large folio swapin, and then make CONFIG_ZSWAP > > > > depend on it being disabled, or have zswap check it at boot and refuse > > > > to be enabled if it is on. > > > > > > Right, sounds like the VM_BUG_ON() really is not that easily avoidable. > > > > > > I was wondering, if we could WARN_ON_ONCE and make the swap code detect > > > this like a read-error from disk. > > > > > > I think do_swap_page() detects that by checking if the folio is not > > > uptodate: > > > > > > if (unlikely(!folio_test_uptodate(folio))) { > > > ret = VM_FAULT_SIGBUS; > > > goto out_nomap; > > > } > > > > > > So maybe WARN_ON_ONCE() + triggering that might be a bit nicer to the > > > system (but the app would crash either way, there is no way around it). > > > > > > > I'd rather fallback to small folios swapin instead crashing apps till we fix > > the large folio swapin in zswap :-) > > I think David is referring to catching the buggy cases that do not > properly fallback to small folios with zswap, not as an alternative to > the fallback. This is at least what I had in mind with the patch. Cool. Thanks for the clarification. I'm fine with keeping the fallback, whether it's the current VM_BUG_ON or David's recommended SIGBUS. Currently, mainline doesn't support large folios swap-in, so I see your patch as a helpful warning for those attempting large folio swap-in to avoid making mistakes like loading large folios from zswap. In fact, I spent a week trying to figure out why my app was crashing before I added 'if (zswap_is_enabled()) goto fallback'. If your patch had come earlier, it would have saved me that week of work :-) To me, a direct crash seems like a more straightforward way to prompt people to fallback when attempting large folio swap-ins. So, I am slightly in favor of your current patch. Thanks Barry
On 06.06.24 23:53, Barry Song wrote: > On Fri, Jun 7, 2024 at 9:37 AM Yosry Ahmed <yosryahmed@google.com> wrote: >> >> On Thu, Jun 6, 2024 at 2:30 PM Barry Song <21cnbao@gmail.com> wrote: >>> >>> On Fri, Jun 7, 2024 at 9:17 AM David Hildenbrand <david@redhat.com> wrote: >>>> >>>> On 06.06.24 22:31, Yosry Ahmed wrote: >>>>> On Thu, Jun 6, 2024 at 1:22 PM David Hildenbrand <david@redhat.com> wrote: >>>>>> >>>>>> On 06.06.24 20:48, Yosry Ahmed wrote: >>>>>>> With ongoing work to support large folio swapin, it is important to make >>>>>>> sure we do not pass large folios to zswap_load() without implementing >>>>>>> proper support. >>>>>>> >>>>>>> For example, if a swapin fault observes that contiguous PTEs are >>>>>>> pointing to contiguous swap entries and tries to swap them in as a large >>>>>>> folio, swap_read_folio() will pass in a large folio to zswap_load(), but >>>>>>> zswap_load() will only effectively load the first page in the folio. If >>>>>>> the first page is not in zswap, the folio will be read from disk, even >>>>>>> though other pages may be in zswap. >>>>>>> >>>>>>> In both cases, this will lead to silent data corruption. >>>>>>> >>>>>>> Proper large folio swapin support needs to go into zswap before zswap >>>>>>> can be enabled in a system that supports large folio swapin. >>>>>>> >>>>>>> Looking at callers of swap_read_folio(), it seems like they are either >>>>>>> allocated from __read_swap_cache_async() or do_swap_page() in the >>>>>>> SWP_SYNCHRONOUS_IO path. Both of which allocate order-0 folios, so we >>>>>>> are fine for now. >>>>>>> >>>>>>> Add a VM_BUG_ON() in zswap_load() to make sure that we detect changes in >>>>>>> the order of those allocations without proper handling of zswap. >>>>>>> >>>>>>> Alternatively, swap_read_folio() (or its callers) can be updated to have >>>>>>> a fallback mechanism that splits large folios or reads subpages >>>>>>> separately. Similar logic may be needed anyway in case part of a large >>>>>>> folio is already in the swapcache and the rest of it is swapped out. >>>>>>> >>>>>>> Signed-off-by: Yosry Ahmed <yosryahmed@google.com> >>>>>>> --- >>>>>>> >>>>>>> Sorry for the long CC list, I just found myself repeatedly looking at >>>>>>> new series that add swap support for mTHPs / large folios, making sure >>>>>>> they do not break with zswap or make incorrect assumptions. This debug >>>>>>> check should give us some peace of mind. Hopefully this patch will also >>>>>>> raise awareness among people who are working on this. >>>>>>> >>>>>>> --- >>>>>>> mm/zswap.c | 3 +++ >>>>>>> 1 file changed, 3 insertions(+) >>>>>>> >>>>>>> diff --git a/mm/zswap.c b/mm/zswap.c >>>>>>> index b9b35ef86d9be..6007252429bb2 100644 >>>>>>> --- a/mm/zswap.c >>>>>>> +++ b/mm/zswap.c >>>>>>> @@ -1577,6 +1577,9 @@ bool zswap_load(struct folio *folio) >>>>>>> if (!entry) >>>>>>> return false; >>>>>>> >>>>>>> + /* Zswap loads do not handle large folio swapins correctly yet */ >>>>>>> + VM_BUG_ON(folio_test_large(folio)); >>>>>>> + >>>>>> >>>>>> There is no way we could have a WARN_ON_ONCE() and recover, right? >>>>> >>>>> Not without making more fundamental changes to the surrounding swap >>>>> code. Currently zswap_load() returns either true (folio was loaded >>>>> from zswap) or false (folio is not in zswap). >>>>> >>>>> To handle this correctly zswap_load() would need to tell >>>>> swap_read_folio() which subpages are in zswap and have been loaded, >>>>> and then swap_read_folio() would need to read the remaining subpages >>>>> from disk. This of course assumes that the caller of swap_read_folio() >>>>> made sure that the entire folio is swapped out and protected against >>>>> races with other swapins. >>>>> >>>>> Also, because swap_read_folio() cannot split the folio itself, other >>>>> swap_read_folio_*() functions that are called from it should be >>>>> updated to handle swapping in tail subpages, which may be questionable >>>>> in its own right. >>>>> >>>>> An alternative would be that zswap_load() (or a separate interface) >>>>> could tell swap_read_folio() that the folio is partially in zswap, >>>>> then we can just bail and tell the caller that it cannot read the >>>>> large folio and that it should be split. >>>>> >>>>> There may be other options as well, but the bottom line is that it is >>>>> possible, but probably not something that we want to do right now. >>>>> >>>>> A stronger protection method would be to introduce a config option or >>>>> boot parameter for large folio swapin, and then make CONFIG_ZSWAP >>>>> depend on it being disabled, or have zswap check it at boot and refuse >>>>> to be enabled if it is on. >>>> >>>> Right, sounds like the VM_BUG_ON() really is not that easily avoidable. >>>> >>>> I was wondering, if we could WARN_ON_ONCE and make the swap code detect >>>> this like a read-error from disk. >>>> >>>> I think do_swap_page() detects that by checking if the folio is not >>>> uptodate: >>>> >>>> if (unlikely(!folio_test_uptodate(folio))) { >>>> ret = VM_FAULT_SIGBUS; >>>> goto out_nomap; >>>> } >>>> >>>> So maybe WARN_ON_ONCE() + triggering that might be a bit nicer to the >>>> system (but the app would crash either way, there is no way around it). >>>> >>> >>> I'd rather fallback to small folios swapin instead crashing apps till we fix >>> the large folio swapin in zswap :-) >> >> I think David is referring to catching the buggy cases that do not >> properly fallback to small folios with zswap, not as an alternative to >> the fallback. This is at least what I had in mind with the patch. > > Cool. Thanks for the clarification. I'm fine with keeping the fallback, > whether it's the current VM_BUG_ON or David's recommended > SIGBUS. > > Currently, mainline doesn't support large folios swap-in, so I see > your patch as a helpful warning for those attempting large folio > swap-in to avoid making mistakes like loading large folios from > zswap. > > In fact, I spent a week trying to figure out why my app was crashing > before I added 'if (zswap_is_enabled()) goto fallback'. If your patch > had come earlier, it would have saved me that week of work :-) > > To me, a direct crash seems like a more straightforward way to > prompt people to fallback when attempting large folio swap-ins. > So, I am slightly in favor of your current patch. BUG_ON and friends are frowned-upon, in particular in scenarios where we can recover or that are so unexpected that they can be found during early testing. I have no strong opinion on this one, but likely a VM_WARN_ON would also be sufficient to find such issues early during testing. No need to crash the machine.
On Fri, Jun 7, 2024 at 12:11 AM David Hildenbrand <david@redhat.com> wrote: > > On 06.06.24 23:53, Barry Song wrote: > > On Fri, Jun 7, 2024 at 9:37 AM Yosry Ahmed <yosryahmed@google.com> wrote: > >> > >> On Thu, Jun 6, 2024 at 2:30 PM Barry Song <21cnbao@gmail.com> wrote: > >>> > >>> On Fri, Jun 7, 2024 at 9:17 AM David Hildenbrand <david@redhat.com> wrote: > >>>> > >>>> On 06.06.24 22:31, Yosry Ahmed wrote: > >>>>> On Thu, Jun 6, 2024 at 1:22 PM David Hildenbrand <david@redhat.com> wrote: > >>>>>> > >>>>>> On 06.06.24 20:48, Yosry Ahmed wrote: > >>>>>>> With ongoing work to support large folio swapin, it is important to make > >>>>>>> sure we do not pass large folios to zswap_load() without implementing > >>>>>>> proper support. > >>>>>>> > >>>>>>> For example, if a swapin fault observes that contiguous PTEs are > >>>>>>> pointing to contiguous swap entries and tries to swap them in as a large > >>>>>>> folio, swap_read_folio() will pass in a large folio to zswap_load(), but > >>>>>>> zswap_load() will only effectively load the first page in the folio. If > >>>>>>> the first page is not in zswap, the folio will be read from disk, even > >>>>>>> though other pages may be in zswap. > >>>>>>> > >>>>>>> In both cases, this will lead to silent data corruption. > >>>>>>> > >>>>>>> Proper large folio swapin support needs to go into zswap before zswap > >>>>>>> can be enabled in a system that supports large folio swapin. > >>>>>>> > >>>>>>> Looking at callers of swap_read_folio(), it seems like they are either > >>>>>>> allocated from __read_swap_cache_async() or do_swap_page() in the > >>>>>>> SWP_SYNCHRONOUS_IO path. Both of which allocate order-0 folios, so we > >>>>>>> are fine for now. > >>>>>>> > >>>>>>> Add a VM_BUG_ON() in zswap_load() to make sure that we detect changes in > >>>>>>> the order of those allocations without proper handling of zswap. > >>>>>>> > >>>>>>> Alternatively, swap_read_folio() (or its callers) can be updated to have > >>>>>>> a fallback mechanism that splits large folios or reads subpages > >>>>>>> separately. Similar logic may be needed anyway in case part of a large > >>>>>>> folio is already in the swapcache and the rest of it is swapped out. > >>>>>>> > >>>>>>> Signed-off-by: Yosry Ahmed <yosryahmed@google.com> > >>>>>>> --- > >>>>>>> > >>>>>>> Sorry for the long CC list, I just found myself repeatedly looking at > >>>>>>> new series that add swap support for mTHPs / large folios, making sure > >>>>>>> they do not break with zswap or make incorrect assumptions. This debug > >>>>>>> check should give us some peace of mind. Hopefully this patch will also > >>>>>>> raise awareness among people who are working on this. > >>>>>>> > >>>>>>> --- > >>>>>>> mm/zswap.c | 3 +++ > >>>>>>> 1 file changed, 3 insertions(+) > >>>>>>> > >>>>>>> diff --git a/mm/zswap.c b/mm/zswap.c > >>>>>>> index b9b35ef86d9be..6007252429bb2 100644 > >>>>>>> --- a/mm/zswap.c > >>>>>>> +++ b/mm/zswap.c > >>>>>>> @@ -1577,6 +1577,9 @@ bool zswap_load(struct folio *folio) > >>>>>>> if (!entry) > >>>>>>> return false; > >>>>>>> > >>>>>>> + /* Zswap loads do not handle large folio swapins correctly yet */ > >>>>>>> + VM_BUG_ON(folio_test_large(folio)); > >>>>>>> + > >>>>>> > >>>>>> There is no way we could have a WARN_ON_ONCE() and recover, right? > >>>>> > >>>>> Not without making more fundamental changes to the surrounding swap > >>>>> code. Currently zswap_load() returns either true (folio was loaded > >>>>> from zswap) or false (folio is not in zswap). > >>>>> > >>>>> To handle this correctly zswap_load() would need to tell > >>>>> swap_read_folio() which subpages are in zswap and have been loaded, > >>>>> and then swap_read_folio() would need to read the remaining subpages > >>>>> from disk. This of course assumes that the caller of swap_read_folio() > >>>>> made sure that the entire folio is swapped out and protected against > >>>>> races with other swapins. > >>>>> > >>>>> Also, because swap_read_folio() cannot split the folio itself, other > >>>>> swap_read_folio_*() functions that are called from it should be > >>>>> updated to handle swapping in tail subpages, which may be questionable > >>>>> in its own right. > >>>>> > >>>>> An alternative would be that zswap_load() (or a separate interface) > >>>>> could tell swap_read_folio() that the folio is partially in zswap, > >>>>> then we can just bail and tell the caller that it cannot read the > >>>>> large folio and that it should be split. > >>>>> > >>>>> There may be other options as well, but the bottom line is that it is > >>>>> possible, but probably not something that we want to do right now. > >>>>> > >>>>> A stronger protection method would be to introduce a config option or > >>>>> boot parameter for large folio swapin, and then make CONFIG_ZSWAP > >>>>> depend on it being disabled, or have zswap check it at boot and refuse > >>>>> to be enabled if it is on. > >>>> > >>>> Right, sounds like the VM_BUG_ON() really is not that easily avoidable. > >>>> > >>>> I was wondering, if we could WARN_ON_ONCE and make the swap code detect > >>>> this like a read-error from disk. > >>>> > >>>> I think do_swap_page() detects that by checking if the folio is not > >>>> uptodate: > >>>> > >>>> if (unlikely(!folio_test_uptodate(folio))) { > >>>> ret = VM_FAULT_SIGBUS; > >>>> goto out_nomap; > >>>> } > >>>> > >>>> So maybe WARN_ON_ONCE() + triggering that might be a bit nicer to the > >>>> system (but the app would crash either way, there is no way around it). > >>>> > >>> > >>> I'd rather fallback to small folios swapin instead crashing apps till we fix > >>> the large folio swapin in zswap :-) > >> > >> I think David is referring to catching the buggy cases that do not > >> properly fallback to small folios with zswap, not as an alternative to > >> the fallback. This is at least what I had in mind with the patch. > > > > Cool. Thanks for the clarification. I'm fine with keeping the fallback, > > whether it's the current VM_BUG_ON or David's recommended > > SIGBUS. > > > > Currently, mainline doesn't support large folios swap-in, so I see > > your patch as a helpful warning for those attempting large folio > > swap-in to avoid making mistakes like loading large folios from > > zswap. > > > > In fact, I spent a week trying to figure out why my app was crashing > > before I added 'if (zswap_is_enabled()) goto fallback'. If your patch > > had come earlier, it would have saved me that week of work :-) > > > > To me, a direct crash seems like a more straightforward way to > > prompt people to fallback when attempting large folio swap-ins. > > So, I am slightly in favor of your current patch. > > BUG_ON and friends are frowned-upon, in particular in scenarios where we > can recover or that are so unexpected that they can be found during > early testing. > > I have no strong opinion on this one, but likely a VM_WARN_ON would also > be sufficient to find such issues early during testing. No need to crash > the machine. I thought VM_BUG_ON() was less frowned-upon than BUG_ON(), but after some digging I found your patches to checkpatch and Linus clearly stating that it isn't. How about something like the following (untested), it is the minimal recovery we can do but should work for a lot of cases, and does nothing beyond a warning if we can swapin the large folio from disk: diff --git a/mm/page_io.c b/mm/page_io.c index f1a9cfab6e748..8f441dd8e109f 100644 --- a/mm/page_io.c +++ b/mm/page_io.c @@ -517,7 +517,6 @@ void swap_read_folio(struct folio *folio, struct swap_iocb **plug) delayacct_swapin_start(); if (zswap_load(folio)) { - folio_mark_uptodate(folio); folio_unlock(folio); } else if (data_race(sis->flags & SWP_FS_OPS)) { swap_read_folio_fs(folio, plug); diff --git a/mm/zswap.c b/mm/zswap.c index 6007252429bb2..cc04db6bb217e 100644 --- a/mm/zswap.c +++ b/mm/zswap.c @@ -1557,6 +1557,22 @@ bool zswap_load(struct folio *folio) VM_WARN_ON_ONCE(!folio_test_locked(folio)); + /* + * Large folios should not be swapped in while zswap is being used, as + * they are not properly handled. + * + * If any of the subpages are in zswap, reading from disk would result + * in data corruption, so return true without marking the folio uptodate + * so that an IO error is emitted (e.g. do_swap_page() will sigfault). + * + * Otherwise, return false and read the folio from disk. + */ + if (WARN_ON_ONCE(folio_test_large(folio))) { + if (xa_find(tree, &offset, offset + folio_nr_pages(folio) - 1, 0)) + return true; + return false; + } + /* * When reading into the swapcache, invalidate our entry. The * swapcache can be the authoritative owner of the page and @@ -1593,7 +1609,7 @@ bool zswap_load(struct folio *folio) zswap_entry_free(entry); folio_mark_dirty(folio); } - + folio_mark_uptodate(folio); return true; } One problem is that even if zswap was never enabled, the warning will be emitted just if CONFIG_ZSWAP is on. Perhaps we need a variable or static key if zswap was "ever" enabled. Barry, I suspect your is_zswap_enabled() check is deficient for similar reasons, zswap could have been enabled before then became disabled.
>> I have no strong opinion on this one, but likely a VM_WARN_ON would also >> be sufficient to find such issues early during testing. No need to crash >> the machine. > > I thought VM_BUG_ON() was less frowned-upon than BUG_ON(), but after > some digging I found your patches to checkpatch and Linus clearly > stating that it isn't. :) yes. VM_BUG_ON is not particularly helpful IMHO. If you want something to be found early during testing, VM_WARN_ON is good enough. Ever since Fedora stopped enabling CONFIG_DEBUG_VM, VM_* friends are primarily reported during early/development testing only. But maybe some distro out there still sets it. > > How about something like the following (untested), it is the minimal > recovery we can do but should work for a lot of cases, and does > nothing beyond a warning if we can swapin the large folio from disk: > > diff --git a/mm/page_io.c b/mm/page_io.c > index f1a9cfab6e748..8f441dd8e109f 100644 > --- a/mm/page_io.c > +++ b/mm/page_io.c > @@ -517,7 +517,6 @@ void swap_read_folio(struct folio *folio, struct > swap_iocb **plug) > delayacct_swapin_start(); > > if (zswap_load(folio)) { > - folio_mark_uptodate(folio); > folio_unlock(folio); > } else if (data_race(sis->flags & SWP_FS_OPS)) { > swap_read_folio_fs(folio, plug); > diff --git a/mm/zswap.c b/mm/zswap.c > index 6007252429bb2..cc04db6bb217e 100644 > --- a/mm/zswap.c > +++ b/mm/zswap.c > @@ -1557,6 +1557,22 @@ bool zswap_load(struct folio *folio) > > VM_WARN_ON_ONCE(!folio_test_locked(folio)); > > + /* > + * Large folios should not be swapped in while zswap is being used, as > + * they are not properly handled. > + * > + * If any of the subpages are in zswap, reading from disk would result > + * in data corruption, so return true without marking the folio uptodate > + * so that an IO error is emitted (e.g. do_swap_page() will sigfault). > + * > + * Otherwise, return false and read the folio from disk. > + */ > + if (WARN_ON_ONCE(folio_test_large(folio))) { > + if (xa_find(tree, &offset, offset + > folio_nr_pages(folio) - 1, 0)) > + return true; > + return false; > + } > + > /* > * When reading into the swapcache, invalidate our entry. The > * swapcache can be the authoritative owner of the page and > @@ -1593,7 +1609,7 @@ bool zswap_load(struct folio *folio) > zswap_entry_free(entry); > folio_mark_dirty(folio); > } > - > + folio_mark_uptodate(folio); > return true; > } > > One problem is that even if zswap was never enabled, the warning will > be emitted just if CONFIG_ZSWAP is on. Perhaps we need a variable or > static key if zswap was "ever" enabled. We should use WARN_ON_ONCE() only for things that cannot happen. So if there are cases where this could be triggered today, it would be problematic -- especially if it can be triggered from unprivileged user space. But if we're concerned of other code messing up our invariant in the future (e.g., enabling large folios without taking proper care about zswap etc), we're good to add it.
On Fri, Jun 7, 2024 at 11:52 AM David Hildenbrand <david@redhat.com> wrote: > > >> I have no strong opinion on this one, but likely a VM_WARN_ON would also > >> be sufficient to find such issues early during testing. No need to crash > >> the machine. > > > > I thought VM_BUG_ON() was less frowned-upon than BUG_ON(), but after > > some digging I found your patches to checkpatch and Linus clearly > > stating that it isn't. > > :) yes. > > VM_BUG_ON is not particularly helpful IMHO. If you want something to be > found early during testing, VM_WARN_ON is good enough. > > Ever since Fedora stopped enabling CONFIG_DEBUG_VM, VM_* friends are > primarily reported during early/development testing only. But maybe some > distro out there still sets it. > > > > > How about something like the following (untested), it is the minimal > > recovery we can do but should work for a lot of cases, and does > > nothing beyond a warning if we can swapin the large folio from disk: > > > > diff --git a/mm/page_io.c b/mm/page_io.c > > index f1a9cfab6e748..8f441dd8e109f 100644 > > --- a/mm/page_io.c > > +++ b/mm/page_io.c > > @@ -517,7 +517,6 @@ void swap_read_folio(struct folio *folio, struct > > swap_iocb **plug) > > delayacct_swapin_start(); > > > > if (zswap_load(folio)) { > > - folio_mark_uptodate(folio); > > folio_unlock(folio); > > } else if (data_race(sis->flags & SWP_FS_OPS)) { > > swap_read_folio_fs(folio, plug); > > diff --git a/mm/zswap.c b/mm/zswap.c > > index 6007252429bb2..cc04db6bb217e 100644 > > --- a/mm/zswap.c > > +++ b/mm/zswap.c > > @@ -1557,6 +1557,22 @@ bool zswap_load(struct folio *folio) > > > > VM_WARN_ON_ONCE(!folio_test_locked(folio)); > > > > + /* > > + * Large folios should not be swapped in while zswap is being used, as > > + * they are not properly handled. > > + * > > + * If any of the subpages are in zswap, reading from disk would result > > + * in data corruption, so return true without marking the folio uptodate > > + * so that an IO error is emitted (e.g. do_swap_page() will sigfault). > > + * > > + * Otherwise, return false and read the folio from disk. > > + */ > > + if (WARN_ON_ONCE(folio_test_large(folio))) { > > + if (xa_find(tree, &offset, offset + > > folio_nr_pages(folio) - 1, 0)) > > + return true; > > + return false; > > + } > > + > > /* > > * When reading into the swapcache, invalidate our entry. The > > * swapcache can be the authoritative owner of the page and > > @@ -1593,7 +1609,7 @@ bool zswap_load(struct folio *folio) > > zswap_entry_free(entry); > > folio_mark_dirty(folio); > > } > > - > > + folio_mark_uptodate(folio); > > return true; > > } > > > > One problem is that even if zswap was never enabled, the warning will > > be emitted just if CONFIG_ZSWAP is on. Perhaps we need a variable or > > static key if zswap was "ever" enabled. > > We should use WARN_ON_ONCE() only for things that cannot happen. So if > there are cases where this could be triggered today, it would be > problematic -- especially if it can be triggered from unprivileged user > space. But if we're concerned of other code messing up our invariant in > the future (e.g., enabling large folios without taking proper care about > zswap etc), we're good to add it. Right now I can't see any paths allocating large folios for swapin, so I think it cannot happen. Once someone tries adding it, the warning will fire if CONFIG_ZSWAP is used, even if zswap is disabled. At this point we will have several options: - Make large folios swapin depend on !CONFIG_ZSWAP for now. - Keep track if zswap was ever enabled and make the warning conditional on it. We should also always fallback to order-0 if zswap was ever enabled. - Properly handle large folio swapin with zswap. Does this sound reasonable to you?
On 07.06.24 20:58, Yosry Ahmed wrote: > On Fri, Jun 7, 2024 at 11:52 AM David Hildenbrand <david@redhat.com> wrote: >> >>>> I have no strong opinion on this one, but likely a VM_WARN_ON would also >>>> be sufficient to find such issues early during testing. No need to crash >>>> the machine. >>> >>> I thought VM_BUG_ON() was less frowned-upon than BUG_ON(), but after >>> some digging I found your patches to checkpatch and Linus clearly >>> stating that it isn't. >> >> :) yes. >> >> VM_BUG_ON is not particularly helpful IMHO. If you want something to be >> found early during testing, VM_WARN_ON is good enough. >> >> Ever since Fedora stopped enabling CONFIG_DEBUG_VM, VM_* friends are >> primarily reported during early/development testing only. But maybe some >> distro out there still sets it. >> >>> >>> How about something like the following (untested), it is the minimal >>> recovery we can do but should work for a lot of cases, and does >>> nothing beyond a warning if we can swapin the large folio from disk: >>> >>> diff --git a/mm/page_io.c b/mm/page_io.c >>> index f1a9cfab6e748..8f441dd8e109f 100644 >>> --- a/mm/page_io.c >>> +++ b/mm/page_io.c >>> @@ -517,7 +517,6 @@ void swap_read_folio(struct folio *folio, struct >>> swap_iocb **plug) >>> delayacct_swapin_start(); >>> >>> if (zswap_load(folio)) { >>> - folio_mark_uptodate(folio); >>> folio_unlock(folio); >>> } else if (data_race(sis->flags & SWP_FS_OPS)) { >>> swap_read_folio_fs(folio, plug); >>> diff --git a/mm/zswap.c b/mm/zswap.c >>> index 6007252429bb2..cc04db6bb217e 100644 >>> --- a/mm/zswap.c >>> +++ b/mm/zswap.c >>> @@ -1557,6 +1557,22 @@ bool zswap_load(struct folio *folio) >>> >>> VM_WARN_ON_ONCE(!folio_test_locked(folio)); >>> >>> + /* >>> + * Large folios should not be swapped in while zswap is being used, as >>> + * they are not properly handled. >>> + * >>> + * If any of the subpages are in zswap, reading from disk would result >>> + * in data corruption, so return true without marking the folio uptodate >>> + * so that an IO error is emitted (e.g. do_swap_page() will sigfault). >>> + * >>> + * Otherwise, return false and read the folio from disk. >>> + */ >>> + if (WARN_ON_ONCE(folio_test_large(folio))) { >>> + if (xa_find(tree, &offset, offset + >>> folio_nr_pages(folio) - 1, 0)) >>> + return true; >>> + return false; >>> + } >>> + >>> /* >>> * When reading into the swapcache, invalidate our entry. The >>> * swapcache can be the authoritative owner of the page and >>> @@ -1593,7 +1609,7 @@ bool zswap_load(struct folio *folio) >>> zswap_entry_free(entry); >>> folio_mark_dirty(folio); >>> } >>> - >>> + folio_mark_uptodate(folio); >>> return true; >>> } >>> >>> One problem is that even if zswap was never enabled, the warning will >>> be emitted just if CONFIG_ZSWAP is on. Perhaps we need a variable or >>> static key if zswap was "ever" enabled. >> >> We should use WARN_ON_ONCE() only for things that cannot happen. So if >> there are cases where this could be triggered today, it would be >> problematic -- especially if it can be triggered from unprivileged user >> space. But if we're concerned of other code messing up our invariant in >> the future (e.g., enabling large folios without taking proper care about >> zswap etc), we're good to add it. > > Right now I can't see any paths allocating large folios for swapin, so > I think it cannot happen. Once someone tries adding it, the warning > will fire if CONFIG_ZSWAP is used, even if zswap is disabled. > > At this point we will have several options: > - Make large folios swapin depend on !CONFIG_ZSWAP for now. > - Keep track if zswap was ever enabled and make the warning > conditional on it. We should also always fallback to order-0 if zswap > was ever enabled. > - Properly handle large folio swapin with zswap. > > Does this sound reasonable to you? Yes, probably easiest is 1) -> 3) or if we don't want to glue it to config options 2) -> 3).
On Fri, Jun 7, 2024 at 11:58 AM Yosry Ahmed <yosryahmed@google.com> wrote: > > On Fri, Jun 7, 2024 at 11:52 AM David Hildenbrand <david@redhat.com> wrote: > > > > >> I have no strong opinion on this one, but likely a VM_WARN_ON would also > > >> be sufficient to find such issues early during testing. No need to crash > > >> the machine. > > > > > > I thought VM_BUG_ON() was less frowned-upon than BUG_ON(), but after > > > some digging I found your patches to checkpatch and Linus clearly > > > stating that it isn't. > > > > :) yes. > > > > VM_BUG_ON is not particularly helpful IMHO. If you want something to be > > found early during testing, VM_WARN_ON is good enough. > > > > Ever since Fedora stopped enabling CONFIG_DEBUG_VM, VM_* friends are > > primarily reported during early/development testing only. But maybe some > > distro out there still sets it. > > > > > > > > How about something like the following (untested), it is the minimal > > > recovery we can do but should work for a lot of cases, and does > > > nothing beyond a warning if we can swapin the large folio from disk: > > > > > > diff --git a/mm/page_io.c b/mm/page_io.c > > > index f1a9cfab6e748..8f441dd8e109f 100644 > > > --- a/mm/page_io.c > > > +++ b/mm/page_io.c > > > @@ -517,7 +517,6 @@ void swap_read_folio(struct folio *folio, struct > > > swap_iocb **plug) > > > delayacct_swapin_start(); > > > > > > if (zswap_load(folio)) { > > > - folio_mark_uptodate(folio); > > > folio_unlock(folio); > > > } else if (data_race(sis->flags & SWP_FS_OPS)) { > > > swap_read_folio_fs(folio, plug); > > > diff --git a/mm/zswap.c b/mm/zswap.c > > > index 6007252429bb2..cc04db6bb217e 100644 > > > --- a/mm/zswap.c > > > +++ b/mm/zswap.c > > > @@ -1557,6 +1557,22 @@ bool zswap_load(struct folio *folio) > > > > > > VM_WARN_ON_ONCE(!folio_test_locked(folio)); > > > > > > + /* > > > + * Large folios should not be swapped in while zswap is being used, as > > > + * they are not properly handled. > > > + * > > > + * If any of the subpages are in zswap, reading from disk would result > > > + * in data corruption, so return true without marking the folio uptodate > > > + * so that an IO error is emitted (e.g. do_swap_page() will sigfault). > > > + * > > > + * Otherwise, return false and read the folio from disk. > > > + */ > > > + if (WARN_ON_ONCE(folio_test_large(folio))) { > > > + if (xa_find(tree, &offset, offset + > > > folio_nr_pages(folio) - 1, 0)) > > > + return true; > > > + return false; > > > + } > > > + > > > /* > > > * When reading into the swapcache, invalidate our entry. The > > > * swapcache can be the authoritative owner of the page and > > > @@ -1593,7 +1609,7 @@ bool zswap_load(struct folio *folio) > > > zswap_entry_free(entry); > > > folio_mark_dirty(folio); > > > } > > > - > > > + folio_mark_uptodate(folio); > > > return true; > > > } > > > > > > One problem is that even if zswap was never enabled, the warning will > > > be emitted just if CONFIG_ZSWAP is on. Perhaps we need a variable or > > > static key if zswap was "ever" enabled. > > > > We should use WARN_ON_ONCE() only for things that cannot happen. So if > > there are cases where this could be triggered today, it would be > > problematic -- especially if it can be triggered from unprivileged user > > space. But if we're concerned of other code messing up our invariant in > > the future (e.g., enabling large folios without taking proper care about > > zswap etc), we're good to add it. > > Right now I can't see any paths allocating large folios for swapin, so > I think it cannot happen. Once someone tries adding it, the warning > will fire if CONFIG_ZSWAP is used, even if zswap is disabled. > At this point we will have several options: Here is my take on this: > - Make large folios swapin depend on !CONFIG_ZSWAP for now. I think a WARON or BUG_ON is better. I would need to revert this change when I am working on 3). It is a make up rule, not a real dependency any way. > - Keep track if zswap was ever enabled and make the warning > conditional on it. We should also always fallback to order-0 if zswap > was ever enabled. IMHO, falling back to order-0 inside zswap is not desired because it complicates the zswap code. We should not pass large folio to zswap if zswap is not ready to handle large folio. The core swap already has the fall back to order-0. If we get to 3), then this fall back in zswap needs to be removed. It is a transitional thing then maybe not introduce it in the first place. > - Properly handle large folio swapin with zswap. That obviously is ideal. Chris
On Sat, Jun 8, 2024 at 5:43 AM Yosry Ahmed <yosryahmed@google.com> wrote: > > On Fri, Jun 7, 2024 at 12:11 AM David Hildenbrand <david@redhat.com> wrote: > > > > On 06.06.24 23:53, Barry Song wrote: > > > On Fri, Jun 7, 2024 at 9:37 AM Yosry Ahmed <yosryahmed@google.com> wrote: > > >> > > >> On Thu, Jun 6, 2024 at 2:30 PM Barry Song <21cnbao@gmail.com> wrote: > > >>> > > >>> On Fri, Jun 7, 2024 at 9:17 AM David Hildenbrand <david@redhat.com> wrote: > > >>>> > > >>>> On 06.06.24 22:31, Yosry Ahmed wrote: > > >>>>> On Thu, Jun 6, 2024 at 1:22 PM David Hildenbrand <david@redhat.com> wrote: > > >>>>>> > > >>>>>> On 06.06.24 20:48, Yosry Ahmed wrote: > > >>>>>>> With ongoing work to support large folio swapin, it is important to make > > >>>>>>> sure we do not pass large folios to zswap_load() without implementing > > >>>>>>> proper support. > > >>>>>>> > > >>>>>>> For example, if a swapin fault observes that contiguous PTEs are > > >>>>>>> pointing to contiguous swap entries and tries to swap them in as a large > > >>>>>>> folio, swap_read_folio() will pass in a large folio to zswap_load(), but > > >>>>>>> zswap_load() will only effectively load the first page in the folio. If > > >>>>>>> the first page is not in zswap, the folio will be read from disk, even > > >>>>>>> though other pages may be in zswap. > > >>>>>>> > > >>>>>>> In both cases, this will lead to silent data corruption. > > >>>>>>> > > >>>>>>> Proper large folio swapin support needs to go into zswap before zswap > > >>>>>>> can be enabled in a system that supports large folio swapin. > > >>>>>>> > > >>>>>>> Looking at callers of swap_read_folio(), it seems like they are either > > >>>>>>> allocated from __read_swap_cache_async() or do_swap_page() in the > > >>>>>>> SWP_SYNCHRONOUS_IO path. Both of which allocate order-0 folios, so we > > >>>>>>> are fine for now. > > >>>>>>> > > >>>>>>> Add a VM_BUG_ON() in zswap_load() to make sure that we detect changes in > > >>>>>>> the order of those allocations without proper handling of zswap. > > >>>>>>> > > >>>>>>> Alternatively, swap_read_folio() (or its callers) can be updated to have > > >>>>>>> a fallback mechanism that splits large folios or reads subpages > > >>>>>>> separately. Similar logic may be needed anyway in case part of a large > > >>>>>>> folio is already in the swapcache and the rest of it is swapped out. > > >>>>>>> > > >>>>>>> Signed-off-by: Yosry Ahmed <yosryahmed@google.com> > > >>>>>>> --- > > >>>>>>> > > >>>>>>> Sorry for the long CC list, I just found myself repeatedly looking at > > >>>>>>> new series that add swap support for mTHPs / large folios, making sure > > >>>>>>> they do not break with zswap or make incorrect assumptions. This debug > > >>>>>>> check should give us some peace of mind. Hopefully this patch will also > > >>>>>>> raise awareness among people who are working on this. > > >>>>>>> > > >>>>>>> --- > > >>>>>>> mm/zswap.c | 3 +++ > > >>>>>>> 1 file changed, 3 insertions(+) > > >>>>>>> > > >>>>>>> diff --git a/mm/zswap.c b/mm/zswap.c > > >>>>>>> index b9b35ef86d9be..6007252429bb2 100644 > > >>>>>>> --- a/mm/zswap.c > > >>>>>>> +++ b/mm/zswap.c > > >>>>>>> @@ -1577,6 +1577,9 @@ bool zswap_load(struct folio *folio) > > >>>>>>> if (!entry) > > >>>>>>> return false; > > >>>>>>> > > >>>>>>> + /* Zswap loads do not handle large folio swapins correctly yet */ > > >>>>>>> + VM_BUG_ON(folio_test_large(folio)); > > >>>>>>> + > > >>>>>> > > >>>>>> There is no way we could have a WARN_ON_ONCE() and recover, right? > > >>>>> > > >>>>> Not without making more fundamental changes to the surrounding swap > > >>>>> code. Currently zswap_load() returns either true (folio was loaded > > >>>>> from zswap) or false (folio is not in zswap). > > >>>>> > > >>>>> To handle this correctly zswap_load() would need to tell > > >>>>> swap_read_folio() which subpages are in zswap and have been loaded, > > >>>>> and then swap_read_folio() would need to read the remaining subpages > > >>>>> from disk. This of course assumes that the caller of swap_read_folio() > > >>>>> made sure that the entire folio is swapped out and protected against > > >>>>> races with other swapins. > > >>>>> > > >>>>> Also, because swap_read_folio() cannot split the folio itself, other > > >>>>> swap_read_folio_*() functions that are called from it should be > > >>>>> updated to handle swapping in tail subpages, which may be questionable > > >>>>> in its own right. > > >>>>> > > >>>>> An alternative would be that zswap_load() (or a separate interface) > > >>>>> could tell swap_read_folio() that the folio is partially in zswap, > > >>>>> then we can just bail and tell the caller that it cannot read the > > >>>>> large folio and that it should be split. > > >>>>> > > >>>>> There may be other options as well, but the bottom line is that it is > > >>>>> possible, but probably not something that we want to do right now. > > >>>>> > > >>>>> A stronger protection method would be to introduce a config option or > > >>>>> boot parameter for large folio swapin, and then make CONFIG_ZSWAP > > >>>>> depend on it being disabled, or have zswap check it at boot and refuse > > >>>>> to be enabled if it is on. > > >>>> > > >>>> Right, sounds like the VM_BUG_ON() really is not that easily avoidable. > > >>>> > > >>>> I was wondering, if we could WARN_ON_ONCE and make the swap code detect > > >>>> this like a read-error from disk. > > >>>> > > >>>> I think do_swap_page() detects that by checking if the folio is not > > >>>> uptodate: > > >>>> > > >>>> if (unlikely(!folio_test_uptodate(folio))) { > > >>>> ret = VM_FAULT_SIGBUS; > > >>>> goto out_nomap; > > >>>> } > > >>>> > > >>>> So maybe WARN_ON_ONCE() + triggering that might be a bit nicer to the > > >>>> system (but the app would crash either way, there is no way around it). > > >>>> > > >>> > > >>> I'd rather fallback to small folios swapin instead crashing apps till we fix > > >>> the large folio swapin in zswap :-) > > >> > > >> I think David is referring to catching the buggy cases that do not > > >> properly fallback to small folios with zswap, not as an alternative to > > >> the fallback. This is at least what I had in mind with the patch. > > > > > > Cool. Thanks for the clarification. I'm fine with keeping the fallback, > > > whether it's the current VM_BUG_ON or David's recommended > > > SIGBUS. > > > > > > Currently, mainline doesn't support large folios swap-in, so I see > > > your patch as a helpful warning for those attempting large folio > > > swap-in to avoid making mistakes like loading large folios from > > > zswap. > > > > > > In fact, I spent a week trying to figure out why my app was crashing > > > before I added 'if (zswap_is_enabled()) goto fallback'. If your patch > > > had come earlier, it would have saved me that week of work :-) > > > > > > To me, a direct crash seems like a more straightforward way to > > > prompt people to fallback when attempting large folio swap-ins. > > > So, I am slightly in favor of your current patch. > > > > BUG_ON and friends are frowned-upon, in particular in scenarios where we > > can recover or that are so unexpected that they can be found during > > early testing. > > > > I have no strong opinion on this one, but likely a VM_WARN_ON would also > > be sufficient to find such issues early during testing. No need to crash > > the machine. > > I thought VM_BUG_ON() was less frowned-upon than BUG_ON(), but after > some digging I found your patches to checkpatch and Linus clearly > stating that it isn't. > > How about something like the following (untested), it is the minimal > recovery we can do but should work for a lot of cases, and does > nothing beyond a warning if we can swapin the large folio from disk: > > diff --git a/mm/page_io.c b/mm/page_io.c > index f1a9cfab6e748..8f441dd8e109f 100644 > --- a/mm/page_io.c > +++ b/mm/page_io.c > @@ -517,7 +517,6 @@ void swap_read_folio(struct folio *folio, struct > swap_iocb **plug) > delayacct_swapin_start(); > > if (zswap_load(folio)) { > - folio_mark_uptodate(folio); > folio_unlock(folio); > } else if (data_race(sis->flags & SWP_FS_OPS)) { > swap_read_folio_fs(folio, plug); > diff --git a/mm/zswap.c b/mm/zswap.c > index 6007252429bb2..cc04db6bb217e 100644 > --- a/mm/zswap.c > +++ b/mm/zswap.c > @@ -1557,6 +1557,22 @@ bool zswap_load(struct folio *folio) > > VM_WARN_ON_ONCE(!folio_test_locked(folio)); > > + /* > + * Large folios should not be swapped in while zswap is being used, as > + * they are not properly handled. > + * > + * If any of the subpages are in zswap, reading from disk would result > + * in data corruption, so return true without marking the folio uptodate > + * so that an IO error is emitted (e.g. do_swap_page() will sigfault). > + * > + * Otherwise, return false and read the folio from disk. > + */ > + if (WARN_ON_ONCE(folio_test_large(folio))) { > + if (xa_find(tree, &offset, offset + > folio_nr_pages(folio) - 1, 0)) > + return true; > + return false; > + } > + > /* > * When reading into the swapcache, invalidate our entry. The > * swapcache can be the authoritative owner of the page and > @@ -1593,7 +1609,7 @@ bool zswap_load(struct folio *folio) > zswap_entry_free(entry); > folio_mark_dirty(folio); > } > - > + folio_mark_uptodate(folio); > return true; > } > > One problem is that even if zswap was never enabled, the warning will > be emitted just if CONFIG_ZSWAP is on. Perhaps we need a variable or > static key if zswap was "ever" enabled. > > Barry, I suspect your is_zswap_enabled() check is deficient for > similar reasons, zswap could have been enabled before then became > disabled. I don't understand this. if zswap was enabled before but is disabled when I am loading data, will I get corrupted data before zswap was once enabled? If not, it seems nothing important. Thanks Barry
On Sat, Jun 8, 2024 at 6:58 AM Yosry Ahmed <yosryahmed@google.com> wrote: > > On Fri, Jun 7, 2024 at 11:52 AM David Hildenbrand <david@redhat.com> wrote: > > > > >> I have no strong opinion on this one, but likely a VM_WARN_ON would also > > >> be sufficient to find such issues early during testing. No need to crash > > >> the machine. > > > > > > I thought VM_BUG_ON() was less frowned-upon than BUG_ON(), but after > > > some digging I found your patches to checkpatch and Linus clearly > > > stating that it isn't. > > > > :) yes. > > > > VM_BUG_ON is not particularly helpful IMHO. If you want something to be > > found early during testing, VM_WARN_ON is good enough. > > > > Ever since Fedora stopped enabling CONFIG_DEBUG_VM, VM_* friends are > > primarily reported during early/development testing only. But maybe some > > distro out there still sets it. > > > > > > > > How about something like the following (untested), it is the minimal > > > recovery we can do but should work for a lot of cases, and does > > > nothing beyond a warning if we can swapin the large folio from disk: > > > > > > diff --git a/mm/page_io.c b/mm/page_io.c > > > index f1a9cfab6e748..8f441dd8e109f 100644 > > > --- a/mm/page_io.c > > > +++ b/mm/page_io.c > > > @@ -517,7 +517,6 @@ void swap_read_folio(struct folio *folio, struct > > > swap_iocb **plug) > > > delayacct_swapin_start(); > > > > > > if (zswap_load(folio)) { > > > - folio_mark_uptodate(folio); > > > folio_unlock(folio); > > > } else if (data_race(sis->flags & SWP_FS_OPS)) { > > > swap_read_folio_fs(folio, plug); > > > diff --git a/mm/zswap.c b/mm/zswap.c > > > index 6007252429bb2..cc04db6bb217e 100644 > > > --- a/mm/zswap.c > > > +++ b/mm/zswap.c > > > @@ -1557,6 +1557,22 @@ bool zswap_load(struct folio *folio) > > > > > > VM_WARN_ON_ONCE(!folio_test_locked(folio)); > > > > > > + /* > > > + * Large folios should not be swapped in while zswap is being used, as > > > + * they are not properly handled. > > > + * > > > + * If any of the subpages are in zswap, reading from disk would result > > > + * in data corruption, so return true without marking the folio uptodate > > > + * so that an IO error is emitted (e.g. do_swap_page() will sigfault). > > > + * > > > + * Otherwise, return false and read the folio from disk. > > > + */ > > > + if (WARN_ON_ONCE(folio_test_large(folio))) { > > > + if (xa_find(tree, &offset, offset + > > > folio_nr_pages(folio) - 1, 0)) > > > + return true; > > > + return false; > > > + } > > > + > > > /* > > > * When reading into the swapcache, invalidate our entry. The > > > * swapcache can be the authoritative owner of the page and > > > @@ -1593,7 +1609,7 @@ bool zswap_load(struct folio *folio) > > > zswap_entry_free(entry); > > > folio_mark_dirty(folio); > > > } > > > - > > > + folio_mark_uptodate(folio); > > > return true; > > > } > > > > > > One problem is that even if zswap was never enabled, the warning will > > > be emitted just if CONFIG_ZSWAP is on. Perhaps we need a variable or > > > static key if zswap was "ever" enabled. > > > > We should use WARN_ON_ONCE() only for things that cannot happen. So if > > there are cases where this could be triggered today, it would be > > problematic -- especially if it can be triggered from unprivileged user > > space. But if we're concerned of other code messing up our invariant in > > the future (e.g., enabling large folios without taking proper care about > > zswap etc), we're good to add it. > > Right now I can't see any paths allocating large folios for swapin, so > I think it cannot happen. Once someone tries adding it, the warning > will fire if CONFIG_ZSWAP is used, even if zswap is disabled. > > At this point we will have several options: > - Make large folios swapin depend on !CONFIG_ZSWAP for now. It appears quite problematic. We lack control over whether the kernel build will enable CONFIG_ZSWAP, particularly when aiming for a common defconfig across all platforms to streamline configurations. For instance, in the case of ARM, this was once a significant goal. Simply trigger a single WARN or BUG if an attempt is made to load large folios in zswap_load, while ensuring that zswap_is_enabled() remains unaffected. In the mainline code, large folio swap-in support is absent, so this warning is intended for debugging purposes and targets a very small audience—perhaps fewer than five individuals worldwide. Real users won’t encounter this warning, as it remains hidden from their view. > - Keep track if zswap was ever enabled and make the warning > conditional on it. We should also always fallback to order-0 if zswap > was ever enabled. > - Properly handle large folio swapin with zswap. > > Does this sound reasonable to you? Thanks Barry
On Sat, Jun 8, 2024 at 7:17 AM Yosry Ahmed <yosryahmed@google.com> wrote: > > [..] > > > One problem is that even if zswap was never enabled, the warning will > > > be emitted just if CONFIG_ZSWAP is on. Perhaps we need a variable or > > > static key if zswap was "ever" enabled. > > > > > > Barry, I suspect your is_zswap_enabled() check is deficient for > > > similar reasons, zswap could have been enabled before then became > > > disabled. > > > > I don't understand this. if zswap was enabled before but is disabled when > > I am loading data, will I get corrupted data before zswap was once enabled? > > If not, it seems nothing important. > > If zswap was enabled and then disabled, some pages may still be in > zswap. We do not load the pages from zswap when it is disabled, we > just stop storing new pages. > > So if you just rely in checking whether zswap is enabled at swapin > time to decide whether to use large folios, you may end up with a > situation where zswap is disabled, yet parts of the large folio you > are trying to swapin (or all of it) is in zswap. > > This is why I think we'll need to track whether zswap was ever enabled > instead (or if a page was ever stored). Thanks! It doesn't seem good. Do we have a simple way to clean zswap when it is disabled? seems not easy? Just like we do swapoff, or disable cache, we ensure they are clean - this is a real "disable". Thanks Barry
[..] > > > > > > > > How about something like the following (untested), it is the minimal > > > > recovery we can do but should work for a lot of cases, and does > > > > nothing beyond a warning if we can swapin the large folio from disk: > > > > > > > > diff --git a/mm/page_io.c b/mm/page_io.c > > > > index f1a9cfab6e748..8f441dd8e109f 100644 > > > > --- a/mm/page_io.c > > > > +++ b/mm/page_io.c > > > > @@ -517,7 +517,6 @@ void swap_read_folio(struct folio *folio, struct > > > > swap_iocb **plug) > > > > delayacct_swapin_start(); > > > > > > > > if (zswap_load(folio)) { > > > > - folio_mark_uptodate(folio); > > > > folio_unlock(folio); > > > > } else if (data_race(sis->flags & SWP_FS_OPS)) { > > > > swap_read_folio_fs(folio, plug); > > > > diff --git a/mm/zswap.c b/mm/zswap.c > > > > index 6007252429bb2..cc04db6bb217e 100644 > > > > --- a/mm/zswap.c > > > > +++ b/mm/zswap.c > > > > @@ -1557,6 +1557,22 @@ bool zswap_load(struct folio *folio) > > > > > > > > VM_WARN_ON_ONCE(!folio_test_locked(folio)); > > > > > > > > + /* > > > > + * Large folios should not be swapped in while zswap is being used, as > > > > + * they are not properly handled. > > > > + * > > > > + * If any of the subpages are in zswap, reading from disk would result > > > > + * in data corruption, so return true without marking the folio uptodate > > > > + * so that an IO error is emitted (e.g. do_swap_page() will sigfault). > > > > + * > > > > + * Otherwise, return false and read the folio from disk. > > > > + */ > > > > + if (WARN_ON_ONCE(folio_test_large(folio))) { > > > > + if (xa_find(tree, &offset, offset + > > > > folio_nr_pages(folio) - 1, 0)) > > > > + return true; > > > > + return false; > > > > + } > > > > + > > > > /* > > > > * When reading into the swapcache, invalidate our entry. The > > > > * swapcache can be the authoritative owner of the page and > > > > @@ -1593,7 +1609,7 @@ bool zswap_load(struct folio *folio) > > > > zswap_entry_free(entry); > > > > folio_mark_dirty(folio); > > > > } > > > > - > > > > + folio_mark_uptodate(folio); > > > > return true; > > > > } > > > > > > > > One problem is that even if zswap was never enabled, the warning will > > > > be emitted just if CONFIG_ZSWAP is on. Perhaps we need a variable or > > > > static key if zswap was "ever" enabled. > > > > > > We should use WARN_ON_ONCE() only for things that cannot happen. So if > > > there are cases where this could be triggered today, it would be > > > problematic -- especially if it can be triggered from unprivileged user > > > space. But if we're concerned of other code messing up our invariant in > > > the future (e.g., enabling large folios without taking proper care about > > > zswap etc), we're good to add it. > > > > Right now I can't see any paths allocating large folios for swapin, so > > I think it cannot happen. Once someone tries adding it, the warning > > will fire if CONFIG_ZSWAP is used, even if zswap is disabled. > > At this point we will have several options: > > Here is my take on this: > > > - Make large folios swapin depend on !CONFIG_ZSWAP for now. > > I think a WARON or BUG_ON is better. I would need to revert this > change when I am working on 3). It is a make up rule, not a real > dependency any way. I am intending to send a new version with WARN_ON_ONCE() and some attempt to recover. It is not a rule, it is just that we don't have the support for it today. > > > - Keep track if zswap was ever enabled and make the warning > > conditional on it. We should also always fallback to order-0 if zswap > > was ever enabled. > > IMHO, falling back to order-0 inside zswap is not desired because it > complicates the zswap code. We should not pass large folio to zswap if > zswap is not ready to handle large folio. The core swap already has > the fall back to order-0. If we get to 3), then this fall back in > zswap needs to be removed. It is a transitional thing then maybe not > introduce it in the first place. We cannot split the folio inside zswap. What I meant is that the swapin code should fallback to order-0 if zswap is being used, to avoid passing large folios to zswap. > > > - Properly handle large folio swapin with zswap. > That obviously is ideal. > > Chris
[..] > > One problem is that even if zswap was never enabled, the warning will > > be emitted just if CONFIG_ZSWAP is on. Perhaps we need a variable or > > static key if zswap was "ever" enabled. > > > > Barry, I suspect your is_zswap_enabled() check is deficient for > > similar reasons, zswap could have been enabled before then became > > disabled. > > I don't understand this. if zswap was enabled before but is disabled when > I am loading data, will I get corrupted data before zswap was once enabled? > If not, it seems nothing important. If zswap was enabled and then disabled, some pages may still be in zswap. We do not load the pages from zswap when it is disabled, we just stop storing new pages. So if you just rely in checking whether zswap is enabled at swapin time to decide whether to use large folios, you may end up with a situation where zswap is disabled, yet parts of the large folio you are trying to swapin (or all of it) is in zswap. This is why I think we'll need to track whether zswap was ever enabled instead (or if a page was ever stored). > > Thanks > Barry
On Fri, Jun 7, 2024 at 3:09 PM Barry Song <21cnbao@gmail.com> wrote: > > On Sat, Jun 8, 2024 at 6:58 AM Yosry Ahmed <yosryahmed@google.com> wrote: > > > > On Fri, Jun 7, 2024 at 11:52 AM David Hildenbrand <david@redhat.com> wrote: > > > > > > >> I have no strong opinion on this one, but likely a VM_WARN_ON would also > > > >> be sufficient to find such issues early during testing. No need to crash > > > >> the machine. > > > > > > > > I thought VM_BUG_ON() was less frowned-upon than BUG_ON(), but after > > > > some digging I found your patches to checkpatch and Linus clearly > > > > stating that it isn't. > > > > > > :) yes. > > > > > > VM_BUG_ON is not particularly helpful IMHO. If you want something to be > > > found early during testing, VM_WARN_ON is good enough. > > > > > > Ever since Fedora stopped enabling CONFIG_DEBUG_VM, VM_* friends are > > > primarily reported during early/development testing only. But maybe some > > > distro out there still sets it. > > > > > > > > > > > How about something like the following (untested), it is the minimal > > > > recovery we can do but should work for a lot of cases, and does > > > > nothing beyond a warning if we can swapin the large folio from disk: > > > > > > > > diff --git a/mm/page_io.c b/mm/page_io.c > > > > index f1a9cfab6e748..8f441dd8e109f 100644 > > > > --- a/mm/page_io.c > > > > +++ b/mm/page_io.c > > > > @@ -517,7 +517,6 @@ void swap_read_folio(struct folio *folio, struct > > > > swap_iocb **plug) > > > > delayacct_swapin_start(); > > > > > > > > if (zswap_load(folio)) { > > > > - folio_mark_uptodate(folio); > > > > folio_unlock(folio); > > > > } else if (data_race(sis->flags & SWP_FS_OPS)) { > > > > swap_read_folio_fs(folio, plug); > > > > diff --git a/mm/zswap.c b/mm/zswap.c > > > > index 6007252429bb2..cc04db6bb217e 100644 > > > > --- a/mm/zswap.c > > > > +++ b/mm/zswap.c > > > > @@ -1557,6 +1557,22 @@ bool zswap_load(struct folio *folio) > > > > > > > > VM_WARN_ON_ONCE(!folio_test_locked(folio)); > > > > > > > > + /* > > > > + * Large folios should not be swapped in while zswap is being used, as > > > > + * they are not properly handled. > > > > + * > > > > + * If any of the subpages are in zswap, reading from disk would result > > > > + * in data corruption, so return true without marking the folio uptodate > > > > + * so that an IO error is emitted (e.g. do_swap_page() will sigfault). > > > > + * > > > > + * Otherwise, return false and read the folio from disk. > > > > + */ > > > > + if (WARN_ON_ONCE(folio_test_large(folio))) { > > > > + if (xa_find(tree, &offset, offset + > > > > folio_nr_pages(folio) - 1, 0)) > > > > + return true; > > > > + return false; > > > > + } > > > > + > > > > /* > > > > * When reading into the swapcache, invalidate our entry. The > > > > * swapcache can be the authoritative owner of the page and > > > > @@ -1593,7 +1609,7 @@ bool zswap_load(struct folio *folio) > > > > zswap_entry_free(entry); > > > > folio_mark_dirty(folio); > > > > } > > > > - > > > > + folio_mark_uptodate(folio); > > > > return true; > > > > } > > > > > > > > One problem is that even if zswap was never enabled, the warning will > > > > be emitted just if CONFIG_ZSWAP is on. Perhaps we need a variable or > > > > static key if zswap was "ever" enabled. > > > > > > We should use WARN_ON_ONCE() only for things that cannot happen. So if > > > there are cases where this could be triggered today, it would be > > > problematic -- especially if it can be triggered from unprivileged user > > > space. But if we're concerned of other code messing up our invariant in > > > the future (e.g., enabling large folios without taking proper care about > > > zswap etc), we're good to add it. > > > > Right now I can't see any paths allocating large folios for swapin, so > > I think it cannot happen. Once someone tries adding it, the warning > > will fire if CONFIG_ZSWAP is used, even if zswap is disabled. > > > > At this point we will have several options: > > - Make large folios swapin depend on !CONFIG_ZSWAP for now. > > It appears quite problematic. We lack control over whether the kernel build > will enable CONFIG_ZSWAP, particularly when aiming for a common > defconfig across all platforms to streamline configurations. For instance, > in the case of ARM, this was once a significant goal. > > Simply trigger a single WARN or BUG if an attempt is made to load > large folios in zswap_load, while ensuring that zswap_is_enabled() > remains unaffected. In the mainline code, large folio swap-in support > is absent, so this warning is intended for debugging purposes and > targets a very small audience—perhaps fewer than five individuals > worldwide. Real users won’t encounter this warning, as it remains > hidden from their view. I can make the warning only fire if any part of the folio is in zswap to avoid getting warnings from zswap_load() if we never actually use zswap, that's reasonable. I wanted to warn if we reach zswap_load() with any large folio at all for higher coverage only. I will send something out in the next week or so. > > > - Keep track if zswap was ever enabled and make the warning > > conditional on it. We should also always fallback to order-0 if zswap > > was ever enabled. > > - Properly handle large folio swapin with zswap. > > > > Does this sound reasonable to you? > > Thanks > Barry
On Fri, Jun 7, 2024 at 4:28 PM Barry Song <21cnbao@gmail.com> wrote: > > On Sat, Jun 8, 2024 at 7:17 AM Yosry Ahmed <yosryahmed@google.com> wrote: > > > > [..] > > > > One problem is that even if zswap was never enabled, the warning will > > > > be emitted just if CONFIG_ZSWAP is on. Perhaps we need a variable or > > > > static key if zswap was "ever" enabled. > > > > > > > > Barry, I suspect your is_zswap_enabled() check is deficient for > > > > similar reasons, zswap could have been enabled before then became > > > > disabled. > > > > > > I don't understand this. if zswap was enabled before but is disabled when > > > I am loading data, will I get corrupted data before zswap was once enabled? > > > If not, it seems nothing important. > > > > If zswap was enabled and then disabled, some pages may still be in > > zswap. We do not load the pages from zswap when it is disabled, we > > just stop storing new pages. > > > > So if you just rely in checking whether zswap is enabled at swapin > > time to decide whether to use large folios, you may end up with a > > situation where zswap is disabled, yet parts of the large folio you > > are trying to swapin (or all of it) is in zswap. > > > > This is why I think we'll need to track whether zswap was ever enabled > > instead (or if a page was ever stored). > > Thanks! It doesn't seem good. Do we have a simple way to clean zswap > when it is disabled? seems not easy? Just like we do swapoff, or disable > cache, we ensure they are clean - this is a real "disable". I think it's just simpler, and disabling zswap at runtime is not that common. Keep in mind that if zswap being disabled takes time, then we'll want to handle races between zswap being disabled and incoming swapins, it will probably end up being a state machine or so. Without a valid use case, I think the complexity is not justified. It's probably simpler to just track if zswap was ever used, and disable large folio swapin. This should be a transitional state anyway.
diff --git a/mm/zswap.c b/mm/zswap.c index b9b35ef86d9be..6007252429bb2 100644 --- a/mm/zswap.c +++ b/mm/zswap.c @@ -1577,6 +1577,9 @@ bool zswap_load(struct folio *folio) if (!entry) return false; + /* Zswap loads do not handle large folio swapins correctly yet */ + VM_BUG_ON(folio_test_large(folio)); + if (entry->length) zswap_decompress(entry, folio); else
With ongoing work to support large folio swapin, it is important to make sure we do not pass large folios to zswap_load() without implementing proper support. For example, if a swapin fault observes that contiguous PTEs are pointing to contiguous swap entries and tries to swap them in as a large folio, swap_read_folio() will pass in a large folio to zswap_load(), but zswap_load() will only effectively load the first page in the folio. If the first page is not in zswap, the folio will be read from disk, even though other pages may be in zswap. In both cases, this will lead to silent data corruption. Proper large folio swapin support needs to go into zswap before zswap can be enabled in a system that supports large folio swapin. Looking at callers of swap_read_folio(), it seems like they are either allocated from __read_swap_cache_async() or do_swap_page() in the SWP_SYNCHRONOUS_IO path. Both of which allocate order-0 folios, so we are fine for now. Add a VM_BUG_ON() in zswap_load() to make sure that we detect changes in the order of those allocations without proper handling of zswap. Alternatively, swap_read_folio() (or its callers) can be updated to have a fallback mechanism that splits large folios or reads subpages separately. Similar logic may be needed anyway in case part of a large folio is already in the swapcache and the rest of it is swapped out. Signed-off-by: Yosry Ahmed <yosryahmed@google.com> --- Sorry for the long CC list, I just found myself repeatedly looking at new series that add swap support for mTHPs / large folios, making sure they do not break with zswap or make incorrect assumptions. This debug check should give us some peace of mind. Hopefully this patch will also raise awareness among people who are working on this. --- mm/zswap.c | 3 +++ 1 file changed, 3 insertions(+)