Message ID | 20250107094347.l37isnk3w2nmpx2i@AALNPWDAGOMEZ1.aal.scsc.local (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | Swap Min Odrer | expand |
On 07.01.25 10:43, Daniel Gomez wrote: > Hi, Hi, > > High-capacity SSDs require writes to be aligned with the drive's > indirection unit (IU), which is typically >4 KiB, to avoid RMW. To > support swap on these devices, we need to ensure that writes do not > cross IU boundaries. So, I think this may require increasing the minimum > allocation size for swap users. How would we handle swapout/swapin when we have smaller pages (just imagine someone does a mmap(4KiB))? Could this be something that gets abstracted/handled by the swap implementation? (i.e., multiple small folios get added to the swapcache but get written out / read in as a single unit?). I recall that we have been talking about a better swap abstraction for years :) Might be a good topic for LSF/MM (might or might not be a better place than the MM alignment session).
On Tue, Jan 07, 2025 at 11:31:05AM +0100, David Hildenbrand wrote: > On 07.01.25 10:43, Daniel Gomez wrote: > > Hi, > > Hi, > > > > > High-capacity SSDs require writes to be aligned with the drive's > > indirection unit (IU), which is typically >4 KiB, to avoid RMW. To > > support swap on these devices, we need to ensure that writes do not > > cross IU boundaries. So, I think this may require increasing the minimum > > allocation size for swap users. > > How would we handle swapout/swapin when we have smaller pages (just imagine > someone does a mmap(4KiB))? Swapout would require to be aligned to the IU. An mmap of 4 KiB would have to perform an IU KiB write, e.g. 16 KiB or 32 KiB, to avoid any potential RMW penalty. So, I think aligning the mmap allocation to the IU would guarantee a write of the required granularity and alignment. But let's also look at your suggestion below with swapcache. Swapin can still be performed at LBA format levels (e.g. 4 KiB) without the same write penalty implications, and only affecting performance if I/Os are not conformant to these boundaries. So, reading at IU boundaries is preferred to get optimal performance, not a 'requirement'. > > Could this be something that gets abstracted/handled by the swap > implementation? (i.e., multiple small folios get added to the swapcache but > get written out / read in as a single unit?). Do you mean merging like in the block layer? I'm not entirely sure if this could guarantee deterministically the I/O boundaries the same way it does min order large folio allocations in the page cache. But I guess is worth exploring as optimization. > > I recall that we have been talking about a better swap abstraction for years > :) Adding Chris Li to the cc list in case he has more input. > > Might be a good topic for LSF/MM (might or might not be a better place than > the MM alignment session). Both options work for me. LSF/MM is in 12 weeks so, having a previous session would be great. Daniel > > -- > Cheers, > > David / dhildenb >
On 07.01.25 13:29, Daniel Gomez wrote: > On Tue, Jan 07, 2025 at 11:31:05AM +0100, David Hildenbrand wrote: >> On 07.01.25 10:43, Daniel Gomez wrote: >>> Hi, >> >> Hi, >> >>> >>> High-capacity SSDs require writes to be aligned with the drive's >>> indirection unit (IU), which is typically >4 KiB, to avoid RMW. To >>> support swap on these devices, we need to ensure that writes do not >>> cross IU boundaries. So, I think this may require increasing the minimum >>> allocation size for swap users. >> >> How would we handle swapout/swapin when we have smaller pages (just imagine >> someone does a mmap(4KiB))? > > Swapout would require to be aligned to the IU. An mmap of 4 KiB would > have to perform an IU KiB write, e.g. 16 KiB or 32 KiB, to avoid any > potential RMW penalty. So, I think aligning the mmap allocation to the > IU would guarantee a write of the required granularity and alignment. We must be prepared to handle and VMA layout with single-page VMAs, single-page holes etc ... :/ IMHO we should try to handle this transparently to the application. > But let's also look at your suggestion below with swapcache. > > Swapin can still be performed at LBA format levels (e.g. 4 KiB) without > the same write penalty implications, and only affecting performance > if I/Os are not conformant to these boundaries. So, reading at IU > boundaries is preferred to get optimal performance, not a 'requirement'. > >> >> Could this be something that gets abstracted/handled by the swap >> implementation? (i.e., multiple small folios get added to the swapcache but >> get written out / read in as a single unit?). > > Do you mean merging like in the block layer? I'm not entirely sure if > this could guarantee deterministically the I/O boundaries the same way > it does min order large folio allocations in the page cache. But I guess > is worth exploring as optimization. Maybe the swapcache could somehow abstract that? We currently have the swap slot allocator, that assigns slots to pages. Assuming we have a 16 KiB BS but a 4 KiB page, we might have various options to explore. For example, we could size swap slots 16 KiB, and assign even 4 KiB pages a single slot. This would waste swap space with small folios, that would go away with large folios. If we stick to 4 KiB swap slots, maybe pageout() could be taught to effectively writeback "everything" residing in the relevant swap slots that span a BS? I recall there was a discussion about atomic writes involving multiple pages, and how it is hard. Maybe with swaping it is "easier"? Absolutely no expert on that, unfortunately. Hoping Chris has some ideas. > >> >> I recall that we have been talking about a better swap abstraction for years >> :) > > Adding Chris Li to the cc list in case he has more input. > >> >> Might be a good topic for LSF/MM (might or might not be a better place than >> the MM alignment session). > > Both options work for me. LSF/MM is in 12 weeks so, having a previous > session would be great. Both work for me.
On Tue, Jan 07, 2025 at 05:41:23PM +0100, David Hildenbrand wrote: > On 07.01.25 13:29, Daniel Gomez wrote: > > On Tue, Jan 07, 2025 at 11:31:05AM +0100, David Hildenbrand wrote: > > > On 07.01.25 10:43, Daniel Gomez wrote: > > > > Hi, > > > > > > Hi, > > > > > > > > > > > High-capacity SSDs require writes to be aligned with the drive's > > > > indirection unit (IU), which is typically >4 KiB, to avoid RMW. To > > > > support swap on these devices, we need to ensure that writes do not > > > > cross IU boundaries. So, I think this may require increasing the minimum > > > > allocation size for swap users. > > > > > > How would we handle swapout/swapin when we have smaller pages (just imagine > > > someone does a mmap(4KiB))? > > > > Swapout would require to be aligned to the IU. An mmap of 4 KiB would > > have to perform an IU KiB write, e.g. 16 KiB or 32 KiB, to avoid any > > potential RMW penalty. So, I think aligning the mmap allocation to the > > IU would guarantee a write of the required granularity and alignment. > > We must be prepared to handle and VMA layout with single-page VMAs, > single-page holes etc ... :/ IMHO we should try to handle this transparently > to the application. > > > But let's also look at your suggestion below with swapcache. > > > > Swapin can still be performed at LBA format levels (e.g. 4 KiB) without > > the same write penalty implications, and only affecting performance > > if I/Os are not conformant to these boundaries. So, reading at IU > > boundaries is preferred to get optimal performance, not a 'requirement'. > > > > > > > > Could this be something that gets abstracted/handled by the swap > > > implementation? (i.e., multiple small folios get added to the swapcache but > > > get written out / read in as a single unit?). > > > > Do you mean merging like in the block layer? I'm not entirely sure if > > this could guarantee deterministically the I/O boundaries the same way > > it does min order large folio allocations in the page cache. But I guess > > is worth exploring as optimization. > > Maybe the swapcache could somehow abstract that? We currently have the swap > slot allocator, that assigns slots to pages. > > Assuming we have a 16 KiB BS but a 4 KiB page, we might have various options > to explore. > > For example, we could size swap slots 16 KiB, and assign even 4 KiB pages a > single slot. This would waste swap space with small folios, that would go > away with large folios. So batching order-0 folios in bigger slots that match the FS BS (e.g. 16 KiB) to perform disk writes, right? Can we also assign different orders to the same slot? And can we batch folios while keeping alignment to the BS (IU)? > > If we stick to 4 KiB swap slots, maybe pageout() could be taught to > effectively writeback "everything" residing in the relevant swap slots that > span a BS? > > I recall there was a discussion about atomic writes involving multiple > pages, and how it is hard. Maybe with swaping it is "easier"? Absolutely no > expert on that, unfortunately. Hoping Chris has some ideas. Not sure about the discussion but I guess the main concern for atomic and swaping is the alignment and the questions I raised above. > > > > > > > > > > I recall that we have been talking about a better swap abstraction for years > > > :) > > > > Adding Chris Li to the cc list in case he has more input. > > > > > > > > Might be a good topic for LSF/MM (might or might not be a better place than > > > the MM alignment session). > > > > Both options work for me. LSF/MM is in 12 weeks so, having a previous > > session would be great. > > Both work for me. Can we start by scheduling this topic for the next available MM session? Would be great to get initial feedback/thoughts/concerns, etc while we keep this thread going on. > > -- > Cheers, > > David / dhildenb >
diff --git a/mm/swapfile.c b/mm/swapfile.c index b0a9071cfe1d..80a9dbe9645a 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -3128,6 +3128,7 @@ static int claim_swapfile(struct swap_info_struct *si, struct inode *inode) si->flags |= SWP_BLKDEV; } else if (S_ISREG(inode->i_mode)) { si->bdev = inode->i_sb->s_bdev; + si->flags |= SWP_BLKDEV; } return 0;