mbox series

[0/8] mm/swap: optimize swap cache search space

Message ID 20240417160842.76665-1-ryncsn@gmail.com (mailing list archive)
Headers show
Series mm/swap: optimize swap cache search space | expand

Message

Kairui Song April 17, 2024, 4:08 p.m. UTC
From: Kairui Song <kasong@tencent.com>

Currently we use one swap_address_space for every 64M chunk to reduce lock
contention, this is like having a set of smaller swap files inside one
big swap file. But when doing swap cache look up or insert, we are
still using the offset of the whole large swap file. This is OK for
correctness, as the offset (key) is unique.

But Xarray is specially optimized for small indexes, it creates the
redix tree levels lazily to be just enough to fit the largest key
stored in one Xarray. So we are wasting tree nodes unnecessarily.

For 64M chunk it should only take at most 3 level to contain everything.
But we are using the offset from the whole swap file, so the offset (key)
value will be way beyond 64M, and so will the tree level.

Optimize this by reduce the swap cache search space into 64M scope.

Test with `time memhog 128G` inside a 8G memcg using 128G swap (ramdisk
with SWP_SYNCHRONOUS_IO dropped, tested 3 times, results are stable. The
test result is similar but the improvement is smaller if SWP_SYNCHRONOUS_IO
is enabled, as swap out path can never skip swap cache):

Before:
6.07user 250.74system 4:17.26elapsed 99%CPU (0avgtext+0avgdata 8373376maxresident)k
0inputs+0outputs (55major+33555018minor)pagefaults 0swaps

After (+1.8% faster):
6.08user 246.09system 4:12.58elapsed 99%CPU (0avgtext+0avgdata 8373248maxresident)k
0inputs+0outputs (54major+33555027minor)pagefaults 0swaps

Similar result with MySQL and sysbench using swap:
Before:
94055.61 qps

After (+0.8% faster):
94834.91 qps

There is alse a very slight drop of radix tree node slab usage:
Before: 303952K
After:  302224K

For this series:

There are multiple places that expect mixed type of pages (page cache or
swap cache), eg. migration, huge memory split; There are four helpers
for that:

- page_index
- page_file_offset
- folio_index
- folio_file_pos

So this series first cleaned up usage of page_index and
page_file_offset, then convert folio_index and folio_file_pos to be
compatible with separate offsets. And introduce a new helper
swap_cache_index for swap internal usage, replace swp_offset with
swap_cache_index when used to retrieve folio from swap cache.

And idealy, we may want to reduce SWAP_ADDRESS_SPACE_SHIFT from 14 to
12: Default Xarray chunk offset is 6, so we have 3 level trees instead
of 2 level trees just for 2 extra bits. But swap cache is based on
address_space struct, with 4 times more metadata sparsely distributed
in memory it waste more cacheline, the performance gain from this
series is almost canceled. So firstly, just have a cleaner seperation
of offsets.

Patch 1/8 - 6/8: Clean up usage of page_index and page_file_offset
Patch 7/8: Convert folio_index and folio_file_pos to be compatible with
  separate offset.
Patch 8/8: Introduce swap_cache_index and use it when doing lookup in
  swap cache.

This series is part of effort to reduce swap cache overhead, and ultimately
remove SWP_SYNCHRONOUS_IO and unify swap cache usage as proposed before:
https://lore.kernel.org/lkml/20240326185032.72159-1-ryncsn@gmail.com/

Kairui Song (8):
  NFS: remove nfs_page_lengthg and usage of page_index
  nilfs2: drop usage of page_index
  f2fs: drop usage of page_index
  ceph: drop usage of page_index
  cifs: drop usage of page_file_offset
  mm/swap: get the swap file offset directly
  mm: drop page_index/page_file_offset and convert swap helpers to use
    folio
  mm/swap: reduce swap cache search space

 fs/ceph/dir.c           |  2 +-
 fs/ceph/inode.c         |  2 +-
 fs/f2fs/data.c          |  5 ++---
 fs/nfs/internal.h       | 19 -------------------
 fs/nilfs2/bmap.c        |  3 +--
 fs/smb/client/file.c    |  2 +-
 include/linux/mm.h      | 13 -------------
 include/linux/pagemap.h | 19 +++++++++----------
 mm/huge_memory.c        |  2 +-
 mm/memcontrol.c         |  2 +-
 mm/mincore.c            |  2 +-
 mm/page_io.c            |  6 +++---
 mm/shmem.c              |  2 +-
 mm/swap.h               | 12 ++++++++++++
 mm/swap_state.c         | 12 ++++++------
 mm/swapfile.c           | 17 +++++++++++------
 16 files changed, 51 insertions(+), 69 deletions(-)

Comments

Huang, Ying April 22, 2024, 7:54 a.m. UTC | #1
Hi, Kairui,

Kairui Song <ryncsn@gmail.com> writes:

> From: Kairui Song <kasong@tencent.com>
>
> Currently we use one swap_address_space for every 64M chunk to reduce lock
> contention, this is like having a set of smaller swap files inside one
> big swap file. But when doing swap cache look up or insert, we are
> still using the offset of the whole large swap file. This is OK for
> correctness, as the offset (key) is unique.
>
> But Xarray is specially optimized for small indexes, it creates the
> redix tree levels lazily to be just enough to fit the largest key
> stored in one Xarray. So we are wasting tree nodes unnecessarily.
>
> For 64M chunk it should only take at most 3 level to contain everything.
> But we are using the offset from the whole swap file, so the offset (key)
> value will be way beyond 64M, and so will the tree level.
>
> Optimize this by reduce the swap cache search space into 64M scope.

In general, I think that it makes sense to reduce the depth of the
xarray.

One concern is that IIUC we make swap cache behaves like file cache if
possible.  And your change makes swap cache and file cache diverge more.
Is it possible for us to keep them similar?

For example,

Is it possible to return the offset inside 64M range in
__page_file_index() (maybe rename it)?

Is it possible to add "start_offset" support in xarray, so "index"
will subtract "start_offset" before looking up / inserting?

Is it possible to use multiple range locks to protect one xarray to
improve the lock scalability?  This is why we have multiple "struct
address_space" for one swap device.  And, we may have same lock
contention issue for large files too.

I haven't look at the code in details.  So, my idea may not make sense
at all.  If so, sorry about that.

Hi, Matthew,

Can you teach me on this too?

--
Best Regards,
Huang, Ying
Kairui Song April 22, 2024, 3:20 p.m. UTC | #2
On Mon, Apr 22, 2024 at 3:56 PM Huang, Ying <ying.huang@intel.com> wrote:
>
> Hi, Kairui,
>
> Kairui Song <ryncsn@gmail.com> writes:
>
> > From: Kairui Song <kasong@tencent.com>
> >
> > Currently we use one swap_address_space for every 64M chunk to reduce lock
> > contention, this is like having a set of smaller swap files inside one
> > big swap file. But when doing swap cache look up or insert, we are
> > still using the offset of the whole large swap file. This is OK for
> > correctness, as the offset (key) is unique.
> >
> > But Xarray is specially optimized for small indexes, it creates the
> > redix tree levels lazily to be just enough to fit the largest key
> > stored in one Xarray. So we are wasting tree nodes unnecessarily.
> >
> > For 64M chunk it should only take at most 3 level to contain everything.
> > But we are using the offset from the whole swap file, so the offset (key)
> > value will be way beyond 64M, and so will the tree level.
> >
> > Optimize this by reduce the swap cache search space into 64M scope.
>

Hi,

Thanks for the comments!

> In general, I think that it makes sense to reduce the depth of the
> xarray.
>
> One concern is that IIUC we make swap cache behaves like file cache if
> possible.  And your change makes swap cache and file cache diverge more.
> Is it possible for us to keep them similar?

So far in this series, I think there is no problem for that, the two
main helpers for retrieving file & cache offset: folio_index and
folio_file_pos will work fine and be compatible with current users.

And if we convert to share filemap_* functions for swap cache / page
cache, they are mostly already accepting index as an argument so no
trouble at all.

>
> For example,
>
> Is it possible to return the offset inside 64M range in
> __page_file_index() (maybe rename it)?

Not sure what you mean by this, __page_file_index will be gone as we
convert to folio.
And this series did delete / rename it (it might not be easy to see
this, the usage of these helpers is not very well organized before
this series so some clean up is involved).
It was previously only used through page_index (deleted) /
folio_index, and, now folio_index will be returning the offset inside
the 64M range.

I guess I just did what you wanted? :)

My cover letter and commit message might be not clear enough, I can update it.

>
> Is it possible to add "start_offset" support in xarray, so "index"
> will subtract "start_offset" before looking up / inserting?

xarray struct seems already very full, and this usage doesn't look
generic to me, might be better to fix this kind of issue case by case.

>
> Is it possible to use multiple range locks to protect one xarray to
> improve the lock scalability?  This is why we have multiple "struct
> address_space" for one swap device.  And, we may have same lock
> contention issue for large files too.

Good question, this series can improve the tree depth issue for swap
cache, but contention in address space is still a thing.

A more generic solution might involve changes of xarray API or use
some other data struct?

(BTW I think reducing the search space and resolving lock contention
is not necessarily related, reducing the search space by having a
large table of small trees should still perform better for swap
cache).


>
> I haven't look at the code in details.  So, my idea may not make sense
> at all.  If so, sorry about that.
>
> Hi, Matthew,
>
> Can you teach me on this too?
>
> --
> Best Regards,
> Huang, Ying
Huang, Ying April 23, 2024, 1:29 a.m. UTC | #3
Kairui Song <ryncsn@gmail.com> writes:

> On Mon, Apr 22, 2024 at 3:56 PM Huang, Ying <ying.huang@intel.com> wrote:
>>
>> Hi, Kairui,
>>
>> Kairui Song <ryncsn@gmail.com> writes:
>>
>> > From: Kairui Song <kasong@tencent.com>
>> >
>> > Currently we use one swap_address_space for every 64M chunk to reduce lock
>> > contention, this is like having a set of smaller swap files inside one
>> > big swap file. But when doing swap cache look up or insert, we are
>> > still using the offset of the whole large swap file. This is OK for
>> > correctness, as the offset (key) is unique.
>> >
>> > But Xarray is specially optimized for small indexes, it creates the
>> > redix tree levels lazily to be just enough to fit the largest key
>> > stored in one Xarray. So we are wasting tree nodes unnecessarily.
>> >
>> > For 64M chunk it should only take at most 3 level to contain everything.
>> > But we are using the offset from the whole swap file, so the offset (key)
>> > value will be way beyond 64M, and so will the tree level.
>> >
>> > Optimize this by reduce the swap cache search space into 64M scope.
>>
>
> Hi,
>
> Thanks for the comments!
>
>> In general, I think that it makes sense to reduce the depth of the
>> xarray.
>>
>> One concern is that IIUC we make swap cache behaves like file cache if
>> possible.  And your change makes swap cache and file cache diverge more.
>> Is it possible for us to keep them similar?
>
> So far in this series, I think there is no problem for that, the two
> main helpers for retrieving file & cache offset: folio_index and
> folio_file_pos will work fine and be compatible with current users.
>
> And if we convert to share filemap_* functions for swap cache / page
> cache, they are mostly already accepting index as an argument so no
> trouble at all.
>
>>
>> For example,
>>
>> Is it possible to return the offset inside 64M range in
>> __page_file_index() (maybe rename it)?
>
> Not sure what you mean by this, __page_file_index will be gone as we
> convert to folio.
> And this series did delete / rename it (it might not be easy to see
> this, the usage of these helpers is not very well organized before
> this series so some clean up is involved).
> It was previously only used through page_index (deleted) /
> folio_index, and, now folio_index will be returning the offset inside
> the 64M range.
>
> I guess I just did what you wanted? :)

Good!

> My cover letter and commit message might be not clear enough, I can update it.
>
>>
>> Is it possible to add "start_offset" support in xarray, so "index"
>> will subtract "start_offset" before looking up / inserting?
>
> xarray struct seems already very full, and this usage doesn't look
> generic to me, might be better to fix this kind of issue case by case.

Just some open question.

>>
>> Is it possible to use multiple range locks to protect one xarray to
>> improve the lock scalability?  This is why we have multiple "struct
>> address_space" for one swap device.  And, we may have same lock
>> contention issue for large files too.
>
> Good question, this series can improve the tree depth issue for swap
> cache, but contention in address space is still a thing.

The lock contention for swap cache has been reduced via using multiple
xarray in commit 4b3ef9daa4fc ("mm/swap: split swap cache into 64MB
trunks").  But it fixes that for swap cache only, not for file cache in
general.  We have observed similar lock contention issue for file cache
too.  And the method isn't perfect too, like the issue you found here.
In general, it's about what is "file" for swap device.

> A more generic solution might involve changes of xarray API or use
> some other data struct?
>
> (BTW I think reducing the search space and resolving lock contention
> is not necessarily related, reducing the search space by having a
> large table of small trees should still perform better for swap
> cache).

--
Best Regards,
Huang, Ying
Matthew Wilcox April 23, 2024, 3:20 a.m. UTC | #4
On Mon, Apr 22, 2024 at 03:54:58PM +0800, Huang, Ying wrote:
> Is it possible to add "start_offset" support in xarray, so "index"
> will subtract "start_offset" before looking up / inserting?

We kind of have that with XA_FLAGS_ZERO_BUSY which is used for
XA_FLAGS_ALLOC1.  But that's just one bit for the entry at 0.  We could
generalise it, but then we'd have to store that somewhere and there's
no obvious good place to store it that wouldn't enlarge struct xarray,
which I'd be reluctant to do.

> Is it possible to use multiple range locks to protect one xarray to
> improve the lock scalability?  This is why we have multiple "struct
> address_space" for one swap device.  And, we may have same lock
> contention issue for large files too.

It's something I've considered.  The issue is search marks.  If we delete
an entry, we may have to walk all the way up the xarray clearing bits as
we go and I'd rather not grab a lock at each level.  There's a convenient
4 byte hole between nr_values and parent where we could put it.

Oh, another issue is that we use i_pages.xa_lock to synchronise
address_space.nrpages, so I'm not sure that a per-node lock will help.

But I'm conscious that there are workloads which show contention on
xa_lock as their limiting factor, so I'm open to ideas to improve all
these things.
Huang, Ying April 24, 2024, 2:24 a.m. UTC | #5
Hi, Matthew,

Matthew Wilcox <willy@infradead.org> writes:

> On Mon, Apr 22, 2024 at 03:54:58PM +0800, Huang, Ying wrote:
>> Is it possible to add "start_offset" support in xarray, so "index"
>> will subtract "start_offset" before looking up / inserting?
>
> We kind of have that with XA_FLAGS_ZERO_BUSY which is used for
> XA_FLAGS_ALLOC1.  But that's just one bit for the entry at 0.  We could
> generalise it, but then we'd have to store that somewhere and there's
> no obvious good place to store it that wouldn't enlarge struct xarray,
> which I'd be reluctant to do.
>
>> Is it possible to use multiple range locks to protect one xarray to
>> improve the lock scalability?  This is why we have multiple "struct
>> address_space" for one swap device.  And, we may have same lock
>> contention issue for large files too.
>
> It's something I've considered.  The issue is search marks.  If we delete
> an entry, we may have to walk all the way up the xarray clearing bits as
> we go and I'd rather not grab a lock at each level.  There's a convenient
> 4 byte hole between nr_values and parent where we could put it.
>
> Oh, another issue is that we use i_pages.xa_lock to synchronise
> address_space.nrpages, so I'm not sure that a per-node lock will help.

Thanks for looking at this.

> But I'm conscious that there are workloads which show contention on
> xa_lock as their limiting factor, so I'm open to ideas to improve all
> these things.

I have no idea so far because my very limited knowledge about xarray.

--
Best Regards,
Huang, Ying
Chris Li April 26, 2024, 11:16 p.m. UTC | #6
Hi Ying,

On Tue, Apr 23, 2024 at 7:26 PM Huang, Ying <ying.huang@intel.com> wrote:
>
> Hi, Matthew,
>
> Matthew Wilcox <willy@infradead.org> writes:
>
> > On Mon, Apr 22, 2024 at 03:54:58PM +0800, Huang, Ying wrote:
> >> Is it possible to add "start_offset" support in xarray, so "index"
> >> will subtract "start_offset" before looking up / inserting?
> >
> > We kind of have that with XA_FLAGS_ZERO_BUSY which is used for
> > XA_FLAGS_ALLOC1.  But that's just one bit for the entry at 0.  We could
> > generalise it, but then we'd have to store that somewhere and there's
> > no obvious good place to store it that wouldn't enlarge struct xarray,
> > which I'd be reluctant to do.
> >
> >> Is it possible to use multiple range locks to protect one xarray to
> >> improve the lock scalability?  This is why we have multiple "struct
> >> address_space" for one swap device.  And, we may have same lock
> >> contention issue for large files too.
> >
> > It's something I've considered.  The issue is search marks.  If we delete
> > an entry, we may have to walk all the way up the xarray clearing bits as
> > we go and I'd rather not grab a lock at each level.  There's a convenient
> > 4 byte hole between nr_values and parent where we could put it.
> >
> > Oh, another issue is that we use i_pages.xa_lock to synchronise
> > address_space.nrpages, so I'm not sure that a per-node lock will help.
>
> Thanks for looking at this.
>
> > But I'm conscious that there are workloads which show contention on
> > xa_lock as their limiting factor, so I'm open to ideas to improve all
> > these things.
>
> I have no idea so far because my very limited knowledge about xarray.

For the swap file usage, I have been considering an idea to remove the
index part of the xarray from swap cache. Swap cache is different from
file cache in a few aspects.
For one if we want to have a folio equivalent of "large swap entry".
Then the natural alignment of those swap offset on does not make
sense. Ideally we should be able to write the folio to un-aligned swap
file locations.

The other aspect for swap files is that, we already have different
data structures organized around swap offset, swap_map and
swap_cgroup. If we group the swap related data structure together. We
can add a pointer to a union of folio or a shadow swap entry. We can
use atomic updates on the swap struct member or breakdown the access
lock by ranges just like swap cluster does.

I want to discuss those ideas in the upcoming LSF/MM meet up as well.

Chris
Huang, Ying April 28, 2024, 1:14 a.m. UTC | #7
Chris Li <chrisl@kernel.org> writes:

> Hi Ying,
>
> On Tue, Apr 23, 2024 at 7:26 PM Huang, Ying <ying.huang@intel.com> wrote:
>>
>> Hi, Matthew,
>>
>> Matthew Wilcox <willy@infradead.org> writes:
>>
>> > On Mon, Apr 22, 2024 at 03:54:58PM +0800, Huang, Ying wrote:
>> >> Is it possible to add "start_offset" support in xarray, so "index"
>> >> will subtract "start_offset" before looking up / inserting?
>> >
>> > We kind of have that with XA_FLAGS_ZERO_BUSY which is used for
>> > XA_FLAGS_ALLOC1.  But that's just one bit for the entry at 0.  We could
>> > generalise it, but then we'd have to store that somewhere and there's
>> > no obvious good place to store it that wouldn't enlarge struct xarray,
>> > which I'd be reluctant to do.
>> >
>> >> Is it possible to use multiple range locks to protect one xarray to
>> >> improve the lock scalability?  This is why we have multiple "struct
>> >> address_space" for one swap device.  And, we may have same lock
>> >> contention issue for large files too.
>> >
>> > It's something I've considered.  The issue is search marks.  If we delete
>> > an entry, we may have to walk all the way up the xarray clearing bits as
>> > we go and I'd rather not grab a lock at each level.  There's a convenient
>> > 4 byte hole between nr_values and parent where we could put it.
>> >
>> > Oh, another issue is that we use i_pages.xa_lock to synchronise
>> > address_space.nrpages, so I'm not sure that a per-node lock will help.
>>
>> Thanks for looking at this.
>>
>> > But I'm conscious that there are workloads which show contention on
>> > xa_lock as their limiting factor, so I'm open to ideas to improve all
>> > these things.
>>
>> I have no idea so far because my very limited knowledge about xarray.
>
> For the swap file usage, I have been considering an idea to remove the
> index part of the xarray from swap cache. Swap cache is different from
> file cache in a few aspects.
> For one if we want to have a folio equivalent of "large swap entry".
> Then the natural alignment of those swap offset on does not make
> sense. Ideally we should be able to write the folio to un-aligned swap
> file locations.
>
> The other aspect for swap files is that, we already have different
> data structures organized around swap offset, swap_map and
> swap_cgroup. If we group the swap related data structure together. We
> can add a pointer to a union of folio or a shadow swap entry.

The shadow swap entry may be freed.  So we need to prepare for that.
And, in current design, only swap_map[] is allocated if the swap space
isn't used.  That needs to be considered too.

> We can use atomic updates on the swap struct member or breakdown the
> access lock by ranges just like swap cluster does.

The swap code uses xarray in a simple way.  That gives us opportunity to
optimize.  For example, it makes it easy to use multiple xarray
instances for one swap device.

> I want to discuss those ideas in the upcoming LSF/MM meet up as well.

Good!

--
Best Regards,
Huang, Ying
Chris Li April 28, 2024, 2:43 a.m. UTC | #8
On Sat, Apr 27, 2024 at 6:16 PM Huang, Ying <ying.huang@intel.com> wrote:
>
> Chris Li <chrisl@kernel.org> writes:
>
> > Hi Ying,
> >
> > For the swap file usage, I have been considering an idea to remove the
> > index part of the xarray from swap cache. Swap cache is different from
> > file cache in a few aspects.
> > For one if we want to have a folio equivalent of "large swap entry".
> > Then the natural alignment of those swap offset on does not make
> > sense. Ideally we should be able to write the folio to un-aligned swap
> > file locations.
> >
> > The other aspect for swap files is that, we already have different
> > data structures organized around swap offset, swap_map and
> > swap_cgroup. If we group the swap related data structure together. We
> > can add a pointer to a union of folio or a shadow swap entry.
>
> The shadow swap entry may be freed.  So we need to prepare for that.

Free the shadow swap entry will just set the pointer to NULL.
Are you concerned that the memory allocated for the pointer is not
free to the system after the shadow swap entry is free?

It will be subject to fragmentation on the free swap entry.
In that regard, xarray is also subject to fragmentation. It will not
free the internal node if the node has one xa_index not freed. Even if
the xarray node is freed to slab, at slab level there is fragmentation
as well, the backing page might not free to the system.

> And, in current design, only swap_map[] is allocated if the swap space
> isn't used.  That needs to be considered too.

I am aware of that. I want to make the swap_map[] not static allocated
any more either.
The swap_map static allocation forces the rest of the swap data
structure to have other means to sparsely allocate their data
structure, repeating the fragmentation elsewhere, in different
ways.That is also the one major source of the pain point hacking on
the swap code. The data structure is spread into too many different
places.

> > We can use atomic updates on the swap struct member or breakdown the
> > access lock by ranges just like swap cluster does.
>
> The swap code uses xarray in a simple way.  That gives us opportunity to
> optimize.  For example, it makes it easy to use multiple xarray

The fixed swap offset range makes it like an array. There are many
ways to shard the array like swap entry, e.g. swap cluster is one way
to shard it. Multiple xarray is another way. We can also do multiple
xarray like sharding, or even more fancy ones.

Chris
Huang, Ying April 28, 2024, 3:21 a.m. UTC | #9
Chris Li <chrisl@kernel.org> writes:

> On Sat, Apr 27, 2024 at 6:16 PM Huang, Ying <ying.huang@intel.com> wrote:
>>
>> Chris Li <chrisl@kernel.org> writes:
>>
>> > Hi Ying,
>> >
>> > For the swap file usage, I have been considering an idea to remove the
>> > index part of the xarray from swap cache. Swap cache is different from
>> > file cache in a few aspects.
>> > For one if we want to have a folio equivalent of "large swap entry".
>> > Then the natural alignment of those swap offset on does not make
>> > sense. Ideally we should be able to write the folio to un-aligned swap
>> > file locations.
>> >
>> > The other aspect for swap files is that, we already have different
>> > data structures organized around swap offset, swap_map and
>> > swap_cgroup. If we group the swap related data structure together. We
>> > can add a pointer to a union of folio or a shadow swap entry.
>>
>> The shadow swap entry may be freed.  So we need to prepare for that.
>
> Free the shadow swap entry will just set the pointer to NULL.
> Are you concerned that the memory allocated for the pointer is not
> free to the system after the shadow swap entry is free?
>
> It will be subject to fragmentation on the free swap entry.
> In that regard, xarray is also subject to fragmentation. It will not
> free the internal node if the node has one xa_index not freed. Even if
> the xarray node is freed to slab, at slab level there is fragmentation
> as well, the backing page might not free to the system.

Sorry my words were confusing.  What I wanted to say is that the xarray
node may be freed.

>> And, in current design, only swap_map[] is allocated if the swap space
>> isn't used.  That needs to be considered too.
>
> I am aware of that. I want to make the swap_map[] not static allocated
> any more either.

Yes.  That's possible.

> The swap_map static allocation forces the rest of the swap data
> structure to have other means to sparsely allocate their data
> structure, repeating the fragmentation elsewhere, in different
> ways.That is also the one major source of the pain point hacking on
> the swap code. The data structure is spread into too many different
> places.

Look forward to more details to compare :-)

>> > We can use atomic updates on the swap struct member or breakdown the
>> > access lock by ranges just like swap cluster does.
>>
>> The swap code uses xarray in a simple way.  That gives us opportunity to
>> optimize.  For example, it makes it easy to use multiple xarray
>
> The fixed swap offset range makes it like an array. There are many
> ways to shard the array like swap entry, e.g. swap cluster is one way
> to shard it. Multiple xarray is another way. We can also do multiple
> xarray like sharding, or even more fancy ones.

--
Best Regards,
Huang, Ying
Chris Li April 28, 2024, 5:26 p.m. UTC | #10
On Sat, Apr 27, 2024 at 8:23 PM Huang, Ying <ying.huang@intel.com> wrote:
>
> Chris Li <chrisl@kernel.org> writes:
>
> > On Sat, Apr 27, 2024 at 6:16 PM Huang, Ying <ying.huang@intel.com> wrote:
> >>
> >> Chris Li <chrisl@kernel.org> writes:
> > Free the shadow swap entry will just set the pointer to NULL.
> > Are you concerned that the memory allocated for the pointer is not
> > free to the system after the shadow swap entry is free?
> >
> > It will be subject to fragmentation on the free swap entry.
> > In that regard, xarray is also subject to fragmentation. It will not
> > free the internal node if the node has one xa_index not freed. Even if
> > the xarray node is freed to slab, at slab level there is fragmentation
> > as well, the backing page might not free to the system.
>
> Sorry my words were confusing.  What I wanted to say is that the xarray
> node may be freed.

Somehow I get that is what you mean :-) My previous reply still
applies here. The xarray node freeing will be subject to the
fragmentation at slab level. The actual backing page might not release
to the kernel after the node freeing.

>
> >> And, in current design, only swap_map[] is allocated if the swap space
> >> isn't used.  That needs to be considered too.
> >
> > I am aware of that. I want to make the swap_map[] not static allocated
> > any more either.
>
> Yes.  That's possible.

Of course there will be a price to pay for that. The current swap_map
is only 1 byte per entry. That swap map count size per swap entry is
going to be hard to beat in the alternatives. Hopefully find the trade
off in other places.

>
> > The swap_map static allocation forces the rest of the swap data
> > structure to have other means to sparsely allocate their data
> > structure, repeating the fragmentation elsewhere, in different
> > ways.That is also the one major source of the pain point hacking on
> > the swap code. The data structure is spread into too many different
> > places.
>
> Look forward to more details to compare :-)

Sure. When I make more progress I will post it.

Chris
Kairui Song April 28, 2024, 5:37 p.m. UTC | #11
On Sat, Apr 27, 2024 at 7:16 AM Chris Li <chrisl@kernel.org> wrote:
>
> Hi Ying,
>
> On Tue, Apr 23, 2024 at 7:26 PM Huang, Ying <ying.huang@intel.com> wrote:
> >
> > Hi, Matthew,
> >
> > Matthew Wilcox <willy@infradead.org> writes:
> >
> > > On Mon, Apr 22, 2024 at 03:54:58PM +0800, Huang, Ying wrote:
> > >> Is it possible to add "start_offset" support in xarray, so "index"
> > >> will subtract "start_offset" before looking up / inserting?
> > >
> > > We kind of have that with XA_FLAGS_ZERO_BUSY which is used for
> > > XA_FLAGS_ALLOC1.  But that's just one bit for the entry at 0.  We could
> > > generalise it, but then we'd have to store that somewhere and there's
> > > no obvious good place to store it that wouldn't enlarge struct xarray,
> > > which I'd be reluctant to do.
> > >
> > >> Is it possible to use multiple range locks to protect one xarray to
> > >> improve the lock scalability?  This is why we have multiple "struct
> > >> address_space" for one swap device.  And, we may have same lock
> > >> contention issue for large files too.
> > >
> > > It's something I've considered.  The issue is search marks.  If we delete
> > > an entry, we may have to walk all the way up the xarray clearing bits as
> > > we go and I'd rather not grab a lock at each level.  There's a convenient
> > > 4 byte hole between nr_values and parent where we could put it.
> > >
> > > Oh, another issue is that we use i_pages.xa_lock to synchronise
> > > address_space.nrpages, so I'm not sure that a per-node lock will help.
> >
> > Thanks for looking at this.
> >
> > > But I'm conscious that there are workloads which show contention on
> > > xa_lock as their limiting factor, so I'm open to ideas to improve all
> > > these things.
> >
> > I have no idea so far because my very limited knowledge about xarray.
>
> For the swap file usage, I have been considering an idea to remove the
> index part of the xarray from swap cache. Swap cache is different from
> file cache in a few aspects.
> For one if we want to have a folio equivalent of "large swap entry".
> Then the natural alignment of those swap offset on does not make
> sense. Ideally we should be able to write the folio to un-aligned swap
> file locations.
>

Hi Chris,

This sound interesting, I have a few questions though...

Are you suggesting we handle swap on file and swap on device
differently? Swap on file is much less frequently used than swap on
device I think.

And what do you mean "index part of the xarray"? If we need a cache,
xarray still seems one of the best choices to hold the content.

> The other aspect for swap files is that, we already have different
> data structures organized around swap offset, swap_map and
> swap_cgroup. If we group the swap related data structure together. We
> can add a pointer to a union of folio or a shadow swap entry. We can
> use atomic updates on the swap struct member or breakdown the access
> lock by ranges just like swap cluster does.
>
> I want to discuss those ideas in the upcoming LSF/MM meet up as well.

Looking forward to it!

>
> Chris
Kairui Song April 28, 2024, 5:45 p.m. UTC | #12
On Mon, Apr 29, 2024 at 1:37 AM Kairui Song <ryncsn@gmail.com> wrote:
>
> On Sat, Apr 27, 2024 at 7:16 AM Chris Li <chrisl@kernel.org> wrote:
> >
> > Hi Ying,
> >
> > On Tue, Apr 23, 2024 at 7:26 PM Huang, Ying <ying.huang@intel.com> wrote:
> > >
> > > Hi, Matthew,
> > >
> > > Matthew Wilcox <willy@infradead.org> writes:
> > >
> > > > On Mon, Apr 22, 2024 at 03:54:58PM +0800, Huang, Ying wrote:
> > > >> Is it possible to add "start_offset" support in xarray, so "index"
> > > >> will subtract "start_offset" before looking up / inserting?
> > > >
> > > > We kind of have that with XA_FLAGS_ZERO_BUSY which is used for
> > > > XA_FLAGS_ALLOC1.  But that's just one bit for the entry at 0.  We could
> > > > generalise it, but then we'd have to store that somewhere and there's
> > > > no obvious good place to store it that wouldn't enlarge struct xarray,
> > > > which I'd be reluctant to do.
> > > >
> > > >> Is it possible to use multiple range locks to protect one xarray to
> > > >> improve the lock scalability?  This is why we have multiple "struct
> > > >> address_space" for one swap device.  And, we may have same lock
> > > >> contention issue for large files too.
> > > >
> > > > It's something I've considered.  The issue is search marks.  If we delete
> > > > an entry, we may have to walk all the way up the xarray clearing bits as
> > > > we go and I'd rather not grab a lock at each level.  There's a convenient
> > > > 4 byte hole between nr_values and parent where we could put it.
> > > >
> > > > Oh, another issue is that we use i_pages.xa_lock to synchronise
> > > > address_space.nrpages, so I'm not sure that a per-node lock will help.
> > >
> > > Thanks for looking at this.
> > >
> > > > But I'm conscious that there are workloads which show contention on
> > > > xa_lock as their limiting factor, so I'm open to ideas to improve all
> > > > these things.
> > >
> > > I have no idea so far because my very limited knowledge about xarray.
> >
> > For the swap file usage, I have been considering an idea to remove the
> > index part of the xarray from swap cache. Swap cache is different from
> > file cache in a few aspects.
> > For one if we want to have a folio equivalent of "large swap entry".
> > Then the natural alignment of those swap offset on does not make
> > sense. Ideally we should be able to write the folio to un-aligned swap
> > file locations.
> >
>
> Hi Chris,
>
> This sound interesting, I have a few questions though...
>
> Are you suggesting we handle swap on file and swap on device
> differently? Swap on file is much less frequently used than swap on
> device I think.
>
> And what do you mean "index part of the xarray"? If we need a cache,
> xarray still seems one of the best choices to hold the content.
>
> > The other aspect for swap files is that, we already have different
> > data structures organized around swap offset, swap_map and
> > swap_cgroup. If we group the swap related data structure together. We
> > can add a pointer to a union of folio or a shadow swap entry. We can
> > use atomic updates on the swap struct member or breakdown the access
> > lock by ranges just like swap cluster does.

Oh, and BTW I'm also trying to breakdown the swap address space range
(from 64M to 16M, SWAP_ADDRESS_SPACE_SHIFT from 14 to
12). It's a simple approach, but the coupling and increased memory
usage of address_space structure makes the performance go into
regression (about -2% for worst real world workload). I found this
part very performance sensitive, so basically I'm not making much
progress for the future items I mentioned in this cover letter. New
ideas could be very helpful!

> >
> > I want to discuss those ideas in the upcoming LSF/MM meet up as well.
>
> Looking forward to it!
>
> >
> > Chris
Chris Li April 29, 2024, 5:50 a.m. UTC | #13
On Sun, Apr 28, 2024 at 10:37 AM Kairui Song <ryncsn@gmail.com> wrote:
>
> On Sat, Apr 27, 2024 at 7:16 AM Chris Li <chrisl@kernel.org> wrote:
> >
> > Hi Ying,
> >
> > On Tue, Apr 23, 2024 at 7:26 PM Huang, Ying <ying.huang@intel.com> wrote:
> > >
> > > Hi, Matthew,
> > >
> > > Matthew Wilcox <willy@infradead.org> writes:
> > >
> > > > On Mon, Apr 22, 2024 at 03:54:58PM +0800, Huang, Ying wrote:
> > > >> Is it possible to add "start_offset" support in xarray, so "index"
> > > >> will subtract "start_offset" before looking up / inserting?
> > > >
> > > > We kind of have that with XA_FLAGS_ZERO_BUSY which is used for
> > > > XA_FLAGS_ALLOC1.  But that's just one bit for the entry at 0.  We could
> > > > generalise it, but then we'd have to store that somewhere and there's
> > > > no obvious good place to store it that wouldn't enlarge struct xarray,
> > > > which I'd be reluctant to do.
> > > >
> > > >> Is it possible to use multiple range locks to protect one xarray to
> > > >> improve the lock scalability?  This is why we have multiple "struct
> > > >> address_space" for one swap device.  And, we may have same lock
> > > >> contention issue for large files too.
> > > >
> > > > It's something I've considered.  The issue is search marks.  If we delete
> > > > an entry, we may have to walk all the way up the xarray clearing bits as
> > > > we go and I'd rather not grab a lock at each level.  There's a convenient
> > > > 4 byte hole between nr_values and parent where we could put it.
> > > >
> > > > Oh, another issue is that we use i_pages.xa_lock to synchronise
> > > > address_space.nrpages, so I'm not sure that a per-node lock will help.
> > >
> > > Thanks for looking at this.
> > >
> > > > But I'm conscious that there are workloads which show contention on
> > > > xa_lock as their limiting factor, so I'm open to ideas to improve all
> > > > these things.
> > >
> > > I have no idea so far because my very limited knowledge about xarray.
> >
> > For the swap file usage, I have been considering an idea to remove the
> > index part of the xarray from swap cache. Swap cache is different from
> > file cache in a few aspects.
> > For one if we want to have a folio equivalent of "large swap entry".
> > Then the natural alignment of those swap offset on does not make
> > sense. Ideally we should be able to write the folio to un-aligned swap
> > file locations.
> >
>
> Hi Chris,
>
> This sound interesting, I have a few questions though...
>
> Are you suggesting we handle swap on file and swap on device
> differently? Swap on file is much less frequently used than swap on
> device I think.

That is not what I have in mind. The swap struct idea did not
distinguish the swap file vs swap device.BTW, I sometimes use swap on
file because I did not allocate a swap partition in advance.

>
> And what do you mean "index part of the xarray"? If we need a cache,
> xarray still seems one of the best choices to hold the content.

We still need to look up swap file offset -> folio. However if we
allocate each swap offset a "struct swap", then the folio lookup can
be as simple as get the swap_struc by offset, then atomic read of
swap_structt->folio.

Not sure how you come to the conclusion for "best choices"?  It is one
choice, but it has its drawbacks. The natural alignment requirement of
xarray, e.g. 2M large swap entries need to be written to 2M aligned
offset, that is an unnecessary restriction. If we allocate the "struct
swap" ourselves, we have more flexibility.

> > The other aspect for swap files is that, we already have different
> > data structures organized around swap offset, swap_map and
> > swap_cgroup. If we group the swap related data structure together. We
> > can add a pointer to a union of folio or a shadow swap entry. We can
> > use atomic updates on the swap struct member or breakdown the access
> > lock by ranges just like swap cluster does.
> >
> > I want to discuss those ideas in the upcoming LSF/MM meet up as well.
>
> Looking forward to it!

Thanks, I will post more when I get more progress on that.

>
> Oh, and BTW I'm also trying to breakdown the swap address space range
> (from 64M to 16M, SWAP_ADDRESS_SPACE_SHIFT from 14 to
> 12). It's a simple approach, but the coupling and increased memory
> usage of address_space structure makes the performance go into
> regression (about -2% for worst real world workload). I found this

Yes, that sounds plausible.

> part very performance sensitive, so basically I'm not making much
> progress for the future items I mentioned in this cover letter. New
> ideas could be very helpful!
>

The swap_struct idea is very different from what you are trying to do
in this series. It is more related to my LSF/MM topic on the swap back
end overhaul. More long term and bigger undertakings.

Chris