[POC] xfs: reduce ilock contention on buffered randrw workload
diff mbox series

Message ID 20190404165737.30889-1-amir73il@gmail.com
State New
Headers show
Series
  • [POC] xfs: reduce ilock contention on buffered randrw workload
Related show

Commit Message

Amir Goldstein April 4, 2019, 4:57 p.m. UTC
This patch improves performance of mixed random rw workload
on xfs without relaxing the atomic buffered read/write guaranty
that xfs has always provided.

We achieve that by calling generic_file_read_iter() twice.
Once with a discard iterator to warm up page cache before taking
the shared ilock and once again under shared ilock.

Since this is a POC patch it also includes a separate fix to the
copy_page_to_iter() helper when called with discard iterator.
There is no other caller in the kernel to this method with a
discard iterator as far as I could see.

Signed-off-by: Amir Goldstein <amir73il@gmail.com>
---

Hi Folks,

With this experimenital patch I was able to bring performance
of random rw workload benchmark on xfs much closer to ext4.

Ext4, as most Linux filesystems doesn't take the shared inode
lock on buffered reads and does not provide atomic buffered
reads w.r.t buffered writes.

Following are the numbers I got with filebench randomrw workload [1]
on a VM with 4 CPUs and spindles.

Note that this improvement is unrelated to the rw_semaphore starvation
issue that was observed when running the same benchmark on fast SDD
drive [2].

=== random read/write - cold page cache ===
--- EXT4 ---
 filebench randomrw (8 read threads, 8 write threads)
 kernel 5.1.0-rc2, ext4
 rand-write1          862304ops    14235ops/s 111.2mb/s      0.5ms/op
 rand-read1           22065ops      364ops/s   2.8mb/s     21.5ms/op

--- XFS ---
 filebench randomrw (8 read threads, 8 write threads)
 kernel 5.1.0-rc2, xfs
 rand-write1          39451ops      657ops/s   5.1mb/s     12.0ms/op
 rand-read1           2035ops       34ops/s   0.3mb/s    232.7ms/op

--- XFS+ ---
 filebench randomrw (8 read threads, 8 write threads)
 kernel 5.1.0-rc2+ (xfs page cache warmup patch)
 rand-write1          935597ops    15592ops/s 121.8mb/s      0.5ms/op
 rand-read1           4446ops       74ops/s   0.6mb/s    107.6ms/op

To measure the effects of two passes of generic_file_read_iter(), I ran
a random read [2] benchmark on 5GB file with warm and cold page cache.

=== random read - cold page cache ===
--- EXT4 ---
 filebench randomread (8 read threads) - cold page cache
 kernel 5.1.0-rc2
 rand-read1           23589ops      393ops/s   3.1mb/s     20.3ms/op

--- XFS ---
 filebench randomread (8 read threads) - cold page cache
 kernel 5.1.0-rc2
 rand-read1           20578ops      343ops/s   2.7mb/s     23.3ms/op

--- XFS+ ---
 filebench randomread (8 read threads) - cold page cache
 kernel 5.1.0-rc2+ (xfs page cache warmup patch)
 rand-read1           20476ops      341ops/s   2.7mb/s     23.4ms/op

=== random read - warm page cache ===
--- EXT4 ---
 filebench randomread (8 read threads) - warm page cache
 kernel 5.1.0-rc2
 rand-read1           58168696ops   969410ops/s 7573.5mb/s      0.0ms/op

--- XFS ---
 filebench randomread (8 read threads) - warm page cache
 kernel 5.1.0-rc2
 rand-read1           52748818ops   878951ops/s 6866.8mb/s      0.0ms/op

--- XFS+ ---
 filebench randomread (8 read threads) - warm page cache
 kernel 5.1.0-rc2+ (xfs page cache warmup patch)
 rand-read1           52770537ops   879445ops/s 6870.7mb/s      0.0ms/op

The numbers of this benchmark do not show and measurable difference for
readers only workload with either cold or warm page cache.

If needed I can provide more measurments with fio or with different
workloads and different drives.

Fire away!

Thanks,
Amir.

[1] https://github.com/amir73il/filebench/blob/overlayfs-devel/workloads/randomrw.f
[2] https://marc.info/?l=linux-xfs&m=155347265016053&w=2
[3] https://github.com/amir73il/filebench/blob/overlayfs-devel/workloads/randomread.f

 fs/xfs/xfs_file.c | 14 ++++++++++++++
 lib/iov_iter.c    |  5 +++--
 2 files changed, 17 insertions(+), 2 deletions(-)

Comments

Dave Chinner April 4, 2019, 9:17 p.m. UTC | #1
On Thu, Apr 04, 2019 at 07:57:37PM +0300, Amir Goldstein wrote:
> This patch improves performance of mixed random rw workload
> on xfs without relaxing the atomic buffered read/write guaranty
> that xfs has always provided.
> 
> We achieve that by calling generic_file_read_iter() twice.
> Once with a discard iterator to warm up page cache before taking
> the shared ilock and once again under shared ilock.

This will race with thing like truncate, hole punching, etc that
serialise IO and invalidate the page cache for data integrity
reasons under the IOLOCK. These rely on there being no IO to the
inode in progress at all to work correctly, which this patch
violates. IOWs, while this is fast, it is not safe and so not a
viable approach to solving the problem.

FYI, I'm working on a range lock implementation that should both
solve the performance issue and the reader starvation issue at the
same time by allowing concurrent buffered reads and writes to
different file ranges.

IO range locks will allow proper exclusion for other extent
manipulation operations like fallocate and truncate, and eventually
even allow truncate, hole punch, file extension, etc to run
concurrently with other non-overlapping IO. They solve more than
just the performance issue you are seeing....

Cheers,

Dave.
Amir Goldstein April 5, 2019, 2:02 p.m. UTC | #2
On Fri, Apr 5, 2019 at 12:17 AM Dave Chinner <david@fromorbit.com> wrote:
>
> On Thu, Apr 04, 2019 at 07:57:37PM +0300, Amir Goldstein wrote:
> > This patch improves performance of mixed random rw workload
> > on xfs without relaxing the atomic buffered read/write guaranty
> > that xfs has always provided.
> >
> > We achieve that by calling generic_file_read_iter() twice.
> > Once with a discard iterator to warm up page cache before taking
> > the shared ilock and once again under shared ilock.
>
> This will race with thing like truncate, hole punching, etc that
> serialise IO and invalidate the page cache for data integrity
> reasons under the IOLOCK. These rely on there being no IO to the
> inode in progress at all to work correctly, which this patch
> violates. IOWs, while this is fast, it is not safe and so not a
> viable approach to solving the problem.
>

This statement leaves me wondering, if ext4 does not takes
i_rwsem on generic_file_read_iter(), how does ext4 (or any other
fs for that matter) guaranty buffered read synchronization with
truncate, hole punching etc?
The answer in ext4 case is i_mmap_sem, which is read locked
in the page fault handler.

And xfs does the same type of synchronization with MMAPLOCK,
so while my patch may not be safe, I cannot follow why from your
explanation, so please explain if I am missing something.

One thing that Darrick mentioned earlier was that IOLOCK is also
used by xfs to synchronization pNFS leases (probably listed under
'etc' in your explanation). I consent that my patch does not look safe
w.r.t pNFS leases, but that can be sorted out with a hammer
#ifndef CONFIG_EXPORTFS_BLOCK_OPS
or with finer instruments.

> FYI, I'm working on a range lock implementation that should both
> solve the performance issue and the reader starvation issue at the
> same time by allowing concurrent buffered reads and writes to
> different file ranges.
>
> IO range locks will allow proper exclusion for other extent
> manipulation operations like fallocate and truncate, and eventually
> even allow truncate, hole punch, file extension, etc to run
> concurrently with other non-overlapping IO. They solve more than
> just the performance issue you are seeing....
>

I'm glad to hear that. IO range locks are definitely a more wholesome
solution to the problem looking forward.

However, I am still interested to continue the discussion on my POC
patch. One reason is that I am guessing it would be much easier for
distros to backport and pick up to solve performance issues.

Even if my patch doesn't get applied upstream nor picked by distros,
I would still like to understand its flaws and limitations. I know...
if I break it, I get to keep the pieces, but the information that you
provide helps me make my risk assessments.

Thanks,
Amir.
Dave Chinner April 7, 2019, 11:27 p.m. UTC | #3
On Fri, Apr 05, 2019 at 05:02:33PM +0300, Amir Goldstein wrote:
> On Fri, Apr 5, 2019 at 12:17 AM Dave Chinner <david@fromorbit.com> wrote:
> >
> > On Thu, Apr 04, 2019 at 07:57:37PM +0300, Amir Goldstein wrote:
> > > This patch improves performance of mixed random rw workload
> > > on xfs without relaxing the atomic buffered read/write guaranty
> > > that xfs has always provided.
> > >
> > > We achieve that by calling generic_file_read_iter() twice.
> > > Once with a discard iterator to warm up page cache before taking
> > > the shared ilock and once again under shared ilock.
> >
> > This will race with thing like truncate, hole punching, etc that
> > serialise IO and invalidate the page cache for data integrity
> > reasons under the IOLOCK. These rely on there being no IO to the
> > inode in progress at all to work correctly, which this patch
> > violates. IOWs, while this is fast, it is not safe and so not a
> > viable approach to solving the problem.
> >
> 
> This statement leaves me wondering, if ext4 does not takes
> i_rwsem on generic_file_read_iter(), how does ext4 (or any other
> fs for that matter) guaranty buffered read synchronization with
> truncate, hole punching etc?
> The answer in ext4 case is i_mmap_sem, which is read locked
> in the page fault handler.

Nope, the  i_mmap_sem is for serialisation of /page faults/ against
truncate, holepunching, etc. Completely irrelevant to the read()
path.

> And xfs does the same type of synchronization with MMAPLOCK,
> so while my patch may not be safe, I cannot follow why from your
> explanation, so please explain if I am missing something.

mmap_sem inversions require independent locks for IO path and page
faults - the MMAPLOCK does not protect anything in the
read()/write() IO path.

> One thing that Darrick mentioned earlier was that IOLOCK is also
> used by xfs to synchronization pNFS leases (probably listed under
> 'etc' in your explanation).

PNFS leases are separate to the IO locking. All the IO locking does
is serialise new IO submission against the process of breaking
leases so that extent maps that have been shared under the lease are
invalidated correctly. i.e. we can't start IO until the lease has
been recalled and the external client has guaranteed it won't
read/write data from the stale extent map.

If you do IO outside the IOLOCK, then you break those serialisation
guarantees and risk data corruption and/or stale data exposure...

> I consent that my patch does not look safe
> w.r.t pNFS leases, but that can be sorted out with a hammer
> #ifndef CONFIG_EXPORTFS_BLOCK_OPS
> or with finer instruments.

All you see is this:

truncate:				read()

IOLOCK_EXCL
  flush relevant cached data
  truncate page cache
					pre-read page cache between
					new eof and old eof
					IOLOCK_SHARED
					<blocks>
  start transaction
  ILOCK_EXCL
    update isize
    remove extents
....
  commit xactn
IOLOCK unlock
					<gets lock>
					sees beyond EOF, returns 0


So you see the read() doing the right thing (detect EOF, returning
short read). Great.

But what I see is uptodate pages containing stale data being left in
the page cache beyond EOF. That is th eproblem here - truncate must
not leave stale pages beyond EOF behind - it's the landmine that
causes future things to go wrong.

e.g. now the app does post-eof preallocation so the range those
pages are cached over are allocated as unwritten - the filesystem
will do this without even looking at the page cache because it's
beyond EOF.  Now we extend the file past those cached pages, and
iomap_zero() sees the range as unwritten and so does not write zeros
to the blocks between the old EOF and the new EOF. Now the app reads
from that range (say it does a sub-page write, triggering a page
cache RMW cycle). the read goes to instantiate the page cache page,
finds a page already in the cache that is uptodate, and uses it
without zeroing or reading from disk.

And now we have stale data exposure and/or data corruption.

If can come up with quite a few scenarios where this particular
"populate cache after invalidation" race can cause similar problems
for XFS. Hole punch and most of the other fallocate extent
manipulations have the same serialisation requirements - no IO over
the range of the operation can be *initiated* between the /start/ of
the page cache invalidation and the end of the specific extent
manipulation operation.

So how does ext4 avoid this problem on truncate?

History lesson: truncate in Linux (and hence ext4) has traditionally
been serialised by the hacky post-page-lock checks that are strewn
all through the page cache and mm/ subsystem. i.e. every time you
look up and lock a page cache page, you have to check the
page->mapping and page->index to ensure that the lookup-and-lock
hasn't raced with truncate. This only works because truncate
requires the inode size to be updated before invalidating the page
cache - that's the "hacky" part of it.

IOWs, the burden of detecting truncate races is strewn throughout
the mm/ subsystem, rather than being the responisibility of the
filesystem. This is made worse by the fact this mechanism simply
doesn't work for hole punching because there is no file size change
to indicate that the page lookup is racing with an in-progress
invalidation.

That means the mm/ and page cache code is unable to detect hole
punch races, and so the serialisation of invalidation vs page cache
instantiation has to be done in the filesystem. And no Linux native
filesystem had the infrastructure for such serialisation because
they never had to implement anything to ensure truncate was
serialised against new and in-progress IO.

The result of this is that, AFAICT, ext4 does not protect against
read() vs hole punch races - it's hole punching code it does:

Hole Punch:				read():

inode_lock()
inode_dio_wait(inode);
down_write(i_mmap_sem)
truncate_pagecache_range()
					ext4_file_iter_read()
					  ext4_map_blocks()
					    down_read(i_data_sem)
					     <gets mapping>
					<populates page cache over hole>
					<reads stale data into cache>
					.....
down_write(i_data_sem)
  remove extents

IOWs, ext4 is safe against truncate because of the
change-inode-size-before-invalidation hacks, but the lack of
serialise buffered reads means that hole punch and other similar
fallocate based extent manipulations can race against reads....

> However, I am still interested to continue the discussion on my POC
> patch. One reason is that I am guessing it would be much easier for
> distros to backport and pick up to solve performance issues.

Upstream first, please. If it's not fit for upstream, then it most
definitely is not fit for backporting to distro kernels.

Cheers,

Dave.
Amir Goldstein April 8, 2019, 9:02 a.m. UTC | #4
On Mon, Apr 8, 2019 at 2:27 AM Dave Chinner <david@fromorbit.com> wrote:
>
> On Fri, Apr 05, 2019 at 05:02:33PM +0300, Amir Goldstein wrote:
> > On Fri, Apr 5, 2019 at 12:17 AM Dave Chinner <david@fromorbit.com> wrote:
> > >
> > > On Thu, Apr 04, 2019 at 07:57:37PM +0300, Amir Goldstein wrote:
> > > > This patch improves performance of mixed random rw workload
> > > > on xfs without relaxing the atomic buffered read/write guaranty
> > > > that xfs has always provided.
> > > >
> > > > We achieve that by calling generic_file_read_iter() twice.
> > > > Once with a discard iterator to warm up page cache before taking
> > > > the shared ilock and once again under shared ilock.
> > >
> > > This will race with thing like truncate, hole punching, etc that
> > > serialise IO and invalidate the page cache for data integrity
> > > reasons under the IOLOCK. These rely on there being no IO to the
> > > inode in progress at all to work correctly, which this patch
> > > violates. IOWs, while this is fast, it is not safe and so not a
> > > viable approach to solving the problem.
> > >
> >
> > This statement leaves me wondering, if ext4 does not takes
> > i_rwsem on generic_file_read_iter(), how does ext4 (or any other
> > fs for that matter) guaranty buffered read synchronization with
> > truncate, hole punching etc?
> > The answer in ext4 case is i_mmap_sem, which is read locked
> > in the page fault handler.
>
> Nope, the  i_mmap_sem is for serialisation of /page faults/ against
> truncate, holepunching, etc. Completely irrelevant to the read()
> path.
>

I'm at lost here. Why are page faults completely irrelevant to read()
path? Aren't full pages supposed to be faulted in on read() after
truncate_pagecache_range()? And aren't partial pages supposed
to be partially zeroed and uptodate after truncate_pagecache_range()?

> > And xfs does the same type of synchronization with MMAPLOCK,
> > so while my patch may not be safe, I cannot follow why from your
> > explanation, so please explain if I am missing something.
>
> mmap_sem inversions require independent locks for IO path and page
> faults - the MMAPLOCK does not protect anything in the
> read()/write() IO path.
>
[...]
>
> All you see is this:
>
> truncate:                               read()
>
> IOLOCK_EXCL
>   flush relevant cached data
>   truncate page cache
>                                         pre-read page cache between
>                                         new eof and old eof
>                                         IOLOCK_SHARED
>                                         <blocks>
>   start transaction
>   ILOCK_EXCL
>     update isize
>     remove extents
> ....
>   commit xactn
> IOLOCK unlock
>                                         <gets lock>
>                                         sees beyond EOF, returns 0
>
>
> So you see the read() doing the right thing (detect EOF, returning
> short read). Great.
>
> But what I see is uptodate pages containing stale data being left in
> the page cache beyond EOF. That is th eproblem here - truncate must
> not leave stale pages beyond EOF behind - it's the landmine that
> causes future things to go wrong.
>
> e.g. now the app does post-eof preallocation so the range those
> pages are cached over are allocated as unwritten - the filesystem
> will do this without even looking at the page cache because it's
> beyond EOF.  Now we extend the file past those cached pages, and
> iomap_zero() sees the range as unwritten and so does not write zeros
> to the blocks between the old EOF and the new EOF. Now the app reads
> from that range (say it does a sub-page write, triggering a page
> cache RMW cycle). the read goes to instantiate the page cache page,
> finds a page already in the cache that is uptodate, and uses it
> without zeroing or reading from disk.
>
> And now we have stale data exposure and/or data corruption.
>
> If can come up with quite a few scenarios where this particular
> "populate cache after invalidation" race can cause similar problems
> for XFS. Hole punch and most of the other fallocate extent
> manipulations have the same serialisation requirements - no IO over
> the range of the operation can be *initiated* between the /start/ of
> the page cache invalidation and the end of the specific extent
> manipulation operation.
>
> So how does ext4 avoid this problem on truncate?
>
> History lesson: truncate in Linux (and hence ext4) has traditionally
> been serialised by the hacky post-page-lock checks that are strewn
> all through the page cache and mm/ subsystem. i.e. every time you
> look up and lock a page cache page, you have to check the
> page->mapping and page->index to ensure that the lookup-and-lock
> hasn't raced with truncate. This only works because truncate
> requires the inode size to be updated before invalidating the page
> cache - that's the "hacky" part of it.
>
> IOWs, the burden of detecting truncate races is strewn throughout
> the mm/ subsystem, rather than being the responisibility of the
> filesystem. This is made worse by the fact this mechanism simply
> doesn't work for hole punching because there is no file size change
> to indicate that the page lookup is racing with an in-progress
> invalidation.
>
> That means the mm/ and page cache code is unable to detect hole
> punch races, and so the serialisation of invalidation vs page cache
> instantiation has to be done in the filesystem. And no Linux native
> filesystem had the infrastructure for such serialisation because
> they never had to implement anything to ensure truncate was
> serialised against new and in-progress IO.
>
> The result of this is that, AFAICT, ext4 does not protect against
> read() vs hole punch races - it's hole punching code it does:
>
> Hole Punch:                             read():
>
> inode_lock()
> inode_dio_wait(inode);
> down_write(i_mmap_sem)
> truncate_pagecache_range()
>                                         ext4_file_iter_read()
>                                           ext4_map_blocks()
>                                             down_read(i_data_sem)
>                                              <gets mapping>
>                                         <populates page cache over hole>
>                                         <reads stale data into cache>
>                                         .....
> down_write(i_data_sem)
>   remove extents
>
> IOWs, ext4 is safe against truncate because of the
> change-inode-size-before-invalidation hacks, but the lack of
> serialise buffered reads means that hole punch and other similar
> fallocate based extent manipulations can race against reads....
>

Adding some ext4 folks to comment on the above.
Could it be that those races were already addressed by Lukas' work:
https://lore.kernel.org/patchwork/cover/371861/

Thanks,
Amir.
Jan Kara April 8, 2019, 10:33 a.m. UTC | #5
On Fri 05-04-19 08:17:30, Dave Chinner wrote:
> FYI, I'm working on a range lock implementation that should both
> solve the performance issue and the reader starvation issue at the
> same time by allowing concurrent buffered reads and writes to
> different file ranges.

Are you aware of range locks Davidlohr has implemented [1]? It didn't get
merged because he had no in-tree user at the time (he was more aiming at
converting mmap_sem which is rather difficult). But the generic lock
implementation should be well usable.

Added Davidlohr to CC.

								Honza

[1] https://lkml.org/lkml/2017/3/7/22
Jan Kara April 8, 2019, 11:03 a.m. UTC | #6
On Mon 08-04-19 09:27:28, Dave Chinner wrote:
> The result of this is that, AFAICT, ext4 does not protect against
> read() vs hole punch races - it's hole punching code it does:
> 
> Hole Punch:				read():
> 
> inode_lock()
> inode_dio_wait(inode);
> down_write(i_mmap_sem)
> truncate_pagecache_range()
> 					ext4_file_iter_read()
> 					  ext4_map_blocks()
> 					    down_read(i_data_sem)
> 					     <gets mapping>
> 					<populates page cache over hole>
> 					<reads stale data into cache>
> 					.....
> down_write(i_data_sem)
>   remove extents
> 
> IOWs, ext4 is safe against truncate because of the
> change-inode-size-before-invalidation hacks, but the lack of
> serialise buffered reads means that hole punch and other similar
> fallocate based extent manipulations can race against reads....

Hum, you are right. Ext4 is buggy in this regard. I've fixed the race for
page fault in ea3d7209ca01 "ext4: fix races between page faults and hole
punching" but didn't realize the problem is there for buffered reads as
well. I'll think how we can fix this. Thanks for noticing this!

								Honza
Jan Kara April 8, 2019, 2:11 p.m. UTC | #7
On Mon 08-04-19 12:02:34, Amir Goldstein wrote:
> On Mon, Apr 8, 2019 at 2:27 AM Dave Chinner <david@fromorbit.com> wrote:
> >
> > On Fri, Apr 05, 2019 at 05:02:33PM +0300, Amir Goldstein wrote:
> > > On Fri, Apr 5, 2019 at 12:17 AM Dave Chinner <david@fromorbit.com> wrote:
> > > >
> > > > On Thu, Apr 04, 2019 at 07:57:37PM +0300, Amir Goldstein wrote:
> > > > > This patch improves performance of mixed random rw workload
> > > > > on xfs without relaxing the atomic buffered read/write guaranty
> > > > > that xfs has always provided.
> > > > >
> > > > > We achieve that by calling generic_file_read_iter() twice.
> > > > > Once with a discard iterator to warm up page cache before taking
> > > > > the shared ilock and once again under shared ilock.
> > > >
> > > > This will race with thing like truncate, hole punching, etc that
> > > > serialise IO and invalidate the page cache for data integrity
> > > > reasons under the IOLOCK. These rely on there being no IO to the
> > > > inode in progress at all to work correctly, which this patch
> > > > violates. IOWs, while this is fast, it is not safe and so not a
> > > > viable approach to solving the problem.
> > > >
> > >
> > > This statement leaves me wondering, if ext4 does not takes
> > > i_rwsem on generic_file_read_iter(), how does ext4 (or any other
> > > fs for that matter) guaranty buffered read synchronization with
> > > truncate, hole punching etc?
> > > The answer in ext4 case is i_mmap_sem, which is read locked
> > > in the page fault handler.
> >
> > Nope, the  i_mmap_sem is for serialisation of /page faults/ against
> > truncate, holepunching, etc. Completely irrelevant to the read()
> > path.
> >
> 
> I'm at lost here. Why are page faults completely irrelevant to read()
> path? Aren't full pages supposed to be faulted in on read() after
> truncate_pagecache_range()?

During read(2), pages are not "faulted in". Just look at
what generic_file_buffered_read() does. It uses completely separate code to
add page to page cache, trigger readahead, and possibly call ->readpage() to
fill the page with data. "fault" path (handled by filemap_fault()) applies
only to accesses from userspace to mmaps.

> And aren't partial pages supposed to be partially zeroed and uptodate
> after truncate_pagecache_range()?

truncate_pagecache_range() does take care of page zeroing, that is correct
(if those pages are in page cache in the first place). I'm not sure how
this relates to the above discussion though.

								Honza
Davidlohr Bueso April 8, 2019, 4:37 p.m. UTC | #8
On Mon, 2019-04-08 at 12:33 +0200, Jan Kara wrote:
> On Fri 05-04-19 08:17:30, Dave Chinner wrote:
> > FYI, I'm working on a range lock implementation that should both
> > solve the performance issue and the reader starvation issue at the
> > same time by allowing concurrent buffered reads and writes to
> > different file ranges.
> 
> Are you aware of range locks Davidlohr has implemented [1]? It didn't get
> merged because he had no in-tree user at the time (he was more aiming at
> converting mmap_sem which is rather difficult). But the generic lock
> implementation should be well usable.
> 
> Added Davidlohr to CC.
> 
> 								Honza
> 
> [1] https://lkml.org/lkml/2017/3/7/22

fyi this was the latest version (had some naming updates per peterz).

https://lkml.org/lkml/2018/2/4/232
Amir Goldstein April 8, 2019, 5:41 p.m. UTC | #9
On Mon, Apr 8, 2019 at 5:11 PM Jan Kara <jack@suse.cz> wrote:
>
> On Mon 08-04-19 12:02:34, Amir Goldstein wrote:
> > On Mon, Apr 8, 2019 at 2:27 AM Dave Chinner <david@fromorbit.com> wrote:
> > >
> > > On Fri, Apr 05, 2019 at 05:02:33PM +0300, Amir Goldstein wrote:
> > > > On Fri, Apr 5, 2019 at 12:17 AM Dave Chinner <david@fromorbit.com> wrote:
> > > > >
> > > > > On Thu, Apr 04, 2019 at 07:57:37PM +0300, Amir Goldstein wrote:
> > > > > > This patch improves performance of mixed random rw workload
> > > > > > on xfs without relaxing the atomic buffered read/write guaranty
> > > > > > that xfs has always provided.
> > > > > >
> > > > > > We achieve that by calling generic_file_read_iter() twice.
> > > > > > Once with a discard iterator to warm up page cache before taking
> > > > > > the shared ilock and once again under shared ilock.
> > > > >
> > > > > This will race with thing like truncate, hole punching, etc that
> > > > > serialise IO and invalidate the page cache for data integrity
> > > > > reasons under the IOLOCK. These rely on there being no IO to the
> > > > > inode in progress at all to work correctly, which this patch
> > > > > violates. IOWs, while this is fast, it is not safe and so not a
> > > > > viable approach to solving the problem.
> > > > >
> > > >
> > > > This statement leaves me wondering, if ext4 does not takes
> > > > i_rwsem on generic_file_read_iter(), how does ext4 (or any other
> > > > fs for that matter) guaranty buffered read synchronization with
> > > > truncate, hole punching etc?
> > > > The answer in ext4 case is i_mmap_sem, which is read locked
> > > > in the page fault handler.
> > >
> > > Nope, the  i_mmap_sem is for serialisation of /page faults/ against
> > > truncate, holepunching, etc. Completely irrelevant to the read()
> > > path.
> > >
> >
> > I'm at lost here. Why are page faults completely irrelevant to read()
> > path? Aren't full pages supposed to be faulted in on read() after
> > truncate_pagecache_range()?
>
> During read(2), pages are not "faulted in". Just look at
> what generic_file_buffered_read() does. It uses completely separate code to
> add page to page cache, trigger readahead, and possibly call ->readpage() to
> fill the page with data. "fault" path (handled by filemap_fault()) applies
> only to accesses from userspace to mmaps.
>

Oh! thanks for fixing my blind spot.
So if you agree with Dave that ext4, and who knows what other fs,
are vulnerable to populating page cache with stale "uptodate" data,
then it seems to me that also xfs is vulnerable via readahead(2) and
posix_fadvise().
Mind you, I recently added an fadvise f_op, so it could be used by
xfs to synchronize with IOLOCK.

Perhaps a better solution would be for truncate_pagecache_range()
to leave zeroed or Unwritten (i.e. lazy zeroed by read) pages in page
cache. When we have shared pages for files, these pages could be
deduped.

Thanks,
Amir.
Jan Kara April 9, 2019, 8:26 a.m. UTC | #10
On Mon 08-04-19 20:41:09, Amir Goldstein wrote:
> On Mon, Apr 8, 2019 at 5:11 PM Jan Kara <jack@suse.cz> wrote:
> >
> > On Mon 08-04-19 12:02:34, Amir Goldstein wrote:
> > > On Mon, Apr 8, 2019 at 2:27 AM Dave Chinner <david@fromorbit.com> wrote:
> > > >
> > > > On Fri, Apr 05, 2019 at 05:02:33PM +0300, Amir Goldstein wrote:
> > > > > On Fri, Apr 5, 2019 at 12:17 AM Dave Chinner <david@fromorbit.com> wrote:
> > > > > >
> > > > > > On Thu, Apr 04, 2019 at 07:57:37PM +0300, Amir Goldstein wrote:
> > > > > > > This patch improves performance of mixed random rw workload
> > > > > > > on xfs without relaxing the atomic buffered read/write guaranty
> > > > > > > that xfs has always provided.
> > > > > > >
> > > > > > > We achieve that by calling generic_file_read_iter() twice.
> > > > > > > Once with a discard iterator to warm up page cache before taking
> > > > > > > the shared ilock and once again under shared ilock.
> > > > > >
> > > > > > This will race with thing like truncate, hole punching, etc that
> > > > > > serialise IO and invalidate the page cache for data integrity
> > > > > > reasons under the IOLOCK. These rely on there being no IO to the
> > > > > > inode in progress at all to work correctly, which this patch
> > > > > > violates. IOWs, while this is fast, it is not safe and so not a
> > > > > > viable approach to solving the problem.
> > > > > >
> > > > >
> > > > > This statement leaves me wondering, if ext4 does not takes
> > > > > i_rwsem on generic_file_read_iter(), how does ext4 (or any other
> > > > > fs for that matter) guaranty buffered read synchronization with
> > > > > truncate, hole punching etc?
> > > > > The answer in ext4 case is i_mmap_sem, which is read locked
> > > > > in the page fault handler.
> > > >
> > > > Nope, the  i_mmap_sem is for serialisation of /page faults/ against
> > > > truncate, holepunching, etc. Completely irrelevant to the read()
> > > > path.
> > > >
> > >
> > > I'm at lost here. Why are page faults completely irrelevant to read()
> > > path? Aren't full pages supposed to be faulted in on read() after
> > > truncate_pagecache_range()?
> >
> > During read(2), pages are not "faulted in". Just look at
> > what generic_file_buffered_read() does. It uses completely separate code to
> > add page to page cache, trigger readahead, and possibly call ->readpage() to
> > fill the page with data. "fault" path (handled by filemap_fault()) applies
> > only to accesses from userspace to mmaps.
> >
> 
> Oh! thanks for fixing my blind spot.
> So if you agree with Dave that ext4, and who knows what other fs,
> are vulnerable to populating page cache with stale "uptodate" data,

Not that many filesystems support punching holes but you're right.

> then it seems to me that also xfs is vulnerable via readahead(2) and
> posix_fadvise().

Yes, this is correct AFAICT.

> Mind you, I recently added an fadvise f_op, so it could be used by
> xfs to synchronize with IOLOCK.

And yes, this should work.

> Perhaps a better solution would be for truncate_pagecache_range()
> to leave zeroed or Unwritten (i.e. lazy zeroed by read) pages in page
> cache. When we have shared pages for files, these pages could be
> deduped.

No, I wouldn't really mess with sharing pages due to this. It would be hard
to make that scale resonably and would be rather complex. We really need a
proper and reasonably simple synchronization mechanism between operations
removing blocks from inode and operations filling in page cache of the
inode. Page lock was supposed to provide this but doesn't quite work
because hole punching first remove pagecache pages and then go removing all
blocks.

So I agree with Dave that going for range lock is really the cleanest way
forward here without causing big regressions for mixed rw workloads. I'm
just thinking how to best do that without introducing lot of boilerplate
code into each filesystem.

								Honza
Dave Chinner April 11, 2019, 1:11 a.m. UTC | #11
On Mon, Apr 08, 2019 at 09:37:09AM -0700, Davidlohr Bueso wrote:
> On Mon, 2019-04-08 at 12:33 +0200, Jan Kara wrote:
> > On Fri 05-04-19 08:17:30, Dave Chinner wrote:
> > > FYI, I'm working on a range lock implementation that should both
> > > solve the performance issue and the reader starvation issue at the
> > > same time by allowing concurrent buffered reads and writes to
> > > different file ranges.
> > 
> > Are you aware of range locks Davidlohr has implemented [1]? It didn't get
> > merged because he had no in-tree user at the time (he was more aiming at
> > converting mmap_sem which is rather difficult). But the generic lock
> > implementation should be well usable.
> > 
> > Added Davidlohr to CC.
> > 
> > 								Honza
> > 
> > [1] https://lkml.org/lkml/2017/3/7/22
> 
> fyi this was the latest version (had some naming updates per peterz).
> 
> https://lkml.org/lkml/2018/2/4/232

No, I wasn't aware of these because they haven't ever been posted to
a list I subscribe to and they haven't been merged. I'll go have a
look at them over the next few days.

Cheers,

Dave.
Dave Chinner April 16, 2019, 12:22 p.m. UTC | #12
On Thu, Apr 11, 2019 at 11:11:17AM +1000, Dave Chinner wrote:
> On Mon, Apr 08, 2019 at 09:37:09AM -0700, Davidlohr Bueso wrote:
> > On Mon, 2019-04-08 at 12:33 +0200, Jan Kara wrote:
> > > On Fri 05-04-19 08:17:30, Dave Chinner wrote:
> > > > FYI, I'm working on a range lock implementation that should both
> > > > solve the performance issue and the reader starvation issue at the
> > > > same time by allowing concurrent buffered reads and writes to
> > > > different file ranges.
> > > 
> > > Are you aware of range locks Davidlohr has implemented [1]? It didn't get
> > > merged because he had no in-tree user at the time (he was more aiming at
> > > converting mmap_sem which is rather difficult). But the generic lock
> > > implementation should be well usable.
> > > 
> > > Added Davidlohr to CC.
> > > 
> > > 								Honza
> > > 
> > > [1] https://lkml.org/lkml/2017/3/7/22
> > 
> > fyi this was the latest version (had some naming updates per peterz).
> > 
> > https://lkml.org/lkml/2018/2/4/232
> 
> No, I wasn't aware of these because they haven't ever been posted to
> a list I subscribe to and they haven't been merged. I'll go have a
> look at them over the next few days.

Rather disappointing, to tell the truth.

The implementation doesn't scale nearly as a rwsem for concurrent IO
using shared access (i.e. XFS direct IO). That's the baseline we
need to get close to for range locks to be a viable prospect.

Fio randrw numbers on a single file on a pmem device on a 16p
machine using 4kB AIO-DIO iodepth 128 w/ fio on 5.1.0-rc3:

			IOPS read/write (direct IO)
fio processes		rwsem			rangelock
 1			78k / 78k		75k / 75k
 2			131k / 131k		123k / 123k
 4			267k / 267k		183k / 183k
 8			372k / 372k		177k / 177k
 16			315k / 315k		135k / 135k

So uncontended, non-overlapping range performance is not
really even in the ballpark right now, unfortunately.

To indicate that range locking is actually working, let's do
buffered read/write, which takes the rwsem exclusive for writes
and so will be bound by that:

			IOPS read/write (buffered IO)
fio processes		rwsem			rangelock
 1			57k / 57k		64k / 64k
 2			61k / 61k		111k / 111k
 4			61k / 61k		228k / 228k
 8			55k / 55k		195k / 195k
 16			15k / 15k		 40k /  40k

So the win for mixed buffered IO is huge but it's at the cost
of direct IO performance. We can't make that tradeoff, so if this
implementation cannot be substantially improved we really can'ti
use it.

FUndamentally, the problem is that the interval tree work is all
done under a spinlock, so by the time we get to 16p, 70% of the 16p
that is being burnt is on the spinlock, and only 30% of the CPU time
is actually doing IO work:

+   70.78%     2.50%  [kernel]             [k] do_raw_spin_lock
+   69.72%     0.07%  [kernel]             [k] _raw_spin_lock_irqsave
-   68.27%    68.10%  [kernel]             [k] __pv_queued_spin_lock_slowpath
.....
+   15.61%     0.07%  [kernel]             [k] range_write_unlock
+   15.39%     0.06%  [kernel]             [k] range_read_unlock
+   15.11%     0.12%  [kernel]             [k] range_read_lock
+   15.11%     0.13%  [kernel]             [k] range_write_lock
.....
+   12.92%     0.11%  [kernel]             [k] range_is_locked

FWIW, I'm not convinced about the scalability of the rb/interval
tree, to tell you the truth. We got rid of the rbtree in XFS for
cache indexing because the multi-level pointer chasing was just too
expensive to do under a spinlock - it's just not a cache efficient
structure for random index object storage.

FWIW, I have basic hack to replace the i_rwsem in XFS with a full
range read or write lock with my XFS range lock implementation so it
just behaves like a rwsem at this point. It is not in any way
optimised at this point. Numbers for same AIO-DIO test are:

			IOPS read/write (direct IO)
processes	rwsem		DB rangelock	XFS rangelock
 1		78k / 78k	75k / 75k	74k / 74k
 2		131k / 131k	123k / 123k	134k / 134k
 4		267k / 267k	183k / 183k	246k / 246k
 8		372k / 372k	177k / 177k	306k / 306k
 16		315k / 315k	135k / 135k	240k / 240k

Performance is much closer to rwsems, but the single lock critical
section still causes serious amounts of cacheline contention on pure
shared-read lock workloads like this.

FWIW, the XFS rangelock numbers match what I've been seeing with
userspace perf tests using unoptimised pthread mutexes - ~600,000
lock/unlock cycles a second over ~100 concurrent non-overlapping
ranges seem to be the limitation before futex contention burns down
the house

AS it is, I've done all the work to make XFS use proper range locks
on top of your patch - I'll go back tomorrow and modify the XFS
range locks to use the same API as your lock implementation. I'm
interested to see what falls out when they are used as real range
locks rather than a glorified rwsem....

Cheers,

Dave.

PS: there's a double lock bug in range_is_locked().....
Dave Chinner April 18, 2019, 3:10 a.m. UTC | #13
On Tue, Apr 16, 2019 at 10:22:40PM +1000, Dave Chinner wrote:
> On Thu, Apr 11, 2019 at 11:11:17AM +1000, Dave Chinner wrote:
> > On Mon, Apr 08, 2019 at 09:37:09AM -0700, Davidlohr Bueso wrote:
> > > On Mon, 2019-04-08 at 12:33 +0200, Jan Kara wrote:
> > > > On Fri 05-04-19 08:17:30, Dave Chinner wrote:
> > > > > FYI, I'm working on a range lock implementation that should both
> > > > > solve the performance issue and the reader starvation issue at the
> > > > > same time by allowing concurrent buffered reads and writes to
> > > > > different file ranges.
> > > > 
> > > > Are you aware of range locks Davidlohr has implemented [1]? It didn't get
> > > > merged because he had no in-tree user at the time (he was more aiming at
> > > > converting mmap_sem which is rather difficult). But the generic lock
> > > > implementation should be well usable.
> > > > 
> > > > Added Davidlohr to CC.
.....
> Fio randrw numbers on a single file on a pmem device on a 16p
> machine using 4kB AIO-DIO iodepth 128 w/ fio on 5.1.0-rc3:
> 
> 			IOPS read/write (direct IO)
> fio processes		rwsem			rangelock
>  1			78k / 78k		75k / 75k
>  2			131k / 131k		123k / 123k
>  4			267k / 267k		183k / 183k
>  8			372k / 372k		177k / 177k
>  16			315k / 315k		135k / 135k
....

> FWIW, I'm not convinced about the scalability of the rb/interval
> tree, to tell you the truth. We got rid of the rbtree in XFS for
> cache indexing because the multi-level pointer chasing was just too
> expensive to do under a spinlock - it's just not a cache efficient
> structure for random index object storage.

Yeah, definitely not convinced an rbtree is the right structure
here. Locking of the tree is the limitation....

> FWIW, I have basic hack to replace the i_rwsem in XFS with a full
> range read or write lock with my XFS range lock implementation so it
> just behaves like a rwsem at this point. It is not in any way
> optimised at this point. Numbers for same AIO-DIO test are:

Now the stuff I've been working on has the same interface as
Davidlohr's patch, so I can swap and change them without thinking
about it. It's still completely unoptimised, but:

			IOPS read/write (direct IO)
processes	rwsem		DB rangelock	XFS rangelock
 1		78k / 78k	75k / 75k	72k / 72k
 2		131k / 131k	123k / 123k	133k / 133k
 4		267k / 267k	183k / 183k	237k / 237k
 8		372k / 372k	177k / 177k	265k / 265k
 16		315k / 315k	135k / 135k	228k / 228k

It's still substantially faster than the interval tree code.

BTW, if I take away the rwsem serialisation altogether, this
test tops out at just under 500k/500k at 8 threads, and at 16
threads has started dropping off (~440k/440k). So the rwsem is
a scalability limitation at just 8 threads....

/me goes off and thinks more about adding optimistic lock coupling
to the XFS iext btree to get rid of the need for tree-wide
locking altogether

Cheers,

Dave.
Davidlohr Bueso April 18, 2019, 6:21 p.m. UTC | #14
On Thu, 2019-04-18 at 13:10 +1000, Dave Chinner wrote:
> Now the stuff I've been working on has the same interface as
> Davidlohr's patch, so I can swap and change them without thinking
> about it. It's still completely unoptimised, but:
> 
> 			IOPS read/write (direct IO)
> processes	rwsem		DB rangelock	XFS
> rangelock
>  1		78k / 78k	75k / 75k	72k / 72k
>  2		131k / 131k	123k / 123k	133k / 133k
>  4		267k / 267k	183k / 183k	237k / 237k
>  8		372k / 372k	177k / 177k	265k / 265k
>  16		315k / 315k	135k / 135k	228k / 228k
> 
> It's still substantially faster than the interval tree code.

In general another big difference between both rangelock vs rwsems
(when comparing them with full ranges) is that the latter will do
writer optimistic spinning, so saving a context switch under the right
scenarios provides mayor wins for rwsems -- I'm not sure if this
applies to your fio tests, though. And pretty soon readers will also do
this, hence rwsem will become a try-hard-not-to-sleep lock.

One of the reasons why I was hesitant with Btrees was the fact that
insertion requires memory allocation, something I wanted to avoid...
per your numbers, sacrificing tree depth was the wrong choice. Thanks
for sharing these numbers.

> 
> BTW, if I take away the rwsem serialisation altogether, this
> test tops out at just under 500k/500k at 8 threads, and at 16
> threads has started dropping off (~440k/440k). So the rwsem is
> a scalability limitation at just 8 threads....
> 
> /me goes off and thinks more about adding optimistic lock coupling
> to the XFS iext btree to get rid of the need for tree-wide
> locking altogether

I was not aware of this code.

Thanks,
Davidlohr
Dave Chinner April 20, 2019, 11:54 p.m. UTC | #15
On Thu, Apr 18, 2019 at 11:21:34AM -0700, Davidlohr Bueso wrote:
> On Thu, 2019-04-18 at 13:10 +1000, Dave Chinner wrote:
> > Now the stuff I've been working on has the same interface as
> > Davidlohr's patch, so I can swap and change them without thinking
> > about it. It's still completely unoptimised, but:
> > 
> > 			IOPS read/write (direct IO)
> > processes	rwsem		DB rangelock	XFS
> > rangelock
> >  1		78k / 78k	75k / 75k	72k / 72k
> >  2		131k / 131k	123k / 123k	133k / 133k
> >  4		267k / 267k	183k / 183k	237k / 237k
> >  8		372k / 372k	177k / 177k	265k / 265k
> >  16		315k / 315k	135k / 135k	228k / 228k
> > 
> > It's still substantially faster than the interval tree code.
> 
> In general another big difference between both rangelock vs rwsems
> (when comparing them with full ranges) is that the latter will do
> writer optimistic spinning, so saving a context switch under the right
> scenarios provides mayor wins for rwsems -- I'm not sure if this
> applies to your fio tests, though. And pretty soon readers will also do
> this, hence rwsem will become a try-hard-not-to-sleep lock.

Right, the optimistic spin-on-owner behaviour basically causes both
rwsems and mutexes to behave as spin locks under these fio workloads
until all CPUs are spinning at 100% CPU usage before they even start
to behave like a sleeping lock. Essentailly, when I'm using the XFS
range locks as "full range only" rwsem equivalent locks, they end up
behaving like rwsems in terms of spinning on the tree mutex as the
single read-lock range is shared between all concurrent readers and
so has relatively low lookup and modification cost.

If I optmised the record update, it would be even closer to the
rwsem performance, but full range read locks are going to be rare.
Hell, even conflicting lock ranges are going to be rare in the IO
path, because the application is doing something very wrong if it is
doing overlapping concurrent IOs....

> One of the reasons why I was hesitant with Btrees was the fact that
> insertion requires memory allocation, something I wanted to avoid...

Yup, that's an issue, and one of the reasons why I've used a mutex
for the tree lock rather than a spin lock. But in the end, there's
no difference to CPU usage on contention as they both act as spin
locks in these cases...

If it becomes an issue, I'll add a mempool for the tree nodes and
make sure that it is pre-populated sufficiently before starting a
multi-level tree modification.

> per your numbers, sacrificing tree depth was the wrong choice. Thanks
> for sharing these numbers.

Yeah, anything that pointer chases through cachelines and dirties
multiple cachelines on rebalance really doesn't scale for a
fast-lookup structure. For btrees multi-level split/merge hurt, but
for the lookup and simple insert/remove case the cacheline miss cost
is substantially amortised by packing multiple records per
cacheline.

> > BTW, if I take away the rwsem serialisation altogether, this
> > test tops out at just under 500k/500k at 8 threads, and at 16
> > threads has started dropping off (~440k/440k). So the rwsem is
> > a scalability limitation at just 8 threads....
> > 
> > /me goes off and thinks more about adding optimistic lock coupling
> > to the XFS iext btree to get rid of the need for tree-wide
> > locking altogether
> 
> I was not aware of this code.

It's relatively new, and directly tailored to the needs of caching
the XFS extent tree - it's not really a generic btree in that it's
record store format is the XFS on-disk extent record. i.e. it
only stores 54 bits of start offset and 21 bits of length in it's 16
byte records, and the rest of the space is for the record data.

As such, I'm not really implementing a generic range lock - it is
highly tailored to the requirements and constraints of XFS
filesystem IO. e.g. using max(block size, PAGE_SIZE) granularity
range indexing means we can support maximum IO sizes (4GB) with 21
bits of length in the record.  And RANGE_LOCK_FULL is implemented as
length = MAXEXTLEN + a hidden state bit so it supports exclusive
locking from (start, RANGE_LOCK_FULL) and things like truncate, EOF
zeroing use this to exclude IO from the range they are working on
correctly....

Cheers,

Dave.
Boaz Harrosh April 22, 2019, 10:55 a.m. UTC | #16
On 08/04/19 14:03, Jan Kara wrote:
<>
> Hum, you are right. Ext4 is buggy in this regard. I've fixed the race for
> page fault in ea3d7209ca01 "ext4: fix races between page faults and hole
> punching" but didn't realize the problem is there for buffered reads as
> well. I'll think how we can fix this. Thanks for noticing this!
> 

This is very interesting. Please CC me if you have devised a test for this?

For a long time I want to enhance fsstress --verify that each writer always
writes the long address it writes to as data (ie. file looks like an increasing
long counter) And readers want to see only this pattern or zero.

> 								Honza
> 

Thanks
Boaz
Dave Chinner May 3, 2019, 4:17 a.m. UTC | #17
On Sun, Apr 21, 2019 at 09:54:12AM +1000, Dave Chinner wrote:
> On Thu, Apr 18, 2019 at 11:21:34AM -0700, Davidlohr Bueso wrote:
> > On Thu, 2019-04-18 at 13:10 +1000, Dave Chinner wrote:
> > > Now the stuff I've been working on has the same interface as
> > > Davidlohr's patch, so I can swap and change them without thinking
> > > about it. It's still completely unoptimised, but:
> > > 
> > > 			IOPS read/write (direct IO)
> > > processes	rwsem		DB rangelock	XFS
> > > rangelock
> > >  1		78k / 78k	75k / 75k	72k / 72k
> > >  2		131k / 131k	123k / 123k	133k / 133k
> > >  4		267k / 267k	183k / 183k	237k / 237k
> > >  8		372k / 372k	177k / 177k	265k / 265k
> > >  16		315k / 315k	135k / 135k	228k / 228k
> > > 
> > > It's still substantially faster than the interval tree code.
....
> > > /me goes off and thinks more about adding optimistic lock coupling
> > > to the XFS iext btree to get rid of the need for tree-wide
> > > locking altogether
> > 
> > I was not aware of this code.
> 
> It's relatively new, and directly tailored to the needs of caching
> the XFS extent tree - it's not really a generic btree in that it's
> record store format is the XFS on-disk extent record. i.e. it
> only stores 54 bits of start offset and 21 bits of length in it's 16
> byte records, and the rest of the space is for the record data.

SO now I have a mostly working OLC btree based on this tree which is
plumbed into xfsprogs userspace and some testing code. I think I can
say now that the code will actually work, and it /should/ scale
better than a rwsem.

The userspace test harness that I have ran a "thread profile" to
indicated scalability. Basically it ran each thread in a different
offset range and locked a hundred ranges and then unlocked them, and
then looped over this. The btree is a single level for the first 14
locks, 2-level for up to 210 locks, and 3-level for up to 3150
locks. Hence most of this testing results in the btree being 2-3
levels and so largely removes the global root node lock as a point
of contention. It's "best case" for concurrency for an OLC btree.

On a 16p machine:

		     Range lock/unlock ops/s
threads		mutex btree		OLC btree
  1		  5239442		  949487
  2		  1014466		 1398539
  4		   985940		 2405275
  8		   733195		 3288435
  16		   653429		 2809225

When looking at these numbers, remember that the mutex btree kernel
range lock performed a lot better than the interval tree range lock,
and they were only ~30% down on an rwsem. The mutex btree code shows
cache residency effects for the single threaded load, hence it looks
much faster than it is for occasional and multithreaded access.

However, at 2 threads (where hot CPU caches don't affect the
performance), the OLC btree is 40% faster, and at 8 threads it is
4.5x faster than the mutex btree. The OLC btree starts slowing down
at 16 threads, largely because the tree itself doesn't have enough
depth to provide the interior nodes to scale to higher concurrency
levels without contention, but it's still running at 4.5x faster
than the mutex btree....

The best part is when I run worse case threaded workloads on the
OLC btree. If I run the same 100-lock loops, but this time change
the offsets of each thread so they interleave into adjacent records
in the btree (i.e. every thread touches every leaf), then the
performance is still pretty damn good:

		     Range lock/unlock ops/s
threads		Worst Case		Best Case
  1		  1045991		  949487
  2		  1530212		 1398539
  4		  1147099		 2405275
  8		  1602114		 3288435
  16		  1731890		 2809225

IOWs, performance is down and somewhat variable around tree
height changes (4 threads straddles the 2-3 level tree height
threshold), but it's still a massive improvement on the mutex_btree
and it's not going backwards as threads are added.

Concept proven.

Next steps are:

	- separate the OLC btree from the XFS iext btree
	  implementation. It will still have a similar interface
	  (i.e. can't manipulate the btree records directly), but
	  there's sufficient difference in structure for them to be
	  separate implementations.
	- expand records out to full 64bit extents. The iext tree
	  memory usage constraints no longer apply, so the record
	  size can go up a little bit.
	- work out whether RCU read locking and kfree_rcu() will
	  work with the requirement to do memory allocation while
	  holding rcu_read_lock(). Alternative is an internal
	  garbage collector mechanism, kinda like I've hacked up to
	  simulate kfree_rcu() in userspace.
	- fix all the little bugs that still exist in the code.
	- Think about structural optimisations like parent pointers
	  to avoid costly path walks to find parents for 
	  modifications.

Cheers,

Dave.
Dave Chinner May 3, 2019, 5:17 a.m. UTC | #18
On Fri, May 03, 2019 at 02:17:27PM +1000, Dave Chinner wrote:
> Concept proven.
> 
> Next steps are:
....
> 	- work out whether RCU read locking and kfree_rcu() will
> 	  work with the requirement to do memory allocation while
> 	  holding rcu_read_lock(). Alternative is an internal
> 	  garbage collector mechanism, kinda like I've hacked up to
> 	  simulate kfree_rcu() in userspace.

Internal RCU interactions are now solved. Actually very simple in
the end, should be very easy to integrate into the code.

Cheers,

Dave.

Patch
diff mbox series

diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 1f2e284..4e5f88a 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -240,6 +240,20 @@  xfs_file_buffered_aio_read(
 		if (!xfs_ilock_nowait(ip, XFS_IOLOCK_SHARED))
 			return -EAGAIN;
 	} else {
+		/*
+		 * Warm up page cache to minimize time spent under
+		 * shared ilock.
+		 */
+		struct iov_iter	iter;
+		loff_t pos = iocb->ki_pos;
+
+		iov_iter_discard(&iter, READ, iov_iter_count(to));
+		ret = generic_file_read_iter(iocb, &iter);
+		if (ret <= 0)
+			return ret;
+
+		iocb->ki_pos = pos;
+
 		xfs_ilock(ip, XFS_IOLOCK_SHARED);
 	}
 	ret = generic_file_read_iter(iocb, to);
diff --git a/lib/iov_iter.c b/lib/iov_iter.c
index ea36dc3..b22e433 100644
--- a/lib/iov_iter.c
+++ b/lib/iov_iter.c
@@ -893,9 +893,10 @@  size_t copy_page_to_iter(struct page *page, size_t offset, size_t bytes,
 		size_t wanted = copy_to_iter(kaddr + offset, bytes, i);
 		kunmap_atomic(kaddr);
 		return wanted;
-	} else if (unlikely(iov_iter_is_discard(i)))
+	} else if (unlikely(iov_iter_is_discard(i))) {
+		i->count -= bytes;
 		return bytes;
-	else if (likely(!iov_iter_is_pipe(i)))
+	} else if (likely(!iov_iter_is_pipe(i)))
 		return copy_page_to_iter_iovec(page, offset, bytes, i);
 	else
 		return copy_page_to_iter_pipe(page, offset, bytes, i);