[0/3] Create large folios in iomap buffered write path

Message ID	20230520163603.1794256-1-willy@infradead.org (mailing list archive)
Headers	show Return-Path: <linux-xfs-owner@vger.kernel.org> From: "Matthew Wilcox (Oracle)" <willy@infradead.org> To: linux-fsdevel@vger.kernel.org Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>, linux-xfs@vger.kernel.org, Wang Yugui <wangyugui@e16-tech.com>, Dave Chinner <david@fromorbit.com>, Christoph Hellwig <hch@infradead.org>, "Darrick J . Wong" <djwong@kernel.org> Subject: [PATCH 0/3] Create large folios in iomap buffered write path Date: Sat, 20 May 2023 17:36:00 +0100 Message-Id: <20230520163603.1794256-1-willy@infradead.org> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	Create large folios in iomap buffered write path \| expand [0/3] Create large folios in iomap buffered write path [1/3] filemap: Allow __filemap_get_folio to allocate large folios [2/3] iomap: Create large folios in the buffered write path [3/3] iomap: Copy larger chunks from userspace

Matthew Wilcox May 20, 2023, 4:36 p.m. UTC

Wang Yugui has a workload which would be improved by using large folios.
Until now, we've only created large folios in the readahead path,
but this workload writes without reading.  The decision of what size
folio to create is based purely on the size of the write() call (unlike
readahead where we keep history and can choose to create larger folios
based on that history even if individual reads are small).

The third patch looks like it's an optional extra but is actually needed
for the first two patches to work in the write path, otherwise it limits
the length that iomap_get_folio() sees to PAGE_SIZE.

Matthew Wilcox (Oracle) (3):
  filemap: Allow __filemap_get_folio to allocate large folios
  iomap: Create large folios in the buffered write path
  iomap: Copy larger chunks from userspace

 fs/gfs2/bmap.c          |  2 +-
 fs/iomap/buffered-io.c  | 32 ++++++++++++++++++------------
 include/linux/iomap.h   |  2 +-
 include/linux/pagemap.h | 29 ++++++++++++++++++++++++---
 mm/filemap.c            | 44 ++++++++++++++++++++++++++++-------------
 mm/folio-compat.c       |  2 +-
 mm/readahead.c          | 13 ------------
 7 files changed, 78 insertions(+), 46 deletions(-)

Wang Yugui May 21, 2023, 12:49 a.m. UTC | #1

Hi,

> Wang Yugui has a workload which would be improved by using large folios.
> Until now, we've only created large folios in the readahead path,
> but this workload writes without reading.  The decision of what size
> folio to create is based purely on the size of the write() call (unlike
> readahead where we keep history and can choose to create larger folios
> based on that history even if individual reads are small).
> 
> The third patch looks like it's an optional extra but is actually needed
> for the first two patches to work in the write path, otherwise it limits
> the length that iomap_get_folio() sees to PAGE_SIZE.

very good test result on 6.4.0-rc2. 
# just drop ';' in 'if (bytes > folio_size(folio) - offset);' of [PATCH 3/3].

fio -name write-bandwidth -rw=write -bs=1024Ki -size=32Gi -runtime=30 -iodepth 1
-ioengine sync -zero_buffers=1 -direct=0 -end_fsync=1 -numjobs=4
-directory=/mnt/test
fio WRITE: bw=7655MiB/s (8027MB/s).

Now it is the same as 5.15.y

Best Regards
Wang Yugui (wangyugui@e16-tech.com)
2023/05/21


> Matthew Wilcox (Oracle) (3):
>   filemap: Allow __filemap_get_folio to allocate large folios
>   iomap: Create large folios in the buffered write path
>   iomap: Copy larger chunks from userspace
> 
>  fs/gfs2/bmap.c          |  2 +-
>  fs/iomap/buffered-io.c  | 32 ++++++++++++++++++------------
>  include/linux/iomap.h   |  2 +-
>  include/linux/pagemap.h | 29 ++++++++++++++++++++++++---
>  mm/filemap.c            | 44 ++++++++++++++++++++++++++++-------------
>  mm/folio-compat.c       |  2 +-
>  mm/readahead.c          | 13 ------------
>  7 files changed, 78 insertions(+), 46 deletions(-)
> 
> -- 
> 2.39.2

Matthew Wilcox May 21, 2023, 12:59 a.m. UTC | #2

On Sun, May 21, 2023 at 08:49:53AM +0800, Wang Yugui wrote:
> Hi,
> 
> > Wang Yugui has a workload which would be improved by using large folios.
> > Until now, we've only created large folios in the readahead path,
> > but this workload writes without reading.  The decision of what size
> > folio to create is based purely on the size of the write() call (unlike
> > readahead where we keep history and can choose to create larger folios
> > based on that history even if individual reads are small).
> > 
> > The third patch looks like it's an optional extra but is actually needed
> > for the first two patches to work in the write path, otherwise it limits
> > the length that iomap_get_folio() sees to PAGE_SIZE.
> 
> very good test result on 6.4.0-rc2. 
> # just drop ';' in 'if (bytes > folio_size(folio) - offset);' of [PATCH 3/3].
> 
> fio -name write-bandwidth -rw=write -bs=1024Ki -size=32Gi -runtime=30 -iodepth 1
> -ioengine sync -zero_buffers=1 -direct=0 -end_fsync=1 -numjobs=4
> -directory=/mnt/test
> fio WRITE: bw=7655MiB/s (8027MB/s).
> 
> Now it is the same as 5.15.y

Great!  How close is that to saturating the theoretical write bandwidth
of your storage?

Wang Yugui May 21, 2023, 1:38 a.m. UTC | #3

Hi,

> On Sun, May 21, 2023 at 08:49:53AM +0800, Wang Yugui wrote:
> > Hi,
> > 
> > > Wang Yugui has a workload which would be improved by using large folios.
> > > Until now, we've only created large folios in the readahead path,
> > > but this workload writes without reading.  The decision of what size
> > > folio to create is based purely on the size of the write() call (unlike
> > > readahead where we keep history and can choose to create larger folios
> > > based on that history even if individual reads are small).
> > > 
> > > The third patch looks like it's an optional extra but is actually needed
> > > for the first two patches to work in the write path, otherwise it limits
> > > the length that iomap_get_folio() sees to PAGE_SIZE.
> > 
> > very good test result on 6.4.0-rc2. 
> > # just drop ';' in 'if (bytes > folio_size(folio) - offset);' of [PATCH 3/3].
> > 
> > fio -name write-bandwidth -rw=write -bs=1024Ki -size=32Gi -runtime=30 -iodepth 1
> > -ioengine sync -zero_buffers=1 -direct=0 -end_fsync=1 -numjobs=4
> > -directory=/mnt/test
> > fio WRITE: bw=7655MiB/s (8027MB/s).
> > 
> > Now it is the same as 5.15.y
> 
> Great!  How close is that to saturating the theoretical write bandwidth
> of your storage?

SSD:  PCIe3  SSD U.2 *4 (1.6T/3.2T/3.2T/3.2T, just 800G used) through a NVMe
adapater.
CPU: E5-2680 v2 @ 2.80GHz *2
memory: DDR3 * 3 channel

so theoretical write bandwidth maybe over 12GB/s,
but 8GB/s maybe the in fact bandwidth because of 
- write depth limit
- NVMe adapther limit
- CPU/memory limit

I also noticed a huge improvement for single thread dd.

# 6.4.0-rc2 with this patch
#dd-9.1 conv=fsync bs=1024K count=32K if=/dev/zero of=/mnt/test/dd.txt
34359738368 bytes (34 GB, 32 GiB) copied, 6.96108 s, 4.6 GiB/s

But it is about 2.2GiB/s in xfs/5.15.y.

Best Regards
Wang Yugui (wangyugui@e16-tech.com)
2023/05/21

Dave Chinner May 21, 2023, 2:49 a.m. UTC | #4

On Sat, May 20, 2023 at 05:36:00PM +0100, Matthew Wilcox (Oracle) wrote:
> Wang Yugui has a workload which would be improved by using large folios.

I think that's a bit of a misrepresentation of the situation. The
workload in question has regressed from ~8GB/s to 2.5GB/s due to
page cache structure contention caused by XFS limiting writeback bio
chain length to bound worst case IO completion latency in 5.15. This
causes several tasks attempting concurrent exclusive locking of the
mapping tree: write(), writeback IO submission, writeback IO
completion and multiple memory reclaim tasks (both direct and
background). Limiting worse case latency means that IO completion is
accessing the mapping tree much more frequently (every 16MB, instead
of 16-32GB), and that has driven this workload into lock contention
breakdown.

This was shown in profiles indicating the regression was caused
by page cache contention causing excessive CPU usage in the
writeback flusher thread limiting IO submission rates. This is not
something we can fix in XFS - it's a exclusive lock access
issue in the page cache...

Mitigating the performance regression by enabling large folios for
XFS doesn't actually fix any of the locking problems, it just
reduces lock traffic in the IO path by a couple of orders of
magnitude. The problem will come back as bandwidth increases again.
Also, the same problem will affect other filesystems that aren't
capable of using large folios.

Hence I think we really need to document this problem with the
mitigations being proposed so that, in future, we know how to
recognise when we hit these page cache limitations again. i.e.
I think it would also be a good idea to include some of the
analysis that pointed to the page cache contention problem here
(either in the cover letter so it can be used as a merge tag
message, or in a commit), rather than present it as "here's a
performance improvement" without any context of what problem it's
actually mitigating....

Cheers,

Dave.

Wang Yugui May 21, 2023, 11:40 a.m. UTC | #5

Hi,

> Wang Yugui has a workload which would be improved by using large folios.
> Until now, we've only created large folios in the readahead path,
> but this workload writes without reading.  The decision of what size
> folio to create is based purely on the size of the write() call (unlike
> readahead where we keep history and can choose to create larger folios
> based on that history even if individual reads are small).
> 
> The third patch looks like it's an optional extra but is actually needed
> for the first two patches to work in the write path, otherwise it limits
> the length that iomap_get_folio() sees to PAGE_SIZE.
> 
> Matthew Wilcox (Oracle) (3):
>   filemap: Allow __filemap_get_folio to allocate large folios
>   iomap: Create large folios in the buffered write path
>   iomap: Copy larger chunks from userspace

The [PATCH 2/3] is a little difficult to backport to 6.1.y(LTS),
it need some patches as depency.

so please provide those patches based on 6.1.y(LTS)  and depency list?
then we can do more test on 6.1.y(LTS) too.

Best Regards
Wang Yugui (wangyugui@e16-tech.com)
2023/05/21



>  fs/gfs2/bmap.c          |  2 +-
>  fs/iomap/buffered-io.c  | 32 ++++++++++++++++++------------
>  include/linux/iomap.h   |  2 +-
>  include/linux/pagemap.h | 29 ++++++++++++++++++++++++---
>  mm/filemap.c            | 44 ++++++++++++++++++++++++++++-------------
>  mm/folio-compat.c       |  2 +-
>  mm/readahead.c          | 13 ------------
>  7 files changed, 78 insertions(+), 46 deletions(-)
> 
> -- 
> 2.39.2

Wang Yugui May 31, 2023, 4:34 a.m. UTC | #6

Hi,

> Hi,
> 
> > Wang Yugui has a workload which would be improved by using large folios.
> > Until now, we've only created large folios in the readahead path,
> > but this workload writes without reading.  The decision of what size
> > folio to create is based purely on the size of the write() call (unlike
> > readahead where we keep history and can choose to create larger folios
> > based on that history even if individual reads are small).
> > 
> > The third patch looks like it's an optional extra but is actually needed
> > for the first two patches to work in the write path, otherwise it limits
> > the length that iomap_get_folio() sees to PAGE_SIZE.
> > 
> > Matthew Wilcox (Oracle) (3):
> >   filemap: Allow __filemap_get_folio to allocate large folios
> >   iomap: Create large folios in the buffered write path
> >   iomap: Copy larger chunks from userspace
> 
> The [PATCH 2/3] is a little difficult to backport to 6.1.y(LTS),
> it need some patches as depency.
> 
> so please provide those patches based on 6.1.y(LTS)  and depency list?
> then we can do more test on 6.1.y(LTS) too.

I selected 8 patches as depency
    d7b64041164c :Dave Chinner: iomap: write iomap validity checks

    7a70a5085ed0 :Andreas Gruenbacher: iomap: Add __iomap_put_folio helper
    80baab88bb93 :Andreas Gruenbacher: iomap/gfs2: Unlock and put folio in page_done handler
    40405dddd98a :Andreas Gruenbacher: iomap: Rename page_done handler to put_folio

    98321b5139f9 :Andreas Gruenbacher: iomap: Add iomap_get_folio helper

    9060bc4d3aca :Andreas Gruenbacher: iomap/gfs2: Get page in page_prepare handler
    07c22b56685d :Andreas Gruenbacher: iomap: Add __iomap_get_folio helper
    c82abc239464 :Andreas Gruenbacher: iomap: Rename page_prepare handler to get_folio

then rebased path 1, 2 ( see attachment files).

Now we can test patch 1,2,3 on 5.1.31.

Best Regards
Wang Yugui (wangyugui@e16-tech.com)
2023/05/31

[0/3] Create large folios in iomap buffered write path

Message

Comments