Message ID | 20230520163603.1794256-1-willy@infradead.org (mailing list archive) |
---|---|
Headers | show |
Series | Create large folios in iomap buffered write path | expand |
Hi, > Wang Yugui has a workload which would be improved by using large folios. > Until now, we've only created large folios in the readahead path, > but this workload writes without reading. The decision of what size > folio to create is based purely on the size of the write() call (unlike > readahead where we keep history and can choose to create larger folios > based on that history even if individual reads are small). > > The third patch looks like it's an optional extra but is actually needed > for the first two patches to work in the write path, otherwise it limits > the length that iomap_get_folio() sees to PAGE_SIZE. very good test result on 6.4.0-rc2. # just drop ';' in 'if (bytes > folio_size(folio) - offset);' of [PATCH 3/3]. fio -name write-bandwidth -rw=write -bs=1024Ki -size=32Gi -runtime=30 -iodepth 1 -ioengine sync -zero_buffers=1 -direct=0 -end_fsync=1 -numjobs=4 -directory=/mnt/test fio WRITE: bw=7655MiB/s (8027MB/s). Now it is the same as 5.15.y Best Regards Wang Yugui (wangyugui@e16-tech.com) 2023/05/21 > Matthew Wilcox (Oracle) (3): > filemap: Allow __filemap_get_folio to allocate large folios > iomap: Create large folios in the buffered write path > iomap: Copy larger chunks from userspace > > fs/gfs2/bmap.c | 2 +- > fs/iomap/buffered-io.c | 32 ++++++++++++++++++------------ > include/linux/iomap.h | 2 +- > include/linux/pagemap.h | 29 ++++++++++++++++++++++++--- > mm/filemap.c | 44 ++++++++++++++++++++++++++++------------- > mm/folio-compat.c | 2 +- > mm/readahead.c | 13 ------------ > 7 files changed, 78 insertions(+), 46 deletions(-) > > -- > 2.39.2
On Sun, May 21, 2023 at 08:49:53AM +0800, Wang Yugui wrote: > Hi, > > > Wang Yugui has a workload which would be improved by using large folios. > > Until now, we've only created large folios in the readahead path, > > but this workload writes without reading. The decision of what size > > folio to create is based purely on the size of the write() call (unlike > > readahead where we keep history and can choose to create larger folios > > based on that history even if individual reads are small). > > > > The third patch looks like it's an optional extra but is actually needed > > for the first two patches to work in the write path, otherwise it limits > > the length that iomap_get_folio() sees to PAGE_SIZE. > > very good test result on 6.4.0-rc2. > # just drop ';' in 'if (bytes > folio_size(folio) - offset);' of [PATCH 3/3]. > > fio -name write-bandwidth -rw=write -bs=1024Ki -size=32Gi -runtime=30 -iodepth 1 > -ioengine sync -zero_buffers=1 -direct=0 -end_fsync=1 -numjobs=4 > -directory=/mnt/test > fio WRITE: bw=7655MiB/s (8027MB/s). > > Now it is the same as 5.15.y Great! How close is that to saturating the theoretical write bandwidth of your storage?
Hi, > On Sun, May 21, 2023 at 08:49:53AM +0800, Wang Yugui wrote: > > Hi, > > > > > Wang Yugui has a workload which would be improved by using large folios. > > > Until now, we've only created large folios in the readahead path, > > > but this workload writes without reading. The decision of what size > > > folio to create is based purely on the size of the write() call (unlike > > > readahead where we keep history and can choose to create larger folios > > > based on that history even if individual reads are small). > > > > > > The third patch looks like it's an optional extra but is actually needed > > > for the first two patches to work in the write path, otherwise it limits > > > the length that iomap_get_folio() sees to PAGE_SIZE. > > > > very good test result on 6.4.0-rc2. > > # just drop ';' in 'if (bytes > folio_size(folio) - offset);' of [PATCH 3/3]. > > > > fio -name write-bandwidth -rw=write -bs=1024Ki -size=32Gi -runtime=30 -iodepth 1 > > -ioengine sync -zero_buffers=1 -direct=0 -end_fsync=1 -numjobs=4 > > -directory=/mnt/test > > fio WRITE: bw=7655MiB/s (8027MB/s). > > > > Now it is the same as 5.15.y > > Great! How close is that to saturating the theoretical write bandwidth > of your storage? SSD: PCIe3 SSD U.2 *4 (1.6T/3.2T/3.2T/3.2T, just 800G used) through a NVMe adapater. CPU: E5-2680 v2 @ 2.80GHz *2 memory: DDR3 * 3 channel so theoretical write bandwidth maybe over 12GB/s, but 8GB/s maybe the in fact bandwidth because of - write depth limit - NVMe adapther limit - CPU/memory limit I also noticed a huge improvement for single thread dd. # 6.4.0-rc2 with this patch #dd-9.1 conv=fsync bs=1024K count=32K if=/dev/zero of=/mnt/test/dd.txt 34359738368 bytes (34 GB, 32 GiB) copied, 6.96108 s, 4.6 GiB/s But it is about 2.2GiB/s in xfs/5.15.y. Best Regards Wang Yugui (wangyugui@e16-tech.com) 2023/05/21
On Sat, May 20, 2023 at 05:36:00PM +0100, Matthew Wilcox (Oracle) wrote:
> Wang Yugui has a workload which would be improved by using large folios.
I think that's a bit of a misrepresentation of the situation. The
workload in question has regressed from ~8GB/s to 2.5GB/s due to
page cache structure contention caused by XFS limiting writeback bio
chain length to bound worst case IO completion latency in 5.15. This
causes several tasks attempting concurrent exclusive locking of the
mapping tree: write(), writeback IO submission, writeback IO
completion and multiple memory reclaim tasks (both direct and
background). Limiting worse case latency means that IO completion is
accessing the mapping tree much more frequently (every 16MB, instead
of 16-32GB), and that has driven this workload into lock contention
breakdown.
This was shown in profiles indicating the regression was caused
by page cache contention causing excessive CPU usage in the
writeback flusher thread limiting IO submission rates. This is not
something we can fix in XFS - it's a exclusive lock access
issue in the page cache...
Mitigating the performance regression by enabling large folios for
XFS doesn't actually fix any of the locking problems, it just
reduces lock traffic in the IO path by a couple of orders of
magnitude. The problem will come back as bandwidth increases again.
Also, the same problem will affect other filesystems that aren't
capable of using large folios.
Hence I think we really need to document this problem with the
mitigations being proposed so that, in future, we know how to
recognise when we hit these page cache limitations again. i.e.
I think it would also be a good idea to include some of the
analysis that pointed to the page cache contention problem here
(either in the cover letter so it can be used as a merge tag
message, or in a commit), rather than present it as "here's a
performance improvement" without any context of what problem it's
actually mitigating....
Cheers,
Dave.
Hi, > Wang Yugui has a workload which would be improved by using large folios. > Until now, we've only created large folios in the readahead path, > but this workload writes without reading. The decision of what size > folio to create is based purely on the size of the write() call (unlike > readahead where we keep history and can choose to create larger folios > based on that history even if individual reads are small). > > The third patch looks like it's an optional extra but is actually needed > for the first two patches to work in the write path, otherwise it limits > the length that iomap_get_folio() sees to PAGE_SIZE. > > Matthew Wilcox (Oracle) (3): > filemap: Allow __filemap_get_folio to allocate large folios > iomap: Create large folios in the buffered write path > iomap: Copy larger chunks from userspace The [PATCH 2/3] is a little difficult to backport to 6.1.y(LTS), it need some patches as depency. so please provide those patches based on 6.1.y(LTS) and depency list? then we can do more test on 6.1.y(LTS) too. Best Regards Wang Yugui (wangyugui@e16-tech.com) 2023/05/21 > fs/gfs2/bmap.c | 2 +- > fs/iomap/buffered-io.c | 32 ++++++++++++++++++------------ > include/linux/iomap.h | 2 +- > include/linux/pagemap.h | 29 ++++++++++++++++++++++++--- > mm/filemap.c | 44 ++++++++++++++++++++++++++++------------- > mm/folio-compat.c | 2 +- > mm/readahead.c | 13 ------------ > 7 files changed, 78 insertions(+), 46 deletions(-) > > -- > 2.39.2
Hi, > Hi, > > > Wang Yugui has a workload which would be improved by using large folios. > > Until now, we've only created large folios in the readahead path, > > but this workload writes without reading. The decision of what size > > folio to create is based purely on the size of the write() call (unlike > > readahead where we keep history and can choose to create larger folios > > based on that history even if individual reads are small). > > > > The third patch looks like it's an optional extra but is actually needed > > for the first two patches to work in the write path, otherwise it limits > > the length that iomap_get_folio() sees to PAGE_SIZE. > > > > Matthew Wilcox (Oracle) (3): > > filemap: Allow __filemap_get_folio to allocate large folios > > iomap: Create large folios in the buffered write path > > iomap: Copy larger chunks from userspace > > The [PATCH 2/3] is a little difficult to backport to 6.1.y(LTS), > it need some patches as depency. > > so please provide those patches based on 6.1.y(LTS) and depency list? > then we can do more test on 6.1.y(LTS) too. I selected 8 patches as depency d7b64041164c :Dave Chinner: iomap: write iomap validity checks 7a70a5085ed0 :Andreas Gruenbacher: iomap: Add __iomap_put_folio helper 80baab88bb93 :Andreas Gruenbacher: iomap/gfs2: Unlock and put folio in page_done handler 40405dddd98a :Andreas Gruenbacher: iomap: Rename page_done handler to put_folio 98321b5139f9 :Andreas Gruenbacher: iomap: Add iomap_get_folio helper 9060bc4d3aca :Andreas Gruenbacher: iomap/gfs2: Get page in page_prepare handler 07c22b56685d :Andreas Gruenbacher: iomap: Add __iomap_get_folio helper c82abc239464 :Andreas Gruenbacher: iomap: Rename page_prepare handler to get_folio then rebased path 1, 2 ( see attachment files). Now we can test patch 1,2,3 on 5.1.31. Best Regards Wang Yugui (wangyugui@e16-tech.com) 2023/05/31