Message ID | 20230710130253.3484695-1-willy@infradead.org (mailing list archive) |
---|---|
Headers | show |
Series | Create large folios in iomap buffered write path | expand |
On Mon, Jul 10, 2023 at 02:02:44PM +0100, Matthew Wilcox (Oracle) wrote: > Commit ebb7fb1557b1 limited the length of ioend chains to 4096 entries > to improve worst-case latency. Unfortunately, this had the effect of > limiting the performance of: > > fio -name write-bandwidth -rw=write -bs=1024Ki -size=32Gi -runtime=30 \ > -iodepth 1 -ioengine sync -zero_buffers=1 -direct=0 -end_fsync=1 \ > -numjobs=4 -directory=/mnt/test When you say performance, do you mean overall throughput / IOPS / latency or all? And who noticed it / reported it? The above incantation seems pretty specific so I'm curious who runs that test and what sort of work flow is it trying to replicate. > The problem ends up being lock contention on the i_pages spinlock as we > clear the writeback bit on each folio (and propagate that up through > the tree). By using larger folios, we decrease the number of folios > to be processed by a factor of 256 for this benchmark, eliminating the > lock contention. Implied here seems to suggest that the associated cost for the search a larger folio is pretty negligable compared the gains of finding one. That seems to be nice but it gets me wondering if there are other benchmarks under which there is any penalties instead. Ie, is the above a microbenchmark where this yields good results? > It's also the right thing to do. This is a project that has been on > the back burner for years, it just hasn't been important enough to do > before now. Commit ebb7fb1557b1 (xfs, iomap: limit individual ioend chain lengths in writeback") dates back to just one year, and so it gets me wondering how a project in the back burner for years now finds motivation for just a one year old regression. What was the original motivation of the older project dating this effort back to its inception? Luis
On Mon, Jul 10, 2023 at 03:55:17PM -0700, Luis Chamberlain wrote: > On Mon, Jul 10, 2023 at 02:02:44PM +0100, Matthew Wilcox (Oracle) wrote: > > Commit ebb7fb1557b1 limited the length of ioend chains to 4096 entries > > to improve worst-case latency. Unfortunately, this had the effect of > > limiting the performance of: > > > > fio -name write-bandwidth -rw=write -bs=1024Ki -size=32Gi -runtime=30 \ > > -iodepth 1 -ioengine sync -zero_buffers=1 -direct=0 -end_fsync=1 \ > > -numjobs=4 -directory=/mnt/test > > When you say performance, do you mean overall throughput / IOPS / > latency or all? This is buffered I/O, so when we run out of RAM, we block until the dirty pages are written back. I suppose that makes it throughput, but it's throughput from the bottom of the page cache to storage, not the usual app-to-page-cache bottleneck. > And who noticed it / reported it? The above incantation seems pretty > specific so I'm curious who runs that test and what sort of work flow > is it trying to replicate. Wang Yugui, who is on the cc reported it. https://lore.kernel.org/linux-xfs/20230508172406.1CF3.409509F4@e16-tech.com/ > > The problem ends up being lock contention on the i_pages spinlock as we > > clear the writeback bit on each folio (and propagate that up through > > the tree). By using larger folios, we decrease the number of folios > > to be processed by a factor of 256 for this benchmark, eliminating the > > lock contention. > > Implied here seems to suggest that the associated cost for the search a > larger folio is pretty negligable compared the gains of finding one. > That seems to be nice but it gets me wondering if there are other > benchmarks under which there is any penalties instead. > > Ie, is the above a microbenchmark where this yields good results? It happens to be constructed in such a way that it yields the optimum outcome in this case, but that clearly wasn't deliberate. And the solution is the one I had in mind from before the bug report came in. I don't think you'll be able to find a benchmark that regresses as a result of this, but I welcome your attempt! > > It's also the right thing to do. This is a project that has been on > > the back burner for years, it just hasn't been important enough to do > > before now. > > Commit ebb7fb1557b1 (xfs, iomap: limit individual ioend chain lengths in > writeback") dates back to just one year, and so it gets me wondering > how a project in the back burner for years now finds motivation for > just a one year old regression. > > What was the original motivation of the older project dating this > effort back to its inception? We should create larger folios on write. It just wasn't important enough to do before now. But now there's an actual user who cares.
On Mon, Jul 10, 2023 at 03:55:17PM -0700, Luis Chamberlain wrote: > On Mon, Jul 10, 2023 at 02:02:44PM +0100, Matthew Wilcox (Oracle) wrote: > > Commit ebb7fb1557b1 limited the length of ioend chains to 4096 entries > > to improve worst-case latency. Unfortunately, this had the effect of > > limiting the performance of: > > > > fio -name write-bandwidth -rw=write -bs=1024Ki -size=32Gi -runtime=30 \ > > -iodepth 1 -ioengine sync -zero_buffers=1 -direct=0 -end_fsync=1 \ > > -numjobs=4 -directory=/mnt/test > > When you say performance, do you mean overall throughput / IOPS / > latency or all? > > And who noticed it / reported it? https://lore.kernel.org/linux-xfs/20230508172406.1CF3.409509F4@e16-tech.com/ > The above incantation seems pretty > specific so I'm curious who runs that test and what sort of work flow > is it trying to replicate. Not specific at all. It's just a basic concurrent sequential buffered write performance test. It needs multiple jobs to max out typical cheap pcie 4.0 NVMe SSD storage (i.e. 6-8GB/s) because sequential non-async buffered writes are CPU bound at (typically) 2-3GB/s per file write. > > The problem ends up being lock contention on the i_pages spinlock as we > > clear the writeback bit on each folio (and propagate that up through > > the tree). By using larger folios, we decrease the number of folios > > to be processed by a factor of 256 for this benchmark, eliminating the > > lock contention. > > Implied here seems to suggest that the associated cost for the search a > larger folio is pretty negligable compared the gains of finding one. > That seems to be nice but it gets me wondering if there are other > benchmarks under which there is any penalties instead. > > Ie, is the above a microbenchmark where this yields good results? No, the workload gains are general - they avoid the lock contention problems involved with cycling, accounting and state changes for millions of objects (order-0 folios) a second through a single exclusive lock (mapping tree lock) by reducing the mapping tree lock cycling by a couple of orders of magnitude. > > It's also the right thing to do. This is a project that has been on > > the back burner for years, it just hasn't been important enough to do > > before now. > > Commit ebb7fb1557b1 (xfs, iomap: limit individual ioend chain lengths in > writeback") dates back to just one year, and so it gets me wondering > how a project in the back burner for years now finds motivation for > just a one year old regression. The change in ebb7fb1557b1 is just the straw that broke the camel's back. It got rid of the massive IO batch processing we used to minimise the inherent cross-process mapping tree contention in buffered writes. i.e. the process doing write() calls, multiple kswapds doing memory reclaim, writeback submission and writeback completion all contend at the same time for the mapping tree lock. We largely removed the IO submission and completion from the picture with huge batch processing, but that then started causing latency problems with IO completion processing. So we went back to smaller chunks of IO submission and completion, and that means we went from 3 threads contending on the mapping tree lock to 5 threads. And that drove the system into catastrophic lock breakdown on the mapping tree lock. And so -everything- then went really slow because each write() task burns down at least 5 CPUs on the mapping tree lock each.... THis is not an XFS issue to solve - this is a general page cache problem and we've always wanted to fix it in the page cache, either by large folio support or by large block size support that required aligned high-order pages in the page cache. Same solution - less pages to iterate - but different methods... > What was the original motivation of the older project dating this > effort back to its inception? https://www.kernel.org/doc/ols/2006/ols2006v1-pages-177-192.pdf That was run on a 64kB page size machine (itanic), but there were signs that the mapping tree lock would be an issue in future. Indeed, when these large SSI supercomputers moved to x86-64 (down to 4kB page size) a couple of years later, the mapping tree lock popped up to the top of the list of buffered write throughput limiting factors. i.e. the larger the NUMA distances between the workload doing the write and the node the mapping tree lock is located on, the slower buffered writes go and the more CPU we burnt on the mapping tree lock. We carefully worked around that with cpusets and locality control, and the same was then done in HPC environments on these types of machines and hence it wasn't an immediate limiting factor... But we're talking about multi-million dollar supercomputers here, and in most cases people just rewrote the apps to use direct IO and so the problem just went away and the HPC apps could easily use all the perofrmance the storage provided.... IOWs, we've know about this problem for over 15 years, but the difference is now that consumer hardware is capable of >10GB/s write speeds (current PCIe 5.0 nvme ssds) for just a couple of hundred dollars, rather than multi-million dollar machines in carefully optimised environments that we first saw it on back in 2006... Cheers, Dave.