Message ID | 20210715033704.692967-1-willy@infradead.org (mailing list archive) |
---|---|
Headers | show |
Series | Memory folios | expand |
On Thu, Jul 15, 2021 at 04:34:46AM +0100, Matthew Wilcox (Oracle) wrote: > Managing memory in 4KiB pages is a serious overhead. Many benchmarks > benefit from a larger "page size". As an example, an earlier iteration > of this idea which used compound pages (and wasn't particularly tuned) > got a 7% performance boost when compiling the kernel. > > Using compound pages or THPs exposes a weakness of our type system. > Functions are often unprepared for compound pages to be passed to them, > and may only act on PAGE_SIZE chunks. Even functions which are aware of > compound pages may expect a head page, and do the wrong thing if passed > a tail page. > > We also waste a lot of instructions ensuring that we're not looking at > a tail page. Almost every call to PageFoo() contains one or more hidden > calls to compound_head(). This also happens for get_page(), put_page() > and many more functions. > > This patch series uses a new type, the struct folio, to manage memory. > It converts enough of the page cache, iomap and XFS to use folios instead > of pages, and then adds support for multi-page folios. It passes xfstests > (running on XFS) with no regressions compared to v5.14-rc1. Hey Willy, I must confess I've lost the thread of the plot in terms of how you hope to get the Memory folio work merged upstream. There are some partial patch sets that just have the mm core, and then there were some larger patchsets include some in the past which as I recall, would touch ext4 (but which isn't in this set). I was wondering if you could perhaps post a roadmap for how this patch set might be broken up, and which subsections you were hoping to target for the upcoming merge window versus the following merge windows. Also I assume that for file systems that aren't converted to use Folios, there won't be any performance regressions --- is that correct? Or is that something we need to watch for? Put another way, if we don't land all of the memory folio patches before the end of the calendar year, and we cut an LTS release with some file systems converted and some file systems not yet converted, are there any potential problems in that eventuality? Thanks! - Ted
On Thu, Jul 15, 2021 at 11:56:07AM -0400, Theodore Y. Ts'o wrote: > On Thu, Jul 15, 2021 at 04:34:46AM +0100, Matthew Wilcox (Oracle) wrote: > > Managing memory in 4KiB pages is a serious overhead. Many benchmarks > > benefit from a larger "page size". As an example, an earlier iteration > > of this idea which used compound pages (and wasn't particularly tuned) > > got a 7% performance boost when compiling the kernel. > > > > Using compound pages or THPs exposes a weakness of our type system. > > Functions are often unprepared for compound pages to be passed to them, > > and may only act on PAGE_SIZE chunks. Even functions which are aware of > > compound pages may expect a head page, and do the wrong thing if passed > > a tail page. > > > > We also waste a lot of instructions ensuring that we're not looking at > > a tail page. Almost every call to PageFoo() contains one or more hidden > > calls to compound_head(). This also happens for get_page(), put_page() > > and many more functions. > > > > This patch series uses a new type, the struct folio, to manage memory. > > It converts enough of the page cache, iomap and XFS to use folios instead > > of pages, and then adds support for multi-page folios. It passes xfstests > > (running on XFS) with no regressions compared to v5.14-rc1. > > Hey Willy, > > I must confess I've lost the thread of the plot in terms of how you > hope to get the Memory folio work merged upstream. There are some > partial patch sets that just have the mm core, and then there were > some larger patchsets include some in the past which as I recall, > would touch ext4 (but which isn't in this set). > > I was wondering if you could perhaps post a roadmap for how this patch > set might be broken up, and which subsections you were hoping to > target for the upcoming merge window versus the following merge > windows. Hi Ted! Great questions. This particular incarnation of the patch set is the one Linus asked for -- show the performance win of using compound pages in the page cache. I think of this patchset as having six parts: 1-32: core; introduce struct folio, get/put, flags 33-50: memcg 51-89: page cache, part 1 90-107: block + iomap 108-124: page cache, part 2 125-138: actually use compound pages in the page cache I'm hoping to get the first three parts (patches 1-89) into the next merge window. That gets us to the point where filesystems can start to use folios themselves (ie it does the initial Amdahl step and then everything else can happen in parallel) > Also I assume that for file systems that aren't converted to use > Folios, there won't be any performance regressions --- is that > correct? Or is that something we need to watch for? Put another way, > if we don't land all of the memory folio patches before the end of the > calendar year, and we cut an LTS release with some file systems > converted and some file systems not yet converted, are there any > potential problems in that eventuality? I suppose I can't guarantee that there will be no performance regressions as a result (eg 5899593f51e6 was a regression that was seen as a result of some of the prep work for folios), but I do not anticipate any for unconverted filesystems. There might be a tiny performance penalty for supporting arbitrary-order pages instead of just orders 0 and 9, but I haven't seen anything to suggest it's noticable. I would expect to see a tiny performance win from removing all the compound_head() calls in the VFS core. I have a proposal in to Plumbers filesystem track where I intend to go over all the ways I'm going to want filesystems to change to take advantage of folios. I think that will be a good venue to discuss how to handle buffer_head based filesystems in a multi-page folio world. I wouldn't expect anything to have to change before the end of the year. I only have four patches in my extended tree which touch ext4, and they're all in the context of making treewide changes to all filesystems: - Converting ->set_page_dirty to ->dirty_folio, - Converting ->invalidatepage to ->invalidate_folio, - Converting ->readpage to ->read_folio, - Changing readahead_page() to readahead_folio() None of those patches are in great shape at this point, and I wouldn't ask anyone to review them. I am anticipating that some filesystems will never be converted to multi-page folios (although all filesystems should be converted to single-page folios so we can remove the folio compat code).
Hi Matthew, (Sorry for the late response, I could not find time earlier) On Thu, Jul 15, 2021 at 04:34:46AM +0100, Matthew Wilcox (Oracle) wrote: > Managing memory in 4KiB pages is a serious overhead. Many benchmarks > benefit from a larger "page size". As an example, an earlier iteration > of this idea which used compound pages (and wasn't particularly tuned) > got a 7% performance boost when compiling the kernel. > > Using compound pages or THPs exposes a weakness of our type system. > Functions are often unprepared for compound pages to be passed to them, > and may only act on PAGE_SIZE chunks. Even functions which are aware of > compound pages may expect a head page, and do the wrong thing if passed > a tail page. > > We also waste a lot of instructions ensuring that we're not looking at > a tail page. Almost every call to PageFoo() contains one or more hidden > calls to compound_head(). This also happens for get_page(), put_page() > and many more functions. > > This patch series uses a new type, the struct folio, to manage memory. > It converts enough of the page cache, iomap and XFS to use folios instead > of pages, and then adds support for multi-page folios. It passes xfstests > (running on XFS) with no regressions compared to v5.14-rc1. I like the idea of folio and that first patches I've reviewed look good. Most of the changelogs (at least at the first patches) mention reduction of the kernel size for your configuration on x86. I wonder, what happens if you build the kernel with "non-distro" configuration, e.g. defconfig or tiny.config? Also, what is the difference on !x86 builds? > Git: https://git.infradead.org/users/willy/pagecache.git/shortlog/refs/tags/folio_14
On Tue, Jul 20, 2021 at 01:54:38PM +0300, Mike Rapoport wrote: > Most of the changelogs (at least at the first patches) mention reduction of > the kernel size for your configuration on x86. I wonder, what happens if > you build the kernel with "non-distro" configuration, e.g. defconfig or > tiny.config? I did an allnoconfig build and that reduced in size by ~2KiB. > Also, what is the difference on !x86 builds? I don't generally do non-x86 builds ... feel free to compare for yourself! I imagine it'll be 2-4 instructions per call to compound_head(). ie something like: load page into reg S load reg S + 8 into reg T test bottom bit of reg T cond-move reg T - 1 to reg S becomes load folio into reg S the exact spelling of those instructions will vary from architecture to architecture; some will take more instructions than others. Possibly it means we end up using one fewer register and so reducing the number of registers spilled to the stack. Probably not, though.
On Tue, Jul 20, 2021 at 01:41:15PM +0100, Matthew Wilcox wrote: > On Tue, Jul 20, 2021 at 01:54:38PM +0300, Mike Rapoport wrote: > > Most of the changelogs (at least at the first patches) mention reduction of > > the kernel size for your configuration on x86. I wonder, what happens if > > you build the kernel with "non-distro" configuration, e.g. defconfig or > > tiny.config? > > I did an allnoconfig build and that reduced in size by ~2KiB. > > > Also, what is the difference on !x86 builds? > > I don't generally do non-x86 builds ... feel free to compare for > yourself! I did allnoconfig and defconfig for arm64 and powerpc. All execpt arm64::defconfig show decrease by ~1KiB, while arm64::defconfig was actually increased by ~500 bytes. I didn't dig into objdumps yet. I also tried to build arm but it failed with: CC fs/remap_range.o fs/remap_range.c: In function 'vfs_dedupe_file_range_compare': fs/remap_range.c:250:3: error: implicit declaration of function 'flush_dcache_folio'; did you mean 'flush_cache_louis'? [-Werror=implicit-function-declaration] 250 | flush_dcache_folio(src_folio); | ^~~~~~~~~~~~~~~~~~ | flush_cache_louis cc1: some warnings being treated as errors > I imagine it'll be 2-4 instructions per call to > compound_head(). ie something like: > > load page into reg S > load reg S + 8 into reg T > test bottom bit of reg T > cond-move reg T - 1 to reg S > becomes > load folio into reg S > > the exact spelling of those instructions will vary from architecture to > architecture; some will take more instructions than others. Possibly it > means we end up using one fewer register and so reducing the number of > registers spilled to the stack. Probably not, though.
On Tue, Jul 20, 2021 at 06:17:26PM +0300, Mike Rapoport wrote: > On Tue, Jul 20, 2021 at 01:41:15PM +0100, Matthew Wilcox wrote: > > On Tue, Jul 20, 2021 at 01:54:38PM +0300, Mike Rapoport wrote: > > > Most of the changelogs (at least at the first patches) mention reduction of > > > the kernel size for your configuration on x86. I wonder, what happens if > > > you build the kernel with "non-distro" configuration, e.g. defconfig or > > > tiny.config? > > > > I did an allnoconfig build and that reduced in size by ~2KiB. > > > > > Also, what is the difference on !x86 builds? > > > > I don't generally do non-x86 builds ... feel free to compare for > > yourself! > > I did allnoconfig and defconfig for arm64 and powerpc. > > All execpt arm64::defconfig show decrease by ~1KiB, while arm64::defconfig > was actually increased by ~500 bytes. Which patch did you go up to for that? If you're going past patch 50 or so, then you're starting to add functionality (ie support for arbitrary order pages), so a certain amount of extra code size might be expected. I measured 6KB at patch 32 or so, then between patch 32 & 50 was pretty much a wash. > I didn't dig into objdumps yet. > > I also tried to build arm but it failed with: > > CC fs/remap_range.o > fs/remap_range.c: In function 'vfs_dedupe_file_range_compare': > fs/remap_range.c:250:3: error: implicit declaration of function 'flush_dcache_folio'; did you mean 'flush_cache_louis'? [-Werror=implicit-function-declaration] > 250 | flush_dcache_folio(src_folio); > | ^~~~~~~~~~~~~~~~~~ > | flush_cache_louis > cc1: some warnings being treated as errors Already complained about by the build bot; already fixed. You should maybe look at the git tree if you're doing more than code review.
On Tue, Jul 20, 2021 at 04:23:29PM +0100, Matthew Wilcox wrote: > On Tue, Jul 20, 2021 at 06:17:26PM +0300, Mike Rapoport wrote: > > On Tue, Jul 20, 2021 at 01:41:15PM +0100, Matthew Wilcox wrote: > > > On Tue, Jul 20, 2021 at 01:54:38PM +0300, Mike Rapoport wrote: > > > > Most of the changelogs (at least at the first patches) mention reduction of > > > > the kernel size for your configuration on x86. I wonder, what happens if > > > > you build the kernel with "non-distro" configuration, e.g. defconfig or > > > > tiny.config? > > > > > > I did an allnoconfig build and that reduced in size by ~2KiB. > > > > > > > Also, what is the difference on !x86 builds? > > > > > > I don't generally do non-x86 builds ... feel free to compare for > > > yourself! > > > > I did allnoconfig and defconfig for arm64 and powerpc. > > > > All execpt arm64::defconfig show decrease by ~1KiB, while arm64::defconfig > > was actually increased by ~500 bytes. > > Which patch did you go up to for that? If you're going past patch 50 or > so, then you're starting to add functionality (ie support for arbitrary > order pages), so a certain amount of extra code size might be expected. > I measured 6KB at patch 32 or so, then between patch 32 & 50 was pretty > much a wash. I've used folio_14 tag: commit 480552d0322d855d146c0fa6fdf1e89ca8569037 (HEAD, tag: folio_14) Author: Matthew Wilcox (Oracle) <willy@infradead.org> Date: Wed Feb 5 11:27:01 2020 -0500 mm/readahead: Add multi-page folio readahead
On Tue, Jul 20, 2021 at 06:35:50PM +0300, Mike Rapoport wrote: > On Tue, Jul 20, 2021 at 04:23:29PM +0100, Matthew Wilcox wrote: > > Which patch did you go up to for that? If you're going past patch 50 or > > so, then you're starting to add functionality (ie support for arbitrary > > order pages), so a certain amount of extra code size might be expected. > > I measured 6KB at patch 32 or so, then between patch 32 & 50 was pretty > > much a wash. > > I've used folio_14 tag: > > commit 480552d0322d855d146c0fa6fdf1e89ca8569037 (HEAD, tag: folio_14) > Author: Matthew Wilcox (Oracle) <willy@infradead.org> > Date: Wed Feb 5 11:27:01 2020 -0500 > > mm/readahead: Add multi-page folio readahead Probably worth trying the for-next tag instead to get a meaningful comparison of how much using folios saves over pages. I don't want to give the impression that this is all that can be saved by switching to folios. There are still hundreds of places that call PageFoo(), SetPageFoo(), ClearPageFoo(), put_page(), get_page(), lock_page() and so on. There's probably another 20KB of code that can be removed that way.
On Thu, Jul 15, 2021 at 04:34:46AM +0100, Matthew Wilcox (Oracle) wrote: > Managing memory in 4KiB pages is a serious overhead. Many benchmarks > benefit from a larger "page size". As an example, an earlier iteration > of this idea which used compound pages (and wasn't particularly tuned) > got a 7% performance boost when compiling the kernel. I want to thank Michael Larabel for his benchmarking effort: https://www.phoronix.com/scan.php?page=news_item&px=Folios-v14-Testing-AMD-Linux I'm not too surprised by the lack of performance change on the majority of benchmarks. This patch series is only going to change things for heavy users of the page cache (ie it'll do nothing for anon memory users), and it's only really a benefit for programs that have good locality. What blows me away is the 80% performance improvement for PostgreSQL. I know they use the page cache extensively, so it's plausibly real. I'm a bit surprised that it has such good locality, and the size of the win far exceeds my expectations. We should probably dive into it and figure out exactly what's going on. Should we accelerate inclusion of this patchset? Right now, I have 89 mm patches queued up for the 5.15 merge window. My plan was to get the 17 iomap + block patches, plus another 18 page cache patches into 5.16 and then get the 14 multi-page folio patches into 5.17. But I'm mindful of the longterm release coming up "soon", and I'm not sure we're best served by multiple distros trying to backport the multi-page folio patches to either 5.15 or 5.16.
On Sat, 2021-07-24 at 18:27 +0100, Matthew Wilcox wrote: > What blows me away is the 80% performance improvement for PostgreSQL. > I know they use the page cache extensively, so it's plausibly real. > I'm a bit surprised that it has such good locality, and the size of > the win far exceeds my expectations. We should probably dive into it > and figure out exactly what's going on. Since none of the other tested databases showed more than a 3% improvement, this looks like an anomalous result specific to something in postgres ... although the next biggest db: mariadb wasn't part of the tests so I'm not sure that's definitive. Perhaps the next step should be to test mariadb? Since they're fairly similar in domain (both full SQL) if mariadb shows this type of improvement, you can safely assume it's something in the way SQL databases handle paging and if it doesn't, it's likely fixing a postgres inefficiency. James
On Sat, Jul 24, 2021 at 11:09:02AM -0700, James Bottomley wrote: > On Sat, 2021-07-24 at 18:27 +0100, Matthew Wilcox wrote: > > What blows me away is the 80% performance improvement for PostgreSQL. > > I know they use the page cache extensively, so it's plausibly real. > > I'm a bit surprised that it has such good locality, and the size of > > the win far exceeds my expectations. We should probably dive into it > > and figure out exactly what's going on. > > Since none of the other tested databases showed more than a 3% > improvement, this looks like an anomalous result specific to something > in postgres ... although the next biggest db: mariadb wasn't part of > the tests so I'm not sure that's definitive. Perhaps the next step > should be to test mariadb? Since they're fairly similar in domain > (both full SQL) if mariadb shows this type of improvement, you can > safely assume it's something in the way SQL databases handle paging and > if it doesn't, it's likely fixing a postgres inefficiency. I think the thing that's specific to PostgreSQL is that it's a heavy user of the page cache. My understanding is that most databases use direct IO and manage their own page cache, while PostgreSQL trusts the kernel to get it right. Regardless of whether postgres is "doing something wrong" or not, do you not think that an 80% performance win would exert a certain amount of pressure on distros to do the backport?
On Sat, 2021-07-24 at 19:14 +0100, Matthew Wilcox wrote: > On Sat, Jul 24, 2021 at 11:09:02AM -0700, James Bottomley wrote: > > On Sat, 2021-07-24 at 18:27 +0100, Matthew Wilcox wrote: > > > What blows me away is the 80% performance improvement for > > > PostgreSQL. I know they use the page cache extensively, so it's > > > plausibly real. I'm a bit surprised that it has such good > > > locality, and the size of the win far exceeds my > > > expectations. We should probably dive into it and figure out > > > exactly what's going on. > > > > Since none of the other tested databases showed more than a 3% > > improvement, this looks like an anomalous result specific to > > something in postgres ... although the next biggest db: mariadb > > wasn't part of the tests so I'm not sure that's > > definitive. Perhaps the next step should be to t > > est mariadb? Since they're fairly similar in domain (both full > > SQL) if mariadb shows this type of improvement, you can > > safely assume it's something in the way SQL databases handle paging > > and if it doesn't, it's likely fixing a postgres inefficiency. > > I think the thing that's specific to PostgreSQL is that it's a heavy > user of the page cache. My understanding is that most databases use > direct IO and manage their own page cache, while PostgreSQL trusts > the kernel to get it right. That's testable with mariadb, at least for the innodb engine since the flush_method is settable. > Regardless of whether postgres is "doing something wrong" or not, > do you not think that an 80% performance win would exert a certain > amount of pressure on distros to do the backport? Well, I cut the previous question deliberately, but if you're going to force me to answer, my experience with storage tells me that one test being 10x different from all the others usually indicates a problem with the benchmark test itself rather than a baseline improvement, so I'd wait for more data. James
Hi, On Sat, Jul 24, 2021, at 11:23, James Bottomley wrote: > On Sat, 2021-07-24 at 19:14 +0100, Matthew Wilcox wrote: > > On Sat, Jul 24, 2021 at 11:09:02AM -0700, James Bottomley wrote: > > > On Sat, 2021-07-24 at 18:27 +0100, Matthew Wilcox wrote: > > > > What blows me away is the 80% performance improvement for > > > > PostgreSQL. I know they use the page cache extensively, so it's > > > > plausibly real. I'm a bit surprised that it has such good > > > > locality, and the size of the win far exceeds my > > > > expectations. We should probably dive into it and figure out > > > > exactly what's going on. > > > > > > Since none of the other tested databases showed more than a 3% > > > improvement, this looks like an anomalous result specific to > > > something in postgres ... although the next biggest db: mariadb > > > wasn't part of the tests so I'm not sure that's > > > definitive. Perhaps the next step should be to t > > > est mariadb? Since they're fairly similar in domain (both full > > > SQL) if mariadb shows this type of improvement, you can > > > safely assume it's something in the way SQL databases handle paging > > > and if it doesn't, it's likely fixing a postgres inefficiency. > > > > I think the thing that's specific to PostgreSQL is that it's a heavy > > user of the page cache. My understanding is that most databases use > > direct IO and manage their own page cache, while PostgreSQL trusts > > the kernel to get it right. > > That's testable with mariadb, at least for the innodb engine since the > flush_method is settable. > > > Regardless of whether postgres is "doing something wrong" or not, > > do you not think that an 80% performance win would exert a certain > > amount of pressure on distros to do the backport? > > Well, I cut the previous question deliberately, but if you're going to > force me to answer, my experience with storage tells me that one test > being 10x different from all the others usually indicates a problem > with the benchmark test itself rather than a baseline improvement, so > I'd wait for more data. I have a similar reaction - the large improvements are for a read/write pgbench benchmark at a scale that fits in memory. That's typically purely bound by the speed at which the WAL can be synced to disk. As far as I recall mariadb also uses buffered IO for WAL (but there was recent work in the area). Is there a reason fdatasync() of 16MB files to have got a lot faster? Or a chance that could be broken? Some improvement for read-only wouldn't surprise me, particularly if the os/pg weren't configured for explicit huge pages. Pgbench has a uniform distribution so its *very* tlb miss heavy with 4k pages. Regards, Andres
On Sat, Jul 24, 2021 at 11:23:25AM -0700, James Bottomley wrote: > On Sat, 2021-07-24 at 19:14 +0100, Matthew Wilcox wrote: > > On Sat, Jul 24, 2021 at 11:09:02AM -0700, James Bottomley wrote: > > > On Sat, 2021-07-24 at 18:27 +0100, Matthew Wilcox wrote: > > > > What blows me away is the 80% performance improvement for > > > > PostgreSQL. I know they use the page cache extensively, so it's > > > > plausibly real. I'm a bit surprised that it has such good > > > > locality, and the size of the win far exceeds my > > > > expectations. We should probably dive into it and figure out > > > > exactly what's going on. > > > > > > Since none of the other tested databases showed more than a 3% > > > improvement, this looks like an anomalous result specific to > > > something in postgres ... although the next biggest db: mariadb > > > wasn't part of the tests so I'm not sure that's > > > definitive. Perhaps the next step should be to t > > > est mariadb? Since they're fairly similar in domain (both full > > > SQL) if mariadb shows this type of improvement, you can > > > safely assume it's something in the way SQL databases handle paging > > > and if it doesn't, it's likely fixing a postgres inefficiency. > > > > I think the thing that's specific to PostgreSQL is that it's a heavy > > user of the page cache. My understanding is that most databases use > > direct IO and manage their own page cache, while PostgreSQL trusts > > the kernel to get it right. > > That's testable with mariadb, at least for the innodb engine since the > flush_method is settable. We're still not communicating well. I'm not talking about writes, I'm talking about reads. Postgres uses the page cache for reads. InnoDB uses O_DIRECT (afaict). See articles like this one: https://www.percona.com/blog/2018/02/08/fsync-performance-storage-devices/ : The first and most obvious type of IO are pages reads and writes from : the tablespaces. The pages are most often read one at a time, as 16KB : random read operations. Writes to the tablespaces are also typically : 16KB random operations, but they are done in batches. After every batch, : fsync is called on the tablespace file handle. (the current folio patch set does not create multi-page folios for writes, only for reads) I downloaded the mariadb source package that's in Debian, and from what I can glean, it does indeed set O_DIRECT on data files in Linux, through os_file_set_nocache(). > > Regardless of whether postgres is "doing something wrong" or not, > > do you not think that an 80% performance win would exert a certain > > amount of pressure on distros to do the backport? > > Well, I cut the previous question deliberately, but if you're going to > force me to answer, my experience with storage tells me that one test > being 10x different from all the others usually indicates a problem > with the benchmark test itself rather than a baseline improvement, so > I'd wait for more data. ... or the two benchmarks use Linux in completely different ways such that one sees a huge benefit while the other sees none. Which is what you'd expect for a patchset that improves the page cache and using a benchmark that doesn't use the page cache.
On Sat, Jul 24, 2021 at 11:45:26AM -0700, Andres Freund wrote: > On Sat, Jul 24, 2021, at 11:23, James Bottomley wrote: > > Well, I cut the previous question deliberately, but if you're going to > > force me to answer, my experience with storage tells me that one test > > being 10x different from all the others usually indicates a problem > > with the benchmark test itself rather than a baseline improvement, so > > I'd wait for more data. > > I have a similar reaction - the large improvements are for a read/write pgbench benchmark at a scale that fits in memory. That's typically purely bound by the speed at which the WAL can be synced to disk. As far as I recall mariadb also uses buffered IO for WAL (but there was recent work in the area). > > Is there a reason fdatasync() of 16MB files to have got a lot faster? Or a chance that could be broken? > > Some improvement for read-only wouldn't surprise me, particularly if the os/pg weren't configured for explicit huge pages. Pgbench has a uniform distribution so its *very* tlb miss heavy with 4k pages. It's going to depend substantially on the access pattern. If the 16MB file (oof, that's tiny!) was read in in large chunks or even in small chunks, but consecutively, the folio changes will allocate larger pages (16k, 64k, 256k, ...). Theoretically it might get up to 2MB pages and start using PMDs, but I've never seen that in my testing. fdatasync() could indeed have got much faster. If we're writing back a 256kB page as a unit, we're handling 64 times less metadata than writing back 64x4kB pages. We'll track 64x less dirty bits. We'll find only 64 dirty pages per 16MB instead of 4096 dirty pages. It's always possible I just broke something. The xfstests aren't exhaustive, and no regressions doesn't mean no problems. Can you guide Michael towards parameters for pgbench that might give an indication of performance on a more realistic workload that doesn't entirely fit in memory?
Hi, On Sat, Jul 24, 2021, at 12:01, Matthew Wilcox wrote: > On Sat, Jul 24, 2021 at 11:45:26AM -0700, Andres Freund wrote: > > On Sat, Jul 24, 2021, at 11:23, James Bottomley wrote: > > > Well, I cut the previous question deliberately, but if you're going to > > > force me to answer, my experience with storage tells me that one test > > > being 10x different from all the others usually indicates a problem > > > with the benchmark test itself rather than a baseline improvement, so > > > I'd wait for more data. > > > > I have a similar reaction - the large improvements are for a read/write pgbench benchmark at a scale that fits in memory. That's typically purely bound by the speed at which the WAL can be synced to disk. As far as I recall mariadb also uses buffered IO for WAL (but there was recent work in the area). > > > > Is there a reason fdatasync() of 16MB files to have got a lot faster? Or a chance that could be broken? > > > > Some improvement for read-only wouldn't surprise me, particularly if the os/pg weren't configured for explicit huge pages. Pgbench has a uniform distribution so its *very* tlb miss heavy with 4k pages. > > It's going to depend substantially on the access pattern. If the 16MB > file (oof, that's tiny!) was read in in large chunks or even in small > chunks, but consecutively, the folio changes will allocate larger pages > (16k, 64k, 256k, ...). Theoretically it might get up to 2MB pages and > start using PMDs, but I've never seen that in my testing. The 16MB files are just for the WAL/journal, and are write only in a benchmark like this. With pgbench it'll be written in small consecutive chunks (a few pages at a time, for each group commit). Each page is only written once, until after a checkpoint the entire file is "recycled" (renamed into the future of the WAL stream) and reused from start. The data files are 1GB. > fdatasync() could indeed have got much faster. If we're writing back a > 256kB page as a unit, we're handling 64 times less metadata than writing > back 64x4kB pages. We'll track 64x less dirty bits. We'll find only > 64 dirty pages per 16MB instead of 4096 dirty pages. The dirty writes will be 8-32k or so in this workload - the constant commits require the WAL to constantly be flushed. > It's always possible I just broke something. The xfstests aren't > exhaustive, and no regressions doesn't mean no problems. > > Can you guide Michael towards parameters for pgbench that might give > an indication of performance on a more realistic workload that doesn't > entirely fit in memory? Fitting in memory isn't bad - that's a large post of real workloads. It just makes it hard to believe the performance improvement, given that we expect to be bound by disk sync speed... Michael, where do I find more details about the codification used during the run? Regards, Andres
On Sat, 2021-07-24 at 19:50 +0100, Matthew Wilcox wrote: > On Sat, Jul 24, 2021 at 11:23:25AM -0700, James Bottomley wrote: > > On Sat, 2021-07-24 at 19:14 +0100, Matthew Wilcox wrote: > > > On Sat, Jul 24, 2021 at 11:09:02AM -0700, James Bottomley wrote: > > > > On Sat, 2021-07-24 at 18:27 +0100, Matthew Wilcox wrote: > > > > > What blows me away is the 80% performance improvement for > > > > > PostgreSQL. I know they use the page cache extensively, so > > > > > it's > > > > > plausibly real. I'm a bit surprised that it has such good > > > > > locality, and the size of the win far exceeds my > > > > > expectations. We should probably dive into it and figure out > > > > > exactly what's going on. > > > > > > > > Since none of the other tested databases showed more than a 3% > > > > improvement, this looks like an anomalous result specific to > > > > something in postgres ... although the next biggest db: mariadb > > > > wasn't part of the tests so I'm not sure that's > > > > definitive. Perhaps the next step should be to t > > > > est mariadb? Since they're fairly similar in domain (both full > > > > SQL) if mariadb shows this type of improvement, you can > > > > safely assume it's something in the way SQL databases handle > > > > paging and if it doesn't, it's likely fixing a postgres > > > > inefficiency. > > > > > > I think the thing that's specific to PostgreSQL is that it's a > > > heavy user of the page cache. My understanding is that most > > > databases use direct IO and manage their own page cache, while > > > PostgreSQL trusts the kernel to get it right. > > > > That's testable with mariadb, at least for the innodb engine since > > the flush_method is settable. > > We're still not communicating well. I'm not talking about writes, > I'm talking about reads. Postgres uses the page cache for reads. > InnoDB uses O_DIRECT (afaict). See articles like this one: > https://www.percona.com/blog/2018/02/08/fsync-performance-storage-devices/ If it were all about reads, wouldn't the Phoronix pgbench read only test have shown a better improvement than 7%? I think the Phoronix data shows that whatever it is it's to do with writes ... that does imply something in the way the log syncs data. James
Hi, On 2021-07-24 12:12:36 -0700, Andres Freund wrote: > On Sat, Jul 24, 2021, at 12:01, Matthew Wilcox wrote: > > On Sat, Jul 24, 2021 at 11:45:26AM -0700, Andres Freund wrote: > > It's always possible I just broke something. The xfstests aren't > > exhaustive, and no regressions doesn't mean no problems. > > > > Can you guide Michael towards parameters for pgbench that might give > > an indication of performance on a more realistic workload that doesn't > > entirely fit in memory? > > Fitting in memory isn't bad - that's a large post of real workloads. It just makes it hard to believe the performance improvement, given that we expect to be bound by disk sync speed... I just tried to compare folio-14 vs its baseline, testing commit 8096acd7442e against 480552d0322d. In a VM however (but at least with its memory being backed by huge pages and storage being passed through). I got about 7% improvement with just some baseline tuning of postgres applied. I think a 1-2% of that is potentially runtime variance (I saw slightly different timings leading around checkpointing that lead to a bit "unfair" advantage to the folio run). That's a *nice* win! WRT the ~70% improvement: > Michael, where do I find more details about the codification used during the > run? After some digging I found https://github.com/phoronix-test-suite/phoronix-test-suite/blob/94562dd4a808637be526b639d220c7cd937e2aa1/ob-cache/test-profiles/pts/pgbench-1.10.1/install.sh For one the test says its done on ext4, while I used xfs. But I think the bigger thing is the following: The phoronix test uses postgres with only one relevant setting adjusted (increasing the max connection count). That will end up using a buffer pool of 128MB, no huge pages, and importantly is configured to aim for not more than 1GB for postgres' journal, which will lead to constant checkpointing. The test also only runs for 15 seconds, which likely isn't even enough to "warm up" (the creation of the data set here will take longer than the run). Given that the dataset phoronix is using is about ~16GB of data (excluding WAL), and uses 256 concurrent clients running full tilt, using that limited postgres settings doesn't end up measuring something particularly interesting in my opinion. Without changing the filesystem, using a configuration more similar to phoronix', I do get a bigger win. But the run-to-run variance is so high (largely due to the short test duration) that I don't trust those results much. It does look like there's a less slowdown due to checkpoints (i.e. fsyncing all data files postgres modified since the last checkpoints) on the folio branch, which does make some sense to me and would be a welcome improvement. Greetings, Andres Freund
On 7/24/21 4:44 PM, Andres Freund wrote: > Hi, > > On 2021-07-24 12:12:36 -0700, Andres Freund wrote: >> On Sat, Jul 24, 2021, at 12:01, Matthew Wilcox wrote: >>> On Sat, Jul 24, 2021 at 11:45:26AM -0700, Andres Freund wrote: >>> It's always possible I just broke something. The xfstests aren't >>> exhaustive, and no regressions doesn't mean no problems. >>> >>> Can you guide Michael towards parameters for pgbench that might give >>> an indication of performance on a more realistic workload that doesn't >>> entirely fit in memory? >> Fitting in memory isn't bad - that's a large post of real workloads. It just makes it hard to believe the performance improvement, given that we expect to be bound by disk sync speed... > I just tried to compare folio-14 vs its baseline, testing commit 8096acd7442e > against 480552d0322d. In a VM however (but at least with its memory being > backed by huge pages and storage being passed through). I got about 7% > improvement with just some baseline tuning of postgres applied. I think a 1-2% > of that is potentially runtime variance (I saw slightly different timings > leading around checkpointing that lead to a bit "unfair" advantage to the > folio run). > > That's a *nice* win! > > WRT the ~70% improvement: > >> Michael, where do I find more details about the codification used during the >> run? > After some digging I found https://github.com/phoronix-test-suite/phoronix-test-suite/blob/94562dd4a808637be526b639d220c7cd937e2aa1/ob-cache/test-profiles/pts/pgbench-1.10.1/install.sh > For one the test says its done on ext4, while I used xfs. But I think the > bigger thing is the following: Yes that is the run/setup script used. The additional pgbench arguments passed at run-time are outlined in https://github.com/phoronix-test-suite/phoronix-test-suite/blob/94562dd4a808637be526b639d220c7cd937e2aa1/ob-cache/test-profiles/pts/pgbench-1.10.1/test-definition.xml Though in this case is quite straight-forward in corresponding to the relevant -s, -c options for pgbench and what is shown in turn on the pgbench graphs. I have been running some more PostgreSQL tests on other hardware as well as via HammerDB and other databases. Will send that over when wrapped up likely tomorrow. Michael > > The phoronix test uses postgres with only one relevant setting adjusted > (increasing the max connection count). That will end up using a buffer pool of > 128MB, no huge pages, and importantly is configured to aim for not more than > 1GB for postgres' journal, which will lead to constant checkpointing. The test > also only runs for 15 seconds, which likely isn't even enough to "warm up" > (the creation of the data set here will take longer than the run). > > Given that the dataset phoronix is using is about ~16GB of data (excluding > WAL), and uses 256 concurrent clients running full tilt, using that limited > postgres settings doesn't end up measuring something particularly interesting > in my opinion. > > Without changing the filesystem, using a configuration more similar to > phoronix', I do get a bigger win. But the run-to-run variance is so high > (largely due to the short test duration) that I don't trust those results > much. > > It does look like there's a less slowdown due to checkpoints (i.e. fsyncing > all data files postgres modified since the last checkpoints) on the folio > branch, which does make some sense to me and would be a welcome improvement. > > Greetings, > > Andres Freund
On Sat, Jul 24, 2021 at 02:44:13PM -0700, Andres Freund wrote: > The phoronix test uses postgres with only one relevant setting adjusted > (increasing the max connection count). That will end up using a buffer pool of > 128MB, no huge pages, and importantly is configured to aim for not more than > 1GB for postgres' journal, which will lead to constant checkpointing. The test > also only runs for 15 seconds, which likely isn't even enough to "warm up" > (the creation of the data set here will take longer than the run). > > Given that the dataset phoronix is using is about ~16GB of data (excluding > WAL), and uses 256 concurrent clients running full tilt, using that limited > postgres settings doesn't end up measuring something particularly interesting > in my opinion. Hi Andreas, I tend to use the phoronix test suite for my performance runs when testing ext4 changes simply because it's convenient. Can you suggest a better set configuration settings that I should perhaps use that might give more "real world" numbers that you would find more significant? Thanks, - Ted
Hi, On 2021-07-26 10:19:11 -0400, Theodore Ts'o wrote: > On Sat, Jul 24, 2021 at 02:44:13PM -0700, Andres Freund wrote: > > The phoronix test uses postgres with only one relevant setting adjusted > > (increasing the max connection count). That will end up using a buffer pool of > > 128MB, no huge pages, and importantly is configured to aim for not more than > > 1GB for postgres' journal, which will lead to constant checkpointing. The test > > also only runs for 15 seconds, which likely isn't even enough to "warm up" > > (the creation of the data set here will take longer than the run). > > > > Given that the dataset phoronix is using is about ~16GB of data (excluding > > WAL), and uses 256 concurrent clients running full tilt, using that limited > > postgres settings doesn't end up measuring something particularly interesting > > in my opinion. > I tend to use the phoronix test suite for my performance runs when > testing ext4 changes simply because it's convenient. Can you suggest > a better set configuration settings that I should perhaps use that > might give more "real world" numbers that you would find more > significant? It depends a bit on what you want to test, obviously... At the very least you should 'max_wal_size = 32GB' or such (it'll only use that much if enough WAL is generated within checkpoint timeout, which defaults to 5min). And unfortunately you're not going to get meaningful performance results for a read/write test within 10s, you need to run at least ~11min (so two checkpoints happen). With the default shared_buffers setting of 128MB you are going to simulate a much-larger-than-postgres's-memory workload, albeit one where the page cache *is* big enough on most current machines, unless you limit the size of the page cache considerably. Doing so can be useful to approximate a workload that would take much longer to initialize due to the size. I suggest *not* disabling autovacuum as currently done for performance testing - it's not something many real-world setups can afford to do, so benchmarking FS performance with it disabled doesn't seem like a good idea. FWIW, depending on what kind of thing you want to test, it'd not be hard to come up with a test that less time to initialize. E.g. an insert-only workload without an initial dataset or such. As long as you *do* initialize 16GB of data, I think it'd make sense to measure the time that takes. There's definitely been filesystem level performance changes of that, and it's often going to be more IO intensive. Greetings, Andres Freund