Message ID | 20151102042952.6610.7185.stgit@dwillia2-desk3.amr.corp.intel.com (mailing list archive) |
---|---|
State | Superseded |
Headers | show |
On Sun, Nov 01, 2015 at 11:29:53PM -0500, Dan Williams wrote: > dax_clear_blocks is currently performing a cond_resched() after every > PAGE_SIZE memset. We need not check so frequently, for example md-raid > only calls cond_resched() at stripe granularity. Also, in preparation > for introducing a dax_map_atomic() operation that temporarily pins a dax > mapping move the call to cond_resched() to the outer loop. > > Reviewed-by: Jan Kara <jack@suse.com> > Reviewed-by: Jeff Moyer <jmoyer@redhat.com> > Signed-off-by: Dan Williams <dan.j.williams@intel.com> > --- > fs/dax.c | 27 ++++++++++++--------------- > 1 file changed, 12 insertions(+), 15 deletions(-) > > diff --git a/fs/dax.c b/fs/dax.c > index 5dc33d788d50..f8e543839e5c 100644 > --- a/fs/dax.c > +++ b/fs/dax.c > @@ -28,6 +28,7 @@ > #include <linux/sched.h> > #include <linux/uio.h> > #include <linux/vmstat.h> > +#include <linux/sizes.h> > > int dax_clear_blocks(struct inode *inode, sector_t block, long size) > { > @@ -38,24 +39,20 @@ int dax_clear_blocks(struct inode *inode, sector_t block, long size) > do { > void __pmem *addr; > unsigned long pfn; > - long count; > + long count, sz; > > - count = bdev_direct_access(bdev, sector, &addr, &pfn, size); > + sz = min_t(long, size, SZ_1M); > + count = bdev_direct_access(bdev, sector, &addr, &pfn, sz); > if (count < 0) > return count; > - BUG_ON(size < count); > - while (count > 0) { > - unsigned pgsz = PAGE_SIZE - offset_in_page(addr); > - if (pgsz > count) > - pgsz = count; > - clear_pmem(addr, pgsz); > - addr += pgsz; > - size -= pgsz; > - count -= pgsz; > - BUG_ON(pgsz & 511); > - sector += pgsz / 512; > - cond_resched(); > - } > + if (count < sz) > + sz = count; > + clear_pmem(addr, sz); > + addr += sz; > + size -= sz; > + BUG_ON(sz & 511); > + sector += sz / 512; > + cond_resched(); > } while (size); > > wmb_pmem(); dax_clear_blocks() needs to go away and be replaced by a driver level implementation of blkdev_issue_zerout(). This is effectively a block device operation (we're taking sector addresses and zeroing them), so it really belongs in the pmem drivers rather than the DAX code. I suspect a REQ_WRITE_SAME implementation is the way to go here, as then the filesystems can just call sb_issue_zerout() and the block layer zeroing will work on all types of storage without the filesystem having to care whether DAX is in use or not. Putting the implementation of the zeroing in the pmem drivers will enable the drivers to optimise the caching behaviour of block zeroing. The synchronous cache flushing behaviour of this function is a performance killer as we are now block zeroing on allocation and that results in two synchronous data writes (zero on alloc, commit, write data, commit) for each page. The zeroing (and the data, for that matter) doesn't need to be committed to persistent store until the allocation is written and committed to the journal - that will happen with a REQ_FLUSH|REQ_FUA write, so it makes sense to deploy the big hammer and delay the blocking CPU cache flushes until the last possible moment in cases like this. Cheers, Dave.
On Mon, Nov 2, 2015 at 4:51 PM, Dave Chinner <david@fromorbit.com> wrote: > On Sun, Nov 01, 2015 at 11:29:53PM -0500, Dan Williams wrote: >> dax_clear_blocks is currently performing a cond_resched() after every >> PAGE_SIZE memset. We need not check so frequently, for example md-raid >> only calls cond_resched() at stripe granularity. Also, in preparation >> for introducing a dax_map_atomic() operation that temporarily pins a dax >> mapping move the call to cond_resched() to the outer loop. >> >> Reviewed-by: Jan Kara <jack@suse.com> >> Reviewed-by: Jeff Moyer <jmoyer@redhat.com> >> Signed-off-by: Dan Williams <dan.j.williams@intel.com> >> --- >> fs/dax.c | 27 ++++++++++++--------------- >> 1 file changed, 12 insertions(+), 15 deletions(-) >> >> diff --git a/fs/dax.c b/fs/dax.c >> index 5dc33d788d50..f8e543839e5c 100644 >> --- a/fs/dax.c >> +++ b/fs/dax.c >> @@ -28,6 +28,7 @@ >> #include <linux/sched.h> >> #include <linux/uio.h> >> #include <linux/vmstat.h> >> +#include <linux/sizes.h> >> >> int dax_clear_blocks(struct inode *inode, sector_t block, long size) >> { >> @@ -38,24 +39,20 @@ int dax_clear_blocks(struct inode *inode, sector_t block, long size) >> do { >> void __pmem *addr; >> unsigned long pfn; >> - long count; >> + long count, sz; >> >> - count = bdev_direct_access(bdev, sector, &addr, &pfn, size); >> + sz = min_t(long, size, SZ_1M); >> + count = bdev_direct_access(bdev, sector, &addr, &pfn, sz); >> if (count < 0) >> return count; >> - BUG_ON(size < count); >> - while (count > 0) { >> - unsigned pgsz = PAGE_SIZE - offset_in_page(addr); >> - if (pgsz > count) >> - pgsz = count; >> - clear_pmem(addr, pgsz); >> - addr += pgsz; >> - size -= pgsz; >> - count -= pgsz; >> - BUG_ON(pgsz & 511); >> - sector += pgsz / 512; >> - cond_resched(); >> - } >> + if (count < sz) >> + sz = count; >> + clear_pmem(addr, sz); >> + addr += sz; >> + size -= sz; >> + BUG_ON(sz & 511); >> + sector += sz / 512; >> + cond_resched(); >> } while (size); >> >> wmb_pmem(); > > dax_clear_blocks() needs to go away and be replaced by a driver > level implementation of blkdev_issue_zerout(). This is effectively a > block device operation (we're taking sector addresses and zeroing > them), so it really belongs in the pmem drivers rather than the DAX > code. > > I suspect a REQ_WRITE_SAME implementation is the way to go here, as > then the filesystems can just call sb_issue_zerout() and the block > layer zeroing will work on all types of storage without the > filesystem having to care whether DAX is in use or not. > > Putting the implementation of the zeroing in the pmem drivers will > enable the drivers to optimise the caching behaviour of block > zeroing. The synchronous cache flushing behaviour of this function > is a performance killer as we are now block zeroing on allocation > and that results in two synchronous data writes (zero on alloc, > commit, write data, commit) for each page. > > The zeroing (and the data, for that matter) doesn't need to be > committed to persistent store until the allocation is written and > committed to the journal - that will happen with a REQ_FLUSH|REQ_FUA > write, so it makes sense to deploy the big hammer and delay the > blocking CPU cache flushes until the last possible moment in cases > like this. In pmem terms that would be a non-temporal memset plus a delayed wmb_pmem at REQ_FLUSH time. Better to write around the cache than loop over the dirty-data issuing flushes after the fact. We'll bump the priority of the non-temporal memset implementation. I like the idea of pushing this down into the driver vs your other feedback of pushing dirty extent tracking up into the radix... but more on that in the other thread.
On Mon, Nov 02, 2015 at 07:27:26PM -0800, Dan Williams wrote: > On Mon, Nov 2, 2015 at 4:51 PM, Dave Chinner <david@fromorbit.com> wrote: > > On Sun, Nov 01, 2015 at 11:29:53PM -0500, Dan Williams wrote: > > The zeroing (and the data, for that matter) doesn't need to be > > committed to persistent store until the allocation is written and > > committed to the journal - that will happen with a REQ_FLUSH|REQ_FUA > > write, so it makes sense to deploy the big hammer and delay the > > blocking CPU cache flushes until the last possible moment in cases > > like this. > > In pmem terms that would be a non-temporal memset plus a delayed > wmb_pmem at REQ_FLUSH time. Better to write around the cache than > loop over the dirty-data issuing flushes after the fact. We'll bump > the priority of the non-temporal memset implementation. Why is it better to do two synchronous physical writes to memory within a couple of microseconds of CPU time rather than writing them through the cache and, in most cases, only doing one physical write to memory in a separate context that expects to wait for a flush to complete? Cheers, Dave.
On Mon, Nov 2, 2015 at 8:48 PM, Dave Chinner <david@fromorbit.com> wrote: > On Mon, Nov 02, 2015 at 07:27:26PM -0800, Dan Williams wrote: >> On Mon, Nov 2, 2015 at 4:51 PM, Dave Chinner <david@fromorbit.com> wrote: >> > On Sun, Nov 01, 2015 at 11:29:53PM -0500, Dan Williams wrote: >> > The zeroing (and the data, for that matter) doesn't need to be >> > committed to persistent store until the allocation is written and >> > committed to the journal - that will happen with a REQ_FLUSH|REQ_FUA >> > write, so it makes sense to deploy the big hammer and delay the >> > blocking CPU cache flushes until the last possible moment in cases >> > like this. >> >> In pmem terms that would be a non-temporal memset plus a delayed >> wmb_pmem at REQ_FLUSH time. Better to write around the cache than >> loop over the dirty-data issuing flushes after the fact. We'll bump >> the priority of the non-temporal memset implementation. > > Why is it better to do two synchronous physical writes to memory > within a couple of microseconds of CPU time rather than writing them > through the cache and, in most cases, only doing one physical write > to memory in a separate context that expects to wait for a flush > to complete? With a switch to non-temporal writes they wouldn't be synchronous, although it's doubtful that the subsequent writes after zeroing would also hit the store buffer. If we had a method to flush by physical-cache-way rather than a virtual address then it would indeed be better to save up for one final flush, but when we need to resort to looping through all the virtual addresses that might have touched it gets expensive.
On Mon, Nov 02, 2015 at 09:31:11PM -0800, Dan Williams wrote: > On Mon, Nov 2, 2015 at 8:48 PM, Dave Chinner <david@fromorbit.com> wrote: > > On Mon, Nov 02, 2015 at 07:27:26PM -0800, Dan Williams wrote: > >> On Mon, Nov 2, 2015 at 4:51 PM, Dave Chinner <david@fromorbit.com> wrote: > >> > On Sun, Nov 01, 2015 at 11:29:53PM -0500, Dan Williams wrote: > >> > The zeroing (and the data, for that matter) doesn't need to be > >> > committed to persistent store until the allocation is written and > >> > committed to the journal - that will happen with a REQ_FLUSH|REQ_FUA > >> > write, so it makes sense to deploy the big hammer and delay the > >> > blocking CPU cache flushes until the last possible moment in cases > >> > like this. > >> > >> In pmem terms that would be a non-temporal memset plus a delayed > >> wmb_pmem at REQ_FLUSH time. Better to write around the cache than > >> loop over the dirty-data issuing flushes after the fact. We'll bump > >> the priority of the non-temporal memset implementation. > > > > Why is it better to do two synchronous physical writes to memory > > within a couple of microseconds of CPU time rather than writing them > > through the cache and, in most cases, only doing one physical write > > to memory in a separate context that expects to wait for a flush > > to complete? > > With a switch to non-temporal writes they wouldn't be synchronous, > although it's doubtful that the subsequent writes after zeroing would > also hit the store buffer. > > If we had a method to flush by physical-cache-way rather than a > virtual address then it would indeed be better to save up for one > final flush, but when we need to resort to looping through all the > virtual addresses that might have touched it gets expensive. msync() is for flushing userspace mmap ranges addresses back to physical memory. fsync() is for flushing kernel addresses (i.e. as returned by bdev_direct_access()) back to physical addresses. msync() calls ->fsync() as part of it's operation, fsync() does not care about whether mmap has been sync'd first or not. i.e. we don't care about random dirty userspace virtual mappings in fsync() - if you have them then you need to call msync() first. So we shouldn't ever be having to walk virtual addresses in fsync - just the kaddr returned by bdev_direct_access() is all that fsync needs to flush... Cheers, Dave.
On Mon, Nov 2, 2015 at 9:52 PM, Dave Chinner <david@fromorbit.com> wrote: > On Mon, Nov 02, 2015 at 09:31:11PM -0800, Dan Williams wrote: >> On Mon, Nov 2, 2015 at 8:48 PM, Dave Chinner <david@fromorbit.com> wrote: >> > On Mon, Nov 02, 2015 at 07:27:26PM -0800, Dan Williams wrote: >> >> On Mon, Nov 2, 2015 at 4:51 PM, Dave Chinner <david@fromorbit.com> wrote: >> >> > On Sun, Nov 01, 2015 at 11:29:53PM -0500, Dan Williams wrote: >> >> > The zeroing (and the data, for that matter) doesn't need to be >> >> > committed to persistent store until the allocation is written and >> >> > committed to the journal - that will happen with a REQ_FLUSH|REQ_FUA >> >> > write, so it makes sense to deploy the big hammer and delay the >> >> > blocking CPU cache flushes until the last possible moment in cases >> >> > like this. >> >> >> >> In pmem terms that would be a non-temporal memset plus a delayed >> >> wmb_pmem at REQ_FLUSH time. Better to write around the cache than >> >> loop over the dirty-data issuing flushes after the fact. We'll bump >> >> the priority of the non-temporal memset implementation. >> > >> > Why is it better to do two synchronous physical writes to memory >> > within a couple of microseconds of CPU time rather than writing them >> > through the cache and, in most cases, only doing one physical write >> > to memory in a separate context that expects to wait for a flush >> > to complete? >> >> With a switch to non-temporal writes they wouldn't be synchronous, >> although it's doubtful that the subsequent writes after zeroing would >> also hit the store buffer. >> >> If we had a method to flush by physical-cache-way rather than a >> virtual address then it would indeed be better to save up for one >> final flush, but when we need to resort to looping through all the >> virtual addresses that might have touched it gets expensive. > > msync() is for flushing userspace mmap ranges addresses back to > physical memory. fsync() is for flushing kernel addresses (i.e. as > returned by bdev_direct_access()) back to physical addresses. > msync() calls ->fsync() as part of it's operation, fsync() does not > care about whether mmap has been sync'd first or not. > > i.e. we don't care about random dirty userspace virtual mappings in > fsync() - if you have them then you need to call msync() first. So > we shouldn't ever be having to walk virtual addresses in fsync - > just the kaddr returned by bdev_direct_access() is all that fsync > needs to flush... > Neither Ross' solution nor mine use userspace addresses. Which comment of mine were you reacting to?
On Mon, Nov 02, 2015 at 09:31:11PM -0800, Dan Williams wrote: > On Mon, Nov 2, 2015 at 8:48 PM, Dave Chinner <david@fromorbit.com> wrote: > > On Mon, Nov 02, 2015 at 07:27:26PM -0800, Dan Williams wrote: > >> On Mon, Nov 2, 2015 at 4:51 PM, Dave Chinner <david@fromorbit.com> wrote: > >> > On Sun, Nov 01, 2015 at 11:29:53PM -0500, Dan Williams wrote: > >> > The zeroing (and the data, for that matter) doesn't need to be > >> > committed to persistent store until the allocation is written and > >> > committed to the journal - that will happen with a REQ_FLUSH|REQ_FUA > >> > write, so it makes sense to deploy the big hammer and delay the > >> > blocking CPU cache flushes until the last possible moment in cases > >> > like this. > >> > >> In pmem terms that would be a non-temporal memset plus a delayed > >> wmb_pmem at REQ_FLUSH time. Better to write around the cache than > >> loop over the dirty-data issuing flushes after the fact. We'll bump > >> the priority of the non-temporal memset implementation. > > > > Why is it better to do two synchronous physical writes to memory > > within a couple of microseconds of CPU time rather than writing them > > through the cache and, in most cases, only doing one physical write > > to memory in a separate context that expects to wait for a flush > > to complete? > > With a switch to non-temporal writes they wouldn't be synchronous, > although it's doubtful that the subsequent writes after zeroing would > also hit the store buffer. > > If we had a method to flush by physical-cache-way rather than a > virtual address then it would indeed be better to save up for one > final flush, but when we need to resort to looping through all the > virtual addresses that might have touched it gets expensive. I agree with the idea that we should avoid the "big hammer" flushing in response to REQ_FLUSH. Here are the steps that are needed to make sure that something is durable on media with PMEM/DAX: 1) Write, either with non-temporal stores or with stores that use the processor cache 2) If you wrote using the processor cache, flush or write back the processor cache 3) wmb_pmem(), synchronizing all non-temporal writes and flushes durably to media. PMEM does all I/O using 1 and 3 with non-temporal stores, and mmaps that go to userspace can used cached writes, so on fsync/msync we do a bunch of flushes for step 2. In either case I think we should have the PMEM driver just do step 3, the wmb_pmem(), in response to REQ_FLUSH. This allows the zeroing code to just do non-temporal writes of zeros, the DAX fsync/msync code to just do flushes (which is what my patch set already does), and just leave the wmb_pmem() to the PMEM driver at REQ_FLUSH time. This makes the burden of REQ_FLUSH bearable for the PMEM driver, allowing us to avoid looping through potentially terabytes of PMEM on each REQ_FLUSH bio. This just means that the layers above the PMEM code either need to use non-temporal writes for their I/Os, or do flushing, which I don't think is too onerous.
On Tue, Nov 03, 2015 at 10:57:57AM -0700, Ross Zwisler wrote: > On Mon, Nov 02, 2015 at 09:31:11PM -0800, Dan Williams wrote: > > On Mon, Nov 2, 2015 at 8:48 PM, Dave Chinner <david@fromorbit.com> wrote: > > > On Mon, Nov 02, 2015 at 07:27:26PM -0800, Dan Williams wrote: > > >> On Mon, Nov 2, 2015 at 4:51 PM, Dave Chinner <david@fromorbit.com> wrote: > > >> > On Sun, Nov 01, 2015 at 11:29:53PM -0500, Dan Williams wrote: > > >> > The zeroing (and the data, for that matter) doesn't need to be > > >> > committed to persistent store until the allocation is written and > > >> > committed to the journal - that will happen with a REQ_FLUSH|REQ_FUA > > >> > write, so it makes sense to deploy the big hammer and delay the > > >> > blocking CPU cache flushes until the last possible moment in cases > > >> > like this. > > >> > > >> In pmem terms that would be a non-temporal memset plus a delayed > > >> wmb_pmem at REQ_FLUSH time. Better to write around the cache than > > >> loop over the dirty-data issuing flushes after the fact. We'll bump > > >> the priority of the non-temporal memset implementation. > > > > > > Why is it better to do two synchronous physical writes to memory > > > within a couple of microseconds of CPU time rather than writing them > > > through the cache and, in most cases, only doing one physical write > > > to memory in a separate context that expects to wait for a flush > > > to complete? > > > > With a switch to non-temporal writes they wouldn't be synchronous, > > although it's doubtful that the subsequent writes after zeroing would > > also hit the store buffer. > > > > If we had a method to flush by physical-cache-way rather than a > > virtual address then it would indeed be better to save up for one > > final flush, but when we need to resort to looping through all the > > virtual addresses that might have touched it gets expensive. > > I agree with the idea that we should avoid the "big hammer" flushing in > response to REQ_FLUSH. Here are the steps that are needed to make sure that > something is durable on media with PMEM/DAX: > > 1) Write, either with non-temporal stores or with stores that use the > processor cache > > 2) If you wrote using the processor cache, flush or write back the processor > cache > > 3) wmb_pmem(), synchronizing all non-temporal writes and flushes durably to > media. Right, and when you look at buffered IO, we have: 1) write to page cache, mark page dirty 2) if you have dirty cached pages, flush dirty pages to device 3) REQ_FLUSH causes everything to be durable. > PMEM does all I/O using 1 and 3 with non-temporal stores, and mmaps that go to > userspace can used cached writes, so on fsync/msync we do a bunch of flushes > for step 2. In either case I think we should have the PMEM driver just do > step 3, the wmb_pmem(), in response to REQ_FLUSH. This allows the zeroing > code to just do non-temporal writes of zeros, the DAX fsync/msync code to just > do flushes (which is what my patch set already does), and just leave the > wmb_pmem() to the PMEM driver at REQ_FLUSH time. > > This just means that the layers above the PMEM code either need to use > non-temporal writes for their I/Os, or do flushing, which I don't think is too > onerous. Agreed - it fits neatly into the existing infrastructure and algorithms and there's no evidence to suggest that using the existing infrastructure is going to cause undue burden on PMEM based workloads. Hence I really think this is the right way to proceed... Cheers, Dave.
diff --git a/fs/dax.c b/fs/dax.c index 5dc33d788d50..f8e543839e5c 100644 --- a/fs/dax.c +++ b/fs/dax.c @@ -28,6 +28,7 @@ #include <linux/sched.h> #include <linux/uio.h> #include <linux/vmstat.h> +#include <linux/sizes.h> int dax_clear_blocks(struct inode *inode, sector_t block, long size) { @@ -38,24 +39,20 @@ int dax_clear_blocks(struct inode *inode, sector_t block, long size) do { void __pmem *addr; unsigned long pfn; - long count; + long count, sz; - count = bdev_direct_access(bdev, sector, &addr, &pfn, size); + sz = min_t(long, size, SZ_1M); + count = bdev_direct_access(bdev, sector, &addr, &pfn, sz); if (count < 0) return count; - BUG_ON(size < count); - while (count > 0) { - unsigned pgsz = PAGE_SIZE - offset_in_page(addr); - if (pgsz > count) - pgsz = count; - clear_pmem(addr, pgsz); - addr += pgsz; - size -= pgsz; - count -= pgsz; - BUG_ON(pgsz & 511); - sector += pgsz / 512; - cond_resched(); - } + if (count < sz) + sz = count; + clear_pmem(addr, sz); + addr += sz; + size -= sz; + BUG_ON(sz & 511); + sector += sz / 512; + cond_resched(); } while (size); wmb_pmem();