diff mbox

[v3,02/15] dax: increase granularity of dax_clear_blocks() operations

Message ID 20151102042952.6610.7185.stgit@dwillia2-desk3.amr.corp.intel.com (mailing list archive)
State Superseded
Headers show

Commit Message

Dan Williams Nov. 2, 2015, 4:29 a.m. UTC
dax_clear_blocks is currently performing a cond_resched() after every
PAGE_SIZE memset.  We need not check so frequently, for example md-raid
only calls cond_resched() at stripe granularity.  Also, in preparation
for introducing a dax_map_atomic() operation that temporarily pins a dax
mapping move the call to cond_resched() to the outer loop.

Reviewed-by: Jan Kara <jack@suse.com>
Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 fs/dax.c |   27 ++++++++++++---------------
 1 file changed, 12 insertions(+), 15 deletions(-)

Comments

Dave Chinner Nov. 3, 2015, 12:51 a.m. UTC | #1
On Sun, Nov 01, 2015 at 11:29:53PM -0500, Dan Williams wrote:
> dax_clear_blocks is currently performing a cond_resched() after every
> PAGE_SIZE memset.  We need not check so frequently, for example md-raid
> only calls cond_resched() at stripe granularity.  Also, in preparation
> for introducing a dax_map_atomic() operation that temporarily pins a dax
> mapping move the call to cond_resched() to the outer loop.
> 
> Reviewed-by: Jan Kara <jack@suse.com>
> Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> ---
>  fs/dax.c |   27 ++++++++++++---------------
>  1 file changed, 12 insertions(+), 15 deletions(-)
> 
> diff --git a/fs/dax.c b/fs/dax.c
> index 5dc33d788d50..f8e543839e5c 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -28,6 +28,7 @@
>  #include <linux/sched.h>
>  #include <linux/uio.h>
>  #include <linux/vmstat.h>
> +#include <linux/sizes.h>
>  
>  int dax_clear_blocks(struct inode *inode, sector_t block, long size)
>  {
> @@ -38,24 +39,20 @@ int dax_clear_blocks(struct inode *inode, sector_t block, long size)
>  	do {
>  		void __pmem *addr;
>  		unsigned long pfn;
> -		long count;
> +		long count, sz;
>  
> -		count = bdev_direct_access(bdev, sector, &addr, &pfn, size);
> +		sz = min_t(long, size, SZ_1M);
> +		count = bdev_direct_access(bdev, sector, &addr, &pfn, sz);
>  		if (count < 0)
>  			return count;
> -		BUG_ON(size < count);
> -		while (count > 0) {
> -			unsigned pgsz = PAGE_SIZE - offset_in_page(addr);
> -			if (pgsz > count)
> -				pgsz = count;
> -			clear_pmem(addr, pgsz);
> -			addr += pgsz;
> -			size -= pgsz;
> -			count -= pgsz;
> -			BUG_ON(pgsz & 511);
> -			sector += pgsz / 512;
> -			cond_resched();
> -		}
> +		if (count < sz)
> +			sz = count;
> +		clear_pmem(addr, sz);
> +		addr += sz;
> +		size -= sz;
> +		BUG_ON(sz & 511);
> +		sector += sz / 512;
> +		cond_resched();
>  	} while (size);
>  
>  	wmb_pmem();

dax_clear_blocks() needs to go away and be replaced by a driver
level implementation of blkdev_issue_zerout(). This is effectively a
block device operation (we're taking sector addresses and zeroing
them), so it really belongs in the pmem drivers rather than the DAX
code.

I suspect a REQ_WRITE_SAME implementation is the way to go here, as
then the filesystems can just call sb_issue_zerout() and the block
layer zeroing will work on all types of storage without the
filesystem having to care whether DAX is in use or not.

Putting the implementation of the zeroing in the pmem drivers will
enable the drivers to optimise the caching behaviour of block
zeroing.  The synchronous cache flushing behaviour of this function
is a performance killer as we are now block zeroing on allocation
and that results in two synchronous data writes (zero on alloc,
commit, write data, commit) for each page.

The zeroing (and the data, for that matter) doesn't need to be
committed to persistent store until the allocation is written and
committed to the journal - that will happen with a REQ_FLUSH|REQ_FUA
write, so it makes sense to deploy the big hammer and delay the
blocking CPU cache flushes until the last possible moment in cases
like this.

Cheers,

Dave.
Dan Williams Nov. 3, 2015, 3:27 a.m. UTC | #2
On Mon, Nov 2, 2015 at 4:51 PM, Dave Chinner <david@fromorbit.com> wrote:
> On Sun, Nov 01, 2015 at 11:29:53PM -0500, Dan Williams wrote:
>> dax_clear_blocks is currently performing a cond_resched() after every
>> PAGE_SIZE memset.  We need not check so frequently, for example md-raid
>> only calls cond_resched() at stripe granularity.  Also, in preparation
>> for introducing a dax_map_atomic() operation that temporarily pins a dax
>> mapping move the call to cond_resched() to the outer loop.
>>
>> Reviewed-by: Jan Kara <jack@suse.com>
>> Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
>> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
>> ---
>>  fs/dax.c |   27 ++++++++++++---------------
>>  1 file changed, 12 insertions(+), 15 deletions(-)
>>
>> diff --git a/fs/dax.c b/fs/dax.c
>> index 5dc33d788d50..f8e543839e5c 100644
>> --- a/fs/dax.c
>> +++ b/fs/dax.c
>> @@ -28,6 +28,7 @@
>>  #include <linux/sched.h>
>>  #include <linux/uio.h>
>>  #include <linux/vmstat.h>
>> +#include <linux/sizes.h>
>>
>>  int dax_clear_blocks(struct inode *inode, sector_t block, long size)
>>  {
>> @@ -38,24 +39,20 @@ int dax_clear_blocks(struct inode *inode, sector_t block, long size)
>>       do {
>>               void __pmem *addr;
>>               unsigned long pfn;
>> -             long count;
>> +             long count, sz;
>>
>> -             count = bdev_direct_access(bdev, sector, &addr, &pfn, size);
>> +             sz = min_t(long, size, SZ_1M);
>> +             count = bdev_direct_access(bdev, sector, &addr, &pfn, sz);
>>               if (count < 0)
>>                       return count;
>> -             BUG_ON(size < count);
>> -             while (count > 0) {
>> -                     unsigned pgsz = PAGE_SIZE - offset_in_page(addr);
>> -                     if (pgsz > count)
>> -                             pgsz = count;
>> -                     clear_pmem(addr, pgsz);
>> -                     addr += pgsz;
>> -                     size -= pgsz;
>> -                     count -= pgsz;
>> -                     BUG_ON(pgsz & 511);
>> -                     sector += pgsz / 512;
>> -                     cond_resched();
>> -             }
>> +             if (count < sz)
>> +                     sz = count;
>> +             clear_pmem(addr, sz);
>> +             addr += sz;
>> +             size -= sz;
>> +             BUG_ON(sz & 511);
>> +             sector += sz / 512;
>> +             cond_resched();
>>       } while (size);
>>
>>       wmb_pmem();
>
> dax_clear_blocks() needs to go away and be replaced by a driver
> level implementation of blkdev_issue_zerout(). This is effectively a
> block device operation (we're taking sector addresses and zeroing
> them), so it really belongs in the pmem drivers rather than the DAX
> code.
>
> I suspect a REQ_WRITE_SAME implementation is the way to go here, as
> then the filesystems can just call sb_issue_zerout() and the block
> layer zeroing will work on all types of storage without the
> filesystem having to care whether DAX is in use or not.
>
> Putting the implementation of the zeroing in the pmem drivers will
> enable the drivers to optimise the caching behaviour of block
> zeroing.  The synchronous cache flushing behaviour of this function
> is a performance killer as we are now block zeroing on allocation
> and that results in two synchronous data writes (zero on alloc,
> commit, write data, commit) for each page.
>
> The zeroing (and the data, for that matter) doesn't need to be
> committed to persistent store until the allocation is written and
> committed to the journal - that will happen with a REQ_FLUSH|REQ_FUA
> write, so it makes sense to deploy the big hammer and delay the
> blocking CPU cache flushes until the last possible moment in cases
> like this.

In pmem terms that would be a non-temporal memset plus a delayed
wmb_pmem at REQ_FLUSH time.  Better to write around the cache than
loop over the dirty-data issuing flushes after the fact.  We'll bump
the priority of the non-temporal memset implementation.

I like the idea of pushing this down into the driver vs your other
feedback of pushing dirty extent tracking up into the radix... but
more on that in the other thread.
Dave Chinner Nov. 3, 2015, 4:48 a.m. UTC | #3
On Mon, Nov 02, 2015 at 07:27:26PM -0800, Dan Williams wrote:
> On Mon, Nov 2, 2015 at 4:51 PM, Dave Chinner <david@fromorbit.com> wrote:
> > On Sun, Nov 01, 2015 at 11:29:53PM -0500, Dan Williams wrote:
> > The zeroing (and the data, for that matter) doesn't need to be
> > committed to persistent store until the allocation is written and
> > committed to the journal - that will happen with a REQ_FLUSH|REQ_FUA
> > write, so it makes sense to deploy the big hammer and delay the
> > blocking CPU cache flushes until the last possible moment in cases
> > like this.
> 
> In pmem terms that would be a non-temporal memset plus a delayed
> wmb_pmem at REQ_FLUSH time.  Better to write around the cache than
> loop over the dirty-data issuing flushes after the fact.  We'll bump
> the priority of the non-temporal memset implementation.

Why is it better to do two synchronous physical writes to memory
within a couple of microseconds of CPU time rather than writing them
through the cache and, in most cases, only doing one physical write
to memory in a separate context that expects to wait for a flush
to complete?

Cheers,

Dave.
Dan Williams Nov. 3, 2015, 5:31 a.m. UTC | #4
On Mon, Nov 2, 2015 at 8:48 PM, Dave Chinner <david@fromorbit.com> wrote:
> On Mon, Nov 02, 2015 at 07:27:26PM -0800, Dan Williams wrote:
>> On Mon, Nov 2, 2015 at 4:51 PM, Dave Chinner <david@fromorbit.com> wrote:
>> > On Sun, Nov 01, 2015 at 11:29:53PM -0500, Dan Williams wrote:
>> > The zeroing (and the data, for that matter) doesn't need to be
>> > committed to persistent store until the allocation is written and
>> > committed to the journal - that will happen with a REQ_FLUSH|REQ_FUA
>> > write, so it makes sense to deploy the big hammer and delay the
>> > blocking CPU cache flushes until the last possible moment in cases
>> > like this.
>>
>> In pmem terms that would be a non-temporal memset plus a delayed
>> wmb_pmem at REQ_FLUSH time.  Better to write around the cache than
>> loop over the dirty-data issuing flushes after the fact.  We'll bump
>> the priority of the non-temporal memset implementation.
>
> Why is it better to do two synchronous physical writes to memory
> within a couple of microseconds of CPU time rather than writing them
> through the cache and, in most cases, only doing one physical write
> to memory in a separate context that expects to wait for a flush
> to complete?

With a switch to non-temporal writes they wouldn't be synchronous,
although it's doubtful that the subsequent writes after zeroing would
also hit the store buffer.

If we had a method to flush by physical-cache-way rather than a
virtual address then it would indeed be better to save up for one
final flush, but when we need to resort to looping through all the
virtual addresses that might have touched it gets expensive.
Dave Chinner Nov. 3, 2015, 5:52 a.m. UTC | #5
On Mon, Nov 02, 2015 at 09:31:11PM -0800, Dan Williams wrote:
> On Mon, Nov 2, 2015 at 8:48 PM, Dave Chinner <david@fromorbit.com> wrote:
> > On Mon, Nov 02, 2015 at 07:27:26PM -0800, Dan Williams wrote:
> >> On Mon, Nov 2, 2015 at 4:51 PM, Dave Chinner <david@fromorbit.com> wrote:
> >> > On Sun, Nov 01, 2015 at 11:29:53PM -0500, Dan Williams wrote:
> >> > The zeroing (and the data, for that matter) doesn't need to be
> >> > committed to persistent store until the allocation is written and
> >> > committed to the journal - that will happen with a REQ_FLUSH|REQ_FUA
> >> > write, so it makes sense to deploy the big hammer and delay the
> >> > blocking CPU cache flushes until the last possible moment in cases
> >> > like this.
> >>
> >> In pmem terms that would be a non-temporal memset plus a delayed
> >> wmb_pmem at REQ_FLUSH time.  Better to write around the cache than
> >> loop over the dirty-data issuing flushes after the fact.  We'll bump
> >> the priority of the non-temporal memset implementation.
> >
> > Why is it better to do two synchronous physical writes to memory
> > within a couple of microseconds of CPU time rather than writing them
> > through the cache and, in most cases, only doing one physical write
> > to memory in a separate context that expects to wait for a flush
> > to complete?
> 
> With a switch to non-temporal writes they wouldn't be synchronous,
> although it's doubtful that the subsequent writes after zeroing would
> also hit the store buffer.
> 
> If we had a method to flush by physical-cache-way rather than a
> virtual address then it would indeed be better to save up for one
> final flush, but when we need to resort to looping through all the
> virtual addresses that might have touched it gets expensive.

msync() is for flushing userspace mmap ranges addresses back to
physical memory. fsync() is for flushing kernel addresses (i.e. as
returned by bdev_direct_access()) back to physical addresses.
msync() calls ->fsync() as part of it's operation, fsync() does not
care about whether mmap has been sync'd first or not.

i.e. we don't care about random dirty userspace virtual mappings in
fsync() - if you have them then you need to call msync() first. So
we shouldn't ever be having to walk virtual addresses in fsync -
just the kaddr returned by bdev_direct_access() is all that fsync
needs to flush...

Cheers,

Dave.
Dan Williams Nov. 3, 2015, 7:24 a.m. UTC | #6
On Mon, Nov 2, 2015 at 9:52 PM, Dave Chinner <david@fromorbit.com> wrote:
> On Mon, Nov 02, 2015 at 09:31:11PM -0800, Dan Williams wrote:
>> On Mon, Nov 2, 2015 at 8:48 PM, Dave Chinner <david@fromorbit.com> wrote:
>> > On Mon, Nov 02, 2015 at 07:27:26PM -0800, Dan Williams wrote:
>> >> On Mon, Nov 2, 2015 at 4:51 PM, Dave Chinner <david@fromorbit.com> wrote:
>> >> > On Sun, Nov 01, 2015 at 11:29:53PM -0500, Dan Williams wrote:
>> >> > The zeroing (and the data, for that matter) doesn't need to be
>> >> > committed to persistent store until the allocation is written and
>> >> > committed to the journal - that will happen with a REQ_FLUSH|REQ_FUA
>> >> > write, so it makes sense to deploy the big hammer and delay the
>> >> > blocking CPU cache flushes until the last possible moment in cases
>> >> > like this.
>> >>
>> >> In pmem terms that would be a non-temporal memset plus a delayed
>> >> wmb_pmem at REQ_FLUSH time.  Better to write around the cache than
>> >> loop over the dirty-data issuing flushes after the fact.  We'll bump
>> >> the priority of the non-temporal memset implementation.
>> >
>> > Why is it better to do two synchronous physical writes to memory
>> > within a couple of microseconds of CPU time rather than writing them
>> > through the cache and, in most cases, only doing one physical write
>> > to memory in a separate context that expects to wait for a flush
>> > to complete?
>>
>> With a switch to non-temporal writes they wouldn't be synchronous,
>> although it's doubtful that the subsequent writes after zeroing would
>> also hit the store buffer.
>>
>> If we had a method to flush by physical-cache-way rather than a
>> virtual address then it would indeed be better to save up for one
>> final flush, but when we need to resort to looping through all the
>> virtual addresses that might have touched it gets expensive.
>
> msync() is for flushing userspace mmap ranges addresses back to
> physical memory. fsync() is for flushing kernel addresses (i.e. as
> returned by bdev_direct_access()) back to physical addresses.
> msync() calls ->fsync() as part of it's operation, fsync() does not
> care about whether mmap has been sync'd first or not.
>
> i.e. we don't care about random dirty userspace virtual mappings in
> fsync() - if you have them then you need to call msync() first. So
> we shouldn't ever be having to walk virtual addresses in fsync -
> just the kaddr returned by bdev_direct_access() is all that fsync
> needs to flush...
>

Neither Ross' solution nor mine use userspace addresses.  Which
comment of mine were you reacting to?
Ross Zwisler Nov. 3, 2015, 5:57 p.m. UTC | #7
On Mon, Nov 02, 2015 at 09:31:11PM -0800, Dan Williams wrote:
> On Mon, Nov 2, 2015 at 8:48 PM, Dave Chinner <david@fromorbit.com> wrote:
> > On Mon, Nov 02, 2015 at 07:27:26PM -0800, Dan Williams wrote:
> >> On Mon, Nov 2, 2015 at 4:51 PM, Dave Chinner <david@fromorbit.com> wrote:
> >> > On Sun, Nov 01, 2015 at 11:29:53PM -0500, Dan Williams wrote:
> >> > The zeroing (and the data, for that matter) doesn't need to be
> >> > committed to persistent store until the allocation is written and
> >> > committed to the journal - that will happen with a REQ_FLUSH|REQ_FUA
> >> > write, so it makes sense to deploy the big hammer and delay the
> >> > blocking CPU cache flushes until the last possible moment in cases
> >> > like this.
> >>
> >> In pmem terms that would be a non-temporal memset plus a delayed
> >> wmb_pmem at REQ_FLUSH time.  Better to write around the cache than
> >> loop over the dirty-data issuing flushes after the fact.  We'll bump
> >> the priority of the non-temporal memset implementation.
> >
> > Why is it better to do two synchronous physical writes to memory
> > within a couple of microseconds of CPU time rather than writing them
> > through the cache and, in most cases, only doing one physical write
> > to memory in a separate context that expects to wait for a flush
> > to complete?
> 
> With a switch to non-temporal writes they wouldn't be synchronous,
> although it's doubtful that the subsequent writes after zeroing would
> also hit the store buffer.
> 
> If we had a method to flush by physical-cache-way rather than a
> virtual address then it would indeed be better to save up for one
> final flush, but when we need to resort to looping through all the
> virtual addresses that might have touched it gets expensive.

I agree with the idea that we should avoid the "big hammer" flushing in
response to REQ_FLUSH.  Here are the steps that are needed to make sure that
something is durable on media with PMEM/DAX:

1) Write, either with non-temporal stores or with stores that use the
processor cache

2) If you wrote using the processor cache, flush or write back the processor
cache

3) wmb_pmem(), synchronizing all non-temporal writes and flushes durably to
media.

PMEM does all I/O using 1 and 3 with non-temporal stores, and mmaps that go to
userspace can used cached writes, so on fsync/msync we do a bunch of flushes
for step 2.  In either case I think we should have the PMEM driver just do
step 3, the wmb_pmem(), in response to REQ_FLUSH.  This allows the zeroing
code to just do non-temporal writes of zeros, the DAX fsync/msync code to just
do flushes (which is what my patch set already does), and just leave the
wmb_pmem() to the PMEM driver at REQ_FLUSH time.

This makes the burden of REQ_FLUSH bearable for the PMEM driver, allowing us
to avoid looping through potentially terabytes of PMEM on each REQ_FLUSH bio.

This just means that the layers above the PMEM code either need to use
non-temporal writes for their I/Os, or do flushing, which I don't think is too
onerous.
Dave Chinner Nov. 3, 2015, 8:59 p.m. UTC | #8
On Tue, Nov 03, 2015 at 10:57:57AM -0700, Ross Zwisler wrote:
> On Mon, Nov 02, 2015 at 09:31:11PM -0800, Dan Williams wrote:
> > On Mon, Nov 2, 2015 at 8:48 PM, Dave Chinner <david@fromorbit.com> wrote:
> > > On Mon, Nov 02, 2015 at 07:27:26PM -0800, Dan Williams wrote:
> > >> On Mon, Nov 2, 2015 at 4:51 PM, Dave Chinner <david@fromorbit.com> wrote:
> > >> > On Sun, Nov 01, 2015 at 11:29:53PM -0500, Dan Williams wrote:
> > >> > The zeroing (and the data, for that matter) doesn't need to be
> > >> > committed to persistent store until the allocation is written and
> > >> > committed to the journal - that will happen with a REQ_FLUSH|REQ_FUA
> > >> > write, so it makes sense to deploy the big hammer and delay the
> > >> > blocking CPU cache flushes until the last possible moment in cases
> > >> > like this.
> > >>
> > >> In pmem terms that would be a non-temporal memset plus a delayed
> > >> wmb_pmem at REQ_FLUSH time.  Better to write around the cache than
> > >> loop over the dirty-data issuing flushes after the fact.  We'll bump
> > >> the priority of the non-temporal memset implementation.
> > >
> > > Why is it better to do two synchronous physical writes to memory
> > > within a couple of microseconds of CPU time rather than writing them
> > > through the cache and, in most cases, only doing one physical write
> > > to memory in a separate context that expects to wait for a flush
> > > to complete?
> > 
> > With a switch to non-temporal writes they wouldn't be synchronous,
> > although it's doubtful that the subsequent writes after zeroing would
> > also hit the store buffer.
> > 
> > If we had a method to flush by physical-cache-way rather than a
> > virtual address then it would indeed be better to save up for one
> > final flush, but when we need to resort to looping through all the
> > virtual addresses that might have touched it gets expensive.
> 
> I agree with the idea that we should avoid the "big hammer" flushing in
> response to REQ_FLUSH.  Here are the steps that are needed to make sure that
> something is durable on media with PMEM/DAX:
> 
> 1) Write, either with non-temporal stores or with stores that use the
> processor cache
> 
> 2) If you wrote using the processor cache, flush or write back the processor
> cache
> 
> 3) wmb_pmem(), synchronizing all non-temporal writes and flushes durably to
> media.

Right, and when you look at buffered IO, we have:

1) write to page cache, mark page dirty
2) if you have dirty cached pages, flush dirty pages to device
3) REQ_FLUSH causes everything to be durable.

> PMEM does all I/O using 1 and 3 with non-temporal stores, and mmaps that go to
> userspace can used cached writes, so on fsync/msync we do a bunch of flushes
> for step 2.  In either case I think we should have the PMEM driver just do
> step 3, the wmb_pmem(), in response to REQ_FLUSH.  This allows the zeroing
> code to just do non-temporal writes of zeros, the DAX fsync/msync code to just
> do flushes (which is what my patch set already does), and just leave the
> wmb_pmem() to the PMEM driver at REQ_FLUSH time.
>
> This just means that the layers above the PMEM code either need to use
> non-temporal writes for their I/Os, or do flushing, which I don't think is too
> onerous.

Agreed - it fits neatly into the existing infrastructure and
algorithms and there's no evidence to suggest that using the
existing infrastructure is going to cause undue burden on PMEM based
workloads. Hence I really think this is the right way to proceed...

Cheers,

Dave.
diff mbox

Patch

diff --git a/fs/dax.c b/fs/dax.c
index 5dc33d788d50..f8e543839e5c 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -28,6 +28,7 @@ 
 #include <linux/sched.h>
 #include <linux/uio.h>
 #include <linux/vmstat.h>
+#include <linux/sizes.h>
 
 int dax_clear_blocks(struct inode *inode, sector_t block, long size)
 {
@@ -38,24 +39,20 @@  int dax_clear_blocks(struct inode *inode, sector_t block, long size)
 	do {
 		void __pmem *addr;
 		unsigned long pfn;
-		long count;
+		long count, sz;
 
-		count = bdev_direct_access(bdev, sector, &addr, &pfn, size);
+		sz = min_t(long, size, SZ_1M);
+		count = bdev_direct_access(bdev, sector, &addr, &pfn, sz);
 		if (count < 0)
 			return count;
-		BUG_ON(size < count);
-		while (count > 0) {
-			unsigned pgsz = PAGE_SIZE - offset_in_page(addr);
-			if (pgsz > count)
-				pgsz = count;
-			clear_pmem(addr, pgsz);
-			addr += pgsz;
-			size -= pgsz;
-			count -= pgsz;
-			BUG_ON(pgsz & 511);
-			sector += pgsz / 512;
-			cond_resched();
-		}
+		if (count < sz)
+			sz = count;
+		clear_pmem(addr, sz);
+		addr += sz;
+		size -= sz;
+		BUG_ON(sz & 511);
+		sector += sz / 512;
+		cond_resched();
 	} while (size);
 
 	wmb_pmem();