[v5,2/8] drivers/pmem: Allow pmem_clear_poison() to accept arbitrary offset and len
diff mbox series

Message ID 20200218214841.10076-3-vgoyal@redhat.com
State New
Headers show
Series
  • dax/pmem: Provide a dax operation to zero range of memory
Related show

Commit Message

Vivek Goyal Feb. 18, 2020, 9:48 p.m. UTC
Currently pmem_clear_poison() expects offset and len to be sector aligned.
Atleast that seems to be the assumption with which code has been written.
It is called only from pmem_do_bvec() which is called only from pmem_rw_page()
and pmem_make_request() which will only passe sector aligned offset and len.

Soon we want use this function from dax_zero_page_range() code path which
can try to zero arbitrary range of memory with-in a page. So update this
function to assume that offset and length can be arbitrary and do the
necessary alignments as needed.

nvdimm_clear_poison() seems to assume offset and len to be aligned to
clear_err_unit boundary. But this is currently internal detail and is
not exported for others to use. So for now, continue to align offset and
length to SECTOR_SIZE boundary. Improving it further and to align it
to clear_err_unit boundary is a TODO item for future.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 drivers/nvdimm/pmem.c | 22 ++++++++++++++++++----
 1 file changed, 18 insertions(+), 4 deletions(-)

Comments

hch@infradead.org Feb. 20, 2020, 4:17 p.m. UTC | #1
On Tue, Feb 18, 2020 at 04:48:35PM -0500, Vivek Goyal wrote:
> Currently pmem_clear_poison() expects offset and len to be sector aligned.
> Atleast that seems to be the assumption with which code has been written.
> It is called only from pmem_do_bvec() which is called only from pmem_rw_page()
> and pmem_make_request() which will only passe sector aligned offset and len.
> 
> Soon we want use this function from dax_zero_page_range() code path which
> can try to zero arbitrary range of memory with-in a page. So update this
> function to assume that offset and length can be arbitrary and do the
> necessary alignments as needed.
> 
> nvdimm_clear_poison() seems to assume offset and len to be aligned to
> clear_err_unit boundary. But this is currently internal detail and is
> not exported for others to use. So for now, continue to align offset and
> length to SECTOR_SIZE boundary. Improving it further and to align it
> to clear_err_unit boundary is a TODO item for future.
> 
> Signed-off-by: Vivek Goyal <vgoyal@redhat.com>

This looks sensibel to me, but I'd really like to have Dan take at look.
Jeff Moyer Feb. 20, 2020, 9:35 p.m. UTC | #2
Vivek Goyal <vgoyal@redhat.com> writes:

> Currently pmem_clear_poison() expects offset and len to be sector aligned.
> Atleast that seems to be the assumption with which code has been written.
> It is called only from pmem_do_bvec() which is called only from pmem_rw_page()
> and pmem_make_request() which will only passe sector aligned offset and len.
>
> Soon we want use this function from dax_zero_page_range() code path which
> can try to zero arbitrary range of memory with-in a page. So update this
> function to assume that offset and length can be arbitrary and do the
> necessary alignments as needed.

What caller will try to zero a range that is smaller than a sector?

> nvdimm_clear_poison() seems to assume offset and len to be aligned to
> clear_err_unit boundary. But this is currently internal detail and is
> not exported for others to use. So for now, continue to align offset and
> length to SECTOR_SIZE boundary. Improving it further and to align it
> to clear_err_unit boundary is a TODO item for future.

When there is a poisoned range of persistent memory, it is recorded by
the badblocks infrastructure, which currently operates on sectors.  So,
no matter what the error unit is for the hardware, we currently can't
record/report to userspace anything smaller than a sector, and so that
is what we expect when clearing errors.

Continuing on for completeness, we will currently not map a page with
badblocks into a process' address space.  So, let's say you have 256
bytes of bad pmem, we will tell you we've lost 512 bytes, and even if
you access a valid mmap()d address in the same page as the poisoned
memory, you will get a segfault.

Userspace can fix up the error by calling write(2) and friends to
provide new data, or by punching a hole and writing new data to the hole
(which may result in getting a new block, or reallocating the old block
and zeroing it, which will clear the error).

More comments below...

> Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
> ---
>  drivers/nvdimm/pmem.c | 22 ++++++++++++++++++----
>  1 file changed, 18 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
> index 075b11682192..e72959203253 100644
> --- a/drivers/nvdimm/pmem.c
> +++ b/drivers/nvdimm/pmem.c
> @@ -74,14 +74,28 @@ static blk_status_t pmem_clear_poison(struct pmem_device *pmem,
>  	sector_t sector;
>  	long cleared;
>  	blk_status_t rc = BLK_STS_OK;
> +	phys_addr_t start_aligned, end_aligned;
> +	unsigned int len_aligned;
>  
> -	sector = (offset - pmem->data_offset) / 512;
> +	/*
> +	 * Callers can pass arbitrary offset and len. But nvdimm_clear_poison()
> +	 * expects memory offset and length to meet certain alignment
> +	 * restrction (clear_err_unit). Currently nvdimm does not export
                                                  ^^^^^^^^^^^^^^^^^^^^^^
> +	 * required alignment. So align offset and length to sector boundary

What is "nvdimm" in that sentence?  Because the nvdimm most certainly
does export the required alignment.  Perhaps you meant libnvdimm?

> +	 * before passing it to nvdimm_clear_poison().
> +	 */
> +	start_aligned = ALIGN(offset, SECTOR_SIZE);
> +	end_aligned = ALIGN_DOWN((offset + len), SECTOR_SIZE) - 1;
> +	len_aligned = end_aligned - start_aligned + 1;
> +
> +	sector = (start_aligned - pmem->data_offset) / 512;
>  
> -	cleared = nvdimm_clear_poison(dev, pmem->phys_addr + offset, len);
> -	if (cleared < len)
> +	cleared = nvdimm_clear_poison(dev, pmem->phys_addr + start_aligned,
> +				      len_aligned);
> +	if (cleared < len_aligned)
>  		rc = BLK_STS_IOERR;
>  	if (cleared > 0 && cleared / 512) {
> -		hwpoison_clear(pmem, pmem->phys_addr + offset, cleared);
> +		hwpoison_clear(pmem, pmem->phys_addr + start_aligned, cleared);
>  		cleared /= 512;
>  		dev_dbg(dev, "%#llx clear %ld sector%s\n",
>  				(unsigned long long) sector, cleared,

We could potentially support clearing less than a sector, but I'd have
to understand the use cases better before offerring implementation
suggestions.

-Jeff
Vivek Goyal Feb. 20, 2020, 9:57 p.m. UTC | #3
On Thu, Feb 20, 2020 at 04:35:17PM -0500, Jeff Moyer wrote:
> Vivek Goyal <vgoyal@redhat.com> writes:
> 
> > Currently pmem_clear_poison() expects offset and len to be sector aligned.
> > Atleast that seems to be the assumption with which code has been written.
> > It is called only from pmem_do_bvec() which is called only from pmem_rw_page()
> > and pmem_make_request() which will only passe sector aligned offset and len.
> >
> > Soon we want use this function from dax_zero_page_range() code path which
> > can try to zero arbitrary range of memory with-in a page. So update this
> > function to assume that offset and length can be arbitrary and do the
> > necessary alignments as needed.
> 
> What caller will try to zero a range that is smaller than a sector?

Hi Jeff,

New dax zeroing interface (dax_zero_page_range()) can technically pass
a range which is less than a sector. Or which is bigger than a sector
but start and end are not aligned on sector boundaries.

At this point of time, all I care about is that case of an arbitrary
range is handeled well. So if a caller passes a range in, we figure
out subrange which is sector aligned in terms of start and end, and
clear poison on those sectors and ignore rest of the range. And
this itself will be an improvement over current behavior where 
nothing is cleared if I/O is not sector aligned.

> 
> > nvdimm_clear_poison() seems to assume offset and len to be aligned to
> > clear_err_unit boundary. But this is currently internal detail and is
> > not exported for others to use. So for now, continue to align offset and
> > length to SECTOR_SIZE boundary. Improving it further and to align it
> > to clear_err_unit boundary is a TODO item for future.
> 
> When there is a poisoned range of persistent memory, it is recorded by
> the badblocks infrastructure, which currently operates on sectors.  So,
> no matter what the error unit is for the hardware, we currently can't
> record/report to userspace anything smaller than a sector, and so that
> is what we expect when clearing errors.
> 
> Continuing on for completeness, we will currently not map a page with
> badblocks into a process' address space.  So, let's say you have 256
> bytes of bad pmem, we will tell you we've lost 512 bytes, and even if
> you access a valid mmap()d address in the same page as the poisoned
> memory, you will get a segfault.
> 
> Userspace can fix up the error by calling write(2) and friends to
> provide new data, or by punching a hole and writing new data to the hole
> (which may result in getting a new block, or reallocating the old block
> and zeroing it, which will clear the error).

Fair enough. I do not need poison clearing at finer granularity. It might
be needed once dev_dax path wants to clear poison. Not sure how exactly
that works.

> 
> More comments below...
> 
> > Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
> > ---
> >  drivers/nvdimm/pmem.c | 22 ++++++++++++++++++----
> >  1 file changed, 18 insertions(+), 4 deletions(-)
> >
> > diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
> > index 075b11682192..e72959203253 100644
> > --- a/drivers/nvdimm/pmem.c
> > +++ b/drivers/nvdimm/pmem.c
> > @@ -74,14 +74,28 @@ static blk_status_t pmem_clear_poison(struct pmem_device *pmem,
> >  	sector_t sector;
> >  	long cleared;
> >  	blk_status_t rc = BLK_STS_OK;
> > +	phys_addr_t start_aligned, end_aligned;
> > +	unsigned int len_aligned;
> >  
> > -	sector = (offset - pmem->data_offset) / 512;
> > +	/*
> > +	 * Callers can pass arbitrary offset and len. But nvdimm_clear_poison()
> > +	 * expects memory offset and length to meet certain alignment
> > +	 * restrction (clear_err_unit). Currently nvdimm does not export
>                                                   ^^^^^^^^^^^^^^^^^^^^^^
> > +	 * required alignment. So align offset and length to sector boundary
> 
> What is "nvdimm" in that sentence?  Because the nvdimm most certainly
> does export the required alignment.  Perhaps you meant libnvdimm?

I meant nvdimm_clear_poison() function in drivers/nvdimm/bus.c. Whatever
it is called. It first queries alignement required (clear_err_unit) and
then makes sure range passed in meets that alignment requirement.

> 
> > +	 * before passing it to nvdimm_clear_poison().
> > +	 */
> > +	start_aligned = ALIGN(offset, SECTOR_SIZE);
> > +	end_aligned = ALIGN_DOWN((offset + len), SECTOR_SIZE) - 1;
> > +	len_aligned = end_aligned - start_aligned + 1;
> > +
> > +	sector = (start_aligned - pmem->data_offset) / 512;
> >  
> > -	cleared = nvdimm_clear_poison(dev, pmem->phys_addr + offset, len);
> > -	if (cleared < len)
> > +	cleared = nvdimm_clear_poison(dev, pmem->phys_addr + start_aligned,
> > +				      len_aligned);
> > +	if (cleared < len_aligned)
> >  		rc = BLK_STS_IOERR;
> >  	if (cleared > 0 && cleared / 512) {
> > -		hwpoison_clear(pmem, pmem->phys_addr + offset, cleared);
> > +		hwpoison_clear(pmem, pmem->phys_addr + start_aligned, cleared);
> >  		cleared /= 512;
> >  		dev_dbg(dev, "%#llx clear %ld sector%s\n",
> >  				(unsigned long long) sector, cleared,
> 
> We could potentially support clearing less than a sector, but I'd have
> to understand the use cases better before offerring implementation
> suggestions.

I don't need clearing less than a secotr. Once somebody needs it they
can implement it. All I am doing is making sure current logic is not
broken when dax_zero_page_range() starts using this logic and passes
an arbitrary range. We need to make sure we internally align I/O 
and carve out an aligned sub-range and pass that subrange to
nvdimm_clear_poison().

So if you can make sure I am not breaking things and new interface
will continue to clear poison on sector boundary, that will be great.

Thanks
Vivek
Jeff Moyer Feb. 21, 2020, 6:32 p.m. UTC | #4
Vivek Goyal <vgoyal@redhat.com> writes:

> On Thu, Feb 20, 2020 at 04:35:17PM -0500, Jeff Moyer wrote:
>> Vivek Goyal <vgoyal@redhat.com> writes:
>> 
>> > Currently pmem_clear_poison() expects offset and len to be sector aligned.
>> > Atleast that seems to be the assumption with which code has been written.
>> > It is called only from pmem_do_bvec() which is called only from pmem_rw_page()
>> > and pmem_make_request() which will only passe sector aligned offset and len.
>> >
>> > Soon we want use this function from dax_zero_page_range() code path which
>> > can try to zero arbitrary range of memory with-in a page. So update this
>> > function to assume that offset and length can be arbitrary and do the
>> > necessary alignments as needed.
>> 
>> What caller will try to zero a range that is smaller than a sector?
>
> Hi Jeff,
>
> New dax zeroing interface (dax_zero_page_range()) can technically pass
> a range which is less than a sector. Or which is bigger than a sector
> but start and end are not aligned on sector boundaries.

Sure, but who will call it with misaligned ranges?

> At this point of time, all I care about is that case of an arbitrary
> range is handeled well. So if a caller passes a range in, we figure
> out subrange which is sector aligned in terms of start and end, and
> clear poison on those sectors and ignore rest of the range. And
> this itself will be an improvement over current behavior where 
> nothing is cleared if I/O is not sector aligned.

I don't think this makes sense.  The caller needs to know about the
blast radius of errors.  This is why I asked for a concrete example.
It might make more sense, for example, to return an error if not all of
the errors could be cleared.

>> > nvdimm_clear_poison() seems to assume offset and len to be aligned to
>> > clear_err_unit boundary. But this is currently internal detail and is
>> > not exported for others to use. So for now, continue to align offset and
>> > length to SECTOR_SIZE boundary. Improving it further and to align it
>> > to clear_err_unit boundary is a TODO item for future.
>> 
>> When there is a poisoned range of persistent memory, it is recorded by
>> the badblocks infrastructure, which currently operates on sectors.  So,
>> no matter what the error unit is for the hardware, we currently can't
>> record/report to userspace anything smaller than a sector, and so that
>> is what we expect when clearing errors.
>> 
>> Continuing on for completeness, we will currently not map a page with
>> badblocks into a process' address space.  So, let's say you have 256
>> bytes of bad pmem, we will tell you we've lost 512 bytes, and even if
>> you access a valid mmap()d address in the same page as the poisoned
>> memory, you will get a segfault.
>> 
>> Userspace can fix up the error by calling write(2) and friends to
>> provide new data, or by punching a hole and writing new data to the hole
>> (which may result in getting a new block, or reallocating the old block
>> and zeroing it, which will clear the error).
>
> Fair enough. I do not need poison clearing at finer granularity. It might
> be needed once dev_dax path wants to clear poison. Not sure how exactly
> that works.

It doesn't.  :)

>> > +	/*
>> > +	 * Callers can pass arbitrary offset and len. But nvdimm_clear_poison()
>> > +	 * expects memory offset and length to meet certain alignment
>> > +	 * restrction (clear_err_unit). Currently nvdimm does not export
>>                                                   ^^^^^^^^^^^^^^^^^^^^^^
>> > +	 * required alignment. So align offset and length to sector boundary
>> 
>> What is "nvdimm" in that sentence?  Because the nvdimm most certainly
>> does export the required alignment.  Perhaps you meant libnvdimm?
>
> I meant nvdimm_clear_poison() function in drivers/nvdimm/bus.c. Whatever
> it is called. It first queries alignement required (clear_err_unit) and
> then makes sure range passed in meets that alignment requirement.

My point was your comment is misleading.

>> We could potentially support clearing less than a sector, but I'd have
>> to understand the use cases better before offerring implementation
>> suggestions.
>
> I don't need clearing less than a secotr. Once somebody needs it they
> can implement it. All I am doing is making sure current logic is not
> broken when dax_zero_page_range() starts using this logic and passes
> an arbitrary range. We need to make sure we internally align I/O 

An arbitrary range is the same thing as less than a sector.  :)  Do you
know of an instance where the range will not be sector-aligned and sized?

> and carve out an aligned sub-range and pass that subrange to
> nvdimm_clear_poison().

And what happens to the rest?  The caller is left to trip over the
errors?  That sounds pretty terrible.  I really think there needs to be
an explicit contract here.

> So if you can make sure I am not breaking things and new interface
> will continue to clear poison on sector boundary, that will be great.

I think allowing arbitrary ranges /could/ break things.  How it breaks
things depends on what the caller is doing.

If ther eare no callers using the interface in this way, then I see no
need to relax the restriction.  I do think we could document it better.

-Jeff
Vivek Goyal Feb. 21, 2020, 8:17 p.m. UTC | #5
On Fri, Feb 21, 2020 at 01:32:48PM -0500, Jeff Moyer wrote:
> Vivek Goyal <vgoyal@redhat.com> writes:
> 
> > On Thu, Feb 20, 2020 at 04:35:17PM -0500, Jeff Moyer wrote:
> >> Vivek Goyal <vgoyal@redhat.com> writes:
> >> 
> >> > Currently pmem_clear_poison() expects offset and len to be sector aligned.
> >> > Atleast that seems to be the assumption with which code has been written.
> >> > It is called only from pmem_do_bvec() which is called only from pmem_rw_page()
> >> > and pmem_make_request() which will only passe sector aligned offset and len.
> >> >
> >> > Soon we want use this function from dax_zero_page_range() code path which
> >> > can try to zero arbitrary range of memory with-in a page. So update this
> >> > function to assume that offset and length can be arbitrary and do the
> >> > necessary alignments as needed.
> >> 
> >> What caller will try to zero a range that is smaller than a sector?
> >
> > Hi Jeff,
> >
> > New dax zeroing interface (dax_zero_page_range()) can technically pass
> > a range which is less than a sector. Or which is bigger than a sector
> > but start and end are not aligned on sector boundaries.
> 
> Sure, but who will call it with misaligned ranges?

create a file foo.txt of size 4K and then truncate it.

"truncate -s 23 foo.txt". Filesystems try to zero the bytes from 24 to
4095.

I have also written a test for this.

https://github.com/rhvgoyal/misc/blob/master/pmem-tests/iomap-range-test.sh#L102

> 
> > At this point of time, all I care about is that case of an arbitrary
> > range is handeled well. So if a caller passes a range in, we figure
> > out subrange which is sector aligned in terms of start and end, and
> > clear poison on those sectors and ignore rest of the range. And
> > this itself will be an improvement over current behavior where 
> > nothing is cleared if I/O is not sector aligned.
> 
> I don't think this makes sense.  The caller needs to know about the
> blast radius of errors.  This is why I asked for a concrete example.
> It might make more sense, for example, to return an error if not all of
> the errors could be cleared.
> 
> >> > nvdimm_clear_poison() seems to assume offset and len to be aligned to
> >> > clear_err_unit boundary. But this is currently internal detail and is
> >> > not exported for others to use. So for now, continue to align offset and
> >> > length to SECTOR_SIZE boundary. Improving it further and to align it
> >> > to clear_err_unit boundary is a TODO item for future.
> >> 
> >> When there is a poisoned range of persistent memory, it is recorded by
> >> the badblocks infrastructure, which currently operates on sectors.  So,
> >> no matter what the error unit is for the hardware, we currently can't
> >> record/report to userspace anything smaller than a sector, and so that
> >> is what we expect when clearing errors.
> >> 
> >> Continuing on for completeness, we will currently not map a page with
> >> badblocks into a process' address space.  So, let's say you have 256
> >> bytes of bad pmem, we will tell you we've lost 512 bytes, and even if
> >> you access a valid mmap()d address in the same page as the poisoned
> >> memory, you will get a segfault.
> >> 
> >> Userspace can fix up the error by calling write(2) and friends to
> >> provide new data, or by punching a hole and writing new data to the hole
> >> (which may result in getting a new block, or reallocating the old block
> >> and zeroing it, which will clear the error).
> >
> > Fair enough. I do not need poison clearing at finer granularity. It might
> > be needed once dev_dax path wants to clear poison. Not sure how exactly
> > that works.
> 
> It doesn't.  :)
> 
> >> > +	/*
> >> > +	 * Callers can pass arbitrary offset and len. But nvdimm_clear_poison()
> >> > +	 * expects memory offset and length to meet certain alignment
> >> > +	 * restrction (clear_err_unit). Currently nvdimm does not export
> >>                                                   ^^^^^^^^^^^^^^^^^^^^^^
> >> > +	 * required alignment. So align offset and length to sector boundary
> >> 
> >> What is "nvdimm" in that sentence?  Because the nvdimm most certainly
> >> does export the required alignment.  Perhaps you meant libnvdimm?
> >
> > I meant nvdimm_clear_poison() function in drivers/nvdimm/bus.c. Whatever
> > it is called. It first queries alignement required (clear_err_unit) and
> > then makes sure range passed in meets that alignment requirement.
> 
> My point was your comment is misleading.
> 
> >> We could potentially support clearing less than a sector, but I'd have
> >> to understand the use cases better before offerring implementation
> >> suggestions.
> >
> > I don't need clearing less than a secotr. Once somebody needs it they
> > can implement it. All I am doing is making sure current logic is not
> > broken when dax_zero_page_range() starts using this logic and passes
> > an arbitrary range. We need to make sure we internally align I/O 
> 
> An arbitrary range is the same thing as less than a sector.  :)  Do you
> know of an instance where the range will not be sector-aligned and sized?
> 
> > and carve out an aligned sub-range and pass that subrange to
> > nvdimm_clear_poison().
> 
> And what happens to the rest?  The caller is left to trip over the
> errors?  That sounds pretty terrible.  I really think there needs to be
> an explicit contract here.

Ok, I think is is the contentious bit. Current interface
(__dax_zero_page_range()) either clears the poison (if I/O is aligned to
sector) or expects page to be free of poison.

So in above example, of "truncate -s 23 foo.txt", currently I get an error
because range being zeroed is not sector aligned. So
__dax_zero_page_range() falls back to calling direct_access(). Which
fails because there are poisoned sectors in the page.

With my patches, dax_zero_page_range(), clears the poison from sector 1 to
7 but leaves sector 0 untouched and just writes zeroes from byte 0 to 511
and returns success.

So question is, is this better behavior or worse behavior. If sector 0
was poisoned, it will continue to remain poisoned and caller will come
to know about it on next read and then it should try to truncate file
to length 0 or unlink file or restore that file to get rid of poison.

IOW, if a partial block is being zeroed and if it is poisoned, caller
will not be return an error and poison will not be cleared and memory
will be zeroed. What do we expect in such cases.

Do we expect an interface where if there are any bad blocks in the range
being zeroed, then they all should be cleared (and hence all I/O should
be aligned) otherwise error is returned. If yes, I could make that
change.

Downside of current interface is that it will clear as many blocks as
possible in the given range and leave starting and end blocks poisoned
(if it is unaligned) and not return error. That means a reader will
get error on these blocks again and they will have to try to clear it
again.

Thanks
Vivek
Dan Williams Feb. 21, 2020, 9 p.m. UTC | #6
On Fri, Feb 21, 2020 at 12:18 PM Vivek Goyal <vgoyal@redhat.com> wrote:
>
> On Fri, Feb 21, 2020 at 01:32:48PM -0500, Jeff Moyer wrote:
> > Vivek Goyal <vgoyal@redhat.com> writes:
> >
> > > On Thu, Feb 20, 2020 at 04:35:17PM -0500, Jeff Moyer wrote:
> > >> Vivek Goyal <vgoyal@redhat.com> writes:
> > >>
> > >> > Currently pmem_clear_poison() expects offset and len to be sector aligned.
> > >> > Atleast that seems to be the assumption with which code has been written.
> > >> > It is called only from pmem_do_bvec() which is called only from pmem_rw_page()
> > >> > and pmem_make_request() which will only passe sector aligned offset and len.
> > >> >
> > >> > Soon we want use this function from dax_zero_page_range() code path which
> > >> > can try to zero arbitrary range of memory with-in a page. So update this
> > >> > function to assume that offset and length can be arbitrary and do the
> > >> > necessary alignments as needed.
> > >>
> > >> What caller will try to zero a range that is smaller than a sector?
> > >
> > > Hi Jeff,
> > >
> > > New dax zeroing interface (dax_zero_page_range()) can technically pass
> > > a range which is less than a sector. Or which is bigger than a sector
> > > but start and end are not aligned on sector boundaries.
> >
> > Sure, but who will call it with misaligned ranges?
>
> create a file foo.txt of size 4K and then truncate it.
>
> "truncate -s 23 foo.txt". Filesystems try to zero the bytes from 24 to
> 4095.
>
> I have also written a test for this.
>
> https://github.com/rhvgoyal/misc/blob/master/pmem-tests/iomap-range-test.sh#L102
>
> >
> > > At this point of time, all I care about is that case of an arbitrary
> > > range is handeled well. So if a caller passes a range in, we figure
> > > out subrange which is sector aligned in terms of start and end, and
> > > clear poison on those sectors and ignore rest of the range. And
> > > this itself will be an improvement over current behavior where
> > > nothing is cleared if I/O is not sector aligned.
> >
> > I don't think this makes sense.  The caller needs to know about the
> > blast radius of errors.  This is why I asked for a concrete example.
> > It might make more sense, for example, to return an error if not all of
> > the errors could be cleared.
> >
> > >> > nvdimm_clear_poison() seems to assume offset and len to be aligned to
> > >> > clear_err_unit boundary. But this is currently internal detail and is
> > >> > not exported for others to use. So for now, continue to align offset and
> > >> > length to SECTOR_SIZE boundary. Improving it further and to align it
> > >> > to clear_err_unit boundary is a TODO item for future.
> > >>
> > >> When there is a poisoned range of persistent memory, it is recorded by
> > >> the badblocks infrastructure, which currently operates on sectors.  So,
> > >> no matter what the error unit is for the hardware, we currently can't
> > >> record/report to userspace anything smaller than a sector, and so that
> > >> is what we expect when clearing errors.
> > >>
> > >> Continuing on for completeness, we will currently not map a page with
> > >> badblocks into a process' address space.  So, let's say you have 256
> > >> bytes of bad pmem, we will tell you we've lost 512 bytes, and even if
> > >> you access a valid mmap()d address in the same page as the poisoned
> > >> memory, you will get a segfault.
> > >>
> > >> Userspace can fix up the error by calling write(2) and friends to
> > >> provide new data, or by punching a hole and writing new data to the hole
> > >> (which may result in getting a new block, or reallocating the old block
> > >> and zeroing it, which will clear the error).
> > >
> > > Fair enough. I do not need poison clearing at finer granularity. It might
> > > be needed once dev_dax path wants to clear poison. Not sure how exactly
> > > that works.
> >
> > It doesn't.  :)
> >
> > >> > +        /*
> > >> > +         * Callers can pass arbitrary offset and len. But nvdimm_clear_poison()
> > >> > +         * expects memory offset and length to meet certain alignment
> > >> > +         * restrction (clear_err_unit). Currently nvdimm does not export
> > >>                                                   ^^^^^^^^^^^^^^^^^^^^^^
> > >> > +         * required alignment. So align offset and length to sector boundary
> > >>
> > >> What is "nvdimm" in that sentence?  Because the nvdimm most certainly
> > >> does export the required alignment.  Perhaps you meant libnvdimm?
> > >
> > > I meant nvdimm_clear_poison() function in drivers/nvdimm/bus.c. Whatever
> > > it is called. It first queries alignement required (clear_err_unit) and
> > > then makes sure range passed in meets that alignment requirement.
> >
> > My point was your comment is misleading.
> >
> > >> We could potentially support clearing less than a sector, but I'd have
> > >> to understand the use cases better before offerring implementation
> > >> suggestions.
> > >
> > > I don't need clearing less than a secotr. Once somebody needs it they
> > > can implement it. All I am doing is making sure current logic is not
> > > broken when dax_zero_page_range() starts using this logic and passes
> > > an arbitrary range. We need to make sure we internally align I/O
> >
> > An arbitrary range is the same thing as less than a sector.  :)  Do you
> > know of an instance where the range will not be sector-aligned and sized?
> >
> > > and carve out an aligned sub-range and pass that subrange to
> > > nvdimm_clear_poison().
> >
> > And what happens to the rest?  The caller is left to trip over the
> > errors?  That sounds pretty terrible.  I really think there needs to be
> > an explicit contract here.
>
> Ok, I think is is the contentious bit. Current interface
> (__dax_zero_page_range()) either clears the poison (if I/O is aligned to
> sector) or expects page to be free of poison.
>
> So in above example, of "truncate -s 23 foo.txt", currently I get an error
> because range being zeroed is not sector aligned. So
> __dax_zero_page_range() falls back to calling direct_access(). Which
> fails because there are poisoned sectors in the page.
>
> With my patches, dax_zero_page_range(), clears the poison from sector 1 to
> 7 but leaves sector 0 untouched and just writes zeroes from byte 0 to 511
> and returns success.
>
> So question is, is this better behavior or worse behavior. If sector 0
> was poisoned, it will continue to remain poisoned and caller will come
> to know about it on next read and then it should try to truncate file
> to length 0 or unlink file or restore that file to get rid of poison.
>
> IOW, if a partial block is being zeroed and if it is poisoned, caller
> will not be return an error and poison will not be cleared and memory
> will be zeroed. What do we expect in such cases.
>
> Do we expect an interface where if there are any bad blocks in the range
> being zeroed, then they all should be cleared (and hence all I/O should
> be aligned) otherwise error is returned. If yes, I could make that
> change.

This does not strike me as a good idea because it's a false security
compared to the latent poison case. If the writes to an unknown
poisoned location would otherwise succeed via a different I/O path
(dax), it's an unsymmetric surprise to start returning errors just
because you wrote zeroes as a side effect of truncate.

> Downside of current interface is that it will clear as many blocks as
> possible in the given range and leave starting and end blocks poisoned
> (if it is unaligned) and not return error. That means a reader will
> get error on these blocks again and they will have to try to clear it
> again.

I think what you have described in your truncate example is an
improvement on what we have currently because x86 does not communicate
write errors. Specifically, writing zeros via dax from userspace over
unknown poison behaves the same as writing unaligned zeros over known
poison. In both cases it's a best effort that always succeeds (no cpu
exception), and may inadvertently clear poison as a side-effect.
Otherwise, an error-block-aligned hole punch is the only way to
trigger the kernel to try to clear known poison when the full block is
reallocated.

On movdir64b capable cpus the error clearing unit becomes 64-bytes
rather than 256-bytes because that allows a cacheline to be written
without triggering a line fill read. So the error clearing granularity
gets better over time, but unfortunately not synchronous detection in
the I/O path.

I think a better way to improve poison handling is the long standing
idea to integrate the badblock tracking into the filesystem directly.
That way driver notifications of poison can be ingested into the
filesystem and notifications sent on filenames rather than the current
TOCTOU mess of trying to do a reverse lookup of badblock numbers to
files. If the application can efficiently list and be notified of
poison it can mitigate it immediately rather than trying to rely on
write side effects.
Vivek Goyal Feb. 21, 2020, 9:24 p.m. UTC | #7
On Fri, Feb 21, 2020 at 01:00:29PM -0800, Dan Williams wrote:
> On Fri, Feb 21, 2020 at 12:18 PM Vivek Goyal <vgoyal@redhat.com> wrote:
> >
> > On Fri, Feb 21, 2020 at 01:32:48PM -0500, Jeff Moyer wrote:
> > > Vivek Goyal <vgoyal@redhat.com> writes:
> > >
> > > > On Thu, Feb 20, 2020 at 04:35:17PM -0500, Jeff Moyer wrote:
> > > >> Vivek Goyal <vgoyal@redhat.com> writes:
> > > >>
> > > >> > Currently pmem_clear_poison() expects offset and len to be sector aligned.
> > > >> > Atleast that seems to be the assumption with which code has been written.
> > > >> > It is called only from pmem_do_bvec() which is called only from pmem_rw_page()
> > > >> > and pmem_make_request() which will only passe sector aligned offset and len.
> > > >> >
> > > >> > Soon we want use this function from dax_zero_page_range() code path which
> > > >> > can try to zero arbitrary range of memory with-in a page. So update this
> > > >> > function to assume that offset and length can be arbitrary and do the
> > > >> > necessary alignments as needed.
> > > >>
> > > >> What caller will try to zero a range that is smaller than a sector?
> > > >
> > > > Hi Jeff,
> > > >
> > > > New dax zeroing interface (dax_zero_page_range()) can technically pass
> > > > a range which is less than a sector. Or which is bigger than a sector
> > > > but start and end are not aligned on sector boundaries.
> > >
> > > Sure, but who will call it with misaligned ranges?
> >
> > create a file foo.txt of size 4K and then truncate it.
> >
> > "truncate -s 23 foo.txt". Filesystems try to zero the bytes from 24 to
> > 4095.
> >
> > I have also written a test for this.
> >
> > https://github.com/rhvgoyal/misc/blob/master/pmem-tests/iomap-range-test.sh#L102
> >
> > >
> > > > At this point of time, all I care about is that case of an arbitrary
> > > > range is handeled well. So if a caller passes a range in, we figure
> > > > out subrange which is sector aligned in terms of start and end, and
> > > > clear poison on those sectors and ignore rest of the range. And
> > > > this itself will be an improvement over current behavior where
> > > > nothing is cleared if I/O is not sector aligned.
> > >
> > > I don't think this makes sense.  The caller needs to know about the
> > > blast radius of errors.  This is why I asked for a concrete example.
> > > It might make more sense, for example, to return an error if not all of
> > > the errors could be cleared.
> > >
> > > >> > nvdimm_clear_poison() seems to assume offset and len to be aligned to
> > > >> > clear_err_unit boundary. But this is currently internal detail and is
> > > >> > not exported for others to use. So for now, continue to align offset and
> > > >> > length to SECTOR_SIZE boundary. Improving it further and to align it
> > > >> > to clear_err_unit boundary is a TODO item for future.
> > > >>
> > > >> When there is a poisoned range of persistent memory, it is recorded by
> > > >> the badblocks infrastructure, which currently operates on sectors.  So,
> > > >> no matter what the error unit is for the hardware, we currently can't
> > > >> record/report to userspace anything smaller than a sector, and so that
> > > >> is what we expect when clearing errors.
> > > >>
> > > >> Continuing on for completeness, we will currently not map a page with
> > > >> badblocks into a process' address space.  So, let's say you have 256
> > > >> bytes of bad pmem, we will tell you we've lost 512 bytes, and even if
> > > >> you access a valid mmap()d address in the same page as the poisoned
> > > >> memory, you will get a segfault.
> > > >>
> > > >> Userspace can fix up the error by calling write(2) and friends to
> > > >> provide new data, or by punching a hole and writing new data to the hole
> > > >> (which may result in getting a new block, or reallocating the old block
> > > >> and zeroing it, which will clear the error).
> > > >
> > > > Fair enough. I do not need poison clearing at finer granularity. It might
> > > > be needed once dev_dax path wants to clear poison. Not sure how exactly
> > > > that works.
> > >
> > > It doesn't.  :)
> > >
> > > >> > +        /*
> > > >> > +         * Callers can pass arbitrary offset and len. But nvdimm_clear_poison()
> > > >> > +         * expects memory offset and length to meet certain alignment
> > > >> > +         * restrction (clear_err_unit). Currently nvdimm does not export
> > > >>                                                   ^^^^^^^^^^^^^^^^^^^^^^
> > > >> > +         * required alignment. So align offset and length to sector boundary
> > > >>
> > > >> What is "nvdimm" in that sentence?  Because the nvdimm most certainly
> > > >> does export the required alignment.  Perhaps you meant libnvdimm?
> > > >
> > > > I meant nvdimm_clear_poison() function in drivers/nvdimm/bus.c. Whatever
> > > > it is called. It first queries alignement required (clear_err_unit) and
> > > > then makes sure range passed in meets that alignment requirement.
> > >
> > > My point was your comment is misleading.
> > >
> > > >> We could potentially support clearing less than a sector, but I'd have
> > > >> to understand the use cases better before offerring implementation
> > > >> suggestions.
> > > >
> > > > I don't need clearing less than a secotr. Once somebody needs it they
> > > > can implement it. All I am doing is making sure current logic is not
> > > > broken when dax_zero_page_range() starts using this logic and passes
> > > > an arbitrary range. We need to make sure we internally align I/O
> > >
> > > An arbitrary range is the same thing as less than a sector.  :)  Do you
> > > know of an instance where the range will not be sector-aligned and sized?
> > >
> > > > and carve out an aligned sub-range and pass that subrange to
> > > > nvdimm_clear_poison().
> > >
> > > And what happens to the rest?  The caller is left to trip over the
> > > errors?  That sounds pretty terrible.  I really think there needs to be
> > > an explicit contract here.
> >
> > Ok, I think is is the contentious bit. Current interface
> > (__dax_zero_page_range()) either clears the poison (if I/O is aligned to
> > sector) or expects page to be free of poison.
> >
> > So in above example, of "truncate -s 23 foo.txt", currently I get an error
> > because range being zeroed is not sector aligned. So
> > __dax_zero_page_range() falls back to calling direct_access(). Which
> > fails because there are poisoned sectors in the page.
> >
> > With my patches, dax_zero_page_range(), clears the poison from sector 1 to
> > 7 but leaves sector 0 untouched and just writes zeroes from byte 0 to 511
> > and returns success.
> >
> > So question is, is this better behavior or worse behavior. If sector 0
> > was poisoned, it will continue to remain poisoned and caller will come
> > to know about it on next read and then it should try to truncate file
> > to length 0 or unlink file or restore that file to get rid of poison.
> >
> > IOW, if a partial block is being zeroed and if it is poisoned, caller
> > will not be return an error and poison will not be cleared and memory
> > will be zeroed. What do we expect in such cases.
> >
> > Do we expect an interface where if there are any bad blocks in the range
> > being zeroed, then they all should be cleared (and hence all I/O should
> > be aligned) otherwise error is returned. If yes, I could make that
> > change.
> 
> This does not strike me as a good idea because it's a false security
> compared to the latent poison case. If the writes to an unknown
> poisoned location would otherwise succeed via a different I/O path
> (dax), it's an unsymmetric surprise to start returning errors just
> because you wrote zeroes as a side effect of truncate.
> 
> > Downside of current interface is that it will clear as many blocks as
> > possible in the given range and leave starting and end blocks poisoned
> > (if it is unaligned) and not return error. That means a reader will
> > get error on these blocks again and they will have to try to clear it
> > again.
> 
> I think what you have described in your truncate example is an
> improvement on what we have currently because x86 does not communicate
> write errors. Specifically, writing zeros via dax from userspace over
> unknown poison behaves the same as writing unaligned zeros over known
> poison. In both cases it's a best effort that always succeeds (no cpu
> exception), and may inadvertently clear poison as a side-effect.
> Otherwise, an error-block-aligned hole punch is the only way to
> trigger the kernel to try to clear known poison when the full block is
> reallocated.

Hi Dan,

Agreed. This new interface works uniformly for both known poison and latent
poison cases. Existing interface is asymmetric and that means if poison is
latent, unaligned zero range will succeed but if poison is known, unaligned
zero range will fail.

> 
> On movdir64b capable cpus the error clearing unit becomes 64-bytes
> rather than 256-bytes because that allows a cacheline to be written
> without triggering a line fill read. So the error clearing granularity
> gets better over time, but unfortunately not synchronous detection in
> the I/O path.
> 
> I think a better way to improve poison handling is the long standing
> idea to integrate the badblock tracking into the filesystem directly.
> That way driver notifications of poison can be ingested into the
> filesystem and notifications sent on filenames rather than the current
> TOCTOU mess of trying to do a reverse lookup of badblock numbers to
> files. If the application can efficiently list and be notified of
> poison it can mitigate it immediately rather than trying to rely on
> write side effects.

Moving badblocks infrastructure in filesystem sounds like a major
rework which should be taken up in a seprate patch series in future.

For now, can we please take these patches which are an improvement
over existing interface.

Thanks
Vivek
Dan Williams Feb. 21, 2020, 9:30 p.m. UTC | #8
On Fri, Feb 21, 2020 at 1:25 PM Vivek Goyal <vgoyal@redhat.com> wrote:
>
> On Fri, Feb 21, 2020 at 01:00:29PM -0800, Dan Williams wrote:
> > On Fri, Feb 21, 2020 at 12:18 PM Vivek Goyal <vgoyal@redhat.com> wrote:
> > >
> > > On Fri, Feb 21, 2020 at 01:32:48PM -0500, Jeff Moyer wrote:
> > > > Vivek Goyal <vgoyal@redhat.com> writes:
> > > >
> > > > > On Thu, Feb 20, 2020 at 04:35:17PM -0500, Jeff Moyer wrote:
> > > > >> Vivek Goyal <vgoyal@redhat.com> writes:
> > > > >>
> > > > >> > Currently pmem_clear_poison() expects offset and len to be sector aligned.
> > > > >> > Atleast that seems to be the assumption with which code has been written.
> > > > >> > It is called only from pmem_do_bvec() which is called only from pmem_rw_page()
> > > > >> > and pmem_make_request() which will only passe sector aligned offset and len.
> > > > >> >
> > > > >> > Soon we want use this function from dax_zero_page_range() code path which
> > > > >> > can try to zero arbitrary range of memory with-in a page. So update this
> > > > >> > function to assume that offset and length can be arbitrary and do the
> > > > >> > necessary alignments as needed.
> > > > >>
> > > > >> What caller will try to zero a range that is smaller than a sector?
> > > > >
> > > > > Hi Jeff,
> > > > >
> > > > > New dax zeroing interface (dax_zero_page_range()) can technically pass
> > > > > a range which is less than a sector. Or which is bigger than a sector
> > > > > but start and end are not aligned on sector boundaries.
> > > >
> > > > Sure, but who will call it with misaligned ranges?
> > >
> > > create a file foo.txt of size 4K and then truncate it.
> > >
> > > "truncate -s 23 foo.txt". Filesystems try to zero the bytes from 24 to
> > > 4095.
> > >
> > > I have also written a test for this.
> > >
> > > https://github.com/rhvgoyal/misc/blob/master/pmem-tests/iomap-range-test.sh#L102
> > >
> > > >
> > > > > At this point of time, all I care about is that case of an arbitrary
> > > > > range is handeled well. So if a caller passes a range in, we figure
> > > > > out subrange which is sector aligned in terms of start and end, and
> > > > > clear poison on those sectors and ignore rest of the range. And
> > > > > this itself will be an improvement over current behavior where
> > > > > nothing is cleared if I/O is not sector aligned.
> > > >
> > > > I don't think this makes sense.  The caller needs to know about the
> > > > blast radius of errors.  This is why I asked for a concrete example.
> > > > It might make more sense, for example, to return an error if not all of
> > > > the errors could be cleared.
> > > >
> > > > >> > nvdimm_clear_poison() seems to assume offset and len to be aligned to
> > > > >> > clear_err_unit boundary. But this is currently internal detail and is
> > > > >> > not exported for others to use. So for now, continue to align offset and
> > > > >> > length to SECTOR_SIZE boundary. Improving it further and to align it
> > > > >> > to clear_err_unit boundary is a TODO item for future.
> > > > >>
> > > > >> When there is a poisoned range of persistent memory, it is recorded by
> > > > >> the badblocks infrastructure, which currently operates on sectors.  So,
> > > > >> no matter what the error unit is for the hardware, we currently can't
> > > > >> record/report to userspace anything smaller than a sector, and so that
> > > > >> is what we expect when clearing errors.
> > > > >>
> > > > >> Continuing on for completeness, we will currently not map a page with
> > > > >> badblocks into a process' address space.  So, let's say you have 256
> > > > >> bytes of bad pmem, we will tell you we've lost 512 bytes, and even if
> > > > >> you access a valid mmap()d address in the same page as the poisoned
> > > > >> memory, you will get a segfault.
> > > > >>
> > > > >> Userspace can fix up the error by calling write(2) and friends to
> > > > >> provide new data, or by punching a hole and writing new data to the hole
> > > > >> (which may result in getting a new block, or reallocating the old block
> > > > >> and zeroing it, which will clear the error).
> > > > >
> > > > > Fair enough. I do not need poison clearing at finer granularity. It might
> > > > > be needed once dev_dax path wants to clear poison. Not sure how exactly
> > > > > that works.
> > > >
> > > > It doesn't.  :)
> > > >
> > > > >> > +        /*
> > > > >> > +         * Callers can pass arbitrary offset and len. But nvdimm_clear_poison()
> > > > >> > +         * expects memory offset and length to meet certain alignment
> > > > >> > +         * restrction (clear_err_unit). Currently nvdimm does not export
> > > > >>                                                   ^^^^^^^^^^^^^^^^^^^^^^
> > > > >> > +         * required alignment. So align offset and length to sector boundary
> > > > >>
> > > > >> What is "nvdimm" in that sentence?  Because the nvdimm most certainly
> > > > >> does export the required alignment.  Perhaps you meant libnvdimm?
> > > > >
> > > > > I meant nvdimm_clear_poison() function in drivers/nvdimm/bus.c. Whatever
> > > > > it is called. It first queries alignement required (clear_err_unit) and
> > > > > then makes sure range passed in meets that alignment requirement.
> > > >
> > > > My point was your comment is misleading.
> > > >
> > > > >> We could potentially support clearing less than a sector, but I'd have
> > > > >> to understand the use cases better before offerring implementation
> > > > >> suggestions.
> > > > >
> > > > > I don't need clearing less than a secotr. Once somebody needs it they
> > > > > can implement it. All I am doing is making sure current logic is not
> > > > > broken when dax_zero_page_range() starts using this logic and passes
> > > > > an arbitrary range. We need to make sure we internally align I/O
> > > >
> > > > An arbitrary range is the same thing as less than a sector.  :)  Do you
> > > > know of an instance where the range will not be sector-aligned and sized?
> > > >
> > > > > and carve out an aligned sub-range and pass that subrange to
> > > > > nvdimm_clear_poison().
> > > >
> > > > And what happens to the rest?  The caller is left to trip over the
> > > > errors?  That sounds pretty terrible.  I really think there needs to be
> > > > an explicit contract here.
> > >
> > > Ok, I think is is the contentious bit. Current interface
> > > (__dax_zero_page_range()) either clears the poison (if I/O is aligned to
> > > sector) or expects page to be free of poison.
> > >
> > > So in above example, of "truncate -s 23 foo.txt", currently I get an error
> > > because range being zeroed is not sector aligned. So
> > > __dax_zero_page_range() falls back to calling direct_access(). Which
> > > fails because there are poisoned sectors in the page.
> > >
> > > With my patches, dax_zero_page_range(), clears the poison from sector 1 to
> > > 7 but leaves sector 0 untouched and just writes zeroes from byte 0 to 511
> > > and returns success.
> > >
> > > So question is, is this better behavior or worse behavior. If sector 0
> > > was poisoned, it will continue to remain poisoned and caller will come
> > > to know about it on next read and then it should try to truncate file
> > > to length 0 or unlink file or restore that file to get rid of poison.
> > >
> > > IOW, if a partial block is being zeroed and if it is poisoned, caller
> > > will not be return an error and poison will not be cleared and memory
> > > will be zeroed. What do we expect in such cases.
> > >
> > > Do we expect an interface where if there are any bad blocks in the range
> > > being zeroed, then they all should be cleared (and hence all I/O should
> > > be aligned) otherwise error is returned. If yes, I could make that
> > > change.
> >
> > This does not strike me as a good idea because it's a false security
> > compared to the latent poison case. If the writes to an unknown
> > poisoned location would otherwise succeed via a different I/O path
> > (dax), it's an unsymmetric surprise to start returning errors just
> > because you wrote zeroes as a side effect of truncate.
> >
> > > Downside of current interface is that it will clear as many blocks as
> > > possible in the given range and leave starting and end blocks poisoned
> > > (if it is unaligned) and not return error. That means a reader will
> > > get error on these blocks again and they will have to try to clear it
> > > again.
> >
> > I think what you have described in your truncate example is an
> > improvement on what we have currently because x86 does not communicate
> > write errors. Specifically, writing zeros via dax from userspace over
> > unknown poison behaves the same as writing unaligned zeros over known
> > poison. In both cases it's a best effort that always succeeds (no cpu
> > exception), and may inadvertently clear poison as a side-effect.
> > Otherwise, an error-block-aligned hole punch is the only way to
> > trigger the kernel to try to clear known poison when the full block is
> > reallocated.
>
> Hi Dan,
>
> Agreed. This new interface works uniformly for both known poison and latent
> poison cases. Existing interface is asymmetric and that means if poison is
> latent, unaligned zero range will succeed but if poison is known, unaligned
> zero range will fail.
>
> >
> > On movdir64b capable cpus the error clearing unit becomes 64-bytes
> > rather than 256-bytes because that allows a cacheline to be written
> > without triggering a line fill read. So the error clearing granularity
> > gets better over time, but unfortunately not synchronous detection in
> > the I/O path.
> >
> > I think a better way to improve poison handling is the long standing
> > idea to integrate the badblock tracking into the filesystem directly.
> > That way driver notifications of poison can be ingested into the
> > filesystem and notifications sent on filenames rather than the current
> > TOCTOU mess of trying to do a reverse lookup of badblock numbers to
> > files. If the application can efficiently list and be notified of
> > poison it can mitigate it immediately rather than trying to rely on
> > write side effects.
>
> Moving badblocks infrastructure in filesystem sounds like a major
> rework which should be taken up in a seprate patch series in future.
>
> For now, can we please take these patches which are an improvement
> over existing interface.

Oh you misunderstood my comment, the "move badblocks to filesystem"
proposal is long term / down the road thing to consider. In the near
term this unaligned block zeroing facility is an improvement.
Jeff Moyer Feb. 21, 2020, 9:33 p.m. UTC | #9
Dan Williams <dan.j.williams@intel.com> writes:

> Oh you misunderstood my comment, the "move badblocks to filesystem"
> proposal is long term / down the road thing to consider. In the near
> term this unaligned block zeroing facility is an improvement.

I'm not sure I agree.  I'm going to think about it and get back to you.

-Jeff
Dave Chinner Feb. 23, 2020, 11:03 p.m. UTC | #10
On Fri, Feb 21, 2020 at 03:17:59PM -0500, Vivek Goyal wrote:
> On Fri, Feb 21, 2020 at 01:32:48PM -0500, Jeff Moyer wrote:
> > Vivek Goyal <vgoyal@redhat.com> writes:
> > 
> > > On Thu, Feb 20, 2020 at 04:35:17PM -0500, Jeff Moyer wrote:
> > >> Vivek Goyal <vgoyal@redhat.com> writes:
> > >> 
> > >> > Currently pmem_clear_poison() expects offset and len to be sector aligned.
> > >> > Atleast that seems to be the assumption with which code has been written.
> > >> > It is called only from pmem_do_bvec() which is called only from pmem_rw_page()
> > >> > and pmem_make_request() which will only passe sector aligned offset and len.
> > >> >
> > >> > Soon we want use this function from dax_zero_page_range() code path which
> > >> > can try to zero arbitrary range of memory with-in a page. So update this
> > >> > function to assume that offset and length can be arbitrary and do the
> > >> > necessary alignments as needed.
> > >> 
> > >> What caller will try to zero a range that is smaller than a sector?
> > >
> > > Hi Jeff,
> > >
> > > New dax zeroing interface (dax_zero_page_range()) can technically pass
> > > a range which is less than a sector. Or which is bigger than a sector
> > > but start and end are not aligned on sector boundaries.
> > 
> > Sure, but who will call it with misaligned ranges?
> 
> create a file foo.txt of size 4K and then truncate it.
> 
> "truncate -s 23 foo.txt". Filesystems try to zero the bytes from 24 to
> 4095.

This should fail with EIO. Only full page writes should clear the
bad page state, and partial writes should therefore fail because
they do not guarantee the data in the filesystem block is all good.

If this zeroing was a buffered write to an address with a bad
sector, then the writeback will fail and the user will (eventually)
get an EIO on the file.

DAX should do the same thing, except because the zeroing is
synchronous (i.e. done directly by the truncate syscall) we can -
and should - return EIO immediately.

Indeed, with your code, if we then extend the file by truncating up
back to 4k, then the range between 23 and 512 is still bad, even
though we've successfully zeroed it and the user knows it. An
attempt to read anywhere in this range (e.g. 10 bytes at offset 100)
will fail with EIO, but reading 10 bytes at offset 2000 will
succeed.

That's *awful* behaviour to expose to userspace, especially when
they look at the fs config and see that it's using both 4kB block
and sector sizes...

The only thing that makes sense from a filesystem perspective is
clearing bad page state when entire filesystem blocks are
overwritten. The data in a filesystem block is either good or bad,
and it doesn't matter how many internal (kernel or device) sectors
it has.

> > And what happens to the rest?  The caller is left to trip over the
> > errors?  That sounds pretty terrible.  I really think there needs to be
> > an explicit contract here.
> 
> Ok, I think is is the contentious bit. Current interface
> (__dax_zero_page_range()) either clears the poison (if I/O is aligned to
> sector) or expects page to be free of poison.
> 
> So in above example, of "truncate -s 23 foo.txt", currently I get an error
> because range being zeroed is not sector aligned. So
> __dax_zero_page_range() falls back to calling direct_access(). Which
> fails because there are poisoned sectors in the page.
> 
> With my patches, dax_zero_page_range(), clears the poison from sector 1 to
> 7 but leaves sector 0 untouched and just writes zeroes from byte 0 to 511
> and returns success.

Ok, kernel sectors are not the unit of granularity bad page state
should be managed at. They don't match page state granularity, and
they don't match filesystem block granularity, and the whacky
"partial writes silently succeed, reads fail unpredictably"
assymetry it leads to will just cause problems for users.

> So question is, is this better behavior or worse behavior. If sector 0
> was poisoned, it will continue to remain poisoned and caller will come
> to know about it on next read and then it should try to truncate file
> to length 0 or unlink file or restore that file to get rid of poison.

Worse, because the filesystem can't track what sub-parts of the
block are bad and that leads to inconsistent data integrity status
being exposed to userspace.


> IOW, if a partial block is being zeroed and if it is poisoned, caller
> will not be return an error and poison will not be cleared and memory
> will be zeroed. What do we expect in such cases.
> 
> Do we expect an interface where if there are any bad blocks in the range
> being zeroed, then they all should be cleared (and hence all I/O should
> be aligned) otherwise error is returned. If yes, I could make that
> change.
> 
> Downside of current interface is that it will clear as many blocks as
> possible in the given range and leave starting and end blocks poisoned
> (if it is unaligned) and not return error. That means a reader will
> get error on these blocks again and they will have to try to clear it
> again.

Which is solved by having partial page writes always EIO on poisoned
memory.

Cheers,

Dave.
Dan Williams Feb. 24, 2020, 12:40 a.m. UTC | #11
On Sun, Feb 23, 2020 at 3:03 PM Dave Chinner <david@fromorbit.com> wrote:
>
> On Fri, Feb 21, 2020 at 03:17:59PM -0500, Vivek Goyal wrote:
> > On Fri, Feb 21, 2020 at 01:32:48PM -0500, Jeff Moyer wrote:
> > > Vivek Goyal <vgoyal@redhat.com> writes:
> > >
> > > > On Thu, Feb 20, 2020 at 04:35:17PM -0500, Jeff Moyer wrote:
> > > >> Vivek Goyal <vgoyal@redhat.com> writes:
> > > >>
> > > >> > Currently pmem_clear_poison() expects offset and len to be sector aligned.
> > > >> > Atleast that seems to be the assumption with which code has been written.
> > > >> > It is called only from pmem_do_bvec() which is called only from pmem_rw_page()
> > > >> > and pmem_make_request() which will only passe sector aligned offset and len.
> > > >> >
> > > >> > Soon we want use this function from dax_zero_page_range() code path which
> > > >> > can try to zero arbitrary range of memory with-in a page. So update this
> > > >> > function to assume that offset and length can be arbitrary and do the
> > > >> > necessary alignments as needed.
> > > >>
> > > >> What caller will try to zero a range that is smaller than a sector?
> > > >
> > > > Hi Jeff,
> > > >
> > > > New dax zeroing interface (dax_zero_page_range()) can technically pass
> > > > a range which is less than a sector. Or which is bigger than a sector
> > > > but start and end are not aligned on sector boundaries.
> > >
> > > Sure, but who will call it with misaligned ranges?
> >
> > create a file foo.txt of size 4K and then truncate it.
> >
> > "truncate -s 23 foo.txt". Filesystems try to zero the bytes from 24 to
> > 4095.
>
> This should fail with EIO. Only full page writes should clear the
> bad page state, and partial writes should therefore fail because
> they do not guarantee the data in the filesystem block is all good.
>
> If this zeroing was a buffered write to an address with a bad
> sector, then the writeback will fail and the user will (eventually)
> get an EIO on the file.
>
> DAX should do the same thing, except because the zeroing is
> synchronous (i.e. done directly by the truncate syscall) we can -
> and should - return EIO immediately.
>
> Indeed, with your code, if we then extend the file by truncating up
> back to 4k, then the range between 23 and 512 is still bad, even
> though we've successfully zeroed it and the user knows it. An
> attempt to read anywhere in this range (e.g. 10 bytes at offset 100)
> will fail with EIO, but reading 10 bytes at offset 2000 will
> succeed.
>
> That's *awful* behaviour to expose to userspace, especially when
> they look at the fs config and see that it's using both 4kB block
> and sector sizes...
>
> The only thing that makes sense from a filesystem perspective is
> clearing bad page state when entire filesystem blocks are
> overwritten. The data in a filesystem block is either good or bad,
> and it doesn't matter how many internal (kernel or device) sectors
> it has.
>
> > > And what happens to the rest?  The caller is left to trip over the
> > > errors?  That sounds pretty terrible.  I really think there needs to be
> > > an explicit contract here.
> >
> > Ok, I think is is the contentious bit. Current interface
> > (__dax_zero_page_range()) either clears the poison (if I/O is aligned to
> > sector) or expects page to be free of poison.
> >
> > So in above example, of "truncate -s 23 foo.txt", currently I get an error
> > because range being zeroed is not sector aligned. So
> > __dax_zero_page_range() falls back to calling direct_access(). Which
> > fails because there are poisoned sectors in the page.
> >
> > With my patches, dax_zero_page_range(), clears the poison from sector 1 to
> > 7 but leaves sector 0 untouched and just writes zeroes from byte 0 to 511
> > and returns success.
>
> Ok, kernel sectors are not the unit of granularity bad page state
> should be managed at. They don't match page state granularity, and
> they don't match filesystem block granularity, and the whacky
> "partial writes silently succeed, reads fail unpredictably"
> assymetry it leads to will just cause problems for users.
>
> > So question is, is this better behavior or worse behavior. If sector 0
> > was poisoned, it will continue to remain poisoned and caller will come
> > to know about it on next read and then it should try to truncate file
> > to length 0 or unlink file or restore that file to get rid of poison.
>
> Worse, because the filesystem can't track what sub-parts of the
> block are bad and that leads to inconsistent data integrity status
> being exposed to userspace.

The driver can't track it either. Latent poison isn't know until it is
consumed, and writes to latent poison are not guaranteed to clear it.

>
>
> > IOW, if a partial block is being zeroed and if it is poisoned, caller
> > will not be return an error and poison will not be cleared and memory
> > will be zeroed. What do we expect in such cases.
> >
> > Do we expect an interface where if there are any bad blocks in the range
> > being zeroed, then they all should be cleared (and hence all I/O should
> > be aligned) otherwise error is returned. If yes, I could make that
> > change.
> >
> > Downside of current interface is that it will clear as many blocks as
> > possible in the given range and leave starting and end blocks poisoned
> > (if it is unaligned) and not return error. That means a reader will
> > get error on these blocks again and they will have to try to clear it
> > again.
>
> Which is solved by having partial page writes always EIO on poisoned
> memory.

The problem with the above is that partial page writes can not be
guaranteed to return EIO. Poison is only detected on consumed reads,
or a periodic scrub, not writes. IFF poison detection was always
synchronous with poison creation then the above makes sense. However,
with asynchronous signaling, it's fundamentally a false security
blanket to assume even full block writes will clear poison unless a
callback to firmware is made for every block.
Jeff Moyer Feb. 24, 2020, 1:50 p.m. UTC | #12
Dan Williams <dan.j.williams@intel.com> writes:

> On Sun, Feb 23, 2020 at 3:03 PM Dave Chinner <david@fromorbit.com> wrote:
>>
>> On Fri, Feb 21, 2020 at 03:17:59PM -0500, Vivek Goyal wrote:
>> > On Fri, Feb 21, 2020 at 01:32:48PM -0500, Jeff Moyer wrote:
>> > > Vivek Goyal <vgoyal@redhat.com> writes:
>> > >
>> > > > On Thu, Feb 20, 2020 at 04:35:17PM -0500, Jeff Moyer wrote:
>> > > >> Vivek Goyal <vgoyal@redhat.com> writes:
>> > > >>
>> > > >> > Currently pmem_clear_poison() expects offset and len to be sector aligned.
>> > > >> > Atleast that seems to be the assumption with which code has been written.
>> > > >> > It is called only from pmem_do_bvec() which is called only from pmem_rw_page()
>> > > >> > and pmem_make_request() which will only passe sector aligned offset and len.
>> > > >> >
>> > > >> > Soon we want use this function from dax_zero_page_range() code path which
>> > > >> > can try to zero arbitrary range of memory with-in a page. So update this
>> > > >> > function to assume that offset and length can be arbitrary and do the
>> > > >> > necessary alignments as needed.
>> > > >>
>> > > >> What caller will try to zero a range that is smaller than a sector?
>> > > >
>> > > > Hi Jeff,
>> > > >
>> > > > New dax zeroing interface (dax_zero_page_range()) can technically pass
>> > > > a range which is less than a sector. Or which is bigger than a sector
>> > > > but start and end are not aligned on sector boundaries.
>> > >
>> > > Sure, but who will call it with misaligned ranges?
>> >
>> > create a file foo.txt of size 4K and then truncate it.
>> >
>> > "truncate -s 23 foo.txt". Filesystems try to zero the bytes from 24 to
>> > 4095.
>>
>> This should fail with EIO. Only full page writes should clear the
>> bad page state, and partial writes should therefore fail because
>> they do not guarantee the data in the filesystem block is all good.
>>
>> If this zeroing was a buffered write to an address with a bad
>> sector, then the writeback will fail and the user will (eventually)
>> get an EIO on the file.
>>
>> DAX should do the same thing, except because the zeroing is
>> synchronous (i.e. done directly by the truncate syscall) we can -
>> and should - return EIO immediately.
>>
>> Indeed, with your code, if we then extend the file by truncating up
>> back to 4k, then the range between 23 and 512 is still bad, even
>> though we've successfully zeroed it and the user knows it. An
>> attempt to read anywhere in this range (e.g. 10 bytes at offset 100)
>> will fail with EIO, but reading 10 bytes at offset 2000 will
>> succeed.
>>
>> That's *awful* behaviour to expose to userspace, especially when
>> they look at the fs config and see that it's using both 4kB block
>> and sector sizes...
>>
>> The only thing that makes sense from a filesystem perspective is
>> clearing bad page state when entire filesystem blocks are
>> overwritten. The data in a filesystem block is either good or bad,
>> and it doesn't matter how many internal (kernel or device) sectors
>> it has.
>>
>> > > And what happens to the rest?  The caller is left to trip over the
>> > > errors?  That sounds pretty terrible.  I really think there needs to be
>> > > an explicit contract here.
>> >
>> > Ok, I think is is the contentious bit. Current interface
>> > (__dax_zero_page_range()) either clears the poison (if I/O is aligned to
>> > sector) or expects page to be free of poison.
>> >
>> > So in above example, of "truncate -s 23 foo.txt", currently I get an error
>> > because range being zeroed is not sector aligned. So
>> > __dax_zero_page_range() falls back to calling direct_access(). Which
>> > fails because there are poisoned sectors in the page.
>> >
>> > With my patches, dax_zero_page_range(), clears the poison from sector 1 to
>> > 7 but leaves sector 0 untouched and just writes zeroes from byte 0 to 511
>> > and returns success.
>>
>> Ok, kernel sectors are not the unit of granularity bad page state
>> should be managed at. They don't match page state granularity, and
>> they don't match filesystem block granularity, and the whacky
>> "partial writes silently succeed, reads fail unpredictably"
>> assymetry it leads to will just cause problems for users.
>>
>> > So question is, is this better behavior or worse behavior. If sector 0
>> > was poisoned, it will continue to remain poisoned and caller will come
>> > to know about it on next read and then it should try to truncate file
>> > to length 0 or unlink file or restore that file to get rid of poison.
>>
>> Worse, because the filesystem can't track what sub-parts of the
>> block are bad and that leads to inconsistent data integrity status
>> being exposed to userspace.
>
> The driver can't track it either. Latent poison isn't know until it is
> consumed, and writes to latent poison are not guaranteed to clear it.

I believe we're discussing the case where we know there is a bad block.
Obviously we can't know about latent errors.

>> > IOW, if a partial block is being zeroed and if it is poisoned, caller
>> > will not be return an error and poison will not be cleared and memory
>> > will be zeroed. What do we expect in such cases.
>> >
>> > Do we expect an interface where if there are any bad blocks in the range
>> > being zeroed, then they all should be cleared (and hence all I/O should
>> > be aligned) otherwise error is returned. If yes, I could make that
>> > change.
>> >
>> > Downside of current interface is that it will clear as many blocks as
>> > possible in the given range and leave starting and end blocks poisoned
>> > (if it is unaligned) and not return error. That means a reader will
>> > get error on these blocks again and they will have to try to clear it
>> > again.
>>
>> Which is solved by having partial page writes always EIO on poisoned
>> memory.
>
> The problem with the above is that partial page writes can not be
> guaranteed to return EIO. Poison is only detected on consumed reads,
> or a periodic scrub, not writes. IFF poison detection was always
> synchronous with poison creation then the above makes sense. However,
> with asynchronous signaling, it's fundamentally a false security
> blanket to assume even full block writes will clear poison unless a
> callback to firmware is made for every block.

Let's just focus on reporting errors when we know we have them.

-Jeff
Vivek Goyal Feb. 24, 2020, 3:38 p.m. UTC | #13
On Mon, Feb 24, 2020 at 10:03:30AM +1100, Dave Chinner wrote:

[..]
> > > > Hi Jeff,
> > > >
> > > > New dax zeroing interface (dax_zero_page_range()) can technically pass
> > > > a range which is less than a sector. Or which is bigger than a sector
> > > > but start and end are not aligned on sector boundaries.
> > > 
> > > Sure, but who will call it with misaligned ranges?
> > 
> > create a file foo.txt of size 4K and then truncate it.
> > 
> > "truncate -s 23 foo.txt". Filesystems try to zero the bytes from 24 to
> > 4095.
> 
> This should fail with EIO. Only full page writes should clear the
> bad page state, and partial writes should therefore fail because
> they do not guarantee the data in the filesystem block is all good.
> 
> If this zeroing was a buffered write to an address with a bad
> sector, then the writeback will fail and the user will (eventually)
> get an EIO on the file.
> 
> DAX should do the same thing, except because the zeroing is
> synchronous (i.e. done directly by the truncate syscall) we can -
> and should - return EIO immediately.
> 
> Indeed, with your code, if we then extend the file by truncating up
> back to 4k, then the range between 23 and 512 is still bad, even
> though we've successfully zeroed it and the user knows it. An
> attempt to read anywhere in this range (e.g. 10 bytes at offset 100)
> will fail with EIO, but reading 10 bytes at offset 2000 will
> succeed.

Hi Dave,

What is expected if I do "truncate -s 512 foo.txt". Say first sector (0 to
511) is poisoned and rest don't have poison. Should this fail with -EIO.

In current implementation it does not. Because all sector aligned I/O
we redirect through blkdev_issue_zeroout() and that will happly zero
out sector 2-8 without worrying about the state of sector 1. Hence user
which tries to read 10 bytes at offset 100, will still fail. This probably
should be fixed if we want to retain existing behavior.

Anyway, partial page truncate can't ensure that data in rest of the page can
be read back successfully. Memory can get poison after the write and
hence read after truncate will still fail.

Hence, all we are trying to ensure is that if a poison is known at the
time of writing partial page, then we should return error to user space.

Thanks
Vivek
Vivek Goyal Feb. 24, 2020, 8:13 p.m. UTC | #14
On Mon, Feb 24, 2020 at 10:03:30AM +1100, Dave Chinner wrote:
> On Fri, Feb 21, 2020 at 03:17:59PM -0500, Vivek Goyal wrote:
> > On Fri, Feb 21, 2020 at 01:32:48PM -0500, Jeff Moyer wrote:
> > > Vivek Goyal <vgoyal@redhat.com> writes:
> > > 
> > > > On Thu, Feb 20, 2020 at 04:35:17PM -0500, Jeff Moyer wrote:
> > > >> Vivek Goyal <vgoyal@redhat.com> writes:
> > > >> 
> > > >> > Currently pmem_clear_poison() expects offset and len to be sector aligned.
> > > >> > Atleast that seems to be the assumption with which code has been written.
> > > >> > It is called only from pmem_do_bvec() which is called only from pmem_rw_page()
> > > >> > and pmem_make_request() which will only passe sector aligned offset and len.
> > > >> >
> > > >> > Soon we want use this function from dax_zero_page_range() code path which
> > > >> > can try to zero arbitrary range of memory with-in a page. So update this
> > > >> > function to assume that offset and length can be arbitrary and do the
> > > >> > necessary alignments as needed.
> > > >> 
> > > >> What caller will try to zero a range that is smaller than a sector?
> > > >
> > > > Hi Jeff,
> > > >
> > > > New dax zeroing interface (dax_zero_page_range()) can technically pass
> > > > a range which is less than a sector. Or which is bigger than a sector
> > > > but start and end are not aligned on sector boundaries.
> > > 
> > > Sure, but who will call it with misaligned ranges?
> > 
> > create a file foo.txt of size 4K and then truncate it.
> > 
> > "truncate -s 23 foo.txt". Filesystems try to zero the bytes from 24 to
> > 4095.
> 
> This should fail with EIO. Only full page writes should clear the
> bad page state, and partial writes should therefore fail because
> they do not guarantee the data in the filesystem block is all good.
> 
> If this zeroing was a buffered write to an address with a bad
> sector, then the writeback will fail and the user will (eventually)
> get an EIO on the file.
> 
> DAX should do the same thing, except because the zeroing is
> synchronous (i.e. done directly by the truncate syscall) we can -
> and should - return EIO immediately.
> 
> Indeed, with your code, if we then extend the file by truncating up
> back to 4k, then the range between 23 and 512 is still bad, even
> though we've successfully zeroed it and the user knows it. An
> attempt to read anywhere in this range (e.g. 10 bytes at offset 100)
> will fail with EIO, but reading 10 bytes at offset 2000 will
> succeed.
> 
> That's *awful* behaviour to expose to userspace, especially when
> they look at the fs config and see that it's using both 4kB block
> and sector sizes...
> 
> The only thing that makes sense from a filesystem perspective is
> clearing bad page state when entire filesystem blocks are
> overwritten. The data in a filesystem block is either good or bad,
> and it doesn't matter how many internal (kernel or device) sectors
> it has.
> 
> > > And what happens to the rest?  The caller is left to trip over the
> > > errors?  That sounds pretty terrible.  I really think there needs to be
> > > an explicit contract here.
> > 
> > Ok, I think is is the contentious bit. Current interface
> > (__dax_zero_page_range()) either clears the poison (if I/O is aligned to
> > sector) or expects page to be free of poison.
> > 
> > So in above example, of "truncate -s 23 foo.txt", currently I get an error
> > because range being zeroed is not sector aligned. So
> > __dax_zero_page_range() falls back to calling direct_access(). Which
> > fails because there are poisoned sectors in the page.
> > 
> > With my patches, dax_zero_page_range(), clears the poison from sector 1 to
> > 7 but leaves sector 0 untouched and just writes zeroes from byte 0 to 511
> > and returns success.
> 
> Ok, kernel sectors are not the unit of granularity bad page state
> should be managed at. They don't match page state granularity, and
> they don't match filesystem block granularity, and the whacky
> "partial writes silently succeed, reads fail unpredictably"
> assymetry it leads to will just cause problems for users.
> 
> > So question is, is this better behavior or worse behavior. If sector 0
> > was poisoned, it will continue to remain poisoned and caller will come
> > to know about it on next read and then it should try to truncate file
> > to length 0 or unlink file or restore that file to get rid of poison.
> 
> Worse, because the filesystem can't track what sub-parts of the
> block are bad and that leads to inconsistent data integrity status
> being exposed to userspace.
> 
> 
> > IOW, if a partial block is being zeroed and if it is poisoned, caller
> > will not be return an error and poison will not be cleared and memory
> > will be zeroed. What do we expect in such cases.
> > 
> > Do we expect an interface where if there are any bad blocks in the range
> > being zeroed, then they all should be cleared (and hence all I/O should
> > be aligned) otherwise error is returned. If yes, I could make that
> > change.
> > 
> > Downside of current interface is that it will clear as many blocks as
> > possible in the given range and leave starting and end blocks poisoned
> > (if it is unaligned) and not return error. That means a reader will
> > get error on these blocks again and they will have to try to clear it
> > again.
> 
> Which is solved by having partial page writes always EIO on poisoned
> memory.

Ok, how about if I add one more patch to the series which will check
if unwritten portion of the page has known poison. If it has, then
-EIO is returned. 


Subject: pmem: zero page range return error if poisoned memory in unwritten area

Filesystems call into pmem_dax_zero_page_range() to zero partial page upon
truncate. If partial page is being zeroed, then at the end of operation
file systems expect that there is no poison in the whole page (atleast
known poison).

So make sure part of the partial page which is not being written, does not
have poison. If it does, return error. If there is poison in area of page
being written, it will be cleared.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 drivers/nvdimm/pmem.c |   13 +++++++++++++
 1 file changed, 13 insertions(+)

Index: redhat-linux/drivers/nvdimm/pmem.c
===================================================================
--- redhat-linux.orig/drivers/nvdimm/pmem.c	2020-02-24 10:57:48.520451478 -0500
+++ redhat-linux/drivers/nvdimm/pmem.c	2020-02-24 15:01:58.119854835 -0500
@@ -309,6 +309,19 @@ static int pmem_dax_zero_page_range(stru
 {
 	struct pmem_device *pmem = dax_get_private(dax_dev);
 
+	if (offset_in_page(offset)) {
+		sector_t unwritten_start = (offset & PAGE_MASK) >> SECTOR_SHIFT;
+		unsigned int unwritten_len = ALIGN(offset_in_page(offset), SECTOR_SIZE);
+		/* If partial page is being zeroed, make sure there is no
+		 * known poison in the area of page which is not being zeroed.
+		 * Filesystems expect full page to be free of *known* poison
+		 * at the end of partial page zeroing.
+		 */
+		if (unlikely(is_bad_pmem(&pmem->bb, unwritten_start,
+					 unwritten_len)))
+			return -EIO;
+	}
+
 	return blk_status_to_errno(pmem_do_write(pmem, ZERO_PAGE(0), 0, offset,
 				   len));
 }
Dan Williams Feb. 24, 2020, 8:48 p.m. UTC | #15
On Mon, Feb 24, 2020 at 5:50 AM Jeff Moyer <jmoyer@redhat.com> wrote:
>
> Dan Williams <dan.j.williams@intel.com> writes:
>
> > On Sun, Feb 23, 2020 at 3:03 PM Dave Chinner <david@fromorbit.com> wrote:
> >>
> >> On Fri, Feb 21, 2020 at 03:17:59PM -0500, Vivek Goyal wrote:
> >> > On Fri, Feb 21, 2020 at 01:32:48PM -0500, Jeff Moyer wrote:
> >> > > Vivek Goyal <vgoyal@redhat.com> writes:
> >> > >
> >> > > > On Thu, Feb 20, 2020 at 04:35:17PM -0500, Jeff Moyer wrote:
> >> > > >> Vivek Goyal <vgoyal@redhat.com> writes:
> >> > > >>
> >> > > >> > Currently pmem_clear_poison() expects offset and len to be sector aligned.
> >> > > >> > Atleast that seems to be the assumption with which code has been written.
> >> > > >> > It is called only from pmem_do_bvec() which is called only from pmem_rw_page()
> >> > > >> > and pmem_make_request() which will only passe sector aligned offset and len.
> >> > > >> >
> >> > > >> > Soon we want use this function from dax_zero_page_range() code path which
> >> > > >> > can try to zero arbitrary range of memory with-in a page. So update this
> >> > > >> > function to assume that offset and length can be arbitrary and do the
> >> > > >> > necessary alignments as needed.
> >> > > >>
> >> > > >> What caller will try to zero a range that is smaller than a sector?
> >> > > >
> >> > > > Hi Jeff,
> >> > > >
> >> > > > New dax zeroing interface (dax_zero_page_range()) can technically pass
> >> > > > a range which is less than a sector. Or which is bigger than a sector
> >> > > > but start and end are not aligned on sector boundaries.
> >> > >
> >> > > Sure, but who will call it with misaligned ranges?
> >> >
> >> > create a file foo.txt of size 4K and then truncate it.
> >> >
> >> > "truncate -s 23 foo.txt". Filesystems try to zero the bytes from 24 to
> >> > 4095.
> >>
> >> This should fail with EIO. Only full page writes should clear the
> >> bad page state, and partial writes should therefore fail because
> >> they do not guarantee the data in the filesystem block is all good.
> >>
> >> If this zeroing was a buffered write to an address with a bad
> >> sector, then the writeback will fail and the user will (eventually)
> >> get an EIO on the file.
> >>
> >> DAX should do the same thing, except because the zeroing is
> >> synchronous (i.e. done directly by the truncate syscall) we can -
> >> and should - return EIO immediately.
> >>
> >> Indeed, with your code, if we then extend the file by truncating up
> >> back to 4k, then the range between 23 and 512 is still bad, even
> >> though we've successfully zeroed it and the user knows it. An
> >> attempt to read anywhere in this range (e.g. 10 bytes at offset 100)
> >> will fail with EIO, but reading 10 bytes at offset 2000 will
> >> succeed.
> >>
> >> That's *awful* behaviour to expose to userspace, especially when
> >> they look at the fs config and see that it's using both 4kB block
> >> and sector sizes...
> >>
> >> The only thing that makes sense from a filesystem perspective is
> >> clearing bad page state when entire filesystem blocks are
> >> overwritten. The data in a filesystem block is either good or bad,
> >> and it doesn't matter how many internal (kernel or device) sectors
> >> it has.
> >>
> >> > > And what happens to the rest?  The caller is left to trip over the
> >> > > errors?  That sounds pretty terrible.  I really think there needs to be
> >> > > an explicit contract here.
> >> >
> >> > Ok, I think is is the contentious bit. Current interface
> >> > (__dax_zero_page_range()) either clears the poison (if I/O is aligned to
> >> > sector) or expects page to be free of poison.
> >> >
> >> > So in above example, of "truncate -s 23 foo.txt", currently I get an error
> >> > because range being zeroed is not sector aligned. So
> >> > __dax_zero_page_range() falls back to calling direct_access(). Which
> >> > fails because there are poisoned sectors in the page.
> >> >
> >> > With my patches, dax_zero_page_range(), clears the poison from sector 1 to
> >> > 7 but leaves sector 0 untouched and just writes zeroes from byte 0 to 511
> >> > and returns success.
> >>
> >> Ok, kernel sectors are not the unit of granularity bad page state
> >> should be managed at. They don't match page state granularity, and
> >> they don't match filesystem block granularity, and the whacky
> >> "partial writes silently succeed, reads fail unpredictably"
> >> assymetry it leads to will just cause problems for users.
> >>
> >> > So question is, is this better behavior or worse behavior. If sector 0
> >> > was poisoned, it will continue to remain poisoned and caller will come
> >> > to know about it on next read and then it should try to truncate file
> >> > to length 0 or unlink file or restore that file to get rid of poison.
> >>
> >> Worse, because the filesystem can't track what sub-parts of the
> >> block are bad and that leads to inconsistent data integrity status
> >> being exposed to userspace.
> >
> > The driver can't track it either. Latent poison isn't know until it is
> > consumed, and writes to latent poison are not guaranteed to clear it.
>
> I believe we're discussing the case where we know there is a bad block.
> Obviously we can't know about latent errors.
>
> >> > IOW, if a partial block is being zeroed and if it is poisoned, caller
> >> > will not be return an error and poison will not be cleared and memory
> >> > will be zeroed. What do we expect in such cases.
> >> >
> >> > Do we expect an interface where if there are any bad blocks in the range
> >> > being zeroed, then they all should be cleared (and hence all I/O should
> >> > be aligned) otherwise error is returned. If yes, I could make that
> >> > change.
> >> >
> >> > Downside of current interface is that it will clear as many blocks as
> >> > possible in the given range and leave starting and end blocks poisoned
> >> > (if it is unaligned) and not return error. That means a reader will
> >> > get error on these blocks again and they will have to try to clear it
> >> > again.
> >>
> >> Which is solved by having partial page writes always EIO on poisoned
> >> memory.
> >
> > The problem with the above is that partial page writes can not be
> > guaranteed to return EIO. Poison is only detected on consumed reads,
> > or a periodic scrub, not writes. IFF poison detection was always
> > synchronous with poison creation then the above makes sense. However,
> > with asynchronous signaling, it's fundamentally a false security
> > blanket to assume even full block writes will clear poison unless a
> > callback to firmware is made for every block.
>
> Let's just focus on reporting errors when we know we have them.

That's the problem in my eyes. If software needs to contend with
latent error reporting then it should always contend otherwise
software has multiple error models to wrangle.

Setting that aside we can start with just treating zeroing the same as
the copy_from_iter() case and fail the I/O at the dax_direct_access()
step.

I'd rather have a separate op that filesystems can use to clear errors
at block allocation time that can be enforced to have the correct
alignment.
Dan Williams Feb. 24, 2020, 8:52 p.m. UTC | #16
On Mon, Feb 24, 2020 at 12:13 PM Vivek Goyal <vgoyal@redhat.com> wrote:
>
> On Mon, Feb 24, 2020 at 10:03:30AM +1100, Dave Chinner wrote:
> > On Fri, Feb 21, 2020 at 03:17:59PM -0500, Vivek Goyal wrote:
> > > On Fri, Feb 21, 2020 at 01:32:48PM -0500, Jeff Moyer wrote:
> > > > Vivek Goyal <vgoyal@redhat.com> writes:
> > > >
> > > > > On Thu, Feb 20, 2020 at 04:35:17PM -0500, Jeff Moyer wrote:
> > > > >> Vivek Goyal <vgoyal@redhat.com> writes:
> > > > >>
> > > > >> > Currently pmem_clear_poison() expects offset and len to be sector aligned.
> > > > >> > Atleast that seems to be the assumption with which code has been written.
> > > > >> > It is called only from pmem_do_bvec() which is called only from pmem_rw_page()
> > > > >> > and pmem_make_request() which will only passe sector aligned offset and len.
> > > > >> >
> > > > >> > Soon we want use this function from dax_zero_page_range() code path which
> > > > >> > can try to zero arbitrary range of memory with-in a page. So update this
> > > > >> > function to assume that offset and length can be arbitrary and do the
> > > > >> > necessary alignments as needed.
> > > > >>
> > > > >> What caller will try to zero a range that is smaller than a sector?
> > > > >
> > > > > Hi Jeff,
> > > > >
> > > > > New dax zeroing interface (dax_zero_page_range()) can technically pass
> > > > > a range which is less than a sector. Or which is bigger than a sector
> > > > > but start and end are not aligned on sector boundaries.
> > > >
> > > > Sure, but who will call it with misaligned ranges?
> > >
> > > create a file foo.txt of size 4K and then truncate it.
> > >
> > > "truncate -s 23 foo.txt". Filesystems try to zero the bytes from 24 to
> > > 4095.
> >
> > This should fail with EIO. Only full page writes should clear the
> > bad page state, and partial writes should therefore fail because
> > they do not guarantee the data in the filesystem block is all good.
> >
> > If this zeroing was a buffered write to an address with a bad
> > sector, then the writeback will fail and the user will (eventually)
> > get an EIO on the file.
> >
> > DAX should do the same thing, except because the zeroing is
> > synchronous (i.e. done directly by the truncate syscall) we can -
> > and should - return EIO immediately.
> >
> > Indeed, with your code, if we then extend the file by truncating up
> > back to 4k, then the range between 23 and 512 is still bad, even
> > though we've successfully zeroed it and the user knows it. An
> > attempt to read anywhere in this range (e.g. 10 bytes at offset 100)
> > will fail with EIO, but reading 10 bytes at offset 2000 will
> > succeed.
> >
> > That's *awful* behaviour to expose to userspace, especially when
> > they look at the fs config and see that it's using both 4kB block
> > and sector sizes...
> >
> > The only thing that makes sense from a filesystem perspective is
> > clearing bad page state when entire filesystem blocks are
> > overwritten. The data in a filesystem block is either good or bad,
> > and it doesn't matter how many internal (kernel or device) sectors
> > it has.
> >
> > > > And what happens to the rest?  The caller is left to trip over the
> > > > errors?  That sounds pretty terrible.  I really think there needs to be
> > > > an explicit contract here.
> > >
> > > Ok, I think is is the contentious bit. Current interface
> > > (__dax_zero_page_range()) either clears the poison (if I/O is aligned to
> > > sector) or expects page to be free of poison.
> > >
> > > So in above example, of "truncate -s 23 foo.txt", currently I get an error
> > > because range being zeroed is not sector aligned. So
> > > __dax_zero_page_range() falls back to calling direct_access(). Which
> > > fails because there are poisoned sectors in the page.
> > >
> > > With my patches, dax_zero_page_range(), clears the poison from sector 1 to
> > > 7 but leaves sector 0 untouched and just writes zeroes from byte 0 to 511
> > > and returns success.
> >
> > Ok, kernel sectors are not the unit of granularity bad page state
> > should be managed at. They don't match page state granularity, and
> > they don't match filesystem block granularity, and the whacky
> > "partial writes silently succeed, reads fail unpredictably"
> > assymetry it leads to will just cause problems for users.
> >
> > > So question is, is this better behavior or worse behavior. If sector 0
> > > was poisoned, it will continue to remain poisoned and caller will come
> > > to know about it on next read and then it should try to truncate file
> > > to length 0 or unlink file or restore that file to get rid of poison.
> >
> > Worse, because the filesystem can't track what sub-parts of the
> > block are bad and that leads to inconsistent data integrity status
> > being exposed to userspace.
> >
> >
> > > IOW, if a partial block is being zeroed and if it is poisoned, caller
> > > will not be return an error and poison will not be cleared and memory
> > > will be zeroed. What do we expect in such cases.
> > >
> > > Do we expect an interface where if there are any bad blocks in the range
> > > being zeroed, then they all should be cleared (and hence all I/O should
> > > be aligned) otherwise error is returned. If yes, I could make that
> > > change.
> > >
> > > Downside of current interface is that it will clear as many blocks as
> > > possible in the given range and leave starting and end blocks poisoned
> > > (if it is unaligned) and not return error. That means a reader will
> > > get error on these blocks again and they will have to try to clear it
> > > again.
> >
> > Which is solved by having partial page writes always EIO on poisoned
> > memory.
>
> Ok, how about if I add one more patch to the series which will check
> if unwritten portion of the page has known poison. If it has, then
> -EIO is returned.
>
>
> Subject: pmem: zero page range return error if poisoned memory in unwritten area
>
> Filesystems call into pmem_dax_zero_page_range() to zero partial page upon
> truncate. If partial page is being zeroed, then at the end of operation
> file systems expect that there is no poison in the whole page (atleast
> known poison).
>
> So make sure part of the partial page which is not being written, does not
> have poison. If it does, return error. If there is poison in area of page
> being written, it will be cleared.

No, I don't like that the zero operation is special cased compared to
the write case. I'd say let's make them identical for now. I.e. fail
the I/O at dax_direct_access() time. I think the error clearing
interface should be an explicit / separate op rather than a
side-effect. What about an explicit interface for initializing newly
allocated blocks, and the only reliable way to destroy poison through
the filesystem is to free the block?
Vivek Goyal Feb. 24, 2020, 9:15 p.m. UTC | #17
On Mon, Feb 24, 2020 at 12:52:13PM -0800, Dan Williams wrote:
> On Mon, Feb 24, 2020 at 12:13 PM Vivek Goyal <vgoyal@redhat.com> wrote:
> >
> > On Mon, Feb 24, 2020 at 10:03:30AM +1100, Dave Chinner wrote:
> > > On Fri, Feb 21, 2020 at 03:17:59PM -0500, Vivek Goyal wrote:
> > > > On Fri, Feb 21, 2020 at 01:32:48PM -0500, Jeff Moyer wrote:
> > > > > Vivek Goyal <vgoyal@redhat.com> writes:
> > > > >
> > > > > > On Thu, Feb 20, 2020 at 04:35:17PM -0500, Jeff Moyer wrote:
> > > > > >> Vivek Goyal <vgoyal@redhat.com> writes:
> > > > > >>
> > > > > >> > Currently pmem_clear_poison() expects offset and len to be sector aligned.
> > > > > >> > Atleast that seems to be the assumption with which code has been written.
> > > > > >> > It is called only from pmem_do_bvec() which is called only from pmem_rw_page()
> > > > > >> > and pmem_make_request() which will only passe sector aligned offset and len.
> > > > > >> >
> > > > > >> > Soon we want use this function from dax_zero_page_range() code path which
> > > > > >> > can try to zero arbitrary range of memory with-in a page. So update this
> > > > > >> > function to assume that offset and length can be arbitrary and do the
> > > > > >> > necessary alignments as needed.
> > > > > >>
> > > > > >> What caller will try to zero a range that is smaller than a sector?
> > > > > >
> > > > > > Hi Jeff,
> > > > > >
> > > > > > New dax zeroing interface (dax_zero_page_range()) can technically pass
> > > > > > a range which is less than a sector. Or which is bigger than a sector
> > > > > > but start and end are not aligned on sector boundaries.
> > > > >
> > > > > Sure, but who will call it with misaligned ranges?
> > > >
> > > > create a file foo.txt of size 4K and then truncate it.
> > > >
> > > > "truncate -s 23 foo.txt". Filesystems try to zero the bytes from 24 to
> > > > 4095.
> > >
> > > This should fail with EIO. Only full page writes should clear the
> > > bad page state, and partial writes should therefore fail because
> > > they do not guarantee the data in the filesystem block is all good.
> > >
> > > If this zeroing was a buffered write to an address with a bad
> > > sector, then the writeback will fail and the user will (eventually)
> > > get an EIO on the file.
> > >
> > > DAX should do the same thing, except because the zeroing is
> > > synchronous (i.e. done directly by the truncate syscall) we can -
> > > and should - return EIO immediately.
> > >
> > > Indeed, with your code, if we then extend the file by truncating up
> > > back to 4k, then the range between 23 and 512 is still bad, even
> > > though we've successfully zeroed it and the user knows it. An
> > > attempt to read anywhere in this range (e.g. 10 bytes at offset 100)
> > > will fail with EIO, but reading 10 bytes at offset 2000 will
> > > succeed.
> > >
> > > That's *awful* behaviour to expose to userspace, especially when
> > > they look at the fs config and see that it's using both 4kB block
> > > and sector sizes...
> > >
> > > The only thing that makes sense from a filesystem perspective is
> > > clearing bad page state when entire filesystem blocks are
> > > overwritten. The data in a filesystem block is either good or bad,
> > > and it doesn't matter how many internal (kernel or device) sectors
> > > it has.
> > >
> > > > > And what happens to the rest?  The caller is left to trip over the
> > > > > errors?  That sounds pretty terrible.  I really think there needs to be
> > > > > an explicit contract here.
> > > >
> > > > Ok, I think is is the contentious bit. Current interface
> > > > (__dax_zero_page_range()) either clears the poison (if I/O is aligned to
> > > > sector) or expects page to be free of poison.
> > > >
> > > > So in above example, of "truncate -s 23 foo.txt", currently I get an error
> > > > because range being zeroed is not sector aligned. So
> > > > __dax_zero_page_range() falls back to calling direct_access(). Which
> > > > fails because there are poisoned sectors in the page.
> > > >
> > > > With my patches, dax_zero_page_range(), clears the poison from sector 1 to
> > > > 7 but leaves sector 0 untouched and just writes zeroes from byte 0 to 511
> > > > and returns success.
> > >
> > > Ok, kernel sectors are not the unit of granularity bad page state
> > > should be managed at. They don't match page state granularity, and
> > > they don't match filesystem block granularity, and the whacky
> > > "partial writes silently succeed, reads fail unpredictably"
> > > assymetry it leads to will just cause problems for users.
> > >
> > > > So question is, is this better behavior or worse behavior. If sector 0
> > > > was poisoned, it will continue to remain poisoned and caller will come
> > > > to know about it on next read and then it should try to truncate file
> > > > to length 0 or unlink file or restore that file to get rid of poison.
> > >
> > > Worse, because the filesystem can't track what sub-parts of the
> > > block are bad and that leads to inconsistent data integrity status
> > > being exposed to userspace.
> > >
> > >
> > > > IOW, if a partial block is being zeroed and if it is poisoned, caller
> > > > will not be return an error and poison will not be cleared and memory
> > > > will be zeroed. What do we expect in such cases.
> > > >
> > > > Do we expect an interface where if there are any bad blocks in the range
> > > > being zeroed, then they all should be cleared (and hence all I/O should
> > > > be aligned) otherwise error is returned. If yes, I could make that
> > > > change.
> > > >
> > > > Downside of current interface is that it will clear as many blocks as
> > > > possible in the given range and leave starting and end blocks poisoned
> > > > (if it is unaligned) and not return error. That means a reader will
> > > > get error on these blocks again and they will have to try to clear it
> > > > again.
> > >
> > > Which is solved by having partial page writes always EIO on poisoned
> > > memory.
> >
> > Ok, how about if I add one more patch to the series which will check
> > if unwritten portion of the page has known poison. If it has, then
> > -EIO is returned.
> >
> >
> > Subject: pmem: zero page range return error if poisoned memory in unwritten area
> >
> > Filesystems call into pmem_dax_zero_page_range() to zero partial page upon
> > truncate. If partial page is being zeroed, then at the end of operation
> > file systems expect that there is no poison in the whole page (atleast
> > known poison).
> >
> > So make sure part of the partial page which is not being written, does not
> > have poison. If it does, return error. If there is poison in area of page
> > being written, it will be cleared.
> 
> No, I don't like that the zero operation is special cased compared to
> the write case. I'd say let's make them identical for now. I.e. fail
> the I/O at dax_direct_access() time.

So basically __dax_zero_page_range() will only write zeros (and not
try to clear any poison). Right?

> I think the error clearing
> interface should be an explicit / separate op rather than a
> side-effect. What about an explicit interface for initializing newly
> allocated blocks, and the only reliable way to destroy poison through
> the filesystem is to free the block?

Effectively pmem_make_request() is already that interface filesystems
use to initialize blocks and clear poison. So we don't really have to
introduce a new interface?

Or you are suggesting separate dax_zero_page_range() interface which will
always call into firmware to clear poison. And that will make sure latent
poison is cleared as well and filesystem should use that for block
initialization instead? I do like the idea of not having to differentiate
between known poison and latent poison. Once a block has been initialized
all poison should be cleared (known/latent). I am worried though that
on large devices this might slowdown filesystem initialization a lot
if they are zeroing large range of blocks.

If yes, this sounds like two different patch series. First patch series
takes care of removing blkdev_issue_zeroout() from
__dax_zero_page_range() and couple of iomap related cleans christoph
wanted.

And second patch series for adding new dax operation to zero a range
and always call info firmware to clear poison and modify filesystems
accordingly.

Thanks
Vivek
Dan Williams Feb. 24, 2020, 9:32 p.m. UTC | #18
On Mon, Feb 24, 2020 at 1:17 PM Vivek Goyal <vgoyal@redhat.com> wrote:
>
> On Mon, Feb 24, 2020 at 12:52:13PM -0800, Dan Williams wrote:
> > On Mon, Feb 24, 2020 at 12:13 PM Vivek Goyal <vgoyal@redhat.com> wrote:
> > >
> > > On Mon, Feb 24, 2020 at 10:03:30AM +1100, Dave Chinner wrote:
> > > > On Fri, Feb 21, 2020 at 03:17:59PM -0500, Vivek Goyal wrote:
> > > > > On Fri, Feb 21, 2020 at 01:32:48PM -0500, Jeff Moyer wrote:
> > > > > > Vivek Goyal <vgoyal@redhat.com> writes:
> > > > > >
> > > > > > > On Thu, Feb 20, 2020 at 04:35:17PM -0500, Jeff Moyer wrote:
> > > > > > >> Vivek Goyal <vgoyal@redhat.com> writes:
> > > > > > >>
> > > > > > >> > Currently pmem_clear_poison() expects offset and len to be sector aligned.
> > > > > > >> > Atleast that seems to be the assumption with which code has been written.
> > > > > > >> > It is called only from pmem_do_bvec() which is called only from pmem_rw_page()
> > > > > > >> > and pmem_make_request() which will only passe sector aligned offset and len.
> > > > > > >> >
> > > > > > >> > Soon we want use this function from dax_zero_page_range() code path which
> > > > > > >> > can try to zero arbitrary range of memory with-in a page. So update this
> > > > > > >> > function to assume that offset and length can be arbitrary and do the
> > > > > > >> > necessary alignments as needed.
> > > > > > >>
> > > > > > >> What caller will try to zero a range that is smaller than a sector?
> > > > > > >
> > > > > > > Hi Jeff,
> > > > > > >
> > > > > > > New dax zeroing interface (dax_zero_page_range()) can technically pass
> > > > > > > a range which is less than a sector. Or which is bigger than a sector
> > > > > > > but start and end are not aligned on sector boundaries.
> > > > > >
> > > > > > Sure, but who will call it with misaligned ranges?
> > > > >
> > > > > create a file foo.txt of size 4K and then truncate it.
> > > > >
> > > > > "truncate -s 23 foo.txt". Filesystems try to zero the bytes from 24 to
> > > > > 4095.
> > > >
> > > > This should fail with EIO. Only full page writes should clear the
> > > > bad page state, and partial writes should therefore fail because
> > > > they do not guarantee the data in the filesystem block is all good.
> > > >
> > > > If this zeroing was a buffered write to an address with a bad
> > > > sector, then the writeback will fail and the user will (eventually)
> > > > get an EIO on the file.
> > > >
> > > > DAX should do the same thing, except because the zeroing is
> > > > synchronous (i.e. done directly by the truncate syscall) we can -
> > > > and should - return EIO immediately.
> > > >
> > > > Indeed, with your code, if we then extend the file by truncating up
> > > > back to 4k, then the range between 23 and 512 is still bad, even
> > > > though we've successfully zeroed it and the user knows it. An
> > > > attempt to read anywhere in this range (e.g. 10 bytes at offset 100)
> > > > will fail with EIO, but reading 10 bytes at offset 2000 will
> > > > succeed.
> > > >
> > > > That's *awful* behaviour to expose to userspace, especially when
> > > > they look at the fs config and see that it's using both 4kB block
> > > > and sector sizes...
> > > >
> > > > The only thing that makes sense from a filesystem perspective is
> > > > clearing bad page state when entire filesystem blocks are
> > > > overwritten. The data in a filesystem block is either good or bad,
> > > > and it doesn't matter how many internal (kernel or device) sectors
> > > > it has.
> > > >
> > > > > > And what happens to the rest?  The caller is left to trip over the
> > > > > > errors?  That sounds pretty terrible.  I really think there needs to be
> > > > > > an explicit contract here.
> > > > >
> > > > > Ok, I think is is the contentious bit. Current interface
> > > > > (__dax_zero_page_range()) either clears the poison (if I/O is aligned to
> > > > > sector) or expects page to be free of poison.
> > > > >
> > > > > So in above example, of "truncate -s 23 foo.txt", currently I get an error
> > > > > because range being zeroed is not sector aligned. So
> > > > > __dax_zero_page_range() falls back to calling direct_access(). Which
> > > > > fails because there are poisoned sectors in the page.
> > > > >
> > > > > With my patches, dax_zero_page_range(), clears the poison from sector 1 to
> > > > > 7 but leaves sector 0 untouched and just writes zeroes from byte 0 to 511
> > > > > and returns success.
> > > >
> > > > Ok, kernel sectors are not the unit of granularity bad page state
> > > > should be managed at. They don't match page state granularity, and
> > > > they don't match filesystem block granularity, and the whacky
> > > > "partial writes silently succeed, reads fail unpredictably"
> > > > assymetry it leads to will just cause problems for users.
> > > >
> > > > > So question is, is this better behavior or worse behavior. If sector 0
> > > > > was poisoned, it will continue to remain poisoned and caller will come
> > > > > to know about it on next read and then it should try to truncate file
> > > > > to length 0 or unlink file or restore that file to get rid of poison.
> > > >
> > > > Worse, because the filesystem can't track what sub-parts of the
> > > > block are bad and that leads to inconsistent data integrity status
> > > > being exposed to userspace.
> > > >
> > > >
> > > > > IOW, if a partial block is being zeroed and if it is poisoned, caller
> > > > > will not be return an error and poison will not be cleared and memory
> > > > > will be zeroed. What do we expect in such cases.
> > > > >
> > > > > Do we expect an interface where if there are any bad blocks in the range
> > > > > being zeroed, then they all should be cleared (and hence all I/O should
> > > > > be aligned) otherwise error is returned. If yes, I could make that
> > > > > change.
> > > > >
> > > > > Downside of current interface is that it will clear as many blocks as
> > > > > possible in the given range and leave starting and end blocks poisoned
> > > > > (if it is unaligned) and not return error. That means a reader will
> > > > > get error on these blocks again and they will have to try to clear it
> > > > > again.
> > > >
> > > > Which is solved by having partial page writes always EIO on poisoned
> > > > memory.
> > >
> > > Ok, how about if I add one more patch to the series which will check
> > > if unwritten portion of the page has known poison. If it has, then
> > > -EIO is returned.
> > >
> > >
> > > Subject: pmem: zero page range return error if poisoned memory in unwritten area
> > >
> > > Filesystems call into pmem_dax_zero_page_range() to zero partial page upon
> > > truncate. If partial page is being zeroed, then at the end of operation
> > > file systems expect that there is no poison in the whole page (atleast
> > > known poison).
> > >
> > > So make sure part of the partial page which is not being written, does not
> > > have poison. If it does, return error. If there is poison in area of page
> > > being written, it will be cleared.
> >
> > No, I don't like that the zero operation is special cased compared to
> > the write case. I'd say let's make them identical for now. I.e. fail
> > the I/O at dax_direct_access() time.
>
> So basically __dax_zero_page_range() will only write zeros (and not
> try to clear any poison). Right?

Yes, the zero operation would have already failed at the
dax_direct_access() step if there was present poison.

> > I think the error clearing
> > interface should be an explicit / separate op rather than a
> > side-effect. What about an explicit interface for initializing newly
> > allocated blocks, and the only reliable way to destroy poison through
> > the filesystem is to free the block?
>
> Effectively pmem_make_request() is already that interface filesystems
> use to initialize blocks and clear poison. So we don't really have to
> introduce a new interface?

pmem_make_request() is shared with the I/O path and is too low in the
stack to understand intent. DAX intercepts the I/O path closer to the
filesystem and can understand zeroing vs writing today. I'm proposing
we go a step further and make DAX understand free-to-allocated-block
initialization instead of just zeroing. Inject the error clearing into
that initialization interface.

> Or you are suggesting separate dax_zero_page_range() interface which will
> always call into firmware to clear poison. And that will make sure latent
> poison is cleared as well and filesystem should use that for block
> initialization instead?

Yes, except latent poison would not be cleared until the zeroing is
implemented with movdir64b instead of callouts to firmware. It's
otherwise too slow to call out to firmware unconditionally.

> I do like the idea of not having to differentiate
> between known poison and latent poison. Once a block has been initialized
> all poison should be cleared (known/latent). I am worried though that
> on large devices this might slowdown filesystem initialization a lot
> if they are zeroing large range of blocks.
>
> If yes, this sounds like two different patch series. First patch series
> takes care of removing blkdev_issue_zeroout() from
> __dax_zero_page_range() and couple of iomap related cleans christoph
> wanted.
>
> And second patch series for adding new dax operation to zero a range
> and always call info firmware to clear poison and modify filesystems
> accordingly.

Yes, but they may need to be merged together. I don't want to regress
the ability of a block-aligned hole-punch to clear errors.
Jeff Moyer Feb. 24, 2020, 9:53 p.m. UTC | #19
Dan Williams <dan.j.williams@intel.com> writes:

>> Let's just focus on reporting errors when we know we have them.
>
> That's the problem in my eyes. If software needs to contend with
> latent error reporting then it should always contend otherwise
> software has multiple error models to wrangle.

The only way for an application to know that the data has been written
successfully would be to issue a read after every write.  That's not a
performance hit most applications are willing to take.  And, of course,
the media can still go bad at a later time, so it only guarantees the
data is accessible immediately after having been written.

What I'm suggesting is that we should not complete a write successfully
if we know that the data will not be retrievable.  I wouldn't call this
adding an extra error model to contend with.  Applications should
already be checking for errors on write.

Does that make sense?  Are we talking past each other?

> Setting that aside we can start with just treating zeroing the same as
> the copy_from_iter() case and fail the I/O at the dax_direct_access()
> step.

OK.

> I'd rather have a separate op that filesystems can use to clear errors
> at block allocation time that can be enforced to have the correct
> alignment.

So would file systems always call that routine instead of zeroing, or
would they first check to see if there are badblocks?

-Jeff
Dan Williams Feb. 25, 2020, 12:26 a.m. UTC | #20
On Mon, Feb 24, 2020 at 1:53 PM Jeff Moyer <jmoyer@redhat.com> wrote:
>
> Dan Williams <dan.j.williams@intel.com> writes:
>
> >> Let's just focus on reporting errors when we know we have them.
> >
> > That's the problem in my eyes. If software needs to contend with
> > latent error reporting then it should always contend otherwise
> > software has multiple error models to wrangle.
>
> The only way for an application to know that the data has been written
> successfully would be to issue a read after every write.  That's not a
> performance hit most applications are willing to take.  And, of course,
> the media can still go bad at a later time, so it only guarantees the
> data is accessible immediately after having been written.
>
> What I'm suggesting is that we should not complete a write successfully
> if we know that the data will not be retrievable.  I wouldn't call this
> adding an extra error model to contend with.  Applications should
> already be checking for errors on write.
>
> Does that make sense? Are we talking past each other?

The badblock list is late to update in both directions, late to add
entries that the scrub needs to find and late to delete entries that
were inadvertently cleared by cache-line writes that did not first
ingest the poison for a read-modify-write. So I see the above as being
wishful in using the error list as the hard source of truth and
unfortunate to up-level all sub-sector error entries into full
PAGE_SIZE data offline events.

I'm hoping we can find a way to make the error handling more fine
grained over time, but for the current patch, managing the blast
radius as PAGE_SIZE granularity at least matches the zero path with
the write path.

> > Setting that aside we can start with just treating zeroing the same as
> > the copy_from_iter() case and fail the I/O at the dax_direct_access()
> > step.
>
> OK.
>
> > I'd rather have a separate op that filesystems can use to clear errors
> > at block allocation time that can be enforced to have the correct
> > alignment.
>
> So would file systems always call that routine instead of zeroing, or
> would they first check to see if there are badblocks?

The proposal is that filesystems distinguish zeroing from free-block
allocation/initialization such that the fsdax implementation directs
initialization to a driver callback. This "initialization op" would
take care to check for poison and clear it. All other dax paths would
not consult the badblocks list.
Vivek Goyal Feb. 25, 2020, 1:36 p.m. UTC | #21
On Mon, Feb 24, 2020 at 01:32:58PM -0800, Dan Williams wrote:

[..]
> > > > Ok, how about if I add one more patch to the series which will check
> > > > if unwritten portion of the page has known poison. If it has, then
> > > > -EIO is returned.
> > > >
> > > >
> > > > Subject: pmem: zero page range return error if poisoned memory in unwritten area
> > > >
> > > > Filesystems call into pmem_dax_zero_page_range() to zero partial page upon
> > > > truncate. If partial page is being zeroed, then at the end of operation
> > > > file systems expect that there is no poison in the whole page (atleast
> > > > known poison).
> > > >
> > > > So make sure part of the partial page which is not being written, does not
> > > > have poison. If it does, return error. If there is poison in area of page
> > > > being written, it will be cleared.
> > >
> > > No, I don't like that the zero operation is special cased compared to
> > > the write case. I'd say let's make them identical for now. I.e. fail
> > > the I/O at dax_direct_access() time.
> >
> > So basically __dax_zero_page_range() will only write zeros (and not
> > try to clear any poison). Right?
> 
> Yes, the zero operation would have already failed at the
> dax_direct_access() step if there was present poison.
> 
> > > I think the error clearing
> > > interface should be an explicit / separate op rather than a
> > > side-effect. What about an explicit interface for initializing newly
> > > allocated blocks, and the only reliable way to destroy poison through
> > > the filesystem is to free the block?
> >
> > Effectively pmem_make_request() is already that interface filesystems
> > use to initialize blocks and clear poison. So we don't really have to
> > introduce a new interface?
> 
> pmem_make_request() is shared with the I/O path and is too low in the
> stack to understand intent. DAX intercepts the I/O path closer to the
> filesystem and can understand zeroing vs writing today. I'm proposing
> we go a step further and make DAX understand free-to-allocated-block
> initialization instead of just zeroing. Inject the error clearing into
> that initialization interface.
> 
> > Or you are suggesting separate dax_zero_page_range() interface which will
> > always call into firmware to clear poison. And that will make sure latent
> > poison is cleared as well and filesystem should use that for block
> > initialization instead?
> 
> Yes, except latent poison would not be cleared until the zeroing is
> implemented with movdir64b instead of callouts to firmware. It's
> otherwise too slow to call out to firmware unconditionally.
> 
> > I do like the idea of not having to differentiate
> > between known poison and latent poison. Once a block has been initialized
> > all poison should be cleared (known/latent). I am worried though that
> > on large devices this might slowdown filesystem initialization a lot
> > if they are zeroing large range of blocks.
> >
> > If yes, this sounds like two different patch series. First patch series
> > takes care of removing blkdev_issue_zeroout() from
> > __dax_zero_page_range() and couple of iomap related cleans christoph
> > wanted.
> >
> > And second patch series for adding new dax operation to zero a range
> > and always call info firmware to clear poison and modify filesystems
> > accordingly.
> 
> Yes, but they may need to be merged together. I don't want to regress
> the ability of a block-aligned hole-punch to clear errors.

Hi Dan,

IIUC, block aligned hole punch don't go through __dax_zero_page_range()
path. Instead they call blkdev_issue_zeroout() at later point of time.

Only partial block zeroing path is taking __dax_zero_page_range(). So
even if we remove poison clearing code from __dax_zero_page_range(),
there should not be a regression w.r.t full block zeroing. Only possible
regression will be if somebody was doing partial block zeroing on sector
boundary, then poison will not be cleared.

We now seem to be discussing too many issues w.r.t poison clearing
and dax. Atleast 3 issues are mentioned in this thread.

A. Get rid of dependency on block device in dax zeroing path.
   (__dax_zero_page_range)

B. Provide a way to clear latent poison. And possibly use movdir64b to
   do that and make filesystems use that interface for initialization
   of blocks.

C. Dax zero operation is clearing known poison while copy_from_iter() is
   not. I guess this ship has already sailed. If we change it now,
   somebody will complain of some regression.

For issue A, there are two possible ways to deal with it.

1. Implement a dax method to zero page. And this method will also clear
   known poison. This is what my patch series is doing.

2. Just get rid of blkdev_issue_zeroout() from __dax_zero_page_range()
   so that no poison will be cleared in __dax_zero_page_range() path. This
   path is currently used in partial page zeroing path and full filesystem
   block zeroing happens with blkdev_issue_zeroout(). There is a small
   chance of regression here in case of sector aligned partial block
   zeroing.

My patch series takes care of issue A without any regressions. In fact it
improves current interface. For example, currently "truncate -s 512
foo.txt" will succeed even if first sector in the block is poisoned. My
patch series fixes it. Current implementation will return error on if any
non sector aligned truncate is done and any of the sector is poisoned. My
implementation will not return error if poisoned can be cleared as part
of zeroing. It will return only if poison is present in non-zeoring part.

Why don't we solve one issue A now and deal with issue B and C later in
a sepaprate patch series. This patch series gets rid of dependency on 
block device in dax path and also makes current zeroing interface better.

Thanks
Vivek
Dan Williams Feb. 25, 2020, 4:25 p.m. UTC | #22
On Tue, Feb 25, 2020 at 5:37 AM Vivek Goyal <vgoyal@redhat.com> wrote:
>
> On Mon, Feb 24, 2020 at 01:32:58PM -0800, Dan Williams wrote:
>
> [..]
> > > > > Ok, how about if I add one more patch to the series which will check
> > > > > if unwritten portion of the page has known poison. If it has, then
> > > > > -EIO is returned.
> > > > >
> > > > >
> > > > > Subject: pmem: zero page range return error if poisoned memory in unwritten area
> > > > >
> > > > > Filesystems call into pmem_dax_zero_page_range() to zero partial page upon
> > > > > truncate. If partial page is being zeroed, then at the end of operation
> > > > > file systems expect that there is no poison in the whole page (atleast
> > > > > known poison).
> > > > >
> > > > > So make sure part of the partial page which is not being written, does not
> > > > > have poison. If it does, return error. If there is poison in area of page
> > > > > being written, it will be cleared.
> > > >
> > > > No, I don't like that the zero operation is special cased compared to
> > > > the write case. I'd say let's make them identical for now. I.e. fail
> > > > the I/O at dax_direct_access() time.
> > >
> > > So basically __dax_zero_page_range() will only write zeros (and not
> > > try to clear any poison). Right?
> >
> > Yes, the zero operation would have already failed at the
> > dax_direct_access() step if there was present poison.
> >
> > > > I think the error clearing
> > > > interface should be an explicit / separate op rather than a
> > > > side-effect. What about an explicit interface for initializing newly
> > > > allocated blocks, and the only reliable way to destroy poison through
> > > > the filesystem is to free the block?
> > >
> > > Effectively pmem_make_request() is already that interface filesystems
> > > use to initialize blocks and clear poison. So we don't really have to
> > > introduce a new interface?
> >
> > pmem_make_request() is shared with the I/O path and is too low in the
> > stack to understand intent. DAX intercepts the I/O path closer to the
> > filesystem and can understand zeroing vs writing today. I'm proposing
> > we go a step further and make DAX understand free-to-allocated-block
> > initialization instead of just zeroing. Inject the error clearing into
> > that initialization interface.
> >
> > > Or you are suggesting separate dax_zero_page_range() interface which will
> > > always call into firmware to clear poison. And that will make sure latent
> > > poison is cleared as well and filesystem should use that for block
> > > initialization instead?
> >
> > Yes, except latent poison would not be cleared until the zeroing is
> > implemented with movdir64b instead of callouts to firmware. It's
> > otherwise too slow to call out to firmware unconditionally.
> >
> > > I do like the idea of not having to differentiate
> > > between known poison and latent poison. Once a block has been initialized
> > > all poison should be cleared (known/latent). I am worried though that
> > > on large devices this might slowdown filesystem initialization a lot
> > > if they are zeroing large range of blocks.
> > >
> > > If yes, this sounds like two different patch series. First patch series
> > > takes care of removing blkdev_issue_zeroout() from
> > > __dax_zero_page_range() and couple of iomap related cleans christoph
> > > wanted.
> > >
> > > And second patch series for adding new dax operation to zero a range
> > > and always call info firmware to clear poison and modify filesystems
> > > accordingly.
> >
> > Yes, but they may need to be merged together. I don't want to regress
> > the ability of a block-aligned hole-punch to clear errors.
>
> Hi Dan,
>
> IIUC, block aligned hole punch don't go through __dax_zero_page_range()
> path. Instead they call blkdev_issue_zeroout() at later point of time.
>
> Only partial block zeroing path is taking __dax_zero_page_range(). So
> even if we remove poison clearing code from __dax_zero_page_range(),
> there should not be a regression w.r.t full block zeroing. Only possible
> regression will be if somebody was doing partial block zeroing on sector
> boundary, then poison will not be cleared.
>
> We now seem to be discussing too many issues w.r.t poison clearing
> and dax. Atleast 3 issues are mentioned in this thread.
>
> A. Get rid of dependency on block device in dax zeroing path.
>    (__dax_zero_page_range)
>
> B. Provide a way to clear latent poison. And possibly use movdir64b to
>    do that and make filesystems use that interface for initialization
>    of blocks.
>
> C. Dax zero operation is clearing known poison while copy_from_iter() is
>    not. I guess this ship has already sailed. If we change it now,
>    somebody will complain of some regression.
>
> For issue A, there are two possible ways to deal with it.
>
> 1. Implement a dax method to zero page. And this method will also clear
>    known poison. This is what my patch series is doing.
>
> 2. Just get rid of blkdev_issue_zeroout() from __dax_zero_page_range()
>    so that no poison will be cleared in __dax_zero_page_range() path. This
>    path is currently used in partial page zeroing path and full filesystem
>    block zeroing happens with blkdev_issue_zeroout(). There is a small
>    chance of regression here in case of sector aligned partial block
>    zeroing.
>
> My patch series takes care of issue A without any regressions. In fact it
> improves current interface. For example, currently "truncate -s 512
> foo.txt" will succeed even if first sector in the block is poisoned. My
> patch series fixes it. Current implementation will return error on if any
> non sector aligned truncate is done and any of the sector is poisoned. My
> implementation will not return error if poisoned can be cleared as part
> of zeroing. It will return only if poison is present in non-zeoring part.

That asymmetry makes the implementation too much of a special case. If
the dax mapping path forces error boundaries on PAGE_SIZE blocks then
so should zeroing.

>
> Why don't we solve one issue A now and deal with issue B and C later in
> a sepaprate patch series. This patch series gets rid of dependency on
> block device in dax path and also makes current zeroing interface better.

I'm ok with replacing blkdev_issue_zeroout() with a dax operation
callback that deals with page aligned entries. That change at least
makes the error boundary symmetric across copy_from_iter() and the
zeroing path.
Vivek Goyal Feb. 25, 2020, 8:08 p.m. UTC | #23
On Tue, Feb 25, 2020 at 08:25:27AM -0800, Dan Williams wrote:
> On Tue, Feb 25, 2020 at 5:37 AM Vivek Goyal <vgoyal@redhat.com> wrote:
> >
> > On Mon, Feb 24, 2020 at 01:32:58PM -0800, Dan Williams wrote:
> >
> > [..]
> > > > > > Ok, how about if I add one more patch to the series which will check
> > > > > > if unwritten portion of the page has known poison. If it has, then
> > > > > > -EIO is returned.
> > > > > >
> > > > > >
> > > > > > Subject: pmem: zero page range return error if poisoned memory in unwritten area
> > > > > >
> > > > > > Filesystems call into pmem_dax_zero_page_range() to zero partial page upon
> > > > > > truncate. If partial page is being zeroed, then at the end of operation
> > > > > > file systems expect that there is no poison in the whole page (atleast
> > > > > > known poison).
> > > > > >
> > > > > > So make sure part of the partial page which is not being written, does not
> > > > > > have poison. If it does, return error. If there is poison in area of page
> > > > > > being written, it will be cleared.
> > > > >
> > > > > No, I don't like that the zero operation is special cased compared to
> > > > > the write case. I'd say let's make them identical for now. I.e. fail
> > > > > the I/O at dax_direct_access() time.
> > > >
> > > > So basically __dax_zero_page_range() will only write zeros (and not
> > > > try to clear any poison). Right?
> > >
> > > Yes, the zero operation would have already failed at the
> > > dax_direct_access() step if there was present poison.
> > >
> > > > > I think the error clearing
> > > > > interface should be an explicit / separate op rather than a
> > > > > side-effect. What about an explicit interface for initializing newly
> > > > > allocated blocks, and the only reliable way to destroy poison through
> > > > > the filesystem is to free the block?
> > > >
> > > > Effectively pmem_make_request() is already that interface filesystems
> > > > use to initialize blocks and clear poison. So we don't really have to
> > > > introduce a new interface?
> > >
> > > pmem_make_request() is shared with the I/O path and is too low in the
> > > stack to understand intent. DAX intercepts the I/O path closer to the
> > > filesystem and can understand zeroing vs writing today. I'm proposing
> > > we go a step further and make DAX understand free-to-allocated-block
> > > initialization instead of just zeroing. Inject the error clearing into
> > > that initialization interface.
> > >
> > > > Or you are suggesting separate dax_zero_page_range() interface which will
> > > > always call into firmware to clear poison. And that will make sure latent
> > > > poison is cleared as well and filesystem should use that for block
> > > > initialization instead?
> > >
> > > Yes, except latent poison would not be cleared until the zeroing is
> > > implemented with movdir64b instead of callouts to firmware. It's
> > > otherwise too slow to call out to firmware unconditionally.
> > >
> > > > I do like the idea of not having to differentiate
> > > > between known poison and latent poison. Once a block has been initialized
> > > > all poison should be cleared (known/latent). I am worried though that
> > > > on large devices this might slowdown filesystem initialization a lot
> > > > if they are zeroing large range of blocks.
> > > >
> > > > If yes, this sounds like two different patch series. First patch series
> > > > takes care of removing blkdev_issue_zeroout() from
> > > > __dax_zero_page_range() and couple of iomap related cleans christoph
> > > > wanted.
> > > >
> > > > And second patch series for adding new dax operation to zero a range
> > > > and always call info firmware to clear poison and modify filesystems
> > > > accordingly.
> > >
> > > Yes, but they may need to be merged together. I don't want to regress
> > > the ability of a block-aligned hole-punch to clear errors.
> >
> > Hi Dan,
> >
> > IIUC, block aligned hole punch don't go through __dax_zero_page_range()
> > path. Instead they call blkdev_issue_zeroout() at later point of time.
> >
> > Only partial block zeroing path is taking __dax_zero_page_range(). So
> > even if we remove poison clearing code from __dax_zero_page_range(),
> > there should not be a regression w.r.t full block zeroing. Only possible
> > regression will be if somebody was doing partial block zeroing on sector
> > boundary, then poison will not be cleared.
> >
> > We now seem to be discussing too many issues w.r.t poison clearing
> > and dax. Atleast 3 issues are mentioned in this thread.
> >
> > A. Get rid of dependency on block device in dax zeroing path.
> >    (__dax_zero_page_range)
> >
> > B. Provide a way to clear latent poison. And possibly use movdir64b to
> >    do that and make filesystems use that interface for initialization
> >    of blocks.
> >
> > C. Dax zero operation is clearing known poison while copy_from_iter() is
> >    not. I guess this ship has already sailed. If we change it now,
> >    somebody will complain of some regression.
> >
> > For issue A, there are two possible ways to deal with it.
> >
> > 1. Implement a dax method to zero page. And this method will also clear
> >    known poison. This is what my patch series is doing.
> >
> > 2. Just get rid of blkdev_issue_zeroout() from __dax_zero_page_range()
> >    so that no poison will be cleared in __dax_zero_page_range() path. This
> >    path is currently used in partial page zeroing path and full filesystem
> >    block zeroing happens with blkdev_issue_zeroout(). There is a small
> >    chance of regression here in case of sector aligned partial block
> >    zeroing.
> >
> > My patch series takes care of issue A without any regressions. In fact it
> > improves current interface. For example, currently "truncate -s 512
> > foo.txt" will succeed even if first sector in the block is poisoned. My
> > patch series fixes it. Current implementation will return error on if any
> > non sector aligned truncate is done and any of the sector is poisoned. My
> > implementation will not return error if poisoned can be cleared as part
> > of zeroing. It will return only if poison is present in non-zeoring part.
> 
> That asymmetry makes the implementation too much of a special case. If
> the dax mapping path forces error boundaries on PAGE_SIZE blocks then
> so should zeroing.
> 
> >
> > Why don't we solve one issue A now and deal with issue B and C later in
> > a sepaprate patch series. This patch series gets rid of dependency on
> > block device in dax path and also makes current zeroing interface better.
> 
> I'm ok with replacing blkdev_issue_zeroout() with a dax operation
> callback that deals with page aligned entries. That change at least
> makes the error boundary symmetric across copy_from_iter() and the
> zeroing path.

IIUC, you are suggesting that modify dax_zero_page_range() to take page
aligned start and size and call this interface from
__dax_zero_page_range() and get rid of blkdev_issue_zeroout() in that
path?

Something like.

__dax_zero_page_range() {
  if(page_aligned_io)
  	call_dax_page_zero_range()
  else
   	use_direct_access_and_memcpy;
}

And other callers of blkdev_issue_zeroout() in filesystems can migrate
to calling dax_zero_page_range() instead.

If yes, I am not seeing what advantage do we get by this change.

- __dax_zero_page_range() seems to be called by only partial block
  zeroing code. So dax_zero_page_range() call will remain unused.

- dax_zero_page_range() will be exact replacement of
  blkdev_issue_zeroout() so filesystems will not gain anything. Just that
  it will create a dax specific hook.

In that case it might be simpler to just get rid of blkdev_issue_zeroout()
call from __dax_zero_page_range() and make sure there are no callers of
full block zeroing from this path.

Thanks
Vivek
Jeff Moyer Feb. 25, 2020, 8:32 p.m. UTC | #24
Dan Williams <dan.j.williams@intel.com> writes:

> On Mon, Feb 24, 2020 at 1:53 PM Jeff Moyer <jmoyer@redhat.com> wrote:
>>
>> Dan Williams <dan.j.williams@intel.com> writes:
>>
>> >> Let's just focus on reporting errors when we know we have them.
>> >
>> > That's the problem in my eyes. If software needs to contend with
>> > latent error reporting then it should always contend otherwise
>> > software has multiple error models to wrangle.
>>
>> The only way for an application to know that the data has been written
>> successfully would be to issue a read after every write.  That's not a
>> performance hit most applications are willing to take.  And, of course,
>> the media can still go bad at a later time, so it only guarantees the
>> data is accessible immediately after having been written.
>>
>> What I'm suggesting is that we should not complete a write successfully
>> if we know that the data will not be retrievable.  I wouldn't call this
>> adding an extra error model to contend with.  Applications should
>> already be checking for errors on write.
>>
>> Does that make sense? Are we talking past each other?
>
> The badblock list is late to update in both directions, late to add
> entries that the scrub needs to find and late to delete entries that
> were inadvertently cleared by cache-line writes that did not first
> ingest the poison for a read-modify-write.

We aren't looking for perfection.  If the badblocks list is populated,
then report the error instead of letting the user write data that we
know they won't be able to access later.

You have confused me, though, since I thought that stores to bad media
would not clear errors.  Perhaps you are talking about some future
hardware implementation that doesn't yet exist?

> So I see the above as being wishful in using the error list as the
> hard source of truth and unfortunate to up-level all sub-sector error
> entries into full PAGE_SIZE data offline events.

The page size granularity is only an issue for mmap().  If you are using
read/write, then 512 byte granularity can be achieved.  Even with mmap,
if you encounter an error on a 4k page, you can query the status of each
sector in that page to isolate the error.  So I'm not quite sure I
understand what you're getting at.

> I'm hoping we can find a way to make the error handling more fine
> grained over time, but for the current patch, managing the blast
> radius as PAGE_SIZE granularity at least matches the zero path with
> the write path.

I think the write path can do 512 byte granularity, not page size.
Anyway, I think we've gone far enough off into the weeds that more
patches will have to be posted for debate.  :)

>> > Setting that aside we can start with just treating zeroing the same as
>> > the copy_from_iter() case and fail the I/O at the dax_direct_access()
>> > step.
>>
>> OK.
>>
>> > I'd rather have a separate op that filesystems can use to clear errors
>> > at block allocation time that can be enforced to have the correct
>> > alignment.
>>
>> So would file systems always call that routine instead of zeroing, or
>> would they first check to see if there are badblocks?
>
> The proposal is that filesystems distinguish zeroing from free-block
> allocation/initialization such that the fsdax implementation directs
> initialization to a driver callback. This "initialization op" would
> take care to check for poison and clear it. All other dax paths would
> not consult the badblocks list.

What do you mean by "all other dax paths?"  Would that include
mmap/direct_access?  Because that definitely should still consult the
badblocks list.

I'm not against having a separate operation for clearing errors, but I
guess I'm not convinced it's cleaner, either.

-Jeff
Dan Williams Feb. 25, 2020, 9:52 p.m. UTC | #25
On Tue, Feb 25, 2020 at 12:33 PM Jeff Moyer <jmoyer@redhat.com> wrote:
>
> Dan Williams <dan.j.williams@intel.com> writes:
>
> > On Mon, Feb 24, 2020 at 1:53 PM Jeff Moyer <jmoyer@redhat.com> wrote:
> >>
> >> Dan Williams <dan.j.williams@intel.com> writes:
> >>
> >> >> Let's just focus on reporting errors when we know we have them.
> >> >
> >> > That's the problem in my eyes. If software needs to contend with
> >> > latent error reporting then it should always contend otherwise
> >> > software has multiple error models to wrangle.
> >>
> >> The only way for an application to know that the data has been written
> >> successfully would be to issue a read after every write.  That's not a
> >> performance hit most applications are willing to take.  And, of course,
> >> the media can still go bad at a later time, so it only guarantees the
> >> data is accessible immediately after having been written.
> >>
> >> What I'm suggesting is that we should not complete a write successfully
> >> if we know that the data will not be retrievable.  I wouldn't call this
> >> adding an extra error model to contend with.  Applications should
> >> already be checking for errors on write.
> >>
> >> Does that make sense? Are we talking past each other?
> >
> > The badblock list is late to update in both directions, late to add
> > entries that the scrub needs to find and late to delete entries that
> > were inadvertently cleared by cache-line writes that did not first
> > ingest the poison for a read-modify-write.
>
> We aren't looking for perfection.  If the badblocks list is populated,
> then report the error instead of letting the user write data that we
> know they won't be able to access later.
>
> You have confused me, though, since I thought that stores to bad media
> would not clear errors.  Perhaps you are talking about some future
> hardware implementation that doesn't yet exist?

No, today cacheline aligned DMA, and cpu cachelines that are fully
dirtied without a demand fill can end up clearing poison. The
movdir64b instruction provides a way to force this behavior from the
cpu where it was previously luck.

> > So I see the above as being wishful in using the error list as the
> > hard source of truth and unfortunate to up-level all sub-sector error
> > entries into full PAGE_SIZE data offline events.
>
> The page size granularity is only an issue for mmap().  If you are using
> read/write, then 512 byte granularity can be achieved.  Even with mmap,
> if you encounter an error on a 4k page, you can query the status of each
> sector in that page to isolate the error.  So I'm not quite sure I
> understand what you're getting at.

I'm getting at the fact that we don't allow mmap on PAGE_SIZE
granularity in the presence of errors, and don't allow dax I/O to the
page when an error is present. Only non-dax direct-I/O can read /
write at sub-page granularity in the presence of errors.

The interface to query the status is not coordinated with the
filesystem, but that's a whole other discussion. Yes, we're getting a
bit off in the weeds.

> > I'm hoping we can find a way to make the error handling more fine
> > grained over time, but for the current patch, managing the blast
> > radius as PAGE_SIZE granularity at least matches the zero path with
> > the write path.
>
> I think the write path can do 512 byte granularity, not page size.
> Anyway, I think we've gone far enough off into the weeds that more
> patches will have to be posted for debate.  :)
>

It can't, see dax_iomap_actor(). That call to dax_direct_access() will
fail if it hits a known badblock.

> >> > Setting that aside we can start with just treating zeroing the same as
> >> > the copy_from_iter() case and fail the I/O at the dax_direct_access()
> >> > step.
> >>
> >> OK.
> >>
> >> > I'd rather have a separate op that filesystems can use to clear errors
> >> > at block allocation time that can be enforced to have the correct
> >> > alignment.
> >>
> >> So would file systems always call that routine instead of zeroing, or
> >> would they first check to see if there are badblocks?
> >
> > The proposal is that filesystems distinguish zeroing from free-block
> > allocation/initialization such that the fsdax implementation directs
> > initialization to a driver callback. This "initialization op" would
> > take care to check for poison and clear it. All other dax paths would
> > not consult the badblocks list.
>
> What do you mean by "all other dax paths?"  Would that include
> mmap/direct_access?  Because that definitely should still consult the
> badblocks list.

dax_direct_access() consults the badblock list,
dax_copy_{to,from}_iter() do not, and badblocks discovered after a
page is mapped do not invalidate the page unless the poison is
consumed.

> I'm not against having a separate operation for clearing errors, but I
> guess I'm not convinced it's cleaner, either.

The idea with a separate op is to have an explicit ABI to clear errors
like page-aligned-hole-punch (FALLOC_FL_PUNCH_ERROR?) vs some implicit
side effect of direct-I/O writes.

I also feel like we're heading in a direction that tries to avoid
relying on the cpu machine-check recovery path at all costs. That's
the kernel's prerogative, of course, but it seems like at a minimum
we're not quite on the same page about what role machine-check
recovery plays in the error handling model.
Dan Williams Feb. 25, 2020, 10:49 p.m. UTC | #26
On Tue, Feb 25, 2020 at 12:08 PM Vivek Goyal <vgoyal@redhat.com> wrote:
>
> On Tue, Feb 25, 2020 at 08:25:27AM -0800, Dan Williams wrote:
> > On Tue, Feb 25, 2020 at 5:37 AM Vivek Goyal <vgoyal@redhat.com> wrote:
> > >
> > > On Mon, Feb 24, 2020 at 01:32:58PM -0800, Dan Williams wrote:
> > >
> > > [..]
> > > > > > > Ok, how about if I add one more patch to the series which will check
> > > > > > > if unwritten portion of the page has known poison. If it has, then
> > > > > > > -EIO is returned.
> > > > > > >
> > > > > > >
> > > > > > > Subject: pmem: zero page range return error if poisoned memory in unwritten area
> > > > > > >
> > > > > > > Filesystems call into pmem_dax_zero_page_range() to zero partial page upon
> > > > > > > truncate. If partial page is being zeroed, then at the end of operation
> > > > > > > file systems expect that there is no poison in the whole page (atleast
> > > > > > > known poison).
> > > > > > >
> > > > > > > So make sure part of the partial page which is not being written, does not
> > > > > > > have poison. If it does, return error. If there is poison in area of page
> > > > > > > being written, it will be cleared.
> > > > > >
> > > > > > No, I don't like that the zero operation is special cased compared to
> > > > > > the write case. I'd say let's make them identical for now. I.e. fail
> > > > > > the I/O at dax_direct_access() time.
> > > > >
> > > > > So basically __dax_zero_page_range() will only write zeros (and not
> > > > > try to clear any poison). Right?
> > > >
> > > > Yes, the zero operation would have already failed at the
> > > > dax_direct_access() step if there was present poison.
> > > >
> > > > > > I think the error clearing
> > > > > > interface should be an explicit / separate op rather than a
> > > > > > side-effect. What about an explicit interface for initializing newly
> > > > > > allocated blocks, and the only reliable way to destroy poison through
> > > > > > the filesystem is to free the block?
> > > > >
> > > > > Effectively pmem_make_request() is already that interface filesystems
> > > > > use to initialize blocks and clear poison. So we don't really have to
> > > > > introduce a new interface?
> > > >
> > > > pmem_make_request() is shared with the I/O path and is too low in the
> > > > stack to understand intent. DAX intercepts the I/O path closer to the
> > > > filesystem and can understand zeroing vs writing today. I'm proposing
> > > > we go a step further and make DAX understand free-to-allocated-block
> > > > initialization instead of just zeroing. Inject the error clearing into
> > > > that initialization interface.
> > > >
> > > > > Or you are suggesting separate dax_zero_page_range() interface which will
> > > > > always call into firmware to clear poison. And that will make sure latent
> > > > > poison is cleared as well and filesystem should use that for block
> > > > > initialization instead?
> > > >
> > > > Yes, except latent poison would not be cleared until the zeroing is
> > > > implemented with movdir64b instead of callouts to firmware. It's
> > > > otherwise too slow to call out to firmware unconditionally.
> > > >
> > > > > I do like the idea of not having to differentiate
> > > > > between known poison and latent poison. Once a block has been initialized
> > > > > all poison should be cleared (known/latent). I am worried though that
> > > > > on large devices this might slowdown filesystem initialization a lot
> > > > > if they are zeroing large range of blocks.
> > > > >
> > > > > If yes, this sounds like two different patch series. First patch series
> > > > > takes care of removing blkdev_issue_zeroout() from
> > > > > __dax_zero_page_range() and couple of iomap related cleans christoph
> > > > > wanted.
> > > > >
> > > > > And second patch series for adding new dax operation to zero a range
> > > > > and always call info firmware to clear poison and modify filesystems
> > > > > accordingly.
> > > >
> > > > Yes, but they may need to be merged together. I don't want to regress
> > > > the ability of a block-aligned hole-punch to clear errors.
> > >
> > > Hi Dan,
> > >
> > > IIUC, block aligned hole punch don't go through __dax_zero_page_range()
> > > path. Instead they call blkdev_issue_zeroout() at later point of time.
> > >
> > > Only partial block zeroing path is taking __dax_zero_page_range(). So
> > > even if we remove poison clearing code from __dax_zero_page_range(),
> > > there should not be a regression w.r.t full block zeroing. Only possible
> > > regression will be if somebody was doing partial block zeroing on sector
> > > boundary, then poison will not be cleared.
> > >
> > > We now seem to be discussing too many issues w.r.t poison clearing
> > > and dax. Atleast 3 issues are mentioned in this thread.
> > >
> > > A. Get rid of dependency on block device in dax zeroing path.
> > >    (__dax_zero_page_range)
> > >
> > > B. Provide a way to clear latent poison. And possibly use movdir64b to
> > >    do that and make filesystems use that interface for initialization
> > >    of blocks.
> > >
> > > C. Dax zero operation is clearing known poison while copy_from_iter() is
> > >    not. I guess this ship has already sailed. If we change it now,
> > >    somebody will complain of some regression.
> > >
> > > For issue A, there are two possible ways to deal with it.
> > >
> > > 1. Implement a dax method to zero page. And this method will also clear
> > >    known poison. This is what my patch series is doing.
> > >
> > > 2. Just get rid of blkdev_issue_zeroout() from __dax_zero_page_range()
> > >    so that no poison will be cleared in __dax_zero_page_range() path. This
> > >    path is currently used in partial page zeroing path and full filesystem
> > >    block zeroing happens with blkdev_issue_zeroout(). There is a small
> > >    chance of regression here in case of sector aligned partial block
> > >    zeroing.
> > >
> > > My patch series takes care of issue A without any regressions. In fact it
> > > improves current interface. For example, currently "truncate -s 512
> > > foo.txt" will succeed even if first sector in the block is poisoned. My
> > > patch series fixes it. Current implementation will return error on if any
> > > non sector aligned truncate is done and any of the sector is poisoned. My
> > > implementation will not return error if poisoned can be cleared as part
> > > of zeroing. It will return only if poison is present in non-zeoring part.
> >
> > That asymmetry makes the implementation too much of a special case. If
> > the dax mapping path forces error boundaries on PAGE_SIZE blocks then
> > so should zeroing.
> >
> > >
> > > Why don't we solve one issue A now and deal with issue B and C later in
> > > a sepaprate patch series. This patch series gets rid of dependency on
> > > block device in dax path and also makes current zeroing interface better.
> >
> > I'm ok with replacing blkdev_issue_zeroout() with a dax operation
> > callback that deals with page aligned entries. That change at least
> > makes the error boundary symmetric across copy_from_iter() and the
> > zeroing path.
>
> IIUC, you are suggesting that modify dax_zero_page_range() to take page
> aligned start and size and call this interface from
> __dax_zero_page_range() and get rid of blkdev_issue_zeroout() in that
> path?
>
> Something like.
>
> __dax_zero_page_range() {
>   if(page_aligned_io)
>         call_dax_page_zero_range()
>   else
>         use_direct_access_and_memcpy;
> }
>
> And other callers of blkdev_issue_zeroout() in filesystems can migrate
> to calling dax_zero_page_range() instead.
>
> If yes, I am not seeing what advantage do we get by this change.
>
> - __dax_zero_page_range() seems to be called by only partial block
>   zeroing code. So dax_zero_page_range() call will remain unused.
>
>
> - dax_zero_page_range() will be exact replacement of
>   blkdev_issue_zeroout() so filesystems will not gain anything. Just that
>   it will create a dax specific hook.
>
> In that case it might be simpler to just get rid of blkdev_issue_zeroout()
> call from __dax_zero_page_range() and make sure there are no callers of
> full block zeroing from this path.

I think you're right. The path I'm concerned about not regressing is
the error clearing on new block allocation and we get that already via
xfs_zero_extent() and sb_issue_zeroout(). For your fs we'll want a
dax-device equivalent  for that path, but that does mean that
__dax_zero_page_range() stays out of the error clearing game.
Jane Chu Feb. 25, 2020, 11:26 p.m. UTC | #27
On 2/24/2020 4:26 PM, Dan Williams wrote:
> On Mon, Feb 24, 2020 at 1:53 PM Jeff Moyer <jmoyer@redhat.com> wrote:
>>
>> Dan Williams <dan.j.williams@intel.com> writes:
>>
>>>> Let's just focus on reporting errors when we know we have them.
>>>
>>> That's the problem in my eyes. If software needs to contend with
>>> latent error reporting then it should always contend otherwise
>>> software has multiple error models to wrangle.
>>
>> The only way for an application to know that the data has been written
>> successfully would be to issue a read after every write.  That's not a
>> performance hit most applications are willing to take.  And, of course,
>> the media can still go bad at a later time, so it only guarantees the
>> data is accessible immediately after having been written.
>>
>> What I'm suggesting is that we should not complete a write successfully
>> if we know that the data will not be retrievable.  I wouldn't call this
>> adding an extra error model to contend with.  Applications should
>> already be checking for errors on write.
>>
>> Does that make sense? Are we talking past each other?
> 
> The badblock list is late to update in both directions, late to add
> entries that the scrub needs to find and late to delete entries that
> were inadvertently cleared by cache-line writes that did not first
> ingest the poison for a read-modify-write. So I see the above as being
> wishful in using the error list as the hard source of truth and
> unfortunate to up-level all sub-sector error entries into full
> PAGE_SIZE data offline events.

Sorry, don't mean to distract the discussion, but I'm wondering if
anyone has noticed SIGBUS with si_code = MCEERR_AO in a single process 
poison test over a dax-xfs file?  There is only 1 poison in the
file which has been consumed, it's the recovery code path (hole punch/
munmap/mmap/pwrite/read) that encounters the _AO. I'm confident that
latent error isn't the scenario per ARS scrub. Also, the _AO appears
rarely. This is un-explainable given the kernel MCE pmem handling
implementation.

> 
> I'm hoping we can find a way to make the error handling more fine
> grained over time, but for the current patch, managing the blast
> radius as PAGE_SIZE granularity at least matches the zero path with
> the write path.

Maybe the new filesystem op for clearing pmem poison should insist on
4K alignment? because in hwpoison_clear() the starting pfn is given
by PHYS_PFN which rounds down to the nearest page, so we might inadvertently
clear the poison bit and 'noce' bit from a page when we only cleared a
poison e.g. in the second half of the page.

BTW, set_mce_nospec() doesn't seem to work in 5.5 release,
   [ 2321.209382] Could not invalidate pfn=0x1850600 from 1:1 map
I will see if I can find more information.

> 
>>> Setting that aside we can start with just treating zeroing the same as
>>> the copy_from_iter() case and fail the I/O at the dax_direct_access()
>>> step.
>>
>> OK.
>>
>>> I'd rather have a separate op that filesystems can use to clear errors
>>> at block allocation time that can be enforced to have the correct
>>> alignment.
>>
>> So would file systems always call that routine instead of zeroing, or
>> would they first check to see if there are badblocks?
> 
> The proposal is that filesystems distinguish zeroing from free-block
> allocation/initialization such that the fsdax implementation directs
> initialization to a driver callback. This "initialization op" would
> take care to check for poison and clear it. All other dax paths would
> not consult the badblocks list.

thanks!
-jane


> _______________________________________________
> Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
> To unsubscribe send an email to linux-nvdimm-leave@lists.01.org
>
Vivek Goyal Feb. 26, 2020, 1:51 p.m. UTC | #28
On Tue, Feb 25, 2020 at 02:49:30PM -0800, Dan Williams wrote:
[..]
> > > > Hi Dan,
> > > >
> > > > IIUC, block aligned hole punch don't go through __dax_zero_page_range()
> > > > path. Instead they call blkdev_issue_zeroout() at later point of time.
> > > >
> > > > Only partial block zeroing path is taking __dax_zero_page_range(). So
> > > > even if we remove poison clearing code from __dax_zero_page_range(),
> > > > there should not be a regression w.r.t full block zeroing. Only possible
> > > > regression will be if somebody was doing partial block zeroing on sector
> > > > boundary, then poison will not be cleared.
> > > >
> > > > We now seem to be discussing too many issues w.r.t poison clearing
> > > > and dax. Atleast 3 issues are mentioned in this thread.
> > > >
> > > > A. Get rid of dependency on block device in dax zeroing path.
> > > >    (__dax_zero_page_range)
> > > >
> > > > B. Provide a way to clear latent poison. And possibly use movdir64b to
> > > >    do that and make filesystems use that interface for initialization
> > > >    of blocks.
> > > >
> > > > C. Dax zero operation is clearing known poison while copy_from_iter() is
> > > >    not. I guess this ship has already sailed. If we change it now,
> > > >    somebody will complain of some regression.
> > > >
> > > > For issue A, there are two possible ways to deal with it.
> > > >
> > > > 1. Implement a dax method to zero page. And this method will also clear
> > > >    known poison. This is what my patch series is doing.
> > > >
> > > > 2. Just get rid of blkdev_issue_zeroout() from __dax_zero_page_range()
> > > >    so that no poison will be cleared in __dax_zero_page_range() path. This
> > > >    path is currently used in partial page zeroing path and full filesystem
> > > >    block zeroing happens with blkdev_issue_zeroout(). There is a small
> > > >    chance of regression here in case of sector aligned partial block
> > > >    zeroing.
> > > >
> > > > My patch series takes care of issue A without any regressions. In fact it
> > > > improves current interface. For example, currently "truncate -s 512
> > > > foo.txt" will succeed even if first sector in the block is poisoned. My
> > > > patch series fixes it. Current implementation will return error on if any
> > > > non sector aligned truncate is done and any of the sector is poisoned. My
> > > > implementation will not return error if poisoned can be cleared as part
> > > > of zeroing. It will return only if poison is present in non-zeoring part.
> > >
> > > That asymmetry makes the implementation too much of a special case. If
> > > the dax mapping path forces error boundaries on PAGE_SIZE blocks then
> > > so should zeroing.
> > >
> > > >
> > > > Why don't we solve one issue A now and deal with issue B and C later in
> > > > a sepaprate patch series. This patch series gets rid of dependency on
> > > > block device in dax path and also makes current zeroing interface better.
> > >
> > > I'm ok with replacing blkdev_issue_zeroout() with a dax operation
> > > callback that deals with page aligned entries. That change at least
> > > makes the error boundary symmetric across copy_from_iter() and the
> > > zeroing path.
> >
> > IIUC, you are suggesting that modify dax_zero_page_range() to take page
> > aligned start and size and call this interface from
> > __dax_zero_page_range() and get rid of blkdev_issue_zeroout() in that
> > path?
> >
> > Something like.
> >
> > __dax_zero_page_range() {
> >   if(page_aligned_io)
> >         call_dax_page_zero_range()
> >   else
> >         use_direct_access_and_memcpy;
> > }
> >
> > And other callers of blkdev_issue_zeroout() in filesystems can migrate
> > to calling dax_zero_page_range() instead.
> >
> > If yes, I am not seeing what advantage do we get by this change.
> >
> > - __dax_zero_page_range() seems to be called by only partial block
> >   zeroing code. So dax_zero_page_range() call will remain unused.
> >
> >
> > - dax_zero_page_range() will be exact replacement of
> >   blkdev_issue_zeroout() so filesystems will not gain anything. Just that
> >   it will create a dax specific hook.
> >
> > In that case it might be simpler to just get rid of blkdev_issue_zeroout()
> > call from __dax_zero_page_range() and make sure there are no callers of
> > full block zeroing from this path.
> 
> I think you're right. The path I'm concerned about not regressing is
> the error clearing on new block allocation and we get that already via
> xfs_zero_extent() and sb_issue_zeroout(). For your fs we'll want a
> dax-device equivalent  for that path, but that does mean that
> __dax_zero_page_range() stays out of the error clearing game.

In virtiofs we do not manage our own blocks. We let host filesystem
do that and we are just passthrough filesystem passing fuse messages
around. So I have not seen need of block zeroing interface yet.

I just happened to carry a patch in my patch series in this area because
we wanted to get rid of this assumption that dax always has a block
device associated. Apart from that, I don't need __dax_zero_page_range()
for virtiofs. 

I am doing this cleanup so that we dont even try to use block device
in this dax zeroing path.

Anyway, I will cleanup this patch series and get rid of
blkdev_issue_zeroout() call from __dax_zero_page_range() and post again
for review and where does it go from there.

Thanks
Vivek
Vivek Goyal Feb. 26, 2020, 4:57 p.m. UTC | #29
On Tue, Feb 25, 2020 at 02:49:30PM -0800, Dan Williams wrote:
[..]
> > > I'm ok with replacing blkdev_issue_zeroout() with a dax operation
> > > callback that deals with page aligned entries. That change at least
> > > makes the error boundary symmetric across copy_from_iter() and the
> > > zeroing path.
> >
> > IIUC, you are suggesting that modify dax_zero_page_range() to take page
> > aligned start and size and call this interface from
> > __dax_zero_page_range() and get rid of blkdev_issue_zeroout() in that
> > path?
> >
> > Something like.
> >
> > __dax_zero_page_range() {
> >   if(page_aligned_io)
> >         call_dax_page_zero_range()
> >   else
> >         use_direct_access_and_memcpy;
> > }
> >
> > And other callers of blkdev_issue_zeroout() in filesystems can migrate
> > to calling dax_zero_page_range() instead.
> >
> > If yes, I am not seeing what advantage do we get by this change.
> >
> > - __dax_zero_page_range() seems to be called by only partial block
> >   zeroing code. So dax_zero_page_range() call will remain unused.
> >
> >
> > - dax_zero_page_range() will be exact replacement of
> >   blkdev_issue_zeroout() so filesystems will not gain anything. Just that
> >   it will create a dax specific hook.
> >
> > In that case it might be simpler to just get rid of blkdev_issue_zeroout()
> > call from __dax_zero_page_range() and make sure there are no callers of
> > full block zeroing from this path.
> 
> I think you're right. The path I'm concerned about not regressing is
> the error clearing on new block allocation and we get that already via
> xfs_zero_extent() and sb_issue_zeroout().

Well I was wrong. I found atleast one user which uses __dax_zero_page_range()
to zero full PAGE_SIZE blocks.

xfs_io -c "allocsp 32K 0" foo.txt

In that case, I will add a new dax method say dax_zero_page_range() which will
only take PAGE_SIZE aligned range will clear known poison and call that
from __dax_zero_page_range() if I/O is PAGE_SIZE aligned.

For now I will limit this interface to take only single page at a time
because otherwise implementation becomes complex in dm/md stack where
range has to be broken across multiple devices and there are no users
because current iomap API passes one page at a time.

So once we have grown users, then one can also tackle the complexity of
modifying dm/md to break a page range across multiple devices.

Will post a patch series for this.

Thanks
Vivek
Dave Chinner Feb. 27, 2020, 3:02 a.m. UTC | #30
On Mon, Feb 24, 2020 at 10:38:44AM -0500, Vivek Goyal wrote:
> On Mon, Feb 24, 2020 at 10:03:30AM +1100, Dave Chinner wrote:
> 
> [..]
> > > > > Hi Jeff,
> > > > >
> > > > > New dax zeroing interface (dax_zero_page_range()) can technically pass
> > > > > a range which is less than a sector. Or which is bigger than a sector
> > > > > but start and end are not aligned on sector boundaries.
> > > > 
> > > > Sure, but who will call it with misaligned ranges?
> > > 
> > > create a file foo.txt of size 4K and then truncate it.
> > > 
> > > "truncate -s 23 foo.txt". Filesystems try to zero the bytes from 24 to
> > > 4095.
> > 
> > This should fail with EIO. Only full page writes should clear the
> > bad page state, and partial writes should therefore fail because
> > they do not guarantee the data in the filesystem block is all good.
> > 
> > If this zeroing was a buffered write to an address with a bad
> > sector, then the writeback will fail and the user will (eventually)
> > get an EIO on the file.
> > 
> > DAX should do the same thing, except because the zeroing is
> > synchronous (i.e. done directly by the truncate syscall) we can -
> > and should - return EIO immediately.
> > 
> > Indeed, with your code, if we then extend the file by truncating up
> > back to 4k, then the range between 23 and 512 is still bad, even
> > though we've successfully zeroed it and the user knows it. An
> > attempt to read anywhere in this range (e.g. 10 bytes at offset 100)
> > will fail with EIO, but reading 10 bytes at offset 2000 will
> > succeed.
> 
> Hi Dave,
> 
> What is expected if I do "truncate -s 512 foo.txt". Say first sector (0 to
> 511) is poisoned and rest don't have poison. Should this fail with -EIO.

Yes - the filesystem block still contains bad data.

> In current implementation it does not.

I'm not surprised - the whole hardware error handling architecture
for FS-DAX is fundamentally broken. It was designed for device-dax,
and it just doesn't work for FS-DAX.

For example, to get the hardware error handling to be able to kill
userspace applications, a 1:1 physical-to-logical association
constraint was added to fs/dax.c:

/*
 * TODO: for reflink+dax we need a way to associate a single page with
 * multiple address_space instances at different linear_page_index()
 * offsets.
 */
static void dax_associate_entry(void *entry, struct address_space *mapping,
                struct vm_area_struct *vma, unsigned long address)

because device-dax only has *linear mappings* and so has a fixed
reverse mapping architecture.

i.e. the whole hardware error handling infrastructure was designed
around the constraints of device-dax. device-dax does not having any
structure to serialise access to the physical storage, so locking
was added to the mapping tree. THe mapping tree locking is accessed
on hardware error via the reverse mappingi association in the struct
page and that's how device-dax serialises direct physical storage
access against hardware error processing.  And while the page index
is locked in the mapping tree, it can walk the process vmas that
have the page mapped to kill them so that they don't try to access
the bad page.

That bad physical storage state is carried in a volatile struct page
flag, hence requiring some mechanism to make it persistent (the
device bad blocks register) and some other mechanism to clear the
poison state (direct IO, IIRC).

It's also a big, nasty, solid roadblock to implementing shared
data extents in FS-DAX. We basically have to completely re-architect
the hardware error handling for FS-DAX so that such errors are
reported to the filesystem first and then the filesystem does what
is needed to handle the error.

None of this works for filesystems because they need to perform
different operations depending on what the page that went bad
contains. FS-DAX should never trip over an unexpected poisoned page;
we do so now because such physical storage errors are completely
hidden form the fielsystem.

What you are trying to do is slap a band-aid over what to do when we
hit an unexpected page containing bad data. Filesystems expect to
find out about bad data in storage when they marshall the data into
or out of memory. They make the assumption that once it is in memory
it remains valid on the physical storage. Hence if an in-memory
error occurs, we can just toss it away and re-read it from storage,
and all is good.

FS-DAX changes that - we are no longer marshalling data into and out
of memory so we don't have a mechanism to get EIO when reading the
page into the page cache or writing it back to disk. We also don't
have an in-memory copy of the data - the physical storage is the
in-memory copy, and so we can't just toss it away when an error
occurs.

What we actually require is instantaneous notification of physical
storage errors so we can handle the error immediately. And that, in
turn, means we should never poison or see poisoned pages during
direct access operations because the filesystem doesn't need to
poison pages to prevent user access - it controls how the storage is
accessed directly.

e.g. if this error is in filesystem metadata, we might be able to
recover from it as that metadata might have a full copy in memory
(metadata is buffered in both XFS and ext4) or we might be able to
reconstruct it from other sources. Worst case, we have shut the
filesystem down completely so the admin can repair the damage the
lost page has caused.

e.g. The physical page may be located in free space, in which case
we don't care and can just zero it so all the bad hardware state is
cleared. The page never goes bad or gets poisoned in that case.

e.g. The physical page may be user data, in which case it may be
buffered in the page cache (non-dax) and so can easily be recovered.
It may not be recoverable, in which case we need to issue log
messages indicating that data has been lost (path, offset, length),
and do the VMA walk and kill processes that map that page. Then we
can zero the page to clear the bad state.

If, at any point we can't clear the bad state (e.g. the zeroing or
the read-back verification fails), then we need to make sure that
filesystem block is marked as allocated in the free space map, then
tell the reverse map that it's owner is now "bad storage" so it
never gets used again. i.e. this is the persistent bad block
storage, but doing it this way results in the rest of the filesystem
code never, ever seeing a poisoned page. And users never see it
either, because it will never be returned to the free space pool.

Of course, this relies of the filesystem having reverse mapping
capability. XFS already has this funcitonality available as a mkfs
option (mkfs.xfs -m rmapbt=1 ...), and we really need this so we can
get rid of the association of a physical page with a mapping and
file offset that device-dax requires for hardware page error
handling.  This means we don't need the physical storage to try to
hold filesystem layer reverse mapping information for us, and this
also removes the roadblock that the hardware error handling has
placed on implementing reflink w/ FS-DAX.

IOWs, the problem you are trying to solve is a direct result of
filesysetms not being informed when a physical pmem page goes bad
and the current error handling being implemented at entirely the
wrong layer for FS-DAX. It may work for device-dax, but it's most
definitely insufficient for correct error handling for filesystems.

> Anyway, partial page truncate can't ensure that data in rest of the page can
> be read back successfully. Memory can get poison after the write and
> hence read after truncate will still fail.

Which is where the notification requirement comes in. Yes, we may
still get errors on read or write, but if memory at rest goes bad,
we want to handle that and correct it ASAP, not wait days or months
for something to trip over the poisoned page before we find out
about it.

> Hence, all we are trying to ensure is that if a poison is known at the
> time of writing partial page, then we should return error to user space.

I think within FS-DAX infrastructure, any access to the data (read
or write) within a poisoned page or a page marked with PageError()
should return EIO to the caller, unless it's the specific command to
clear the error/poison state on the page. What happens with that
error state is then up to the caller.

Cheers,

Dave.
Dave Chinner Feb. 27, 2020, 3:11 a.m. UTC | #31
On Wed, Feb 26, 2020 at 11:57:56AM -0500, Vivek Goyal wrote:
> On Tue, Feb 25, 2020 at 02:49:30PM -0800, Dan Williams wrote:
> [..]
> > > > I'm ok with replacing blkdev_issue_zeroout() with a dax operation
> > > > callback that deals with page aligned entries. That change at least
> > > > makes the error boundary symmetric across copy_from_iter() and the
> > > > zeroing path.
> > >
> > > IIUC, you are suggesting that modify dax_zero_page_range() to take page
> > > aligned start and size and call this interface from
> > > __dax_zero_page_range() and get rid of blkdev_issue_zeroout() in that
> > > path?
> > >
> > > Something like.
> > >
> > > __dax_zero_page_range() {
> > >   if(page_aligned_io)
> > >         call_dax_page_zero_range()
> > >   else
> > >         use_direct_access_and_memcpy;
> > > }
> > >
> > > And other callers of blkdev_issue_zeroout() in filesystems can migrate
> > > to calling dax_zero_page_range() instead.
> > >
> > > If yes, I am not seeing what advantage do we get by this change.
> > >
> > > - __dax_zero_page_range() seems to be called by only partial block
> > >   zeroing code. So dax_zero_page_range() call will remain unused.
> > >
> > >
> > > - dax_zero_page_range() will be exact replacement of
> > >   blkdev_issue_zeroout() so filesystems will not gain anything. Just that
> > >   it will create a dax specific hook.
> > >
> > > In that case it might be simpler to just get rid of blkdev_issue_zeroout()
> > > call from __dax_zero_page_range() and make sure there are no callers of
> > > full block zeroing from this path.
> > 
> > I think you're right. The path I'm concerned about not regressing is
> > the error clearing on new block allocation and we get that already via
> > xfs_zero_extent() and sb_issue_zeroout().
> 
> Well I was wrong. I found atleast one user which uses __dax_zero_page_range()
> to zero full PAGE_SIZE blocks.
> 
> xfs_io -c "allocsp 32K 0" foo.txt

That ioctl interface is deprecated and likely unused by any new
application written since 1999. It predates unwritten extents (1998)
and I don't think any native linux applications have ever used it. A
newly written DAX aware application would almost certainly not use
this interface.

IOWs, I wouldn't use it as justification for some special case
behaviour; I'm more likely to say "rip that ancient ioctl out" than
to jump through hoops because it's implementation behaviour.

Cheers,

Dave.
Dan Williams Feb. 27, 2020, 4:19 a.m. UTC | #32
On Wed, Feb 26, 2020 at 7:03 PM Dave Chinner <david@fromorbit.com> wrote:
>
> On Mon, Feb 24, 2020 at 10:38:44AM -0500, Vivek Goyal wrote:
> > On Mon, Feb 24, 2020 at 10:03:30AM +1100, Dave Chinner wrote:
> >
> > [..]
> > > > > > Hi Jeff,
> > > > > >
> > > > > > New dax zeroing interface (dax_zero_page_range()) can technically pass
> > > > > > a range which is less than a sector. Or which is bigger than a sector
> > > > > > but start and end are not aligned on sector boundaries.
> > > > >
> > > > > Sure, but who will call it with misaligned ranges?
> > > >
> > > > create a file foo.txt of size 4K and then truncate it.
> > > >
> > > > "truncate -s 23 foo.txt". Filesystems try to zero the bytes from 24 to
> > > > 4095.
> > >
> > > This should fail with EIO. Only full page writes should clear the
> > > bad page state, and partial writes should therefore fail because
> > > they do not guarantee the data in the filesystem block is all good.
> > >
> > > If this zeroing was a buffered write to an address with a bad
> > > sector, then the writeback will fail and the user will (eventually)
> > > get an EIO on the file.
> > >
> > > DAX should do the same thing, except because the zeroing is
> > > synchronous (i.e. done directly by the truncate syscall) we can -
> > > and should - return EIO immediately.
> > >
> > > Indeed, with your code, if we then extend the file by truncating up
> > > back to 4k, then the range between 23 and 512 is still bad, even
> > > though we've successfully zeroed it and the user knows it. An
> > > attempt to read anywhere in this range (e.g. 10 bytes at offset 100)
> > > will fail with EIO, but reading 10 bytes at offset 2000 will
> > > succeed.
> >
> > Hi Dave,
> >
> > What is expected if I do "truncate -s 512 foo.txt". Say first sector (0 to
> > 511) is poisoned and rest don't have poison. Should this fail with -EIO.
>
> Yes - the filesystem block still contains bad data.
>
> > In current implementation it does not.
>
> I'm not surprised - the whole hardware error handling architecture
> for FS-DAX is fundamentally broken. It was designed for device-dax,
> and it just doesn't work for FS-DAX.
>
> For example, to get the hardware error handling to be able to kill
> userspace applications, a 1:1 physical-to-logical association
> constraint was added to fs/dax.c:
>
> /*
>  * TODO: for reflink+dax we need a way to associate a single page with
>  * multiple address_space instances at different linear_page_index()
>  * offsets.
>  */
> static void dax_associate_entry(void *entry, struct address_space *mapping,
>                 struct vm_area_struct *vma, unsigned long address)
>
> because device-dax only has *linear mappings* and so has a fixed
> reverse mapping architecture.
>
> i.e. the whole hardware error handling infrastructure was designed
> around the constraints of device-dax. device-dax does not having any
> structure to serialise access to the physical storage, so locking
> was added to the mapping tree. THe mapping tree locking is accessed
> on hardware error via the reverse mappingi association in the struct
> page and that's how device-dax serialises direct physical storage
> access against hardware error processing.  And while the page index
> is locked in the mapping tree, it can walk the process vmas that
> have the page mapped to kill them so that they don't try to access
> the bad page.
>
> That bad physical storage state is carried in a volatile struct page
> flag, hence requiring some mechanism to make it persistent (the
> device bad blocks register) and some other mechanism to clear the
> poison state (direct IO, IIRC).
>
> It's also a big, nasty, solid roadblock to implementing shared
> data extents in FS-DAX. We basically have to completely re-architect
> the hardware error handling for FS-DAX so that such errors are
> reported to the filesystem first and then the filesystem does what
> is needed to handle the error.
>
> None of this works for filesystems because they need to perform
> different operations depending on what the page that went bad
> contains. FS-DAX should never trip over an unexpected poisoned page;
> we do so now because such physical storage errors are completely
> hidden form the fielsystem.
>
> What you are trying to do is slap a band-aid over what to do when we
> hit an unexpected page containing bad data. Filesystems expect to
> find out about bad data in storage when they marshall the data into
> or out of memory. They make the assumption that once it is in memory
> it remains valid on the physical storage. Hence if an in-memory
> error occurs, we can just toss it away and re-read it from storage,
> and all is good.
>
> FS-DAX changes that - we are no longer marshalling data into and out
> of memory so we don't have a mechanism to get EIO when reading the
> page into the page cache or writing it back to disk. We also don't
> have an in-memory copy of the data - the physical storage is the
> in-memory copy, and so we can't just toss it away when an error
> occurs.
>
> What we actually require is instantaneous notification of physical
> storage errors so we can handle the error immediately. And that, in
> turn, means we should never poison or see poisoned pages during
> direct access operations because the filesystem doesn't need to
> poison pages to prevent user access - it controls how the storage is
> accessed directly.
>
> e.g. if this error is in filesystem metadata, we might be able to
> recover from it as that metadata might have a full copy in memory
> (metadata is buffered in both XFS and ext4) or we might be able to
> reconstruct it from other sources. Worst case, we have shut the
> filesystem down completely so the admin can repair the damage the
> lost page has caused.
>
> e.g. The physical page may be located in free space, in which case
> we don't care and can just zero it so all the bad hardware state is
> cleared. The page never goes bad or gets poisoned in that case.
>
> e.g. The physical page may be user data, in which case it may be
> buffered in the page cache (non-dax) and so can easily be recovered.
> It may not be recoverable, in which case we need to issue log
> messages indicating that data has been lost (path, offset, length),
> and do the VMA walk and kill processes that map that page. Then we
> can zero the page to clear the bad state.
>
> If, at any point we can't clear the bad state (e.g. the zeroing or
> the read-back verification fails), then we need to make sure that
> filesystem block is marked as allocated in the free space map, then
> tell the reverse map that it's owner is now "bad storage" so it
> never gets used again. i.e. this is the persistent bad block
> storage, but doing it this way results in the rest of the filesystem
> code never, ever seeing a poisoned page. And users never see it
> either, because it will never be returned to the free space pool.
>
> Of course, this relies of the filesystem having reverse mapping
> capability. XFS already has this funcitonality available as a mkfs
> option (mkfs.xfs -m rmapbt=1 ...), and we really need this so we can
> get rid of the association of a physical page with a mapping and
> file offset that device-dax requires for hardware page error
> handling.  This means we don't need the physical storage to try to
> hold filesystem layer reverse mapping information for us, and this
> also removes the roadblock that the hardware error handling has
> placed on implementing reflink w/ FS-DAX.
>
> IOWs, the problem you are trying to solve is a direct result of
> filesysetms not being informed when a physical pmem page goes bad
> and the current error handling being implemented at entirely the
> wrong layer for FS-DAX. It may work for device-dax, but it's most
> definitely insufficient for correct error handling for filesystems.
>
> > Anyway, partial page truncate can't ensure that data in rest of the page can
> > be read back successfully. Memory can get poison after the write and
> > hence read after truncate will still fail.
>
> Which is where the notification requirement comes in. Yes, we may
> still get errors on read or write, but if memory at rest goes bad,
> we want to handle that and correct it ASAP, not wait days or months
> for something to trip over the poisoned page before we find out
> about it.
>
> > Hence, all we are trying to ensure is that if a poison is known at the
> > time of writing partial page, then we should return error to user space.
>
> I think within FS-DAX infrastructure, any access to the data (read
> or write) within a poisoned page or a page marked with PageError()
> should return EIO to the caller, unless it's the specific command to
> clear the error/poison state on the page. What happens with that
> error state is then up to the caller.
>

I agree with most of the above if you replace "device-dax error
handling" with "System RAM error handling". It's typical memory error
handling that injects the page->index and page->mappping dependency.
In fact it was difficult to map this to device-dax without the hack
that device-dax does not need to take a page lock. I do think that the
FS needs this error information, at the same time it's also true that
historically no FS pushed for this information for page-cache and
in-memory metadata prior to FS-DAX. So, the design was not "device-dax
first" it was "existing memory error handling first" and all the warts
that implied.

So you want the FS to have error handling for just pmem errors or all
memory errors? And you want this to be done without the mm core using
page->index to identify what to unmap when the error happens? Memory
error scanning is not universally available on all pmem
implementations, so FS will need to subscribe for memory-error
handling events. The implementation is clearer for pmem, if the FS is
only interested in memory error handling events for memory it
ostensibly owns (mounted pmem) vs other pages that are only
ephemerally related to the FS (typical RAM).
Vivek Goyal Feb. 27, 2020, 3:25 p.m. UTC | #33
On Thu, Feb 27, 2020 at 02:11:43PM +1100, Dave Chinner wrote:
> On Wed, Feb 26, 2020 at 11:57:56AM -0500, Vivek Goyal wrote:
> > On Tue, Feb 25, 2020 at 02:49:30PM -0800, Dan Williams wrote:
> > [..]
> > > > > I'm ok with replacing blkdev_issue_zeroout() with a dax operation
> > > > > callback that deals with page aligned entries. That change at least
> > > > > makes the error boundary symmetric across copy_from_iter() and the
> > > > > zeroing path.
> > > >
> > > > IIUC, you are suggesting that modify dax_zero_page_range() to take page
> > > > aligned start and size and call this interface from
> > > > __dax_zero_page_range() and get rid of blkdev_issue_zeroout() in that
> > > > path?
> > > >
> > > > Something like.
> > > >
> > > > __dax_zero_page_range() {
> > > >   if(page_aligned_io)
> > > >         call_dax_page_zero_range()
> > > >   else
> > > >         use_direct_access_and_memcpy;
> > > > }
> > > >
> > > > And other callers of blkdev_issue_zeroout() in filesystems can migrate
> > > > to calling dax_zero_page_range() instead.
> > > >
> > > > If yes, I am not seeing what advantage do we get by this change.
> > > >
> > > > - __dax_zero_page_range() seems to be called by only partial block
> > > >   zeroing code. So dax_zero_page_range() call will remain unused.
> > > >
> > > >
> > > > - dax_zero_page_range() will be exact replacement of
> > > >   blkdev_issue_zeroout() so filesystems will not gain anything. Just that
> > > >   it will create a dax specific hook.
> > > >
> > > > In that case it might be simpler to just get rid of blkdev_issue_zeroout()
> > > > call from __dax_zero_page_range() and make sure there are no callers of
> > > > full block zeroing from this path.
> > > 
> > > I think you're right. The path I'm concerned about not regressing is
> > > the error clearing on new block allocation and we get that already via
> > > xfs_zero_extent() and sb_issue_zeroout().
> > 
> > Well I was wrong. I found atleast one user which uses __dax_zero_page_range()
> > to zero full PAGE_SIZE blocks.
> > 
> > xfs_io -c "allocsp 32K 0" foo.txt
> 
> That ioctl interface is deprecated and likely unused by any new
> application written since 1999. It predates unwritten extents (1998)
> and I don't think any native linux applications have ever used it. A
> newly written DAX aware application would almost certainly not use
> this interface.
> 
> IOWs, I wouldn't use it as justification for some special case
> behaviour; I'm more likely to say "rip that ancient ioctl out" than
> to jump through hoops because it's implementation behaviour.

Hi Dave,

Do you see any other path in xfs using iomap_zero_range() and zeroing
full block. iomap_zero_range() already skips IOMAP_HOLE and
IOMAP_UNWRITTEN. So it has to be a full block zeroing which is of not type
IOMAP_HOLE and IOMAP_UNWRITTEN.

My understanding is that ext4 and xfs both are initializing full blocks
using blkdev_issue_zeroout(). Only partial blocks are being zeroed using
this dax zeroing path.

If there are no callers of full block zeroing through
__dax_zero_page_range(), then I can simply get rid of
blkdev_issue_zerout() call from __dax_zero_page_range().

Thanks
Vivek
Dave Chinner Feb. 28, 2020, 1:30 a.m. UTC | #34
On Wed, Feb 26, 2020 at 08:19:37PM -0800, Dan Williams wrote:
> On Wed, Feb 26, 2020 at 7:03 PM Dave Chinner <david@fromorbit.com> wrote:
> > On Mon, Feb 24, 2020 at 10:38:44AM -0500, Vivek Goyal wrote:
> > > Anyway, partial page truncate can't ensure that data in rest of the page can
> > > be read back successfully. Memory can get poison after the write and
> > > hence read after truncate will still fail.
> >
> > Which is where the notification requirement comes in. Yes, we may
> > still get errors on read or write, but if memory at rest goes bad,
> > we want to handle that and correct it ASAP, not wait days or months
> > for something to trip over the poisoned page before we find out
> > about it.
> >
> > > Hence, all we are trying to ensure is that if a poison is known at the
> > > time of writing partial page, then we should return error to user space.
> >
> > I think within FS-DAX infrastructure, any access to the data (read
> > or write) within a poisoned page or a page marked with PageError()
> > should return EIO to the caller, unless it's the specific command to
> > clear the error/poison state on the page. What happens with that
> > error state is then up to the caller.
> >
> 
> I agree with most of the above if you replace "device-dax error
> handling" with "System RAM error handling". It's typical memory error
> handling that injects the page->index and page->mappping dependency.

I disagree, but that's beside the point and not worth arguing.

> So you want the FS to have error handling for just pmem errors or all
> memory errors?

Just pmem errors in the specific range the filesystem manages - we
really only care storage errors because those are the only errors
the filesystem is responsible for handling correctly.

Somebody else can worry about errors that hit page cache pages -
page cache pages require mapping/index pointers on each page anyway,
so a generic mechanism for handling those errors can be built into
the page cache. And if the error is in general kernel memory, then
it's game over for the entire kernel at that point, not just the
filesystem.

> And you want this to be done without the mm core using
> page->index to identify what to unmap when the error happens?

Isn't that exactly what I just said? We get the page address that
failed, the daxdev can turn that into a sector address and call into
the filesystem with a {sector, len, errno} tuple. We then do a
reverse mapping lookup on {sector, len} to find all the owners of
that range in the filesystem. If it's file data, that owner record
gives us the inode and the offset into the file, which then gives us
a {mapping, index} tuple.

Further, the filesytem reverse map is aware that it's blocks can be
shared by multiple owners, so it will have a record for every inode
and file offset that maps to that page. Hence we can simply iterate
the reverse map and do that whacky collect_procs/kill_procs dance
for every {mapping, index} pair that references the the bad range.

Nothing ever needs to be stored on the struct page...

> Memory
> error scanning is not universally available on all pmem
> implementations, so FS will need to subscribe for memory-error
> handling events.

No. Filesystems interact with the underlying device abstraction, not
the physical storage that lies behind that device abstraction.  The
filesystem should not interface with pmem directly in any way (all
direct accesses are hidden inside fs/dax.c!), nor should it care
about the quirks of the pmem implementation it is sitting on. That's
what the daxdev abstraction is supposed to hide from the users of
the pmem.

IOWs, the daxdev infrastructure subscribes to memory-error event
subsystem, calls out to the filesystem when an error in a page in
the daxdev is reported. The filesystem only needs to know the
{sector, len, errno} tuple related to the error; it is the device's
responsibility to handle the physical mechanics of listening,
filtering and translating MCEs to a format the filesystem
understands....

Another reason it should be provided by the daxdev as a {sector,
len, errno} tuple is that it also allows non-dax block devices to
implement the similar error notifications and provide filesystems
with exactly the same information so the filesystem can start
auto-recovery processes....

Cheers,

Dave.
Dave Chinner Feb. 28, 2020, 1:50 a.m. UTC | #35
On Thu, Feb 27, 2020 at 10:25:17AM -0500, Vivek Goyal wrote:
> On Thu, Feb 27, 2020 at 02:11:43PM +1100, Dave Chinner wrote:
> > On Wed, Feb 26, 2020 at 11:57:56AM -0500, Vivek Goyal wrote:
> > > On Tue, Feb 25, 2020 at 02:49:30PM -0800, Dan Williams wrote:
> > > [..]
> > > > > > I'm ok with replacing blkdev_issue_zeroout() with a dax operation
> > > > > > callback that deals with page aligned entries. That change at least
> > > > > > makes the error boundary symmetric across copy_from_iter() and the
> > > > > > zeroing path.
> > > > >
> > > > > IIUC, you are suggesting that modify dax_zero_page_range() to take page
> > > > > aligned start and size and call this interface from
> > > > > __dax_zero_page_range() and get rid of blkdev_issue_zeroout() in that
> > > > > path?
> > > > >
> > > > > Something like.
> > > > >
> > > > > __dax_zero_page_range() {
> > > > >   if(page_aligned_io)
> > > > >         call_dax_page_zero_range()
> > > > >   else
> > > > >         use_direct_access_and_memcpy;
> > > > > }
> > > > >
> > > > > And other callers of blkdev_issue_zeroout() in filesystems can migrate
> > > > > to calling dax_zero_page_range() instead.
> > > > >
> > > > > If yes, I am not seeing what advantage do we get by this change.
> > > > >
> > > > > - __dax_zero_page_range() seems to be called by only partial block
> > > > >   zeroing code. So dax_zero_page_range() call will remain unused.
> > > > >
> > > > >
> > > > > - dax_zero_page_range() will be exact replacement of
> > > > >   blkdev_issue_zeroout() so filesystems will not gain anything. Just that
> > > > >   it will create a dax specific hook.
> > > > >
> > > > > In that case it might be simpler to just get rid of blkdev_issue_zeroout()
> > > > > call from __dax_zero_page_range() and make sure there are no callers of
> > > > > full block zeroing from this path.
> > > > 
> > > > I think you're right. The path I'm concerned about not regressing is
> > > > the error clearing on new block allocation and we get that already via
> > > > xfs_zero_extent() and sb_issue_zeroout().
> > > 
> > > Well I was wrong. I found atleast one user which uses __dax_zero_page_range()
> > > to zero full PAGE_SIZE blocks.
> > > 
> > > xfs_io -c "allocsp 32K 0" foo.txt
> > 
> > That ioctl interface is deprecated and likely unused by any new
> > application written since 1999. It predates unwritten extents (1998)
> > and I don't think any native linux applications have ever used it. A
> > newly written DAX aware application would almost certainly not use
> > this interface.
> > 
> > IOWs, I wouldn't use it as justification for some special case
> > behaviour; I'm more likely to say "rip that ancient ioctl out" than
> > to jump through hoops because it's implementation behaviour.
> 
> Hi Dave,
> 
> Do you see any other path in xfs using iomap_zero_range() and zeroing
> full block.

Yes:

- xfs_file_aio_write_checks() for zeroing blocks between the
  existing EOF and the start of the incoming write beyond EOF
- xfs_setattr_size() on truncate up for zeroing blocks between the
  existing EOF and the new EOF.
- xfs_reflink_zero_posteof() for zeroing blocks between the old EOF
  and where the new reflinked extents are going to land beyond EOF

And don't think that blocks beyond EOF can't exist when DAX is
enabled. We can turn DAX on and off, we can crash between allocation
and file size extension, etc. Hence this code must be able to handle
zeroing large ranges of blocks beyond EOF...

> iomap_zero_range() already skips IOMAP_HOLE and
> IOMAP_UNWRITTEN. So it has to be a full block zeroing which is of not type
> IOMAP_HOLE and IOMAP_UNWRITTEN.
> 
> My understanding is that ext4 and xfs both are initializing full blocks
> using blkdev_issue_zeroout(). Only partial blocks are being zeroed using
> this dax zeroing path.

Look at the API, not the callers: iomap_zero_range takes a 64 bit
length parameter. It can be asked to zero blocks across petabytes of
a file. If there's a single block somewhere in that range, it will
only zero that block. If the entire range is allocated, it will zero
that entire range (yes, it will take forever!) as that it what it
is intended to do.

It should be pretty clear that needs to be able to zero entire
pages, regardless of how it is currently used/called by filesystems.

Cheers,

Dave.
Dan Williams Feb. 28, 2020, 3:28 a.m. UTC | #36
On Thu, Feb 27, 2020 at 5:31 PM Dave Chinner <david@fromorbit.com> wrote:
> On Wed, Feb 26, 2020 at 08:19:37PM -0800, Dan Williams wrote:
[..]
> > So you want the FS to have error handling for just pmem errors or all
> > memory errors?
>
> Just pmem errors in the specific range the filesystem manages - we
> really only care storage errors because those are the only errors
> the filesystem is responsible for handling correctly.
>
> Somebody else can worry about errors that hit page cache pages -
> page cache pages require mapping/index pointers on each page anyway,
> so a generic mechanism for handling those errors can be built into
> the page cache. And if the error is in general kernel memory, then
> it's game over for the entire kernel at that point, not just the
> filesystem.
>
> > And you want this to be done without the mm core using
> > page->index to identify what to unmap when the error happens?
>
> Isn't that exactly what I just said? We get the page address that
> failed, the daxdev can turn that into a sector address and call into
> the filesystem with a {sector, len, errno} tuple. We then do a
> reverse mapping lookup on {sector, len} to find all the owners of
> that range in the filesystem. If it's file data, that owner record
> gives us the inode and the offset into the file, which then gives us
> a {mapping, index} tuple.
>
> Further, the filesytem reverse map is aware that it's blocks can be
> shared by multiple owners, so it will have a record for every inode
> and file offset that maps to that page. Hence we can simply iterate
> the reverse map and do that whacky collect_procs/kill_procs dance
> for every {mapping, index} pair that references the the bad range.
>
> Nothing ever needs to be stored on the struct page...

Ok, so fs/dax.c needs to coordinate with mm/memory-failure.c to say
"don't perform generic memory-error-handling, there's an fs that owns
this daxdev and wants to disposition errors". The fs/dax.c
infrastructure that sets up ->index and ->mapping would still need to
be there for ext4 until its ready to take on the same responsibility.
Last I checked the ext4 reverse mapping implementation was not as
capable as XFS. This goes back to why the initial design avoided
relying on not universally available / stable reverse-mapping
facilities and opted for extending the generic mm/memory-failure.c
implementation.

If XFS optionally supplants mm/memory-failure.c I would expect we'd
want to do better than the "whacky collect_procs/kill_procs"
implementation and let applications register for a notification format
better than BUS_MCEERR_AO signals.

> > Memory
> > error scanning is not universally available on all pmem
> > implementations, so FS will need to subscribe for memory-error
> > handling events.
>
> No. Filesystems interact with the underlying device abstraction, not
> the physical storage that lies behind that device abstraction.  The
> filesystem should not interface with pmem directly in any way (all
> direct accesses are hidden inside fs/dax.c!), nor should it care
> about the quirks of the pmem implementation it is sitting on. That's
> what the daxdev abstraction is supposed to hide from the users of
> the pmem.

I wasn't proposing that XFS have a machine-check handler, I was trying
to get to the next level detail of the async notification interface to
the fs.

>
> IOWs, the daxdev infrastructure subscribes to memory-error event
> subsystem, calls out to the filesystem when an error in a page in
> the daxdev is reported. The filesystem only needs to know the
> {sector, len, errno} tuple related to the error; it is the device's
> responsibility to handle the physical mechanics of listening,
> filtering and translating MCEs to a format the filesystem
> understands....
>
> Another reason it should be provided by the daxdev as a {sector,
> len, errno} tuple is that it also allows non-dax block devices to
> implement the similar error notifications and provide filesystems
> with exactly the same information so the filesystem can start
> auto-recovery processes....

The driver layer does already have 'struct badblocks' that both pmem
and md use, just no current way for the FS to get a reference to it.
However, my hope was to get away from the interface being sector based
because the error granularity is already smaller than a sector in the
daxdev case as compared to a bdev. A daxdev native error record should
be a daxdev relative byte offset, not a sector. If the fs wants to
align the blast radius of the error to sectors or fs-blocks that's its
choice, but I don't think the driver interface should preclude
sub-sector error handling. Essentially I don't want to add more bdev
semantics to fs/dax.c while Vivek is busy removing them.
hch@infradead.org Feb. 28, 2020, 2:05 p.m. UTC | #37
On Thu, Feb 27, 2020 at 07:28:41PM -0800, Dan Williams wrote:
> "don't perform generic memory-error-handling, there's an fs that owns
> this daxdev and wants to disposition errors". The fs/dax.c
> infrastructure that sets up ->index and ->mapping would still need to
> be there for ext4 until its ready to take on the same responsibility.
> Last I checked the ext4 reverse mapping implementation was not as
> capable as XFS. This goes back to why the initial design avoided
> relying on not universally available / stable reverse-mapping
> facilities and opted for extending the generic mm/memory-failure.c
> implementation.

The important but is that we stop using ->index and ->mapping when XFS
is used as that enables using DAX+reflinks, which actually is the most
requested DAX feature on XFS (way more than silly runtime flag switches
for example).
Dan Williams Feb. 28, 2020, 4:26 p.m. UTC | #38
On Fri, Feb 28, 2020 at 6:05 AM Christoph Hellwig <hch@infradead.org> wrote:
>
> On Thu, Feb 27, 2020 at 07:28:41PM -0800, Dan Williams wrote:
> > "don't perform generic memory-error-handling, there's an fs that owns
> > this daxdev and wants to disposition errors". The fs/dax.c
> > infrastructure that sets up ->index and ->mapping would still need to
> > be there for ext4 until its ready to take on the same responsibility.
> > Last I checked the ext4 reverse mapping implementation was not as
> > capable as XFS. This goes back to why the initial design avoided
> > relying on not universally available / stable reverse-mapping
> > facilities and opted for extending the generic mm/memory-failure.c
> > implementation.
>
> The important but is that we stop using ->index and ->mapping when XFS
> is used as that enables using DAX+reflinks, which actually is the most
> requested DAX feature on XFS (way more than silly runtime flag switches
> for example).

Understood. To be clear the plan we are marching to is knock down all
the known objections to the removal of the "experimental" designation.
reflink is on that list and so is per-file dax. The thought was that
pef-file dax was a nearer term goal than reflink.

Patch
diff mbox series

diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index 075b11682192..e72959203253 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -74,14 +74,28 @@  static blk_status_t pmem_clear_poison(struct pmem_device *pmem,
 	sector_t sector;
 	long cleared;
 	blk_status_t rc = BLK_STS_OK;
+	phys_addr_t start_aligned, end_aligned;
+	unsigned int len_aligned;
 
-	sector = (offset - pmem->data_offset) / 512;
+	/*
+	 * Callers can pass arbitrary offset and len. But nvdimm_clear_poison()
+	 * expects memory offset and length to meet certain alignment
+	 * restrction (clear_err_unit). Currently nvdimm does not export
+	 * required alignment. So align offset and length to sector boundary
+	 * before passing it to nvdimm_clear_poison().
+	 */
+	start_aligned = ALIGN(offset, SECTOR_SIZE);
+	end_aligned = ALIGN_DOWN((offset + len), SECTOR_SIZE) - 1;
+	len_aligned = end_aligned - start_aligned + 1;
+
+	sector = (start_aligned - pmem->data_offset) / 512;
 
-	cleared = nvdimm_clear_poison(dev, pmem->phys_addr + offset, len);
-	if (cleared < len)
+	cleared = nvdimm_clear_poison(dev, pmem->phys_addr + start_aligned,
+				      len_aligned);
+	if (cleared < len_aligned)
 		rc = BLK_STS_IOERR;
 	if (cleared > 0 && cleared / 512) {
-		hwpoison_clear(pmem, pmem->phys_addr + offset, cleared);
+		hwpoison_clear(pmem, pmem->phys_addr + start_aligned, cleared);
 		cleared /= 512;
 		dev_dbg(dev, "%#llx clear %ld sector%s\n",
 				(unsigned long long) sector, cleared,