[v4,10/12] filesystem-dax: Introduce dax_lock_page()

Message ID	152850187437.38390.2257981090761438811.stgit@dwillia2-desk3.amr.corp.intel.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <linux-nvdimm-bounces@lists.01.org> Received-SPF: Pass (sender SPF authorized) identity=mailfrom; client-ip=134.134.136.65; helo=mga03.intel.com; envelope-from=dan.j.williams@intel.com; receiver=linux-nvdimm@lists.01.org Subject: [PATCH v4 10/12] filesystem-dax: Introduce dax_lock_page() From: Dan Williams <dan.j.williams@intel.com> To: linux-nvdimm@lists.01.org Date: Fri, 08 Jun 2018 16:51:14 -0700 Message-ID: <152850187437.38390.2257981090761438811.stgit@dwillia2-desk3.amr.corp.intel.com> In-Reply-To: <152850182079.38390.8280340535691965744.stgit@dwillia2-desk3.amr.corp.intel.com> References: <152850182079.38390.8280340535691965744.stgit@dwillia2-desk3.amr.corp.intel.com> User-Agent: StGit/0.18-2-gc94f MIME-Version: 1.0 Precedence: list Cc: linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, jack@suse.cz, hch@lst.de Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Errors-To: linux-nvdimm-bounces@lists.01.org Sender: "Linux-nvdimm" <linux-nvdimm-bounces@lists.01.org>

Dan Williams June 8, 2018, 11:51 p.m. UTC

In preparation for implementing support for memory poison (media error)
handling via dax mappings, implement a lock_page() equivalent. Poison
error handling requires rmap and needs guarantees that the page->mapping
association is maintained / valid (inode not freed) for the duration of
the lookup.

In the device-dax case it is sufficient to simply hold a dev_pagemap
reference. In the filesystem-dax case we need to use the entry lock.

Export the entry lock via dax_lock_page() that uses rcu_read_lock() to
protect against the inode being freed, and revalidates the page->mapping
association under xa_lock().

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 fs/dax.c            |   76 +++++++++++++++++++++++++++++++++++++++++++++++++++
 include/linux/dax.h |   15 ++++++++++
 2 files changed, 91 insertions(+)

Jan Kara June 11, 2018, 3:41 p.m. UTC | #1

On Fri 08-06-18 16:51:14, Dan Williams wrote:
> In preparation for implementing support for memory poison (media error)
> handling via dax mappings, implement a lock_page() equivalent. Poison
> error handling requires rmap and needs guarantees that the page->mapping
> association is maintained / valid (inode not freed) for the duration of
> the lookup.
> 
> In the device-dax case it is sufficient to simply hold a dev_pagemap
> reference. In the filesystem-dax case we need to use the entry lock.
> 
> Export the entry lock via dax_lock_page() that uses rcu_read_lock() to
> protect against the inode being freed, and revalidates the page->mapping
> association under xa_lock().
> 
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>

Some comments below...

> diff --git a/fs/dax.c b/fs/dax.c
> index cccf6cad1a7a..b7e71b108fcf 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -361,6 +361,82 @@ static void dax_disassociate_entry(void *entry, struct address_space *mapping,
>  	}
>  }
>  
> +struct page *dax_lock_page(unsigned long pfn)
> +{

Why do you return struct page here? Any reason behind that? Because struct
page exists and can be accessed through pfn_to_page() regardless of result
of this function so it looks a bit confusing. Also dax_lock_page() name
seems a bit confusing. Maybe dax_lock_pfn_mapping_entry()?

> +	pgoff_t index;
> +	struct inode *inode;
> +	wait_queue_head_t *wq;
> +	void *entry = NULL, **slot;
> +	struct address_space *mapping;
> +	struct wait_exceptional_entry_queue ewait;
> +	struct page *ret = NULL, *page = pfn_to_page(pfn);
> +
> +	rcu_read_lock();
> +	for (;;) {
> +		mapping = READ_ONCE(page->mapping);
> +
> +		if (!mapping || !IS_DAX(mapping->host))
> +			break;
> +
> +		/*
> +		 * In the device-dax case there's no need to lock, a
> +		 * struct dev_pagemap pin is sufficient to keep the
> +		 * inode alive.
> +		 */
> +		inode = mapping->host;
> +		if (S_ISCHR(inode->i_mode)) {
> +			ret = page;
> +			break;
> +		}
> +
> +		xa_lock_irq(&mapping->i_pages);
> +		if (mapping != page->mapping) {
> +			xa_unlock_irq(&mapping->i_pages);
> +			continue;
> +		}
> +		index = page->index;
> +
> +		init_wait(&ewait.wait);
> +		ewait.wait.func = wake_exceptional_entry_func;

This initialization could be before the loop.

> +
> +		entry = __radix_tree_lookup(&mapping->i_pages, index, NULL,
> +				&slot);
> +		if (!entry ||
> +		    WARN_ON_ONCE(!radix_tree_exceptional_entry(entry))) {
> +			xa_unlock_irq(&mapping->i_pages);
> +			break;
> +		} else if (!slot_locked(mapping, slot)) {
> +			lock_slot(mapping, slot);
> +			ret = page;
> +			xa_unlock_irq(&mapping->i_pages);
> +			break;
> +		}
> +
> +		wq = dax_entry_waitqueue(mapping, index, entry, &ewait.key);
> +		prepare_to_wait_exclusive(wq, &ewait.wait,
> +				TASK_UNINTERRUPTIBLE);
> +		xa_unlock_irq(&mapping->i_pages);
> +		rcu_read_unlock();
> +		schedule();
> +		finish_wait(wq, &ewait.wait);
> +		rcu_read_lock();
> +	}
> +	rcu_read_unlock();

I don't like how this duplicates a lot of get_unlocked_mapping_entry().
Can we possibly factor this out similary as done for wait_event()?

								Honza

Dan Williams June 11, 2018, 4:48 p.m. UTC | #2

On Mon, Jun 11, 2018 at 8:41 AM, Jan Kara <jack@suse.cz> wrote:
> On Fri 08-06-18 16:51:14, Dan Williams wrote:
>> In preparation for implementing support for memory poison (media error)
>> handling via dax mappings, implement a lock_page() equivalent. Poison
>> error handling requires rmap and needs guarantees that the page->mapping
>> association is maintained / valid (inode not freed) for the duration of
>> the lookup.
>>
>> In the device-dax case it is sufficient to simply hold a dev_pagemap
>> reference. In the filesystem-dax case we need to use the entry lock.
>>
>> Export the entry lock via dax_lock_page() that uses rcu_read_lock() to
>> protect against the inode being freed, and revalidates the page->mapping
>> association under xa_lock().
>>
>> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
>
> Some comments below...
>
>> diff --git a/fs/dax.c b/fs/dax.c
>> index cccf6cad1a7a..b7e71b108fcf 100644
>> --- a/fs/dax.c
>> +++ b/fs/dax.c
>> @@ -361,6 +361,82 @@ static void dax_disassociate_entry(void *entry, struct address_space *mapping,
>>       }
>>  }
>>
>> +struct page *dax_lock_page(unsigned long pfn)
>> +{
>
> Why do you return struct page here? Any reason behind that? Because struct
> page exists and can be accessed through pfn_to_page() regardless of result
> of this function so it looks a bit confusing. Also dax_lock_page() name
> seems a bit confusing. Maybe dax_lock_pfn_mapping_entry()?
>
>> +     pgoff_t index;
>> +     struct inode *inode;
>> +     wait_queue_head_t *wq;
>> +     void *entry = NULL, **slot;
>> +     struct address_space *mapping;
>> +     struct wait_exceptional_entry_queue ewait;
>> +     struct page *ret = NULL, *page = pfn_to_page(pfn);
>> +
>> +     rcu_read_lock();
>> +     for (;;) {
>> +             mapping = READ_ONCE(page->mapping);
>> +
>> +             if (!mapping || !IS_DAX(mapping->host))
>> +                     break;
>> +
>> +             /*
>> +              * In the device-dax case there's no need to lock, a
>> +              * struct dev_pagemap pin is sufficient to keep the
>> +              * inode alive.
>> +              */
>> +             inode = mapping->host;
>> +             if (S_ISCHR(inode->i_mode)) {
>> +                     ret = page;
>> +                     break;
>> +             }
>> +
>> +             xa_lock_irq(&mapping->i_pages);
>> +             if (mapping != page->mapping) {
>> +                     xa_unlock_irq(&mapping->i_pages);
>> +                     continue;
>> +             }
>> +             index = page->index;
>> +
>> +             init_wait(&ewait.wait);
>> +             ewait.wait.func = wake_exceptional_entry_func;
>
> This initialization could be before the loop.
>
>> +
>> +             entry = __radix_tree_lookup(&mapping->i_pages, index, NULL,
>> +                             &slot);
>> +             if (!entry ||
>> +                 WARN_ON_ONCE(!radix_tree_exceptional_entry(entry))) {
>> +                     xa_unlock_irq(&mapping->i_pages);
>> +                     break;
>> +             } else if (!slot_locked(mapping, slot)) {
>> +                     lock_slot(mapping, slot);
>> +                     ret = page;
>> +                     xa_unlock_irq(&mapping->i_pages);
>> +                     break;
>> +             }
>> +
>> +             wq = dax_entry_waitqueue(mapping, index, entry, &ewait.key);
>> +             prepare_to_wait_exclusive(wq, &ewait.wait,
>> +                             TASK_UNINTERRUPTIBLE);
>> +             xa_unlock_irq(&mapping->i_pages);
>> +             rcu_read_unlock();
>> +             schedule();
>> +             finish_wait(wq, &ewait.wait);
>> +             rcu_read_lock();
>> +     }
>> +     rcu_read_unlock();
>
> I don't like how this duplicates a lot of get_unlocked_mapping_entry().
> Can we possibly factor this out similary as done for wait_event()?

Ok, I'll give that a shot.

Ross Zwisler June 12, 2018, 6:07 p.m. UTC | #3

On Mon, Jun 11, 2018 at 05:41:46PM +0200, Jan Kara wrote:
> On Fri 08-06-18 16:51:14, Dan Williams wrote:
> > In preparation for implementing support for memory poison (media error)
> > handling via dax mappings, implement a lock_page() equivalent. Poison
> > error handling requires rmap and needs guarantees that the page->mapping
> > association is maintained / valid (inode not freed) for the duration of
> > the lookup.
> > 
> > In the device-dax case it is sufficient to simply hold a dev_pagemap
> > reference. In the filesystem-dax case we need to use the entry lock.
> > 
> > Export the entry lock via dax_lock_page() that uses rcu_read_lock() to
> > protect against the inode being freed, and revalidates the page->mapping
> > association under xa_lock().
> > 
> > Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> 
> Some comments below...
> 
> > diff --git a/fs/dax.c b/fs/dax.c
> > index cccf6cad1a7a..b7e71b108fcf 100644
> > --- a/fs/dax.c
> > +++ b/fs/dax.c
> > @@ -361,6 +361,82 @@ static void dax_disassociate_entry(void *entry, struct address_space *mapping,
> >  	}
> >  }
> >  
> > +struct page *dax_lock_page(unsigned long pfn)
> > +{
> 
> Why do you return struct page here? Any reason behind that? Because struct
> page exists and can be accessed through pfn_to_page() regardless of result
> of this function so it looks a bit confusing. Also dax_lock_page() name
> seems a bit confusing. Maybe dax_lock_pfn_mapping_entry()?

It's also a bit awkward that the functions are asymmetric in their arguments:
dax_lock_page(pfn) vs dax_unlock_page(struct page)

Looking at dax_lock_page(), we only use 'pfn' to get 'page', so maybe it would
be cleaner to just always deal with struct page, i.e.:

void dax_lock_page(struct page *page);
void dax_unlock_page(struct page *page);

Ross Zwisler June 12, 2018, 6:15 p.m. UTC | #4

On Fri, Jun 08, 2018 at 04:51:14PM -0700, Dan Williams wrote:
> In preparation for implementing support for memory poison (media error)
> handling via dax mappings, implement a lock_page() equivalent. Poison
> error handling requires rmap and needs guarantees that the page->mapping
> association is maintained / valid (inode not freed) for the duration of
> the lookup.
> 
> In the device-dax case it is sufficient to simply hold a dev_pagemap
> reference. In the filesystem-dax case we need to use the entry lock.
> 
> Export the entry lock via dax_lock_page() that uses rcu_read_lock() to
> protect against the inode being freed, and revalidates the page->mapping
> association under xa_lock().
> 
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> ---
>  fs/dax.c            |   76 +++++++++++++++++++++++++++++++++++++++++++++++++++
>  include/linux/dax.h |   15 ++++++++++
>  2 files changed, 91 insertions(+)
> 
> diff --git a/fs/dax.c b/fs/dax.c
> index cccf6cad1a7a..b7e71b108fcf 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -361,6 +361,82 @@ static void dax_disassociate_entry(void *entry, struct address_space *mapping,
>  	}
>  }
>  
> +struct page *dax_lock_page(unsigned long pfn)
> +{
> +	pgoff_t index;
> +	struct inode *inode;
> +	wait_queue_head_t *wq;
> +	void *entry = NULL, **slot;
> +	struct address_space *mapping;
> +	struct wait_exceptional_entry_queue ewait;
> +	struct page *ret = NULL, *page = pfn_to_page(pfn);
> +
> +	rcu_read_lock();
> +	for (;;) {
> +		mapping = READ_ONCE(page->mapping);

Why the READ_ONCE()?

> +
> +		if (!mapping || !IS_DAX(mapping->host))

Might read better using the dax_mapping() helper.

Also, forgive my ignorance, but this implies that dev dax has page->mapping
set up and that that inode will have IS_DAX set, right?  This will let us get
past this point for device DAX, and we'll bail out at the S_ISCHR() check?

> +			break;
> +
> +		/*
> +		 * In the device-dax case there's no need to lock, a
> +		 * struct dev_pagemap pin is sufficient to keep the
> +		 * inode alive.
> +		 */
> +		inode = mapping->host;
> +		if (S_ISCHR(inode->i_mode)) {
> +			ret = page;

'ret' isn't actually used for anything in this function, we just
unconditionally return 'page'.

> +			break;
> +		}
> +
> +		xa_lock_irq(&mapping->i_pages);
> +		if (mapping != page->mapping) {
> +			xa_unlock_irq(&mapping->i_pages);
> +			continue;
> +		}
> +		index = page->index;
> +
> +		init_wait(&ewait.wait);
> +		ewait.wait.func = wake_exceptional_entry_func;
> +
> +		entry = __radix_tree_lookup(&mapping->i_pages, index, NULL,
> +				&slot);
> +		if (!entry ||

So if we do a lookup and there is no entry in the tree, we won't add an empty
entry and lock it, we'll just return with no entry in the tree and nothing
locked.

Then, when we call dax_unlock_page(), we'll eventually hit a WARN_ON_ONCE() in 
dax_unlock_mapping_entry() when we see entry is 0.  And, in that gap we've got
nothing locked so page faults could have happened, etc... (which would mean
that instead of WARN_ON_ONCE() for an empty entry, we'd hit it instead for an
unlocked entry).

Is that okay?  Or do we need to insert a locked empty entry here?

Dan Williams July 4, 2018, 3:11 p.m. UTC | #5

On Tue, Jun 12, 2018 at 11:15 AM, Ross Zwisler
<ross.zwisler@linux.intel.com> wrote:
> On Fri, Jun 08, 2018 at 04:51:14PM -0700, Dan Williams wrote:
>> In preparation for implementing support for memory poison (media error)
>> handling via dax mappings, implement a lock_page() equivalent. Poison
>> error handling requires rmap and needs guarantees that the page->mapping
>> association is maintained / valid (inode not freed) for the duration of
>> the lookup.
>>
>> In the device-dax case it is sufficient to simply hold a dev_pagemap
>> reference. In the filesystem-dax case we need to use the entry lock.
>>
>> Export the entry lock via dax_lock_page() that uses rcu_read_lock() to
>> protect against the inode being freed, and revalidates the page->mapping
>> association under xa_lock().
>>
>> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
>> ---
>>  fs/dax.c            |   76 +++++++++++++++++++++++++++++++++++++++++++++++++++
>>  include/linux/dax.h |   15 ++++++++++
>>  2 files changed, 91 insertions(+)
>>
>> diff --git a/fs/dax.c b/fs/dax.c
>> index cccf6cad1a7a..b7e71b108fcf 100644
>> --- a/fs/dax.c
>> +++ b/fs/dax.c
>> @@ -361,6 +361,82 @@ static void dax_disassociate_entry(void *entry, struct address_space *mapping,
>>       }
>>  }
>>
>> +struct page *dax_lock_page(unsigned long pfn)
>> +{
>> +     pgoff_t index;
>> +     struct inode *inode;
>> +     wait_queue_head_t *wq;
>> +     void *entry = NULL, **slot;
>> +     struct address_space *mapping;
>> +     struct wait_exceptional_entry_queue ewait;
>> +     struct page *ret = NULL, *page = pfn_to_page(pfn);
>> +
>> +     rcu_read_lock();
>> +     for (;;) {
>> +             mapping = READ_ONCE(page->mapping);
>
> Why the READ_ONCE()?

We're potentially racing inode teardown, so the READ_ONCE() prevents
the compiler from trying to de-reference page->mapping twice and
getting inconsistent answers.

>
>> +
>> +             if (!mapping || !IS_DAX(mapping->host))
>
> Might read better using the dax_mapping() helper.

Sure.

>
> Also, forgive my ignorance, but this implies that dev dax has page->mapping
> set up and that that inode will have IS_DAX set, right?  This will let us get
> past this point for device DAX, and we'll bail out at the S_ISCHR() check?

Yes.

>
>> +                     break;
>> +
>> +             /*
>> +              * In the device-dax case there's no need to lock, a
>> +              * struct dev_pagemap pin is sufficient to keep the
>> +              * inode alive.
>> +              */
>> +             inode = mapping->host;
>> +             if (S_ISCHR(inode->i_mode)) {
>> +                     ret = page;
>
> 'ret' isn't actually used for anything in this function, we just
> unconditionally return 'page'.
>

Yes, bug.

>> +                     break;
>> +             }
>> +
>> +             xa_lock_irq(&mapping->i_pages);
>> +             if (mapping != page->mapping) {
>> +                     xa_unlock_irq(&mapping->i_pages);
>> +                     continue;
>> +             }
>> +             index = page->index;
>> +
>> +             init_wait(&ewait.wait);
>> +             ewait.wait.func = wake_exceptional_entry_func;
>> +
>> +             entry = __radix_tree_lookup(&mapping->i_pages, index, NULL,
>> +                             &slot);
>> +             if (!entry ||
>
> So if we do a lookup and there is no entry in the tree, we won't add an empty
> entry and lock it, we'll just return with no entry in the tree and nothing
> locked.
>
> Then, when we call dax_unlock_page(), we'll eventually hit a WARN_ON_ONCE() in
> dax_unlock_mapping_entry() when we see entry is 0.  And, in that gap we've got
> nothing locked so page faults could have happened, etc... (which would mean
> that instead of WARN_ON_ONCE() for an empty entry, we'd hit it instead for an
> unlocked entry).
>
> Is that okay?  Or do we need to insert a locked empty entry here?

No, the intent was to return NULL and fail the lock, but I messed up
and unconditionally returned the page.

Dan Williams July 4, 2018, 3:17 p.m. UTC | #6

On Mon, Jun 11, 2018 at 8:41 AM, Jan Kara <jack@suse.cz> wrote:
> On Fri 08-06-18 16:51:14, Dan Williams wrote:
>> In preparation for implementing support for memory poison (media error)
>> handling via dax mappings, implement a lock_page() equivalent. Poison
>> error handling requires rmap and needs guarantees that the page->mapping
>> association is maintained / valid (inode not freed) for the duration of
>> the lookup.
>>
>> In the device-dax case it is sufficient to simply hold a dev_pagemap
>> reference. In the filesystem-dax case we need to use the entry lock.
>>
>> Export the entry lock via dax_lock_page() that uses rcu_read_lock() to
>> protect against the inode being freed, and revalidates the page->mapping
>> association under xa_lock().
>>
>> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
>
> Some comments below...
>
>> diff --git a/fs/dax.c b/fs/dax.c
>> index cccf6cad1a7a..b7e71b108fcf 100644
>> --- a/fs/dax.c
>> +++ b/fs/dax.c
>> @@ -361,6 +361,82 @@ static void dax_disassociate_entry(void *entry, struct address_space *mapping,
>>       }
>>  }
>>
>> +struct page *dax_lock_page(unsigned long pfn)
>> +{
>
> Why do you return struct page here? Any reason behind that?

Unlike lock_page() there is no guarantee that we can lock a mapping
entry given a pfn. There is a chance that we lose a race and can't
validate the pfn to take the lock. So returning 'struct page *' was
there to indicate that we successfully validated the pfn and were able
to take the lock. I'll rework it to just return bool.

> Because struct
> page exists and can be accessed through pfn_to_page() regardless of result
> of this function so it looks a bit confusing. Also dax_lock_page() name
> seems a bit confusing. Maybe dax_lock_pfn_mapping_entry()?

Ok.

Dan Williams July 4, 2018, 3:20 p.m. UTC | #7

On Tue, Jun 12, 2018 at 11:07 AM, Ross Zwisler
<ross.zwisler@linux.intel.com> wrote:
> On Mon, Jun 11, 2018 at 05:41:46PM +0200, Jan Kara wrote:
>> On Fri 08-06-18 16:51:14, Dan Williams wrote:
>> > In preparation for implementing support for memory poison (media error)
>> > handling via dax mappings, implement a lock_page() equivalent. Poison
>> > error handling requires rmap and needs guarantees that the page->mapping
>> > association is maintained / valid (inode not freed) for the duration of
>> > the lookup.
>> >
>> > In the device-dax case it is sufficient to simply hold a dev_pagemap
>> > reference. In the filesystem-dax case we need to use the entry lock.
>> >
>> > Export the entry lock via dax_lock_page() that uses rcu_read_lock() to
>> > protect against the inode being freed, and revalidates the page->mapping
>> > association under xa_lock().
>> >
>> > Signed-off-by: Dan Williams <dan.j.williams@intel.com>
>>
>> Some comments below...
>>
>> > diff --git a/fs/dax.c b/fs/dax.c
>> > index cccf6cad1a7a..b7e71b108fcf 100644
>> > --- a/fs/dax.c
>> > +++ b/fs/dax.c
>> > @@ -361,6 +361,82 @@ static void dax_disassociate_entry(void *entry, struct address_space *mapping,
>> >     }
>> >  }
>> >
>> > +struct page *dax_lock_page(unsigned long pfn)
>> > +{
>>
>> Why do you return struct page here? Any reason behind that? Because struct
>> page exists and can be accessed through pfn_to_page() regardless of result
>> of this function so it looks a bit confusing. Also dax_lock_page() name
>> seems a bit confusing. Maybe dax_lock_pfn_mapping_entry()?
>
> It's also a bit awkward that the functions are asymmetric in their arguments:
> dax_lock_page(pfn) vs dax_unlock_page(struct page)
>
> Looking at dax_lock_page(), we only use 'pfn' to get 'page', so maybe it would
> be cleaner to just always deal with struct page, i.e.:
>
> void dax_lock_page(struct page *page);
> void dax_unlock_page(struct page *page);

No, intent was to have the locking routine return the object that it
validated and then deal with that same object at unlock.
dax_lock_page() can fail to acquire a lock.

[v4,10/12] filesystem-dax: Introduce dax_lock_page()

Commit Message

Comments

Patch