From patchwork Wed May 11 19:26:32 2016
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Ross Zwisler <ross.zwisler@linux.intel.com>
X-Patchwork-Id: 9074031
Return-Path: <linux-fsdevel-owner@kernel.org>
X-Original-To: patchwork-linux-fsdevel@patchwork.kernel.org
Delivered-To: patchwork-parsemail@patchwork1.web.kernel.org
Received: from mail.kernel.org (mail.kernel.org [198.145.29.136])
	by patchwork1.web.kernel.org (Postfix) with ESMTP id 484129F1C1
	for <patchwork-linux-fsdevel@patchwork.kernel.org>;
	Wed, 11 May 2016 19:27:08 +0000 (UTC)
Received: from mail.kernel.org (localhost [127.0.0.1])
	by mail.kernel.org (Postfix) with ESMTP id 0E9262012D
	for <patchwork-linux-fsdevel@patchwork.kernel.org>;
	Wed, 11 May 2016 19:27:07 +0000 (UTC)
Received: from vger.kernel.org (vger.kernel.org [209.132.180.67])
	by mail.kernel.org (Postfix) with ESMTP id C14042010F
	for <patchwork-linux-fsdevel@patchwork.kernel.org>;
	Wed, 11 May 2016 19:27:05 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
	id S1751670AbcEKT1D (ORCPT
	<rfc822;patchwork-linux-fsdevel@patchwork.kernel.org>);
	Wed, 11 May 2016 15:27:03 -0400
Received: from mga03.intel.com ([134.134.136.65]:10935 "EHLO mga03.intel.com"
	rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP
	id S1751400AbcEKT1C (ORCPT <rfc822;linux-fsdevel@vger.kernel.org>);
	Wed, 11 May 2016 15:27:02 -0400
Received: from fmsmga001.fm.intel.com ([10.253.24.23])
	by orsmga103.jf.intel.com with ESMTP; 11 May 2016 12:26:33 -0700
X-ExtLoop1: 1
X-IronPort-AV: E=Sophos;i="5.24,609,1455004800"; d="scan'208";a="963880586"
Received: from rzwisler-desk.amr.corp.intel.com (HELO linux.intel.com)
	([10.254.59.181])
	by fmsmga001.fm.intel.com with ESMTP; 11 May 2016 12:26:32 -0700
Date: Wed, 11 May 2016 13:26:32 -0600
From: Ross Zwisler <ross.zwisler@linux.intel.com>
To: Jan Kara <jack@suse.cz>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>,
	linux-fsdevel@vger.kernel.org, linux-ext4@vger.kernel.org,
	linux-mm@kvack.org, Dan Williams <dan.j.williams@intel.com>,
	linux-nvdimm@lists.01.org, Matthew Wilcox <willy@linux.intel.com>
Subject: Re: [PATCH 16/18] dax: New fault locking
Message-ID: <20160511192632.GA8841@linux.intel.com>
Mail-Followup-To: Ross Zwisler <ross.zwisler@linux.intel.com>,
	Jan Kara <jack@suse.cz>, linux-fsdevel@vger.kernel.org,
	linux-ext4@vger.kernel.org, linux-mm@kvack.org,
	Dan Williams <dan.j.williams@intel.com>, linux-nvdimm@lists.01.org,
	Matthew Wilcox <willy@linux.intel.com>
References: <1461015341-20153-1-git-send-email-jack@suse.cz>
	<1461015341-20153-17-git-send-email-jack@suse.cz>
	<20160506041350.GA29628@linux.intel.com>
	<20160510122715.GK11897@quack2.suse.cz>
MIME-Version: 1.0
Content-Disposition: inline
In-Reply-To: <20160510122715.GK11897@quack2.suse.cz>
User-Agent: Mutt/1.6.0 (2016-04-01)
Sender: linux-fsdevel-owner@vger.kernel.org
Precedence: bulk
List-ID: <linux-fsdevel.vger.kernel.org>
X-Mailing-List: linux-fsdevel@vger.kernel.org
X-Spam-Status: No, score=-8.3 required=5.0 tests=BAYES_00, RCVD_IN_DNSWL_HI,
	RP_MATCHES_RCVD,
	UNPARSEABLE_RELAY autolearn=unavailable version=3.3.1
X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on mail.kernel.org
X-Virus-Scanned: ClamAV using ClamSMTP

On Tue, May 10, 2016 at 02:27:15PM +0200, Jan Kara wrote:
> On Thu 05-05-16 22:13:50, Ross Zwisler wrote:
> > On Mon, Apr 18, 2016 at 11:35:39PM +0200, Jan Kara wrote:
> > > +/*
> > > + * Find radix tree entry at given index. If it points to a page, return with
> > > + * the page locked. If it points to the exceptional entry, return with the
> > > + * radix tree entry locked. If the radix tree doesn't contain given index,
> > > + * create empty exceptional entry for the index and return with it locked.
> > > + *
> > > + * Note: Unlike filemap_fault() we don't honor FAULT_FLAG_RETRY flags. For
> > > + * persistent memory the benefit is doubtful. We can add that later if we can
> > > + * show it helps.
> > > + */
> > > +static void *grab_mapping_entry(struct address_space *mapping, pgoff_t index)
> > > +{
> > > +	void *ret, **slot;
> > > +
> > > +restart:
> > > +	spin_lock_irq(&mapping->tree_lock);
> > > +	ret = get_unlocked_mapping_entry(mapping, index, &slot);
> > > +	/* No entry for given index? Make sure radix tree is big enough. */
> > > +	if (!ret) {
> > > +		int err;
> > > +
> > > +		spin_unlock_irq(&mapping->tree_lock);
> > > +		err = radix_tree_preload(
> > > +				mapping_gfp_mask(mapping) & ~__GFP_HIGHMEM);
> > 
> > In the conversation about v2 of this series you said:
> > 
> > > Note that we take the hit for dropping the lock only if we really need to
> > > allocate new radix tree node so about once per 64 new entries. So it is not
> > > too bad.
> > 
> > I think this is incorrect.  We get here whenever we get a NULL return from
> > __radix_tree_lookup().  I believe that this happens if we don't have a node,
> > in which case we need an allocation, but I think it also happens in the case
> > where we do have a node and we just have a NULL slot in that node.
> > 
> > For the behavior you're looking for (only preload if you need to do an
> > allocation), you probably need to check the 'slot' we get back from
> > get_unlocked_mapping_entry(), yea?
> 
> You are correct. However currently __radix_tree_lookup() doesn't return a
> slot pointer if entry was not found so it is not easy to fix. So I'd leave
> the code as is for now and we can later optimize the case where we don't
> need to grow the radix tree...

Ah, you're right.  Sure, that plan sounds good.

> > > +/*
> > > + * Delete exceptional DAX entry at @index from @mapping. Wait for radix tree
> > > + * entry to get unlocked before deleting it.
> > > + */
> > > +int dax_delete_mapping_entry(struct address_space *mapping, pgoff_t index)
> > > +{
> > > +	void *entry;
> > > +
> > > +	spin_lock_irq(&mapping->tree_lock);
> > > +	entry = get_unlocked_mapping_entry(mapping, index, NULL);
> > > +	/*
> > > +	 * Caller should make sure radix tree modifications don't race and
> > > +	 * we have seen exceptional entry here before.
> > > +	 */
> > > +	if (WARN_ON_ONCE(!entry || !radix_tree_exceptional_entry(entry))) {
> > 
> > dax_delete_mapping_entry() is only called from clear_exceptional_entry().
> > With this new code we've changed the behavior of that call path a little.
> > 
> > In the various places where clear_exceptional_entry() is called, the code
> > batches up a bunch of entries in a pvec via pagevec_lookup_entries().  We
> > don't hold the mapping->tree_lock between the time this lookup happens and the
> > time that the entry is passed to clear_exceptional_entry(). This is why the
> > old code did a verification that the entry passed in matched what was still
> > currently present in the radix tree.  This was done in the DAX case via
> > radix_tree_delete_item(), and it was open coded in clear_exceptional_entry()
> > for the page cache case.  In both cases if the entry didn't match what was
> > currently in the tree, we bailed without doing anything.
> > 
> > This new code doesn't verify against the 'entry' passed to
> > clear_exceptional_entry(), but instead makes sure it is an exceptional entry
> > before removing, and if not it does a WARN_ON_ONCE().
> > 
> > This changes things because:
> > 
> > a) If the exceptional entry changed, say from a plain lock entry to an actual
> > DAX entry, we wouldn't notice, and we would just clear the latter out.  My
> > guess is that this is fine, I just wanted to call it out.
> > 
> > b) If we have a non-exceptional entry here now, say because our lock entry has
> > been swapped out for a zero page, we will WARN_ON_ONCE() and return without a
> > removal.  I think we may want to silence the WARN_ON_ONCE(), as I believe this
> > could happen during normal operation and we don't want to scare anyone. :)
> 
> So your concerns are exactly why I have added a comment to
> dax_delete_mapping_entry() that:
> 
> 	/*
> 	 * Caller should make sure radix tree modifications don't race and
> 	 * we have seen exceptional entry here before.
> 	 */
> 
> The thing is dax_delete_mapping_entry() is called only from truncate /
> punch hole path. Those should hold i_mmap_sem for writing and thus there
> should be no modifications of the radix tree. If anything changes, between
> what truncate_inode_pages() (or similar functions) finds and what
> dax_delete_mapping_entry() sees, we have a locking bug and I want to know
> about it :). Any suggestion how I should expand the comment so that this is
> clearer?

Ah, I didn't understand all that.  :)  Given a bit more context the comment
seems fine - if anything it could be a bit more specific, and include the
text: "dax_delete_mapping_entry() is called only from truncate / punch hole
path. Those should hold i_mmap_sem for writing and thus there should be no
modifications of the radix tree."  Either way - thanks for explaining.

> > > +/*
> > >   * The user has performed a load from a hole in the file.  Allocating
> > >   * a new page in the file would cause excessive storage usage for
> > >   * workloads with sparse files.  We allocate a page cache page instead.
> > > @@ -307,15 +584,24 @@ EXPORT_SYMBOL_GPL(dax_do_io);
> > >   * otherwise it will simply fall out of the page cache under memory
> > >   * pressure without ever having been dirtied.
> > >   */
> > > -static int dax_load_hole(struct address_space *mapping, struct page *page,
> > > -							struct vm_fault *vmf)
> > > +static int dax_load_hole(struct address_space *mapping, void *entry,
> > > +			 struct vm_fault *vmf)
> > >  {
> > > -	if (!page)
> > > -		page = find_or_create_page(mapping, vmf->pgoff,
> > > -						GFP_KERNEL | __GFP_ZERO);
> > > -	if (!page)
> > > -		return VM_FAULT_OOM;
> > > +	struct page *page;
> > > +
> > > +	/* Hole page already exists? Return it...  */
> > > +	if (!radix_tree_exceptional_entry(entry)) {
> > > +		vmf->page = entry;
> > > +		return VM_FAULT_LOCKED;
> > > +	}
> > >  
> > > +	/* This will replace locked radix tree entry with a hole page */
> > > +	page = find_or_create_page(mapping, vmf->pgoff,
> > > +				   vmf->gfp_mask | __GFP_ZERO);
> > 
> > This replacement happens via page_cache_tree_insert(), correct?  In this case,
> > who wakes up anyone waiting on the old lock entry that we just killed?  In the
> > non-hole case we would traverse through put_locked_mapping_entry(), but I
> > don't see that in the hole case.
> 
> Ha, good catch. We miss the wakeup. Fixed.
> 
> Attached is the diff resulting from your review of this patch. I still have
> to hunt down that strange interaction with workingset code you've reported...

At the end of this mail I've attached one small fixup for the incremental diff
you sent.  Aside from that, I think that you've addressed all my review
feedback, thanks!

Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>

I'm going to try and get more info on the working set test failure.
---

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

diff --git a/fs/dax.c b/fs/dax.c
index f496854..c4cb69b 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -406,7 +406,7 @@ static void *get_unlocked_mapping_entry(struct address_space *mapping,
 
        init_wait(&ewait.wait);
        ewait.wait.func = wake_exceptional_entry_func;
-       ewait.key.root = &mapping->page_tree;
+       ewait.key.mapping = mapping;
        ewait.key.index = index;
 
        for (;;) {