dax: allow DAX to look up an inode's block device

Message ID	1454454702-11889-1-git-send-email-ross.zwisler@linux.intel.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <linux-nvdimm-bounces@lists.01.org> From: Ross Zwisler <ross.zwisler@linux.intel.com> To: linux-kernel@vger.kernel.org Subject: [PATCH] dax: allow DAX to look up an inode's block device Date: Tue, 2 Feb 2016 16:11:42 -0700 Message-Id: <1454454702-11889-1-git-send-email-ross.zwisler@linux.intel.com> Cc: Jeff Layton <jlayton@poochiereds.net>, Andrew Morton <akpm@linux-foundation.org>, linux-nvdimm@lists.01.org, Dave Chinner <david@fromorbit.com>, xfs@oss.sgi.com, "J. Bruce Fields" <bfields@fieldses.org>, Alexander Viro <viro@zeniv.linux.org.uk>, Jan Kara <jack@suse.com>, linux-fsdevel@vger.kernel.org Precedence: list MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Errors-To: linux-nvdimm-bounces@lists.01.org Sender: "Linux-nvdimm" <linux-nvdimm-bounces@lists.01.org>

Ross Zwisler Feb. 2, 2016, 11:11 p.m. UTC

There are a number of places in dax.c that look up the struct block_device
associated with an inode.  Previously this was done by just using
inode->i_sb->s_bdev.  This is correct in some cases, such as when using
ext2 and ext4.

However, for raw block devices and for XFS with a real-time device, the
value in inode->i_sb->s_bdev is not correct.  With the code as it is
currently written, an fsync or msync to a DAX enabled raw block device will
cause a NULL pointer dereference kernel BUG.  For this to work correctly we
need to ask the block device or filesystem what struct block_device is
appropriate for our inode.

To that end, add a get_bdev(struct inode *) entry point to struct
super_operations.  If this function pointer is non-NULL, this notifies DAX
that it needs to use it to look up the correct block_device.  If
i_sb->get_bdev() is NULL DAX will default to inode->i_sb->s_bdev.

I added the function to super_operations instead of another alternative
like inode_operations because the function pointer varies by filesystem or
block device, not per inode.  I believe that this will also save memory
because there is only one struct super_operations per mounted filesystem
but there could be many struct inode_operations and there is no need to
keep many copies of the same function pointer in memory.

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
---
 fs/block_dev.c     |  6 ++++++
 fs/dax.c           | 20 ++++++++++++++------
 fs/xfs/xfs_aops.c  |  2 +-
 fs/xfs/xfs_aops.h  |  1 +
 fs/xfs/xfs_super.c |  1 +
 include/linux/fs.h |  1 +
 6 files changed, 24 insertions(+), 7 deletions(-)

Al Viro Feb. 2, 2016, 11:19 p.m. UTC | #1

On Tue, Feb 02, 2016 at 04:11:42PM -0700, Ross Zwisler wrote:

> However, for raw block devices and for XFS with a real-time device, the
> value in inode->i_sb->s_bdev is not correct.  With the code as it is
> currently written, an fsync or msync to a DAX enabled raw block device will
> cause a NULL pointer dereference kernel BUG.  For this to work correctly we
> need to ask the block device or filesystem what struct block_device is
> appropriate for our inode.
> 
> To that end, add a get_bdev(struct inode *) entry point to struct
> super_operations.  If this function pointer is non-NULL, this notifies DAX
> that it needs to use it to look up the correct block_device.  If
> i_sb->get_bdev() is NULL DAX will default to inode->i_sb->s_bdev.

Umm...  It assumes that bdev will stay pinned for as long as inode is
referenced, presumably?  If so, that needs to be documented (and verified
for existing fs instances).  In principle, multi-disk fs might want to
support things like "silently move the inodes backed by that disk to other
ones"...

Jared Hulbert Feb. 2, 2016, 11:36 p.m. UTC | #2

On Tue, Feb 2, 2016 at 3:19 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> On Tue, Feb 02, 2016 at 04:11:42PM -0700, Ross Zwisler wrote:
>
> > However, for raw block devices and for XFS with a real-time device, the
> > value in inode->i_sb->s_bdev is not correct.  With the code as it is
> > currently written, an fsync or msync to a DAX enabled raw block device will
> > cause a NULL pointer dereference kernel BUG.  For this to work correctly we
> > need to ask the block device or filesystem what struct block_device is
> > appropriate for our inode.
> >
> > To that end, add a get_bdev(struct inode *) entry point to struct
> > super_operations.  If this function pointer is non-NULL, this notifies DAX
> > that it needs to use it to look up the correct block_device.  If
> > i_sb->get_bdev() is NULL DAX will default to inode->i_sb->s_bdev.
>
> Umm...  It assumes that bdev will stay pinned for as long as inode is
> referenced, presumably?  If so, that needs to be documented (and verified
> for existing fs instances).  In principle, multi-disk fs might want to
> support things like "silently move the inodes backed by that disk to other
> ones"...

Dan, This is exactly the kind of thing I'm taking about WRT the
weirder device models and directly calling bdev_direct_access().
Filesystems don't have the monogamous relationship with a device that
is implicitly assumed in DAX, you have to ask the filesystem what the
relationship is and is migrating to, and allow the filesystem to
update DAX when the relationship is changing.  As we start to see many
DIMM's and 10s TiB pmem systems this is going be an even bigger deal
as load balancing, wear leveling, and fault tolerance concerned are
inevitably driven by the filesystem.

Dan Williams Feb. 2, 2016, 11:38 p.m. UTC | #3

[ adding btrfs ]

On Tue, Feb 2, 2016 at 3:19 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
> On Tue, Feb 02, 2016 at 04:11:42PM -0700, Ross Zwisler wrote:
>
>> However, for raw block devices and for XFS with a real-time device, the
>> value in inode->i_sb->s_bdev is not correct.  With the code as it is
>> currently written, an fsync or msync to a DAX enabled raw block device will
>> cause a NULL pointer dereference kernel BUG.  For this to work correctly we
>> need to ask the block device or filesystem what struct block_device is
>> appropriate for our inode.
>>
>> To that end, add a get_bdev(struct inode *) entry point to struct
>> super_operations.  If this function pointer is non-NULL, this notifies DAX
>> that it needs to use it to look up the correct block_device.  If
>> i_sb->get_bdev() is NULL DAX will default to inode->i_sb->s_bdev.
>
> Umm...  It assumes that bdev will stay pinned for as long as inode is
> referenced, presumably?  If so, that needs to be documented (and verified
> for existing fs instances).  In principle, multi-disk fs might want to
> support things like "silently move the inodes backed by that disk to other
> ones"...

I assume btrfs is the only fs we have that might reassign the bdev for
a given inode on the fly?  Hopefully we don't need anything stronger
than rcu_read_lock() to pin the result as valid.

At least in this case the initial user is dax-fsync where the
->get_bdev() answer should be static for the life of the inode, and
btrfs does not currently interface with dax.  But yes, we need to get
the expected semantics clear.

Dan Williams Feb. 2, 2016, 11:39 p.m. UTC | #4

[ adding btrfs, resend with the correct list address  ]

On Tue, Feb 2, 2016 at 3:19 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
> On Tue, Feb 02, 2016 at 04:11:42PM -0700, Ross Zwisler wrote:
>
>> However, for raw block devices and for XFS with a real-time device, the
>> value in inode->i_sb->s_bdev is not correct.  With the code as it is
>> currently written, an fsync or msync to a DAX enabled raw block device will
>> cause a NULL pointer dereference kernel BUG.  For this to work correctly we
>> need to ask the block device or filesystem what struct block_device is
>> appropriate for our inode.
>>
>> To that end, add a get_bdev(struct inode *) entry point to struct
>> super_operations.  If this function pointer is non-NULL, this notifies DAX
>> that it needs to use it to look up the correct block_device.  If
>> i_sb->get_bdev() is NULL DAX will default to inode->i_sb->s_bdev.
>
> Umm...  It assumes that bdev will stay pinned for as long as inode is
> referenced, presumably?  If so, that needs to be documented (and verified
> for existing fs instances).  In principle, multi-disk fs might want to
> support things like "silently move the inodes backed by that disk to other
> ones"...

I assume btrfs is the only fs we have that might reassign the bdev for
a given inode on the fly?  Hopefully we don't need anything stronger
than rcu_read_lock() to pin the result as valid.

At least in this case the initial user is dax-fsync where the
->get_bdev() answer should be static for the life of the inode, and
btrfs does not currently interface with dax.  But yes, we need to get
the expected semantics clear.

On Tue, Feb 2, 2016 at 3:38 PM, Dan Williams <dan.j.williams@intel.com> wrote:
> [ adding btrfs ]
>
> On Tue, Feb 2, 2016 at 3:19 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
>> On Tue, Feb 02, 2016 at 04:11:42PM -0700, Ross Zwisler wrote:
>>
>>> However, for raw block devices and for XFS with a real-time device, the
>>> value in inode->i_sb->s_bdev is not correct.  With the code as it is
>>> currently written, an fsync or msync to a DAX enabled raw block device will
>>> cause a NULL pointer dereference kernel BUG.  For this to work correctly we
>>> need to ask the block device or filesystem what struct block_device is
>>> appropriate for our inode.
>>>
>>> To that end, add a get_bdev(struct inode *) entry point to struct
>>> super_operations.  If this function pointer is non-NULL, this notifies DAX
>>> that it needs to use it to look up the correct block_device.  If
>>> i_sb->get_bdev() is NULL DAX will default to inode->i_sb->s_bdev.
>>
>> Umm...  It assumes that bdev will stay pinned for as long as inode is
>> referenced, presumably?  If so, that needs to be documented (and verified
>> for existing fs instances).  In principle, multi-disk fs might want to
>> support things like "silently move the inodes backed by that disk to other
>> ones"...
>
> I assume btrfs is the only fs we have that might reassign the bdev for
> a given inode on the fly?  Hopefully we don't need anything stronger
> than rcu_read_lock() to pin the result as valid.
>
> At least in this case the initial user is dax-fsync where the
> ->get_bdev() answer should be static for the life of the inode, and
> btrfs does not currently interface with dax.  But yes, we need to get
> the expected semantics clear.

Dan Williams Feb. 2, 2016, 11:41 p.m. UTC | #5

On Tue, Feb 2, 2016 at 3:36 PM, Jared Hulbert <jaredeh@gmail.com> wrote:
> On Tue, Feb 2, 2016 at 3:19 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
>>
>> On Tue, Feb 02, 2016 at 04:11:42PM -0700, Ross Zwisler wrote:
>>
>> > However, for raw block devices and for XFS with a real-time device, the
>> > value in inode->i_sb->s_bdev is not correct.  With the code as it is
>> > currently written, an fsync or msync to a DAX enabled raw block device will
>> > cause a NULL pointer dereference kernel BUG.  For this to work correctly we
>> > need to ask the block device or filesystem what struct block_device is
>> > appropriate for our inode.
>> >
>> > To that end, add a get_bdev(struct inode *) entry point to struct
>> > super_operations.  If this function pointer is non-NULL, this notifies DAX
>> > that it needs to use it to look up the correct block_device.  If
>> > i_sb->get_bdev() is NULL DAX will default to inode->i_sb->s_bdev.
>>
>> Umm...  It assumes that bdev will stay pinned for as long as inode is
>> referenced, presumably?  If so, that needs to be documented (and verified
>> for existing fs instances).  In principle, multi-disk fs might want to
>> support things like "silently move the inodes backed by that disk to other
>> ones"...
>
> Dan, This is exactly the kind of thing I'm taking about WRT the
> weirder device models and directly calling bdev_direct_access().
> Filesystems don't have the monogamous relationship with a device that
> is implicitly assumed in DAX, you have to ask the filesystem what the
> relationship is and is migrating to, and allow the filesystem to
> update DAX when the relationship is changing.

That's precisely what ->get_bdev() does.  When the answer
inode->i_sb->s_bdev lookup is invalid, use ->get_bdev().

> As we start to see many
> DIMM's and 10s TiB pmem systems this is going be an even bigger deal
> as load balancing, wear leveling, and fault tolerance concerned are
> inevitably driven by the filesystem.

No, there are no plans on the horizon for an fs to manage these media
specific concerns for persistent memory.

Matthew Wilcox Feb. 2, 2016, 11:52 p.m. UTC | #6

On Tue, Feb 02, 2016 at 03:39:15PM -0800, Dan Williams wrote:
> On Tue, Feb 2, 2016 at 3:19 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
> > On Tue, Feb 02, 2016 at 04:11:42PM -0700, Ross Zwisler wrote:
> >> However, for raw block devices and for XFS with a real-time device, the
> >> value in inode->i_sb->s_bdev is not correct.  With the code as it is
> >> currently written, an fsync or msync to a DAX enabled raw block device will
> >> cause a NULL pointer dereference kernel BUG.  For this to work correctly we
> >> need to ask the block device or filesystem what struct block_device is
> >> appropriate for our inode.
> >>
> >> To that end, add a get_bdev(struct inode *) entry point to struct
> >> super_operations.  If this function pointer is non-NULL, this notifies DAX
> >> that it needs to use it to look up the correct block_device.  If
> >> i_sb->get_bdev() is NULL DAX will default to inode->i_sb->s_bdev.
> >
> > Umm...  It assumes that bdev will stay pinned for as long as inode is
> > referenced, presumably?  If so, that needs to be documented (and verified
> > for existing fs instances).  In principle, multi-disk fs might want to
> > support things like "silently move the inodes backed by that disk to other
> > ones"...
> 
> I assume btrfs is the only fs we have that might reassign the bdev for
> a given inode on the fly?  Hopefully we don't need anything stronger
> than rcu_read_lock() to pin the result as valid.
> 
> At least in this case the initial user is dax-fsync where the
> ->get_bdev() answer should be static for the life of the inode, and
> btrfs does not currently interface with dax.  But yes, we need to get
> the expected semantics clear.

Let's be clear though.  ->get_bdev is a temporary hack.  The need for
it goes away when DAX doesn't rely on being on a block_device any more.
I don't expect it to live longer than six months.

Jared Hulbert Feb. 3, 2016, 12:33 a.m. UTC | #7

On Tue, Feb 2, 2016 at 3:41 PM, Dan Williams <dan.j.williams@intel.com> wrote:
> On Tue, Feb 2, 2016 at 3:36 PM, Jared Hulbert <jaredeh@gmail.com> wrote:
>> On Tue, Feb 2, 2016 at 3:19 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
>>>
>>> On Tue, Feb 02, 2016 at 04:11:42PM -0700, Ross Zwisler wrote:
>>>
>>> > However, for raw block devices and for XFS with a real-time device, the
>>> > value in inode->i_sb->s_bdev is not correct.  With the code as it is
>>> > currently written, an fsync or msync to a DAX enabled raw block device will
>>> > cause a NULL pointer dereference kernel BUG.  For this to work correctly we
>>> > need to ask the block device or filesystem what struct block_device is
>>> > appropriate for our inode.
>>> >
>>> > To that end, add a get_bdev(struct inode *) entry point to struct
>>> > super_operations.  If this function pointer is non-NULL, this notifies DAX
>>> > that it needs to use it to look up the correct block_device.  If
>>> > i_sb->get_bdev() is NULL DAX will default to inode->i_sb->s_bdev.
>>>
>>> Umm...  It assumes that bdev will stay pinned for as long as inode is
>>> referenced, presumably?  If so, that needs to be documented (and verified
>>> for existing fs instances).  In principle, multi-disk fs might want to
>>> support things like "silently move the inodes backed by that disk to other
>>> ones"...
>>
>> Dan, This is exactly the kind of thing I'm taking about WRT the
>> weirder device models and directly calling bdev_direct_access().
>> Filesystems don't have the monogamous relationship with a device that
>> is implicitly assumed in DAX, you have to ask the filesystem what the
>> relationship is and is migrating to, and allow the filesystem to
>> update DAX when the relationship is changing.
>
> That's precisely what ->get_bdev() does.  When the answer
> inode->i_sb->s_bdev lookup is invalid, use ->get_bdev().
>
>> As we start to see many
>> DIMM's and 10s TiB pmem systems this is going be an even bigger deal
>> as load balancing, wear leveling, and fault tolerance concerned are
>> inevitably driven by the filesystem.
>
> No, there are no plans on the horizon for an fs to manage these media
> specific concerns for persistent memory.

So the filesystem is now directly in charge of mapping user pages to
physical memory.  The filesystem is effectively bypassing NUMA and
zones and all that stuff that tries to balance memory bus and QPI
traffic etc.  You don't think the filesystem will therefore be in
charge of memory bus hotspots?

Alright.  We can just agree to disagree on that point.

Dave Chinner Feb. 3, 2016, 7:54 a.m. UTC | #8

On Tue, Feb 02, 2016 at 04:33:16PM -0800, Jared Hulbert wrote:
> On Tue, Feb 2, 2016 at 3:41 PM, Dan Williams <dan.j.williams@intel.com> wrote:
> > On Tue, Feb 2, 2016 at 3:36 PM, Jared Hulbert <jaredeh@gmail.com> wrote:
> >> On Tue, Feb 2, 2016 at 3:19 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
> >>>
> >>> On Tue, Feb 02, 2016 at 04:11:42PM -0700, Ross Zwisler wrote:
> >>>
> >>> > However, for raw block devices and for XFS with a real-time device, the
> >>> > value in inode->i_sb->s_bdev is not correct.  With the code as it is
> >>> > currently written, an fsync or msync to a DAX enabled raw block device will
> >>> > cause a NULL pointer dereference kernel BUG.  For this to work correctly we
> >>> > need to ask the block device or filesystem what struct block_device is
> >>> > appropriate for our inode.
> >>> >
> >>> > To that end, add a get_bdev(struct inode *) entry point to struct
> >>> > super_operations.  If this function pointer is non-NULL, this notifies DAX
> >>> > that it needs to use it to look up the correct block_device.  If
> >>> > i_sb->get_bdev() is NULL DAX will default to inode->i_sb->s_bdev.
> >>>
> >>> Umm...  It assumes that bdev will stay pinned for as long as inode is
> >>> referenced, presumably?  If so, that needs to be documented (and verified
> >>> for existing fs instances).  In principle, multi-disk fs might want to
> >>> support things like "silently move the inodes backed by that disk to other
> >>> ones"...
> >>
> >> Dan, This is exactly the kind of thing I'm taking about WRT the
> >> weirder device models and directly calling bdev_direct_access().
> >> Filesystems don't have the monogamous relationship with a device that
> >> is implicitly assumed in DAX, you have to ask the filesystem what the
> >> relationship is and is migrating to, and allow the filesystem to
> >> update DAX when the relationship is changing.
> >
> > That's precisely what ->get_bdev() does.  When the answer
> > inode->i_sb->s_bdev lookup is invalid, use ->get_bdev().
> >
> >> As we start to see many
> >> DIMM's and 10s TiB pmem systems this is going be an even bigger deal
> >> as load balancing, wear leveling, and fault tolerance concerned are
> >> inevitably driven by the filesystem.
> >
> > No, there are no plans on the horizon for an fs to manage these media
> > specific concerns for persistent memory.
> 
> So the filesystem is now directly in charge of mapping user pages to
> physical memory.  The filesystem is effectively bypassing NUMA and
> zones and all that stuff that tries to balance memory bus and QPI
> traffic etc.

No, it's isn't bypassing NUMA, zones, etc.

The pmem block device can linearise a typical NUMA layout quite
sanely.  i.e. if there is 10GB of pmem per node, the pmem device
would need to map that as:

	node	   block device offsets
	 0		 0..10GB
	 1		10..20GB
	 2		20..30GB
	 ....
	 n		 N..(N+1)GB

i.e. present a *linear concatentation* of discrete nodes in a linear
address space.

Then, we can use the fact that XFS has a piecewise address space
architecture that can map linear chunks of the block device address
space to different logical domains. Each piece of an XFS filesystem
is an allocation group. Hence we tell mkfs.xfs to set the allocation
group size to 10GB, thereby mapping each individual allocation group
to a different physical node of pmem.  Suddenly all the filesystem
allocation control algorithms become physical device locality
control algorithms.

Then we simply map process locality control (cpusets or
memcgs or whatever is being used for that now) to the allocator -
instead of selecting AGs for allocation based on inode/parent inode
locality, we select AGs based on the allowed CPU/numa node mask of
the process that is running...

An even better architecture would be to present a pmem device per
discrete node and then use DM to build the concat as required. Or
enable us to stripe across nodes for higher performance in large
concurrent applications, or configure RAID mirrors in physically
separate parts of the NUMA topology for redundancy (e.g a water leak
that physically destroys a rack doesn't cause data loss because the
copies are in different racks (i.e. located in different failure
domains)) then we can concat/stripe those mirrors together, etc.

IOWs, we've already got all the pieces in place that we need to
handle pmem in just about any way you can imagine in NUMA machines;
the filesystem is just one of the pieces.

This is just another example of how yet another new-fangled storage
technology maps precisely to a well known, long serving storage
architecture that we already have many, many experts out there that
know to build reliable, performant storage from... :)

> You don't think the filesystem will therefore be in
> charge of memory bus hotspots?

Filesystems and DM are already in charge of avoiding hotspots on
disks, RAID arrays or in storage fabrics that can sustain tens of
GB/s throughput. This really is a solved problem - pmem on NUMA
systems is not very different to having tens of GB/s available on a
multi-pathed SAN.

Cheers,

Dave.

dax: allow DAX to look up an inode's block device

Commit Message

Comments

Patch