diff mbox

[v2,15/30] xfs: Define usercopy region in xfs_inode slab cache

Message ID 1503956111-36652-16-git-send-email-keescook@chromium.org (mailing list archive)
State New, archived
Headers show

Commit Message

Kees Cook Aug. 28, 2017, 9:34 p.m. UTC
From: David Windsor <dave@nullcore.net>

The XFS inline inode data, stored in struct xfs_inode_t field
i_df.if_u2.if_inline_data and therefore contained in the xfs_inode slab
cache, needs to be copied to/from userspace.

cache object allocation:
    fs/xfs/xfs_icache.c:
        xfs_inode_alloc(...):
            ...
            ip = kmem_zone_alloc(xfs_inode_zone, KM_SLEEP);

    fs/xfs/libxfs/xfs_inode_fork.c:
        xfs_init_local_fork(...):
            ...
            if (mem_size <= sizeof(ifp->if_u2.if_inline_data))
                    ifp->if_u1.if_data = ifp->if_u2.if_inline_data;
            ...

    fs/xfs/xfs_symlink.c:
        xfs_symlink(...):
            ...
            xfs_init_local_fork(ip, XFS_DATA_FORK, target_path, pathlen);

example usage trace:
    readlink_copy+0x43/0x70
    vfs_readlink+0x62/0x110
    SyS_readlinkat+0x100/0x130

    fs/xfs/xfs_iops.c:
        (via inode->i_op->get_link)
        xfs_vn_get_link_inline(...):
            ...
            return XFS_I(inode)->i_df.if_u1.if_data;

    fs/namei.c:
        readlink_copy(..., link):
            ...
            copy_to_user(..., link, len);

        generic_readlink(dentry, ...):
            struct inode *inode = d_inode(dentry);
            const char *link = inode->i_link;
            ...
            if (!link) {
                    link = inode->i_op->get_link(dentry, inode, &done);
            ...
            readlink_copy(..., link);

In support of usercopy hardening, this patch defines a region in the
xfs_inode slab cache in which userspace copy operations are allowed.

This region is known as the slab cache's usercopy region. Slab caches can
now check that each copy operation involving cache-managed memory falls
entirely within the slab's usercopy region.

This patch is modified from Brad Spengler/PaX Team's PAX_USERCOPY
whitelisting code in the last public patch of grsecurity/PaX based on my
understanding of the code. Changes or omissions from the original code are
mine and don't reflect the original grsecurity/PaX code.

Signed-off-by: David Windsor <dave@nullcore.net>
[kees: adjust commit log, provide usage trace]
Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
Cc: linux-xfs@vger.kernel.org
Signed-off-by: Kees Cook <keescook@chromium.org>
---
 fs/xfs/kmem.h      | 10 ++++++++++
 fs/xfs/xfs_super.c |  7 +++++--
 2 files changed, 15 insertions(+), 2 deletions(-)

Comments

Darrick J. Wong Aug. 28, 2017, 9:49 p.m. UTC | #1
On Mon, Aug 28, 2017 at 02:34:56PM -0700, Kees Cook wrote:
> From: David Windsor <dave@nullcore.net>
> 
> The XFS inline inode data, stored in struct xfs_inode_t field
> i_df.if_u2.if_inline_data and therefore contained in the xfs_inode slab
> cache, needs to be copied to/from userspace.
> 
> cache object allocation:
>     fs/xfs/xfs_icache.c:
>         xfs_inode_alloc(...):
>             ...
>             ip = kmem_zone_alloc(xfs_inode_zone, KM_SLEEP);
> 
>     fs/xfs/libxfs/xfs_inode_fork.c:
>         xfs_init_local_fork(...):
>             ...
>             if (mem_size <= sizeof(ifp->if_u2.if_inline_data))
>                     ifp->if_u1.if_data = ifp->if_u2.if_inline_data;

Hmm, what happens when mem_size > sizeof(if_inline_data)?  A slab object
will be allocated for ifp->if_u1.if_data which can then be used for
readlink in the same manner as the example usage trace below.  Does
that allocated object have a need for a usercopy annotation like
the one we're adding for if_inline_data?  Or is that already covered
elsewhere?

--D

>             ...
> 
>     fs/xfs/xfs_symlink.c:
>         xfs_symlink(...):
>             ...
>             xfs_init_local_fork(ip, XFS_DATA_FORK, target_path, pathlen);
> 
> example usage trace:
>     readlink_copy+0x43/0x70
>     vfs_readlink+0x62/0x110
>     SyS_readlinkat+0x100/0x130
> 
>     fs/xfs/xfs_iops.c:
>         (via inode->i_op->get_link)
>         xfs_vn_get_link_inline(...):
>             ...
>             return XFS_I(inode)->i_df.if_u1.if_data;
> 
>     fs/namei.c:
>         readlink_copy(..., link):
>             ...
>             copy_to_user(..., link, len);
> 
>         generic_readlink(dentry, ...):
>             struct inode *inode = d_inode(dentry);
>             const char *link = inode->i_link;
>             ...
>             if (!link) {
>                     link = inode->i_op->get_link(dentry, inode, &done);
>             ...
>             readlink_copy(..., link);
> 
> In support of usercopy hardening, this patch defines a region in the
> xfs_inode slab cache in which userspace copy operations are allowed.
> 
> This region is known as the slab cache's usercopy region. Slab caches can
> now check that each copy operation involving cache-managed memory falls
> entirely within the slab's usercopy region.
> 
> This patch is modified from Brad Spengler/PaX Team's PAX_USERCOPY
> whitelisting code in the last public patch of grsecurity/PaX based on my
> understanding of the code. Changes or omissions from the original code are
> mine and don't reflect the original grsecurity/PaX code.
> 
> Signed-off-by: David Windsor <dave@nullcore.net>
> [kees: adjust commit log, provide usage trace]
> Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
> Cc: linux-xfs@vger.kernel.org
> Signed-off-by: Kees Cook <keescook@chromium.org>
> ---
>  fs/xfs/kmem.h      | 10 ++++++++++
>  fs/xfs/xfs_super.c |  7 +++++--
>  2 files changed, 15 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/xfs/kmem.h b/fs/xfs/kmem.h
> index 4d85992d75b2..08358f38dee6 100644
> --- a/fs/xfs/kmem.h
> +++ b/fs/xfs/kmem.h
> @@ -110,6 +110,16 @@ kmem_zone_init_flags(int size, char *zone_name, unsigned long flags,
>  	return kmem_cache_create(zone_name, size, 0, flags, construct);
>  }
>  
> +static inline kmem_zone_t *
> +kmem_zone_init_flags_usercopy(int size, char *zone_name, unsigned long flags,
> +				size_t useroffset, size_t usersize,
> +				void (*construct)(void *))
> +{
> +	return kmem_cache_create_usercopy(zone_name, size, 0, flags,
> +				useroffset, usersize, construct);
> +}
> +
> +
>  static inline void
>  kmem_zone_free(kmem_zone_t *zone, void *ptr)
>  {
> diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
> index 38aaacdbb8b3..6ca428c6f943 100644
> --- a/fs/xfs/xfs_super.c
> +++ b/fs/xfs/xfs_super.c
> @@ -1829,9 +1829,12 @@ xfs_init_zones(void)
>  		goto out_destroy_efd_zone;
>  
>  	xfs_inode_zone =
> -		kmem_zone_init_flags(sizeof(xfs_inode_t), "xfs_inode",
> +		kmem_zone_init_flags_usercopy(sizeof(xfs_inode_t), "xfs_inode",
>  			KM_ZONE_HWALIGN | KM_ZONE_RECLAIM | KM_ZONE_SPREAD |
> -			KM_ZONE_ACCOUNT, xfs_fs_inode_init_once);
> +				KM_ZONE_ACCOUNT,
> +			offsetof(xfs_inode_t, i_df.if_u2.if_inline_data),
> +			sizeof_field(xfs_inode_t, i_df.if_u2.if_inline_data),
> +			xfs_fs_inode_init_once);
>  	if (!xfs_inode_zone)
>  		goto out_destroy_efi_zone;
>  
> -- 
> 2.7.4
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
Kees Cook Aug. 28, 2017, 9:57 p.m. UTC | #2
On Mon, Aug 28, 2017 at 2:49 PM, Darrick J. Wong
<darrick.wong@oracle.com> wrote:
> On Mon, Aug 28, 2017 at 02:34:56PM -0700, Kees Cook wrote:
>> From: David Windsor <dave@nullcore.net>
>>
>> The XFS inline inode data, stored in struct xfs_inode_t field
>> i_df.if_u2.if_inline_data and therefore contained in the xfs_inode slab
>> cache, needs to be copied to/from userspace.
>>
>> cache object allocation:
>>     fs/xfs/xfs_icache.c:
>>         xfs_inode_alloc(...):
>>             ...
>>             ip = kmem_zone_alloc(xfs_inode_zone, KM_SLEEP);
>>
>>     fs/xfs/libxfs/xfs_inode_fork.c:
>>         xfs_init_local_fork(...):
>>             ...
>>             if (mem_size <= sizeof(ifp->if_u2.if_inline_data))
>>                     ifp->if_u1.if_data = ifp->if_u2.if_inline_data;
>
> Hmm, what happens when mem_size > sizeof(if_inline_data)?  A slab object
> will be allocated for ifp->if_u1.if_data which can then be used for
> readlink in the same manner as the example usage trace below.  Does
> that allocated object have a need for a usercopy annotation like
> the one we're adding for if_inline_data?  Or is that already covered
> elsewhere?

Yeah, the xfs helper kmem_alloc() is used in the other case, which
ultimately boils down to a call to kmalloc(), which is entirely
whitelisted by an earlier patch in the series:

https://lkml.org/lkml/2017/8/28/1026

(It's possible that at some future time we can start segregating
kernel-only kmallocs from usercopy-able kmallocs, but for now, there
are no plans for this.)

-Kees
Darrick J. Wong Aug. 29, 2017, 4:47 a.m. UTC | #3
On Mon, Aug 28, 2017 at 02:57:14PM -0700, Kees Cook wrote:
> On Mon, Aug 28, 2017 at 2:49 PM, Darrick J. Wong
> <darrick.wong@oracle.com> wrote:
> > On Mon, Aug 28, 2017 at 02:34:56PM -0700, Kees Cook wrote:
> >> From: David Windsor <dave@nullcore.net>
> >>
> >> The XFS inline inode data, stored in struct xfs_inode_t field
> >> i_df.if_u2.if_inline_data and therefore contained in the xfs_inode slab
> >> cache, needs to be copied to/from userspace.
> >>
> >> cache object allocation:
> >>     fs/xfs/xfs_icache.c:
> >>         xfs_inode_alloc(...):
> >>             ...
> >>             ip = kmem_zone_alloc(xfs_inode_zone, KM_SLEEP);
> >>
> >>     fs/xfs/libxfs/xfs_inode_fork.c:
> >>         xfs_init_local_fork(...):
> >>             ...
> >>             if (mem_size <= sizeof(ifp->if_u2.if_inline_data))
> >>                     ifp->if_u1.if_data = ifp->if_u2.if_inline_data;
> >
> > Hmm, what happens when mem_size > sizeof(if_inline_data)?  A slab object
> > will be allocated for ifp->if_u1.if_data which can then be used for
> > readlink in the same manner as the example usage trace below.  Does
> > that allocated object have a need for a usercopy annotation like
> > the one we're adding for if_inline_data?  Or is that already covered
> > elsewhere?
> 
> Yeah, the xfs helper kmem_alloc() is used in the other case, which
> ultimately boils down to a call to kmalloc(), which is entirely
> whitelisted by an earlier patch in the series:
> 
> https://lkml.org/lkml/2017/8/28/1026

Ah.  It would've been helpful to have the first three patches cc'd to
the xfs list.  So basically this series establishes the ability to set
regions within a slab object into which copy_to_user can copy memory
contents, and vice versa.  Have you seen any runtime performance impact?
The overhead looks like it ought to be minimal.

> (It's possible that at some future time we can start segregating
> kernel-only kmallocs from usercopy-able kmallocs, but for now, there
> are no plans for this.)

A pity.  It would be interesting to create no-usercopy versions of the
kmalloc-* slabs and see how much of XFS' memory consumption never
touches userspace buffers. :)

--D

> 
> -Kees
> 
> -- 
> Kees Cook
> Pixel Security
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
Christoph Hellwig Aug. 29, 2017, 8:14 a.m. UTC | #4
One thing I've been wondering is wether we should actually just
get rid of the online area.  Compared to reading an inode from
disk a single additional kmalloc is negligible, and not having the
inline data / extent list would allow us to reduce the inode size
significantly.

Kees/David:  how many of these patches are file systems with some
sort of inline data?  Given that it's only about 30 patches declaring
allocations either entirely valid for user copy or not might end up
being nicer in many ways than these offsets.
Dave Chinner Aug. 29, 2017, 12:31 p.m. UTC | #5
On Tue, Aug 29, 2017 at 01:14:53AM -0700, Christoph Hellwig wrote:
> One thing I've been wondering is wether we should actually just
> get rid of the online area.  Compared to reading an inode from
> disk a single additional kmalloc is negligible, and not having the
> inline data / extent list would allow us to reduce the inode size
> significantly.

Probably should.  I've already been looking at killing the inline
extents array to simplify the management of the extent list (much
simpler to index by rbtree when we don't have direct/indirect
structures), so killing the inline data would get rid of the other
part of the union the inline data sits in.

OTOH, if we're going to have to dynamically allocate the memory for
the extent/inline data for the data fork, it may just be easier to
make the entire data fork a dynamic allocation (like the attr fork).

Cheers,

Dave.
Christoph Hellwig Aug. 29, 2017, 12:45 p.m. UTC | #6
On Tue, Aug 29, 2017 at 10:31:26PM +1000, Dave Chinner wrote:
> Probably should.  I've already been looking at killing the inline
> extents array to simplify the management of the extent list (much
> simpler to index by rbtree when we don't have direct/indirect
> structures), so killing the inline data would get rid of the other
> part of the union the inline data sits in.

That's exactly where I came form with my extent list work.  Although
the rbtree performance was horrible due to the memory overhead and
I've switched to a modified b+tree at the moment..

> OTOH, if we're going to have to dynamically allocate the memory for
> the extent/inline data for the data fork, it may just be easier to
> make the entire data fork a dynamic allocation (like the attr fork).

I though about this a bit, but it turned out that we basically
always need the data anyway, so I don't think it's going to buy
us much unless we shrink the inode enough so that they better fit
into a page.
Kees Cook Aug. 29, 2017, 6:48 p.m. UTC | #7
On Mon, Aug 28, 2017 at 9:47 PM, Darrick J. Wong
<darrick.wong@oracle.com> wrote:
> On Mon, Aug 28, 2017 at 02:57:14PM -0700, Kees Cook wrote:
>> On Mon, Aug 28, 2017 at 2:49 PM, Darrick J. Wong
>> <darrick.wong@oracle.com> wrote:
>> > On Mon, Aug 28, 2017 at 02:34:56PM -0700, Kees Cook wrote:
>> >> From: David Windsor <dave@nullcore.net>
>> >>
>> >> The XFS inline inode data, stored in struct xfs_inode_t field
>> >> i_df.if_u2.if_inline_data and therefore contained in the xfs_inode slab
>> >> cache, needs to be copied to/from userspace.
>> >>
>> >> cache object allocation:
>> >>     fs/xfs/xfs_icache.c:
>> >>         xfs_inode_alloc(...):
>> >>             ...
>> >>             ip = kmem_zone_alloc(xfs_inode_zone, KM_SLEEP);
>> >>
>> >>     fs/xfs/libxfs/xfs_inode_fork.c:
>> >>         xfs_init_local_fork(...):
>> >>             ...
>> >>             if (mem_size <= sizeof(ifp->if_u2.if_inline_data))
>> >>                     ifp->if_u1.if_data = ifp->if_u2.if_inline_data;
>> >
>> > Hmm, what happens when mem_size > sizeof(if_inline_data)?  A slab object
>> > will be allocated for ifp->if_u1.if_data which can then be used for
>> > readlink in the same manner as the example usage trace below.  Does
>> > that allocated object have a need for a usercopy annotation like
>> > the one we're adding for if_inline_data?  Or is that already covered
>> > elsewhere?
>>
>> Yeah, the xfs helper kmem_alloc() is used in the other case, which
>> ultimately boils down to a call to kmalloc(), which is entirely
>> whitelisted by an earlier patch in the series:
>>
>> https://lkml.org/lkml/2017/8/28/1026
>
> Ah.  It would've been helpful to have the first three patches cc'd to
> the xfs list.  So basically this series establishes the ability to set

I went back and forth on that, and given all the things it touched, it
seemed like too large a CC list. :) I can explicitly add the xfs list
to the first three for any future versions.

> regions within a slab object into which copy_to_user can copy memory
> contents, and vice versa.  Have you seen any runtime performance impact?
> The overhead looks like it ought to be minimal.

Under CONFIG_HARDENED_USERCOPY, there's no difference in performance
between the earlier bounds checking (of the whole slab object) vs the
new bounds checking (of the useroffset/usersize portion of the slab
object). Perf difference of CONFIG_HARDENED_USERCOPY itself has proven
hard to measure, which likely means it's very minimal.

>> (It's possible that at some future time we can start segregating
>> kernel-only kmallocs from usercopy-able kmallocs, but for now, there
>> are no plans for this.)
>
> A pity.  It would be interesting to create no-usercopy versions of the
> kmalloc-* slabs and see how much of XFS' memory consumption never
> touches userspace buffers. :)

There are plans for building either a new helper (kmalloc_usercopy())
or adding a new flag (GFP_USERCOPY), but I haven't had time yet to
come back around to it. I wanted to land this step first, and we could
then move forward on the rest in future.

-Kees
Kees Cook Aug. 29, 2017, 6:55 p.m. UTC | #8
On Tue, Aug 29, 2017 at 1:14 AM, Christoph Hellwig <hch@infradead.org> wrote:
> One thing I've been wondering is wether we should actually just
> get rid of the online area.  Compared to reading an inode from
> disk a single additional kmalloc is negligible, and not having the
> inline data / extent list would allow us to reduce the inode size
> significantly.
>
> Kees/David:  how many of these patches are file systems with some
> sort of inline data?  Given that it's only about 30 patches declaring
> allocations either entirely valid for user copy or not might end up
> being nicer in many ways than these offsets.

9 filesystems use some form of inline data: xfs, vxfs, ufs, orangefs,
exofs, befs, jfs, ext2, and ext4. How much of each slab is whitelisted
varies by filesystem (e.g. ext2/4 uses i_data for other things, but
ufs and orangefs and have a dedicate field for symlink names).

-Kees
Darrick J. Wong Aug. 29, 2017, 7 p.m. UTC | #9
On Tue, Aug 29, 2017 at 11:48:49AM -0700, Kees Cook wrote:
> On Mon, Aug 28, 2017 at 9:47 PM, Darrick J. Wong
> <darrick.wong@oracle.com> wrote:
> > On Mon, Aug 28, 2017 at 02:57:14PM -0700, Kees Cook wrote:
> >> On Mon, Aug 28, 2017 at 2:49 PM, Darrick J. Wong
> >> <darrick.wong@oracle.com> wrote:
> >> > On Mon, Aug 28, 2017 at 02:34:56PM -0700, Kees Cook wrote:
> >> >> From: David Windsor <dave@nullcore.net>
> >> >>
> >> >> The XFS inline inode data, stored in struct xfs_inode_t field
> >> >> i_df.if_u2.if_inline_data and therefore contained in the xfs_inode slab
> >> >> cache, needs to be copied to/from userspace.
> >> >>
> >> >> cache object allocation:
> >> >>     fs/xfs/xfs_icache.c:
> >> >>         xfs_inode_alloc(...):
> >> >>             ...
> >> >>             ip = kmem_zone_alloc(xfs_inode_zone, KM_SLEEP);
> >> >>
> >> >>     fs/xfs/libxfs/xfs_inode_fork.c:
> >> >>         xfs_init_local_fork(...):
> >> >>             ...
> >> >>             if (mem_size <= sizeof(ifp->if_u2.if_inline_data))
> >> >>                     ifp->if_u1.if_data = ifp->if_u2.if_inline_data;
> >> >
> >> > Hmm, what happens when mem_size > sizeof(if_inline_data)?  A slab object
> >> > will be allocated for ifp->if_u1.if_data which can then be used for
> >> > readlink in the same manner as the example usage trace below.  Does
> >> > that allocated object have a need for a usercopy annotation like
> >> > the one we're adding for if_inline_data?  Or is that already covered
> >> > elsewhere?
> >>
> >> Yeah, the xfs helper kmem_alloc() is used in the other case, which
> >> ultimately boils down to a call to kmalloc(), which is entirely
> >> whitelisted by an earlier patch in the series:
> >>
> >> https://lkml.org/lkml/2017/8/28/1026
> >
> > Ah.  It would've been helpful to have the first three patches cc'd to
> > the xfs list.  So basically this series establishes the ability to set
> 
> I went back and forth on that, and given all the things it touched, it
> seemed like too large a CC list. :) I can explicitly add the xfs list
> to the first three for any future versions.
> 
> > regions within a slab object into which copy_to_user can copy memory
> > contents, and vice versa.  Have you seen any runtime performance impact?
> > The overhead looks like it ought to be minimal.
> 
> Under CONFIG_HARDENED_USERCOPY, there's no difference in performance
> between the earlier bounds checking (of the whole slab object) vs the
> new bounds checking (of the useroffset/usersize portion of the slab
> object). Perf difference of CONFIG_HARDENED_USERCOPY itself has proven
> hard to measure, which likely means it's very minimal.
> 
> >> (It's possible that at some future time we can start segregating
> >> kernel-only kmallocs from usercopy-able kmallocs, but for now, there
> >> are no plans for this.)
> >
> > A pity.  It would be interesting to create no-usercopy versions of the
> > kmalloc-* slabs and see how much of XFS' memory consumption never
> > touches userspace buffers. :)
> 
> There are plans for building either a new helper (kmalloc_usercopy())
> or adding a new flag (GFP_USERCOPY), but I haven't had time yet to
> come back around to it. I wanted to land this step first, and we could
> then move forward on the rest in future.

Heh, fair enough.

For the XFS bits,
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>

--D

> 
> -Kees
> 
> -- 
> Kees Cook
> Pixel Security
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
Dave Chinner Aug. 29, 2017, 9:51 p.m. UTC | #10
On Tue, Aug 29, 2017 at 05:45:36AM -0700, Christoph Hellwig wrote:
> On Tue, Aug 29, 2017 at 10:31:26PM +1000, Dave Chinner wrote:
> > Probably should.  I've already been looking at killing the inline
> > extents array to simplify the management of the extent list (much
> > simpler to index by rbtree when we don't have direct/indirect
> > structures), so killing the inline data would get rid of the other
> > part of the union the inline data sits in.
> 
> That's exactly where I came form with my extent list work.  Although
> the rbtree performance was horrible due to the memory overhead and
> I've switched to a modified b+tree at the moment..

Right, I've looked at btrees, too, but it's more complex than just
using an rbtree. I originally looked at using Peter Z's old
RCU-aware btree code, but it doesn't hold data in the tree leaves.
So that needed significant modification to make work without a
memory alloc per extent and that didn't work with original aim of
RCU-safe extent lookups.  I also looked at that "generic" btree
stuff that came from logfs, and after a little while ran away
screaming. So if we are going to use a b+tree, it sounds like you
are probably going the right way.

As it is, I've been looking at using interval tree - I have kinda
working code - which basically leaves the page based extent arrays
intact but adds an rbnode/interval state header to the start of each
page to track the offsets within the node and propagate them back up
to the root for fast offset based extent lookups. With a lookaside
cache on the root, it should behave and perform almost identically
to the current indirect array and should have very little extra
overhead....

The sticking point, IMO, is the extent array index based lookups in
all the bmbt code.  I've been looking at converting all that to use
offset based lookups and a cursor w/ lookup/inc/dec/insert/delete
ioperations wrapping xfs_iext_lookup_ext() and friends. This means
the modifications are pretty much identical to the on-disk extent
btree, so they can be abstracted out into a single extent update
interface for both trees.  Have you planned/done any cleanup/changes
with this code?

> > OTOH, if we're going to have to dynamically allocate the memory for
> > the extent/inline data for the data fork, it may just be easier to
> > make the entire data fork a dynamic allocation (like the attr fork).
> 
> I though about this a bit, but it turned out that we basically
> always need the data anyway, so I don't think it's going to buy
> us much unless we shrink the inode enough so that they better fit
> into a page.

True. Keep it mind for when we've shrunk the inode by another
100 bytes...

Cheers,

Dave.
Dave Chinner Aug. 29, 2017, 10:15 p.m. UTC | #11
On Tue, Aug 29, 2017 at 11:48:49AM -0700, Kees Cook wrote:
> On Mon, Aug 28, 2017 at 9:47 PM, Darrick J. Wong
> <darrick.wong@oracle.com> wrote:
> > On Mon, Aug 28, 2017 at 02:57:14PM -0700, Kees Cook wrote:
> >> On Mon, Aug 28, 2017 at 2:49 PM, Darrick J. Wong
> >> <darrick.wong@oracle.com> wrote:
> >> > On Mon, Aug 28, 2017 at 02:34:56PM -0700, Kees Cook wrote:
> >> >> From: David Windsor <dave@nullcore.net>
> >> >>
> >> >> The XFS inline inode data, stored in struct xfs_inode_t field
> >> >> i_df.if_u2.if_inline_data and therefore contained in the xfs_inode slab
> >> >> cache, needs to be copied to/from userspace.
> >> >>
> >> >> cache object allocation:
> >> >>     fs/xfs/xfs_icache.c:
> >> >>         xfs_inode_alloc(...):
> >> >>             ...
> >> >>             ip = kmem_zone_alloc(xfs_inode_zone, KM_SLEEP);
> >> >>
> >> >>     fs/xfs/libxfs/xfs_inode_fork.c:
> >> >>         xfs_init_local_fork(...):
> >> >>             ...
> >> >>             if (mem_size <= sizeof(ifp->if_u2.if_inline_data))
> >> >>                     ifp->if_u1.if_data = ifp->if_u2.if_inline_data;
> >> >
> >> > Hmm, what happens when mem_size > sizeof(if_inline_data)?  A slab object
> >> > will be allocated for ifp->if_u1.if_data which can then be used for
> >> > readlink in the same manner as the example usage trace below.  Does
> >> > that allocated object have a need for a usercopy annotation like
> >> > the one we're adding for if_inline_data?  Or is that already covered
> >> > elsewhere?
> >>
> >> Yeah, the xfs helper kmem_alloc() is used in the other case, which
> >> ultimately boils down to a call to kmalloc(), which is entirely
> >> whitelisted by an earlier patch in the series:
> >>
> >> https://lkml.org/lkml/2017/8/28/1026
> >
> > Ah.  It would've been helpful to have the first three patches cc'd to
> > the xfs list.  So basically this series establishes the ability to set
> 
> I went back and forth on that, and given all the things it touched, it
> seemed like too large a CC list. :) I can explicitly add the xfs list
> to the first three for any future versions.

If you are touching multiple filesystems, you really should cc the
entire patchset to linux-fsdevel, similar to how you sent the entire
patchset to lkml. That way the entire series will end up on a list
that almost all fs developers read. LKML is not a list you can rely
on all filesystem developers reading (or developers in any other
subsystem, for that matter)...

Cheers,

Dave.
Kees Cook Aug. 29, 2017, 10:25 p.m. UTC | #12
On Tue, Aug 29, 2017 at 3:15 PM, Dave Chinner <david@fromorbit.com> wrote:
> If you are touching multiple filesystems, you really should cc the
> entire patchset to linux-fsdevel, similar to how you sent the entire
> patchset to lkml. That way the entire series will end up on a list
> that almost all fs developers read. LKML is not a list you can rely
> on all filesystem developers reading (or developers in any other
> subsystem, for that matter)...

Okay, sounds good. Thanks!

-Kees
Christoph Hellwig Aug. 30, 2017, 7:14 a.m. UTC | #13
On Wed, Aug 30, 2017 at 07:51:57AM +1000, Dave Chinner wrote:
> Right, I've looked at btrees, too, but it's more complex than just
> using an rbtree. I originally looked at using Peter Z's old
> RCU-aware btree code, but it doesn't hold data in the tree leaves.
> So that needed significant modification to make work without a
> memory alloc per extent and that didn't work with original aim of
> RCU-safe extent lookups.  I also looked at that "generic" btree
> stuff that came from logfs, and after a little while ran away
> screaming.

I started with the latter, but it's not really looking like it any more:
there nodes are formatted as a series of u64s instead of all the
long magic, and the data is stored inline - in fact I use a cute
trick to keep the size down, derived from our "compressed" on disk
extent format:

Key:

 +-------+----------------------------+
 | 00:51 | all 52 bits of startoff    |
 | 52:63 | low 12 bits of startblock  |
 +-------+----------------------------+

Value

 +-------+----------------------------+
 | 00:20 | all 21 bits of length      |
 |    21 | unwritten extent bit       |
 | 22:63 | high 42 bits of startblock |
 +-------+----------------------------+

So we only need a 64-bit key and a 64-bit value by abusing parts
of the key to store bits of the startblock.

For non-leaf nodes we iterate through the keys only, never touching
the cache lines for the value.  For the leaf nodes we have to touch
the value anyway because we have to do a range lookup to find the
exact record.

This works fine so far in an isolated simulator, and now I'm ammending
it to be a b+tree with pointers to the previous and next node so
that we can nicely implement our extent iterators instead of doing
full lookups.

> The sticking point, IMO, is the extent array index based lookups in
> all the bmbt code.  I've been looking at converting all that to use
> offset based lookups and a cursor w/ lookup/inc/dec/insert/delete
> ioperations wrapping xfs_iext_lookup_ext() and friends. This means
> the modifications are pretty much identical to the on-disk extent
> btree, so they can be abstracted out into a single extent update
> interface for both trees.  Have you planned/done any cleanup/changes
> with this code?

I've done various cleanups, but I've not yet consolidated the two.
Basically step one at the moment is to move everyone to
xfs_iext_lookup_extent + xfs_iext_get_extent that removes all the
bad intrusion.

Once we move to the actual b+trees the extnum_t cursor will be replaced
with a real cursor structure that contains a pointer to the current
b+tree leaf node, and an index inside that, which will allows us very
efficient iteration.  The xfs_iext_get_extent calls will be replaced
with more specific xfs_iext_prev_extent, xfs_iext_next_extent calls
that include the now slightly more complex cursor decrement, increment
as well as a new xfs_iext_last_extent helper for the last extent
that we need in a few places.

insert/delete remain very similar to what they do right now, they'll
get a different cursor type, and the manual xfs_iext_add calls will
go away.  The new xfs_iext_update_extent helper I posted to the list
yesterday will become a bit more complex, as changing the startoff
will have to be propagated up the tree.
Dave Chinner Aug. 30, 2017, 8:05 a.m. UTC | #14
On Wed, Aug 30, 2017 at 12:14:03AM -0700, Christoph Hellwig wrote:
> On Wed, Aug 30, 2017 at 07:51:57AM +1000, Dave Chinner wrote:
> > Right, I've looked at btrees, too, but it's more complex than just
> > using an rbtree. I originally looked at using Peter Z's old
> > RCU-aware btree code, but it doesn't hold data in the tree leaves.
> > So that needed significant modification to make work without a
> > memory alloc per extent and that didn't work with original aim of
> > RCU-safe extent lookups.  I also looked at that "generic" btree
> > stuff that came from logfs, and after a little while ran away
> > screaming.
> 
> I started with the latter, but it's not really looking like it any more:
> there nodes are formatted as a series of u64s instead of all the
> long magic,

Yeah, that was about where I started to run away and look for
something nicer....

> and the data is stored inline - in fact I use a cute
> trick to keep the size down, derived from our "compressed" on disk
> extent format:
> 
> Key:
> 
>  +-------+----------------------------+
>  | 00:51 | all 52 bits of startoff    |
>  | 52:63 | low 12 bits of startblock  |
>  +-------+----------------------------+
> 
> Value
> 
>  +-------+----------------------------+
>  | 00:20 | all 21 bits of length      |
>  |    21 | unwritten extent bit       |
>  | 22:63 | high 42 bits of startblock |
>  +-------+----------------------------+
> 
> So we only need a 64-bit key and a 64-bit value by abusing parts
> of the key to store bits of the startblock.

Neat! :)

> For non-leaf nodes we iterate through the keys only, never touching
> the cache lines for the value.  For the leaf nodes we have to touch
> the value anyway because we have to do a range lookup to find the
> exact record.
> 
> This works fine so far in an isolated simulator, and now I'm ammending
> it to be a b+tree with pointers to the previous and next node so
> that we can nicely implement our extent iterators instead of doing
> full lookups.

Ok, that sounds exactly what I have been looking towards....

> > The sticking point, IMO, is the extent array index based lookups in
> > all the bmbt code.  I've been looking at converting all that to use
> > offset based lookups and a cursor w/ lookup/inc/dec/insert/delete
> > ioperations wrapping xfs_iext_lookup_ext() and friends. This means
> > the modifications are pretty much identical to the on-disk extent
> > btree, so they can be abstracted out into a single extent update
> > interface for both trees.  Have you planned/done any cleanup/changes
> > with this code?
> 
> I've done various cleanups, but I've not yet consolidated the two.
> Basically step one at the moment is to move everyone to
> xfs_iext_lookup_extent + xfs_iext_get_extent that removes all the
> bad intrusion.

Yup.

> Once we move to the actual b+trees the extnum_t cursor will be replaced
> with a real cursor structure that contains a pointer to the current
> b+tree leaf node, and an index inside that, which will allows us very
> efficient iteration.  The xfs_iext_get_extent calls will be replaced
> with more specific xfs_iext_prev_extent, xfs_iext_next_extent calls
> that include the now slightly more complex cursor decrement, increment
> as well as a new xfs_iext_last_extent helper for the last extent
> that we need in a few places.

Ok, that's sounds like it'll fit right in with what I've been
prototyping for the extent code in xfs_bmap.c. I can make that work
with a cursor-based lookup/inc/dec/ins/del API similar to the bmbt
API. I've been looking to abstract the extent manipulations out into
functions that modify both trees like this:

[note: just put template code in to get my thoughts straight, it's
not working code]

+static int
+xfs_bmex_delete(
+       struct xfs_iext_cursor          *icur,
+       struct xfs_btree_cursor         *cur,
+       int                             *nextents)
+{
+       int                             i;
+
+       xfs_iext_remove(bma->ip, bma->idx + 1, 2, state);
+       if (nextents)
+               (*nextents)--;
+       if (!cur)
+               return 0;
+       error = xfs_btree_delete(cur, &i);
+       if (error)
+               return error;
+       XFS_WANT_CORRUPTED_RETURN(cur->bc_mp, i == 1);
+       return 0;
+}
+
+static int
+xfs_bmex_increment(
+       struct xfs_iext_cursor          *icur,
+       struct xfs_btree_cursor         *cur)
+{
+       int                             i;
+
+       icur->ep = xfs_iext_get_right_ext(icur->ep);
+       if (!cur)
+               return 0;
+       error = xfs_btree_increment(cur, 0, &i);
+       if (error)
+               return error;
+       XFS_WANT_CORRUPTED_RETURN(cur->bc_mp, i == 1);
+       return 0;
+}
+
+static int
+xfs_bmex_decrement(
+       struct xfs_iext_cursor          *icur,
+       struct xfs_btree_cursor         *cur)
+{
+       int                             i;
+
+       icur->ep = xfs_iext_get_left_ext(icur->ep);
+       if (!cur)
+               return 0;
+       error = xfs_btree_decrement(cur, 0, &i);
+       if (error)
+               return error;
+       XFS_WANT_CORRUPTED_RETURN(cur->bc_mp, i == 1);
+       return 0;
+}

And so what you're doing would fit straight into that. I'm
ending up with is extent operations that look like this:

xfs_bmap_add_extent_delay_real()
.....
	case BMAP_LEFT_FILLING | BMAP_LEFT_CONTIG |
             BMAP_RIGHT_FILLING | BMAP_RIGHT_CONTIG:
                /*
                 * Filling in all of a previously delayed allocation extent.
                 * The left and right neighbors are both contiguous with new.
                 */
+               rval |= XFS_ILOG_CORE;
+
+               /* remove the incore delalloc extent first */
+               error = xfs_bmex_delete(&icur, NULL, nextents);
+               if (error)
+                       goto done;
+
+               /*
+                * update incore and bmap extent trees
+                *      1. set cursors to the right extent
+                *      2. remove the right extent
+                *      3. update the left extent to span all 3 extent ranges
+                */
+               error = xfs_bmex_lookup_eq(&icur, bma->cur, RIGHT.br_startoff,
+                               RIGHT.br_startblock, RIGHT.br_blockcount, 1);
+               if (error)
+                       goto done;
+               error = xfs_bmex_delete(&icur, bma->cur, NULL);
+               if (error)
+                       goto done;
+               error = xfs_bmex_decrement(&icur, bma->cur);
+               if (error)
+                       goto done;
+               error = xfs_bmex_update(&icur, bma->cur, LEFT.br_startoff,
+                               LEFT.br_startblock,
+                               LEFT.br_blockcount + PREV.br_blockcount +
+                                       RIGHT.br_blockcount,
+                               LEFT.br_state);
+               if (error)
+                       goto done;
 		break;
....

And I'm starting to see where there are common extent manipulations
being done so there's probably a fair amount of further factoring
that can be done on top of this....

> insert/delete remain very similar to what they do right now, they'll
> get a different cursor type, and the manual xfs_iext_add calls will
> go away.  The new xfs_iext_update_extent helper I posted to the list
> yesterday will become a bit more complex, as changing the startoff
> will have to be propagated up the tree.

I've had a quick look at them and pulled it down into my tree for
testing (which had a cpu burning hang on xfs/020 a few minutes ago),
but I'll spend more time grokking them tomorrow.

Cheers,

Dave.
Christoph Hellwig Aug. 30, 2017, 8:33 a.m. UTC | #15
On Wed, Aug 30, 2017 at 06:05:58PM +1000, Dave Chinner wrote:
> Ok, that's sounds like it'll fit right in with what I've been
> prototyping for the extent code in xfs_bmap.c. I can make that work
> with a cursor-based lookup/inc/dec/ins/del API similar to the bmbt
> API. I've been looking to abstract the extent manipulations out into
> functions that modify both trees like this:
> 
> [note: just put template code in to get my thoughts straight, it's
> not working code]

FYI, I've got somewhat working changes in that area (still has bugs
but a few tests pass :)), what I'm doing is to make sure all of
the xfs_bmap_{add,del}_extent_* routines fully operate on xfs_bmbt_irec
structures that they acquire through the xfs_bmalloca structure or
from xfs_iext_get_extent and update using xfs_iext_update_extent.
A nice fallout from that is that we can change the prototypes for
xfs_bmbt_lookup_* and xfs_bmbt_update to take a xfs_bmbt_irec
as well instead of taking the individual arguments.  That should
help with your next step cleanups a bit.
diff mbox

Patch

diff --git a/fs/xfs/kmem.h b/fs/xfs/kmem.h
index 4d85992d75b2..08358f38dee6 100644
--- a/fs/xfs/kmem.h
+++ b/fs/xfs/kmem.h
@@ -110,6 +110,16 @@  kmem_zone_init_flags(int size, char *zone_name, unsigned long flags,
 	return kmem_cache_create(zone_name, size, 0, flags, construct);
 }
 
+static inline kmem_zone_t *
+kmem_zone_init_flags_usercopy(int size, char *zone_name, unsigned long flags,
+				size_t useroffset, size_t usersize,
+				void (*construct)(void *))
+{
+	return kmem_cache_create_usercopy(zone_name, size, 0, flags,
+				useroffset, usersize, construct);
+}
+
+
 static inline void
 kmem_zone_free(kmem_zone_t *zone, void *ptr)
 {
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index 38aaacdbb8b3..6ca428c6f943 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -1829,9 +1829,12 @@  xfs_init_zones(void)
 		goto out_destroy_efd_zone;
 
 	xfs_inode_zone =
-		kmem_zone_init_flags(sizeof(xfs_inode_t), "xfs_inode",
+		kmem_zone_init_flags_usercopy(sizeof(xfs_inode_t), "xfs_inode",
 			KM_ZONE_HWALIGN | KM_ZONE_RECLAIM | KM_ZONE_SPREAD |
-			KM_ZONE_ACCOUNT, xfs_fs_inode_init_once);
+				KM_ZONE_ACCOUNT,
+			offsetof(xfs_inode_t, i_df.if_u2.if_inline_data),
+			sizeof_field(xfs_inode_t, i_df.if_u2.if_inline_data),
+			xfs_fs_inode_init_once);
 	if (!xfs_inode_zone)
 		goto out_destroy_efi_zone;