Message ID | 1503956111-36652-16-git-send-email-keescook@chromium.org (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Mon, Aug 28, 2017 at 02:34:56PM -0700, Kees Cook wrote: > From: David Windsor <dave@nullcore.net> > > The XFS inline inode data, stored in struct xfs_inode_t field > i_df.if_u2.if_inline_data and therefore contained in the xfs_inode slab > cache, needs to be copied to/from userspace. > > cache object allocation: > fs/xfs/xfs_icache.c: > xfs_inode_alloc(...): > ... > ip = kmem_zone_alloc(xfs_inode_zone, KM_SLEEP); > > fs/xfs/libxfs/xfs_inode_fork.c: > xfs_init_local_fork(...): > ... > if (mem_size <= sizeof(ifp->if_u2.if_inline_data)) > ifp->if_u1.if_data = ifp->if_u2.if_inline_data; Hmm, what happens when mem_size > sizeof(if_inline_data)? A slab object will be allocated for ifp->if_u1.if_data which can then be used for readlink in the same manner as the example usage trace below. Does that allocated object have a need for a usercopy annotation like the one we're adding for if_inline_data? Or is that already covered elsewhere? --D > ... > > fs/xfs/xfs_symlink.c: > xfs_symlink(...): > ... > xfs_init_local_fork(ip, XFS_DATA_FORK, target_path, pathlen); > > example usage trace: > readlink_copy+0x43/0x70 > vfs_readlink+0x62/0x110 > SyS_readlinkat+0x100/0x130 > > fs/xfs/xfs_iops.c: > (via inode->i_op->get_link) > xfs_vn_get_link_inline(...): > ... > return XFS_I(inode)->i_df.if_u1.if_data; > > fs/namei.c: > readlink_copy(..., link): > ... > copy_to_user(..., link, len); > > generic_readlink(dentry, ...): > struct inode *inode = d_inode(dentry); > const char *link = inode->i_link; > ... > if (!link) { > link = inode->i_op->get_link(dentry, inode, &done); > ... > readlink_copy(..., link); > > In support of usercopy hardening, this patch defines a region in the > xfs_inode slab cache in which userspace copy operations are allowed. > > This region is known as the slab cache's usercopy region. Slab caches can > now check that each copy operation involving cache-managed memory falls > entirely within the slab's usercopy region. > > This patch is modified from Brad Spengler/PaX Team's PAX_USERCOPY > whitelisting code in the last public patch of grsecurity/PaX based on my > understanding of the code. Changes or omissions from the original code are > mine and don't reflect the original grsecurity/PaX code. > > Signed-off-by: David Windsor <dave@nullcore.net> > [kees: adjust commit log, provide usage trace] > Cc: "Darrick J. Wong" <darrick.wong@oracle.com> > Cc: linux-xfs@vger.kernel.org > Signed-off-by: Kees Cook <keescook@chromium.org> > --- > fs/xfs/kmem.h | 10 ++++++++++ > fs/xfs/xfs_super.c | 7 +++++-- > 2 files changed, 15 insertions(+), 2 deletions(-) > > diff --git a/fs/xfs/kmem.h b/fs/xfs/kmem.h > index 4d85992d75b2..08358f38dee6 100644 > --- a/fs/xfs/kmem.h > +++ b/fs/xfs/kmem.h > @@ -110,6 +110,16 @@ kmem_zone_init_flags(int size, char *zone_name, unsigned long flags, > return kmem_cache_create(zone_name, size, 0, flags, construct); > } > > +static inline kmem_zone_t * > +kmem_zone_init_flags_usercopy(int size, char *zone_name, unsigned long flags, > + size_t useroffset, size_t usersize, > + void (*construct)(void *)) > +{ > + return kmem_cache_create_usercopy(zone_name, size, 0, flags, > + useroffset, usersize, construct); > +} > + > + > static inline void > kmem_zone_free(kmem_zone_t *zone, void *ptr) > { > diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c > index 38aaacdbb8b3..6ca428c6f943 100644 > --- a/fs/xfs/xfs_super.c > +++ b/fs/xfs/xfs_super.c > @@ -1829,9 +1829,12 @@ xfs_init_zones(void) > goto out_destroy_efd_zone; > > xfs_inode_zone = > - kmem_zone_init_flags(sizeof(xfs_inode_t), "xfs_inode", > + kmem_zone_init_flags_usercopy(sizeof(xfs_inode_t), "xfs_inode", > KM_ZONE_HWALIGN | KM_ZONE_RECLAIM | KM_ZONE_SPREAD | > - KM_ZONE_ACCOUNT, xfs_fs_inode_init_once); > + KM_ZONE_ACCOUNT, > + offsetof(xfs_inode_t, i_df.if_u2.if_inline_data), > + sizeof_field(xfs_inode_t, i_df.if_u2.if_inline_data), > + xfs_fs_inode_init_once); > if (!xfs_inode_zone) > goto out_destroy_efi_zone; > > -- > 2.7.4 > > -- > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, Aug 28, 2017 at 2:49 PM, Darrick J. Wong <darrick.wong@oracle.com> wrote: > On Mon, Aug 28, 2017 at 02:34:56PM -0700, Kees Cook wrote: >> From: David Windsor <dave@nullcore.net> >> >> The XFS inline inode data, stored in struct xfs_inode_t field >> i_df.if_u2.if_inline_data and therefore contained in the xfs_inode slab >> cache, needs to be copied to/from userspace. >> >> cache object allocation: >> fs/xfs/xfs_icache.c: >> xfs_inode_alloc(...): >> ... >> ip = kmem_zone_alloc(xfs_inode_zone, KM_SLEEP); >> >> fs/xfs/libxfs/xfs_inode_fork.c: >> xfs_init_local_fork(...): >> ... >> if (mem_size <= sizeof(ifp->if_u2.if_inline_data)) >> ifp->if_u1.if_data = ifp->if_u2.if_inline_data; > > Hmm, what happens when mem_size > sizeof(if_inline_data)? A slab object > will be allocated for ifp->if_u1.if_data which can then be used for > readlink in the same manner as the example usage trace below. Does > that allocated object have a need for a usercopy annotation like > the one we're adding for if_inline_data? Or is that already covered > elsewhere? Yeah, the xfs helper kmem_alloc() is used in the other case, which ultimately boils down to a call to kmalloc(), which is entirely whitelisted by an earlier patch in the series: https://lkml.org/lkml/2017/8/28/1026 (It's possible that at some future time we can start segregating kernel-only kmallocs from usercopy-able kmallocs, but for now, there are no plans for this.) -Kees
On Mon, Aug 28, 2017 at 02:57:14PM -0700, Kees Cook wrote: > On Mon, Aug 28, 2017 at 2:49 PM, Darrick J. Wong > <darrick.wong@oracle.com> wrote: > > On Mon, Aug 28, 2017 at 02:34:56PM -0700, Kees Cook wrote: > >> From: David Windsor <dave@nullcore.net> > >> > >> The XFS inline inode data, stored in struct xfs_inode_t field > >> i_df.if_u2.if_inline_data and therefore contained in the xfs_inode slab > >> cache, needs to be copied to/from userspace. > >> > >> cache object allocation: > >> fs/xfs/xfs_icache.c: > >> xfs_inode_alloc(...): > >> ... > >> ip = kmem_zone_alloc(xfs_inode_zone, KM_SLEEP); > >> > >> fs/xfs/libxfs/xfs_inode_fork.c: > >> xfs_init_local_fork(...): > >> ... > >> if (mem_size <= sizeof(ifp->if_u2.if_inline_data)) > >> ifp->if_u1.if_data = ifp->if_u2.if_inline_data; > > > > Hmm, what happens when mem_size > sizeof(if_inline_data)? A slab object > > will be allocated for ifp->if_u1.if_data which can then be used for > > readlink in the same manner as the example usage trace below. Does > > that allocated object have a need for a usercopy annotation like > > the one we're adding for if_inline_data? Or is that already covered > > elsewhere? > > Yeah, the xfs helper kmem_alloc() is used in the other case, which > ultimately boils down to a call to kmalloc(), which is entirely > whitelisted by an earlier patch in the series: > > https://lkml.org/lkml/2017/8/28/1026 Ah. It would've been helpful to have the first three patches cc'd to the xfs list. So basically this series establishes the ability to set regions within a slab object into which copy_to_user can copy memory contents, and vice versa. Have you seen any runtime performance impact? The overhead looks like it ought to be minimal. > (It's possible that at some future time we can start segregating > kernel-only kmallocs from usercopy-able kmallocs, but for now, there > are no plans for this.) A pity. It would be interesting to create no-usercopy versions of the kmalloc-* slabs and see how much of XFS' memory consumption never touches userspace buffers. :) --D > > -Kees > > -- > Kees Cook > Pixel Security > -- > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html
One thing I've been wondering is wether we should actually just get rid of the online area. Compared to reading an inode from disk a single additional kmalloc is negligible, and not having the inline data / extent list would allow us to reduce the inode size significantly. Kees/David: how many of these patches are file systems with some sort of inline data? Given that it's only about 30 patches declaring allocations either entirely valid for user copy or not might end up being nicer in many ways than these offsets.
On Tue, Aug 29, 2017 at 01:14:53AM -0700, Christoph Hellwig wrote: > One thing I've been wondering is wether we should actually just > get rid of the online area. Compared to reading an inode from > disk a single additional kmalloc is negligible, and not having the > inline data / extent list would allow us to reduce the inode size > significantly. Probably should. I've already been looking at killing the inline extents array to simplify the management of the extent list (much simpler to index by rbtree when we don't have direct/indirect structures), so killing the inline data would get rid of the other part of the union the inline data sits in. OTOH, if we're going to have to dynamically allocate the memory for the extent/inline data for the data fork, it may just be easier to make the entire data fork a dynamic allocation (like the attr fork). Cheers, Dave.
On Tue, Aug 29, 2017 at 10:31:26PM +1000, Dave Chinner wrote: > Probably should. I've already been looking at killing the inline > extents array to simplify the management of the extent list (much > simpler to index by rbtree when we don't have direct/indirect > structures), so killing the inline data would get rid of the other > part of the union the inline data sits in. That's exactly where I came form with my extent list work. Although the rbtree performance was horrible due to the memory overhead and I've switched to a modified b+tree at the moment.. > OTOH, if we're going to have to dynamically allocate the memory for > the extent/inline data for the data fork, it may just be easier to > make the entire data fork a dynamic allocation (like the attr fork). I though about this a bit, but it turned out that we basically always need the data anyway, so I don't think it's going to buy us much unless we shrink the inode enough so that they better fit into a page.
On Mon, Aug 28, 2017 at 9:47 PM, Darrick J. Wong <darrick.wong@oracle.com> wrote: > On Mon, Aug 28, 2017 at 02:57:14PM -0700, Kees Cook wrote: >> On Mon, Aug 28, 2017 at 2:49 PM, Darrick J. Wong >> <darrick.wong@oracle.com> wrote: >> > On Mon, Aug 28, 2017 at 02:34:56PM -0700, Kees Cook wrote: >> >> From: David Windsor <dave@nullcore.net> >> >> >> >> The XFS inline inode data, stored in struct xfs_inode_t field >> >> i_df.if_u2.if_inline_data and therefore contained in the xfs_inode slab >> >> cache, needs to be copied to/from userspace. >> >> >> >> cache object allocation: >> >> fs/xfs/xfs_icache.c: >> >> xfs_inode_alloc(...): >> >> ... >> >> ip = kmem_zone_alloc(xfs_inode_zone, KM_SLEEP); >> >> >> >> fs/xfs/libxfs/xfs_inode_fork.c: >> >> xfs_init_local_fork(...): >> >> ... >> >> if (mem_size <= sizeof(ifp->if_u2.if_inline_data)) >> >> ifp->if_u1.if_data = ifp->if_u2.if_inline_data; >> > >> > Hmm, what happens when mem_size > sizeof(if_inline_data)? A slab object >> > will be allocated for ifp->if_u1.if_data which can then be used for >> > readlink in the same manner as the example usage trace below. Does >> > that allocated object have a need for a usercopy annotation like >> > the one we're adding for if_inline_data? Or is that already covered >> > elsewhere? >> >> Yeah, the xfs helper kmem_alloc() is used in the other case, which >> ultimately boils down to a call to kmalloc(), which is entirely >> whitelisted by an earlier patch in the series: >> >> https://lkml.org/lkml/2017/8/28/1026 > > Ah. It would've been helpful to have the first three patches cc'd to > the xfs list. So basically this series establishes the ability to set I went back and forth on that, and given all the things it touched, it seemed like too large a CC list. :) I can explicitly add the xfs list to the first three for any future versions. > regions within a slab object into which copy_to_user can copy memory > contents, and vice versa. Have you seen any runtime performance impact? > The overhead looks like it ought to be minimal. Under CONFIG_HARDENED_USERCOPY, there's no difference in performance between the earlier bounds checking (of the whole slab object) vs the new bounds checking (of the useroffset/usersize portion of the slab object). Perf difference of CONFIG_HARDENED_USERCOPY itself has proven hard to measure, which likely means it's very minimal. >> (It's possible that at some future time we can start segregating >> kernel-only kmallocs from usercopy-able kmallocs, but for now, there >> are no plans for this.) > > A pity. It would be interesting to create no-usercopy versions of the > kmalloc-* slabs and see how much of XFS' memory consumption never > touches userspace buffers. :) There are plans for building either a new helper (kmalloc_usercopy()) or adding a new flag (GFP_USERCOPY), but I haven't had time yet to come back around to it. I wanted to land this step first, and we could then move forward on the rest in future. -Kees
On Tue, Aug 29, 2017 at 1:14 AM, Christoph Hellwig <hch@infradead.org> wrote: > One thing I've been wondering is wether we should actually just > get rid of the online area. Compared to reading an inode from > disk a single additional kmalloc is negligible, and not having the > inline data / extent list would allow us to reduce the inode size > significantly. > > Kees/David: how many of these patches are file systems with some > sort of inline data? Given that it's only about 30 patches declaring > allocations either entirely valid for user copy or not might end up > being nicer in many ways than these offsets. 9 filesystems use some form of inline data: xfs, vxfs, ufs, orangefs, exofs, befs, jfs, ext2, and ext4. How much of each slab is whitelisted varies by filesystem (e.g. ext2/4 uses i_data for other things, but ufs and orangefs and have a dedicate field for symlink names). -Kees
On Tue, Aug 29, 2017 at 11:48:49AM -0700, Kees Cook wrote: > On Mon, Aug 28, 2017 at 9:47 PM, Darrick J. Wong > <darrick.wong@oracle.com> wrote: > > On Mon, Aug 28, 2017 at 02:57:14PM -0700, Kees Cook wrote: > >> On Mon, Aug 28, 2017 at 2:49 PM, Darrick J. Wong > >> <darrick.wong@oracle.com> wrote: > >> > On Mon, Aug 28, 2017 at 02:34:56PM -0700, Kees Cook wrote: > >> >> From: David Windsor <dave@nullcore.net> > >> >> > >> >> The XFS inline inode data, stored in struct xfs_inode_t field > >> >> i_df.if_u2.if_inline_data and therefore contained in the xfs_inode slab > >> >> cache, needs to be copied to/from userspace. > >> >> > >> >> cache object allocation: > >> >> fs/xfs/xfs_icache.c: > >> >> xfs_inode_alloc(...): > >> >> ... > >> >> ip = kmem_zone_alloc(xfs_inode_zone, KM_SLEEP); > >> >> > >> >> fs/xfs/libxfs/xfs_inode_fork.c: > >> >> xfs_init_local_fork(...): > >> >> ... > >> >> if (mem_size <= sizeof(ifp->if_u2.if_inline_data)) > >> >> ifp->if_u1.if_data = ifp->if_u2.if_inline_data; > >> > > >> > Hmm, what happens when mem_size > sizeof(if_inline_data)? A slab object > >> > will be allocated for ifp->if_u1.if_data which can then be used for > >> > readlink in the same manner as the example usage trace below. Does > >> > that allocated object have a need for a usercopy annotation like > >> > the one we're adding for if_inline_data? Or is that already covered > >> > elsewhere? > >> > >> Yeah, the xfs helper kmem_alloc() is used in the other case, which > >> ultimately boils down to a call to kmalloc(), which is entirely > >> whitelisted by an earlier patch in the series: > >> > >> https://lkml.org/lkml/2017/8/28/1026 > > > > Ah. It would've been helpful to have the first three patches cc'd to > > the xfs list. So basically this series establishes the ability to set > > I went back and forth on that, and given all the things it touched, it > seemed like too large a CC list. :) I can explicitly add the xfs list > to the first three for any future versions. > > > regions within a slab object into which copy_to_user can copy memory > > contents, and vice versa. Have you seen any runtime performance impact? > > The overhead looks like it ought to be minimal. > > Under CONFIG_HARDENED_USERCOPY, there's no difference in performance > between the earlier bounds checking (of the whole slab object) vs the > new bounds checking (of the useroffset/usersize portion of the slab > object). Perf difference of CONFIG_HARDENED_USERCOPY itself has proven > hard to measure, which likely means it's very minimal. > > >> (It's possible that at some future time we can start segregating > >> kernel-only kmallocs from usercopy-able kmallocs, but for now, there > >> are no plans for this.) > > > > A pity. It would be interesting to create no-usercopy versions of the > > kmalloc-* slabs and see how much of XFS' memory consumption never > > touches userspace buffers. :) > > There are plans for building either a new helper (kmalloc_usercopy()) > or adding a new flag (GFP_USERCOPY), but I haven't had time yet to > come back around to it. I wanted to land this step first, and we could > then move forward on the rest in future. Heh, fair enough. For the XFS bits, Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> --D > > -Kees > > -- > Kees Cook > Pixel Security > -- > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Aug 29, 2017 at 05:45:36AM -0700, Christoph Hellwig wrote: > On Tue, Aug 29, 2017 at 10:31:26PM +1000, Dave Chinner wrote: > > Probably should. I've already been looking at killing the inline > > extents array to simplify the management of the extent list (much > > simpler to index by rbtree when we don't have direct/indirect > > structures), so killing the inline data would get rid of the other > > part of the union the inline data sits in. > > That's exactly where I came form with my extent list work. Although > the rbtree performance was horrible due to the memory overhead and > I've switched to a modified b+tree at the moment.. Right, I've looked at btrees, too, but it's more complex than just using an rbtree. I originally looked at using Peter Z's old RCU-aware btree code, but it doesn't hold data in the tree leaves. So that needed significant modification to make work without a memory alloc per extent and that didn't work with original aim of RCU-safe extent lookups. I also looked at that "generic" btree stuff that came from logfs, and after a little while ran away screaming. So if we are going to use a b+tree, it sounds like you are probably going the right way. As it is, I've been looking at using interval tree - I have kinda working code - which basically leaves the page based extent arrays intact but adds an rbnode/interval state header to the start of each page to track the offsets within the node and propagate them back up to the root for fast offset based extent lookups. With a lookaside cache on the root, it should behave and perform almost identically to the current indirect array and should have very little extra overhead.... The sticking point, IMO, is the extent array index based lookups in all the bmbt code. I've been looking at converting all that to use offset based lookups and a cursor w/ lookup/inc/dec/insert/delete ioperations wrapping xfs_iext_lookup_ext() and friends. This means the modifications are pretty much identical to the on-disk extent btree, so they can be abstracted out into a single extent update interface for both trees. Have you planned/done any cleanup/changes with this code? > > OTOH, if we're going to have to dynamically allocate the memory for > > the extent/inline data for the data fork, it may just be easier to > > make the entire data fork a dynamic allocation (like the attr fork). > > I though about this a bit, but it turned out that we basically > always need the data anyway, so I don't think it's going to buy > us much unless we shrink the inode enough so that they better fit > into a page. True. Keep it mind for when we've shrunk the inode by another 100 bytes... Cheers, Dave.
On Tue, Aug 29, 2017 at 11:48:49AM -0700, Kees Cook wrote: > On Mon, Aug 28, 2017 at 9:47 PM, Darrick J. Wong > <darrick.wong@oracle.com> wrote: > > On Mon, Aug 28, 2017 at 02:57:14PM -0700, Kees Cook wrote: > >> On Mon, Aug 28, 2017 at 2:49 PM, Darrick J. Wong > >> <darrick.wong@oracle.com> wrote: > >> > On Mon, Aug 28, 2017 at 02:34:56PM -0700, Kees Cook wrote: > >> >> From: David Windsor <dave@nullcore.net> > >> >> > >> >> The XFS inline inode data, stored in struct xfs_inode_t field > >> >> i_df.if_u2.if_inline_data and therefore contained in the xfs_inode slab > >> >> cache, needs to be copied to/from userspace. > >> >> > >> >> cache object allocation: > >> >> fs/xfs/xfs_icache.c: > >> >> xfs_inode_alloc(...): > >> >> ... > >> >> ip = kmem_zone_alloc(xfs_inode_zone, KM_SLEEP); > >> >> > >> >> fs/xfs/libxfs/xfs_inode_fork.c: > >> >> xfs_init_local_fork(...): > >> >> ... > >> >> if (mem_size <= sizeof(ifp->if_u2.if_inline_data)) > >> >> ifp->if_u1.if_data = ifp->if_u2.if_inline_data; > >> > > >> > Hmm, what happens when mem_size > sizeof(if_inline_data)? A slab object > >> > will be allocated for ifp->if_u1.if_data which can then be used for > >> > readlink in the same manner as the example usage trace below. Does > >> > that allocated object have a need for a usercopy annotation like > >> > the one we're adding for if_inline_data? Or is that already covered > >> > elsewhere? > >> > >> Yeah, the xfs helper kmem_alloc() is used in the other case, which > >> ultimately boils down to a call to kmalloc(), which is entirely > >> whitelisted by an earlier patch in the series: > >> > >> https://lkml.org/lkml/2017/8/28/1026 > > > > Ah. It would've been helpful to have the first three patches cc'd to > > the xfs list. So basically this series establishes the ability to set > > I went back and forth on that, and given all the things it touched, it > seemed like too large a CC list. :) I can explicitly add the xfs list > to the first three for any future versions. If you are touching multiple filesystems, you really should cc the entire patchset to linux-fsdevel, similar to how you sent the entire patchset to lkml. That way the entire series will end up on a list that almost all fs developers read. LKML is not a list you can rely on all filesystem developers reading (or developers in any other subsystem, for that matter)... Cheers, Dave.
On Tue, Aug 29, 2017 at 3:15 PM, Dave Chinner <david@fromorbit.com> wrote: > If you are touching multiple filesystems, you really should cc the > entire patchset to linux-fsdevel, similar to how you sent the entire > patchset to lkml. That way the entire series will end up on a list > that almost all fs developers read. LKML is not a list you can rely > on all filesystem developers reading (or developers in any other > subsystem, for that matter)... Okay, sounds good. Thanks! -Kees
On Wed, Aug 30, 2017 at 07:51:57AM +1000, Dave Chinner wrote: > Right, I've looked at btrees, too, but it's more complex than just > using an rbtree. I originally looked at using Peter Z's old > RCU-aware btree code, but it doesn't hold data in the tree leaves. > So that needed significant modification to make work without a > memory alloc per extent and that didn't work with original aim of > RCU-safe extent lookups. I also looked at that "generic" btree > stuff that came from logfs, and after a little while ran away > screaming. I started with the latter, but it's not really looking like it any more: there nodes are formatted as a series of u64s instead of all the long magic, and the data is stored inline - in fact I use a cute trick to keep the size down, derived from our "compressed" on disk extent format: Key: +-------+----------------------------+ | 00:51 | all 52 bits of startoff | | 52:63 | low 12 bits of startblock | +-------+----------------------------+ Value +-------+----------------------------+ | 00:20 | all 21 bits of length | | 21 | unwritten extent bit | | 22:63 | high 42 bits of startblock | +-------+----------------------------+ So we only need a 64-bit key and a 64-bit value by abusing parts of the key to store bits of the startblock. For non-leaf nodes we iterate through the keys only, never touching the cache lines for the value. For the leaf nodes we have to touch the value anyway because we have to do a range lookup to find the exact record. This works fine so far in an isolated simulator, and now I'm ammending it to be a b+tree with pointers to the previous and next node so that we can nicely implement our extent iterators instead of doing full lookups. > The sticking point, IMO, is the extent array index based lookups in > all the bmbt code. I've been looking at converting all that to use > offset based lookups and a cursor w/ lookup/inc/dec/insert/delete > ioperations wrapping xfs_iext_lookup_ext() and friends. This means > the modifications are pretty much identical to the on-disk extent > btree, so they can be abstracted out into a single extent update > interface for both trees. Have you planned/done any cleanup/changes > with this code? I've done various cleanups, but I've not yet consolidated the two. Basically step one at the moment is to move everyone to xfs_iext_lookup_extent + xfs_iext_get_extent that removes all the bad intrusion. Once we move to the actual b+trees the extnum_t cursor will be replaced with a real cursor structure that contains a pointer to the current b+tree leaf node, and an index inside that, which will allows us very efficient iteration. The xfs_iext_get_extent calls will be replaced with more specific xfs_iext_prev_extent, xfs_iext_next_extent calls that include the now slightly more complex cursor decrement, increment as well as a new xfs_iext_last_extent helper for the last extent that we need in a few places. insert/delete remain very similar to what they do right now, they'll get a different cursor type, and the manual xfs_iext_add calls will go away. The new xfs_iext_update_extent helper I posted to the list yesterday will become a bit more complex, as changing the startoff will have to be propagated up the tree.
On Wed, Aug 30, 2017 at 12:14:03AM -0700, Christoph Hellwig wrote: > On Wed, Aug 30, 2017 at 07:51:57AM +1000, Dave Chinner wrote: > > Right, I've looked at btrees, too, but it's more complex than just > > using an rbtree. I originally looked at using Peter Z's old > > RCU-aware btree code, but it doesn't hold data in the tree leaves. > > So that needed significant modification to make work without a > > memory alloc per extent and that didn't work with original aim of > > RCU-safe extent lookups. I also looked at that "generic" btree > > stuff that came from logfs, and after a little while ran away > > screaming. > > I started with the latter, but it's not really looking like it any more: > there nodes are formatted as a series of u64s instead of all the > long magic, Yeah, that was about where I started to run away and look for something nicer.... > and the data is stored inline - in fact I use a cute > trick to keep the size down, derived from our "compressed" on disk > extent format: > > Key: > > +-------+----------------------------+ > | 00:51 | all 52 bits of startoff | > | 52:63 | low 12 bits of startblock | > +-------+----------------------------+ > > Value > > +-------+----------------------------+ > | 00:20 | all 21 bits of length | > | 21 | unwritten extent bit | > | 22:63 | high 42 bits of startblock | > +-------+----------------------------+ > > So we only need a 64-bit key and a 64-bit value by abusing parts > of the key to store bits of the startblock. Neat! :) > For non-leaf nodes we iterate through the keys only, never touching > the cache lines for the value. For the leaf nodes we have to touch > the value anyway because we have to do a range lookup to find the > exact record. > > This works fine so far in an isolated simulator, and now I'm ammending > it to be a b+tree with pointers to the previous and next node so > that we can nicely implement our extent iterators instead of doing > full lookups. Ok, that sounds exactly what I have been looking towards.... > > The sticking point, IMO, is the extent array index based lookups in > > all the bmbt code. I've been looking at converting all that to use > > offset based lookups and a cursor w/ lookup/inc/dec/insert/delete > > ioperations wrapping xfs_iext_lookup_ext() and friends. This means > > the modifications are pretty much identical to the on-disk extent > > btree, so they can be abstracted out into a single extent update > > interface for both trees. Have you planned/done any cleanup/changes > > with this code? > > I've done various cleanups, but I've not yet consolidated the two. > Basically step one at the moment is to move everyone to > xfs_iext_lookup_extent + xfs_iext_get_extent that removes all the > bad intrusion. Yup. > Once we move to the actual b+trees the extnum_t cursor will be replaced > with a real cursor structure that contains a pointer to the current > b+tree leaf node, and an index inside that, which will allows us very > efficient iteration. The xfs_iext_get_extent calls will be replaced > with more specific xfs_iext_prev_extent, xfs_iext_next_extent calls > that include the now slightly more complex cursor decrement, increment > as well as a new xfs_iext_last_extent helper for the last extent > that we need in a few places. Ok, that's sounds like it'll fit right in with what I've been prototyping for the extent code in xfs_bmap.c. I can make that work with a cursor-based lookup/inc/dec/ins/del API similar to the bmbt API. I've been looking to abstract the extent manipulations out into functions that modify both trees like this: [note: just put template code in to get my thoughts straight, it's not working code] +static int +xfs_bmex_delete( + struct xfs_iext_cursor *icur, + struct xfs_btree_cursor *cur, + int *nextents) +{ + int i; + + xfs_iext_remove(bma->ip, bma->idx + 1, 2, state); + if (nextents) + (*nextents)--; + if (!cur) + return 0; + error = xfs_btree_delete(cur, &i); + if (error) + return error; + XFS_WANT_CORRUPTED_RETURN(cur->bc_mp, i == 1); + return 0; +} + +static int +xfs_bmex_increment( + struct xfs_iext_cursor *icur, + struct xfs_btree_cursor *cur) +{ + int i; + + icur->ep = xfs_iext_get_right_ext(icur->ep); + if (!cur) + return 0; + error = xfs_btree_increment(cur, 0, &i); + if (error) + return error; + XFS_WANT_CORRUPTED_RETURN(cur->bc_mp, i == 1); + return 0; +} + +static int +xfs_bmex_decrement( + struct xfs_iext_cursor *icur, + struct xfs_btree_cursor *cur) +{ + int i; + + icur->ep = xfs_iext_get_left_ext(icur->ep); + if (!cur) + return 0; + error = xfs_btree_decrement(cur, 0, &i); + if (error) + return error; + XFS_WANT_CORRUPTED_RETURN(cur->bc_mp, i == 1); + return 0; +} And so what you're doing would fit straight into that. I'm ending up with is extent operations that look like this: xfs_bmap_add_extent_delay_real() ..... case BMAP_LEFT_FILLING | BMAP_LEFT_CONTIG | BMAP_RIGHT_FILLING | BMAP_RIGHT_CONTIG: /* * Filling in all of a previously delayed allocation extent. * The left and right neighbors are both contiguous with new. */ + rval |= XFS_ILOG_CORE; + + /* remove the incore delalloc extent first */ + error = xfs_bmex_delete(&icur, NULL, nextents); + if (error) + goto done; + + /* + * update incore and bmap extent trees + * 1. set cursors to the right extent + * 2. remove the right extent + * 3. update the left extent to span all 3 extent ranges + */ + error = xfs_bmex_lookup_eq(&icur, bma->cur, RIGHT.br_startoff, + RIGHT.br_startblock, RIGHT.br_blockcount, 1); + if (error) + goto done; + error = xfs_bmex_delete(&icur, bma->cur, NULL); + if (error) + goto done; + error = xfs_bmex_decrement(&icur, bma->cur); + if (error) + goto done; + error = xfs_bmex_update(&icur, bma->cur, LEFT.br_startoff, + LEFT.br_startblock, + LEFT.br_blockcount + PREV.br_blockcount + + RIGHT.br_blockcount, + LEFT.br_state); + if (error) + goto done; break; .... And I'm starting to see where there are common extent manipulations being done so there's probably a fair amount of further factoring that can be done on top of this.... > insert/delete remain very similar to what they do right now, they'll > get a different cursor type, and the manual xfs_iext_add calls will > go away. The new xfs_iext_update_extent helper I posted to the list > yesterday will become a bit more complex, as changing the startoff > will have to be propagated up the tree. I've had a quick look at them and pulled it down into my tree for testing (which had a cpu burning hang on xfs/020 a few minutes ago), but I'll spend more time grokking them tomorrow. Cheers, Dave.
On Wed, Aug 30, 2017 at 06:05:58PM +1000, Dave Chinner wrote: > Ok, that's sounds like it'll fit right in with what I've been > prototyping for the extent code in xfs_bmap.c. I can make that work > with a cursor-based lookup/inc/dec/ins/del API similar to the bmbt > API. I've been looking to abstract the extent manipulations out into > functions that modify both trees like this: > > [note: just put template code in to get my thoughts straight, it's > not working code] FYI, I've got somewhat working changes in that area (still has bugs but a few tests pass :)), what I'm doing is to make sure all of the xfs_bmap_{add,del}_extent_* routines fully operate on xfs_bmbt_irec structures that they acquire through the xfs_bmalloca structure or from xfs_iext_get_extent and update using xfs_iext_update_extent. A nice fallout from that is that we can change the prototypes for xfs_bmbt_lookup_* and xfs_bmbt_update to take a xfs_bmbt_irec as well instead of taking the individual arguments. That should help with your next step cleanups a bit.
diff --git a/fs/xfs/kmem.h b/fs/xfs/kmem.h index 4d85992d75b2..08358f38dee6 100644 --- a/fs/xfs/kmem.h +++ b/fs/xfs/kmem.h @@ -110,6 +110,16 @@ kmem_zone_init_flags(int size, char *zone_name, unsigned long flags, return kmem_cache_create(zone_name, size, 0, flags, construct); } +static inline kmem_zone_t * +kmem_zone_init_flags_usercopy(int size, char *zone_name, unsigned long flags, + size_t useroffset, size_t usersize, + void (*construct)(void *)) +{ + return kmem_cache_create_usercopy(zone_name, size, 0, flags, + useroffset, usersize, construct); +} + + static inline void kmem_zone_free(kmem_zone_t *zone, void *ptr) { diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c index 38aaacdbb8b3..6ca428c6f943 100644 --- a/fs/xfs/xfs_super.c +++ b/fs/xfs/xfs_super.c @@ -1829,9 +1829,12 @@ xfs_init_zones(void) goto out_destroy_efd_zone; xfs_inode_zone = - kmem_zone_init_flags(sizeof(xfs_inode_t), "xfs_inode", + kmem_zone_init_flags_usercopy(sizeof(xfs_inode_t), "xfs_inode", KM_ZONE_HWALIGN | KM_ZONE_RECLAIM | KM_ZONE_SPREAD | - KM_ZONE_ACCOUNT, xfs_fs_inode_init_once); + KM_ZONE_ACCOUNT, + offsetof(xfs_inode_t, i_df.if_u2.if_inline_data), + sizeof_field(xfs_inode_t, i_df.if_u2.if_inline_data), + xfs_fs_inode_init_once); if (!xfs_inode_zone) goto out_destroy_efi_zone;