Message ID | 1447800381-20167-1-git-send-email-octavian.purdila@intel.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Wed, Nov 18, 2015 at 12:46:21AM +0200, Octavian Purdila wrote: > Naive implementation for non-mmu architectures: allocate physically > contiguous xfs buffers with alloc_pages. Terribly inefficient with > memory and fragmentation on high I/O loads but it may be good enough > for basic usage (which most non-mmu architectures will need). > > This patch was tested with lklfuse [1] and basic operations seems to > work even with 16MB allocated for LKL. > > [1] https://github.com/lkl/linux > > Signed-off-by: Octavian Purdila <octavian.purdila@intel.com> > --- Interesting, though this makes me wonder why we couldn't have a new _XBF_VMEM (for example) buffer type that uses vmalloc(). I'm not familiar with mmu-less context, but I see that mm/nommu.c has a __vmalloc() interface that looks like it ultimately translates into an alloc_pages() call. Would that accomplish what this patch is currently trying to do? I ask because it seems like that would help clean up the code a bit, for one. It might also facilitate some degree of testing of the XFS bits (even if utilized sparingly in DEBUG mode if it weren't suitable enough for generic/mmu use). We currently allocate and map the buffer pages separately and I'm not sure if there's any particular reasons for doing that outside of some congestion handling in the allocation code and XBF_UNMAPPED buffers, the latter probably being irrelevant for nommu. Any other thoughts on that? > fs/xfs/xfs_buf.c | 27 +++++++++++++++++++++++++++ > 1 file changed, 27 insertions(+) > > diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c > index 8ecffb3..50b5246 100644 > --- a/fs/xfs/xfs_buf.c > +++ b/fs/xfs/xfs_buf.c ... > @@ -816,11 +835,19 @@ xfs_buf_get_uncached( > if (error) > goto fail_free_buf; > > +#ifdef CONFIG_MMU > for (i = 0; i < page_count; i++) { > bp->b_pages[i] = alloc_page(xb_to_gfp(flags)); > if (!bp->b_pages[i]) > goto fail_free_mem; > } > +#else > + bp->b_pages[0] = alloc_pages(flags, order_base_2(page_count)); > + if (!bp->b_pages[0]) > + goto fail_free_buf; > + for (i = 1; i < page_count; i++) > + bp->b_pages[i] = bp->b_pages[i-1] + 1; > +#endif We still have a path into __free_page() further down in this function if _xfs_buf_map_pages() fails. Granted, _xfs_buf_map_pages() should probably never fail in this case, but it still looks like a land mine at the very least. Brian > bp->b_flags |= _XBF_PAGES; > > error = _xfs_buf_map_pages(bp, 0); > -- > 1.9.1 > > _______________________________________________ > xfs mailing list > xfs@oss.sgi.com > http://oss.sgi.com/mailman/listinfo/xfs -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Nov 19, 2015 at 5:55 PM, Brian Foster <bfoster@redhat.com> wrote: > On Wed, Nov 18, 2015 at 12:46:21AM +0200, Octavian Purdila wrote: >> Naive implementation for non-mmu architectures: allocate physically >> contiguous xfs buffers with alloc_pages. Terribly inefficient with >> memory and fragmentation on high I/O loads but it may be good enough >> for basic usage (which most non-mmu architectures will need). >> >> This patch was tested with lklfuse [1] and basic operations seems to >> work even with 16MB allocated for LKL. >> >> [1] https://github.com/lkl/linux >> >> Signed-off-by: Octavian Purdila <octavian.purdila@intel.com> >> --- > > Interesting, though this makes me wonder why we couldn't have a new > _XBF_VMEM (for example) buffer type that uses vmalloc(). I'm not > familiar with mmu-less context, but I see that mm/nommu.c has a > __vmalloc() interface that looks like it ultimately translates into an > alloc_pages() call. Would that accomplish what this patch is currently > trying to do? > Hi Brian, Ah, that sounds nice! We could get rid of the #ifdefs and use a common path in that case. vmalloc should work on non-mmu and it seems that there is a vmalloc_to_page() we can use to get the physical pages. I'll give it a try. Is there a central place where we could enable the new _XBF_VMEM to be the default for non-mmu arches? > I ask because it seems like that would help clean up the code a bit, for > one. It might also facilitate some degree of testing of the XFS bits > (even if utilized sparingly in DEBUG mode if it weren't suitable enough > for generic/mmu use). We currently allocate and map the buffer pages > separately and I'm not sure if there's any particular reasons for doing > that outside of some congestion handling in the allocation code and > XBF_UNMAPPED buffers, the latter probably being irrelevant for nommu. > Any other thoughts on that? > >> fs/xfs/xfs_buf.c | 27 +++++++++++++++++++++++++++ >> 1 file changed, 27 insertions(+) >> >> diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c >> index 8ecffb3..50b5246 100644 >> --- a/fs/xfs/xfs_buf.c >> +++ b/fs/xfs/xfs_buf.c > ... >> @@ -816,11 +835,19 @@ xfs_buf_get_uncached( >> if (error) >> goto fail_free_buf; >> >> +#ifdef CONFIG_MMU >> for (i = 0; i < page_count; i++) { >> bp->b_pages[i] = alloc_page(xb_to_gfp(flags)); >> if (!bp->b_pages[i]) >> goto fail_free_mem; >> } >> +#else >> + bp->b_pages[0] = alloc_pages(flags, order_base_2(page_count)); >> + if (!bp->b_pages[0]) >> + goto fail_free_buf; >> + for (i = 1; i < page_count; i++) >> + bp->b_pages[i] = bp->b_pages[i-1] + 1; >> +#endif > > We still have a path into __free_page() further down in this function if > _xfs_buf_map_pages() fails. Granted, _xfs_buf_map_pages() should > probably never fail in this case, but it still looks like a land mine at > the very least. > OK. Adding a i = 1; right before the #endif should take care of that, if the vmalloc approach does not work. -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Nov 18, 2015 at 12:46:21AM +0200, Octavian Purdila wrote: > Naive implementation for non-mmu architectures: allocate physically > contiguous xfs buffers with alloc_pages. Terribly inefficient with > memory and fragmentation on high I/O loads but it may be good enough > for basic usage (which most non-mmu architectures will need). Can you please explain why you want to use XFS on low end, basic non-MMU devices? XFS is a high performance, enterprise/HPC level filesystem - it's not a filesystem designed for small IoT level devices - so I'm struggling to see why we'd want to expend any effort to make XFS work on such devices.... Cheers, Dave.
On Thu, Nov 19, 2015 at 10:55:25AM -0500, Brian Foster wrote: > On Wed, Nov 18, 2015 at 12:46:21AM +0200, Octavian Purdila wrote: > > Naive implementation for non-mmu architectures: allocate physically > > contiguous xfs buffers with alloc_pages. Terribly inefficient with > > memory and fragmentation on high I/O loads but it may be good enough > > for basic usage (which most non-mmu architectures will need). > > > > This patch was tested with lklfuse [1] and basic operations seems to > > work even with 16MB allocated for LKL. > > > > [1] https://github.com/lkl/linux > > > > Signed-off-by: Octavian Purdila <octavian.purdila@intel.com> > > --- > > Interesting, though this makes me wonder why we couldn't have a new > _XBF_VMEM (for example) buffer type that uses vmalloc(). I'm not > familiar with mmu-less context, but I see that mm/nommu.c has a > __vmalloc() interface that looks like it ultimately translates into an > alloc_pages() call. Would that accomplish what this patch is currently > trying to do? vmalloc is always a last resort. vmalloc space on 32 bit systems is extremely limited and it is easy to exhaust with XFS. Also, vmalloc limits the control we have over allocation context (e.g. the hoops we jump through in kmem_alloc_large() to maintain GFP_NOFS contexts), so just using vmalloc doesn't make things much simpler from an XFS perspective. > I ask because it seems like that would help clean up the code a bit, for > one. It might also facilitate some degree of testing of the XFS bits > (even if utilized sparingly in DEBUG mode if it weren't suitable enough > for generic/mmu use). We currently allocate and map the buffer pages > separately and I'm not sure if there's any particular reasons for doing > that outside of some congestion handling in the allocation code and > XBF_UNMAPPED buffers, the latter probably being irrelevant for nommu. > Any other thoughts on that? We could probably clean the code up more (the allocation logic is now largely a historic relic) but I'm not convinced yet that we should be spending any time trying to specifically support mmu-less hardware. Cheers, Dave.
On Fri, Nov 20, 2015 at 12:24 AM, Dave Chinner <david@fromorbit.com> wrote: > On Wed, Nov 18, 2015 at 12:46:21AM +0200, Octavian Purdila wrote: >> Naive implementation for non-mmu architectures: allocate physically >> contiguous xfs buffers with alloc_pages. Terribly inefficient with >> memory and fragmentation on high I/O loads but it may be good enough >> for basic usage (which most non-mmu architectures will need). > > Can you please explain why you want to use XFS on low end, basic > non-MMU devices? XFS is a high performance, enterprise/HPC level > filesystem - it's not a filesystem designed for small IoT level > devices - so I'm struggling to see why we'd want to expend any > effort to make XFS work on such devices.... The use case is the Linux Kernel Library: https://lkml.org/lkml/2015/11/3/706 Using LKL and fuse you can mount any kernel filesystem using fuse as non-root.
On Fri, Nov 20, 2015 at 12:54:02AM +0100, Richard Weinberger wrote: > On Fri, Nov 20, 2015 at 12:24 AM, Dave Chinner <david@fromorbit.com> wrote: > > On Wed, Nov 18, 2015 at 12:46:21AM +0200, Octavian Purdila wrote: > >> Naive implementation for non-mmu architectures: allocate physically > >> contiguous xfs buffers with alloc_pages. Terribly inefficient with > >> memory and fragmentation on high I/O loads but it may be good enough > >> for basic usage (which most non-mmu architectures will need). > > > > Can you please explain why you want to use XFS on low end, basic > > non-MMU devices? XFS is a high performance, enterprise/HPC level > > filesystem - it's not a filesystem designed for small IoT level > > devices - so I'm struggling to see why we'd want to expend any > > effort to make XFS work on such devices.... > > The use case is the Linux Kernel Library: > https://lkml.org/lkml/2015/11/3/706 > > Using LKL and fuse you can mount any kernel filesystem using fuse > as non-root. IOWs, because we said no to unprivileged mounts, instead the proposal is to linking all the kernel code into userspace so you can do unprivielged mounts that way? IOWs, you get to say "it secure because it's in userspace" and leave us filesystem people with all the shit that comes with allowing users to mount random untrusted filesystem images using code that was never designed to allow that to happen? Fmeh. Not a convincing argument by far. Cheers, Dave.
On Fri, Nov 20, 2015 at 1:24 AM, Dave Chinner <david@fromorbit.com> wrote: > On Wed, Nov 18, 2015 at 12:46:21AM +0200, Octavian Purdila wrote: >> Naive implementation for non-mmu architectures: allocate physically >> contiguous xfs buffers with alloc_pages. Terribly inefficient with >> memory and fragmentation on high I/O loads but it may be good enough >> for basic usage (which most non-mmu architectures will need). > > Can you please explain why you want to use XFS on low end, basic > non-MMU devices? XFS is a high performance, enterprise/HPC level > filesystem - it's not a filesystem designed for small IoT level > devices - so I'm struggling to see why we'd want to expend any > effort to make XFS work on such devices.... > Hi David, Yes XFS as the main fs on this type of devices does not make sense, but does it hurt to be able to perform basic operation on XFS from these devices? Perhaps accessing an external medium formatted with XFS? Another example is accessing VM images that are formatted with XFS. Currently we can do that with tools like libguestfs that use a VM in the background. I am working on a lighter solution for that where we compile the Linux kernel as a library [1]. This allows access to the filesystem without the need to use a full VM. And a final example is linking the bootloader code with LKL to access the filesystem. This has a hard requirement on non-mmu. So, IMHO there are usecases for using XFS on non-mmu architectures. I think it all boils down to: is the patch simple enough to not put an unreasonable maintenance burden on developers? [1] https://github.com/lkl/linux Thanks, Tavi -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, Nov 20, 2015 at 1:35 AM, Dave Chinner <david@fromorbit.com> wrote: > On Thu, Nov 19, 2015 at 10:55:25AM -0500, Brian Foster wrote: >> On Wed, Nov 18, 2015 at 12:46:21AM +0200, Octavian Purdila wrote: >> > Naive implementation for non-mmu architectures: allocate physically >> > contiguous xfs buffers with alloc_pages. Terribly inefficient with >> > memory and fragmentation on high I/O loads but it may be good enough >> > for basic usage (which most non-mmu architectures will need). >> > >> > This patch was tested with lklfuse [1] and basic operations seems to >> > work even with 16MB allocated for LKL. >> > >> > [1] https://github.com/lkl/linux >> > >> > Signed-off-by: Octavian Purdila <octavian.purdila@intel.com> >> > --- >> >> Interesting, though this makes me wonder why we couldn't have a new >> _XBF_VMEM (for example) buffer type that uses vmalloc(). I'm not >> familiar with mmu-less context, but I see that mm/nommu.c has a >> __vmalloc() interface that looks like it ultimately translates into an >> alloc_pages() call. Would that accomplish what this patch is currently >> trying to do? > > vmalloc is always a last resort. vmalloc space on 32 bit systems is > extremely limited and it is easy to exhaust with XFS. Doesn't vm_map_ram use the same space as vmalloc? > > Also, vmalloc limits the control we have over allocation context > (e.g. the hoops we jump through in kmem_alloc_large() to maintain > GFP_NOFS contexts), so just using vmalloc doesn't make things much > simpler from an XFS perspective. > I have zero experience with XFS, sorry if I ask obvious questions. AFAICS there are no memalloc_noio_save() hoops in the page allocation part, just in the vm_map_ram part. Could we preserve the memalloc_noio_save() part and use vmalloc instead of vm_map_ram? >> I ask because it seems like that would help clean up the code a bit, for >> one. It might also facilitate some degree of testing of the XFS bits >> (even if utilized sparingly in DEBUG mode if it weren't suitable enough >> for generic/mmu use). We currently allocate and map the buffer pages >> separately and I'm not sure if there's any particular reasons for doing >> that outside of some congestion handling in the allocation code and >> XBF_UNMAPPED buffers, the latter probably being irrelevant for nommu. >> Any other thoughts on that? > > We could probably clean the code up more (the allocation logic > is now largely a historic relic) but I'm not convinced yet that we > should be spending any time trying to specifically support mmu-less > hardware. > > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, Nov 20, 2015 at 2:58 AM, Dave Chinner <david@fromorbit.com> wrote: > On Fri, Nov 20, 2015 at 12:54:02AM +0100, Richard Weinberger wrote: >> On Fri, Nov 20, 2015 at 12:24 AM, Dave Chinner <david@fromorbit.com> wrote: >> > On Wed, Nov 18, 2015 at 12:46:21AM +0200, Octavian Purdila wrote: >> >> Naive implementation for non-mmu architectures: allocate physically >> >> contiguous xfs buffers with alloc_pages. Terribly inefficient with >> >> memory and fragmentation on high I/O loads but it may be good enough >> >> for basic usage (which most non-mmu architectures will need). >> > >> > Can you please explain why you want to use XFS on low end, basic >> > non-MMU devices? XFS is a high performance, enterprise/HPC level >> > filesystem - it's not a filesystem designed for small IoT level >> > devices - so I'm struggling to see why we'd want to expend any >> > effort to make XFS work on such devices.... >> >> The use case is the Linux Kernel Library: >> https://lkml.org/lkml/2015/11/3/706 >> >> Using LKL and fuse you can mount any kernel filesystem using fuse >> as non-root. > > IOWs, because we said no to unprivileged mounts, instead the > proposal is to linking all the kernel code into userspace so you can > do unprivielged mounts that way? > LKL's goal is to make it easy for various applications to reuse Linux kernel code instead of re-implementing it. Mounting filesystem images is just one of the applications. > IOWs, you get to say "it secure because it's in userspace" and leave > us filesystem people with all the shit that comes with allowing > users to mount random untrusted filesystem images using code that > was never designed to allow that to happen? > It is already possible to mount arbitrary filesystem images in userspace using VMs . LKL doesn't change that, it just reduces the amount of dependencies you need to do so. Could you expand of what burden does this use-case put on fs developers? I am sure that, if needed, we can put restrictions in LKL to avoid that. -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Nov 19, 2015 at 10:54:34PM +0200, Octavian Purdila wrote: > On Thu, Nov 19, 2015 at 5:55 PM, Brian Foster <bfoster@redhat.com> wrote: > > On Wed, Nov 18, 2015 at 12:46:21AM +0200, Octavian Purdila wrote: > >> Naive implementation for non-mmu architectures: allocate physically > >> contiguous xfs buffers with alloc_pages. Terribly inefficient with > >> memory and fragmentation on high I/O loads but it may be good enough > >> for basic usage (which most non-mmu architectures will need). > >> > >> This patch was tested with lklfuse [1] and basic operations seems to > >> work even with 16MB allocated for LKL. > >> > >> [1] https://github.com/lkl/linux > >> > >> Signed-off-by: Octavian Purdila <octavian.purdila@intel.com> > >> --- > > > > Interesting, though this makes me wonder why we couldn't have a new > > _XBF_VMEM (for example) buffer type that uses vmalloc(). I'm not > > familiar with mmu-less context, but I see that mm/nommu.c has a > > __vmalloc() interface that looks like it ultimately translates into an > > alloc_pages() call. Would that accomplish what this patch is currently > > trying to do? > > > > Hi Brian, > > Ah, that sounds nice! We could get rid of the #ifdefs and use a common > path in that case. vmalloc should work on non-mmu and it seems that > there is a vmalloc_to_page() we can use to get the physical pages. > I'll give it a try. > > Is there a central place where we could enable the new _XBF_VMEM to be > the default for non-mmu arches? > I think a generic mechanism is the appropriate short term goal as opposed to a common codepath. IOWs, add the ability to support a new XBF_VMEM buffer type and enforce use of that type when !CONFIG_MMU. So you'd still have an ifdef somewhere, but it wouldn't be shoehorned into the current separate alloc/map implementation. I don't think we want to change the existing buffer mechanism straight away to use this based on some of the things Dave calls out. My hope is more that we can do the following for the generic (CONFIG_MMU) case: 1.) Enable this selectively/randomly when CONFIG_XFS_DEBUG is enabled to provide some test coverage of the XFS buffer type without the need for a nommu setup (not a replacement for nommu testing, of course). 2.) Over time, move the current buffer mechanism over to this type if we can eventually clean up some of the warts that pin us to our current implementation. I'd suggest to let some of the parallel discussions play out before going too far down this path, however... Brian > > I ask because it seems like that would help clean up the code a bit, for > > one. It might also facilitate some degree of testing of the XFS bits > > (even if utilized sparingly in DEBUG mode if it weren't suitable enough > > for generic/mmu use). We currently allocate and map the buffer pages > > separately and I'm not sure if there's any particular reasons for doing > > that outside of some congestion handling in the allocation code and > > XBF_UNMAPPED buffers, the latter probably being irrelevant for nommu. > > Any other thoughts on that? > > > >> fs/xfs/xfs_buf.c | 27 +++++++++++++++++++++++++++ > >> 1 file changed, 27 insertions(+) > >> > >> diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c > >> index 8ecffb3..50b5246 100644 > >> --- a/fs/xfs/xfs_buf.c > >> +++ b/fs/xfs/xfs_buf.c > > ... > >> @@ -816,11 +835,19 @@ xfs_buf_get_uncached( > >> if (error) > >> goto fail_free_buf; > >> > >> +#ifdef CONFIG_MMU > >> for (i = 0; i < page_count; i++) { > >> bp->b_pages[i] = alloc_page(xb_to_gfp(flags)); > >> if (!bp->b_pages[i]) > >> goto fail_free_mem; > >> } > >> +#else > >> + bp->b_pages[0] = alloc_pages(flags, order_base_2(page_count)); > >> + if (!bp->b_pages[0]) > >> + goto fail_free_buf; > >> + for (i = 1; i < page_count; i++) > >> + bp->b_pages[i] = bp->b_pages[i-1] + 1; > >> +#endif > > > > We still have a path into __free_page() further down in this function if > > _xfs_buf_map_pages() fails. Granted, _xfs_buf_map_pages() should > > probably never fail in this case, but it still looks like a land mine at > > the very least. > > > > OK. Adding a i = 1; right before the #endif should take care of that, > if the vmalloc approach does not work. > > _______________________________________________ > xfs mailing list > xfs@oss.sgi.com > http://oss.sgi.com/mailman/listinfo/xfs -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, Nov 20, 2015 at 10:35:47AM +1100, Dave Chinner wrote: > On Thu, Nov 19, 2015 at 10:55:25AM -0500, Brian Foster wrote: > > On Wed, Nov 18, 2015 at 12:46:21AM +0200, Octavian Purdila wrote: > > > Naive implementation for non-mmu architectures: allocate physically > > > contiguous xfs buffers with alloc_pages. Terribly inefficient with > > > memory and fragmentation on high I/O loads but it may be good enough > > > for basic usage (which most non-mmu architectures will need). > > > > > > This patch was tested with lklfuse [1] and basic operations seems to > > > work even with 16MB allocated for LKL. > > > > > > [1] https://github.com/lkl/linux > > > > > > Signed-off-by: Octavian Purdila <octavian.purdila@intel.com> > > > --- > > > > Interesting, though this makes me wonder why we couldn't have a new > > _XBF_VMEM (for example) buffer type that uses vmalloc(). I'm not > > familiar with mmu-less context, but I see that mm/nommu.c has a > > __vmalloc() interface that looks like it ultimately translates into an > > alloc_pages() call. Would that accomplish what this patch is currently > > trying to do? > > vmalloc is always a last resort. vmalloc space on 32 bit systems is > extremely limited and it is easy to exhaust with XFS. > Sure, but my impression is that a vmalloc() buffer is roughly equivalent in this regard to a current !XBF_UNMAPPED && size > PAGE_SIZE buffer. We just do the allocation and mapping separately (presumably for other reasons). > Also, vmalloc limits the control we have over allocation context > (e.g. the hoops we jump through in kmem_alloc_large() to maintain > GFP_NOFS contexts), so just using vmalloc doesn't make things much > simpler from an XFS perspective. > The comment in kmem_zalloc_large() calls out some apparent hardcoded allocation flags down in the depths of vmalloc(). It looks to me that page allocation (__vmalloc_area_node()) actually uses the provided flags, so I'm not following the "data page" part of that comment. Indeed, I do see that this is not the case down in calls like pmd_alloc_one(), pte_alloc_one_kernel(), etc., associated with page table management. Those latter calls are all from following down through the map_vm_area()->vmap_page_range() codepath from __vmalloc_area_node(). We call vm_map_ram() directly from _xfs_buf_map_pages(), which itself calls down into the same code. Indeed, we already protect ourselves here via the same memalloc_noio_save() mechanism that kmem_zalloc_large() uses. I suspect there's more to it than that because it does look like vm_map_ram() has a different mechanism for managing vmalloc space for certain (smaller) allocations, either of which I'm not really familiar with. That aside, I don't see how vmalloc() introduces any new allocation context issues for those buffers where we already set up a multi-page mapping. We still have the somewhat customized page allocation code in xfs_buf_allocate_memory() to contend with. I actually think it would be useful to have a DEBUG sysfs tunable to turn on vmalloc() buffers and actually test how effective some of this code is. > > I ask because it seems like that would help clean up the code a bit, for > > one. It might also facilitate some degree of testing of the XFS bits > > (even if utilized sparingly in DEBUG mode if it weren't suitable enough > > for generic/mmu use). We currently allocate and map the buffer pages > > separately and I'm not sure if there's any particular reasons for doing > > that outside of some congestion handling in the allocation code and > > XBF_UNMAPPED buffers, the latter probably being irrelevant for nommu. > > Any other thoughts on that? > > We could probably clean the code up more (the allocation logic > is now largely a historic relic) but I'm not convinced yet that we > should be spending any time trying to specifically support mmu-less > hardware. > Fair point, we'll see where the use case discussion goes. That said, I was a little surprised that this is all that was required to enable nommu support. If that is indeed the case and we aren't in for a series of subsequent nommu specific changes (Octavian?) by letting this through, what's the big deal? This seems fairly harmless to me as is, particularly if it can be semi-tested via DEBUG mode and has potential generic use down the road. Brian > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com > > _______________________________________________ > xfs mailing list > xfs@oss.sgi.com > http://oss.sgi.com/mailman/listinfo/xfs -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, Nov 20, 2015 at 04:26:28PM +0200, Octavian Purdila wrote: > On Fri, Nov 20, 2015 at 2:58 AM, Dave Chinner <david@fromorbit.com> wrote: > > On Fri, Nov 20, 2015 at 12:54:02AM +0100, Richard Weinberger wrote: > >> On Fri, Nov 20, 2015 at 12:24 AM, Dave Chinner <david@fromorbit.com> wrote: > >> > On Wed, Nov 18, 2015 at 12:46:21AM +0200, Octavian Purdila wrote: > >> >> Naive implementation for non-mmu architectures: allocate physically > >> >> contiguous xfs buffers with alloc_pages. Terribly inefficient with > >> >> memory and fragmentation on high I/O loads but it may be good enough > >> >> for basic usage (which most non-mmu architectures will need). > >> > > >> > Can you please explain why you want to use XFS on low end, basic > >> > non-MMU devices? XFS is a high performance, enterprise/HPC level > >> > filesystem - it's not a filesystem designed for small IoT level > >> > devices - so I'm struggling to see why we'd want to expend any > >> > effort to make XFS work on such devices.... > >> > >> The use case is the Linux Kernel Library: > >> https://lkml.org/lkml/2015/11/3/706 > >> > >> Using LKL and fuse you can mount any kernel filesystem using fuse > >> as non-root. > > > > IOWs, because we said no to unprivileged mounts, instead the > > proposal is to linking all the kernel code into userspace so you can > > do unprivielged mounts that way? > > > > LKL's goal is to make it easy for various applications to reuse Linux > kernel code instead of re-implementing it. Mounting filesystem images > is just one of the applications. > > > IOWs, you get to say "it secure because it's in userspace" and leave > > us filesystem people with all the shit that comes with allowing > > users to mount random untrusted filesystem images using code that > > was never designed to allow that to happen? > > > > It is already possible to mount arbitrary filesystem images in > userspace using VMs . LKL doesn't change that, it just reduces the > amount of dependencies you need to do so. > Perhaps a dumb question, but I'm not quite putting 2+2 together here. When I see nommu, I'm generally thinking hardware characteristics, but we're talking about a userspace kernel library here. So can you elaborate on how this relates to nommu? Does this library emulate kernel mechanisms in userspace via nommu mode or something of that nature? Thanks. Brian > Could you expand of what burden does this use-case put on fs > developers? I am sure that, if needed, we can put restrictions in LKL > to avoid that. > > _______________________________________________ > xfs mailing list > xfs@oss.sgi.com > http://oss.sgi.com/mailman/listinfo/xfs -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, Nov 20, 2015 at 5:24 PM, Brian Foster <bfoster@redhat.com> wrote: > On Fri, Nov 20, 2015 at 04:26:28PM +0200, Octavian Purdila wrote: >> On Fri, Nov 20, 2015 at 2:58 AM, Dave Chinner <david@fromorbit.com> wrote: >> > On Fri, Nov 20, 2015 at 12:54:02AM +0100, Richard Weinberger wrote: >> >> On Fri, Nov 20, 2015 at 12:24 AM, Dave Chinner <david@fromorbit.com> wrote: >> >> > On Wed, Nov 18, 2015 at 12:46:21AM +0200, Octavian Purdila wrote: >> >> >> Naive implementation for non-mmu architectures: allocate physically >> >> >> contiguous xfs buffers with alloc_pages. Terribly inefficient with >> >> >> memory and fragmentation on high I/O loads but it may be good enough >> >> >> for basic usage (which most non-mmu architectures will need). >> >> > >> >> > Can you please explain why you want to use XFS on low end, basic >> >> > non-MMU devices? XFS is a high performance, enterprise/HPC level >> >> > filesystem - it's not a filesystem designed for small IoT level >> >> > devices - so I'm struggling to see why we'd want to expend any >> >> > effort to make XFS work on such devices.... >> >> >> >> The use case is the Linux Kernel Library: >> >> https://lkml.org/lkml/2015/11/3/706 >> >> >> >> Using LKL and fuse you can mount any kernel filesystem using fuse >> >> as non-root. >> > >> > IOWs, because we said no to unprivileged mounts, instead the >> > proposal is to linking all the kernel code into userspace so you can >> > do unprivielged mounts that way? >> > >> >> LKL's goal is to make it easy for various applications to reuse Linux >> kernel code instead of re-implementing it. Mounting filesystem images >> is just one of the applications. >> >> > IOWs, you get to say "it secure because it's in userspace" and leave >> > us filesystem people with all the shit that comes with allowing >> > users to mount random untrusted filesystem images using code that >> > was never designed to allow that to happen? >> > >> >> It is already possible to mount arbitrary filesystem images in >> userspace using VMs . LKL doesn't change that, it just reduces the >> amount of dependencies you need to do so. >> > > Perhaps a dumb question, but I'm not quite putting 2+2 together here. > When I see nommu, I'm generally thinking hardware characteristics, but > we're talking about a userspace kernel library here. So can you > elaborate on how this relates to nommu? Does this library emulate kernel > mechanisms in userspace via nommu mode or something of that nature? > LKL is currently implemented as a virtual non-mmu architecture. That makes it simpler and it will also allow us to support environments where it is not possible to emulate paging (e.g. bootloaders). -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, Nov 20, 2015 at 5:11 PM, Brian Foster <bfoster@redhat.com> wrote: > On Fri, Nov 20, 2015 at 10:35:47AM +1100, Dave Chinner wrote: >> On Thu, Nov 19, 2015 at 10:55:25AM -0500, Brian Foster wrote: >> > On Wed, Nov 18, 2015 at 12:46:21AM +0200, Octavian Purdila wrote: >> > > Naive implementation for non-mmu architectures: allocate physically >> > > contiguous xfs buffers with alloc_pages. Terribly inefficient with >> > > memory and fragmentation on high I/O loads but it may be good enough >> > > for basic usage (which most non-mmu architectures will need). >> > > >> > > This patch was tested with lklfuse [1] and basic operations seems to >> > > work even with 16MB allocated for LKL. >> > > >> > > [1] https://github.com/lkl/linux >> > > >> > > Signed-off-by: Octavian Purdila <octavian.purdila@intel.com> >> > > --- >> > >> > Interesting, though this makes me wonder why we couldn't have a new >> > _XBF_VMEM (for example) buffer type that uses vmalloc(). I'm not >> > familiar with mmu-less context, but I see that mm/nommu.c has a >> > __vmalloc() interface that looks like it ultimately translates into an >> > alloc_pages() call. Would that accomplish what this patch is currently >> > trying to do? >> >> vmalloc is always a last resort. vmalloc space on 32 bit systems is >> extremely limited and it is easy to exhaust with XFS. >> > > Sure, but my impression is that a vmalloc() buffer is roughly equivalent > in this regard to a current !XBF_UNMAPPED && size > PAGE_SIZE buffer. We > just do the allocation and mapping separately (presumably for other > reasons). > >> Also, vmalloc limits the control we have over allocation context >> (e.g. the hoops we jump through in kmem_alloc_large() to maintain >> GFP_NOFS contexts), so just using vmalloc doesn't make things much >> simpler from an XFS perspective. >> > > The comment in kmem_zalloc_large() calls out some apparent hardcoded > allocation flags down in the depths of vmalloc(). It looks to me that > page allocation (__vmalloc_area_node()) actually uses the provided > flags, so I'm not following the "data page" part of that comment. > Indeed, I do see that this is not the case down in calls like > pmd_alloc_one(), pte_alloc_one_kernel(), etc., associated with page > table management. > > Those latter calls are all from following down through the > map_vm_area()->vmap_page_range() codepath from __vmalloc_area_node(). We > call vm_map_ram() directly from _xfs_buf_map_pages(), which itself calls > down into the same code. Indeed, we already protect ourselves here via > the same memalloc_noio_save() mechanism that kmem_zalloc_large() uses. > > I suspect there's more to it than that because it does look like > vm_map_ram() has a different mechanism for managing vmalloc space for > certain (smaller) allocations, either of which I'm not really familiar > with. That aside, I don't see how vmalloc() introduces any new > allocation context issues for those buffers where we already set up a > multi-page mapping. > > We still have the somewhat customized page allocation code in > xfs_buf_allocate_memory() to contend with. I actually think it would be > useful to have a DEBUG sysfs tunable to turn on vmalloc() buffers and > actually test how effective some of this code is. > >> > I ask because it seems like that would help clean up the code a bit, for >> > one. It might also facilitate some degree of testing of the XFS bits >> > (even if utilized sparingly in DEBUG mode if it weren't suitable enough >> > for generic/mmu use). We currently allocate and map the buffer pages >> > separately and I'm not sure if there's any particular reasons for doing >> > that outside of some congestion handling in the allocation code and >> > XBF_UNMAPPED buffers, the latter probably being irrelevant for nommu. >> > Any other thoughts on that? >> >> We could probably clean the code up more (the allocation logic >> is now largely a historic relic) but I'm not convinced yet that we >> should be spending any time trying to specifically support mmu-less >> hardware. >> > > Fair point, we'll see where the use case discussion goes. That said, I > was a little surprised that this is all that was required to enable > nommu support. If that is indeed the case and we aren't in for a series > of subsequent nommu specific changes (Octavian?) by letting this > through, what's the big deal? This seems fairly harmless to me as is, > particularly if it can be semi-tested via DEBUG mode and has potential > generic use down the road. > I don't foresee additional patches. I was able to use lklfuse to mount an XFS image and perform basic operations. Are there any xfs specific tests coverage tools I can use to make sure I am not missing anything? -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, Nov 20, 2015 at 05:35:55PM +0200, Octavian Purdila wrote: > On Fri, Nov 20, 2015 at 5:11 PM, Brian Foster <bfoster@redhat.com> wrote: > > On Fri, Nov 20, 2015 at 10:35:47AM +1100, Dave Chinner wrote: > >> On Thu, Nov 19, 2015 at 10:55:25AM -0500, Brian Foster wrote: > >> > On Wed, Nov 18, 2015 at 12:46:21AM +0200, Octavian Purdila wrote: > >> > > Naive implementation for non-mmu architectures: allocate physically > >> > > contiguous xfs buffers with alloc_pages. Terribly inefficient with > >> > > memory and fragmentation on high I/O loads but it may be good enough > >> > > for basic usage (which most non-mmu architectures will need). > >> > > > >> > > This patch was tested with lklfuse [1] and basic operations seems to > >> > > work even with 16MB allocated for LKL. > >> > > > >> > > [1] https://github.com/lkl/linux > >> > > > >> > > Signed-off-by: Octavian Purdila <octavian.purdila@intel.com> > >> > > --- > >> > > >> > Interesting, though this makes me wonder why we couldn't have a new > >> > _XBF_VMEM (for example) buffer type that uses vmalloc(). I'm not > >> > familiar with mmu-less context, but I see that mm/nommu.c has a > >> > __vmalloc() interface that looks like it ultimately translates into an > >> > alloc_pages() call. Would that accomplish what this patch is currently > >> > trying to do? > >> > >> vmalloc is always a last resort. vmalloc space on 32 bit systems is > >> extremely limited and it is easy to exhaust with XFS. > >> > > > > Sure, but my impression is that a vmalloc() buffer is roughly equivalent > > in this regard to a current !XBF_UNMAPPED && size > PAGE_SIZE buffer. We > > just do the allocation and mapping separately (presumably for other > > reasons). > > > >> Also, vmalloc limits the control we have over allocation context > >> (e.g. the hoops we jump through in kmem_alloc_large() to maintain > >> GFP_NOFS contexts), so just using vmalloc doesn't make things much > >> simpler from an XFS perspective. > >> > > > > The comment in kmem_zalloc_large() calls out some apparent hardcoded > > allocation flags down in the depths of vmalloc(). It looks to me that > > page allocation (__vmalloc_area_node()) actually uses the provided > > flags, so I'm not following the "data page" part of that comment. > > Indeed, I do see that this is not the case down in calls like > > pmd_alloc_one(), pte_alloc_one_kernel(), etc., associated with page > > table management. > > > > Those latter calls are all from following down through the > > map_vm_area()->vmap_page_range() codepath from __vmalloc_area_node(). We > > call vm_map_ram() directly from _xfs_buf_map_pages(), which itself calls > > down into the same code. Indeed, we already protect ourselves here via > > the same memalloc_noio_save() mechanism that kmem_zalloc_large() uses. > > > > I suspect there's more to it than that because it does look like > > vm_map_ram() has a different mechanism for managing vmalloc space for > > certain (smaller) allocations, either of which I'm not really familiar > > with. That aside, I don't see how vmalloc() introduces any new > > allocation context issues for those buffers where we already set up a > > multi-page mapping. > > > > We still have the somewhat customized page allocation code in > > xfs_buf_allocate_memory() to contend with. I actually think it would be > > useful to have a DEBUG sysfs tunable to turn on vmalloc() buffers and > > actually test how effective some of this code is. > > > >> > I ask because it seems like that would help clean up the code a bit, for > >> > one. It might also facilitate some degree of testing of the XFS bits > >> > (even if utilized sparingly in DEBUG mode if it weren't suitable enough > >> > for generic/mmu use). We currently allocate and map the buffer pages > >> > separately and I'm not sure if there's any particular reasons for doing > >> > that outside of some congestion handling in the allocation code and > >> > XBF_UNMAPPED buffers, the latter probably being irrelevant for nommu. > >> > Any other thoughts on that? > >> > >> We could probably clean the code up more (the allocation logic > >> is now largely a historic relic) but I'm not convinced yet that we > >> should be spending any time trying to specifically support mmu-less > >> hardware. > >> > > > > Fair point, we'll see where the use case discussion goes. That said, I > > was a little surprised that this is all that was required to enable > > nommu support. If that is indeed the case and we aren't in for a series > > of subsequent nommu specific changes (Octavian?) by letting this > > through, what's the big deal? This seems fairly harmless to me as is, > > particularly if it can be semi-tested via DEBUG mode and has potential > > generic use down the road. > > > > I don't foresee additional patches. I was able to use lklfuse to mount > an XFS image and perform basic operations. Are there any xfs specific > tests coverage tools I can use to make sure I am not missing anything? > Ok, well you probably want to run some of the tests in xfstests and see if anything falls over: https://git.kernel.org/cgit/fs/xfs/xfstests-dev.git/ Note that this has generic as well as fs-specific tests for other filesystems (ext4, btrfs, etc.). Brian > _______________________________________________ > xfs mailing list > xfs@oss.sgi.com > http://oss.sgi.com/mailman/listinfo/xfs -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, Nov 20, 2015 at 05:31:59PM +0200, Octavian Purdila wrote: > On Fri, Nov 20, 2015 at 5:24 PM, Brian Foster <bfoster@redhat.com> wrote: > > On Fri, Nov 20, 2015 at 04:26:28PM +0200, Octavian Purdila wrote: > >> On Fri, Nov 20, 2015 at 2:58 AM, Dave Chinner <david@fromorbit.com> wrote: > >> > On Fri, Nov 20, 2015 at 12:54:02AM +0100, Richard Weinberger wrote: > >> >> On Fri, Nov 20, 2015 at 12:24 AM, Dave Chinner <david@fromorbit.com> wrote: > >> >> > On Wed, Nov 18, 2015 at 12:46:21AM +0200, Octavian Purdila wrote: > >> >> >> Naive implementation for non-mmu architectures: allocate physically > >> >> >> contiguous xfs buffers with alloc_pages. Terribly inefficient with > >> >> >> memory and fragmentation on high I/O loads but it may be good enough > >> >> >> for basic usage (which most non-mmu architectures will need). > >> >> > > >> >> > Can you please explain why you want to use XFS on low end, basic > >> >> > non-MMU devices? XFS is a high performance, enterprise/HPC level > >> >> > filesystem - it's not a filesystem designed for small IoT level > >> >> > devices - so I'm struggling to see why we'd want to expend any > >> >> > effort to make XFS work on such devices.... > >> >> > >> >> The use case is the Linux Kernel Library: > >> >> https://lkml.org/lkml/2015/11/3/706 > >> >> > >> >> Using LKL and fuse you can mount any kernel filesystem using fuse > >> >> as non-root. > >> > > >> > IOWs, because we said no to unprivileged mounts, instead the > >> > proposal is to linking all the kernel code into userspace so you can > >> > do unprivielged mounts that way? > >> > > >> > >> LKL's goal is to make it easy for various applications to reuse Linux > >> kernel code instead of re-implementing it. Mounting filesystem images > >> is just one of the applications. > >> > >> > IOWs, you get to say "it secure because it's in userspace" and leave > >> > us filesystem people with all the shit that comes with allowing > >> > users to mount random untrusted filesystem images using code that > >> > was never designed to allow that to happen? > >> > > >> > >> It is already possible to mount arbitrary filesystem images in > >> userspace using VMs . LKL doesn't change that, it just reduces the > >> amount of dependencies you need to do so. > >> > > > > Perhaps a dumb question, but I'm not quite putting 2+2 together here. > > When I see nommu, I'm generally thinking hardware characteristics, but > > we're talking about a userspace kernel library here. So can you > > elaborate on how this relates to nommu? Does this library emulate kernel > > mechanisms in userspace via nommu mode or something of that nature? > > > > LKL is currently implemented as a virtual non-mmu architecture. That > makes it simpler and it will also allow us to support environments > where it is not possible to emulate paging (e.g. bootloaders). > Ok, so we aren't necessarily talking about running on typically limited, mmu-less hardware. Thanks! Brian > _______________________________________________ > xfs mailing list > xfs@oss.sgi.com > http://oss.sgi.com/mailman/listinfo/xfs -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, Nov 20, 2015 at 04:26:28PM +0200, Octavian Purdila wrote: > It is already possible to mount arbitrary filesystem images in > userspace using VMs . LKL doesn't change that, it just reduces the > amount of dependencies you need to do so. It is true that you can mount arbitrary file systems in userspace using VM's. But those the kvm binary is typically not run with root privileges in the host OS --- at least, not if the system administrator is smart. So a root compromise does not cause a catastrophic security vulnerability, and if the guest OS crashes --- again, not a real problem. In the caase where people are trying to claim that containers are just as secure as VM's, and plan to give container "guest" system administrators root-like powers, the question which immediately comes to mind is whether the LKML/fuse daemon is running inside or outside the container. If it is outside the container, the a potential security compromise of the binary running binary will be catastrophic to the overall security of the host system and all of its containers. If it is inside the container, you will be partially breaking the illusion that the container works just like a VM (since a user runinng "ps" will see all of these mysterious userspace processes that could be killed, etc.), but it significantly reduces the security problems if a maliciously crafted (or maliciously modulated) block device is mounted. > Could you expand of what burden does this use-case put on fs > developers? I am sure that, if needed, we can put restrictions in LKL > to avoid that. The bottom line is who is going to get all of the support calls and nasty e-mails explaining how our "crappy" code has caused some silly container VM system administrator's customers $$$ worth of losses. As long as we can make sure that it's been underlined that the code is being used well outside of its intended scope, and if it breaks, the user gets to keep both pieces, and all complaints, threatened lawsuits, etc. should go to the LKL maintainers or the purveyors of said container-based products, I suppose that should be OK. :-) - Ted -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, Nov 20, 2015 at 10:11:19AM -0500, Brian Foster wrote: > On Fri, Nov 20, 2015 at 10:35:47AM +1100, Dave Chinner wrote: > > On Thu, Nov 19, 2015 at 10:55:25AM -0500, Brian Foster wrote: > > > On Wed, Nov 18, 2015 at 12:46:21AM +0200, Octavian Purdila wrote: > > > > Naive implementation for non-mmu architectures: allocate physically > > > > contiguous xfs buffers with alloc_pages. Terribly inefficient with > > > > memory and fragmentation on high I/O loads but it may be good enough > > > > for basic usage (which most non-mmu architectures will need). > > > > > > > > This patch was tested with lklfuse [1] and basic operations seems to > > > > work even with 16MB allocated for LKL. > > > > > > > > [1] https://github.com/lkl/linux > > > > > > > > Signed-off-by: Octavian Purdila <octavian.purdila@intel.com> > > > > --- > > > > > > Interesting, though this makes me wonder why we couldn't have a new > > > _XBF_VMEM (for example) buffer type that uses vmalloc(). I'm not > > > familiar with mmu-less context, but I see that mm/nommu.c has a > > > __vmalloc() interface that looks like it ultimately translates into an > > > alloc_pages() call. Would that accomplish what this patch is currently > > > trying to do? > > > > vmalloc is always a last resort. vmalloc space on 32 bit systems is > > extremely limited and it is easy to exhaust with XFS. > > > > Sure, but my impression is that a vmalloc() buffer is roughly equivalent > in this regard to a current !XBF_UNMAPPED && size > PAGE_SIZE buffer. We > just do the allocation and mapping separately (presumably for other > reasons). Yes, it'a always a last resort. We don't use vmap'd buffers very much on block size <= page size filesystems (e.g. iclog buffers are the main user in such cases, IIRC), so the typical 32 bit system doesn't have major problems with vmalloc space. However, the moment you increase the directory block size > block size, that all goes out the window... > > Also, vmalloc limits the control we have over allocation context > > (e.g. the hoops we jump through in kmem_alloc_large() to maintain > > GFP_NOFS contexts), so just using vmalloc doesn't make things much > > simpler from an XFS perspective. > > > > The comment in kmem_zalloc_large() calls out some apparent hardcoded > allocation flags down in the depths of vmalloc(). It looks to me that > page allocation (__vmalloc_area_node()) actually uses the provided > flags, so I'm not following the "data page" part of that comment. You can pass gfp flags for the page allocation part of vmalloc, but not the pte allocation part of it. That's what the hacks in kmem_zalloc_large() are doing. > Indeed, I do see that this is not the case down in calls like > pmd_alloc_one(), pte_alloc_one_kernel(), etc., associated with page > table management. Right. > Those latter calls are all from following down through the > map_vm_area()->vmap_page_range() codepath from __vmalloc_area_node(). We > call vm_map_ram() directly from _xfs_buf_map_pages(), which itself calls > down into the same code. Indeed, we already protect ourselves here via > the same memalloc_noio_save() mechanism that kmem_zalloc_large() uses. Yes, we do, but that is separately handled to the allocation of the pages, which we have to do for all types of buffers, mapped or unmapped, because xfs_buf_ioapply_map() requires direct access to the underlying pages to build the bio for IO. If we delegate the allocation of pages to vmalloc, we don't have direct reference to the underlying pages and so we have to do something completely diffferent to build the bios for the buffer.... > I suspect there's more to it than that because it does look like > vm_map_ram() has a different mechanism for managing vmalloc space for > certain (smaller) allocations, either of which I'm not really familiar > with. Yes, it manages vmalloc space quite differently, and there are different scalability aspects to consider as well - vm_map_ram was pretty much written for the use XFS has in xfs_buf.c... Cheers, Dave.
On Fri, Nov 20, 2015 at 03:43:20PM +0200, Octavian Purdila wrote: > On Fri, Nov 20, 2015 at 1:24 AM, Dave Chinner <david@fromorbit.com> wrote: > > On Wed, Nov 18, 2015 at 12:46:21AM +0200, Octavian Purdila wrote: > >> Naive implementation for non-mmu architectures: allocate physically > >> contiguous xfs buffers with alloc_pages. Terribly inefficient with > >> memory and fragmentation on high I/O loads but it may be good enough > >> for basic usage (which most non-mmu architectures will need). > > > > Can you please explain why you want to use XFS on low end, basic > > non-MMU devices? XFS is a high performance, enterprise/HPC level > > filesystem - it's not a filesystem designed for small IoT level > > devices - so I'm struggling to see why we'd want to expend any > > effort to make XFS work on such devices.... > > > > Hi David, > > Yes XFS as the main fs on this type of devices does not make sense, > but does it hurt to be able to perform basic operation on XFS from > these devices? Perhaps accessing an external medium formatted with > XFS? > > Another example is accessing VM images that are formatted with XFS. > Currently we can do that with tools like libguestfs that use a VM in > the background. I am working on a lighter solution for that where we > compile the Linux kernel as a library [1]. This allows access to the > filesystem without the need to use a full VM. That's hardly a "lighter solution". I'm kinda tired of the ongoing "hack random shit" approach to container development. If you need a XFS-FUSE module to allow safe userspace access to XFS fielsystems then maybe, just maybe, it makes sense to ask the XFS developers how to best go about providing a reliable, up-to-date, tested, maintained and supported XFS-FUSE module? IOWs, a "lighter solution" is to use the libxfs code base that we already maintain across kernel and userspace in the xfsprogs package and write a FUSE wrapper around that. That, immediately, will give you full read-only access to XFS filesystem images via FUSE. Then we (the XFS developers) can test the XFS-FUSE module under normal development conditions as we modify the xfsprogs code base (e.g. via xfstests) and ensure we always release a working, up-to-date FUSE wrapper with each xfsprogs release. And then once a proper read-only FUSE wrapper has been written, then we can discuss what is necessary to enable write access via porting the necessary parts of the kernel code across to the userspace libxfs codebase and hooking them up to the FUSE API... Hmmm? > And a final example is linking the bootloader code with LKL to access > the filesystem. This has a hard requirement on non-mmu. No way. We *can't* support filesystems that have had bootloaders make arbitrary changes to the filesystem without the knowlege of the OS that *owns the filesystem*. Similarly, we cannot support random applications that internally mount and modify filesystem images in ways we can't see, control, test or simulate. Sure, they use the kernel code, but that doesn't stop them from doing stupid shit that could corrupt the filesystem image. So, no, we are not going to support giving random applications direct access to XFS filesystem images, even via LKL. > So, IMHO there are usecases for using XFS on non-mmu architectures. I > think it all boils down to: is the patch simple enough to not put an > unreasonable maintenance burden on developers? Look at it this way: a user tries to mount their XFS-on-LKL-on-FUSE, XFS throughs a memory allocation deadlock warning because of a high order allocation during mount failing (i.e. due to the vmalloc no-mmu hacks). Who has to spend effort to find out why the error is being thrown? It's not the FUSE developers. It's not the LKL developers. XFS is going to be blamed, because it's an XFS error message, and it will be up to the XFS developers to /prove/ that it's not an XFS problem. And then google will find these complaints about XFS causing all sorts of problems, and we're back to the bad old days of everyone blaming XFS for shortcomings and problems in other code we have no control over.. I really don't see how using LKL to give userspace access to XFS filesystems is a better solution than actually writing a proper, supported XFS-FUSE module. LKL is so full of compromises that it's going to be unworkable and unsupportable in practice... Cheers, Dave.
On Fri, Nov 20, 2015 at 11:08 PM, Dave Chinner <david@fromorbit.com> wrote: > On Fri, Nov 20, 2015 at 03:43:20PM +0200, Octavian Purdila wrote: >> On Fri, Nov 20, 2015 at 1:24 AM, Dave Chinner <david@fromorbit.com> wrote: >> > On Wed, Nov 18, 2015 at 12:46:21AM +0200, Octavian Purdila wrote: >> >> Naive implementation for non-mmu architectures: allocate physically >> >> contiguous xfs buffers with alloc_pages. Terribly inefficient with >> >> memory and fragmentation on high I/O loads but it may be good enough >> >> for basic usage (which most non-mmu architectures will need). >> > >> > Can you please explain why you want to use XFS on low end, basic >> > non-MMU devices? XFS is a high performance, enterprise/HPC level >> > filesystem - it's not a filesystem designed for small IoT level >> > devices - so I'm struggling to see why we'd want to expend any >> > effort to make XFS work on such devices.... >> > >> >> Hi David, >> >> Yes XFS as the main fs on this type of devices does not make sense, >> but does it hurt to be able to perform basic operation on XFS from >> these devices? Perhaps accessing an external medium formatted with >> XFS? >> >> Another example is accessing VM images that are formatted with XFS. >> Currently we can do that with tools like libguestfs that use a VM in >> the background. I am working on a lighter solution for that where we >> compile the Linux kernel as a library [1]. This allows access to the >> filesystem without the need to use a full VM. > > That's hardly a "lighter solution" > > I'm kinda tired of the ongoing "hack random shit" approach to > container development. Since apparently there is a container devs hunting party going on right now, let me quickly confess that LKL has nothing to do with (them be damned) containers :) On a more serious note, LKL was not developed for containers or to try to circumvent privileged mounts. It was developed to allow the Linux kernel code to be reused in things like simple tools that allows one to modify a filesystem image. > If you need a XFS-FUSE module to allow safe > userspace access to XFS fielsystems then maybe, just maybe, it makes > sense to ask the XFS developers how to best go about providing a > reliable, up-to-date, tested, maintained and supported XFS-FUSE > module? > > IOWs, a "lighter solution" is to use the libxfs code base that we > already maintain across kernel and userspace in the xfsprogs package > and write a FUSE wrapper around that. That, immediately, will give > you full read-only access to XFS filesystem images via FUSE. Then we > (the XFS developers) can test the XFS-FUSE module under normal > development conditions as we modify the xfsprogs code base (e.g. via > xfstests) and ensure we always release a working, up-to-date FUSE > wrapper with each xfsprogs release. > > And then once a proper read-only FUSE wrapper has been written, then > we can discuss what is necessary to enable write access via porting > the necessary parts of the kernel code across to the userspace > libxfs codebase and hooking them up to the FUSE API... > > Hmmm? > What about ext4, vfat, btrfs and other filesystems? Also why duplicate the whole thing if you could reuse it? >> And a final example is linking the bootloader code with LKL to access >> the filesystem. This has a hard requirement on non-mmu. > > No way. We *can't* support filesystems that have had bootloaders > make arbitrary changes to the filesystem without the knowlege of the > OS that *owns the filesystem*. Similarly, we cannot support random > applications that internally mount and modify filesystem images in > ways we can't see, control, test or simulate. Sure, they use the > kernel code, but that doesn't stop them from doing stupid shit that > could corrupt the filesystem image. So, no, we are not going to > support giving random applications direct access to XFS filesystem > images, even via LKL. > LKL only exports the Linux kernel system calls and nothing else to applications. Because of that, there should not be any loss of control or visibility to the XFS fs driver. >> So, IMHO there are usecases for using XFS on non-mmu architectures. I >> think it all boils down to: is the patch simple enough to not put an >> unreasonable maintenance burden on developers? > > Look at it this way: a user tries to mount their > XFS-on-LKL-on-FUSE, XFS throughs a memory allocation deadlock > warning because of a high order allocation during mount failing > (i.e. due to the vmalloc no-mmu hacks). > > Who has to spend effort to find out why the error is being thrown? > It's not the FUSE developers. It's not the LKL developers. XFS is > going to be blamed, because it's an XFS error message, and it will > be up to the XFS developers to /prove/ that it's not an XFS problem. > And then google will find these complaints about XFS causing all > sorts of problems, and we're back to the bad old days of everyone > blaming XFS for shortcomings and problems in other code we have no > control over.. > > I really don't see how using LKL to give userspace access to XFS > filesystems is a better solution than actually writing a proper, > supported XFS-FUSE module. LKL is so full of compromises that it's > going to be unworkable and unsupportable in practice... > Could you elaborate on some of these issues? -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Sat, Nov 21, 2015 at 07:36:02AM +1100, Dave Chinner wrote: > On Fri, Nov 20, 2015 at 10:11:19AM -0500, Brian Foster wrote: > > On Fri, Nov 20, 2015 at 10:35:47AM +1100, Dave Chinner wrote: > > > On Thu, Nov 19, 2015 at 10:55:25AM -0500, Brian Foster wrote: > > > > On Wed, Nov 18, 2015 at 12:46:21AM +0200, Octavian Purdila wrote: > > > > > Naive implementation for non-mmu architectures: allocate physically > > > > > contiguous xfs buffers with alloc_pages. Terribly inefficient with > > > > > memory and fragmentation on high I/O loads but it may be good enough > > > > > for basic usage (which most non-mmu architectures will need). > > > > > > > > > > This patch was tested with lklfuse [1] and basic operations seems to > > > > > work even with 16MB allocated for LKL. > > > > > > > > > > [1] https://github.com/lkl/linux > > > > > > > > > > Signed-off-by: Octavian Purdila <octavian.purdila@intel.com> > > > > > --- > > > > > > > > Interesting, though this makes me wonder why we couldn't have a new > > > > _XBF_VMEM (for example) buffer type that uses vmalloc(). I'm not > > > > familiar with mmu-less context, but I see that mm/nommu.c has a > > > > __vmalloc() interface that looks like it ultimately translates into an > > > > alloc_pages() call. Would that accomplish what this patch is currently > > > > trying to do? > > > > > > vmalloc is always a last resort. vmalloc space on 32 bit systems is > > > extremely limited and it is easy to exhaust with XFS. > > > > > > > Sure, but my impression is that a vmalloc() buffer is roughly equivalent > > in this regard to a current !XBF_UNMAPPED && size > PAGE_SIZE buffer. We > > just do the allocation and mapping separately (presumably for other > > reasons). > > Yes, it'a always a last resort. We don't use vmap'd buffers very > much on block size <= page size filesystems (e.g. iclog buffers are > the main user in such cases, IIRC), so the typical 32 bit > system doesn't have major problems with vmalloc space. However, the > moment you increase the directory block size > block size, that all > goes out the window... > Ok. It's really only the pre-existing mapped buffer cases we care about here. > > > Also, vmalloc limits the control we have over allocation context > > > (e.g. the hoops we jump through in kmem_alloc_large() to maintain > > > GFP_NOFS contexts), so just using vmalloc doesn't make things much > > > simpler from an XFS perspective. > > > > > > > The comment in kmem_zalloc_large() calls out some apparent hardcoded > > allocation flags down in the depths of vmalloc(). It looks to me that > > page allocation (__vmalloc_area_node()) actually uses the provided > > flags, so I'm not following the "data page" part of that comment. > > You can pass gfp flags for the page allocation part of vmalloc, but > not the pte allocation part of it. That's what the hacks in > kmem_zalloc_large() are doing. > > > Indeed, I do see that this is not the case down in calls like > > pmd_alloc_one(), pte_alloc_one_kernel(), etc., associated with page > > table management. > > Right. > > > Those latter calls are all from following down through the > > map_vm_area()->vmap_page_range() codepath from __vmalloc_area_node(). We > > call vm_map_ram() directly from _xfs_buf_map_pages(), which itself calls > > down into the same code. Indeed, we already protect ourselves here via > > the same memalloc_noio_save() mechanism that kmem_zalloc_large() uses. > > Yes, we do, but that is separately handled to the allocation of the > pages, which we have to do for all types of buffers, mapped or > unmapped, because xfs_buf_ioapply_map() requires direct access to > the underlying pages to build the bio for IO. If we delegate the > allocation of pages to vmalloc, we don't have direct reference to > the underlying pages and so we have to do something completely > diffferent to build the bios for the buffer.... > Octavian points out virt_to_page() in a previous mail. I'm not sure that's the right interface solely based on looking at some current callers, but there is vmalloc_to_page() so I'd expect we can gain access to the pages one way or another. Given that, the buffer allocation code would fully populate the xfs_buf as it is today. The buffer I/O submission code wouldn't really know the difference and shouldn't have to change at all. > > I suspect there's more to it than that because it does look like > > vm_map_ram() has a different mechanism for managing vmalloc space for > > certain (smaller) allocations, either of which I'm not really familiar > > with. > > Yes, it manages vmalloc space quite differently, and there are > different scalability aspects to consider as well - vm_map_ram was > pretty much written for the use XFS has in xfs_buf.c... > Indeed. Looking closer, it appears to have a percpu vmalloc space allocation strategy for smaller allocations. We clearly can't just switch mapped buffer cases over and expect it to work/perform just the same. That said, the vm_map_ram() comment does call out fragmentation concerns for objects with mixed lifetimes. I'd be curious whether our buffer caching can trigger any badness there. OTOH, it's also not clear that this mechanism couldn't extend to vmalloc (or some variant thereof) in the future. Either way, it would require significantly more investigation/testing to enable generic usage. The core point was really just to abstract the nommu changes into something that potentially has generic use. Brian > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, Nov 20, 2015 at 05:47:34PM -0500, Brian Foster wrote: > On Sat, Nov 21, 2015 at 07:36:02AM +1100, Dave Chinner wrote: > > On Fri, Nov 20, 2015 at 10:11:19AM -0500, Brian Foster wrote: > > > On Fri, Nov 20, 2015 at 10:35:47AM +1100, Dave Chinner wrote: > > > Those latter calls are all from following down through the > > > map_vm_area()->vmap_page_range() codepath from __vmalloc_area_node(). We > > > call vm_map_ram() directly from _xfs_buf_map_pages(), which itself calls > > > down into the same code. Indeed, we already protect ourselves here via > > > the same memalloc_noio_save() mechanism that kmem_zalloc_large() uses. > > > > Yes, we do, but that is separately handled to the allocation of the > > pages, which we have to do for all types of buffers, mapped or > > unmapped, because xfs_buf_ioapply_map() requires direct access to > > the underlying pages to build the bio for IO. If we delegate the > > allocation of pages to vmalloc, we don't have direct reference to > > the underlying pages and so we have to do something completely > > diffferent to build the bios for the buffer.... > > > > Octavian points out virt_to_page() in a previous mail. I'm not sure > that's the right interface solely based on looking at some current > callers, but there is vmalloc_to_page() so I'd expect we can gain access > to the pages one way or another. Sure, but these are not zero cost operations.... > Given that, the buffer allocation code > would fully populate the xfs_buf as it is today. The buffer I/O > submission code wouldn't really know the difference and shouldn't have > to change at all. The abstraction results in more expensive/complex setup and teardown of buffers and/or IO submission. i.e. the use of vmalloc() based abstractions has an additional cost over what we do now. [...] > Either way, it would require significantly more investigation/testing to > enable generic usage. The core point was really just to abstract the > nommu changes into something that potentially has generic use. I'm not saying that it is impossible to do this, just trying to work out if making any changes to support nommu architectures is worth the potential future trouble making such changes could bring us. i.e. before working out how to do something, we have to decide whether it is worth doing in the first place. Just because you can do something doesn't automatically make it a good idea.... Cheers, Dave.
On Sat, Nov 21, 2015 at 12:26:47AM +0200, Octavian Purdila wrote: > On Fri, Nov 20, 2015 at 11:08 PM, Dave Chinner <david@fromorbit.com> wrote: > > On Fri, Nov 20, 2015 at 03:43:20PM +0200, Octavian Purdila wrote: > >> On Fri, Nov 20, 2015 at 1:24 AM, Dave Chinner <david@fromorbit.com> wrote: > >> > On Wed, Nov 18, 2015 at 12:46:21AM +0200, Octavian Purdila wrote: > >> >> Naive implementation for non-mmu architectures: allocate physically > >> >> contiguous xfs buffers with alloc_pages. Terribly inefficient with > >> >> memory and fragmentation on high I/O loads but it may be good enough > >> >> for basic usage (which most non-mmu architectures will need). > >> > > >> > Can you please explain why you want to use XFS on low end, basic > >> > non-MMU devices? XFS is a high performance, enterprise/HPC level > >> > filesystem - it's not a filesystem designed for small IoT level > >> > devices - so I'm struggling to see why we'd want to expend any > >> > effort to make XFS work on such devices.... > >> > > >> > >> Hi David, > >> > >> Yes XFS as the main fs on this type of devices does not make sense, > >> but does it hurt to be able to perform basic operation on XFS from > >> these devices? Perhaps accessing an external medium formatted with > >> XFS? > >> > >> Another example is accessing VM images that are formatted with XFS. > >> Currently we can do that with tools like libguestfs that use a VM in > >> the background. I am working on a lighter solution for that where we > >> compile the Linux kernel as a library [1]. This allows access to the > >> filesystem without the need to use a full VM. > > > > That's hardly a "lighter solution" > > > > I'm kinda tired of the ongoing "hack random shit" approach to > > container development. > > Since apparently there is a container devs hunting party going on > right now, let me quickly confess that LKL has nothing to do with > (them be damned) containers :) > > On a more serious note, LKL was not developed for containers or to try > to circumvent privileged mounts. It was developed to allow the Linux > kernel code to be reused in things like simple tools that allows one > to modify a filesystem image. Anything tool that modifies an XFS filesystem that is not directly maintained by the XFS developers voids any kind of support we can supply. Just like the fact we don't support tainted kernels because the 3rd party binary code is unknowable (and usually crap), having the kernel code linked with random 3rd party userspace application code is completely unsupportable by us. Remember that with most kernel code a bug just results in a panic and reboot, and everything just continues on again after the system comes up again. In contrast, a bug in the storage code can cause *persistent damage* that can cause data loss or corruption that cannot be fixed without data loss of some kind. Ultimately, as the maintainer I'm responsible for XFS not eating our users' data, and part of that responsibility involves telling people who want to do daft things that "no, that's a *bad idea*". > > If you need a XFS-FUSE module to allow safe > > userspace access to XFS fielsystems then maybe, just maybe, it makes > > sense to ask the XFS developers how to best go about providing a > > reliable, up-to-date, tested, maintained and supported XFS-FUSE > > module? > > > > IOWs, a "lighter solution" is to use the libxfs code base that we > > already maintain across kernel and userspace in the xfsprogs package > > and write a FUSE wrapper around that. That, immediately, will give > > you full read-only access to XFS filesystem images via FUSE. Then we > > (the XFS developers) can test the XFS-FUSE module under normal > > development conditions as we modify the xfsprogs code base (e.g. via > > xfstests) and ensure we always release a working, up-to-date FUSE > > wrapper with each xfsprogs release. > > > > And then once a proper read-only FUSE wrapper has been written, then > > we can discuss what is necessary to enable write access via porting > > the necessary parts of the kernel code across to the userspace > > libxfs codebase and hooking them up to the FUSE API... > > > > Hmmm? > > > > What about ext4, vfat, btrfs and other filesystems? Ted has also raised exactly the same issues w.r.t. ext4. > Also why duplicate > the whole thing if you could reuse it? Do you use a hammer when you need to tighten a screw? Yes, you can "reuse a hammer" for this purpose, but there's going to be collateral damage because using the screw outside it's original design and architecture constraints presents a high risk of things going wrong. > >> And a final example is linking the bootloader code with LKL to access > >> the filesystem. This has a hard requirement on non-mmu. > > > > No way. We *can't* support filesystems that have had bootloaders > > make arbitrary changes to the filesystem without the knowlege of the > > OS that *owns the filesystem*. Similarly, we cannot support random > > applications that internally mount and modify filesystem images in > > ways we can't see, control, test or simulate. Sure, they use the > > kernel code, but that doesn't stop them from doing stupid shit that > > could corrupt the filesystem image. So, no, we are not going to > > support giving random applications direct access to XFS filesystem > > images, even via LKL. > > > > LKL only exports the Linux kernel system calls and nothing else to > applications. Because of that, there should not be any loss of control > or visibility to the XFS fs driver. It runs in the same address space as the user application, yes? And hence application bugs can cause the kernel code to malfunction, yes? > > I really don't see how using LKL to give userspace access to XFS > > filesystems is a better solution than actually writing a proper, > > supported XFS-FUSE module. LKL is so full of compromises that it's > > going to be unworkable and unsupportable in practice... > > Could you elaborate on some of these issues? Start with "is a no-mmu architecture" and all the compromises that means the kernel code needs to make, add a topping of "runs in the same address space as the application", add a new flavour of kernel binary taint and finish it off with "LKL linked applications will never be tested by their developers over the full functionality the LKL provides them with". Cheers, Dave.
On Mon, Nov 23, 2015 at 12:44 AM, Dave Chinner <david@fromorbit.com> wrote: > On Sat, Nov 21, 2015 at 12:26:47AM +0200, Octavian Purdila wrote: >> On Fri, Nov 20, 2015 at 11:08 PM, Dave Chinner <david@fromorbit.com> wrote: >> > On Fri, Nov 20, 2015 at 03:43:20PM +0200, Octavian Purdila wrote: >> >> On Fri, Nov 20, 2015 at 1:24 AM, Dave Chinner <david@fromorbit.com> wrote: >> >> > On Wed, Nov 18, 2015 at 12:46:21AM +0200, Octavian Purdila wrote: >> >> >> Naive implementation for non-mmu architectures: allocate physically >> >> >> contiguous xfs buffers with alloc_pages. Terribly inefficient with >> >> >> memory and fragmentation on high I/O loads but it may be good enough >> >> >> for basic usage (which most non-mmu architectures will need). >> >> > >> >> > Can you please explain why you want to use XFS on low end, basic >> >> > non-MMU devices? XFS is a high performance, enterprise/HPC level >> >> > filesystem - it's not a filesystem designed for small IoT level >> >> > devices - so I'm struggling to see why we'd want to expend any >> >> > effort to make XFS work on such devices.... >> >> > >> >> >> >> Hi David, >> >> >> >> Yes XFS as the main fs on this type of devices does not make sense, >> >> but does it hurt to be able to perform basic operation on XFS from >> >> these devices? Perhaps accessing an external medium formatted with >> >> XFS? >> >> >> >> Another example is accessing VM images that are formatted with XFS. >> >> Currently we can do that with tools like libguestfs that use a VM in >> >> the background. I am working on a lighter solution for that where we >> >> compile the Linux kernel as a library [1]. This allows access to the >> >> filesystem without the need to use a full VM. >> > >> > That's hardly a "lighter solution" >> > >> > I'm kinda tired of the ongoing "hack random shit" approach to >> > container development. >> >> Since apparently there is a container devs hunting party going on >> right now, let me quickly confess that LKL has nothing to do with >> (them be damned) containers :) >> >> On a more serious note, LKL was not developed for containers or to try >> to circumvent privileged mounts. It was developed to allow the Linux >> kernel code to be reused in things like simple tools that allows one >> to modify a filesystem image. > > Anything tool that modifies an XFS filesystem that is not directly > maintained by the XFS developers voids any kind of support we can > supply. Just like the fact we don't support tainted kernels because > the 3rd party binary code is unknowable (and usually crap), having > the kernel code linked with random 3rd party userspace application > code is completely unsupportable by us. > Perhaps tainting the kernel is a solution when running unknown applications linked with LKL. I would argue that applications that are maintained together with LKL (e.g. lklfuse in tools/lkl) should not taint the kernel because those applications will be under the control of kernel developers. I would also argue that mounting a filesystem read-only should not taint the kernel either. > Remember that with most kernel code a bug just results in a panic > and reboot, and everything just continues on again after the system > comes up again. In contrast, a bug in the storage code can cause > *persistent damage* that can cause data loss or corruption that > cannot be fixed without data loss of some kind. > > Ultimately, as the maintainer I'm responsible for XFS not eating our > users' data, and part of that responsibility involves telling people > who want to do daft things that "no, that's a *bad idea*". > I understand how critical filesystem issues are and I appreciate your feedback. Sorry to drag you deeper into this but as you know, no good deed goes unpunished :) >> > If you need a XFS-FUSE module to allow safe >> > userspace access to XFS fielsystems then maybe, just maybe, it makes >> > sense to ask the XFS developers how to best go about providing a >> > reliable, up-to-date, tested, maintained and supported XFS-FUSE >> > module? >> > >> > IOWs, a "lighter solution" is to use the libxfs code base that we >> > already maintain across kernel and userspace in the xfsprogs package >> > and write a FUSE wrapper around that. That, immediately, will give >> > you full read-only access to XFS filesystem images via FUSE. Then we >> > (the XFS developers) can test the XFS-FUSE module under normal >> > development conditions as we modify the xfsprogs code base (e.g. via >> > xfstests) and ensure we always release a working, up-to-date FUSE >> > wrapper with each xfsprogs release. >> > >> > And then once a proper read-only FUSE wrapper has been written, then >> > we can discuss what is necessary to enable write access via porting >> > the necessary parts of the kernel code across to the userspace >> > libxfs codebase and hooking them up to the FUSE API... >> > >> > Hmmm? >> > >> >> What about ext4, vfat, btrfs and other filesystems? > > Ted has also raised exactly the same issues w.r.t. ext4. > >> Also why duplicate >> the whole thing if you could reuse it? > > Do you use a hammer when you need to tighten a screw? Yes, you can > "reuse a hammer" for this purpose, but there's going to be > collateral damage because using the screw outside it's original > design and architecture constraints presents a high risk of things > going wrong. > First, lets try to discuss the potential collateral damage before calling it a hammer. It may just be just an unfamiliar screw-driver :) >> >> And a final example is linking the bootloader code with LKL to access >> >> the filesystem. This has a hard requirement on non-mmu. >> > >> > No way. We *can't* support filesystems that have had bootloaders >> > make arbitrary changes to the filesystem without the knowlege of the >> > OS that *owns the filesystem*. Similarly, we cannot support random >> > applications that internally mount and modify filesystem images in >> > ways we can't see, control, test or simulate. Sure, they use the >> > kernel code, but that doesn't stop them from doing stupid shit that >> > could corrupt the filesystem image. So, no, we are not going to >> > support giving random applications direct access to XFS filesystem >> > images, even via LKL. >> > >> >> LKL only exports the Linux kernel system calls and nothing else to >> applications. Because of that, there should not be any loss of control >> or visibility to the XFS fs driver. > > It runs in the same address space as the user application, yes? And > hence application bugs can cause the kernel code to malfunction, > yes? > Most non-mmu architecture have the same issue and nevertheless non-mmu is still supported in Linux (including most filesystems). Also, filesystem code runs in the same address space with other kernel code and drivers and a bug anywhere in the kernel can cause filesystem code to malfunction. Applications maintained together with LKL and in the kernel tree will be as safe as drivers and other kernel code with regard to filesystem malfunctions. We can taint the kernel when LKL is linked with unknown applications. >> > I really don't see how using LKL to give userspace access to XFS >> > filesystems is a better solution than actually writing a proper, >> > supported XFS-FUSE module. LKL is so full of compromises that it's >> > going to be unworkable and unsupportable in practice... >> >> Could you elaborate on some of these issues? > > Start with "is a no-mmu architecture" and all the compromises that > means the kernel code needs to make, I don't see non-mmu as a compromise. It is supported by Linux and most filesystems work fine on non-mmu architectures. LKL can be implemented as a mmu architecture. Having it as a non-mmu architecture has the advantages of allowing it to run in more constrained environments like bootloaders. > add a topping of "runs in the > same address space as the application", I've addressed this concern above. > add a new flavour of kernel > binary taint I am not sure I understand, are you saying that it is an issue to add a new taint flavor? > and finish it off with "LKL linked applications will > never be tested by their developers over the full functionality the > LKL provides them with". > You lost me here. Why does an application developer have to test the full functionality of a library it is linked with? -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, Nov 23, 2015 at 09:04:00AM +1100, Dave Chinner wrote: > On Fri, Nov 20, 2015 at 05:47:34PM -0500, Brian Foster wrote: > > On Sat, Nov 21, 2015 at 07:36:02AM +1100, Dave Chinner wrote: > > > On Fri, Nov 20, 2015 at 10:11:19AM -0500, Brian Foster wrote: > > > > On Fri, Nov 20, 2015 at 10:35:47AM +1100, Dave Chinner wrote: > > > > Those latter calls are all from following down through the > > > > map_vm_area()->vmap_page_range() codepath from __vmalloc_area_node(). We > > > > call vm_map_ram() directly from _xfs_buf_map_pages(), which itself calls > > > > down into the same code. Indeed, we already protect ourselves here via > > > > the same memalloc_noio_save() mechanism that kmem_zalloc_large() uses. > > > > > > Yes, we do, but that is separately handled to the allocation of the > > > pages, which we have to do for all types of buffers, mapped or > > > unmapped, because xfs_buf_ioapply_map() requires direct access to > > > the underlying pages to build the bio for IO. If we delegate the > > > allocation of pages to vmalloc, we don't have direct reference to > > > the underlying pages and so we have to do something completely > > > diffferent to build the bios for the buffer.... > > > > > > > Octavian points out virt_to_page() in a previous mail. I'm not sure > > that's the right interface solely based on looking at some current > > callers, but there is vmalloc_to_page() so I'd expect we can gain access > > to the pages one way or another. > > Sure, but these are not zero cost operations.... > > > Given that, the buffer allocation code > > would fully populate the xfs_buf as it is today. The buffer I/O > > submission code wouldn't really know the difference and shouldn't have > > to change at all. > > The abstraction results in more expensive/complex setup and teardown > of buffers and/or IO submission. i.e. the use of vmalloc() based > abstractions has an additional cost over what we do now. > > [...] > Yeah, most likely. The vmalloc based lookup certainly looks more involved than if the pages are already readily accessible. How much more expensive (and whether it's noticeable enough to care) is probably a matter of testing. I think the code itself could end up much more simple, however. There's still a lot of duplication in our current implementation that afaiu right now only continues to exist due to the performance advantages of vm_map_ram(). > > Either way, it would require significantly more investigation/testing to > > enable generic usage. The core point was really just to abstract the > > nommu changes into something that potentially has generic use. > > I'm not saying that it is impossible to do this, just trying to work > out if making any changes to support nommu architectures is worth > the potential future trouble making such changes could bring us. > i.e. before working out how to do something, we have to decide > whether it is worth doing in the first place. > > Just because you can do something doesn't automatically make it a > good idea.... > Sure, good point. I've intentionally been sticking to the technical points that have been raised (and I think we've covered it thoroughly to this point ;). FWIW, I agree with most of the concerns that have been called out in the thread so far, but the support question ultimately sounds moot to me. We can reject the "configuration" enabled by this particular patch, but afaik this LKL thing could just go and implement an mmu mode and fundamentally do what it's trying to do today. If it's useful, users will use it, ultimately hit bugs, and we'll have to deal with it one way or another. Note that I'm not saying we have to support LKL. Rather, I view it as equivalent to somebody off running some new non-standard/uncommon architecture (or maybe a hypervisor is a better example in this case). If the user runs into some low-level filesystem issue, the user probably needs to get through whatever levels of support exist for that special architecture/environment first and/or otherwise work with us to find a reproducer on something more standard that we all have easy access to. Just my .02, though. :) Brian > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com > > _______________________________________________ > xfs mailing list > xfs@oss.sgi.com > http://oss.sgi.com/mailman/listinfo/xfs -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, Nov 23, 2015 at 07:50:00AM -0500, Brian Foster wrote: > On Mon, Nov 23, 2015 at 09:04:00AM +1100, Dave Chinner wrote: > > On Fri, Nov 20, 2015 at 05:47:34PM -0500, Brian Foster wrote: > > > On Sat, Nov 21, 2015 at 07:36:02AM +1100, Dave Chinner wrote: > > > > On Fri, Nov 20, 2015 at 10:11:19AM -0500, Brian Foster wrote: > > > > > On Fri, Nov 20, 2015 at 10:35:47AM +1100, Dave Chinner wrote: > > > > > Those latter calls are all from following down through the > > > > > map_vm_area()->vmap_page_range() codepath from __vmalloc_area_node(). We > > > > > call vm_map_ram() directly from _xfs_buf_map_pages(), which itself calls > > > > > down into the same code. Indeed, we already protect ourselves here via > > > > > the same memalloc_noio_save() mechanism that kmem_zalloc_large() uses. > > > > > > > > Yes, we do, but that is separately handled to the allocation of the > > > > pages, which we have to do for all types of buffers, mapped or > > > > unmapped, because xfs_buf_ioapply_map() requires direct access to > > > > the underlying pages to build the bio for IO. If we delegate the > > > > allocation of pages to vmalloc, we don't have direct reference to > > > > the underlying pages and so we have to do something completely > > > > diffferent to build the bios for the buffer.... > > > > > > > > > > Octavian points out virt_to_page() in a previous mail. I'm not sure > > > that's the right interface solely based on looking at some current > > > callers, but there is vmalloc_to_page() so I'd expect we can gain access > > > to the pages one way or another. > > > > Sure, but these are not zero cost operations.... > > > > > Given that, the buffer allocation code > > > would fully populate the xfs_buf as it is today. The buffer I/O > > > submission code wouldn't really know the difference and shouldn't have > > > to change at all. > > > > The abstraction results in more expensive/complex setup and teardown > > of buffers and/or IO submission. i.e. the use of vmalloc() based > > abstractions has an additional cost over what we do now. > > > > [...] > > > > Yeah, most likely. The vmalloc based lookup certainly looks more > involved than if the pages are already readily accessible. How much more > expensive (and whether it's noticeable enough to care) is probably a > matter of testing. I think the code itself could end up much more > simple, however. There's still a lot of duplication in our current > implementation that afaiu right now only continues to exist due to the > performance advantages of vm_map_ram(). Yup. The historical problem here is that the mm/ people are not really interested in making vmalloc() scale/perform better. We are always told that if you need vmalloc() you are doing something wrong, and so vmalloc() doesn't need improving. The same reason is generally given for not making vmalloc handle GFP flags properly, so we're really stuck between a rock and a hard place here. > > > Either way, it would require significantly more investigation/testing to > > > enable generic usage. The core point was really just to abstract the > > > nommu changes into something that potentially has generic use. > > > > I'm not saying that it is impossible to do this, just trying to work > > out if making any changes to support nommu architectures is worth > > the potential future trouble making such changes could bring us. > > i.e. before working out how to do something, we have to decide > > whether it is worth doing in the first place. > > > > Just because you can do something doesn't automatically make it a > > good idea.... > > Sure, good point. I've intentionally been sticking to the technical > points that have been raised (and I think we've covered it thoroughly to > this point ;). *nod* > FWIW, I agree with most of the concerns that have been called out in the > thread so far, but the support question ultimately sounds moot to me. We > can reject the "configuration" enabled by this particular patch, but > afaik this LKL thing could just go and implement an mmu mode and > fundamentally do what it's trying to do today. If it's useful, users > will use it, ultimately hit bugs, and we'll have to deal with it one way > or another. Right, but we need to set expectations appropriately now, before we have users that depend on something we can't sanely support.... > Note that I'm not saying we have to support LKL. Rather, I view it as > equivalent to somebody off running some new non-standard/uncommon > architecture (or maybe a hypervisor is a better example in this case). Right, but we don't support every possible kernel/device out there with XFS. Up to now, no-mmu devices have been small, embedded systems that can't/won't fit XFS in their flash/RAM because of space/cost/need and so we've been able to completely ignore the issues such platforms have. We really don't pay much attention to 32 bit systems anymore, and no-mmu is much, much further down the list of "users who need XFS" than 32 bit arches.... > If the user runs into some low-level filesystem issue, the user probably > needs to get through whatever levels of support exist for that special > architecture/environment first and/or otherwise work with us to find a > reproducer on something more standard that we all have easy access to. > Just my .02, though. :) The biggest problem we have is that the kernel might be tripping over a problem introduced by some broken version of an app linked against some old version of LKL. That's my biggest worry here - it's not that LKL uses kernel code, it's that we no longer have any idea of what code has been modifying the filesystem nor any easy way of finding out. And we all know how hard it can be to get a straight answer to "explain exactly what you did" from bug reporters... Sure, once we *know* that the filesystem was modified by a random userspace application, we can say "please reproduce with a vanilla kernel", but it's getting to that point that can be problematic. This is why I think a proper FUSE module implemented on top of the userspace libxfs and is built as part of xfsprogs is a much better way give users userspace access to XFS filesystems. We can develop and support that directly, knowing exactly what code users are running and having a relatively controlled environment the filesystem code is running in.... Cheers, Dave. > > Brian > > > Cheers, > > > > Dave. > > -- > > Dave Chinner > > david@fromorbit.com > > > > _______________________________________________ > > xfs mailing list > > xfs@oss.sgi.com > > http://oss.sgi.com/mailman/listinfo/xfs >
On Mon, Nov 23, 2015 at 03:41:49AM +0200, Octavian Purdila wrote: > On Mon, Nov 23, 2015 at 12:44 AM, Dave Chinner <david@fromorbit.com> wrote: > > On Sat, Nov 21, 2015 at 12:26:47AM +0200, Octavian Purdila wrote: > >> On Fri, Nov 20, 2015 at 11:08 PM, Dave Chinner <david@fromorbit.com> wrote: > >> > On Fri, Nov 20, 2015 at 03:43:20PM +0200, Octavian Purdila wrote: > >> >> On Fri, Nov 20, 2015 at 1:24 AM, Dave Chinner <david@fromorbit.com> wrote: > >> >> > On Wed, Nov 18, 2015 at 12:46:21AM +0200, Octavian Purdila wrote: > >> >> >> Naive implementation for non-mmu architectures: allocate physically > >> >> >> contiguous xfs buffers with alloc_pages. Terribly inefficient with > >> >> >> memory and fragmentation on high I/O loads but it may be good enough > >> >> >> for basic usage (which most non-mmu architectures will need). > >> >> > > >> >> > Can you please explain why you want to use XFS on low end, basic > >> >> > non-MMU devices? XFS is a high performance, enterprise/HPC level > >> >> > filesystem - it's not a filesystem designed for small IoT level > >> >> > devices - so I'm struggling to see why we'd want to expend any > >> >> > effort to make XFS work on such devices.... > >> >> > > >> >> > >> >> Hi David, > >> >> > >> >> Yes XFS as the main fs on this type of devices does not make sense, > >> >> but does it hurt to be able to perform basic operation on XFS from > >> >> these devices? Perhaps accessing an external medium formatted with > >> >> XFS? > >> >> > >> >> Another example is accessing VM images that are formatted with XFS. > >> >> Currently we can do that with tools like libguestfs that use a VM in > >> >> the background. I am working on a lighter solution for that where we > >> >> compile the Linux kernel as a library [1]. This allows access to the > >> >> filesystem without the need to use a full VM. > >> > > >> > That's hardly a "lighter solution" > >> > > >> > I'm kinda tired of the ongoing "hack random shit" approach to > >> > container development. > >> > >> Since apparently there is a container devs hunting party going on > >> right now, let me quickly confess that LKL has nothing to do with > >> (them be damned) containers :) > >> > >> On a more serious note, LKL was not developed for containers or to try > >> to circumvent privileged mounts. It was developed to allow the Linux > >> kernel code to be reused in things like simple tools that allows one > >> to modify a filesystem image. > > > > Anything tool that modifies an XFS filesystem that is not directly > > maintained by the XFS developers voids any kind of support we can > > supply. Just like the fact we don't support tainted kernels because > > the 3rd party binary code is unknowable (and usually crap), having > > the kernel code linked with random 3rd party userspace application > > code is completely unsupportable by us. > > > > Perhaps tainting the kernel is a solution when running unknown > applications linked with LKL. It's not the *kernel* that is the problem - it is LKL that is the tainted code! > I would argue that applications that are maintained together with LKL > (e.g. lklfuse in tools/lkl) should not taint the kernel because those > applications will be under the control of kernel developers. I completely disagree. Just because the code is in the kernel tree, it doesn't mean it's controlled, reviewed, tested or maintained by the relevant subsystem maintainers. If the subsystem maintainers are not actively maintaining/testing those tools, then users can't expect the subsystem maintainers to support them. > I would > also argue that mounting a filesystem read-only should not taint the > kernel either. LKL != kernel. > >> >> And a final example is linking the bootloader code with LKL to access > >> >> the filesystem. This has a hard requirement on non-mmu. > >> > > >> > No way. We *can't* support filesystems that have had bootloaders > >> > make arbitrary changes to the filesystem without the knowlege of the > >> > OS that *owns the filesystem*. Similarly, we cannot support random > >> > applications that internally mount and modify filesystem images in > >> > ways we can't see, control, test or simulate. Sure, they use the > >> > kernel code, but that doesn't stop them from doing stupid shit that > >> > could corrupt the filesystem image. So, no, we are not going to > >> > support giving random applications direct access to XFS filesystem > >> > images, even via LKL. > >> > > >> > >> LKL only exports the Linux kernel system calls and nothing else to > >> applications. Because of that, there should not be any loss of control > >> or visibility to the XFS fs driver. > > > > It runs in the same address space as the user application, yes? And > > hence application bugs can cause the kernel code to malfunction, > > yes? > > > > Most non-mmu architecture have the same issue and nevertheless non-mmu > is still supported in Linux (including most filesystems). That doesn't mean all subsystems in the kernel support users on non-mmu systems. > Also, filesystem code runs in the same address space with other kernel > code and drivers and a bug anywhere in the kernel can cause filesystem > code to malfunction. Well, yes, but we have trust other kernel developers keep their ship in good shape, too. But we don't trust *out of tree code* - that taints the kernel and most upstream kernel developers will notice such taints when there are weird errors being reported.... But you want to extend that trust to whatever random code gets chucked into tools/lkl. I'm stretched far enough already having to keep up with mm, VFS, locking and filesystem developments that I don't have time to keep up with what LKL is doing or what applications people are writing. You're not going to get subsystem maintainers being able to find the time to review and test applications that get stuffed into tools/lkl, so from that perspective it's still a complete crapshoot. > >> > I really don't see how using LKL to give userspace access to XFS > >> > filesystems is a better solution than actually writing a proper, > >> > supported XFS-FUSE module. LKL is so full of compromises that it's > >> > going to be unworkable and unsupportable in practice... > >> > >> Could you elaborate on some of these issues? > > > > Start with "is a no-mmu architecture" and all the compromises that > > means the kernel code needs to make, > > I don't see non-mmu as a compromise. It is supported by Linux and most > filesystems work fine on non-mmu architectures. most != all. XFS is unashamedly aimed and developed for high performance systems. i.e. enterprise servers, HPC, fileservers, cloud storage infrastructure, etc. IOWs, we don't even develop for desktop machines, even though XFS performs adequately for most desktop workloads. We've never cared about non-mmu systems because of the requirements we have on virtually mapped buffers and the fact they don't exist in the target markets XFS aimed at. And, really, LKL doesn't change that... > LKL can be implemented > as a mmu architecture. Having it as a non-mmu architecture has the > advantages of allowing it to run in more constrained environments like > bootloaders. The OS owns the filesystem, not the bootloader. If the bootloader is modifying filesystems (e.g. by running log recovery to mount the fs internally to find the kernel/initrd files), then the boot loader compromises the filesystem integrity. We don't not support such configurations at all. > > and finish it off with "LKL linked applications will > > never be tested by their developers over the full functionality the > > LKL provides them with". > > You lost me here. Why does an application developer have to test the > full functionality of a library it is linked with? How much filesystem testing did you actually do with lklfuse? What about load testing? Data integrity/crash/power fail and recovery testing? Scalability testing? Or have you just assumed that it'll all just work fine because the kernel fs developers test this stuff? Yes, we've tested the kernel code along these lines *in kernel space*. That doesn't mean that an LKL application that uses the kernel functionality *in user space* will behave the same way or, indeed, function correctly in adverse circumstances. Application developers are not going to be aware of such issues, and they aren't going to test for such situations - they are simply going to assume "this works just fine" until it doesn't.... Cheers, Dave.
diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c index 8ecffb3..50b5246 100644 --- a/fs/xfs/xfs_buf.c +++ b/fs/xfs/xfs_buf.c @@ -261,6 +261,7 @@ xfs_buf_free( ASSERT(list_empty(&bp->b_lru)); if (bp->b_flags & _XBF_PAGES) { +#ifdef CONFIG_MMU uint i; if (xfs_buf_is_vmapped(bp)) @@ -272,6 +273,10 @@ xfs_buf_free( __free_page(page); } +#else + free_pages((unsigned long)page_to_virt(bp->b_pages[0]), + order_base_2(bp->b_page_count)); +#endif } else if (bp->b_flags & _XBF_KMEM) kmem_free(bp->b_addr); _xfs_buf_free_pages(bp); @@ -338,7 +343,14 @@ use_alloc_page: struct page *page; uint retries = 0; retry: +#ifdef CONFIG_MMU page = alloc_page(gfp_mask); +#else + if (i == 0) + page = alloc_pages(gfp_mask, order_base_2(page_count)); + else + page = bp->b_pages[0] + i; +#endif if (unlikely(page == NULL)) { if (flags & XBF_READ_AHEAD) { bp->b_page_count = i; @@ -372,8 +384,11 @@ retry: return 0; out_free_pages: +#ifdef CONFIG_MMU for (i = 0; i < bp->b_page_count; i++) __free_page(bp->b_pages[i]); +#endif + return error; } @@ -392,6 +407,7 @@ _xfs_buf_map_pages( } else if (flags & XBF_UNMAPPED) { bp->b_addr = NULL; } else { +#ifdef CONFIG_MMU int retried = 0; unsigned noio_flag; @@ -412,6 +428,9 @@ _xfs_buf_map_pages( vm_unmap_aliases(); } while (retried++ <= 1); memalloc_noio_restore(noio_flag); +#else + bp->b_addr = page_to_virt(bp->b_pages[0]); +#endif if (!bp->b_addr) return -ENOMEM; @@ -816,11 +835,19 @@ xfs_buf_get_uncached( if (error) goto fail_free_buf; +#ifdef CONFIG_MMU for (i = 0; i < page_count; i++) { bp->b_pages[i] = alloc_page(xb_to_gfp(flags)); if (!bp->b_pages[i]) goto fail_free_mem; } +#else + bp->b_pages[0] = alloc_pages(flags, order_base_2(page_count)); + if (!bp->b_pages[0]) + goto fail_free_buf; + for (i = 1; i < page_count; i++) + bp->b_pages[i] = bp->b_pages[i-1] + 1; +#endif bp->b_flags |= _XBF_PAGES; error = _xfs_buf_map_pages(bp, 0);
Naive implementation for non-mmu architectures: allocate physically contiguous xfs buffers with alloc_pages. Terribly inefficient with memory and fragmentation on high I/O loads but it may be good enough for basic usage (which most non-mmu architectures will need). This patch was tested with lklfuse [1] and basic operations seems to work even with 16MB allocated for LKL. [1] https://github.com/lkl/linux Signed-off-by: Octavian Purdila <octavian.purdila@intel.com> --- fs/xfs/xfs_buf.c | 27 +++++++++++++++++++++++++++ 1 file changed, 27 insertions(+)