diff mbox

[RFC] xfs: support for non-mmu architectures

Message ID 1447800381-20167-1-git-send-email-octavian.purdila@intel.com (mailing list archive)
State New, archived
Headers show

Commit Message

Octavian Purdila Nov. 17, 2015, 10:46 p.m. UTC
Naive implementation for non-mmu architectures: allocate physically
contiguous xfs buffers with alloc_pages. Terribly inefficient with
memory and fragmentation on high I/O loads but it may be good enough
for basic usage (which most non-mmu architectures will need).

This patch was tested with lklfuse [1] and basic operations seems to
work even with 16MB allocated for LKL.

[1] https://github.com/lkl/linux

Signed-off-by: Octavian Purdila <octavian.purdila@intel.com>
---
 fs/xfs/xfs_buf.c | 27 +++++++++++++++++++++++++++
 1 file changed, 27 insertions(+)

Comments

Brian Foster Nov. 19, 2015, 3:55 p.m. UTC | #1
On Wed, Nov 18, 2015 at 12:46:21AM +0200, Octavian Purdila wrote:
> Naive implementation for non-mmu architectures: allocate physically
> contiguous xfs buffers with alloc_pages. Terribly inefficient with
> memory and fragmentation on high I/O loads but it may be good enough
> for basic usage (which most non-mmu architectures will need).
> 
> This patch was tested with lklfuse [1] and basic operations seems to
> work even with 16MB allocated for LKL.
> 
> [1] https://github.com/lkl/linux
> 
> Signed-off-by: Octavian Purdila <octavian.purdila@intel.com>
> ---

Interesting, though this makes me wonder why we couldn't have a new
_XBF_VMEM (for example) buffer type that uses vmalloc(). I'm not
familiar with mmu-less context, but I see that mm/nommu.c has a
__vmalloc() interface that looks like it ultimately translates into an
alloc_pages() call. Would that accomplish what this patch is currently
trying to do?

I ask because it seems like that would help clean up the code a bit, for
one. It might also facilitate some degree of testing of the XFS bits
(even if utilized sparingly in DEBUG mode if it weren't suitable enough
for generic/mmu use). We currently allocate and map the buffer pages
separately and I'm not sure if there's any particular reasons for doing
that outside of some congestion handling in the allocation code and
XBF_UNMAPPED buffers, the latter probably being irrelevant for nommu.
Any other thoughts on that?

>  fs/xfs/xfs_buf.c | 27 +++++++++++++++++++++++++++
>  1 file changed, 27 insertions(+)
> 
> diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
> index 8ecffb3..50b5246 100644
> --- a/fs/xfs/xfs_buf.c
> +++ b/fs/xfs/xfs_buf.c
...
> @@ -816,11 +835,19 @@ xfs_buf_get_uncached(
>  	if (error)
>  		goto fail_free_buf;
>  
> +#ifdef CONFIG_MMU
>  	for (i = 0; i < page_count; i++) {
>  		bp->b_pages[i] = alloc_page(xb_to_gfp(flags));
>  		if (!bp->b_pages[i])
>  			goto fail_free_mem;
>  	}
> +#else
> +	bp->b_pages[0] = alloc_pages(flags, order_base_2(page_count));
> +	if (!bp->b_pages[0])
> +		goto fail_free_buf;
> +	for (i = 1; i < page_count; i++)
> +		bp->b_pages[i] = bp->b_pages[i-1] + 1;
> +#endif

We still have a path into __free_page() further down in this function if
_xfs_buf_map_pages() fails. Granted, _xfs_buf_map_pages() should
probably never fail in this case, but it still looks like a land mine at
the very least.

Brian

>  	bp->b_flags |= _XBF_PAGES;
>  
>  	error = _xfs_buf_map_pages(bp, 0);
> -- 
> 1.9.1
> 
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Octavian Purdila Nov. 19, 2015, 8:54 p.m. UTC | #2
On Thu, Nov 19, 2015 at 5:55 PM, Brian Foster <bfoster@redhat.com> wrote:
> On Wed, Nov 18, 2015 at 12:46:21AM +0200, Octavian Purdila wrote:
>> Naive implementation for non-mmu architectures: allocate physically
>> contiguous xfs buffers with alloc_pages. Terribly inefficient with
>> memory and fragmentation on high I/O loads but it may be good enough
>> for basic usage (which most non-mmu architectures will need).
>>
>> This patch was tested with lklfuse [1] and basic operations seems to
>> work even with 16MB allocated for LKL.
>>
>> [1] https://github.com/lkl/linux
>>
>> Signed-off-by: Octavian Purdila <octavian.purdila@intel.com>
>> ---
>
> Interesting, though this makes me wonder why we couldn't have a new
> _XBF_VMEM (for example) buffer type that uses vmalloc(). I'm not
> familiar with mmu-less context, but I see that mm/nommu.c has a
> __vmalloc() interface that looks like it ultimately translates into an
> alloc_pages() call. Would that accomplish what this patch is currently
> trying to do?
>

Hi Brian,

Ah, that sounds nice! We could get rid of the #ifdefs and use a common
path in that case. vmalloc should work on non-mmu and it seems that
there is a vmalloc_to_page() we can use to get the physical pages.
I'll give it a try.

Is there a central place where we could enable the new _XBF_VMEM to be
the default for non-mmu arches?

> I ask because it seems like that would help clean up the code a bit, for
> one. It might also facilitate some degree of testing of the XFS bits
> (even if utilized sparingly in DEBUG mode if it weren't suitable enough
> for generic/mmu use). We currently allocate and map the buffer pages
> separately and I'm not sure if there's any particular reasons for doing
> that outside of some congestion handling in the allocation code and
> XBF_UNMAPPED buffers, the latter probably being irrelevant for nommu.
> Any other thoughts on that?
>
>>  fs/xfs/xfs_buf.c | 27 +++++++++++++++++++++++++++
>>  1 file changed, 27 insertions(+)
>>
>> diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
>> index 8ecffb3..50b5246 100644
>> --- a/fs/xfs/xfs_buf.c
>> +++ b/fs/xfs/xfs_buf.c
> ...
>> @@ -816,11 +835,19 @@ xfs_buf_get_uncached(
>>       if (error)
>>               goto fail_free_buf;
>>
>> +#ifdef CONFIG_MMU
>>       for (i = 0; i < page_count; i++) {
>>               bp->b_pages[i] = alloc_page(xb_to_gfp(flags));
>>               if (!bp->b_pages[i])
>>                       goto fail_free_mem;
>>       }
>> +#else
>> +     bp->b_pages[0] = alloc_pages(flags, order_base_2(page_count));
>> +     if (!bp->b_pages[0])
>> +             goto fail_free_buf;
>> +     for (i = 1; i < page_count; i++)
>> +             bp->b_pages[i] = bp->b_pages[i-1] + 1;
>> +#endif
>
> We still have a path into __free_page() further down in this function if
> _xfs_buf_map_pages() fails. Granted, _xfs_buf_map_pages() should
> probably never fail in this case, but it still looks like a land mine at
> the very least.
>

OK. Adding a i = 1; right before the #endif should take care of that,
if the vmalloc approach does not work.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Dave Chinner Nov. 19, 2015, 11:24 p.m. UTC | #3
On Wed, Nov 18, 2015 at 12:46:21AM +0200, Octavian Purdila wrote:
> Naive implementation for non-mmu architectures: allocate physically
> contiguous xfs buffers with alloc_pages. Terribly inefficient with
> memory and fragmentation on high I/O loads but it may be good enough
> for basic usage (which most non-mmu architectures will need).

Can you please explain why you want to use XFS on low end, basic
non-MMU devices? XFS is a high performance, enterprise/HPC level
filesystem - it's not a filesystem designed for small IoT level
devices - so I'm struggling to see why we'd want to expend any
effort to make XFS work on such devices....

Cheers,

Dave.
Dave Chinner Nov. 19, 2015, 11:35 p.m. UTC | #4
On Thu, Nov 19, 2015 at 10:55:25AM -0500, Brian Foster wrote:
> On Wed, Nov 18, 2015 at 12:46:21AM +0200, Octavian Purdila wrote:
> > Naive implementation for non-mmu architectures: allocate physically
> > contiguous xfs buffers with alloc_pages. Terribly inefficient with
> > memory and fragmentation on high I/O loads but it may be good enough
> > for basic usage (which most non-mmu architectures will need).
> > 
> > This patch was tested with lklfuse [1] and basic operations seems to
> > work even with 16MB allocated for LKL.
> > 
> > [1] https://github.com/lkl/linux
> > 
> > Signed-off-by: Octavian Purdila <octavian.purdila@intel.com>
> > ---
> 
> Interesting, though this makes me wonder why we couldn't have a new
> _XBF_VMEM (for example) buffer type that uses vmalloc(). I'm not
> familiar with mmu-less context, but I see that mm/nommu.c has a
> __vmalloc() interface that looks like it ultimately translates into an
> alloc_pages() call. Would that accomplish what this patch is currently
> trying to do?

vmalloc is always a last resort.  vmalloc space on 32 bit systems is
extremely limited and it is easy to exhaust with XFS.

Also, vmalloc limits the control we have over allocation context
(e.g. the hoops we jump through in kmem_alloc_large() to maintain
GFP_NOFS contexts), so just using vmalloc doesn't make things much
simpler from an XFS perspective.

> I ask because it seems like that would help clean up the code a bit, for
> one. It might also facilitate some degree of testing of the XFS bits
> (even if utilized sparingly in DEBUG mode if it weren't suitable enough
> for generic/mmu use). We currently allocate and map the buffer pages
> separately and I'm not sure if there's any particular reasons for doing
> that outside of some congestion handling in the allocation code and
> XBF_UNMAPPED buffers, the latter probably being irrelevant for nommu.
> Any other thoughts on that?

We could probably clean the code up more (the allocation logic
is now largely a historic relic) but I'm not convinced yet that we
should be spending any time trying to specifically support mmu-less
hardware.

Cheers,

Dave.
Richard Weinberger Nov. 19, 2015, 11:54 p.m. UTC | #5
On Fri, Nov 20, 2015 at 12:24 AM, Dave Chinner <david@fromorbit.com> wrote:
> On Wed, Nov 18, 2015 at 12:46:21AM +0200, Octavian Purdila wrote:
>> Naive implementation for non-mmu architectures: allocate physically
>> contiguous xfs buffers with alloc_pages. Terribly inefficient with
>> memory and fragmentation on high I/O loads but it may be good enough
>> for basic usage (which most non-mmu architectures will need).
>
> Can you please explain why you want to use XFS on low end, basic
> non-MMU devices? XFS is a high performance, enterprise/HPC level
> filesystem - it's not a filesystem designed for small IoT level
> devices - so I'm struggling to see why we'd want to expend any
> effort to make XFS work on such devices....

The use case is the Linux Kernel Library:
https://lkml.org/lkml/2015/11/3/706

Using LKL and fuse you can mount any kernel filesystem using fuse
as non-root.
Dave Chinner Nov. 20, 2015, 12:58 a.m. UTC | #6
On Fri, Nov 20, 2015 at 12:54:02AM +0100, Richard Weinberger wrote:
> On Fri, Nov 20, 2015 at 12:24 AM, Dave Chinner <david@fromorbit.com> wrote:
> > On Wed, Nov 18, 2015 at 12:46:21AM +0200, Octavian Purdila wrote:
> >> Naive implementation for non-mmu architectures: allocate physically
> >> contiguous xfs buffers with alloc_pages. Terribly inefficient with
> >> memory and fragmentation on high I/O loads but it may be good enough
> >> for basic usage (which most non-mmu architectures will need).
> >
> > Can you please explain why you want to use XFS on low end, basic
> > non-MMU devices? XFS is a high performance, enterprise/HPC level
> > filesystem - it's not a filesystem designed for small IoT level
> > devices - so I'm struggling to see why we'd want to expend any
> > effort to make XFS work on such devices....
> 
> The use case is the Linux Kernel Library:
> https://lkml.org/lkml/2015/11/3/706
> 
> Using LKL and fuse you can mount any kernel filesystem using fuse
> as non-root.

IOWs, because we said no to unprivileged mounts, instead the
proposal is to linking all the kernel code into userspace so you can
do unprivielged mounts that way?

IOWs, you get to say "it secure because it's in userspace" and leave
us filesystem people with all the shit that comes with allowing
users to mount random untrusted filesystem images using code that
was never designed to allow that to happen?

Fmeh. Not a convincing argument by far.

Cheers,

Dave.
Octavian Purdila Nov. 20, 2015, 1:43 p.m. UTC | #7
On Fri, Nov 20, 2015 at 1:24 AM, Dave Chinner <david@fromorbit.com> wrote:
> On Wed, Nov 18, 2015 at 12:46:21AM +0200, Octavian Purdila wrote:
>> Naive implementation for non-mmu architectures: allocate physically
>> contiguous xfs buffers with alloc_pages. Terribly inefficient with
>> memory and fragmentation on high I/O loads but it may be good enough
>> for basic usage (which most non-mmu architectures will need).
>
> Can you please explain why you want to use XFS on low end, basic
> non-MMU devices? XFS is a high performance, enterprise/HPC level
> filesystem - it's not a filesystem designed for small IoT level
> devices - so I'm struggling to see why we'd want to expend any
> effort to make XFS work on such devices....
>

Hi David,

Yes XFS as the main fs on this type of devices does not make sense,
but does it hurt to be able to perform basic operation on XFS from
these devices? Perhaps accessing an external medium formatted with
XFS?

Another example is accessing VM images that are formatted with XFS.
Currently we can do that with tools like libguestfs that use a VM in
the background. I am working on a lighter solution for that where we
compile the Linux kernel as a library [1]. This allows access to the
filesystem without the need to use a full VM.

And a final example is linking the bootloader code with LKL to access
the filesystem. This has a hard requirement on non-mmu.

So, IMHO there are usecases for using XFS on non-mmu architectures. I
think it all boils down to: is the patch simple enough to not put an
unreasonable maintenance burden on developers?

[1] https://github.com/lkl/linux

Thanks,
Tavi
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Octavian Purdila Nov. 20, 2015, 2:09 p.m. UTC | #8
On Fri, Nov 20, 2015 at 1:35 AM, Dave Chinner <david@fromorbit.com> wrote:
> On Thu, Nov 19, 2015 at 10:55:25AM -0500, Brian Foster wrote:
>> On Wed, Nov 18, 2015 at 12:46:21AM +0200, Octavian Purdila wrote:
>> > Naive implementation for non-mmu architectures: allocate physically
>> > contiguous xfs buffers with alloc_pages. Terribly inefficient with
>> > memory and fragmentation on high I/O loads but it may be good enough
>> > for basic usage (which most non-mmu architectures will need).
>> >
>> > This patch was tested with lklfuse [1] and basic operations seems to
>> > work even with 16MB allocated for LKL.
>> >
>> > [1] https://github.com/lkl/linux
>> >
>> > Signed-off-by: Octavian Purdila <octavian.purdila@intel.com>
>> > ---
>>
>> Interesting, though this makes me wonder why we couldn't have a new
>> _XBF_VMEM (for example) buffer type that uses vmalloc(). I'm not
>> familiar with mmu-less context, but I see that mm/nommu.c has a
>> __vmalloc() interface that looks like it ultimately translates into an
>> alloc_pages() call. Would that accomplish what this patch is currently
>> trying to do?
>
> vmalloc is always a last resort.  vmalloc space on 32 bit systems is
> extremely limited and it is easy to exhaust with XFS.

Doesn't vm_map_ram use the same space as vmalloc?

>
> Also, vmalloc limits the control we have over allocation context
> (e.g. the hoops we jump through in kmem_alloc_large() to maintain
> GFP_NOFS contexts), so just using vmalloc doesn't make things much
> simpler from an XFS perspective.
>

I have zero experience with XFS, sorry if I ask obvious questions.
AFAICS there are no memalloc_noio_save() hoops in the page allocation
part, just in the vm_map_ram part. Could we preserve the
memalloc_noio_save() part and use vmalloc instead of vm_map_ram?

>> I ask because it seems like that would help clean up the code a bit, for
>> one. It might also facilitate some degree of testing of the XFS bits
>> (even if utilized sparingly in DEBUG mode if it weren't suitable enough
>> for generic/mmu use). We currently allocate and map the buffer pages
>> separately and I'm not sure if there's any particular reasons for doing
>> that outside of some congestion handling in the allocation code and
>> XBF_UNMAPPED buffers, the latter probably being irrelevant for nommu.
>> Any other thoughts on that?
>
> We could probably clean the code up more (the allocation logic
> is now largely a historic relic) but I'm not convinced yet that we
> should be spending any time trying to specifically support mmu-less
> hardware.
>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> david@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Octavian Purdila Nov. 20, 2015, 2:26 p.m. UTC | #9
On Fri, Nov 20, 2015 at 2:58 AM, Dave Chinner <david@fromorbit.com> wrote:
> On Fri, Nov 20, 2015 at 12:54:02AM +0100, Richard Weinberger wrote:
>> On Fri, Nov 20, 2015 at 12:24 AM, Dave Chinner <david@fromorbit.com> wrote:
>> > On Wed, Nov 18, 2015 at 12:46:21AM +0200, Octavian Purdila wrote:
>> >> Naive implementation for non-mmu architectures: allocate physically
>> >> contiguous xfs buffers with alloc_pages. Terribly inefficient with
>> >> memory and fragmentation on high I/O loads but it may be good enough
>> >> for basic usage (which most non-mmu architectures will need).
>> >
>> > Can you please explain why you want to use XFS on low end, basic
>> > non-MMU devices? XFS is a high performance, enterprise/HPC level
>> > filesystem - it's not a filesystem designed for small IoT level
>> > devices - so I'm struggling to see why we'd want to expend any
>> > effort to make XFS work on such devices....
>>
>> The use case is the Linux Kernel Library:
>> https://lkml.org/lkml/2015/11/3/706
>>
>> Using LKL and fuse you can mount any kernel filesystem using fuse
>> as non-root.
>
> IOWs, because we said no to unprivileged mounts, instead the
> proposal is to linking all the kernel code into userspace so you can
> do unprivielged mounts that way?
>

LKL's goal is to make it easy for various applications to reuse Linux
kernel code instead of re-implementing it. Mounting filesystem images
is just one of the applications.

> IOWs, you get to say "it secure because it's in userspace" and leave
> us filesystem people with all the shit that comes with allowing
> users to mount random untrusted filesystem images using code that
> was never designed to allow that to happen?
>

It is already possible to mount arbitrary filesystem images in
userspace using VMs . LKL doesn't change that, it just reduces the
amount of dependencies you need to do so.

Could you expand of what burden does this use-case put on fs
developers? I am sure that, if needed, we can put restrictions in LKL
to avoid that.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Brian Foster Nov. 20, 2015, 3:11 p.m. UTC | #10
On Thu, Nov 19, 2015 at 10:54:34PM +0200, Octavian Purdila wrote:
> On Thu, Nov 19, 2015 at 5:55 PM, Brian Foster <bfoster@redhat.com> wrote:
> > On Wed, Nov 18, 2015 at 12:46:21AM +0200, Octavian Purdila wrote:
> >> Naive implementation for non-mmu architectures: allocate physically
> >> contiguous xfs buffers with alloc_pages. Terribly inefficient with
> >> memory and fragmentation on high I/O loads but it may be good enough
> >> for basic usage (which most non-mmu architectures will need).
> >>
> >> This patch was tested with lklfuse [1] and basic operations seems to
> >> work even with 16MB allocated for LKL.
> >>
> >> [1] https://github.com/lkl/linux
> >>
> >> Signed-off-by: Octavian Purdila <octavian.purdila@intel.com>
> >> ---
> >
> > Interesting, though this makes me wonder why we couldn't have a new
> > _XBF_VMEM (for example) buffer type that uses vmalloc(). I'm not
> > familiar with mmu-less context, but I see that mm/nommu.c has a
> > __vmalloc() interface that looks like it ultimately translates into an
> > alloc_pages() call. Would that accomplish what this patch is currently
> > trying to do?
> >
> 
> Hi Brian,
> 
> Ah, that sounds nice! We could get rid of the #ifdefs and use a common
> path in that case. vmalloc should work on non-mmu and it seems that
> there is a vmalloc_to_page() we can use to get the physical pages.
> I'll give it a try.
> 
> Is there a central place where we could enable the new _XBF_VMEM to be
> the default for non-mmu arches?
> 

I think a generic mechanism is the appropriate short term goal as
opposed to a common codepath. IOWs, add the ability to support a new
XBF_VMEM buffer type and enforce use of that type when !CONFIG_MMU. So
you'd still have an ifdef somewhere, but it wouldn't be shoehorned into
the current separate alloc/map implementation.

I don't think we want to change the existing buffer mechanism straight
away to use this based on some of the things Dave calls out. My hope is
more that we can do the following for the generic (CONFIG_MMU) case:

1.) Enable this selectively/randomly when CONFIG_XFS_DEBUG is enabled to
provide some test coverage of the XFS buffer type without the need for a
nommu setup (not a replacement for nommu testing, of course).

2.) Over time, move the current buffer mechanism over to this type if we
can eventually clean up some of the warts that pin us to our current
implementation.

I'd suggest to let some of the parallel discussions play out before
going too far down this path, however...

Brian

> > I ask because it seems like that would help clean up the code a bit, for
> > one. It might also facilitate some degree of testing of the XFS bits
> > (even if utilized sparingly in DEBUG mode if it weren't suitable enough
> > for generic/mmu use). We currently allocate and map the buffer pages
> > separately and I'm not sure if there's any particular reasons for doing
> > that outside of some congestion handling in the allocation code and
> > XBF_UNMAPPED buffers, the latter probably being irrelevant for nommu.
> > Any other thoughts on that?
> >
> >>  fs/xfs/xfs_buf.c | 27 +++++++++++++++++++++++++++
> >>  1 file changed, 27 insertions(+)
> >>
> >> diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
> >> index 8ecffb3..50b5246 100644
> >> --- a/fs/xfs/xfs_buf.c
> >> +++ b/fs/xfs/xfs_buf.c
> > ...
> >> @@ -816,11 +835,19 @@ xfs_buf_get_uncached(
> >>       if (error)
> >>               goto fail_free_buf;
> >>
> >> +#ifdef CONFIG_MMU
> >>       for (i = 0; i < page_count; i++) {
> >>               bp->b_pages[i] = alloc_page(xb_to_gfp(flags));
> >>               if (!bp->b_pages[i])
> >>                       goto fail_free_mem;
> >>       }
> >> +#else
> >> +     bp->b_pages[0] = alloc_pages(flags, order_base_2(page_count));
> >> +     if (!bp->b_pages[0])
> >> +             goto fail_free_buf;
> >> +     for (i = 1; i < page_count; i++)
> >> +             bp->b_pages[i] = bp->b_pages[i-1] + 1;
> >> +#endif
> >
> > We still have a path into __free_page() further down in this function if
> > _xfs_buf_map_pages() fails. Granted, _xfs_buf_map_pages() should
> > probably never fail in this case, but it still looks like a land mine at
> > the very least.
> >
> 
> OK. Adding a i = 1; right before the #endif should take care of that,
> if the vmalloc approach does not work.
> 
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Brian Foster Nov. 20, 2015, 3:11 p.m. UTC | #11
On Fri, Nov 20, 2015 at 10:35:47AM +1100, Dave Chinner wrote:
> On Thu, Nov 19, 2015 at 10:55:25AM -0500, Brian Foster wrote:
> > On Wed, Nov 18, 2015 at 12:46:21AM +0200, Octavian Purdila wrote:
> > > Naive implementation for non-mmu architectures: allocate physically
> > > contiguous xfs buffers with alloc_pages. Terribly inefficient with
> > > memory and fragmentation on high I/O loads but it may be good enough
> > > for basic usage (which most non-mmu architectures will need).
> > > 
> > > This patch was tested with lklfuse [1] and basic operations seems to
> > > work even with 16MB allocated for LKL.
> > > 
> > > [1] https://github.com/lkl/linux
> > > 
> > > Signed-off-by: Octavian Purdila <octavian.purdila@intel.com>
> > > ---
> > 
> > Interesting, though this makes me wonder why we couldn't have a new
> > _XBF_VMEM (for example) buffer type that uses vmalloc(). I'm not
> > familiar with mmu-less context, but I see that mm/nommu.c has a
> > __vmalloc() interface that looks like it ultimately translates into an
> > alloc_pages() call. Would that accomplish what this patch is currently
> > trying to do?
> 
> vmalloc is always a last resort.  vmalloc space on 32 bit systems is
> extremely limited and it is easy to exhaust with XFS.
> 

Sure, but my impression is that a vmalloc() buffer is roughly equivalent
in this regard to a current !XBF_UNMAPPED && size > PAGE_SIZE buffer. We
just do the allocation and mapping separately (presumably for other
reasons).

> Also, vmalloc limits the control we have over allocation context
> (e.g. the hoops we jump through in kmem_alloc_large() to maintain
> GFP_NOFS contexts), so just using vmalloc doesn't make things much
> simpler from an XFS perspective.
> 

The comment in kmem_zalloc_large() calls out some apparent hardcoded
allocation flags down in the depths of vmalloc(). It looks to me that
page allocation (__vmalloc_area_node()) actually uses the provided
flags, so I'm not following the "data page" part of that comment.
Indeed, I do see that this is not the case down in calls like
pmd_alloc_one(), pte_alloc_one_kernel(), etc., associated with page
table management.

Those latter calls are all from following down through the
map_vm_area()->vmap_page_range() codepath from __vmalloc_area_node(). We
call vm_map_ram() directly from _xfs_buf_map_pages(), which itself calls
down into the same code. Indeed, we already protect ourselves here via
the same memalloc_noio_save() mechanism that kmem_zalloc_large() uses.

I suspect there's more to it than that because it does look like
vm_map_ram() has a different mechanism for managing vmalloc space for
certain (smaller) allocations, either of which I'm not really familiar
with. That aside, I don't see how vmalloc() introduces any new
allocation context issues for those buffers where we already set up a
multi-page mapping.

We still have the somewhat customized page allocation code in
xfs_buf_allocate_memory() to contend with. I actually think it would be
useful to have a DEBUG sysfs tunable to turn on vmalloc() buffers and
actually test how effective some of this code is.

> > I ask because it seems like that would help clean up the code a bit, for
> > one. It might also facilitate some degree of testing of the XFS bits
> > (even if utilized sparingly in DEBUG mode if it weren't suitable enough
> > for generic/mmu use). We currently allocate and map the buffer pages
> > separately and I'm not sure if there's any particular reasons for doing
> > that outside of some congestion handling in the allocation code and
> > XBF_UNMAPPED buffers, the latter probably being irrelevant for nommu.
> > Any other thoughts on that?
> 
> We could probably clean the code up more (the allocation logic
> is now largely a historic relic) but I'm not convinced yet that we
> should be spending any time trying to specifically support mmu-less
> hardware.
> 

Fair point, we'll see where the use case discussion goes. That said, I
was a little surprised that this is all that was required to enable
nommu support. If that is indeed the case and we aren't in for a series
of subsequent nommu specific changes (Octavian?) by letting this
through, what's the big deal? This seems fairly harmless to me as is,
particularly if it can be semi-tested via DEBUG mode and has potential
generic use down the road.

Brian

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> 
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Brian Foster Nov. 20, 2015, 3:24 p.m. UTC | #12
On Fri, Nov 20, 2015 at 04:26:28PM +0200, Octavian Purdila wrote:
> On Fri, Nov 20, 2015 at 2:58 AM, Dave Chinner <david@fromorbit.com> wrote:
> > On Fri, Nov 20, 2015 at 12:54:02AM +0100, Richard Weinberger wrote:
> >> On Fri, Nov 20, 2015 at 12:24 AM, Dave Chinner <david@fromorbit.com> wrote:
> >> > On Wed, Nov 18, 2015 at 12:46:21AM +0200, Octavian Purdila wrote:
> >> >> Naive implementation for non-mmu architectures: allocate physically
> >> >> contiguous xfs buffers with alloc_pages. Terribly inefficient with
> >> >> memory and fragmentation on high I/O loads but it may be good enough
> >> >> for basic usage (which most non-mmu architectures will need).
> >> >
> >> > Can you please explain why you want to use XFS on low end, basic
> >> > non-MMU devices? XFS is a high performance, enterprise/HPC level
> >> > filesystem - it's not a filesystem designed for small IoT level
> >> > devices - so I'm struggling to see why we'd want to expend any
> >> > effort to make XFS work on such devices....
> >>
> >> The use case is the Linux Kernel Library:
> >> https://lkml.org/lkml/2015/11/3/706
> >>
> >> Using LKL and fuse you can mount any kernel filesystem using fuse
> >> as non-root.
> >
> > IOWs, because we said no to unprivileged mounts, instead the
> > proposal is to linking all the kernel code into userspace so you can
> > do unprivielged mounts that way?
> >
> 
> LKL's goal is to make it easy for various applications to reuse Linux
> kernel code instead of re-implementing it. Mounting filesystem images
> is just one of the applications.
> 
> > IOWs, you get to say "it secure because it's in userspace" and leave
> > us filesystem people with all the shit that comes with allowing
> > users to mount random untrusted filesystem images using code that
> > was never designed to allow that to happen?
> >
> 
> It is already possible to mount arbitrary filesystem images in
> userspace using VMs . LKL doesn't change that, it just reduces the
> amount of dependencies you need to do so.
> 

Perhaps a dumb question, but I'm not quite putting 2+2 together here.
When I see nommu, I'm generally thinking hardware characteristics, but
we're talking about a userspace kernel library here. So can you
elaborate on how this relates to nommu? Does this library emulate kernel
mechanisms in userspace via nommu mode or something of that nature?
Thanks.

Brian

> Could you expand of what burden does this use-case put on fs
> developers? I am sure that, if needed, we can put restrictions in LKL
> to avoid that.
> 
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Octavian Purdila Nov. 20, 2015, 3:31 p.m. UTC | #13
On Fri, Nov 20, 2015 at 5:24 PM, Brian Foster <bfoster@redhat.com> wrote:
> On Fri, Nov 20, 2015 at 04:26:28PM +0200, Octavian Purdila wrote:
>> On Fri, Nov 20, 2015 at 2:58 AM, Dave Chinner <david@fromorbit.com> wrote:
>> > On Fri, Nov 20, 2015 at 12:54:02AM +0100, Richard Weinberger wrote:
>> >> On Fri, Nov 20, 2015 at 12:24 AM, Dave Chinner <david@fromorbit.com> wrote:
>> >> > On Wed, Nov 18, 2015 at 12:46:21AM +0200, Octavian Purdila wrote:
>> >> >> Naive implementation for non-mmu architectures: allocate physically
>> >> >> contiguous xfs buffers with alloc_pages. Terribly inefficient with
>> >> >> memory and fragmentation on high I/O loads but it may be good enough
>> >> >> for basic usage (which most non-mmu architectures will need).
>> >> >
>> >> > Can you please explain why you want to use XFS on low end, basic
>> >> > non-MMU devices? XFS is a high performance, enterprise/HPC level
>> >> > filesystem - it's not a filesystem designed for small IoT level
>> >> > devices - so I'm struggling to see why we'd want to expend any
>> >> > effort to make XFS work on such devices....
>> >>
>> >> The use case is the Linux Kernel Library:
>> >> https://lkml.org/lkml/2015/11/3/706
>> >>
>> >> Using LKL and fuse you can mount any kernel filesystem using fuse
>> >> as non-root.
>> >
>> > IOWs, because we said no to unprivileged mounts, instead the
>> > proposal is to linking all the kernel code into userspace so you can
>> > do unprivielged mounts that way?
>> >
>>
>> LKL's goal is to make it easy for various applications to reuse Linux
>> kernel code instead of re-implementing it. Mounting filesystem images
>> is just one of the applications.
>>
>> > IOWs, you get to say "it secure because it's in userspace" and leave
>> > us filesystem people with all the shit that comes with allowing
>> > users to mount random untrusted filesystem images using code that
>> > was never designed to allow that to happen?
>> >
>>
>> It is already possible to mount arbitrary filesystem images in
>> userspace using VMs . LKL doesn't change that, it just reduces the
>> amount of dependencies you need to do so.
>>
>
> Perhaps a dumb question, but I'm not quite putting 2+2 together here.
> When I see nommu, I'm generally thinking hardware characteristics, but
> we're talking about a userspace kernel library here. So can you
> elaborate on how this relates to nommu? Does this library emulate kernel
> mechanisms in userspace via nommu mode or something of that nature?
>

LKL is currently implemented as a virtual non-mmu architecture. That
makes it simpler and it will also allow us to support environments
where it is not possible to emulate paging (e.g.  bootloaders).
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Octavian Purdila Nov. 20, 2015, 3:35 p.m. UTC | #14
On Fri, Nov 20, 2015 at 5:11 PM, Brian Foster <bfoster@redhat.com> wrote:
> On Fri, Nov 20, 2015 at 10:35:47AM +1100, Dave Chinner wrote:
>> On Thu, Nov 19, 2015 at 10:55:25AM -0500, Brian Foster wrote:
>> > On Wed, Nov 18, 2015 at 12:46:21AM +0200, Octavian Purdila wrote:
>> > > Naive implementation for non-mmu architectures: allocate physically
>> > > contiguous xfs buffers with alloc_pages. Terribly inefficient with
>> > > memory and fragmentation on high I/O loads but it may be good enough
>> > > for basic usage (which most non-mmu architectures will need).
>> > >
>> > > This patch was tested with lklfuse [1] and basic operations seems to
>> > > work even with 16MB allocated for LKL.
>> > >
>> > > [1] https://github.com/lkl/linux
>> > >
>> > > Signed-off-by: Octavian Purdila <octavian.purdila@intel.com>
>> > > ---
>> >
>> > Interesting, though this makes me wonder why we couldn't have a new
>> > _XBF_VMEM (for example) buffer type that uses vmalloc(). I'm not
>> > familiar with mmu-less context, but I see that mm/nommu.c has a
>> > __vmalloc() interface that looks like it ultimately translates into an
>> > alloc_pages() call. Would that accomplish what this patch is currently
>> > trying to do?
>>
>> vmalloc is always a last resort.  vmalloc space on 32 bit systems is
>> extremely limited and it is easy to exhaust with XFS.
>>
>
> Sure, but my impression is that a vmalloc() buffer is roughly equivalent
> in this regard to a current !XBF_UNMAPPED && size > PAGE_SIZE buffer. We
> just do the allocation and mapping separately (presumably for other
> reasons).
>
>> Also, vmalloc limits the control we have over allocation context
>> (e.g. the hoops we jump through in kmem_alloc_large() to maintain
>> GFP_NOFS contexts), so just using vmalloc doesn't make things much
>> simpler from an XFS perspective.
>>
>
> The comment in kmem_zalloc_large() calls out some apparent hardcoded
> allocation flags down in the depths of vmalloc(). It looks to me that
> page allocation (__vmalloc_area_node()) actually uses the provided
> flags, so I'm not following the "data page" part of that comment.
> Indeed, I do see that this is not the case down in calls like
> pmd_alloc_one(), pte_alloc_one_kernel(), etc., associated with page
> table management.
>
> Those latter calls are all from following down through the
> map_vm_area()->vmap_page_range() codepath from __vmalloc_area_node(). We
> call vm_map_ram() directly from _xfs_buf_map_pages(), which itself calls
> down into the same code. Indeed, we already protect ourselves here via
> the same memalloc_noio_save() mechanism that kmem_zalloc_large() uses.
>
> I suspect there's more to it than that because it does look like
> vm_map_ram() has a different mechanism for managing vmalloc space for
> certain (smaller) allocations, either of which I'm not really familiar
> with. That aside, I don't see how vmalloc() introduces any new
> allocation context issues for those buffers where we already set up a
> multi-page mapping.
>
> We still have the somewhat customized page allocation code in
> xfs_buf_allocate_memory() to contend with. I actually think it would be
> useful to have a DEBUG sysfs tunable to turn on vmalloc() buffers and
> actually test how effective some of this code is.
>
>> > I ask because it seems like that would help clean up the code a bit, for
>> > one. It might also facilitate some degree of testing of the XFS bits
>> > (even if utilized sparingly in DEBUG mode if it weren't suitable enough
>> > for generic/mmu use). We currently allocate and map the buffer pages
>> > separately and I'm not sure if there's any particular reasons for doing
>> > that outside of some congestion handling in the allocation code and
>> > XBF_UNMAPPED buffers, the latter probably being irrelevant for nommu.
>> > Any other thoughts on that?
>>
>> We could probably clean the code up more (the allocation logic
>> is now largely a historic relic) but I'm not convinced yet that we
>> should be spending any time trying to specifically support mmu-less
>> hardware.
>>
>
> Fair point, we'll see where the use case discussion goes. That said, I
> was a little surprised that this is all that was required to enable
> nommu support. If that is indeed the case and we aren't in for a series
> of subsequent nommu specific changes (Octavian?) by letting this
> through, what's the big deal? This seems fairly harmless to me as is,
> particularly if it can be semi-tested via DEBUG mode and has potential
> generic use down the road.
>

I don't foresee additional patches. I was able to use lklfuse to mount
an XFS image and perform basic operations. Are there any xfs specific
tests coverage tools I can use to make sure I am not missing anything?
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Brian Foster Nov. 20, 2015, 3:40 p.m. UTC | #15
On Fri, Nov 20, 2015 at 05:35:55PM +0200, Octavian Purdila wrote:
> On Fri, Nov 20, 2015 at 5:11 PM, Brian Foster <bfoster@redhat.com> wrote:
> > On Fri, Nov 20, 2015 at 10:35:47AM +1100, Dave Chinner wrote:
> >> On Thu, Nov 19, 2015 at 10:55:25AM -0500, Brian Foster wrote:
> >> > On Wed, Nov 18, 2015 at 12:46:21AM +0200, Octavian Purdila wrote:
> >> > > Naive implementation for non-mmu architectures: allocate physically
> >> > > contiguous xfs buffers with alloc_pages. Terribly inefficient with
> >> > > memory and fragmentation on high I/O loads but it may be good enough
> >> > > for basic usage (which most non-mmu architectures will need).
> >> > >
> >> > > This patch was tested with lklfuse [1] and basic operations seems to
> >> > > work even with 16MB allocated for LKL.
> >> > >
> >> > > [1] https://github.com/lkl/linux
> >> > >
> >> > > Signed-off-by: Octavian Purdila <octavian.purdila@intel.com>
> >> > > ---
> >> >
> >> > Interesting, though this makes me wonder why we couldn't have a new
> >> > _XBF_VMEM (for example) buffer type that uses vmalloc(). I'm not
> >> > familiar with mmu-less context, but I see that mm/nommu.c has a
> >> > __vmalloc() interface that looks like it ultimately translates into an
> >> > alloc_pages() call. Would that accomplish what this patch is currently
> >> > trying to do?
> >>
> >> vmalloc is always a last resort.  vmalloc space on 32 bit systems is
> >> extremely limited and it is easy to exhaust with XFS.
> >>
> >
> > Sure, but my impression is that a vmalloc() buffer is roughly equivalent
> > in this regard to a current !XBF_UNMAPPED && size > PAGE_SIZE buffer. We
> > just do the allocation and mapping separately (presumably for other
> > reasons).
> >
> >> Also, vmalloc limits the control we have over allocation context
> >> (e.g. the hoops we jump through in kmem_alloc_large() to maintain
> >> GFP_NOFS contexts), so just using vmalloc doesn't make things much
> >> simpler from an XFS perspective.
> >>
> >
> > The comment in kmem_zalloc_large() calls out some apparent hardcoded
> > allocation flags down in the depths of vmalloc(). It looks to me that
> > page allocation (__vmalloc_area_node()) actually uses the provided
> > flags, so I'm not following the "data page" part of that comment.
> > Indeed, I do see that this is not the case down in calls like
> > pmd_alloc_one(), pte_alloc_one_kernel(), etc., associated with page
> > table management.
> >
> > Those latter calls are all from following down through the
> > map_vm_area()->vmap_page_range() codepath from __vmalloc_area_node(). We
> > call vm_map_ram() directly from _xfs_buf_map_pages(), which itself calls
> > down into the same code. Indeed, we already protect ourselves here via
> > the same memalloc_noio_save() mechanism that kmem_zalloc_large() uses.
> >
> > I suspect there's more to it than that because it does look like
> > vm_map_ram() has a different mechanism for managing vmalloc space for
> > certain (smaller) allocations, either of which I'm not really familiar
> > with. That aside, I don't see how vmalloc() introduces any new
> > allocation context issues for those buffers where we already set up a
> > multi-page mapping.
> >
> > We still have the somewhat customized page allocation code in
> > xfs_buf_allocate_memory() to contend with. I actually think it would be
> > useful to have a DEBUG sysfs tunable to turn on vmalloc() buffers and
> > actually test how effective some of this code is.
> >
> >> > I ask because it seems like that would help clean up the code a bit, for
> >> > one. It might also facilitate some degree of testing of the XFS bits
> >> > (even if utilized sparingly in DEBUG mode if it weren't suitable enough
> >> > for generic/mmu use). We currently allocate and map the buffer pages
> >> > separately and I'm not sure if there's any particular reasons for doing
> >> > that outside of some congestion handling in the allocation code and
> >> > XBF_UNMAPPED buffers, the latter probably being irrelevant for nommu.
> >> > Any other thoughts on that?
> >>
> >> We could probably clean the code up more (the allocation logic
> >> is now largely a historic relic) but I'm not convinced yet that we
> >> should be spending any time trying to specifically support mmu-less
> >> hardware.
> >>
> >
> > Fair point, we'll see where the use case discussion goes. That said, I
> > was a little surprised that this is all that was required to enable
> > nommu support. If that is indeed the case and we aren't in for a series
> > of subsequent nommu specific changes (Octavian?) by letting this
> > through, what's the big deal? This seems fairly harmless to me as is,
> > particularly if it can be semi-tested via DEBUG mode and has potential
> > generic use down the road.
> >
> 
> I don't foresee additional patches. I was able to use lklfuse to mount
> an XFS image and perform basic operations. Are there any xfs specific
> tests coverage tools I can use to make sure I am not missing anything?
> 

Ok, well you probably want to run some of the tests in xfstests and see
if anything falls over:

https://git.kernel.org/cgit/fs/xfs/xfstests-dev.git/

Note that this has generic as well as fs-specific tests for other
filesystems (ext4, btrfs, etc.).

Brian

> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Brian Foster Nov. 20, 2015, 3:43 p.m. UTC | #16
On Fri, Nov 20, 2015 at 05:31:59PM +0200, Octavian Purdila wrote:
> On Fri, Nov 20, 2015 at 5:24 PM, Brian Foster <bfoster@redhat.com> wrote:
> > On Fri, Nov 20, 2015 at 04:26:28PM +0200, Octavian Purdila wrote:
> >> On Fri, Nov 20, 2015 at 2:58 AM, Dave Chinner <david@fromorbit.com> wrote:
> >> > On Fri, Nov 20, 2015 at 12:54:02AM +0100, Richard Weinberger wrote:
> >> >> On Fri, Nov 20, 2015 at 12:24 AM, Dave Chinner <david@fromorbit.com> wrote:
> >> >> > On Wed, Nov 18, 2015 at 12:46:21AM +0200, Octavian Purdila wrote:
> >> >> >> Naive implementation for non-mmu architectures: allocate physically
> >> >> >> contiguous xfs buffers with alloc_pages. Terribly inefficient with
> >> >> >> memory and fragmentation on high I/O loads but it may be good enough
> >> >> >> for basic usage (which most non-mmu architectures will need).
> >> >> >
> >> >> > Can you please explain why you want to use XFS on low end, basic
> >> >> > non-MMU devices? XFS is a high performance, enterprise/HPC level
> >> >> > filesystem - it's not a filesystem designed for small IoT level
> >> >> > devices - so I'm struggling to see why we'd want to expend any
> >> >> > effort to make XFS work on such devices....
> >> >>
> >> >> The use case is the Linux Kernel Library:
> >> >> https://lkml.org/lkml/2015/11/3/706
> >> >>
> >> >> Using LKL and fuse you can mount any kernel filesystem using fuse
> >> >> as non-root.
> >> >
> >> > IOWs, because we said no to unprivileged mounts, instead the
> >> > proposal is to linking all the kernel code into userspace so you can
> >> > do unprivielged mounts that way?
> >> >
> >>
> >> LKL's goal is to make it easy for various applications to reuse Linux
> >> kernel code instead of re-implementing it. Mounting filesystem images
> >> is just one of the applications.
> >>
> >> > IOWs, you get to say "it secure because it's in userspace" and leave
> >> > us filesystem people with all the shit that comes with allowing
> >> > users to mount random untrusted filesystem images using code that
> >> > was never designed to allow that to happen?
> >> >
> >>
> >> It is already possible to mount arbitrary filesystem images in
> >> userspace using VMs . LKL doesn't change that, it just reduces the
> >> amount of dependencies you need to do so.
> >>
> >
> > Perhaps a dumb question, but I'm not quite putting 2+2 together here.
> > When I see nommu, I'm generally thinking hardware characteristics, but
> > we're talking about a userspace kernel library here. So can you
> > elaborate on how this relates to nommu? Does this library emulate kernel
> > mechanisms in userspace via nommu mode or something of that nature?
> >
> 
> LKL is currently implemented as a virtual non-mmu architecture. That
> makes it simpler and it will also allow us to support environments
> where it is not possible to emulate paging (e.g.  bootloaders).
> 

Ok, so we aren't necessarily talking about running on typically limited,
mmu-less hardware. Thanks!

Brian

> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Theodore Ts'o Nov. 20, 2015, 8:07 p.m. UTC | #17
On Fri, Nov 20, 2015 at 04:26:28PM +0200, Octavian Purdila wrote:
> It is already possible to mount arbitrary filesystem images in
> userspace using VMs . LKL doesn't change that, it just reduces the
> amount of dependencies you need to do so.

It is true that you can mount arbitrary file systems in userspace
using VM's.  But those the kvm binary is typically not run with root
privileges in the host OS --- at least, not if the system
administrator is smart.  So a root compromise does not cause a
catastrophic security vulnerability, and if the guest OS crashes ---
again, not a real problem.

In the caase where people are trying to claim that containers are just
as secure as VM's, and plan to give container "guest" system
administrators root-like powers, the question which immediately comes
to mind is whether the LKML/fuse daemon is running inside or outside
the container.  If it is outside the container, the a potential
security compromise of the binary running binary will be catastrophic
to the overall security of the host system and all of its containers.
If it is inside the container, you will be partially breaking the
illusion that the container works just like a VM (since a user runinng
"ps" will see all of these mysterious userspace processes that could
be killed, etc.), but it significantly reduces the security problems
if a maliciously crafted (or maliciously modulated) block device is
mounted.

> Could you expand of what burden does this use-case put on fs
> developers? I am sure that, if needed, we can put restrictions in LKL
> to avoid that.

The bottom line is who is going to get all of the support calls and
nasty e-mails explaining how our "crappy" code has caused some silly
container VM system administrator's customers $$$ worth of losses.  As
long as we can make sure that it's been underlined that the code is
being used well outside of its intended scope, and if it breaks, the
user gets to keep both pieces, and all complaints, threatened
lawsuits, etc. should go to the LKL maintainers or the purveyors of
said container-based products, I suppose that should be OK.  :-)

						- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Dave Chinner Nov. 20, 2015, 8:36 p.m. UTC | #18
On Fri, Nov 20, 2015 at 10:11:19AM -0500, Brian Foster wrote:
> On Fri, Nov 20, 2015 at 10:35:47AM +1100, Dave Chinner wrote:
> > On Thu, Nov 19, 2015 at 10:55:25AM -0500, Brian Foster wrote:
> > > On Wed, Nov 18, 2015 at 12:46:21AM +0200, Octavian Purdila wrote:
> > > > Naive implementation for non-mmu architectures: allocate physically
> > > > contiguous xfs buffers with alloc_pages. Terribly inefficient with
> > > > memory and fragmentation on high I/O loads but it may be good enough
> > > > for basic usage (which most non-mmu architectures will need).
> > > > 
> > > > This patch was tested with lklfuse [1] and basic operations seems to
> > > > work even with 16MB allocated for LKL.
> > > > 
> > > > [1] https://github.com/lkl/linux
> > > > 
> > > > Signed-off-by: Octavian Purdila <octavian.purdila@intel.com>
> > > > ---
> > > 
> > > Interesting, though this makes me wonder why we couldn't have a new
> > > _XBF_VMEM (for example) buffer type that uses vmalloc(). I'm not
> > > familiar with mmu-less context, but I see that mm/nommu.c has a
> > > __vmalloc() interface that looks like it ultimately translates into an
> > > alloc_pages() call. Would that accomplish what this patch is currently
> > > trying to do?
> > 
> > vmalloc is always a last resort.  vmalloc space on 32 bit systems is
> > extremely limited and it is easy to exhaust with XFS.
> > 
> 
> Sure, but my impression is that a vmalloc() buffer is roughly equivalent
> in this regard to a current !XBF_UNMAPPED && size > PAGE_SIZE buffer. We
> just do the allocation and mapping separately (presumably for other
> reasons).

Yes, it'a always a last resort. We don't use vmap'd buffers very
much on block size <= page size filesystems (e.g. iclog buffers are
the main user in such cases, IIRC), so the typical 32 bit
system doesn't have major problems with vmalloc space. However, the
moment you increase the directory block size > block size, that all
goes out the window...

> > Also, vmalloc limits the control we have over allocation context
> > (e.g. the hoops we jump through in kmem_alloc_large() to maintain
> > GFP_NOFS contexts), so just using vmalloc doesn't make things much
> > simpler from an XFS perspective.
> > 
> 
> The comment in kmem_zalloc_large() calls out some apparent hardcoded
> allocation flags down in the depths of vmalloc(). It looks to me that
> page allocation (__vmalloc_area_node()) actually uses the provided
> flags, so I'm not following the "data page" part of that comment.

You can pass gfp flags for the page allocation part of vmalloc, but
not the pte allocation part of it. That's what the hacks in
kmem_zalloc_large() are doing.

> Indeed, I do see that this is not the case down in calls like
> pmd_alloc_one(), pte_alloc_one_kernel(), etc., associated with page
> table management.

Right.

> Those latter calls are all from following down through the
> map_vm_area()->vmap_page_range() codepath from __vmalloc_area_node(). We
> call vm_map_ram() directly from _xfs_buf_map_pages(), which itself calls
> down into the same code. Indeed, we already protect ourselves here via
> the same memalloc_noio_save() mechanism that kmem_zalloc_large() uses.

Yes, we do, but that is separately handled to the allocation of the
pages, which we have to do for all types of buffers, mapped or
unmapped, because xfs_buf_ioapply_map() requires direct access to
the underlying pages to build the bio for IO.  If we delegate the
allocation of pages to vmalloc, we don't have direct reference to
the underlying pages and so we have to do something completely
diffferent to build the bios for the buffer....

> I suspect there's more to it than that because it does look like
> vm_map_ram() has a different mechanism for managing vmalloc space for
> certain (smaller) allocations, either of which I'm not really familiar
> with.

Yes, it manages vmalloc space quite differently, and there are
different scalability aspects to consider as well - vm_map_ram was
pretty much written for the use XFS has in xfs_buf.c...

Cheers,

Dave.
Dave Chinner Nov. 20, 2015, 9:08 p.m. UTC | #19
On Fri, Nov 20, 2015 at 03:43:20PM +0200, Octavian Purdila wrote:
> On Fri, Nov 20, 2015 at 1:24 AM, Dave Chinner <david@fromorbit.com> wrote:
> > On Wed, Nov 18, 2015 at 12:46:21AM +0200, Octavian Purdila wrote:
> >> Naive implementation for non-mmu architectures: allocate physically
> >> contiguous xfs buffers with alloc_pages. Terribly inefficient with
> >> memory and fragmentation on high I/O loads but it may be good enough
> >> for basic usage (which most non-mmu architectures will need).
> >
> > Can you please explain why you want to use XFS on low end, basic
> > non-MMU devices? XFS is a high performance, enterprise/HPC level
> > filesystem - it's not a filesystem designed for small IoT level
> > devices - so I'm struggling to see why we'd want to expend any
> > effort to make XFS work on such devices....
> >
> 
> Hi David,
> 
> Yes XFS as the main fs on this type of devices does not make sense,
> but does it hurt to be able to perform basic operation on XFS from
> these devices? Perhaps accessing an external medium formatted with
> XFS?
> 
> Another example is accessing VM images that are formatted with XFS.
> Currently we can do that with tools like libguestfs that use a VM in
> the background. I am working on a lighter solution for that where we
> compile the Linux kernel as a library [1]. This allows access to the
> filesystem without the need to use a full VM.

That's hardly a "lighter solution".

I'm kinda tired of the ongoing "hack random shit" approach to
container development. If you need a XFS-FUSE module to allow safe
userspace access to XFS fielsystems then maybe, just maybe, it makes
sense to ask the XFS developers how to best go about providing a
reliable, up-to-date, tested, maintained and supported XFS-FUSE
module?

IOWs, a "lighter solution" is to use the libxfs code base that we
already maintain across kernel and userspace in the xfsprogs package
and write a FUSE wrapper around that. That, immediately, will give
you full read-only access to XFS filesystem images via FUSE. Then we
(the XFS developers) can test the XFS-FUSE module under normal
development conditions as we modify the xfsprogs code base (e.g. via
xfstests) and ensure we always release a working, up-to-date FUSE
wrapper with each xfsprogs release.

And then once a proper read-only FUSE wrapper has been written, then
we can discuss what is necessary to enable write access via porting
the necessary parts of the kernel code across to the userspace
libxfs codebase and hooking them up to the FUSE API...

Hmmm?

> And a final example is linking the bootloader code with LKL to access
> the filesystem. This has a hard requirement on non-mmu.

No way. We *can't* support filesystems that have had bootloaders
make arbitrary changes to the filesystem without the knowlege of the
OS that *owns the filesystem*. Similarly, we cannot support random
applications that internally mount and modify filesystem images in
ways we can't see, control, test or simulate.  Sure, they use the
kernel code, but that doesn't stop them from doing stupid shit that
could corrupt the filesystem image. So, no, we are not going to
support giving random applications direct access to XFS filesystem
images, even via LKL.

> So, IMHO there are usecases for using XFS on non-mmu architectures. I
> think it all boils down to: is the patch simple enough to not put an
> unreasonable maintenance burden on developers?

Look at it this way: a user tries to mount their
XFS-on-LKL-on-FUSE, XFS throughs a memory allocation deadlock
warning because of a high order allocation during mount failing
(i.e. due to the vmalloc no-mmu hacks).

Who has to spend effort to find out why the error is being thrown?
It's not the FUSE developers. It's not the LKL developers. XFS is
going to be blamed, because it's an XFS error message, and it will
be up to the XFS developers to /prove/ that it's not an XFS problem.
And then google will find these complaints about XFS causing all
sorts of problems, and we're back to the bad old days of everyone
blaming XFS for shortcomings and problems in other code we have no
control over..

I really don't see how using LKL to give userspace access to XFS
filesystems is a better solution than actually writing a proper,
supported XFS-FUSE module. LKL is so full of compromises that it's
going to be unworkable and unsupportable in practice...

Cheers,

Dave.
Octavian Purdila Nov. 20, 2015, 10:26 p.m. UTC | #20
On Fri, Nov 20, 2015 at 11:08 PM, Dave Chinner <david@fromorbit.com> wrote:
> On Fri, Nov 20, 2015 at 03:43:20PM +0200, Octavian Purdila wrote:
>> On Fri, Nov 20, 2015 at 1:24 AM, Dave Chinner <david@fromorbit.com> wrote:
>> > On Wed, Nov 18, 2015 at 12:46:21AM +0200, Octavian Purdila wrote:
>> >> Naive implementation for non-mmu architectures: allocate physically
>> >> contiguous xfs buffers with alloc_pages. Terribly inefficient with
>> >> memory and fragmentation on high I/O loads but it may be good enough
>> >> for basic usage (which most non-mmu architectures will need).
>> >
>> > Can you please explain why you want to use XFS on low end, basic
>> > non-MMU devices? XFS is a high performance, enterprise/HPC level
>> > filesystem - it's not a filesystem designed for small IoT level
>> > devices - so I'm struggling to see why we'd want to expend any
>> > effort to make XFS work on such devices....
>> >
>>
>> Hi David,
>>
>> Yes XFS as the main fs on this type of devices does not make sense,
>> but does it hurt to be able to perform basic operation on XFS from
>> these devices? Perhaps accessing an external medium formatted with
>> XFS?
>>
>> Another example is accessing VM images that are formatted with XFS.
>> Currently we can do that with tools like libguestfs that use a VM in
>> the background. I am working on a lighter solution for that where we
>> compile the Linux kernel as a library [1]. This allows access to the
>> filesystem without the need to use a full VM.
>
> That's hardly a "lighter solution"
>
> I'm kinda tired of the ongoing "hack random shit" approach to
> container development.

Since apparently there is a container devs hunting party going on
right now, let me quickly confess that LKL has nothing to do with
(them be damned) containers :)

On a more serious note, LKL was not developed for containers or to try
to circumvent privileged mounts. It was developed to allow the Linux
kernel code to be reused in things like simple tools that allows one
to modify a filesystem image.

> If you need a XFS-FUSE module to allow safe
> userspace access to XFS fielsystems then maybe, just maybe, it makes
> sense to ask the XFS developers how to best go about providing a
> reliable, up-to-date, tested, maintained and supported XFS-FUSE
> module?
>
> IOWs, a "lighter solution" is to use the libxfs code base that we
> already maintain across kernel and userspace in the xfsprogs package
> and write a FUSE wrapper around that. That, immediately, will give
> you full read-only access to XFS filesystem images via FUSE. Then we
> (the XFS developers) can test the XFS-FUSE module under normal
> development conditions as we modify the xfsprogs code base (e.g. via
> xfstests) and ensure we always release a working, up-to-date FUSE
> wrapper with each xfsprogs release.
>
> And then once a proper read-only FUSE wrapper has been written, then
> we can discuss what is necessary to enable write access via porting
> the necessary parts of the kernel code across to the userspace
> libxfs codebase and hooking them up to the FUSE API...
>
> Hmmm?
>

What about ext4, vfat, btrfs and other filesystems? Also why duplicate
the whole thing if you could reuse it?

>> And a final example is linking the bootloader code with LKL to access
>> the filesystem. This has a hard requirement on non-mmu.
>
> No way. We *can't* support filesystems that have had bootloaders
> make arbitrary changes to the filesystem without the knowlege of the
> OS that *owns the filesystem*. Similarly, we cannot support random
> applications that internally mount and modify filesystem images in
> ways we can't see, control, test or simulate.  Sure, they use the
> kernel code, but that doesn't stop them from doing stupid shit that
> could corrupt the filesystem image. So, no, we are not going to
> support giving random applications direct access to XFS filesystem
> images, even via LKL.
>

LKL only exports the Linux kernel system calls and nothing else to
applications. Because of that, there should not be any loss of control
or visibility to the XFS fs driver.

>> So, IMHO there are usecases for using XFS on non-mmu architectures. I
>> think it all boils down to: is the patch simple enough to not put an
>> unreasonable maintenance burden on developers?
>
> Look at it this way: a user tries to mount their
> XFS-on-LKL-on-FUSE, XFS throughs a memory allocation deadlock
> warning because of a high order allocation during mount failing
> (i.e. due to the vmalloc no-mmu hacks).
>
> Who has to spend effort to find out why the error is being thrown?
> It's not the FUSE developers. It's not the LKL developers. XFS is
> going to be blamed, because it's an XFS error message, and it will
> be up to the XFS developers to /prove/ that it's not an XFS problem.
> And then google will find these complaints about XFS causing all
> sorts of problems, and we're back to the bad old days of everyone
> blaming XFS for shortcomings and problems in other code we have no
> control over..
>
> I really don't see how using LKL to give userspace access to XFS
> filesystems is a better solution than actually writing a proper,
> supported XFS-FUSE module. LKL is so full of compromises that it's
> going to be unworkable and unsupportable in practice...
>

Could you elaborate on some of these issues?
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Brian Foster Nov. 20, 2015, 10:47 p.m. UTC | #21
On Sat, Nov 21, 2015 at 07:36:02AM +1100, Dave Chinner wrote:
> On Fri, Nov 20, 2015 at 10:11:19AM -0500, Brian Foster wrote:
> > On Fri, Nov 20, 2015 at 10:35:47AM +1100, Dave Chinner wrote:
> > > On Thu, Nov 19, 2015 at 10:55:25AM -0500, Brian Foster wrote:
> > > > On Wed, Nov 18, 2015 at 12:46:21AM +0200, Octavian Purdila wrote:
> > > > > Naive implementation for non-mmu architectures: allocate physically
> > > > > contiguous xfs buffers with alloc_pages. Terribly inefficient with
> > > > > memory and fragmentation on high I/O loads but it may be good enough
> > > > > for basic usage (which most non-mmu architectures will need).
> > > > > 
> > > > > This patch was tested with lklfuse [1] and basic operations seems to
> > > > > work even with 16MB allocated for LKL.
> > > > > 
> > > > > [1] https://github.com/lkl/linux
> > > > > 
> > > > > Signed-off-by: Octavian Purdila <octavian.purdila@intel.com>
> > > > > ---
> > > > 
> > > > Interesting, though this makes me wonder why we couldn't have a new
> > > > _XBF_VMEM (for example) buffer type that uses vmalloc(). I'm not
> > > > familiar with mmu-less context, but I see that mm/nommu.c has a
> > > > __vmalloc() interface that looks like it ultimately translates into an
> > > > alloc_pages() call. Would that accomplish what this patch is currently
> > > > trying to do?
> > > 
> > > vmalloc is always a last resort.  vmalloc space on 32 bit systems is
> > > extremely limited and it is easy to exhaust with XFS.
> > > 
> > 
> > Sure, but my impression is that a vmalloc() buffer is roughly equivalent
> > in this regard to a current !XBF_UNMAPPED && size > PAGE_SIZE buffer. We
> > just do the allocation and mapping separately (presumably for other
> > reasons).
> 
> Yes, it'a always a last resort. We don't use vmap'd buffers very
> much on block size <= page size filesystems (e.g. iclog buffers are
> the main user in such cases, IIRC), so the typical 32 bit
> system doesn't have major problems with vmalloc space. However, the
> moment you increase the directory block size > block size, that all
> goes out the window...
> 

Ok. It's really only the pre-existing mapped buffer cases we care about
here.

> > > Also, vmalloc limits the control we have over allocation context
> > > (e.g. the hoops we jump through in kmem_alloc_large() to maintain
> > > GFP_NOFS contexts), so just using vmalloc doesn't make things much
> > > simpler from an XFS perspective.
> > > 
> > 
> > The comment in kmem_zalloc_large() calls out some apparent hardcoded
> > allocation flags down in the depths of vmalloc(). It looks to me that
> > page allocation (__vmalloc_area_node()) actually uses the provided
> > flags, so I'm not following the "data page" part of that comment.
> 
> You can pass gfp flags for the page allocation part of vmalloc, but
> not the pte allocation part of it. That's what the hacks in
> kmem_zalloc_large() are doing.
> 
> > Indeed, I do see that this is not the case down in calls like
> > pmd_alloc_one(), pte_alloc_one_kernel(), etc., associated with page
> > table management.
> 
> Right.
> 
> > Those latter calls are all from following down through the
> > map_vm_area()->vmap_page_range() codepath from __vmalloc_area_node(). We
> > call vm_map_ram() directly from _xfs_buf_map_pages(), which itself calls
> > down into the same code. Indeed, we already protect ourselves here via
> > the same memalloc_noio_save() mechanism that kmem_zalloc_large() uses.
> 
> Yes, we do, but that is separately handled to the allocation of the
> pages, which we have to do for all types of buffers, mapped or
> unmapped, because xfs_buf_ioapply_map() requires direct access to
> the underlying pages to build the bio for IO.  If we delegate the
> allocation of pages to vmalloc, we don't have direct reference to
> the underlying pages and so we have to do something completely
> diffferent to build the bios for the buffer....
> 

Octavian points out virt_to_page() in a previous mail. I'm not sure
that's the right interface solely based on looking at some current
callers, but there is vmalloc_to_page() so I'd expect we can gain access
to the pages one way or another. Given that, the buffer allocation code
would fully populate the xfs_buf as it is today. The buffer I/O
submission code wouldn't really know the difference and shouldn't have
to change at all.

> > I suspect there's more to it than that because it does look like
> > vm_map_ram() has a different mechanism for managing vmalloc space for
> > certain (smaller) allocations, either of which I'm not really familiar
> > with.
> 
> Yes, it manages vmalloc space quite differently, and there are
> different scalability aspects to consider as well - vm_map_ram was
> pretty much written for the use XFS has in xfs_buf.c...
> 

Indeed. Looking closer, it appears to have a percpu vmalloc space
allocation strategy for smaller allocations. We clearly can't just
switch mapped buffer cases over and expect it to work/perform just the
same.

That said, the vm_map_ram() comment does call out fragmentation concerns
for objects with mixed lifetimes. I'd be curious whether our buffer
caching can trigger any badness there. OTOH, it's also not clear that
this mechanism couldn't extend to vmalloc (or some variant thereof) in
the future.

Either way, it would require significantly more investigation/testing to
enable generic usage. The core point was really just to abstract the
nommu changes into something that potentially has generic use.

Brian

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Dave Chinner Nov. 22, 2015, 10:04 p.m. UTC | #22
On Fri, Nov 20, 2015 at 05:47:34PM -0500, Brian Foster wrote:
> On Sat, Nov 21, 2015 at 07:36:02AM +1100, Dave Chinner wrote:
> > On Fri, Nov 20, 2015 at 10:11:19AM -0500, Brian Foster wrote:
> > > On Fri, Nov 20, 2015 at 10:35:47AM +1100, Dave Chinner wrote:
> > > Those latter calls are all from following down through the
> > > map_vm_area()->vmap_page_range() codepath from __vmalloc_area_node(). We
> > > call vm_map_ram() directly from _xfs_buf_map_pages(), which itself calls
> > > down into the same code. Indeed, we already protect ourselves here via
> > > the same memalloc_noio_save() mechanism that kmem_zalloc_large() uses.
> > 
> > Yes, we do, but that is separately handled to the allocation of the
> > pages, which we have to do for all types of buffers, mapped or
> > unmapped, because xfs_buf_ioapply_map() requires direct access to
> > the underlying pages to build the bio for IO.  If we delegate the
> > allocation of pages to vmalloc, we don't have direct reference to
> > the underlying pages and so we have to do something completely
> > diffferent to build the bios for the buffer....
> > 
> 
> Octavian points out virt_to_page() in a previous mail. I'm not sure
> that's the right interface solely based on looking at some current
> callers, but there is vmalloc_to_page() so I'd expect we can gain access
> to the pages one way or another.

Sure, but these are not zero cost operations....

> Given that, the buffer allocation code
> would fully populate the xfs_buf as it is today. The buffer I/O
> submission code wouldn't really know the difference and shouldn't have
> to change at all.

The abstraction results in more expensive/complex setup and teardown
of buffers and/or IO submission. i.e. the use of vmalloc() based
abstractions has an additional cost over what we do now.

[...]

> Either way, it would require significantly more investigation/testing to
> enable generic usage. The core point was really just to abstract the
> nommu changes into something that potentially has generic use.

I'm not saying that it is impossible to do this, just trying to work
out if making any changes to support nommu architectures is worth
the potential future trouble making such changes could bring us.
i.e. before working out how to do something, we have to decide
whether it is worth doing  in the first place.

Just because you can do something doesn't automatically make it a
good idea....

Cheers,

Dave.
Dave Chinner Nov. 22, 2015, 10:44 p.m. UTC | #23
On Sat, Nov 21, 2015 at 12:26:47AM +0200, Octavian Purdila wrote:
> On Fri, Nov 20, 2015 at 11:08 PM, Dave Chinner <david@fromorbit.com> wrote:
> > On Fri, Nov 20, 2015 at 03:43:20PM +0200, Octavian Purdila wrote:
> >> On Fri, Nov 20, 2015 at 1:24 AM, Dave Chinner <david@fromorbit.com> wrote:
> >> > On Wed, Nov 18, 2015 at 12:46:21AM +0200, Octavian Purdila wrote:
> >> >> Naive implementation for non-mmu architectures: allocate physically
> >> >> contiguous xfs buffers with alloc_pages. Terribly inefficient with
> >> >> memory and fragmentation on high I/O loads but it may be good enough
> >> >> for basic usage (which most non-mmu architectures will need).
> >> >
> >> > Can you please explain why you want to use XFS on low end, basic
> >> > non-MMU devices? XFS is a high performance, enterprise/HPC level
> >> > filesystem - it's not a filesystem designed for small IoT level
> >> > devices - so I'm struggling to see why we'd want to expend any
> >> > effort to make XFS work on such devices....
> >> >
> >>
> >> Hi David,
> >>
> >> Yes XFS as the main fs on this type of devices does not make sense,
> >> but does it hurt to be able to perform basic operation on XFS from
> >> these devices? Perhaps accessing an external medium formatted with
> >> XFS?
> >>
> >> Another example is accessing VM images that are formatted with XFS.
> >> Currently we can do that with tools like libguestfs that use a VM in
> >> the background. I am working on a lighter solution for that where we
> >> compile the Linux kernel as a library [1]. This allows access to the
> >> filesystem without the need to use a full VM.
> >
> > That's hardly a "lighter solution"
> >
> > I'm kinda tired of the ongoing "hack random shit" approach to
> > container development.
> 
> Since apparently there is a container devs hunting party going on
> right now, let me quickly confess that LKL has nothing to do with
> (them be damned) containers :)
> 
> On a more serious note, LKL was not developed for containers or to try
> to circumvent privileged mounts. It was developed to allow the Linux
> kernel code to be reused in things like simple tools that allows one
> to modify a filesystem image.

Anything tool that modifies an XFS filesystem that is not directly
maintained by the XFS developers voids any kind of support we can
supply. Just like the fact we don't support tainted kernels because
the 3rd party binary code is unknowable (and usually crap), having
the kernel code linked with random 3rd party userspace application
code is completely unsupportable by us.

Remember that with most kernel code a bug just results in a panic
and reboot, and everything just continues on again after the system
comes up again. In contrast, a bug in the storage code can cause
*persistent damage* that can cause data loss or corruption that
cannot be fixed without data loss of some kind.

Ultimately, as the maintainer I'm responsible for XFS not eating our
users' data, and part of that responsibility involves telling people
who want to do daft things that "no, that's a *bad idea*".

> > If you need a XFS-FUSE module to allow safe
> > userspace access to XFS fielsystems then maybe, just maybe, it makes
> > sense to ask the XFS developers how to best go about providing a
> > reliable, up-to-date, tested, maintained and supported XFS-FUSE
> > module?
> >
> > IOWs, a "lighter solution" is to use the libxfs code base that we
> > already maintain across kernel and userspace in the xfsprogs package
> > and write a FUSE wrapper around that. That, immediately, will give
> > you full read-only access to XFS filesystem images via FUSE. Then we
> > (the XFS developers) can test the XFS-FUSE module under normal
> > development conditions as we modify the xfsprogs code base (e.g. via
> > xfstests) and ensure we always release a working, up-to-date FUSE
> > wrapper with each xfsprogs release.
> >
> > And then once a proper read-only FUSE wrapper has been written, then
> > we can discuss what is necessary to enable write access via porting
> > the necessary parts of the kernel code across to the userspace
> > libxfs codebase and hooking them up to the FUSE API...
> >
> > Hmmm?
> >
> 
> What about ext4, vfat, btrfs and other filesystems?

Ted has also raised exactly the same issues w.r.t. ext4.

> Also why duplicate
> the whole thing if you could reuse it?

Do you use a hammer when you need to tighten a screw? Yes, you can
"reuse a hammer" for this purpose, but there's going to be
collateral damage because using the screw outside it's original
design and architecture constraints presents a high risk of things
going wrong.

> >> And a final example is linking the bootloader code with LKL to access
> >> the filesystem. This has a hard requirement on non-mmu.
> >
> > No way. We *can't* support filesystems that have had bootloaders
> > make arbitrary changes to the filesystem without the knowlege of the
> > OS that *owns the filesystem*. Similarly, we cannot support random
> > applications that internally mount and modify filesystem images in
> > ways we can't see, control, test or simulate.  Sure, they use the
> > kernel code, but that doesn't stop them from doing stupid shit that
> > could corrupt the filesystem image. So, no, we are not going to
> > support giving random applications direct access to XFS filesystem
> > images, even via LKL.
> >
> 
> LKL only exports the Linux kernel system calls and nothing else to
> applications. Because of that, there should not be any loss of control
> or visibility to the XFS fs driver.

It runs in the same address space as the user application, yes? And
hence application bugs can cause the kernel code to malfunction,
yes?

> > I really don't see how using LKL to give userspace access to XFS
> > filesystems is a better solution than actually writing a proper,
> > supported XFS-FUSE module. LKL is so full of compromises that it's
> > going to be unworkable and unsupportable in practice...
> 
> Could you elaborate on some of these issues?

Start with "is a no-mmu architecture" and all the compromises that
means the kernel code needs to make, add a topping of "runs in the
same address space as the application", add a new flavour of kernel
binary taint and finish it off with "LKL linked applications will
never be tested by their developers over the full functionality the
LKL provides them with".

Cheers,

Dave.
Octavian Purdila Nov. 23, 2015, 1:41 a.m. UTC | #24
On Mon, Nov 23, 2015 at 12:44 AM, Dave Chinner <david@fromorbit.com> wrote:
> On Sat, Nov 21, 2015 at 12:26:47AM +0200, Octavian Purdila wrote:
>> On Fri, Nov 20, 2015 at 11:08 PM, Dave Chinner <david@fromorbit.com> wrote:
>> > On Fri, Nov 20, 2015 at 03:43:20PM +0200, Octavian Purdila wrote:
>> >> On Fri, Nov 20, 2015 at 1:24 AM, Dave Chinner <david@fromorbit.com> wrote:
>> >> > On Wed, Nov 18, 2015 at 12:46:21AM +0200, Octavian Purdila wrote:
>> >> >> Naive implementation for non-mmu architectures: allocate physically
>> >> >> contiguous xfs buffers with alloc_pages. Terribly inefficient with
>> >> >> memory and fragmentation on high I/O loads but it may be good enough
>> >> >> for basic usage (which most non-mmu architectures will need).
>> >> >
>> >> > Can you please explain why you want to use XFS on low end, basic
>> >> > non-MMU devices? XFS is a high performance, enterprise/HPC level
>> >> > filesystem - it's not a filesystem designed for small IoT level
>> >> > devices - so I'm struggling to see why we'd want to expend any
>> >> > effort to make XFS work on such devices....
>> >> >
>> >>
>> >> Hi David,
>> >>
>> >> Yes XFS as the main fs on this type of devices does not make sense,
>> >> but does it hurt to be able to perform basic operation on XFS from
>> >> these devices? Perhaps accessing an external medium formatted with
>> >> XFS?
>> >>
>> >> Another example is accessing VM images that are formatted with XFS.
>> >> Currently we can do that with tools like libguestfs that use a VM in
>> >> the background. I am working on a lighter solution for that where we
>> >> compile the Linux kernel as a library [1]. This allows access to the
>> >> filesystem without the need to use a full VM.
>> >
>> > That's hardly a "lighter solution"
>> >
>> > I'm kinda tired of the ongoing "hack random shit" approach to
>> > container development.
>>
>> Since apparently there is a container devs hunting party going on
>> right now, let me quickly confess that LKL has nothing to do with
>> (them be damned) containers :)
>>
>> On a more serious note, LKL was not developed for containers or to try
>> to circumvent privileged mounts. It was developed to allow the Linux
>> kernel code to be reused in things like simple tools that allows one
>> to modify a filesystem image.
>
> Anything tool that modifies an XFS filesystem that is not directly
> maintained by the XFS developers voids any kind of support we can
> supply. Just like the fact we don't support tainted kernels because
> the 3rd party binary code is unknowable (and usually crap), having
> the kernel code linked with random 3rd party userspace application
> code is completely unsupportable by us.
>

Perhaps tainting the kernel is a solution when running unknown
applications linked with LKL.

I would argue that applications that are maintained together with LKL
(e.g. lklfuse in tools/lkl) should not taint the kernel because those
applications will be under the control of kernel developers. I would
also argue that mounting a filesystem read-only should not taint the
kernel either.

> Remember that with most kernel code a bug just results in a panic
> and reboot, and everything just continues on again after the system
> comes up again. In contrast, a bug in the storage code can cause
> *persistent damage* that can cause data loss or corruption that
> cannot be fixed without data loss of some kind.
>
> Ultimately, as the maintainer I'm responsible for XFS not eating our
> users' data, and part of that responsibility involves telling people
> who want to do daft things that "no, that's a *bad idea*".
>

I understand how critical filesystem issues are and I appreciate your
feedback. Sorry to drag you deeper into this but as you know, no good
deed goes unpunished :)

>> > If you need a XFS-FUSE module to allow safe
>> > userspace access to XFS fielsystems then maybe, just maybe, it makes
>> > sense to ask the XFS developers how to best go about providing a
>> > reliable, up-to-date, tested, maintained and supported XFS-FUSE
>> > module?
>> >
>> > IOWs, a "lighter solution" is to use the libxfs code base that we
>> > already maintain across kernel and userspace in the xfsprogs package
>> > and write a FUSE wrapper around that. That, immediately, will give
>> > you full read-only access to XFS filesystem images via FUSE. Then we
>> > (the XFS developers) can test the XFS-FUSE module under normal
>> > development conditions as we modify the xfsprogs code base (e.g. via
>> > xfstests) and ensure we always release a working, up-to-date FUSE
>> > wrapper with each xfsprogs release.
>> >
>> > And then once a proper read-only FUSE wrapper has been written, then
>> > we can discuss what is necessary to enable write access via porting
>> > the necessary parts of the kernel code across to the userspace
>> > libxfs codebase and hooking them up to the FUSE API...
>> >
>> > Hmmm?
>> >
>>
>> What about ext4, vfat, btrfs and other filesystems?
>
> Ted has also raised exactly the same issues w.r.t. ext4.
>
>> Also why duplicate
>> the whole thing if you could reuse it?
>
> Do you use a hammer when you need to tighten a screw? Yes, you can
> "reuse a hammer" for this purpose, but there's going to be
> collateral damage because using the screw outside it's original
> design and architecture constraints presents a high risk of things
> going wrong.
>

First, lets try to discuss the potential collateral damage before
calling it a hammer. It may just be just an unfamiliar screw-driver :)

>> >> And a final example is linking the bootloader code with LKL to access
>> >> the filesystem. This has a hard requirement on non-mmu.
>> >
>> > No way. We *can't* support filesystems that have had bootloaders
>> > make arbitrary changes to the filesystem without the knowlege of the
>> > OS that *owns the filesystem*. Similarly, we cannot support random
>> > applications that internally mount and modify filesystem images in
>> > ways we can't see, control, test or simulate.  Sure, they use the
>> > kernel code, but that doesn't stop them from doing stupid shit that
>> > could corrupt the filesystem image. So, no, we are not going to
>> > support giving random applications direct access to XFS filesystem
>> > images, even via LKL.
>> >
>>
>> LKL only exports the Linux kernel system calls and nothing else to
>> applications. Because of that, there should not be any loss of control
>> or visibility to the XFS fs driver.
>
> It runs in the same address space as the user application, yes? And
> hence application bugs can cause the kernel code to malfunction,
> yes?
>

Most non-mmu architecture have the same issue and nevertheless non-mmu
is still supported in Linux (including most filesystems).

Also, filesystem code runs in the same address space with other kernel
code and drivers and a bug anywhere in the kernel can cause filesystem
code to malfunction. Applications maintained together with LKL and in
the kernel tree will be as safe as drivers and other kernel code with
regard to filesystem malfunctions. We can taint the kernel when LKL is
linked with unknown applications.

>> > I really don't see how using LKL to give userspace access to XFS
>> > filesystems is a better solution than actually writing a proper,
>> > supported XFS-FUSE module. LKL is so full of compromises that it's
>> > going to be unworkable and unsupportable in practice...
>>
>> Could you elaborate on some of these issues?
>
> Start with "is a no-mmu architecture" and all the compromises that
> means the kernel code needs to make,

I don't see non-mmu as a compromise. It is supported by Linux and most
filesystems work fine on non-mmu architectures. LKL can be implemented
as a mmu architecture. Having it as a non-mmu architecture has the
advantages of allowing it to run in more constrained environments like
bootloaders.

> add a topping of "runs in the
> same address space as the application",

I've addressed this concern above.

> add a new flavour of kernel
> binary taint

I am not sure I understand, are you saying that it is an issue to add
a new taint flavor?

> and finish it off with "LKL linked applications will
> never be tested by their developers over the full functionality the
> LKL provides them with".
>

You lost me here. Why does an application developer have to test the
full functionality of a library it is linked with?
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Brian Foster Nov. 23, 2015, 12:50 p.m. UTC | #25
On Mon, Nov 23, 2015 at 09:04:00AM +1100, Dave Chinner wrote:
> On Fri, Nov 20, 2015 at 05:47:34PM -0500, Brian Foster wrote:
> > On Sat, Nov 21, 2015 at 07:36:02AM +1100, Dave Chinner wrote:
> > > On Fri, Nov 20, 2015 at 10:11:19AM -0500, Brian Foster wrote:
> > > > On Fri, Nov 20, 2015 at 10:35:47AM +1100, Dave Chinner wrote:
> > > > Those latter calls are all from following down through the
> > > > map_vm_area()->vmap_page_range() codepath from __vmalloc_area_node(). We
> > > > call vm_map_ram() directly from _xfs_buf_map_pages(), which itself calls
> > > > down into the same code. Indeed, we already protect ourselves here via
> > > > the same memalloc_noio_save() mechanism that kmem_zalloc_large() uses.
> > > 
> > > Yes, we do, but that is separately handled to the allocation of the
> > > pages, which we have to do for all types of buffers, mapped or
> > > unmapped, because xfs_buf_ioapply_map() requires direct access to
> > > the underlying pages to build the bio for IO.  If we delegate the
> > > allocation of pages to vmalloc, we don't have direct reference to
> > > the underlying pages and so we have to do something completely
> > > diffferent to build the bios for the buffer....
> > > 
> > 
> > Octavian points out virt_to_page() in a previous mail. I'm not sure
> > that's the right interface solely based on looking at some current
> > callers, but there is vmalloc_to_page() so I'd expect we can gain access
> > to the pages one way or another.
> 
> Sure, but these are not zero cost operations....
> 
> > Given that, the buffer allocation code
> > would fully populate the xfs_buf as it is today. The buffer I/O
> > submission code wouldn't really know the difference and shouldn't have
> > to change at all.
> 
> The abstraction results in more expensive/complex setup and teardown
> of buffers and/or IO submission. i.e. the use of vmalloc() based
> abstractions has an additional cost over what we do now.
> 
> [...]
> 

Yeah, most likely. The vmalloc based lookup certainly looks more
involved than if the pages are already readily accessible. How much more
expensive (and whether it's noticeable enough to care) is probably a
matter of testing. I think the code itself could end up much more
simple, however. There's still a lot of duplication in our current
implementation that afaiu right now only continues to exist due to the
performance advantages of vm_map_ram().

> > Either way, it would require significantly more investigation/testing to
> > enable generic usage. The core point was really just to abstract the
> > nommu changes into something that potentially has generic use.
> 
> I'm not saying that it is impossible to do this, just trying to work
> out if making any changes to support nommu architectures is worth
> the potential future trouble making such changes could bring us.
> i.e. before working out how to do something, we have to decide
> whether it is worth doing  in the first place.
> 
> Just because you can do something doesn't automatically make it a
> good idea....
> 

Sure, good point. I've intentionally been sticking to the technical
points that have been raised (and I think we've covered it thoroughly to
this point ;).

FWIW, I agree with most of the concerns that have been called out in the
thread so far, but the support question ultimately sounds moot to me. We
can reject the "configuration" enabled by this particular patch, but
afaik this LKL thing could just go and implement an mmu mode and
fundamentally do what it's trying to do today. If it's useful, users
will use it, ultimately hit bugs, and we'll have to deal with it one way
or another.

Note that I'm not saying we have to support LKL. Rather, I view it as
equivalent to somebody off running some new non-standard/uncommon
architecture (or maybe a hypervisor is a better example in this case).
If the user runs into some low-level filesystem issue, the user probably
needs to get through whatever levels of support exist for that special
architecture/environment first and/or otherwise work with us to find a
reproducer on something more standard that we all have easy access to.
Just my .02, though. :)

Brian

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> 
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Dave Chinner Nov. 23, 2015, 9 p.m. UTC | #26
On Mon, Nov 23, 2015 at 07:50:00AM -0500, Brian Foster wrote:
> On Mon, Nov 23, 2015 at 09:04:00AM +1100, Dave Chinner wrote:
> > On Fri, Nov 20, 2015 at 05:47:34PM -0500, Brian Foster wrote:
> > > On Sat, Nov 21, 2015 at 07:36:02AM +1100, Dave Chinner wrote:
> > > > On Fri, Nov 20, 2015 at 10:11:19AM -0500, Brian Foster wrote:
> > > > > On Fri, Nov 20, 2015 at 10:35:47AM +1100, Dave Chinner wrote:
> > > > > Those latter calls are all from following down through the
> > > > > map_vm_area()->vmap_page_range() codepath from __vmalloc_area_node(). We
> > > > > call vm_map_ram() directly from _xfs_buf_map_pages(), which itself calls
> > > > > down into the same code. Indeed, we already protect ourselves here via
> > > > > the same memalloc_noio_save() mechanism that kmem_zalloc_large() uses.
> > > > 
> > > > Yes, we do, but that is separately handled to the allocation of the
> > > > pages, which we have to do for all types of buffers, mapped or
> > > > unmapped, because xfs_buf_ioapply_map() requires direct access to
> > > > the underlying pages to build the bio for IO.  If we delegate the
> > > > allocation of pages to vmalloc, we don't have direct reference to
> > > > the underlying pages and so we have to do something completely
> > > > diffferent to build the bios for the buffer....
> > > > 
> > > 
> > > Octavian points out virt_to_page() in a previous mail. I'm not sure
> > > that's the right interface solely based on looking at some current
> > > callers, but there is vmalloc_to_page() so I'd expect we can gain access
> > > to the pages one way or another.
> > 
> > Sure, but these are not zero cost operations....
> > 
> > > Given that, the buffer allocation code
> > > would fully populate the xfs_buf as it is today. The buffer I/O
> > > submission code wouldn't really know the difference and shouldn't have
> > > to change at all.
> > 
> > The abstraction results in more expensive/complex setup and teardown
> > of buffers and/or IO submission. i.e. the use of vmalloc() based
> > abstractions has an additional cost over what we do now.
> > 
> > [...]
> > 
> 
> Yeah, most likely. The vmalloc based lookup certainly looks more
> involved than if the pages are already readily accessible. How much more
> expensive (and whether it's noticeable enough to care) is probably a
> matter of testing. I think the code itself could end up much more
> simple, however. There's still a lot of duplication in our current
> implementation that afaiu right now only continues to exist due to the
> performance advantages of vm_map_ram().

Yup. The historical problem here is that the mm/ people are not
really interested in making vmalloc() scale/perform better. We are
always told that if you need vmalloc() you are doing something
wrong, and so vmalloc() doesn't need improving. The same reason is
generally given for not making vmalloc handle GFP flags properly,
so we're really stuck between a rock and a hard place here.

> > > Either way, it would require significantly more investigation/testing to
> > > enable generic usage. The core point was really just to abstract the
> > > nommu changes into something that potentially has generic use.
> > 
> > I'm not saying that it is impossible to do this, just trying to work
> > out if making any changes to support nommu architectures is worth
> > the potential future trouble making such changes could bring us.
> > i.e. before working out how to do something, we have to decide
> > whether it is worth doing  in the first place.
> > 
> > Just because you can do something doesn't automatically make it a
> > good idea....
> 
> Sure, good point. I've intentionally been sticking to the technical
> points that have been raised (and I think we've covered it thoroughly to
> this point ;).

*nod*

> FWIW, I agree with most of the concerns that have been called out in the
> thread so far, but the support question ultimately sounds moot to me. We
> can reject the "configuration" enabled by this particular patch, but
> afaik this LKL thing could just go and implement an mmu mode and
> fundamentally do what it's trying to do today. If it's useful, users
> will use it, ultimately hit bugs, and we'll have to deal with it one way
> or another.

Right, but we need to set expectations appropriately now, before we
have users that depend on something we can't sanely support....

> Note that I'm not saying we have to support LKL. Rather, I view it as
> equivalent to somebody off running some new non-standard/uncommon
> architecture (or maybe a hypervisor is a better example in this case).

Right, but we don't support every possible kernel/device out there
with XFS. Up to now, no-mmu devices have been small, embedded
systems that can't/won't fit XFS in their flash/RAM because of
space/cost/need and so we've been able to completely ignore the
issues such platforms have. We really don't pay much attention to 32
bit systems anymore, and no-mmu is much, much further down the list
of "users who need XFS" than 32 bit arches....

> If the user runs into some low-level filesystem issue, the user probably
> needs to get through whatever levels of support exist for that special
> architecture/environment first and/or otherwise work with us to find a
> reproducer on something more standard that we all have easy access to.
> Just my .02, though. :)

The biggest problem we have is that the kernel might be tripping
over a problem introduced by some broken version of an app linked against
some old version of LKL. That's my biggest worry here - it's not
that LKL uses kernel code, it's that we no longer have any idea of
what code has been modifying the filesystem nor any easy way of
finding out. And we all know how hard it can be to get a straight
answer to "explain exactly what you did" from bug reporters...

Sure, once we *know* that the filesystem was modified by a random
userspace application, we can say "please reproduce with a vanilla
kernel", but it's getting to that point that can be problematic.

This is why I think a proper FUSE module implemented on top of the
userspace libxfs and is built as part of xfsprogs is a much better
way give users userspace access to XFS filesystems. We can develop
and support that directly, knowing exactly what code users are
running and having a relatively controlled environment the
filesystem code is running in....

Cheers,

Dave.
> 
> Brian
> 
> > Cheers,
> > 
> > Dave.
> > -- 
> > Dave Chinner
> > david@fromorbit.com
> > 
> > _______________________________________________
> > xfs mailing list
> > xfs@oss.sgi.com
> > http://oss.sgi.com/mailman/listinfo/xfs
>
Dave Chinner Nov. 23, 2015, 9:46 p.m. UTC | #27
On Mon, Nov 23, 2015 at 03:41:49AM +0200, Octavian Purdila wrote:
> On Mon, Nov 23, 2015 at 12:44 AM, Dave Chinner <david@fromorbit.com> wrote:
> > On Sat, Nov 21, 2015 at 12:26:47AM +0200, Octavian Purdila wrote:
> >> On Fri, Nov 20, 2015 at 11:08 PM, Dave Chinner <david@fromorbit.com> wrote:
> >> > On Fri, Nov 20, 2015 at 03:43:20PM +0200, Octavian Purdila wrote:
> >> >> On Fri, Nov 20, 2015 at 1:24 AM, Dave Chinner <david@fromorbit.com> wrote:
> >> >> > On Wed, Nov 18, 2015 at 12:46:21AM +0200, Octavian Purdila wrote:
> >> >> >> Naive implementation for non-mmu architectures: allocate physically
> >> >> >> contiguous xfs buffers with alloc_pages. Terribly inefficient with
> >> >> >> memory and fragmentation on high I/O loads but it may be good enough
> >> >> >> for basic usage (which most non-mmu architectures will need).
> >> >> >
> >> >> > Can you please explain why you want to use XFS on low end, basic
> >> >> > non-MMU devices? XFS is a high performance, enterprise/HPC level
> >> >> > filesystem - it's not a filesystem designed for small IoT level
> >> >> > devices - so I'm struggling to see why we'd want to expend any
> >> >> > effort to make XFS work on such devices....
> >> >> >
> >> >>
> >> >> Hi David,
> >> >>
> >> >> Yes XFS as the main fs on this type of devices does not make sense,
> >> >> but does it hurt to be able to perform basic operation on XFS from
> >> >> these devices? Perhaps accessing an external medium formatted with
> >> >> XFS?
> >> >>
> >> >> Another example is accessing VM images that are formatted with XFS.
> >> >> Currently we can do that with tools like libguestfs that use a VM in
> >> >> the background. I am working on a lighter solution for that where we
> >> >> compile the Linux kernel as a library [1]. This allows access to the
> >> >> filesystem without the need to use a full VM.
> >> >
> >> > That's hardly a "lighter solution"
> >> >
> >> > I'm kinda tired of the ongoing "hack random shit" approach to
> >> > container development.
> >>
> >> Since apparently there is a container devs hunting party going on
> >> right now, let me quickly confess that LKL has nothing to do with
> >> (them be damned) containers :)
> >>
> >> On a more serious note, LKL was not developed for containers or to try
> >> to circumvent privileged mounts. It was developed to allow the Linux
> >> kernel code to be reused in things like simple tools that allows one
> >> to modify a filesystem image.
> >
> > Anything tool that modifies an XFS filesystem that is not directly
> > maintained by the XFS developers voids any kind of support we can
> > supply. Just like the fact we don't support tainted kernels because
> > the 3rd party binary code is unknowable (and usually crap), having
> > the kernel code linked with random 3rd party userspace application
> > code is completely unsupportable by us.
> >
> 
> Perhaps tainting the kernel is a solution when running unknown
> applications linked with LKL.

It's not the *kernel* that is the problem - it is LKL that is the
tainted code!

> I would argue that applications that are maintained together with LKL
> (e.g. lklfuse in tools/lkl) should not taint the kernel because those
> applications will be under the control of kernel developers.

I completely disagree. Just because the code is in the kernel tree,
it doesn't mean it's controlled, reviewed, tested or maintained by
the relevant subsystem maintainers. If the subsystem maintainers are
not actively maintaining/testing those tools, then users can't
expect the subsystem maintainers to support them.


> I would
> also argue that mounting a filesystem read-only should not taint the
> kernel either.

LKL != kernel.

> >> >> And a final example is linking the bootloader code with LKL to access
> >> >> the filesystem. This has a hard requirement on non-mmu.
> >> >
> >> > No way. We *can't* support filesystems that have had bootloaders
> >> > make arbitrary changes to the filesystem without the knowlege of the
> >> > OS that *owns the filesystem*. Similarly, we cannot support random
> >> > applications that internally mount and modify filesystem images in
> >> > ways we can't see, control, test or simulate.  Sure, they use the
> >> > kernel code, but that doesn't stop them from doing stupid shit that
> >> > could corrupt the filesystem image. So, no, we are not going to
> >> > support giving random applications direct access to XFS filesystem
> >> > images, even via LKL.
> >> >
> >>
> >> LKL only exports the Linux kernel system calls and nothing else to
> >> applications. Because of that, there should not be any loss of control
> >> or visibility to the XFS fs driver.
> >
> > It runs in the same address space as the user application, yes? And
> > hence application bugs can cause the kernel code to malfunction,
> > yes?
> >
> 
> Most non-mmu architecture have the same issue and nevertheless non-mmu
> is still supported in Linux (including most filesystems).

That doesn't mean all subsystems in the kernel support users on
non-mmu systems.

> Also, filesystem code runs in the same address space with other kernel
> code and drivers and a bug anywhere in the kernel can cause filesystem
> code to malfunction.

Well, yes, but we have trust other kernel developers keep their ship
in good shape, too. But we don't trust *out of tree code* - that
taints the kernel and most upstream kernel developers will notice
such taints when there are weird errors being reported....

But you want to extend that trust to whatever random code gets
chucked into tools/lkl. I'm stretched far enough already having to
keep up with mm, VFS, locking and filesystem developments that I
don't have time to keep up with what LKL is doing or what
applications people are writing. You're not going to get subsystem
maintainers being able to find the time to review and test
applications that get stuffed into tools/lkl, so from that
perspective it's still a complete crapshoot.

> >> > I really don't see how using LKL to give userspace access to XFS
> >> > filesystems is a better solution than actually writing a proper,
> >> > supported XFS-FUSE module. LKL is so full of compromises that it's
> >> > going to be unworkable and unsupportable in practice...
> >>
> >> Could you elaborate on some of these issues?
> >
> > Start with "is a no-mmu architecture" and all the compromises that
> > means the kernel code needs to make,
> 
> I don't see non-mmu as a compromise. It is supported by Linux and most
> filesystems work fine on non-mmu architectures.

most != all.

XFS is unashamedly aimed and developed for high performance systems.
i.e. enterprise servers, HPC, fileservers, cloud storage
infrastructure, etc. IOWs, we don't even develop for desktop
machines, even though XFS performs adequately for most desktop
workloads.  We've never cared about non-mmu systems because of the
requirements we have on virtually mapped buffers and the fact they
don't exist in the target markets XFS aimed at. And, really, LKL
doesn't change that...

> LKL can be implemented
> as a mmu architecture. Having it as a non-mmu architecture has the
> advantages of allowing it to run in more constrained environments like
> bootloaders.

The OS owns the filesystem, not the bootloader. If the bootloader is
modifying filesystems (e.g. by running log recovery to mount the fs
internally to find the kernel/initrd files), then the boot loader
compromises the filesystem integrity. We don't not support such
configurations at all.


> > and finish it off with "LKL linked applications will
> > never be tested by their developers over the full functionality the
> > LKL provides them with".
> 
> You lost me here. Why does an application developer have to test the
> full functionality of a library it is linked with?

How much filesystem testing did you actually do with lklfuse? What
about load testing? Data integrity/crash/power fail and recovery
testing?  Scalability testing?  Or have you just assumed that it'll
all just work fine because the kernel fs developers test this stuff?

Yes, we've tested the kernel code along these lines *in kernel
space*. That doesn't mean that an LKL application that uses the
kernel functionality *in user space* will behave the same way or,
indeed, function correctly in adverse circumstances. Application
developers are not going to be aware of such issues, and they aren't
going to test for such situations - they are simply going to assume
"this works just fine" until it doesn't....

Cheers,

Dave.
diff mbox

Patch

diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
index 8ecffb3..50b5246 100644
--- a/fs/xfs/xfs_buf.c
+++ b/fs/xfs/xfs_buf.c
@@ -261,6 +261,7 @@  xfs_buf_free(
 	ASSERT(list_empty(&bp->b_lru));
 
 	if (bp->b_flags & _XBF_PAGES) {
+#ifdef CONFIG_MMU
 		uint		i;
 
 		if (xfs_buf_is_vmapped(bp))
@@ -272,6 +273,10 @@  xfs_buf_free(
 
 			__free_page(page);
 		}
+#else
+		free_pages((unsigned long)page_to_virt(bp->b_pages[0]),
+			   order_base_2(bp->b_page_count));
+#endif
 	} else if (bp->b_flags & _XBF_KMEM)
 		kmem_free(bp->b_addr);
 	_xfs_buf_free_pages(bp);
@@ -338,7 +343,14 @@  use_alloc_page:
 		struct page	*page;
 		uint		retries = 0;
 retry:
+#ifdef CONFIG_MMU
 		page = alloc_page(gfp_mask);
+#else
+		if (i == 0)
+			page = alloc_pages(gfp_mask, order_base_2(page_count));
+		else
+			page = bp->b_pages[0] + i;
+#endif
 		if (unlikely(page == NULL)) {
 			if (flags & XBF_READ_AHEAD) {
 				bp->b_page_count = i;
@@ -372,8 +384,11 @@  retry:
 	return 0;
 
 out_free_pages:
+#ifdef CONFIG_MMU
 	for (i = 0; i < bp->b_page_count; i++)
 		__free_page(bp->b_pages[i]);
+#endif
+
 	return error;
 }
 
@@ -392,6 +407,7 @@  _xfs_buf_map_pages(
 	} else if (flags & XBF_UNMAPPED) {
 		bp->b_addr = NULL;
 	} else {
+#ifdef CONFIG_MMU
 		int retried = 0;
 		unsigned noio_flag;
 
@@ -412,6 +428,9 @@  _xfs_buf_map_pages(
 			vm_unmap_aliases();
 		} while (retried++ <= 1);
 		memalloc_noio_restore(noio_flag);
+#else
+		bp->b_addr = page_to_virt(bp->b_pages[0]);
+#endif
 
 		if (!bp->b_addr)
 			return -ENOMEM;
@@ -816,11 +835,19 @@  xfs_buf_get_uncached(
 	if (error)
 		goto fail_free_buf;
 
+#ifdef CONFIG_MMU
 	for (i = 0; i < page_count; i++) {
 		bp->b_pages[i] = alloc_page(xb_to_gfp(flags));
 		if (!bp->b_pages[i])
 			goto fail_free_mem;
 	}
+#else
+	bp->b_pages[0] = alloc_pages(flags, order_base_2(page_count));
+	if (!bp->b_pages[0])
+		goto fail_free_buf;
+	for (i = 1; i < page_count; i++)
+		bp->b_pages[i] = bp->b_pages[i-1] + 1;
+#endif
 	bp->b_flags |= _XBF_PAGES;
 
 	error = _xfs_buf_map_pages(bp, 0);