diff mbox series

[v4,1/2] hugetlb: use f_mode & FMODE_HUGETLBFS to identify hugetlbfs files

Message ID 20200612004644.255692-1-mike.kravetz@oracle.com (mailing list archive)
State New, archived
Headers show
Series [v4,1/2] hugetlb: use f_mode & FMODE_HUGETLBFS to identify hugetlbfs files | expand

Commit Message

Mike Kravetz June 12, 2020, 12:46 a.m. UTC
The routine is_file_hugepages() checks f_op == hugetlbfs_file_operations
to determine if the file resides in hugetlbfs.  This is problematic when
the file is on a union or overlay.  Instead, define a new file mode
FMODE_HUGETLBFS which is set when a hugetlbfs file is opened.  The mode
can easily be copied to other 'files' derived from the original hugetlbfs
file.

With this change hugetlbfs_file_operations can be static as it should be.

There is also a (duplicate) set of shm file operations used for the routine
is_file_shm_hugepages().  Instead of setting/using special f_op's, just
propagate the FMODE_HUGETLBFS mode.  This means is_file_shm_hugepages() and
the duplicate f_ops can be removed.

While cleaning things up, change the name of is_file_hugepages() to
is_file_hugetlbfs().  The term hugepages is a bit ambiguous.

A subsequent patch will propagate FMODE_HUGETLBFS in overlayfs.

Suggested-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
---
 fs/hugetlbfs/inode.c    |  7 +++++++
 fs/io_uring.c           |  2 +-
 include/linux/fs.h      |  3 +++
 include/linux/hugetlb.h | 10 ++++------
 include/linux/shm.h     |  5 -----
 ipc/shm.c               | 34 ++++++++--------------------------
 mm/memfd.c              |  2 +-
 mm/mmap.c               |  8 ++++----
 8 files changed, 28 insertions(+), 43 deletions(-)

Comments

Matthew Wilcox June 12, 2020, 1:53 a.m. UTC | #1
On Thu, Jun 11, 2020 at 05:46:43PM -0700, Mike Kravetz wrote:
> The routine is_file_hugepages() checks f_op == hugetlbfs_file_operations
> to determine if the file resides in hugetlbfs.  This is problematic when
> the file is on a union or overlay.  Instead, define a new file mode
> FMODE_HUGETLBFS which is set when a hugetlbfs file is opened.  The mode
> can easily be copied to other 'files' derived from the original hugetlbfs
> file.
> 
> With this change hugetlbfs_file_operations can be static as it should be.
> 
> There is also a (duplicate) set of shm file operations used for the routine
> is_file_shm_hugepages().  Instead of setting/using special f_op's, just
> propagate the FMODE_HUGETLBFS mode.  This means is_file_shm_hugepages() and
> the duplicate f_ops can be removed.
> 
> While cleaning things up, change the name of is_file_hugepages() to
> is_file_hugetlbfs().  The term hugepages is a bit ambiguous.

I was going to have objections to this before I read it more carefully
and realised that the "shm" here is sysvipc and doesn't have anything
to do with the huge page support in shmfs.

> A subsequent patch will propagate FMODE_HUGETLBFS in overlayfs.
> 
> Suggested-by: Al Viro <viro@zeniv.linux.org.uk>
> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>

Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>

I might have suggested splitting the rename of is_file_hugetlbfs() from
the rest of this patch, but I wouldn't resend to change that.
Al Viro June 12, 2020, 1:58 a.m. UTC | #2
On Thu, Jun 11, 2020 at 05:46:43PM -0700, Mike Kravetz wrote:
> The routine is_file_hugepages() checks f_op == hugetlbfs_file_operations
> to determine if the file resides in hugetlbfs.  This is problematic when
> the file is on a union or overlay.  Instead, define a new file mode
> FMODE_HUGETLBFS which is set when a hugetlbfs file is opened.  The mode
> can easily be copied to other 'files' derived from the original hugetlbfs
> file.
> 
> With this change hugetlbfs_file_operations can be static as it should be.
> 
> There is also a (duplicate) set of shm file operations used for the routine
> is_file_shm_hugepages().  Instead of setting/using special f_op's, just
> propagate the FMODE_HUGETLBFS mode.  This means is_file_shm_hugepages() and
> the duplicate f_ops can be removed.

s/HUGETLBFS/HUGEPAGES/, please.

> While cleaning things up, change the name of is_file_hugepages() to
> is_file_hugetlbfs().  The term hugepages is a bit ambiguous.

Don't, especially since the very next patch adds such on overlayfs...

Incidentally, can a hugetlbfs be a lower layer, while the upper one
is a normal filesystem?  What should happen on copyup?
Mike Kravetz June 12, 2020, 9:51 p.m. UTC | #3
On 6/11/20 6:58 PM, Al Viro wrote:
> On Thu, Jun 11, 2020 at 05:46:43PM -0700, Mike Kravetz wrote:
>> The routine is_file_hugepages() checks f_op == hugetlbfs_file_operations
>> to determine if the file resides in hugetlbfs.  This is problematic when
>> the file is on a union or overlay.  Instead, define a new file mode
>> FMODE_HUGETLBFS which is set when a hugetlbfs file is opened.  The mode
>> can easily be copied to other 'files' derived from the original hugetlbfs
>> file.
>>
>> With this change hugetlbfs_file_operations can be static as it should be.
>>
>> There is also a (duplicate) set of shm file operations used for the routine
>> is_file_shm_hugepages().  Instead of setting/using special f_op's, just
>> propagate the FMODE_HUGETLBFS mode.  This means is_file_shm_hugepages() and
>> the duplicate f_ops can be removed.
> 
> s/HUGETLBFS/HUGEPAGES/, please.
> 
>> While cleaning things up, change the name of is_file_hugepages() to
>> is_file_hugetlbfs().  The term hugepages is a bit ambiguous.
> 
> Don't, especially since the very next patch adds such on overlayfs...

Ok. This is just something I thought might clarify things.  I seem to
recall questions about 'huge page' routines such as "is that for THP or
hugetlb huge pages"?  That was my motivation for the change.  Since this
is only about hugetlbfs, make it explicit.

> Incidentally, can a hugetlbfs be a lower layer, while the upper one
> is a normal filesystem?  What should happen on copyup?

Yes, that seems to work as expected.  When accessed for write the hugetlb
file is copied to the normal filesystem.

The BUG found by syzbot actually has a single hugetlbfs as both lower and
upper.  With the BUG 'fixed', I am not exactly sure what the expected
behavior is in this case.  I may be wrong, but I would expect any operations
that can be performed on a stand alone hugetlbfs to also be performed on
the overlay.  However, mmap() still fails.  I will look into it.

I also looked at normal filesystem lower and hugetlbfs upper.  Yes, overlayfs
allows this.  This is somewhat 'interesting' as write() is not supported in
hugetlbfs.  Writing to files in the overlay actually ended up writing to
files in the lower filesystem.  That seems wrong, but overlayfs is new to me.

Earlier in the discussion of these issues, Colin Walters asked "Is there any
actual valid use case for mounting an overlayfs on top of hugetlbfs?"  I can
not think of one.  Perhaps we should consider limiting the ways in which
hugetlbfs can be used in overlayfs?  Preventing it from being an upper
filesystem might be a good start?  Or, do people think making hugetlbfs and
overlayfs play nice together is useful?
Amir Goldstein June 13, 2020, 6:53 a.m. UTC | #4
> > Incidentally, can a hugetlbfs be a lower layer, while the upper one
> > is a normal filesystem?  What should happen on copyup?
>
> Yes, that seems to work as expected.  When accessed for write the hugetlb
> file is copied to the normal filesystem.
>
> The BUG found by syzbot actually has a single hugetlbfs as both lower and
> upper.  With the BUG 'fixed', I am not exactly sure what the expected
> behavior is in this case.  I may be wrong, but I would expect any operations
> that can be performed on a stand alone hugetlbfs to also be performed on
> the overlay.  However, mmap() still fails.  I will look into it.
>
> I also looked at normal filesystem lower and hugetlbfs upper.  Yes, overlayfs
> allows this.  This is somewhat 'interesting' as write() is not supported in
> hugetlbfs.  Writing to files in the overlay actually ended up writing to
> files in the lower filesystem.  That seems wrong, but overlayfs is new to me.
>

I am not sure how that happened, but I think that ovl_open_realfile()
needs to fixup f_mode flags FMODE_CAN_WRITE | FMODE_CAN_READ
after open_with_fake_path().

> Earlier in the discussion of these issues, Colin Walters asked "Is there any
> actual valid use case for mounting an overlayfs on top of hugetlbfs?"  I can
> not think of one.  Perhaps we should consider limiting the ways in which
> hugetlbfs can be used in overlayfs?  Preventing it from being an upper
> filesystem might be a good start?  Or, do people think making hugetlbfs and
> overlayfs play nice together is useful?

If people think that making hugetlbfs and overlayfs play nice together maybe
they should work on this problem. It doesn't look like either
hugetlbfs developers
nor overlayfs developers care much about the combination.
Your concern, I assume, is fixing the syzbot issue.

I agree with Colin's remark about adding limitations, but it would be a shame
if overlay had to special case hugetlbfs. It would have been better if we could
find a property of hugetlbfs that makes it inapplicable for overlayfs
upper/lower
or stacking fs in general.

The simplest thing for you to do in order to shush syzbot is what procfs does:
        /*
         * procfs isn't actually a stacking filesystem; however, there is
         * too much magic going on inside it to permit stacking things on
         * top of it
         */
        s->s_stack_depth = FILESYSTEM_MAX_STACK_DEPTH;

Currently, the only in-tree stacking fs are overlayfs and ecryptfs, but there
are some out of tree implementations as well (shiftfs).
So you may only take that option if you do not care about the combination
of hugetlbfs with any of the above.

overlayfs support of mmap is not as good as one might hope.
overlayfs.rst says:
"If a file residing on a lower layer is opened for read-only and then
 memory mapped with MAP_SHARED, then subsequent changes to
 the file are not reflected in the memory mapping."

So if I were you, I wouldn't go trying to fix overlayfs-huguetlb interop...

Thanks,
Amir.
Matthew Wilcox June 13, 2020, 2:38 p.m. UTC | #5
On Sat, Jun 13, 2020 at 09:53:24AM +0300, Amir Goldstein wrote:
> Currently, the only in-tree stacking fs are overlayfs and ecryptfs, but there
> are some out of tree implementations as well (shiftfs).
> So you may only take that option if you do not care about the combination
> of hugetlbfs with any of the above.

I could see shiftfs being interesting, maybe.  I don't really see
the usecase for layering overlayfs or ecryptfs on top of a ram-based
filesystem.
Mike Kravetz June 13, 2020, 7:12 p.m. UTC | #6
On 6/12/20 11:53 PM, Amir Goldstein wrote:
>>> Incidentally, can a hugetlbfs be a lower layer, while the upper one
>>> is a normal filesystem?  What should happen on copyup?
>>
>> Yes, that seems to work as expected.  When accessed for write the hugetlb
>> file is copied to the normal filesystem.
>>
>> The BUG found by syzbot actually has a single hugetlbfs as both lower and
>> upper.  With the BUG 'fixed', I am not exactly sure what the expected
>> behavior is in this case.  I may be wrong, but I would expect any operations
>> that can be performed on a stand alone hugetlbfs to also be performed on
>> the overlay.  However, mmap() still fails.  I will look into it.
>>
>> I also looked at normal filesystem lower and hugetlbfs upper.  Yes, overlayfs
>> allows this.  This is somewhat 'interesting' as write() is not supported in
>> hugetlbfs.  Writing to files in the overlay actually ended up writing to
>> files in the lower filesystem.  That seems wrong, but overlayfs is new to me.
>>
> 
> I am not sure how that happened, but I think that ovl_open_realfile()
> needs to fixup f_mode flags FMODE_CAN_WRITE | FMODE_CAN_READ
> after open_with_fake_path().
> 
>> Earlier in the discussion of these issues, Colin Walters asked "Is there any
>> actual valid use case for mounting an overlayfs on top of hugetlbfs?"  I can
>> not think of one.  Perhaps we should consider limiting the ways in which
>> hugetlbfs can be used in overlayfs?  Preventing it from being an upper
>> filesystem might be a good start?  Or, do people think making hugetlbfs and
>> overlayfs play nice together is useful?
> 
> If people think that making hugetlbfs and overlayfs play nice together maybe
> they should work on this problem. It doesn't look like either
> hugetlbfs developers
> nor overlayfs developers care much about the combination.

Thanks Amir,

As a hugetlbfs developer, I do not know of a use case for interoperability
with overlayfs.  So yes, I am not too interested in making them work well
together.  However, if there was an actual use case I would be more than
happy to consider doing the work.  Just hate to put effort into fixing up
two 'special' filesystems for functionality that may not be used.

I can't speak for overlayfs developers.

> Your concern, I assume, is fixing the syzbot issue.

That is the primary concern.  We should not BUG!  After fixing that up, Al
asked how these things worked together.  I honestly did not look at
interoperability before that.  I am not sure if anyone has done that in the
past.

> I agree with Colin's remark about adding limitations, but it would be a shame
> if overlay had to special case hugetlbfs. It would have been better if we could
> find a property of hugetlbfs that makes it inapplicable for overlayfs
> upper/lower
> or stacking fs in general.
> 
> The simplest thing for you to do in order to shush syzbot is what procfs does:
>         /*
>          * procfs isn't actually a stacking filesystem; however, there is
>          * too much magic going on inside it to permit stacking things on
>          * top of it
>          */
>         s->s_stack_depth = FILESYSTEM_MAX_STACK_DEPTH;
> 
> Currently, the only in-tree stacking fs are overlayfs and ecryptfs, but there
> are some out of tree implementations as well (shiftfs).
> So you may only take that option if you do not care about the combination
> of hugetlbfs with any of the above.
> 
> overlayfs support of mmap is not as good as one might hope.
> overlayfs.rst says:
> "If a file residing on a lower layer is opened for read-only and then
>  memory mapped with MAP_SHARED, then subsequent changes to
>  the file are not reflected in the memory mapping."
> 
> So if I were you, I wouldn't go trying to fix overlayfs-huguetlb interop...

Thanks again,

I'll look at something as simple as s_stack_depth.
Miklos Szeredi June 15, 2020, 7:53 a.m. UTC | #7
On Sat, Jun 13, 2020 at 9:12 PM Mike Kravetz <mike.kravetz@oracle.com> wrote:
>
> On 6/12/20 11:53 PM, Amir Goldstein wrote:

> As a hugetlbfs developer, I do not know of a use case for interoperability
> with overlayfs.  So yes, I am not too interested in making them work well
> together.  However, if there was an actual use case I would be more than
> happy to consider doing the work.  Just hate to put effort into fixing up
> two 'special' filesystems for functionality that may not be used.
>
> I can't speak for overlayfs developers.

As I said, I only know of tmpfs being upper layer as a valid use case.
   Does that work with hugepages?  How would I go about testing that?

> > I agree with Colin's remark about adding limitations, but it would be a shame
> > if overlay had to special case hugetlbfs. It would have been better if we could
> > find a property of hugetlbfs that makes it inapplicable for overlayfs
> > upper/lower
> > or stacking fs in general.
> >
> > The simplest thing for you to do in order to shush syzbot is what procfs does:
> >         /*
> >          * procfs isn't actually a stacking filesystem; however, there is
> >          * too much magic going on inside it to permit stacking things on
> >          * top of it
> >          */
> >         s->s_stack_depth = FILESYSTEM_MAX_STACK_DEPTH;
> >
> > Currently, the only in-tree stacking fs are overlayfs and ecryptfs, but there
> > are some out of tree implementations as well (shiftfs).
> > So you may only take that option if you do not care about the combination
> > of hugetlbfs with any of the above.
> >
> > overlayfs support of mmap is not as good as one might hope.
> > overlayfs.rst says:
> > "If a file residing on a lower layer is opened for read-only and then
> >  memory mapped with MAP_SHARED, then subsequent changes to
> >  the file are not reflected in the memory mapping."
> >
> > So if I were you, I wouldn't go trying to fix overlayfs-huguetlb interop...
>
> Thanks again,
>
> I'll look at something as simple as s_stack_depth.

Agree.

Thanks,
Miklos
Miklos Szeredi June 15, 2020, 8:24 a.m. UTC | #8
On Sat, Jun 13, 2020 at 8:53 AM Amir Goldstein <amir73il@gmail.com> wrote:

> > I also looked at normal filesystem lower and hugetlbfs upper.  Yes, overlayfs
> > allows this.  This is somewhat 'interesting' as write() is not supported in
> > hugetlbfs.  Writing to files in the overlay actually ended up writing to
> > files in the lower filesystem.  That seems wrong, but overlayfs is new to me.

Yes, this very definitely should not happen.

> I am not sure how that happened, but I think that ovl_open_realfile()
> needs to fixup f_mode flags FMODE_CAN_WRITE | FMODE_CAN_READ
> after open_with_fake_path().

Okay, but how did the write actually get to the lower layer?

I failed to reproduce this.  Mike, how did you trigger this?

Thanks,
Miklos
Amir Goldstein June 15, 2020, 10:05 a.m. UTC | #9
On Mon, Jun 15, 2020 at 10:53 AM Miklos Szeredi <miklos@szeredi.hu> wrote:
>
> On Sat, Jun 13, 2020 at 9:12 PM Mike Kravetz <mike.kravetz@oracle.com> wrote:
> >
> > On 6/12/20 11:53 PM, Amir Goldstein wrote:
>
> > As a hugetlbfs developer, I do not know of a use case for interoperability
> > with overlayfs.  So yes, I am not too interested in making them work well
> > together.  However, if there was an actual use case I would be more than
> > happy to consider doing the work.  Just hate to put effort into fixing up
> > two 'special' filesystems for functionality that may not be used.
> >
> > I can't speak for overlayfs developers.
>
> As I said, I only know of tmpfs being upper layer as a valid use case.
>    Does that work with hugepages?  How would I go about testing that?

Simple, after enabling CONFIG_HUGETLBFS:

diff --git a/mount_union.py b/mount_union.py
index fae8899..4070c70 100644
--- a/mount_union.py
+++ b/mount_union.py
@@ -15,7 +15,7 @@ def mount_union(ctx):
         snapshot_mntroot = cfg.snapshot_mntroot()
         if cfg.should_mount_upper():
             system("mount " + upper_mntroot + " 2>/dev/null"
-                    " || mount -t tmpfs upper_layer " + upper_mntroot)
+                    " || mount -t hugetlbfs upper_layer " + upper_mntroot)
         layer_mntroot = upper_mntroot + "/" + ctx.curr_layer()
         upperdir = layer_mntroot + "/u"
         workdir = layer_mntroot + "/w"

It fails colossally, because hugetlbfs, does not have write_iter().
It is only meant as an interface to create named maps of huge pages.
So I don't really see the use case for using it as upper.

Thanks,
Amir.
Miklos Szeredi June 15, 2020, 1:01 p.m. UTC | #10
On Mon, Jun 15, 2020 at 12:05 PM Amir Goldstein <amir73il@gmail.com> wrote:
>
> On Mon, Jun 15, 2020 at 10:53 AM Miklos Szeredi <miklos@szeredi.hu> wrote:
> >
> > On Sat, Jun 13, 2020 at 9:12 PM Mike Kravetz <mike.kravetz@oracle.com> wrote:
> > >
> > > On 6/12/20 11:53 PM, Amir Goldstein wrote:
> >
> > > As a hugetlbfs developer, I do not know of a use case for interoperability
> > > with overlayfs.  So yes, I am not too interested in making them work well
> > > together.  However, if there was an actual use case I would be more than
> > > happy to consider doing the work.  Just hate to put effort into fixing up
> > > two 'special' filesystems for functionality that may not be used.
> > >
> > > I can't speak for overlayfs developers.
> >
> > As I said, I only know of tmpfs being upper layer as a valid use case.
> >    Does that work with hugepages?  How would I go about testing that?
>
> Simple, after enabling CONFIG_HUGETLBFS:
>
> diff --git a/mount_union.py b/mount_union.py
> index fae8899..4070c70 100644
> --- a/mount_union.py
> +++ b/mount_union.py
> @@ -15,7 +15,7 @@ def mount_union(ctx):
>          snapshot_mntroot = cfg.snapshot_mntroot()
>          if cfg.should_mount_upper():
>              system("mount " + upper_mntroot + " 2>/dev/null"
> -                    " || mount -t tmpfs upper_layer " + upper_mntroot)
> +                    " || mount -t hugetlbfs upper_layer " + upper_mntroot)
>          layer_mntroot = upper_mntroot + "/" + ctx.curr_layer()
>          upperdir = layer_mntroot + "/u"
>          workdir = layer_mntroot + "/w"
>
> It fails colossally, because hugetlbfs, does not have write_iter().
> It is only meant as an interface to create named maps of huge pages.
> So I don't really see the use case for using it as upper.

Right.

I was actually asking about the tmpfs+hugepages, not the hugetlbfs case.

In the tmpfs case it looks like the lack of ->get_unmapped_area() in
overlayfs could still be an issue.   But I'm not sure how to trigger
that.

Thanks,
Miklos
Mike Kravetz June 15, 2020, 5:48 p.m. UTC | #11
On 6/15/20 1:24 AM, Miklos Szeredi wrote:
> On Sat, Jun 13, 2020 at 8:53 AM Amir Goldstein <amir73il@gmail.com> wrote:
> 
>>> I also looked at normal filesystem lower and hugetlbfs upper.  Yes, overlayfs
>>> allows this.  This is somewhat 'interesting' as write() is not supported in
>>> hugetlbfs.  Writing to files in the overlay actually ended up writing to
>>> files in the lower filesystem.  That seems wrong, but overlayfs is new to me.
> 
> Yes, this very definitely should not happen.
> 
>> I am not sure how that happened, but I think that ovl_open_realfile()
>> needs to fixup f_mode flags FMODE_CAN_WRITE | FMODE_CAN_READ
>> after open_with_fake_path().
> 
> Okay, but how did the write actually get to the lower layer?
> 
> I failed to reproduce this.  Mike, how did you trigger this?

My apologies!!!

I reviewed my testing and found that it was incorrectly writing to the
lower filesystem.  Writing to any file in the union will fail.
Mike Kravetz June 15, 2020, 11:45 p.m. UTC | #12
On 6/15/20 12:53 AM, Miklos Szeredi wrote:
> On Sat, Jun 13, 2020 at 9:12 PM Mike Kravetz <mike.kravetz@oracle.com> wrote:
>> On 6/12/20 11:53 PM, Amir Goldstein wrote:
>>>
>>> The simplest thing for you to do in order to shush syzbot is what procfs does:
>>>         /*
>>>          * procfs isn't actually a stacking filesystem; however, there is
>>>          * too much magic going on inside it to permit stacking things on
>>>          * top of it
>>>          */
>>>         s->s_stack_depth = FILESYSTEM_MAX_STACK_DEPTH;
>>>
>>> Currently, the only in-tree stacking fs are overlayfs and ecryptfs, but there
>>> are some out of tree implementations as well (shiftfs).
>>> So you may only take that option if you do not care about the combination
>>> of hugetlbfs with any of the above.
>>>
>>> overlayfs support of mmap is not as good as one might hope.
>>> overlayfs.rst says:
>>> "If a file residing on a lower layer is opened for read-only and then
>>>  memory mapped with MAP_SHARED, then subsequent changes to
>>>  the file are not reflected in the memory mapping."
>>>
>>> So if I were you, I wouldn't go trying to fix overlayfs-huguetlb interop...
>>
>> Thanks again,
>>
>> I'll look at something as simple as s_stack_depth.
> 
> Agree.

Apologies again for in the incorrect information about writing to lower
filesystem.

Stacking ecryptfs on hugetlbfs does not work either.  Here is what happens
when trying to create a new file.

[ 1188.863425] ecryptfs_write_metadata_to_contents: Error attempting to write header information to lower file; rc = [-22]
[ 1188.865469] ecryptfs_write_metadata: Error writing metadata out to lower file; rc = [-22]
[ 1188.867022] Error writing headers; rc = [-22]

I like Amir's idea of just setting s_stack_depth in hugetlbfs to prevent
stacking.

From 0fbed66b37c18919ea7edd47b113c97644f49362 Mon Sep 17 00:00:00 2001
From: Mike Kravetz <mike.kravetz@oracle.com>
Date: Mon, 15 Jun 2020 14:37:52 -0700
Subject: [PATCH] hugetlbfs: prevent filesystem stacking of hugetlbfs

syzbot found issues with having hugetlbfs on a union/overlay as reported
in [1].  Due to the limitations (no write) and special functionality of
hugetlbfs, it does not work well in filesystem stacking.  There are no
know use cases for hugetlbfs stacking.  Rather than making modifications
to get hugetlbfs working in such environments, simply prevent stacking.

[1] https://lore.kernel.org/linux-mm/000000000000b4684e05a2968ca6@google.com/

Reported-by: syzbot+d6ec23007e951dadf3de@syzkaller.appspotmail.com
Suggested-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
---
 fs/hugetlbfs/inode.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index 991c60c7ffe0..f32759c8e84d 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -1313,6 +1313,12 @@ hugetlbfs_fill_super(struct super_block *sb, struct fs_context *fc)
 	sb->s_magic = HUGETLBFS_MAGIC;
 	sb->s_op = &hugetlbfs_ops;
 	sb->s_time_gran = 1;
+
+	/*
+	 * Due to the special and limited functionality of hugetlbfs, it does
+	 * not work well as a stacking filesystem.
+	 */
+	sb->s_stack_depth = FILESYSTEM_MAX_STACK_DEPTH;
 	sb->s_root = d_make_root(hugetlbfs_get_root(sb, ctx));
 	if (!sb->s_root)
 		goto out_free;
Miklos Szeredi June 16, 2020, 9:01 a.m. UTC | #13
On Tue, Jun 16, 2020 at 1:45 AM Mike Kravetz <mike.kravetz@oracle.com> wrote:
>
> On 6/15/20 12:53 AM, Miklos Szeredi wrote:
> > On Sat, Jun 13, 2020 at 9:12 PM Mike Kravetz <mike.kravetz@oracle.com> wrote:
> >> On 6/12/20 11:53 PM, Amir Goldstein wrote:
> >>>
> >>> The simplest thing for you to do in order to shush syzbot is what procfs does:
> >>>         /*
> >>>          * procfs isn't actually a stacking filesystem; however, there is
> >>>          * too much magic going on inside it to permit stacking things on
> >>>          * top of it
> >>>          */
> >>>         s->s_stack_depth = FILESYSTEM_MAX_STACK_DEPTH;
> >>>
> >>> Currently, the only in-tree stacking fs are overlayfs and ecryptfs, but there
> >>> are some out of tree implementations as well (shiftfs).
> >>> So you may only take that option if you do not care about the combination
> >>> of hugetlbfs with any of the above.
> >>>
> >>> overlayfs support of mmap is not as good as one might hope.
> >>> overlayfs.rst says:
> >>> "If a file residing on a lower layer is opened for read-only and then
> >>>  memory mapped with MAP_SHARED, then subsequent changes to
> >>>  the file are not reflected in the memory mapping."
> >>>
> >>> So if I were you, I wouldn't go trying to fix overlayfs-huguetlb interop...
> >>
> >> Thanks again,
> >>
> >> I'll look at something as simple as s_stack_depth.
> >
> > Agree.
>
> Apologies again for in the incorrect information about writing to lower
> filesystem.
>
> Stacking ecryptfs on hugetlbfs does not work either.  Here is what happens
> when trying to create a new file.
>
> [ 1188.863425] ecryptfs_write_metadata_to_contents: Error attempting to write header information to lower file; rc = [-22]
> [ 1188.865469] ecryptfs_write_metadata: Error writing metadata out to lower file; rc = [-22]
> [ 1188.867022] Error writing headers; rc = [-22]
>
> I like Amir's idea of just setting s_stack_depth in hugetlbfs to prevent
> stacking.
>
> From 0fbed66b37c18919ea7edd47b113c97644f49362 Mon Sep 17 00:00:00 2001
> From: Mike Kravetz <mike.kravetz@oracle.com>
> Date: Mon, 15 Jun 2020 14:37:52 -0700
> Subject: [PATCH] hugetlbfs: prevent filesystem stacking of hugetlbfs
>
> syzbot found issues with having hugetlbfs on a union/overlay as reported
> in [1].  Due to the limitations (no write) and special functionality of
> hugetlbfs, it does not work well in filesystem stacking.  There are no
> know use cases for hugetlbfs stacking.  Rather than making modifications
> to get hugetlbfs working in such environments, simply prevent stacking.
>
> [1] https://lore.kernel.org/linux-mm/000000000000b4684e05a2968ca6@google.com/
>
> Reported-by: syzbot+d6ec23007e951dadf3de@syzkaller.appspotmail.com
> Suggested-by: Amir Goldstein <amir73il@gmail.com>
> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>

Acked-by: Miklos Szeredi <mszeredi@redhat.com>

Thanks,
Miklos
diff mbox series

Patch

diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index 991c60c7ffe0..5c0c50a88c84 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -324,6 +324,12 @@  static ssize_t hugetlbfs_read_iter(struct kiocb *iocb, struct iov_iter *to)
 	return retval;
 }
 
+static int hugetlbfs_open(struct inode *inode, struct file *file)
+{
+	file->f_mode |= FMODE_HUGETLBFS;
+	return 0;
+}
+
 static int hugetlbfs_write_begin(struct file *file,
 			struct address_space *mapping,
 			loff_t pos, unsigned len, unsigned flags,
@@ -1112,6 +1118,7 @@  static void init_once(void *foo)
 
 const struct file_operations hugetlbfs_file_operations = {
 	.read_iter		= hugetlbfs_read_iter,
+	.open			= hugetlbfs_open,
 	.mmap			= hugetlbfs_file_mmap,
 	.fsync			= noop_fsync,
 	.get_unmapped_area	= hugetlb_get_unmapped_area,
diff --git a/fs/io_uring.c b/fs/io_uring.c
index bb25e3997d41..96e8a4bb610a 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -7123,7 +7123,7 @@  static int io_sqe_buffer_register(struct io_ring_ctx *ctx, void __user *arg,
 				struct vm_area_struct *vma = vmas[j];
 
 				if (vma->vm_file &&
-				    !is_file_hugepages(vma->vm_file)) {
+				    !is_file_hugetlbfs(vma->vm_file)) {
 					ret = -EOPNOTSUPP;
 					break;
 				}
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 45cc10cdf6dd..99af9513f9ab 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -175,6 +175,9 @@  typedef int (dio_iodone_t)(struct kiocb *iocb, loff_t offset,
 /* File does not contribute to nr_files count */
 #define FMODE_NOACCOUNT		((__force fmode_t)0x20000000)
 
+/* File is in hugetlbfs filesystem */
+#define FMODE_HUGETLBFS		((__force fmode_t)0x40000000)
+
 /*
  * Flag for rw_copy_check_uvector and compat_rw_copy_check_uvector
  * that indicates that they should check the contents of the iovec are
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 43a1cef8f0f1..aa3408775464 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -429,18 +429,16 @@  static inline struct hugetlbfs_inode_info *HUGETLBFS_I(struct inode *inode)
 	return container_of(inode, struct hugetlbfs_inode_info, vfs_inode);
 }
 
-extern const struct file_operations hugetlbfs_file_operations;
 extern const struct vm_operations_struct hugetlb_vm_ops;
 struct file *hugetlb_file_setup(const char *name, size_t size, vm_flags_t acct,
 				struct user_struct **user, int creat_flags,
 				int page_size_log);
 
-static inline bool is_file_hugepages(struct file *file)
+static inline bool is_file_hugetlbfs(struct file *file)
 {
-	if (file->f_op == &hugetlbfs_file_operations)
+	if (unlikely(file->f_mode & FMODE_HUGETLBFS))
 		return true;
-
-	return is_file_shm_hugepages(file);
+	return false;
 }
 
 static inline struct hstate *hstate_inode(struct inode *i)
@@ -449,7 +447,7 @@  static inline struct hstate *hstate_inode(struct inode *i)
 }
 #else /* !CONFIG_HUGETLBFS */
 
-#define is_file_hugepages(file)			false
+#define is_file_hugetlbfs(file)			false
 static inline struct file *
 hugetlb_file_setup(const char *name, size_t size, vm_flags_t acctflag,
 		struct user_struct **user, int creat_flags,
diff --git a/include/linux/shm.h b/include/linux/shm.h
index d8e69aed3d32..1ab62d7b334f 100644
--- a/include/linux/shm.h
+++ b/include/linux/shm.h
@@ -16,7 +16,6 @@  struct sysv_shm {
 
 long do_shmat(int shmid, char __user *shmaddr, int shmflg, unsigned long *addr,
 	      unsigned long shmlba);
-bool is_file_shm_hugepages(struct file *file);
 void exit_shm(struct task_struct *task);
 #define shm_init_task(task) INIT_LIST_HEAD(&(task)->sysvshm.shm_clist)
 #else
@@ -30,10 +29,6 @@  static inline long do_shmat(int shmid, char __user *shmaddr,
 {
 	return -ENOSYS;
 }
-static inline bool is_file_shm_hugepages(struct file *file)
-{
-	return false;
-}
 static inline void exit_shm(struct task_struct *task)
 {
 }
diff --git a/ipc/shm.c b/ipc/shm.c
index 0ba6add05b35..8f119b1d6170 100644
--- a/ipc/shm.c
+++ b/ipc/shm.c
@@ -285,7 +285,7 @@  static void shm_destroy(struct ipc_namespace *ns, struct shmid_kernel *shp)
 	ns->shm_tot -= (shp->shm_segsz + PAGE_SIZE - 1) >> PAGE_SHIFT;
 	shm_rmid(ns, shp);
 	shm_unlock(shp);
-	if (!is_file_hugepages(shm_file))
+	if (!is_file_hugetlbfs(shm_file))
 		shmem_lock(shm_file, 0, shp->mlock_user);
 	else if (shp->mlock_user)
 		user_shm_unlock(i_size_read(file_inode(shm_file)),
@@ -560,24 +560,6 @@  static const struct file_operations shm_file_operations = {
 	.fallocate	= shm_fallocate,
 };
 
-/*
- * shm_file_operations_huge is now identical to shm_file_operations,
- * but we keep it distinct for the sake of is_file_shm_hugepages().
- */
-static const struct file_operations shm_file_operations_huge = {
-	.mmap		= shm_mmap,
-	.fsync		= shm_fsync,
-	.release	= shm_release,
-	.get_unmapped_area	= shm_get_unmapped_area,
-	.llseek		= noop_llseek,
-	.fallocate	= shm_fallocate,
-};
-
-bool is_file_shm_hugepages(struct file *file)
-{
-	return file->f_op == &shm_file_operations_huge;
-}
-
 static const struct vm_operations_struct shm_vm_ops = {
 	.open	= shm_open,	/* callback for a new vm-area open */
 	.close	= shm_close,	/* callback for when the vm-area is released */
@@ -698,7 +680,7 @@  static int newseg(struct ipc_namespace *ns, struct ipc_params *params)
 no_id:
 	ipc_update_pid(&shp->shm_cprid, NULL);
 	ipc_update_pid(&shp->shm_lprid, NULL);
-	if (is_file_hugepages(file) && shp->mlock_user)
+	if (is_file_hugetlbfs(file) && shp->mlock_user)
 		user_shm_unlock(size, shp->mlock_user);
 	fput(file);
 	ipc_rcu_putref(&shp->shm_perm, shm_rcu_free);
@@ -836,7 +818,7 @@  static void shm_add_rss_swap(struct shmid_kernel *shp,
 
 	inode = file_inode(shp->shm_file);
 
-	if (is_file_hugepages(shp->shm_file)) {
+	if (is_file_hugetlbfs(shp->shm_file)) {
 		struct address_space *mapping = inode->i_mapping;
 		struct hstate *h = hstate_file(shp->shm_file);
 		*rss_add += pages_per_huge_page(h) * mapping->nrpages;
@@ -1102,7 +1084,7 @@  static int shmctl_do_lock(struct ipc_namespace *ns, int shmid, int cmd)
 	}
 
 	shm_file = shp->shm_file;
-	if (is_file_hugepages(shm_file))
+	if (is_file_hugetlbfs(shm_file))
 		goto out_unlock0;
 
 	if (cmd == SHM_LOCK) {
@@ -1523,10 +1505,7 @@  long do_shmat(int shmid, char __user *shmaddr, int shmflg,
 		goto out_nattch;
 	}
 
-	file = alloc_file_clone(base, f_flags,
-			  is_file_hugepages(base) ?
-				&shm_file_operations_huge :
-				&shm_file_operations);
+	file = alloc_file_clone(base, f_flags, &shm_file_operations);
 	err = PTR_ERR(file);
 	if (IS_ERR(file)) {
 		kfree(sfd);
@@ -1534,6 +1513,9 @@  long do_shmat(int shmid, char __user *shmaddr, int shmflg,
 		goto out_nattch;
 	}
 
+	/* copy hugetlbfs mode for is_file_hugetlbfs() */
+	file->f_mode |= (base->f_mode & FMODE_HUGETLBFS);
+
 	sfd->id = shp->shm_perm.id;
 	sfd->ns = get_ipc_ns(ns);
 	sfd->file = base;
diff --git a/mm/memfd.c b/mm/memfd.c
index 2647c898990c..e6c16b6bf3f6 100644
--- a/mm/memfd.c
+++ b/mm/memfd.c
@@ -123,7 +123,7 @@  static unsigned int *memfd_file_seals_ptr(struct file *file)
 		return &SHMEM_I(file_inode(file))->seals;
 
 #ifdef CONFIG_HUGETLBFS
-	if (is_file_hugepages(file))
+	if (is_file_hugetlbfs(file))
 		return &HUGETLBFS_I(file_inode(file))->seals;
 #endif
 
diff --git a/mm/mmap.c b/mm/mmap.c
index f609e9ec4a25..703a9680a937 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1538,7 +1538,7 @@  unsigned long do_mmap(struct file *file, unsigned long addr,
 			vm_flags |= VM_NORESERVE;
 
 		/* hugetlb applies strict overcommit unless MAP_NORESERVE */
-		if (file && is_file_hugepages(file))
+		if (file && is_file_hugetlbfs(file))
 			vm_flags |= VM_NORESERVE;
 	}
 
@@ -1562,10 +1562,10 @@  unsigned long ksys_mmap_pgoff(unsigned long addr, unsigned long len,
 		file = fget(fd);
 		if (!file)
 			return -EBADF;
-		if (is_file_hugepages(file))
+		if (is_file_hugetlbfs(file))
 			len = ALIGN(len, huge_page_size(hstate_file(file)));
 		retval = -EINVAL;
-		if (unlikely(flags & MAP_HUGETLB && !is_file_hugepages(file)))
+		if (unlikely(flags & MAP_HUGETLB && !is_file_hugetlbfs(file)))
 			goto out_fput;
 	} else if (flags & MAP_HUGETLB) {
 		struct user_struct *user = NULL;
@@ -1678,7 +1678,7 @@  static inline int accountable_mapping(struct file *file, vm_flags_t vm_flags)
 	 * hugetlb has its own accounting separate from the core VM
 	 * VM_HUGETLB may not be set yet so we cannot check for that flag.
 	 */
-	if (file && is_file_hugepages(file))
+	if (file && is_file_hugetlbfs(file))
 		return 0;
 
 	return (vm_flags & (VM_NORESERVE | VM_SHARED | VM_WRITE)) == VM_WRITE;