Message ID | 20200612004644.255692-1-mike.kravetz@oracle.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [v4,1/2] hugetlb: use f_mode & FMODE_HUGETLBFS to identify hugetlbfs files | expand |
On Thu, Jun 11, 2020 at 05:46:43PM -0700, Mike Kravetz wrote: > The routine is_file_hugepages() checks f_op == hugetlbfs_file_operations > to determine if the file resides in hugetlbfs. This is problematic when > the file is on a union or overlay. Instead, define a new file mode > FMODE_HUGETLBFS which is set when a hugetlbfs file is opened. The mode > can easily be copied to other 'files' derived from the original hugetlbfs > file. > > With this change hugetlbfs_file_operations can be static as it should be. > > There is also a (duplicate) set of shm file operations used for the routine > is_file_shm_hugepages(). Instead of setting/using special f_op's, just > propagate the FMODE_HUGETLBFS mode. This means is_file_shm_hugepages() and > the duplicate f_ops can be removed. > > While cleaning things up, change the name of is_file_hugepages() to > is_file_hugetlbfs(). The term hugepages is a bit ambiguous. I was going to have objections to this before I read it more carefully and realised that the "shm" here is sysvipc and doesn't have anything to do with the huge page support in shmfs. > A subsequent patch will propagate FMODE_HUGETLBFS in overlayfs. > > Suggested-by: Al Viro <viro@zeniv.linux.org.uk> > Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> I might have suggested splitting the rename of is_file_hugetlbfs() from the rest of this patch, but I wouldn't resend to change that.
On Thu, Jun 11, 2020 at 05:46:43PM -0700, Mike Kravetz wrote: > The routine is_file_hugepages() checks f_op == hugetlbfs_file_operations > to determine if the file resides in hugetlbfs. This is problematic when > the file is on a union or overlay. Instead, define a new file mode > FMODE_HUGETLBFS which is set when a hugetlbfs file is opened. The mode > can easily be copied to other 'files' derived from the original hugetlbfs > file. > > With this change hugetlbfs_file_operations can be static as it should be. > > There is also a (duplicate) set of shm file operations used for the routine > is_file_shm_hugepages(). Instead of setting/using special f_op's, just > propagate the FMODE_HUGETLBFS mode. This means is_file_shm_hugepages() and > the duplicate f_ops can be removed. s/HUGETLBFS/HUGEPAGES/, please. > While cleaning things up, change the name of is_file_hugepages() to > is_file_hugetlbfs(). The term hugepages is a bit ambiguous. Don't, especially since the very next patch adds such on overlayfs... Incidentally, can a hugetlbfs be a lower layer, while the upper one is a normal filesystem? What should happen on copyup?
On 6/11/20 6:58 PM, Al Viro wrote: > On Thu, Jun 11, 2020 at 05:46:43PM -0700, Mike Kravetz wrote: >> The routine is_file_hugepages() checks f_op == hugetlbfs_file_operations >> to determine if the file resides in hugetlbfs. This is problematic when >> the file is on a union or overlay. Instead, define a new file mode >> FMODE_HUGETLBFS which is set when a hugetlbfs file is opened. The mode >> can easily be copied to other 'files' derived from the original hugetlbfs >> file. >> >> With this change hugetlbfs_file_operations can be static as it should be. >> >> There is also a (duplicate) set of shm file operations used for the routine >> is_file_shm_hugepages(). Instead of setting/using special f_op's, just >> propagate the FMODE_HUGETLBFS mode. This means is_file_shm_hugepages() and >> the duplicate f_ops can be removed. > > s/HUGETLBFS/HUGEPAGES/, please. > >> While cleaning things up, change the name of is_file_hugepages() to >> is_file_hugetlbfs(). The term hugepages is a bit ambiguous. > > Don't, especially since the very next patch adds such on overlayfs... Ok. This is just something I thought might clarify things. I seem to recall questions about 'huge page' routines such as "is that for THP or hugetlb huge pages"? That was my motivation for the change. Since this is only about hugetlbfs, make it explicit. > Incidentally, can a hugetlbfs be a lower layer, while the upper one > is a normal filesystem? What should happen on copyup? Yes, that seems to work as expected. When accessed for write the hugetlb file is copied to the normal filesystem. The BUG found by syzbot actually has a single hugetlbfs as both lower and upper. With the BUG 'fixed', I am not exactly sure what the expected behavior is in this case. I may be wrong, but I would expect any operations that can be performed on a stand alone hugetlbfs to also be performed on the overlay. However, mmap() still fails. I will look into it. I also looked at normal filesystem lower and hugetlbfs upper. Yes, overlayfs allows this. This is somewhat 'interesting' as write() is not supported in hugetlbfs. Writing to files in the overlay actually ended up writing to files in the lower filesystem. That seems wrong, but overlayfs is new to me. Earlier in the discussion of these issues, Colin Walters asked "Is there any actual valid use case for mounting an overlayfs on top of hugetlbfs?" I can not think of one. Perhaps we should consider limiting the ways in which hugetlbfs can be used in overlayfs? Preventing it from being an upper filesystem might be a good start? Or, do people think making hugetlbfs and overlayfs play nice together is useful?
> > Incidentally, can a hugetlbfs be a lower layer, while the upper one > > is a normal filesystem? What should happen on copyup? > > Yes, that seems to work as expected. When accessed for write the hugetlb > file is copied to the normal filesystem. > > The BUG found by syzbot actually has a single hugetlbfs as both lower and > upper. With the BUG 'fixed', I am not exactly sure what the expected > behavior is in this case. I may be wrong, but I would expect any operations > that can be performed on a stand alone hugetlbfs to also be performed on > the overlay. However, mmap() still fails. I will look into it. > > I also looked at normal filesystem lower and hugetlbfs upper. Yes, overlayfs > allows this. This is somewhat 'interesting' as write() is not supported in > hugetlbfs. Writing to files in the overlay actually ended up writing to > files in the lower filesystem. That seems wrong, but overlayfs is new to me. > I am not sure how that happened, but I think that ovl_open_realfile() needs to fixup f_mode flags FMODE_CAN_WRITE | FMODE_CAN_READ after open_with_fake_path(). > Earlier in the discussion of these issues, Colin Walters asked "Is there any > actual valid use case for mounting an overlayfs on top of hugetlbfs?" I can > not think of one. Perhaps we should consider limiting the ways in which > hugetlbfs can be used in overlayfs? Preventing it from being an upper > filesystem might be a good start? Or, do people think making hugetlbfs and > overlayfs play nice together is useful? If people think that making hugetlbfs and overlayfs play nice together maybe they should work on this problem. It doesn't look like either hugetlbfs developers nor overlayfs developers care much about the combination. Your concern, I assume, is fixing the syzbot issue. I agree with Colin's remark about adding limitations, but it would be a shame if overlay had to special case hugetlbfs. It would have been better if we could find a property of hugetlbfs that makes it inapplicable for overlayfs upper/lower or stacking fs in general. The simplest thing for you to do in order to shush syzbot is what procfs does: /* * procfs isn't actually a stacking filesystem; however, there is * too much magic going on inside it to permit stacking things on * top of it */ s->s_stack_depth = FILESYSTEM_MAX_STACK_DEPTH; Currently, the only in-tree stacking fs are overlayfs and ecryptfs, but there are some out of tree implementations as well (shiftfs). So you may only take that option if you do not care about the combination of hugetlbfs with any of the above. overlayfs support of mmap is not as good as one might hope. overlayfs.rst says: "If a file residing on a lower layer is opened for read-only and then memory mapped with MAP_SHARED, then subsequent changes to the file are not reflected in the memory mapping." So if I were you, I wouldn't go trying to fix overlayfs-huguetlb interop... Thanks, Amir.
On Sat, Jun 13, 2020 at 09:53:24AM +0300, Amir Goldstein wrote: > Currently, the only in-tree stacking fs are overlayfs and ecryptfs, but there > are some out of tree implementations as well (shiftfs). > So you may only take that option if you do not care about the combination > of hugetlbfs with any of the above. I could see shiftfs being interesting, maybe. I don't really see the usecase for layering overlayfs or ecryptfs on top of a ram-based filesystem.
On 6/12/20 11:53 PM, Amir Goldstein wrote: >>> Incidentally, can a hugetlbfs be a lower layer, while the upper one >>> is a normal filesystem? What should happen on copyup? >> >> Yes, that seems to work as expected. When accessed for write the hugetlb >> file is copied to the normal filesystem. >> >> The BUG found by syzbot actually has a single hugetlbfs as both lower and >> upper. With the BUG 'fixed', I am not exactly sure what the expected >> behavior is in this case. I may be wrong, but I would expect any operations >> that can be performed on a stand alone hugetlbfs to also be performed on >> the overlay. However, mmap() still fails. I will look into it. >> >> I also looked at normal filesystem lower and hugetlbfs upper. Yes, overlayfs >> allows this. This is somewhat 'interesting' as write() is not supported in >> hugetlbfs. Writing to files in the overlay actually ended up writing to >> files in the lower filesystem. That seems wrong, but overlayfs is new to me. >> > > I am not sure how that happened, but I think that ovl_open_realfile() > needs to fixup f_mode flags FMODE_CAN_WRITE | FMODE_CAN_READ > after open_with_fake_path(). > >> Earlier in the discussion of these issues, Colin Walters asked "Is there any >> actual valid use case for mounting an overlayfs on top of hugetlbfs?" I can >> not think of one. Perhaps we should consider limiting the ways in which >> hugetlbfs can be used in overlayfs? Preventing it from being an upper >> filesystem might be a good start? Or, do people think making hugetlbfs and >> overlayfs play nice together is useful? > > If people think that making hugetlbfs and overlayfs play nice together maybe > they should work on this problem. It doesn't look like either > hugetlbfs developers > nor overlayfs developers care much about the combination. Thanks Amir, As a hugetlbfs developer, I do not know of a use case for interoperability with overlayfs. So yes, I am not too interested in making them work well together. However, if there was an actual use case I would be more than happy to consider doing the work. Just hate to put effort into fixing up two 'special' filesystems for functionality that may not be used. I can't speak for overlayfs developers. > Your concern, I assume, is fixing the syzbot issue. That is the primary concern. We should not BUG! After fixing that up, Al asked how these things worked together. I honestly did not look at interoperability before that. I am not sure if anyone has done that in the past. > I agree with Colin's remark about adding limitations, but it would be a shame > if overlay had to special case hugetlbfs. It would have been better if we could > find a property of hugetlbfs that makes it inapplicable for overlayfs > upper/lower > or stacking fs in general. > > The simplest thing for you to do in order to shush syzbot is what procfs does: > /* > * procfs isn't actually a stacking filesystem; however, there is > * too much magic going on inside it to permit stacking things on > * top of it > */ > s->s_stack_depth = FILESYSTEM_MAX_STACK_DEPTH; > > Currently, the only in-tree stacking fs are overlayfs and ecryptfs, but there > are some out of tree implementations as well (shiftfs). > So you may only take that option if you do not care about the combination > of hugetlbfs with any of the above. > > overlayfs support of mmap is not as good as one might hope. > overlayfs.rst says: > "If a file residing on a lower layer is opened for read-only and then > memory mapped with MAP_SHARED, then subsequent changes to > the file are not reflected in the memory mapping." > > So if I were you, I wouldn't go trying to fix overlayfs-huguetlb interop... Thanks again, I'll look at something as simple as s_stack_depth.
On Sat, Jun 13, 2020 at 9:12 PM Mike Kravetz <mike.kravetz@oracle.com> wrote: > > On 6/12/20 11:53 PM, Amir Goldstein wrote: > As a hugetlbfs developer, I do not know of a use case for interoperability > with overlayfs. So yes, I am not too interested in making them work well > together. However, if there was an actual use case I would be more than > happy to consider doing the work. Just hate to put effort into fixing up > two 'special' filesystems for functionality that may not be used. > > I can't speak for overlayfs developers. As I said, I only know of tmpfs being upper layer as a valid use case. Does that work with hugepages? How would I go about testing that? > > I agree with Colin's remark about adding limitations, but it would be a shame > > if overlay had to special case hugetlbfs. It would have been better if we could > > find a property of hugetlbfs that makes it inapplicable for overlayfs > > upper/lower > > or stacking fs in general. > > > > The simplest thing for you to do in order to shush syzbot is what procfs does: > > /* > > * procfs isn't actually a stacking filesystem; however, there is > > * too much magic going on inside it to permit stacking things on > > * top of it > > */ > > s->s_stack_depth = FILESYSTEM_MAX_STACK_DEPTH; > > > > Currently, the only in-tree stacking fs are overlayfs and ecryptfs, but there > > are some out of tree implementations as well (shiftfs). > > So you may only take that option if you do not care about the combination > > of hugetlbfs with any of the above. > > > > overlayfs support of mmap is not as good as one might hope. > > overlayfs.rst says: > > "If a file residing on a lower layer is opened for read-only and then > > memory mapped with MAP_SHARED, then subsequent changes to > > the file are not reflected in the memory mapping." > > > > So if I were you, I wouldn't go trying to fix overlayfs-huguetlb interop... > > Thanks again, > > I'll look at something as simple as s_stack_depth. Agree. Thanks, Miklos
On Sat, Jun 13, 2020 at 8:53 AM Amir Goldstein <amir73il@gmail.com> wrote: > > I also looked at normal filesystem lower and hugetlbfs upper. Yes, overlayfs > > allows this. This is somewhat 'interesting' as write() is not supported in > > hugetlbfs. Writing to files in the overlay actually ended up writing to > > files in the lower filesystem. That seems wrong, but overlayfs is new to me. Yes, this very definitely should not happen. > I am not sure how that happened, but I think that ovl_open_realfile() > needs to fixup f_mode flags FMODE_CAN_WRITE | FMODE_CAN_READ > after open_with_fake_path(). Okay, but how did the write actually get to the lower layer? I failed to reproduce this. Mike, how did you trigger this? Thanks, Miklos
On Mon, Jun 15, 2020 at 10:53 AM Miklos Szeredi <miklos@szeredi.hu> wrote: > > On Sat, Jun 13, 2020 at 9:12 PM Mike Kravetz <mike.kravetz@oracle.com> wrote: > > > > On 6/12/20 11:53 PM, Amir Goldstein wrote: > > > As a hugetlbfs developer, I do not know of a use case for interoperability > > with overlayfs. So yes, I am not too interested in making them work well > > together. However, if there was an actual use case I would be more than > > happy to consider doing the work. Just hate to put effort into fixing up > > two 'special' filesystems for functionality that may not be used. > > > > I can't speak for overlayfs developers. > > As I said, I only know of tmpfs being upper layer as a valid use case. > Does that work with hugepages? How would I go about testing that? Simple, after enabling CONFIG_HUGETLBFS: diff --git a/mount_union.py b/mount_union.py index fae8899..4070c70 100644 --- a/mount_union.py +++ b/mount_union.py @@ -15,7 +15,7 @@ def mount_union(ctx): snapshot_mntroot = cfg.snapshot_mntroot() if cfg.should_mount_upper(): system("mount " + upper_mntroot + " 2>/dev/null" - " || mount -t tmpfs upper_layer " + upper_mntroot) + " || mount -t hugetlbfs upper_layer " + upper_mntroot) layer_mntroot = upper_mntroot + "/" + ctx.curr_layer() upperdir = layer_mntroot + "/u" workdir = layer_mntroot + "/w" It fails colossally, because hugetlbfs, does not have write_iter(). It is only meant as an interface to create named maps of huge pages. So I don't really see the use case for using it as upper. Thanks, Amir.
On Mon, Jun 15, 2020 at 12:05 PM Amir Goldstein <amir73il@gmail.com> wrote: > > On Mon, Jun 15, 2020 at 10:53 AM Miklos Szeredi <miklos@szeredi.hu> wrote: > > > > On Sat, Jun 13, 2020 at 9:12 PM Mike Kravetz <mike.kravetz@oracle.com> wrote: > > > > > > On 6/12/20 11:53 PM, Amir Goldstein wrote: > > > > > As a hugetlbfs developer, I do not know of a use case for interoperability > > > with overlayfs. So yes, I am not too interested in making them work well > > > together. However, if there was an actual use case I would be more than > > > happy to consider doing the work. Just hate to put effort into fixing up > > > two 'special' filesystems for functionality that may not be used. > > > > > > I can't speak for overlayfs developers. > > > > As I said, I only know of tmpfs being upper layer as a valid use case. > > Does that work with hugepages? How would I go about testing that? > > Simple, after enabling CONFIG_HUGETLBFS: > > diff --git a/mount_union.py b/mount_union.py > index fae8899..4070c70 100644 > --- a/mount_union.py > +++ b/mount_union.py > @@ -15,7 +15,7 @@ def mount_union(ctx): > snapshot_mntroot = cfg.snapshot_mntroot() > if cfg.should_mount_upper(): > system("mount " + upper_mntroot + " 2>/dev/null" > - " || mount -t tmpfs upper_layer " + upper_mntroot) > + " || mount -t hugetlbfs upper_layer " + upper_mntroot) > layer_mntroot = upper_mntroot + "/" + ctx.curr_layer() > upperdir = layer_mntroot + "/u" > workdir = layer_mntroot + "/w" > > It fails colossally, because hugetlbfs, does not have write_iter(). > It is only meant as an interface to create named maps of huge pages. > So I don't really see the use case for using it as upper. Right. I was actually asking about the tmpfs+hugepages, not the hugetlbfs case. In the tmpfs case it looks like the lack of ->get_unmapped_area() in overlayfs could still be an issue. But I'm not sure how to trigger that. Thanks, Miklos
On 6/15/20 1:24 AM, Miklos Szeredi wrote: > On Sat, Jun 13, 2020 at 8:53 AM Amir Goldstein <amir73il@gmail.com> wrote: > >>> I also looked at normal filesystem lower and hugetlbfs upper. Yes, overlayfs >>> allows this. This is somewhat 'interesting' as write() is not supported in >>> hugetlbfs. Writing to files in the overlay actually ended up writing to >>> files in the lower filesystem. That seems wrong, but overlayfs is new to me. > > Yes, this very definitely should not happen. > >> I am not sure how that happened, but I think that ovl_open_realfile() >> needs to fixup f_mode flags FMODE_CAN_WRITE | FMODE_CAN_READ >> after open_with_fake_path(). > > Okay, but how did the write actually get to the lower layer? > > I failed to reproduce this. Mike, how did you trigger this? My apologies!!! I reviewed my testing and found that it was incorrectly writing to the lower filesystem. Writing to any file in the union will fail.
On 6/15/20 12:53 AM, Miklos Szeredi wrote: > On Sat, Jun 13, 2020 at 9:12 PM Mike Kravetz <mike.kravetz@oracle.com> wrote: >> On 6/12/20 11:53 PM, Amir Goldstein wrote: >>> >>> The simplest thing for you to do in order to shush syzbot is what procfs does: >>> /* >>> * procfs isn't actually a stacking filesystem; however, there is >>> * too much magic going on inside it to permit stacking things on >>> * top of it >>> */ >>> s->s_stack_depth = FILESYSTEM_MAX_STACK_DEPTH; >>> >>> Currently, the only in-tree stacking fs are overlayfs and ecryptfs, but there >>> are some out of tree implementations as well (shiftfs). >>> So you may only take that option if you do not care about the combination >>> of hugetlbfs with any of the above. >>> >>> overlayfs support of mmap is not as good as one might hope. >>> overlayfs.rst says: >>> "If a file residing on a lower layer is opened for read-only and then >>> memory mapped with MAP_SHARED, then subsequent changes to >>> the file are not reflected in the memory mapping." >>> >>> So if I were you, I wouldn't go trying to fix overlayfs-huguetlb interop... >> >> Thanks again, >> >> I'll look at something as simple as s_stack_depth. > > Agree. Apologies again for in the incorrect information about writing to lower filesystem. Stacking ecryptfs on hugetlbfs does not work either. Here is what happens when trying to create a new file. [ 1188.863425] ecryptfs_write_metadata_to_contents: Error attempting to write header information to lower file; rc = [-22] [ 1188.865469] ecryptfs_write_metadata: Error writing metadata out to lower file; rc = [-22] [ 1188.867022] Error writing headers; rc = [-22] I like Amir's idea of just setting s_stack_depth in hugetlbfs to prevent stacking. From 0fbed66b37c18919ea7edd47b113c97644f49362 Mon Sep 17 00:00:00 2001 From: Mike Kravetz <mike.kravetz@oracle.com> Date: Mon, 15 Jun 2020 14:37:52 -0700 Subject: [PATCH] hugetlbfs: prevent filesystem stacking of hugetlbfs syzbot found issues with having hugetlbfs on a union/overlay as reported in [1]. Due to the limitations (no write) and special functionality of hugetlbfs, it does not work well in filesystem stacking. There are no know use cases for hugetlbfs stacking. Rather than making modifications to get hugetlbfs working in such environments, simply prevent stacking. [1] https://lore.kernel.org/linux-mm/000000000000b4684e05a2968ca6@google.com/ Reported-by: syzbot+d6ec23007e951dadf3de@syzkaller.appspotmail.com Suggested-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> --- fs/hugetlbfs/inode.c | 6 ++++++ 1 file changed, 6 insertions(+) diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c index 991c60c7ffe0..f32759c8e84d 100644 --- a/fs/hugetlbfs/inode.c +++ b/fs/hugetlbfs/inode.c @@ -1313,6 +1313,12 @@ hugetlbfs_fill_super(struct super_block *sb, struct fs_context *fc) sb->s_magic = HUGETLBFS_MAGIC; sb->s_op = &hugetlbfs_ops; sb->s_time_gran = 1; + + /* + * Due to the special and limited functionality of hugetlbfs, it does + * not work well as a stacking filesystem. + */ + sb->s_stack_depth = FILESYSTEM_MAX_STACK_DEPTH; sb->s_root = d_make_root(hugetlbfs_get_root(sb, ctx)); if (!sb->s_root) goto out_free;
On Tue, Jun 16, 2020 at 1:45 AM Mike Kravetz <mike.kravetz@oracle.com> wrote: > > On 6/15/20 12:53 AM, Miklos Szeredi wrote: > > On Sat, Jun 13, 2020 at 9:12 PM Mike Kravetz <mike.kravetz@oracle.com> wrote: > >> On 6/12/20 11:53 PM, Amir Goldstein wrote: > >>> > >>> The simplest thing for you to do in order to shush syzbot is what procfs does: > >>> /* > >>> * procfs isn't actually a stacking filesystem; however, there is > >>> * too much magic going on inside it to permit stacking things on > >>> * top of it > >>> */ > >>> s->s_stack_depth = FILESYSTEM_MAX_STACK_DEPTH; > >>> > >>> Currently, the only in-tree stacking fs are overlayfs and ecryptfs, but there > >>> are some out of tree implementations as well (shiftfs). > >>> So you may only take that option if you do not care about the combination > >>> of hugetlbfs with any of the above. > >>> > >>> overlayfs support of mmap is not as good as one might hope. > >>> overlayfs.rst says: > >>> "If a file residing on a lower layer is opened for read-only and then > >>> memory mapped with MAP_SHARED, then subsequent changes to > >>> the file are not reflected in the memory mapping." > >>> > >>> So if I were you, I wouldn't go trying to fix overlayfs-huguetlb interop... > >> > >> Thanks again, > >> > >> I'll look at something as simple as s_stack_depth. > > > > Agree. > > Apologies again for in the incorrect information about writing to lower > filesystem. > > Stacking ecryptfs on hugetlbfs does not work either. Here is what happens > when trying to create a new file. > > [ 1188.863425] ecryptfs_write_metadata_to_contents: Error attempting to write header information to lower file; rc = [-22] > [ 1188.865469] ecryptfs_write_metadata: Error writing metadata out to lower file; rc = [-22] > [ 1188.867022] Error writing headers; rc = [-22] > > I like Amir's idea of just setting s_stack_depth in hugetlbfs to prevent > stacking. > > From 0fbed66b37c18919ea7edd47b113c97644f49362 Mon Sep 17 00:00:00 2001 > From: Mike Kravetz <mike.kravetz@oracle.com> > Date: Mon, 15 Jun 2020 14:37:52 -0700 > Subject: [PATCH] hugetlbfs: prevent filesystem stacking of hugetlbfs > > syzbot found issues with having hugetlbfs on a union/overlay as reported > in [1]. Due to the limitations (no write) and special functionality of > hugetlbfs, it does not work well in filesystem stacking. There are no > know use cases for hugetlbfs stacking. Rather than making modifications > to get hugetlbfs working in such environments, simply prevent stacking. > > [1] https://lore.kernel.org/linux-mm/000000000000b4684e05a2968ca6@google.com/ > > Reported-by: syzbot+d6ec23007e951dadf3de@syzkaller.appspotmail.com > Suggested-by: Amir Goldstein <amir73il@gmail.com> > Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Acked-by: Miklos Szeredi <mszeredi@redhat.com> Thanks, Miklos
diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c index 991c60c7ffe0..5c0c50a88c84 100644 --- a/fs/hugetlbfs/inode.c +++ b/fs/hugetlbfs/inode.c @@ -324,6 +324,12 @@ static ssize_t hugetlbfs_read_iter(struct kiocb *iocb, struct iov_iter *to) return retval; } +static int hugetlbfs_open(struct inode *inode, struct file *file) +{ + file->f_mode |= FMODE_HUGETLBFS; + return 0; +} + static int hugetlbfs_write_begin(struct file *file, struct address_space *mapping, loff_t pos, unsigned len, unsigned flags, @@ -1112,6 +1118,7 @@ static void init_once(void *foo) const struct file_operations hugetlbfs_file_operations = { .read_iter = hugetlbfs_read_iter, + .open = hugetlbfs_open, .mmap = hugetlbfs_file_mmap, .fsync = noop_fsync, .get_unmapped_area = hugetlb_get_unmapped_area, diff --git a/fs/io_uring.c b/fs/io_uring.c index bb25e3997d41..96e8a4bb610a 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -7123,7 +7123,7 @@ static int io_sqe_buffer_register(struct io_ring_ctx *ctx, void __user *arg, struct vm_area_struct *vma = vmas[j]; if (vma->vm_file && - !is_file_hugepages(vma->vm_file)) { + !is_file_hugetlbfs(vma->vm_file)) { ret = -EOPNOTSUPP; break; } diff --git a/include/linux/fs.h b/include/linux/fs.h index 45cc10cdf6dd..99af9513f9ab 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -175,6 +175,9 @@ typedef int (dio_iodone_t)(struct kiocb *iocb, loff_t offset, /* File does not contribute to nr_files count */ #define FMODE_NOACCOUNT ((__force fmode_t)0x20000000) +/* File is in hugetlbfs filesystem */ +#define FMODE_HUGETLBFS ((__force fmode_t)0x40000000) + /* * Flag for rw_copy_check_uvector and compat_rw_copy_check_uvector * that indicates that they should check the contents of the iovec are diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h index 43a1cef8f0f1..aa3408775464 100644 --- a/include/linux/hugetlb.h +++ b/include/linux/hugetlb.h @@ -429,18 +429,16 @@ static inline struct hugetlbfs_inode_info *HUGETLBFS_I(struct inode *inode) return container_of(inode, struct hugetlbfs_inode_info, vfs_inode); } -extern const struct file_operations hugetlbfs_file_operations; extern const struct vm_operations_struct hugetlb_vm_ops; struct file *hugetlb_file_setup(const char *name, size_t size, vm_flags_t acct, struct user_struct **user, int creat_flags, int page_size_log); -static inline bool is_file_hugepages(struct file *file) +static inline bool is_file_hugetlbfs(struct file *file) { - if (file->f_op == &hugetlbfs_file_operations) + if (unlikely(file->f_mode & FMODE_HUGETLBFS)) return true; - - return is_file_shm_hugepages(file); + return false; } static inline struct hstate *hstate_inode(struct inode *i) @@ -449,7 +447,7 @@ static inline struct hstate *hstate_inode(struct inode *i) } #else /* !CONFIG_HUGETLBFS */ -#define is_file_hugepages(file) false +#define is_file_hugetlbfs(file) false static inline struct file * hugetlb_file_setup(const char *name, size_t size, vm_flags_t acctflag, struct user_struct **user, int creat_flags, diff --git a/include/linux/shm.h b/include/linux/shm.h index d8e69aed3d32..1ab62d7b334f 100644 --- a/include/linux/shm.h +++ b/include/linux/shm.h @@ -16,7 +16,6 @@ struct sysv_shm { long do_shmat(int shmid, char __user *shmaddr, int shmflg, unsigned long *addr, unsigned long shmlba); -bool is_file_shm_hugepages(struct file *file); void exit_shm(struct task_struct *task); #define shm_init_task(task) INIT_LIST_HEAD(&(task)->sysvshm.shm_clist) #else @@ -30,10 +29,6 @@ static inline long do_shmat(int shmid, char __user *shmaddr, { return -ENOSYS; } -static inline bool is_file_shm_hugepages(struct file *file) -{ - return false; -} static inline void exit_shm(struct task_struct *task) { } diff --git a/ipc/shm.c b/ipc/shm.c index 0ba6add05b35..8f119b1d6170 100644 --- a/ipc/shm.c +++ b/ipc/shm.c @@ -285,7 +285,7 @@ static void shm_destroy(struct ipc_namespace *ns, struct shmid_kernel *shp) ns->shm_tot -= (shp->shm_segsz + PAGE_SIZE - 1) >> PAGE_SHIFT; shm_rmid(ns, shp); shm_unlock(shp); - if (!is_file_hugepages(shm_file)) + if (!is_file_hugetlbfs(shm_file)) shmem_lock(shm_file, 0, shp->mlock_user); else if (shp->mlock_user) user_shm_unlock(i_size_read(file_inode(shm_file)), @@ -560,24 +560,6 @@ static const struct file_operations shm_file_operations = { .fallocate = shm_fallocate, }; -/* - * shm_file_operations_huge is now identical to shm_file_operations, - * but we keep it distinct for the sake of is_file_shm_hugepages(). - */ -static const struct file_operations shm_file_operations_huge = { - .mmap = shm_mmap, - .fsync = shm_fsync, - .release = shm_release, - .get_unmapped_area = shm_get_unmapped_area, - .llseek = noop_llseek, - .fallocate = shm_fallocate, -}; - -bool is_file_shm_hugepages(struct file *file) -{ - return file->f_op == &shm_file_operations_huge; -} - static const struct vm_operations_struct shm_vm_ops = { .open = shm_open, /* callback for a new vm-area open */ .close = shm_close, /* callback for when the vm-area is released */ @@ -698,7 +680,7 @@ static int newseg(struct ipc_namespace *ns, struct ipc_params *params) no_id: ipc_update_pid(&shp->shm_cprid, NULL); ipc_update_pid(&shp->shm_lprid, NULL); - if (is_file_hugepages(file) && shp->mlock_user) + if (is_file_hugetlbfs(file) && shp->mlock_user) user_shm_unlock(size, shp->mlock_user); fput(file); ipc_rcu_putref(&shp->shm_perm, shm_rcu_free); @@ -836,7 +818,7 @@ static void shm_add_rss_swap(struct shmid_kernel *shp, inode = file_inode(shp->shm_file); - if (is_file_hugepages(shp->shm_file)) { + if (is_file_hugetlbfs(shp->shm_file)) { struct address_space *mapping = inode->i_mapping; struct hstate *h = hstate_file(shp->shm_file); *rss_add += pages_per_huge_page(h) * mapping->nrpages; @@ -1102,7 +1084,7 @@ static int shmctl_do_lock(struct ipc_namespace *ns, int shmid, int cmd) } shm_file = shp->shm_file; - if (is_file_hugepages(shm_file)) + if (is_file_hugetlbfs(shm_file)) goto out_unlock0; if (cmd == SHM_LOCK) { @@ -1523,10 +1505,7 @@ long do_shmat(int shmid, char __user *shmaddr, int shmflg, goto out_nattch; } - file = alloc_file_clone(base, f_flags, - is_file_hugepages(base) ? - &shm_file_operations_huge : - &shm_file_operations); + file = alloc_file_clone(base, f_flags, &shm_file_operations); err = PTR_ERR(file); if (IS_ERR(file)) { kfree(sfd); @@ -1534,6 +1513,9 @@ long do_shmat(int shmid, char __user *shmaddr, int shmflg, goto out_nattch; } + /* copy hugetlbfs mode for is_file_hugetlbfs() */ + file->f_mode |= (base->f_mode & FMODE_HUGETLBFS); + sfd->id = shp->shm_perm.id; sfd->ns = get_ipc_ns(ns); sfd->file = base; diff --git a/mm/memfd.c b/mm/memfd.c index 2647c898990c..e6c16b6bf3f6 100644 --- a/mm/memfd.c +++ b/mm/memfd.c @@ -123,7 +123,7 @@ static unsigned int *memfd_file_seals_ptr(struct file *file) return &SHMEM_I(file_inode(file))->seals; #ifdef CONFIG_HUGETLBFS - if (is_file_hugepages(file)) + if (is_file_hugetlbfs(file)) return &HUGETLBFS_I(file_inode(file))->seals; #endif diff --git a/mm/mmap.c b/mm/mmap.c index f609e9ec4a25..703a9680a937 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -1538,7 +1538,7 @@ unsigned long do_mmap(struct file *file, unsigned long addr, vm_flags |= VM_NORESERVE; /* hugetlb applies strict overcommit unless MAP_NORESERVE */ - if (file && is_file_hugepages(file)) + if (file && is_file_hugetlbfs(file)) vm_flags |= VM_NORESERVE; } @@ -1562,10 +1562,10 @@ unsigned long ksys_mmap_pgoff(unsigned long addr, unsigned long len, file = fget(fd); if (!file) return -EBADF; - if (is_file_hugepages(file)) + if (is_file_hugetlbfs(file)) len = ALIGN(len, huge_page_size(hstate_file(file))); retval = -EINVAL; - if (unlikely(flags & MAP_HUGETLB && !is_file_hugepages(file))) + if (unlikely(flags & MAP_HUGETLB && !is_file_hugetlbfs(file))) goto out_fput; } else if (flags & MAP_HUGETLB) { struct user_struct *user = NULL; @@ -1678,7 +1678,7 @@ static inline int accountable_mapping(struct file *file, vm_flags_t vm_flags) * hugetlb has its own accounting separate from the core VM * VM_HUGETLB may not be set yet so we cannot check for that flag. */ - if (file && is_file_hugepages(file)) + if (file && is_file_hugetlbfs(file)) return 0; return (vm_flags & (VM_NORESERVE | VM_SHARED | VM_WRITE)) == VM_WRITE;
The routine is_file_hugepages() checks f_op == hugetlbfs_file_operations to determine if the file resides in hugetlbfs. This is problematic when the file is on a union or overlay. Instead, define a new file mode FMODE_HUGETLBFS which is set when a hugetlbfs file is opened. The mode can easily be copied to other 'files' derived from the original hugetlbfs file. With this change hugetlbfs_file_operations can be static as it should be. There is also a (duplicate) set of shm file operations used for the routine is_file_shm_hugepages(). Instead of setting/using special f_op's, just propagate the FMODE_HUGETLBFS mode. This means is_file_shm_hugepages() and the duplicate f_ops can be removed. While cleaning things up, change the name of is_file_hugepages() to is_file_hugetlbfs(). The term hugepages is a bit ambiguous. A subsequent patch will propagate FMODE_HUGETLBFS in overlayfs. Suggested-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> --- fs/hugetlbfs/inode.c | 7 +++++++ fs/io_uring.c | 2 +- include/linux/fs.h | 3 +++ include/linux/hugetlb.h | 10 ++++------ include/linux/shm.h | 5 ----- ipc/shm.c | 34 ++++++++-------------------------- mm/memfd.c | 2 +- mm/mmap.c | 8 ++++---- 8 files changed, 28 insertions(+), 43 deletions(-)