diff mbox series

files: rcu free files_struct

Message ID 87o8j2svnt.fsf_-_@x220.int.ebiederm.org (mailing list archive)
State New, archived
Headers show
Series files: rcu free files_struct | expand

Commit Message

Eric W. Biederman Dec. 9, 2020, 6:04 p.m. UTC
This makes it safe to deference task->files with just rcu_read_lock
held, removing the need to take task_lock.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
---

I have cleaned up my patch a little more and made this a little
more explicitly rcu.  I took a completley different approach to
documenting the use of rcu_head than we described that seems a little
more robust.

 fs/file.c               | 53 ++++++++++++++++++++++++++---------------
 include/linux/fdtable.h |  1 +
 kernel/fork.c           |  4 ++--
 3 files changed, 37 insertions(+), 21 deletions(-)

Comments

Linus Torvalds Dec. 9, 2020, 7:13 p.m. UTC | #1
On Wed, Dec 9, 2020 at 10:05 AM Eric W. Biederman <ebiederm@xmission.com> wrote:
>
> -                               struct file * file = xchg(&fdt->fd[i], NULL);
> +                               struct file * file = fdt->fd[i];
>                                 if (file) {
> +                                       rcu_assign_pointer(fdt->fd[i], NULL);

This makes me nervous. Why did we use to do that xchg() there? That
has atomicity guarantees that now are gone.

Now, this whole thing should be called for just the last ref of the fd
table, so presumably that atomicity was never needed in the first
place. But the fact that we did that very expensive xchg() then makes
me go "there's some reason for it".

Is this xchg() just bogus historical leftover? It kind of looks that
way. But maybe that change should be done separately?

              Linus
Matthew Wilcox (Oracle) Dec. 9, 2020, 7:49 p.m. UTC | #2
On Wed, Dec 09, 2020 at 12:04:38PM -0600, Eric W. Biederman wrote:
> @@ -397,8 +397,9 @@ static struct fdtable *close_files(struct files_struct * files)
>  		set = fdt->open_fds[j++];
>  		while (set) {
>  			if (set & 1) {
> -				struct file * file = xchg(&fdt->fd[i], NULL);
> +				struct file * file = fdt->fd[i];
>  				if (file) {
> +					rcu_assign_pointer(fdt->fd[i], NULL);

Assuming this is safe, you can use RCU_INIT_POINTER() here because you're
storing NULL, so you don't need the wmb() before storing the pointer.
Al Viro Dec. 9, 2020, 7:50 p.m. UTC | #3
On Wed, Dec 09, 2020 at 11:13:38AM -0800, Linus Torvalds wrote:
> On Wed, Dec 9, 2020 at 10:05 AM Eric W. Biederman <ebiederm@xmission.com> wrote:
> >
> > -                               struct file * file = xchg(&fdt->fd[i], NULL);
> > +                               struct file * file = fdt->fd[i];
> >                                 if (file) {
> > +                                       rcu_assign_pointer(fdt->fd[i], NULL);
> 
> This makes me nervous. Why did we use to do that xchg() there? That
> has atomicity guarantees that now are gone.
> 
> Now, this whole thing should be called for just the last ref of the fd
> table, so presumably that atomicity was never needed in the first
> place. But the fact that we did that very expensive xchg() then makes
> me go "there's some reason for it".
> 
> Is this xchg() just bogus historical leftover? It kind of looks that
> way. But maybe that change should be done separately?

I'm still not convinced that exposing close_files() to parallel
3rd-party accesses is safe in all cases, so this patch still needs
more analysis.  And I'm none too happy about "we'll fix the things
up at the tail of the series" - the changes are subtle enough and
the area affected is rather fundamental.  So if we end up returning
to that several years from now while debugging something, I would
very much prefer to have the transformation series as clean and
understandable as possible.  It's not just about bisect hazard -
asking yourself "WTF had it been done that way, is there anything
subtle I'm missing here?" can cost many hours of head-scratching,
IME.

Eric, I understand that you want to avoid reordering/folding, but
in this case it _is_ needed.  It's not as if there had been any
serious objections to the overall direction of changes; it's
just that we need to get that as understandable as possible.
Eric W. Biederman Dec. 9, 2020, 9:32 p.m. UTC | #4
Al Viro <viro@zeniv.linux.org.uk> writes:

> On Wed, Dec 09, 2020 at 11:13:38AM -0800, Linus Torvalds wrote:
>> On Wed, Dec 9, 2020 at 10:05 AM Eric W. Biederman <ebiederm@xmission.com> wrote:
>> >
>> > -                               struct file * file = xchg(&fdt->fd[i], NULL);
>> > +                               struct file * file = fdt->fd[i];
>> >                                 if (file) {
>> > +                                       rcu_assign_pointer(fdt->fd[i], NULL);
>> 
>> This makes me nervous. Why did we use to do that xchg() there? That
>> has atomicity guarantees that now are gone.
>> 
>> Now, this whole thing should be called for just the last ref of the fd
>> table, so presumably that atomicity was never needed in the first
>> place. But the fact that we did that very expensive xchg() then makes
>> me go "there's some reason for it".
>> 
>> Is this xchg() just bogus historical leftover? It kind of looks that
>> way. But maybe that change should be done separately?
>
> I'm still not convinced that exposing close_files() to parallel
> 3rd-party accesses is safe in all cases, so this patch still needs
> more analysis.

That is fine.  I just wanted to post the latest version so we could
continue the discussion.  Especially with comments etc.

> And I'm none too happy about "we'll fix the things
> up at the tail of the series" - the changes are subtle enough and
> the area affected is rather fundamental.  So if we end up returning
> to that several years from now while debugging something, I would
> very much prefer to have the transformation series as clean and
> understandable as possible.  It's not just about bisect hazard -
> asking yourself "WTF had it been done that way, is there anything
> subtle I'm missing here?" can cost many hours of head-scratching,
> IME.

Fair enough.  I don't expect anyone is basing anything on that branch,
so a rebase is possible.

Now removing the pounding on task_lock isn't about correctness, and it
is not fixing a performance problem anyone has measured at this point.
So I do think it should be a follow on.  If for no other reason than to
keep the problem small enough it can fit in heads.

Similarly the dnotify stuff.  Your description certain makes it look
fishy but that the questionable parts are orthogonal to my patches.

> Eric, I understand that you want to avoid reordering/folding, but
> in this case it _is_ needed.  It's not as if there had been any
> serious objections to the overall direction of changes; it's
> just that we need to get that as understandable as possible.

I will post the patch that will become -1/24 in a moment.

Eric
Eric W. Biederman Dec. 9, 2020, 10:04 p.m. UTC | #5
Linus Torvalds <torvalds@linux-foundation.org> writes:

> On Wed, Dec 9, 2020 at 10:05 AM Eric W. Biederman <ebiederm@xmission.com> wrote:
>>
>> -                               struct file * file = xchg(&fdt->fd[i], NULL);
>> +                               struct file * file = fdt->fd[i];
>>                                 if (file) {
>> +                                       rcu_assign_pointer(fdt->fd[i], NULL);
>
> This makes me nervous. Why did we use to do that xchg() there? That
> has atomicity guarantees that now are gone.
>
> Now, this whole thing should be called for just the last ref of the fd
> table, so presumably that atomicity was never needed in the first
> place. But the fact that we did that very expensive xchg() then makes
> me go "there's some reason for it".
>
> Is this xchg() just bogus historical leftover? It kind of looks that
> way. But maybe that change should be done separately?

Removing the xchg in a separate patch seems reasonable.  Just to make
the review easier.

I traced the xchg back to 7cf4dc3c8dbf ("move files_struct-related bits
from kernel/exit.c to fs/file.c") when put_files_struct was introduced.
The xchg did not exist before that change.

There were many other xchgs in the code back then so I suspect was left
over from some way an earlier version of the change worked and simply
was not removed when the patch was updated.

Eric
Eric W. Biederman Dec. 9, 2020, 10:06 p.m. UTC | #6
Matthew Wilcox <willy@infradead.org> writes:

> On Wed, Dec 09, 2020 at 12:04:38PM -0600, Eric W. Biederman wrote:
>> @@ -397,8 +397,9 @@ static struct fdtable *close_files(struct files_struct * files)
>>  		set = fdt->open_fds[j++];
>>  		while (set) {
>>  			if (set & 1) {
>> -				struct file * file = xchg(&fdt->fd[i], NULL);
>> +				struct file * file = fdt->fd[i];
>>  				if (file) {
>> +					rcu_assign_pointer(fdt->fd[i], NULL);
>
> Assuming this is safe, you can use RCU_INIT_POINTER() here because you're
> storing NULL, so you don't need the wmb() before storing the pointer.

Thanks.  I was remembering there was a special case like and I had
forgotten what the rule was.

Eric
Al Viro Dec. 9, 2020, 10:58 p.m. UTC | #7
On Wed, Dec 09, 2020 at 07:49:38PM +0000, Matthew Wilcox wrote:
> On Wed, Dec 09, 2020 at 12:04:38PM -0600, Eric W. Biederman wrote:
> > @@ -397,8 +397,9 @@ static struct fdtable *close_files(struct files_struct * files)
> >  		set = fdt->open_fds[j++];
> >  		while (set) {
> >  			if (set & 1) {
> > -				struct file * file = xchg(&fdt->fd[i], NULL);
> > +				struct file * file = fdt->fd[i];
> >  				if (file) {
> > +					rcu_assign_pointer(fdt->fd[i], NULL);
> 
> Assuming this is safe, you can use RCU_INIT_POINTER() here because you're
> storing NULL, so you don't need the wmb() before storing the pointer.

fs/file.c:pick_file() would make more interesting target for the same treatment...
Linus Torvalds Dec. 9, 2020, 11:01 p.m. UTC | #8
On Wed, Dec 9, 2020 at 2:58 PM Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> On Wed, Dec 09, 2020 at 07:49:38PM +0000, Matthew Wilcox wrote:
> >
> > Assuming this is safe, you can use RCU_INIT_POINTER() here because you're
> > storing NULL, so you don't need the wmb() before storing the pointer.
>
> fs/file.c:pick_file() would make more interesting target for the same treatment...

Actually, don't.

rcu_assign_pointer() itself already does the optimization for the case
of a constant NULL pointer assignment.

So there's no need to manually change things to RCU_INIT_POINTER().

           Linus
Linus Torvalds Dec. 9, 2020, 11:04 p.m. UTC | #9
On Wed, Dec 9, 2020 at 3:01 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> rcu_assign_pointer() itself already does the optimization for the case
> of a constant NULL pointer assignment.
>
> So there's no need to manually change things to RCU_INIT_POINTER().

Side note: what should be done instead is to delete the stale comment.
It should have been removed back when the optimization was done in
2016 - see commit 3a37f7275cda ("rcu: No ordering for
rcu_assign_pointer() of NULL")

            Linus
Matthew Wilcox (Oracle) Dec. 9, 2020, 11:07 p.m. UTC | #10
On Wed, Dec 09, 2020 at 03:01:36PM -0800, Linus Torvalds wrote:
> On Wed, Dec 9, 2020 at 2:58 PM Al Viro <viro@zeniv.linux.org.uk> wrote:
> >
> > On Wed, Dec 09, 2020 at 07:49:38PM +0000, Matthew Wilcox wrote:
> > >
> > > Assuming this is safe, you can use RCU_INIT_POINTER() here because you're
> > > storing NULL, so you don't need the wmb() before storing the pointer.
> >
> > fs/file.c:pick_file() would make more interesting target for the same treatment...
> 
> Actually, don't.
> 
> rcu_assign_pointer() itself already does the optimization for the case
> of a constant NULL pointer assignment.
> 
> So there's no need to manually change things to RCU_INIT_POINTER().

I missed that, and the documentation wasn't updated by
3a37f7275cda5ad25c1fe9be8f20c76c60d175fa.

Paul, how about this?

+++ b/Documentation/RCU/Design/Requirements/Requirements.rst
@@ -1668,8 +1668,10 @@ against mishaps and misuse:
    this purpose.
 #. It is not necessary to use rcu_assign_pointer() when creating
    linked structures that are to be published via a single external
-   pointer. The RCU_INIT_POINTER() macro is provided for this task
-   and also for assigning ``NULL`` pointers at runtime.
+   pointer. The RCU_INIT_POINTER() macro is provided for this task.
+   It used to be more efficient to use RCU_INIT_POINTER() to store a
+   ``NULL`` pointer, but rcu_assign_pointer() now optimises for a constant
+   ``NULL`` pointer itself.
 
 This not a hard-and-fast list: RCU's diagnostic capabilities will
 continue to be guided by the number and type of usage bugs found in
Paul E. McKenney Dec. 9, 2020, 11:26 p.m. UTC | #11
On Wed, Dec 09, 2020 at 11:07:55PM +0000, Matthew Wilcox wrote:
> On Wed, Dec 09, 2020 at 03:01:36PM -0800, Linus Torvalds wrote:
> > On Wed, Dec 9, 2020 at 2:58 PM Al Viro <viro@zeniv.linux.org.uk> wrote:
> > >
> > > On Wed, Dec 09, 2020 at 07:49:38PM +0000, Matthew Wilcox wrote:
> > > >
> > > > Assuming this is safe, you can use RCU_INIT_POINTER() here because you're
> > > > storing NULL, so you don't need the wmb() before storing the pointer.
> > >
> > > fs/file.c:pick_file() would make more interesting target for the same treatment...
> > 
> > Actually, don't.
> > 
> > rcu_assign_pointer() itself already does the optimization for the case
> > of a constant NULL pointer assignment.
> > 
> > So there's no need to manually change things to RCU_INIT_POINTER().
> 
> I missed that, and the documentation wasn't updated by
> 3a37f7275cda5ad25c1fe9be8f20c76c60d175fa.

Can't trust the author of that patch!  ;-)

> Paul, how about this?
> 
> +++ b/Documentation/RCU/Design/Requirements/Requirements.rst
> @@ -1668,8 +1668,10 @@ against mishaps and misuse:
>     this purpose.
>  #. It is not necessary to use rcu_assign_pointer() when creating
>     linked structures that are to be published via a single external
> -   pointer. The RCU_INIT_POINTER() macro is provided for this task
> -   and also for assigning ``NULL`` pointers at runtime.
> +   pointer. The RCU_INIT_POINTER() macro is provided for this task.
> +   It used to be more efficient to use RCU_INIT_POINTER() to store a
> +   ``NULL`` pointer, but rcu_assign_pointer() now optimises for a constant
> +   ``NULL`` pointer itself.
>  
>  This not a hard-and-fast list: RCU's diagnostic capabilities will
>  continue to be guided by the number and type of usage bugs found in

Looks good to me!  If you send a complete patch, I will be happy to pull
it in.

							Thanx, Paul
Linus Torvalds Dec. 9, 2020, 11:29 p.m. UTC | #12
On Wed, Dec 9, 2020 at 3:07 PM Matthew Wilcox <willy@infradead.org> wrote:
>
>  #. It is not necessary to use rcu_assign_pointer() when creating
>     linked structures that are to be published via a single external
> -   pointer. The RCU_INIT_POINTER() macro is provided for this task
> -   and also for assigning ``NULL`` pointers at runtime.
> +   pointer. The RCU_INIT_POINTER() macro is provided for this task.
> +   It used to be more efficient to use RCU_INIT_POINTER() to store a
> +   ``NULL`` pointer, but rcu_assign_pointer() now optimises for a constant
> +   ``NULL`` pointer itself.

I would just remove the historical note about "It used to be more
efficient ..". It's not really helpful.

If somebody wants to dig into the history, it's there in git.

Maybe the note could be part of the commit message for the comment
update, explaining that it used to be more efficient but no longer is.
Because commit messages are about the explanation for the change.

But I don't feel it makes any sense in the source code.

             Linus
Paul E. McKenney Dec. 10, 2020, 12:39 a.m. UTC | #13
On Wed, Dec 09, 2020 at 03:29:09PM -0800, Linus Torvalds wrote:
> On Wed, Dec 9, 2020 at 3:07 PM Matthew Wilcox <willy@infradead.org> wrote:
> >
> >  #. It is not necessary to use rcu_assign_pointer() when creating
> >     linked structures that are to be published via a single external
> > -   pointer. The RCU_INIT_POINTER() macro is provided for this task
> > -   and also for assigning ``NULL`` pointers at runtime.
> > +   pointer. The RCU_INIT_POINTER() macro is provided for this task.
> > +   It used to be more efficient to use RCU_INIT_POINTER() to store a
> > +   ``NULL`` pointer, but rcu_assign_pointer() now optimises for a constant
> > +   ``NULL`` pointer itself.
> 
> I would just remove the historical note about "It used to be more
> efficient ..". It's not really helpful.
> 
> If somebody wants to dig into the history, it's there in git.
> 
> Maybe the note could be part of the commit message for the comment
> update, explaining that it used to be more efficient but no longer is.
> Because commit messages are about the explanation for the change.
> 
> But I don't feel it makes any sense in the source code.

Like this, then?

							Thanx, Paul

 #. It is not necessary to use rcu_assign_pointer() when creating
    linked structures that are to be published via a single external
-   pointer. The RCU_INIT_POINTER() macro is provided for this task
-   and also for assigning ``NULL`` pointers at runtime.
+   pointer. The RCU_INIT_POINTER() macro is provided for this task.
Linus Torvalds Dec. 10, 2020, 12:41 a.m. UTC | #14
On Wed, Dec 9, 2020 at 4:39 PM Paul E. McKenney <paulmck@kernel.org> wrote:
>
> Like this, then?

Ack.

           Linus
Paul E. McKenney Dec. 10, 2020, 12:57 a.m. UTC | #15
On Wed, Dec 09, 2020 at 04:41:06PM -0800, Linus Torvalds wrote:
> On Wed, Dec 9, 2020 at 4:39 PM Paul E. McKenney <paulmck@kernel.org> wrote:
> >
> > Like this, then?
> 
> Ack.

Queued with Matthew's Reported-by and your Acked-by, thank you all!

							Thanx, Paul
Al Viro Dec. 10, 2020, 6:13 a.m. UTC | #16
On Wed, Dec 09, 2020 at 03:32:38PM -0600, Eric W. Biederman wrote:
> Al Viro <viro@zeniv.linux.org.uk> writes:
> 
> > On Wed, Dec 09, 2020 at 11:13:38AM -0800, Linus Torvalds wrote:
> >> On Wed, Dec 9, 2020 at 10:05 AM Eric W. Biederman <ebiederm@xmission.com> wrote:
> >> >
> >> > -                               struct file * file = xchg(&fdt->fd[i], NULL);
> >> > +                               struct file * file = fdt->fd[i];
> >> >                                 if (file) {
> >> > +                                       rcu_assign_pointer(fdt->fd[i], NULL);
> >> 
> >> This makes me nervous. Why did we use to do that xchg() there? That
> >> has atomicity guarantees that now are gone.
> >> 
> >> Now, this whole thing should be called for just the last ref of the fd
> >> table, so presumably that atomicity was never needed in the first
> >> place. But the fact that we did that very expensive xchg() then makes
> >> me go "there's some reason for it".
> >> 
> >> Is this xchg() just bogus historical leftover? It kind of looks that
> >> way. But maybe that change should be done separately?
> >
> > I'm still not convinced that exposing close_files() to parallel
> > 3rd-party accesses is safe in all cases, so this patch still needs
> > more analysis.
> 
> That is fine.  I just wanted to post the latest version so we could
> continue the discussion.  Especially with comments etc.

It's probably safe.  I've spent today digging through the mess in
fs/notify and kernel/bpf, and while I'm disgusted with both, at
that point I believe that close_files() exposure is not going to
create problems with either.  And xchg() in there _is_ useless.

Said that, BPF "file iterator" stuff is potentially very unpleasant -
it allows to pin a struct file found in any process' descriptor table
indefinitely long.  Temporary struct file references grabbed by procfs
code, while unfortunate, are at least short-lived; with this stuff sky's
the limit.

I'm not happy about having that available, especially if it's a user-visible
primitive we can't withdraw at zero notice ;-/

What are the users of that thing and is there any chance to replace it
with something saner?  IOW, what *is* realistically called for each
struct file by the users of that iterator?
Eric W. Biederman Dec. 10, 2020, 7:29 p.m. UTC | #17
Al Viro <viro@zeniv.linux.org.uk> writes:

> On Wed, Dec 09, 2020 at 03:32:38PM -0600, Eric W. Biederman wrote:
>> Al Viro <viro@zeniv.linux.org.uk> writes:
>> 
>> > On Wed, Dec 09, 2020 at 11:13:38AM -0800, Linus Torvalds wrote:
>> >> On Wed, Dec 9, 2020 at 10:05 AM Eric W. Biederman <ebiederm@xmission.com> wrote:
>> >> >
>> >> > -                               struct file * file = xchg(&fdt->fd[i], NULL);
>> >> > +                               struct file * file = fdt->fd[i];
>> >> >                                 if (file) {
>> >> > +                                       rcu_assign_pointer(fdt->fd[i], NULL);
>> >> 
>> >> This makes me nervous. Why did we use to do that xchg() there? That
>> >> has atomicity guarantees that now are gone.
>> >> 
>> >> Now, this whole thing should be called for just the last ref of the fd
>> >> table, so presumably that atomicity was never needed in the first
>> >> place. But the fact that we did that very expensive xchg() then makes
>> >> me go "there's some reason for it".
>> >> 
>> >> Is this xchg() just bogus historical leftover? It kind of looks that
>> >> way. But maybe that change should be done separately?
>> >
>> > I'm still not convinced that exposing close_files() to parallel
>> > 3rd-party accesses is safe in all cases, so this patch still needs
>> > more analysis.
>> 
>> That is fine.  I just wanted to post the latest version so we could
>> continue the discussion.  Especially with comments etc.
>
> It's probably safe.  I've spent today digging through the mess in
> fs/notify and kernel/bpf, and while I'm disgusted with both, at
> that point I believe that close_files() exposure is not going to
> create problems with either.  And xchg() in there _is_ useless.

Then I will work on a cleaned up version.

> Said that, BPF "file iterator" stuff is potentially very unpleasant -
> it allows to pin a struct file found in any process' descriptor table
> indefinitely long.  Temporary struct file references grabbed by procfs
> code, while unfortunate, are at least short-lived; with this stuff sky's
> the limit.
>
> I'm not happy about having that available, especially if it's a user-visible
> primitive we can't withdraw at zero notice ;-/
>
> What are the users of that thing and is there any chance to replace it
> with something saner?  IOW, what *is* realistically called for each
> struct file by the users of that iterator?

The bpf guys are no longer Cc'd and they can probably answer better than
I.

In a previous conversation it was mentioned that task_iter was supposed
to be a high performance interface for getting proc like data out of the
kernel using bpf.

If so I think that handles the lifetime issues as bpf programs are
supposed to be short-lived and can not pass references anywhere.

On the flip side it raises the question did the BPF guys just make the
current layout of task_struct and struct file part of the linux kernel
user space ABI?

Eric
Al Viro Dec. 10, 2020, 9:36 p.m. UTC | #18
On Thu, Dec 10, 2020 at 01:29:01PM -0600, Eric W. Biederman wrote:
> Al Viro <viro@zeniv.linux.org.uk> writes:

> > What are the users of that thing and is there any chance to replace it
> > with something saner?  IOW, what *is* realistically called for each
> > struct file by the users of that iterator?
> 
> The bpf guys are no longer Cc'd and they can probably answer better than
> I.
> 
> In a previous conversation it was mentioned that task_iter was supposed
> to be a high performance interface for getting proc like data out of the
> kernel using bpf.
> 
> If so I think that handles the lifetime issues as bpf programs are
> supposed to be short-lived and can not pass references anywhere.
> 
> On the flip side it raises the question did the BPF guys just make the
> current layout of task_struct and struct file part of the linux kernel
> user space ABI?

An interesting question, that...  For the record: anybody coming to
complain about a removed/renamed/replaced with something else field
in struct file will be refered to Figure 1.

None of the VFS data structures has any layout stability warranties.
If BPF folks want access to something in that, they are welcome to come
and discuss the set of accessors; so far nothing of that sort has happened.

Direct access to any fields of any of those structures is subject to
being broken at zero notice.

IMO we need some notation for a structure being off-limits for BPF, tracing,
etc., along the lines of "don't access any field directly".
Christian Brauner Dec. 10, 2020, 10:30 p.m. UTC | #19
On Thu, Dec 10, 2020 at 09:36:24PM +0000, Al Viro wrote:
> On Thu, Dec 10, 2020 at 01:29:01PM -0600, Eric W. Biederman wrote:
> > Al Viro <viro@zeniv.linux.org.uk> writes:
> 
> > > What are the users of that thing and is there any chance to replace it
> > > with something saner?  IOW, what *is* realistically called for each
> > > struct file by the users of that iterator?
> > 
> > The bpf guys are no longer Cc'd and they can probably answer better than
> > I.
> > 
> > In a previous conversation it was mentioned that task_iter was supposed
> > to be a high performance interface for getting proc like data out of the
> > kernel using bpf.
> > 
> > If so I think that handles the lifetime issues as bpf programs are
> > supposed to be short-lived and can not pass references anywhere.
> > 
> > On the flip side it raises the question did the BPF guys just make the
> > current layout of task_struct and struct file part of the linux kernel
> > user space ABI?
> 
> An interesting question, that...  For the record: anybody coming to

Imho, they did. An example from the BPF LSM: a few weeks ago someone
asked me whether it would be possible to use the BPF LSM to enforce you
can't open files when they are on a given filesystem. Sine this bpf lsm
allows to attach to lsm hooks, say security_file_open(), you can get at
the superblock and check the filesyste type in a bpf program
(requiring btf), i.e. security_file_open, then follow
file->f_inode->i_sb->s_type->s_magic. If we change the say struct
super_block I'd expect these bpf programs to break. I'm sure there's
something clever that they came up with but it is nonetheless
uncomfortably close to making internal kernel structures part of
userspace ABI indeed.

> complain about a removed/renamed/replaced with something else field
> in struct file will be refered to Figure 1.
> 
> None of the VFS data structures has any layout stability warranties.
> If BPF folks want access to something in that, they are welcome to come
> and discuss the set of accessors; so far nothing of that sort has happened.
> 
> Direct access to any fields of any of those structures is subject to
> being broken at zero notice.
> 
> IMO we need some notation for a structure being off-limits for BPF, tracing,
> etc., along the lines of "don't access any field directly".

Indeed. I would also like to see a list where changes need to be sent
that are technically specific to a subsystem but will necessarily have
kernel-wide impact prime example: a lot of bpf.

Christian
Al Viro Dec. 10, 2020, 10:54 p.m. UTC | #20
On Thu, Dec 10, 2020 at 11:30:24PM +0100, Christian Brauner wrote:
> (requiring btf), i.e. security_file_open, then follow
> file->f_inode->i_sb->s_type->s_magic. If we change the say struct
> super_block I'd expect these bpf programs to break.

To break they would need to have compiled in the first place;
->s_type is struct file_system_type and it contains no ->s_magic
(nor would it be possible, really - ->s_magic can vary between
filesystems that *do* share ->s_type).
Al Viro Dec. 10, 2020, 11:10 p.m. UTC | #21
On Thu, Dec 10, 2020 at 10:54:05PM +0000, Al Viro wrote:
> On Thu, Dec 10, 2020 at 11:30:24PM +0100, Christian Brauner wrote:
> > (requiring btf), i.e. security_file_open, then follow
> > file->f_inode->i_sb->s_type->s_magic. If we change the say struct
> > super_block I'd expect these bpf programs to break.
> 
> To break they would need to have compiled in the first place;
> ->s_type is struct file_system_type and it contains no ->s_magic
> (nor would it be possible, really - ->s_magic can vary between
> filesystems that *do* share ->s_type).

Incidentally, a lot of things in e.g. struct dentry need care when
accessing; the fields are there, but e.g. blind access to name or
parent really can oops.  Moreover, blindly following a chain of
->d_parent pointers without taking appropriate precautions might
end up reading from arbitrary kernel address, including iomem ones.
I don't see anything that would prevent that...

TAINT_BPF would probably be too impractical, since there's a lot
of boxen using it more reasonably on the networking side.  But
it really looks like we *do* need annotations with their violation
triggering a taint, so that BS bug reports could be discarded.
diff mbox series

Patch

diff --git a/fs/file.c b/fs/file.c
index 412033d8cfdf..88064f185560 100644
--- a/fs/file.c
+++ b/fs/file.c
@@ -379,7 +379,7 @@  struct files_struct *dup_fd(struct files_struct *oldf, unsigned int max_fds, int
 	return NULL;
 }
 
-static struct fdtable *close_files(struct files_struct * files)
+static void close_files(struct files_struct * files)
 {
 	/*
 	 * It is safe to dereference the fd table without RCU or
@@ -397,8 +397,9 @@  static struct fdtable *close_files(struct files_struct * files)
 		set = fdt->open_fds[j++];
 		while (set) {
 			if (set & 1) {
-				struct file * file = xchg(&fdt->fd[i], NULL);
+				struct file * file = fdt->fd[i];
 				if (file) {
+					rcu_assign_pointer(fdt->fd[i], NULL);
 					filp_close(file, files);
 					cond_resched();
 				}
@@ -407,19 +408,35 @@  static struct fdtable *close_files(struct files_struct * files)
 			set >>= 1;
 		}
 	}
+}
 
-	return fdt;
+static void free_files_struct_rcu(struct rcu_head *rcu)
+{
+	struct files_struct *files =
+		container_of(rcu, struct files_struct, fdtab.rcu);
+	struct fdtable *fdt = rcu_dereference_raw(files->fdt);
+
+	/* free the arrays if they are not embedded */
+	if (fdt != &files->fdtab)
+		__free_fdtable(fdt);
+	kmem_cache_free(files_cachep, files);
 }
 
 void put_files_struct(struct files_struct *files)
 {
 	if (atomic_dec_and_test(&files->count)) {
-		struct fdtable *fdt = close_files(files);
-
-		/* free the arrays if they are not embedded */
-		if (fdt != &files->fdtab)
-			__free_fdtable(fdt);
-		kmem_cache_free(files_cachep, files);
+		close_files(files);
+		/*
+		 * The rules for using the rcu_head in fdtable:
+		 *
+		 * If the fdtable is being freed independently
+		 * of the files_struct fdtable->rcu is used.
+		 *
+		 * When the fdtable and files_struct are freed
+		 * together the rcu_head from the fdtable embedded in
+		 * files_struct is used.
+		 */
+		call_rcu(&files->fdtab.rcu, free_files_struct_rcu);
 	}
 }
 
@@ -822,12 +839,14 @@  EXPORT_SYMBOL(fget_raw);
 
 struct file *fget_task(struct task_struct *task, unsigned int fd)
 {
+	struct files_struct *files;
 	struct file *file = NULL;
 
-	task_lock(task);
-	if (task->files)
-		file = __fget_files(task->files, fd, 0, 1);
-	task_unlock(task);
+	rcu_read_lock();
+	files = rcu_dereference(task->files);
+	if (files)
+		file = __fget_files(files, fd, 0, 1);
+	rcu_read_unlock();
 
 	return file;
 }
@@ -838,11 +857,9 @@  struct file *task_lookup_fd_rcu(struct task_struct *task, unsigned int fd)
 	struct files_struct *files;
 	struct file *file = NULL;
 
-	task_lock(task);
-	files = task->files;
+	files = rcu_dereference(task->files);
 	if (files)
 		file = files_lookup_fd_rcu(files, fd);
-	task_unlock(task);
 
 	return file;
 }
@@ -854,8 +871,7 @@  struct file *task_lookup_next_fd_rcu(struct task_struct *task, unsigned int *ret
 	unsigned int fd = *ret_fd;
 	struct file *file = NULL;
 
-	task_lock(task);
-	files = task->files;
+	files = rcu_dereference(task->files);
 	if (files) {
 		for (; fd < files_fdtable(files)->max_fds; fd++) {
 			file = files_lookup_fd_rcu(files, fd);
@@ -863,7 +879,6 @@  struct file *task_lookup_next_fd_rcu(struct task_struct *task, unsigned int *ret
 				break;
 		}
 	}
-	task_unlock(task);
 	*ret_fd = fd;
 	return file;
 }
diff --git a/include/linux/fdtable.h b/include/linux/fdtable.h
index d0e78174874a..37dcface020f 100644
--- a/include/linux/fdtable.h
+++ b/include/linux/fdtable.h
@@ -30,6 +30,7 @@  struct fdtable {
 	unsigned long *close_on_exec;
 	unsigned long *open_fds;
 	unsigned long *full_fds_bits;
+	/* Used for freeing fdtable and any containing files_struct */
 	struct rcu_head rcu;
 };
 
diff --git a/kernel/fork.c b/kernel/fork.c
index 4d0ae6f827df..eca474a1586a 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2982,7 +2982,7 @@  int ksys_unshare(unsigned long unshare_flags)
 
 		if (new_fd) {
 			fd = current->files;
-			current->files = new_fd;
+			rcu_assign_pointer(current->files, new_fd);
 			new_fd = fd;
 		}
 
@@ -3035,7 +3035,7 @@  int unshare_files(void)
 
 	old = task->files;
 	task_lock(task);
-	task->files = copy;
+	rcu_assign_pointer(task->files, copy);
 	task_unlock(task);
 	put_files_struct(old);
 	return 0;