diff mbox series

[RFC,v2,01/18] vfs: export new_inode_pseudo

Message ID 20200904160537.76663-2-jlayton@kernel.org (mailing list archive)
State Superseded
Headers show
Series ceph+fscrypt: context, filename and symlink support | expand

Commit Message

Jeff Layton Sept. 4, 2020, 4:05 p.m. UTC
Ceph needs to be able to allocate inodes ahead of a create that might
involve a fscrypt-encrypted inode. new_inode() almost fits the bill,
but it puts the inode on the sb->s_inodes list, and we don't want to
do that until we're ready to insert it into the hash.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
---
 fs/inode.c | 1 +
 1 file changed, 1 insertion(+)

Comments

Eric Biggers Sept. 8, 2020, 3:38 a.m. UTC | #1
On Fri, Sep 04, 2020 at 12:05:20PM -0400, Jeff Layton wrote:
> Ceph needs to be able to allocate inodes ahead of a create that might
> involve a fscrypt-encrypted inode. new_inode() almost fits the bill,
> but it puts the inode on the sb->s_inodes list, and we don't want to
> do that until we're ready to insert it into the hash.
> 
> Signed-off-by: Jeff Layton <jlayton@kernel.org>
> ---
>  fs/inode.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/fs/inode.c b/fs/inode.c
> index 72c4c347afb7..61554c2477ab 100644
> --- a/fs/inode.c
> +++ b/fs/inode.c
> @@ -935,6 +935,7 @@ struct inode *new_inode_pseudo(struct super_block *sb)
>  	}
>  	return inode;
>  }
> +EXPORT_SYMBOL(new_inode_pseudo);
>  

What's the problem with putting the new inode on sb->s_inodes already?
That's what all the other filesystems do.

- Eric
Jeff Layton Sept. 8, 2020, 11:27 a.m. UTC | #2
On Mon, 2020-09-07 at 20:38 -0700, Eric Biggers wrote:
> On Fri, Sep 04, 2020 at 12:05:20PM -0400, Jeff Layton wrote:
> > Ceph needs to be able to allocate inodes ahead of a create that might
> > involve a fscrypt-encrypted inode. new_inode() almost fits the bill,
> > but it puts the inode on the sb->s_inodes list, and we don't want to
> > do that until we're ready to insert it into the hash.
> > 
> > Signed-off-by: Jeff Layton <jlayton@kernel.org>
> > ---
> >  fs/inode.c | 1 +
> >  1 file changed, 1 insertion(+)
> > 
> > diff --git a/fs/inode.c b/fs/inode.c
> > index 72c4c347afb7..61554c2477ab 100644
> > --- a/fs/inode.c
> > +++ b/fs/inode.c
> > @@ -935,6 +935,7 @@ struct inode *new_inode_pseudo(struct super_block *sb)
> >  	}
> >  	return inode;
> >  }
> > +EXPORT_SYMBOL(new_inode_pseudo);
> >  
> 
> What's the problem with putting the new inode on sb->s_inodes already?
> That's what all the other filesystems do.
> 

The existing ones are all local filesystems that use
insert_inode_locked() and similar paths. Ceph needs to use the '5'
variants of those functions (inode_insert5(), iget5_locked(), etc.).

When we go to insert it into the hash in inode_insert5(), we'd need to
set I_CREATING if allocated from new_inode(). But, if you do _that_,
then you'll get back ESTALE from find_inode() if (eg.) someone calls
iget5_locked() before you can clear I_CREATING.

Hitting that race is easy with an asynchronous create. The simplest
scheme to avoid that is to just export new_inode_pseudo and keep it off
of s_inodes until we're ready to do the insert. The only real issue here
is that this inode won't be findable by evict_inodes during umount, but
that shouldn't be happening during an active syscall anyway.
Eric Biggers Sept. 8, 2020, 10:31 p.m. UTC | #3
On Tue, Sep 08, 2020 at 07:27:58AM -0400, Jeff Layton wrote:
> On Mon, 2020-09-07 at 20:38 -0700, Eric Biggers wrote:
> > On Fri, Sep 04, 2020 at 12:05:20PM -0400, Jeff Layton wrote:
> > > Ceph needs to be able to allocate inodes ahead of a create that might
> > > involve a fscrypt-encrypted inode. new_inode() almost fits the bill,
> > > but it puts the inode on the sb->s_inodes list, and we don't want to
> > > do that until we're ready to insert it into the hash.
> > > 
> > > Signed-off-by: Jeff Layton <jlayton@kernel.org>
> > > ---
> > >  fs/inode.c | 1 +
> > >  1 file changed, 1 insertion(+)
> > > 
> > > diff --git a/fs/inode.c b/fs/inode.c
> > > index 72c4c347afb7..61554c2477ab 100644
> > > --- a/fs/inode.c
> > > +++ b/fs/inode.c
> > > @@ -935,6 +935,7 @@ struct inode *new_inode_pseudo(struct super_block *sb)
> > >  	}
> > >  	return inode;
> > >  }
> > > +EXPORT_SYMBOL(new_inode_pseudo);
> > >  
> > 
> > What's the problem with putting the new inode on sb->s_inodes already?
> > That's what all the other filesystems do.
> > 
> 
> The existing ones are all local filesystems that use
> insert_inode_locked() and similar paths. Ceph needs to use the '5'
> variants of those functions (inode_insert5(), iget5_locked(), etc.).
> 
> When we go to insert it into the hash in inode_insert5(), we'd need to
> set I_CREATING if allocated from new_inode(). But, if you do _that_,
> then you'll get back ESTALE from find_inode() if (eg.) someone calls
> iget5_locked() before you can clear I_CREATING.
> 
> Hitting that race is easy with an asynchronous create. The simplest
> scheme to avoid that is to just export new_inode_pseudo and keep it off
> of s_inodes until we're ready to do the insert. The only real issue here
> is that this inode won't be findable by evict_inodes during umount, but
> that shouldn't be happening during an active syscall anyway.

Is your concern the following scenario?

1. ceph successfully created a new file on the server
2. inode_insert5() is called for the new file's inode
3. error occurs in ceph_fill_inode()
4. discard_new_inode() is called
5. another thread looks up the inode and gets ESTALE
6. iput() is finally called

And the claim is that the ESTALE in (5) is unexpected?  I'm not sure that it's
unexpected, given that the file allegedly failed to be created...  Also, it
seems that maybe (3) isn't something that should actually happen, since after
(1) it's too late to fail the file creation.

- Eric
Jeff Layton Sept. 9, 2020, 10:47 a.m. UTC | #4
On Tue, 2020-09-08 at 15:31 -0700, Eric Biggers wrote:
> On Tue, Sep 08, 2020 at 07:27:58AM -0400, Jeff Layton wrote:
> > On Mon, 2020-09-07 at 20:38 -0700, Eric Biggers wrote:
> > > On Fri, Sep 04, 2020 at 12:05:20PM -0400, Jeff Layton wrote:
> > > > Ceph needs to be able to allocate inodes ahead of a create that might
> > > > involve a fscrypt-encrypted inode. new_inode() almost fits the bill,
> > > > but it puts the inode on the sb->s_inodes list, and we don't want to
> > > > do that until we're ready to insert it into the hash.
> > > > 
> > > > Signed-off-by: Jeff Layton <jlayton@kernel.org>
> > > > ---
> > > >  fs/inode.c | 1 +
> > > >  1 file changed, 1 insertion(+)
> > > > 
> > > > diff --git a/fs/inode.c b/fs/inode.c
> > > > index 72c4c347afb7..61554c2477ab 100644
> > > > --- a/fs/inode.c
> > > > +++ b/fs/inode.c
> > > > @@ -935,6 +935,7 @@ struct inode *new_inode_pseudo(struct super_block *sb)
> > > >  	}
> > > >  	return inode;
> > > >  }
> > > > +EXPORT_SYMBOL(new_inode_pseudo);
> > > >  
> > > 
> > > What's the problem with putting the new inode on sb->s_inodes already?
> > > That's what all the other filesystems do.
> > > 
> > 
> > The existing ones are all local filesystems that use
> > insert_inode_locked() and similar paths. Ceph needs to use the '5'
> > variants of those functions (inode_insert5(), iget5_locked(), etc.).
> > 
> > When we go to insert it into the hash in inode_insert5(), we'd need to
> > set I_CREATING if allocated from new_inode(). But, if you do _that_,
> > then you'll get back ESTALE from find_inode() if (eg.) someone calls
> > iget5_locked() before you can clear I_CREATING.
> > 
> > Hitting that race is easy with an asynchronous create. The simplest
> > scheme to avoid that is to just export new_inode_pseudo and keep it off
> > of s_inodes until we're ready to do the insert. The only real issue here
> > is that this inode won't be findable by evict_inodes during umount, but
> > that shouldn't be happening during an active syscall anyway.
> 
> Is your concern the following scenario?
> 
> 1. ceph successfully created a new file on the server
> 2. inode_insert5() is called for the new file's inode
> 3. error occurs in ceph_fill_inode()
> 4. discard_new_inode() is called
> 5. another thread looks up the inode and gets ESTALE
> 6. iput() is finally called
> 
> And the claim is that the ESTALE in (5) is unexpected?  I'm not sure that it's
> unexpected, given that the file allegedly failed to be created...  Also, it
> seems that maybe (3) isn't something that should actually happen, since after
> (1) it's too late to fail the file creation.
> 

No, more like:

Syscall					Workqueue
------------------------------------------------------------------------------
1. allocate an inode
2. determine we can do an async create
   and allocate an inode number for it
3. hash the inode (must set I_CREATING
   if we allocated with new_inode()) 
4. issue the call to the MDS
5. finish filling out the inode()
6.					MDS reply comes in, and workqueue thread
					looks up new inode (-ESTALE)
7. unlock_new_inode()


Because 6 happens before 7 in this case, we get an ESTALE on that
lookup. By using new_inode_pseudo() and not setting I_CREATING, 6 ends
up waiting on the inode to be unlocked rather than giving up.
Eric Biggers Sept. 9, 2020, 4:12 p.m. UTC | #5
On Wed, Sep 09, 2020 at 06:47:28AM -0400, Jeff Layton wrote:
> On Tue, 2020-09-08 at 15:31 -0700, Eric Biggers wrote:
> > On Tue, Sep 08, 2020 at 07:27:58AM -0400, Jeff Layton wrote:
> > > On Mon, 2020-09-07 at 20:38 -0700, Eric Biggers wrote:
> > > > On Fri, Sep 04, 2020 at 12:05:20PM -0400, Jeff Layton wrote:
> > > > > Ceph needs to be able to allocate inodes ahead of a create that might
> > > > > involve a fscrypt-encrypted inode. new_inode() almost fits the bill,
> > > > > but it puts the inode on the sb->s_inodes list, and we don't want to
> > > > > do that until we're ready to insert it into the hash.
> > > > > 
> > > > > Signed-off-by: Jeff Layton <jlayton@kernel.org>
> > > > > ---
> > > > >  fs/inode.c | 1 +
> > > > >  1 file changed, 1 insertion(+)
> > > > > 
> > > > > diff --git a/fs/inode.c b/fs/inode.c
> > > > > index 72c4c347afb7..61554c2477ab 100644
> > > > > --- a/fs/inode.c
> > > > > +++ b/fs/inode.c
> > > > > @@ -935,6 +935,7 @@ struct inode *new_inode_pseudo(struct super_block *sb)
> > > > >  	}
> > > > >  	return inode;
> > > > >  }
> > > > > +EXPORT_SYMBOL(new_inode_pseudo);
> > > > >  
> > > > 
> > > > What's the problem with putting the new inode on sb->s_inodes already?
> > > > That's what all the other filesystems do.
> > > > 
> > > 
> > > The existing ones are all local filesystems that use
> > > insert_inode_locked() and similar paths. Ceph needs to use the '5'
> > > variants of those functions (inode_insert5(), iget5_locked(), etc.).
> > > 
> > > When we go to insert it into the hash in inode_insert5(), we'd need to
> > > set I_CREATING if allocated from new_inode(). But, if you do _that_,
> > > then you'll get back ESTALE from find_inode() if (eg.) someone calls
> > > iget5_locked() before you can clear I_CREATING.
> > > 
> > > Hitting that race is easy with an asynchronous create. The simplest
> > > scheme to avoid that is to just export new_inode_pseudo and keep it off
> > > of s_inodes until we're ready to do the insert. The only real issue here
> > > is that this inode won't be findable by evict_inodes during umount, but
> > > that shouldn't be happening during an active syscall anyway.
> > 
> > Is your concern the following scenario?
> > 
> > 1. ceph successfully created a new file on the server
> > 2. inode_insert5() is called for the new file's inode
> > 3. error occurs in ceph_fill_inode()
> > 4. discard_new_inode() is called
> > 5. another thread looks up the inode and gets ESTALE
> > 6. iput() is finally called
> > 
> > And the claim is that the ESTALE in (5) is unexpected?  I'm not sure that it's
> > unexpected, given that the file allegedly failed to be created...  Also, it
> > seems that maybe (3) isn't something that should actually happen, since after
> > (1) it's too late to fail the file creation.
> > 
> 
> No, more like:
> 
> Syscall					Workqueue
> ------------------------------------------------------------------------------
> 1. allocate an inode
> 2. determine we can do an async create
>    and allocate an inode number for it
> 3. hash the inode (must set I_CREATING
>    if we allocated with new_inode()) 
> 4. issue the call to the MDS
> 5. finish filling out the inode()
> 6.					MDS reply comes in, and workqueue thread
> 					looks up new inode (-ESTALE)
> 7. unlock_new_inode()
> 
> 
> Because 6 happens before 7 in this case, we get an ESTALE on that
> lookup.

How is ESTALE at (6) possible?  (3) will set I_NEW on the inode when inserting
it into the inode hash table.  Then (6) will wait for I_NEW to be cleared before
returning the inode.  (7) will clear I_NEW and I_CREATING together.

- Eric
Jeff Layton Sept. 9, 2020, 4:51 p.m. UTC | #6
On Wed, 2020-09-09 at 09:12 -0700, Eric Biggers wrote:
> On Wed, Sep 09, 2020 at 06:47:28AM -0400, Jeff Layton wrote:
> > On Tue, 2020-09-08 at 15:31 -0700, Eric Biggers wrote:
> > > On Tue, Sep 08, 2020 at 07:27:58AM -0400, Jeff Layton wrote:
> > > > On Mon, 2020-09-07 at 20:38 -0700, Eric Biggers wrote:
> > > > > On Fri, Sep 04, 2020 at 12:05:20PM -0400, Jeff Layton wrote:
> > > > > > Ceph needs to be able to allocate inodes ahead of a create that might
> > > > > > involve a fscrypt-encrypted inode. new_inode() almost fits the bill,
> > > > > > but it puts the inode on the sb->s_inodes list, and we don't want to
> > > > > > do that until we're ready to insert it into the hash.
> > > > > > 
> > > > > > Signed-off-by: Jeff Layton <jlayton@kernel.org>
> > > > > > ---
> > > > > >  fs/inode.c | 1 +
> > > > > >  1 file changed, 1 insertion(+)
> > > > > > 
> > > > > > diff --git a/fs/inode.c b/fs/inode.c
> > > > > > index 72c4c347afb7..61554c2477ab 100644
> > > > > > --- a/fs/inode.c
> > > > > > +++ b/fs/inode.c
> > > > > > @@ -935,6 +935,7 @@ struct inode *new_inode_pseudo(struct super_block *sb)
> > > > > >  	}
> > > > > >  	return inode;
> > > > > >  }
> > > > > > +EXPORT_SYMBOL(new_inode_pseudo);
> > > > > >  
> > > > > 
> > > > > What's the problem with putting the new inode on sb->s_inodes already?
> > > > > That's what all the other filesystems do.
> > > > > 
> > > > 
> > > > The existing ones are all local filesystems that use
> > > > insert_inode_locked() and similar paths. Ceph needs to use the '5'
> > > > variants of those functions (inode_insert5(), iget5_locked(), etc.).
> > > > 
> > > > When we go to insert it into the hash in inode_insert5(), we'd need to
> > > > set I_CREATING if allocated from new_inode(). But, if you do _that_,
> > > > then you'll get back ESTALE from find_inode() if (eg.) someone calls
> > > > iget5_locked() before you can clear I_CREATING.
> > > > 
> > > > Hitting that race is easy with an asynchronous create. The simplest
> > > > scheme to avoid that is to just export new_inode_pseudo and keep it off
> > > > of s_inodes until we're ready to do the insert. The only real issue here
> > > > is that this inode won't be findable by evict_inodes during umount, but
> > > > that shouldn't be happening during an active syscall anyway.
> > > 
> > > Is your concern the following scenario?
> > > 
> > > 1. ceph successfully created a new file on the server
> > > 2. inode_insert5() is called for the new file's inode
> > > 3. error occurs in ceph_fill_inode()
> > > 4. discard_new_inode() is called
> > > 5. another thread looks up the inode and gets ESTALE
> > > 6. iput() is finally called
> > > 
> > > And the claim is that the ESTALE in (5) is unexpected?  I'm not sure that it's
> > > unexpected, given that the file allegedly failed to be created...  Also, it
> > > seems that maybe (3) isn't something that should actually happen, since after
> > > (1) it's too late to fail the file creation.
> > > 
> > 
> > No, more like:
> > 
> > Syscall					Workqueue
> > ------------------------------------------------------------------------------
> > 1. allocate an inode
> > 2. determine we can do an async create
> >    and allocate an inode number for it
> > 3. hash the inode (must set I_CREATING
> >    if we allocated with new_inode()) 
> > 4. issue the call to the MDS
> > 5. finish filling out the inode()
> > 6.					MDS reply comes in, and workqueue thread
> > 					looks up new inode (-ESTALE)
> > 7. unlock_new_inode()
> > 
> > 
> > Because 6 happens before 7 in this case, we get an ESTALE on that
> > lookup.
> 
> How is ESTALE at (6) possible?  (3) will set I_NEW on the inode when inserting
> it into the inode hash table.  Then (6) will wait for I_NEW to be cleared before
> returning the inode.  (7) will clear I_NEW and I_CREATING together.
> 

Long call chain, but:

ceph_fill_trace
   ceph_get_inode
      iget5_locked
         ilookup5(..._nowait, etc)
            find_inode_fast


...and find_inode_fast does:

                if (unlikely(inode->i_state & I_CREATING)) {                                        
                        spin_unlock(&inode->i_lock);                                                
                        return ERR_PTR(-ESTALE);                                                    
                }
Eric Biggers Sept. 9, 2020, 6:49 p.m. UTC | #7
[+Cc Al]

On Wed, Sep 09, 2020 at 12:51:02PM -0400, Jeff Layton wrote:
> > > No, more like:
> > > 
> > > Syscall					Workqueue
> > > ------------------------------------------------------------------------------
> > > 1. allocate an inode
> > > 2. determine we can do an async create
> > >    and allocate an inode number for it
> > > 3. hash the inode (must set I_CREATING
> > >    if we allocated with new_inode()) 
> > > 4. issue the call to the MDS
> > > 5. finish filling out the inode()
> > > 6.					MDS reply comes in, and workqueue thread
> > > 					looks up new inode (-ESTALE)
> > > 7. unlock_new_inode()
> > > 
> > > 
> > > Because 6 happens before 7 in this case, we get an ESTALE on that
> > > lookup.
> > 
> > How is ESTALE at (6) possible?  (3) will set I_NEW on the inode when inserting
> > it into the inode hash table.  Then (6) will wait for I_NEW to be cleared before
> > returning the inode.  (7) will clear I_NEW and I_CREATING together.
> > 
> 
> Long call chain, but:
> 
> ceph_fill_trace
>    ceph_get_inode
>       iget5_locked
>          ilookup5(..._nowait, etc)
>             find_inode_fast
> 
> 
> ...and find_inode_fast does:
> 
>                 if (unlikely(inode->i_state & I_CREATING)) {                                        
>                         spin_unlock(&inode->i_lock);                                                
>                         return ERR_PTR(-ESTALE);                                                    
>                 }                                                                                   

Why does ilookup5() not wait for I_NEW to be cleared if the inode has
I_NEW|I_CREATING, but it does wait for I_NEW to be cleared if I_NEW is set its
own?  That seems like a bug.

- Eric
Jeff Layton Sept. 9, 2020, 7:24 p.m. UTC | #8
On Wed, 2020-09-09 at 11:49 -0700, Eric Biggers wrote:
> [+Cc Al]
> 
> On Wed, Sep 09, 2020 at 12:51:02PM -0400, Jeff Layton wrote:
> > > > No, more like:
> > > > 
> > > > Syscall					Workqueue
> > > > ------------------------------------------------------------------------------
> > > > 1. allocate an inode
> > > > 2. determine we can do an async create
> > > >    and allocate an inode number for it
> > > > 3. hash the inode (must set I_CREATING
> > > >    if we allocated with new_inode()) 
> > > > 4. issue the call to the MDS
> > > > 5. finish filling out the inode()
> > > > 6.					MDS reply comes in, and workqueue thread
> > > > 					looks up new inode (-ESTALE)
> > > > 7. unlock_new_inode()
> > > > 
> > > > 
> > > > Because 6 happens before 7 in this case, we get an ESTALE on that
> > > > lookup.
> > > 
> > > How is ESTALE at (6) possible?  (3) will set I_NEW on the inode when inserting
> > > it into the inode hash table.  Then (6) will wait for I_NEW to be cleared before
> > > returning the inode.  (7) will clear I_NEW and I_CREATING together.
> > > 
> > 
> > Long call chain, but:
> > 
> > ceph_fill_trace
> >    ceph_get_inode
> >       iget5_locked
> >          ilookup5(..._nowait, etc)
> >             find_inode_fast
> > 
> > 
> > ...and find_inode_fast does:
> > 
> >                 if (unlikely(inode->i_state & I_CREATING)) {                                        
> >                         spin_unlock(&inode->i_lock);                                                
> >                         return ERR_PTR(-ESTALE);                                                    
> >                 }                                                                                   
> 
> Why does ilookup5() not wait for I_NEW to be cleared if the inode has
> I_NEW|I_CREATING, but it does wait for I_NEW to be cleared if I_NEW is set its
> own?  That seems like a bug.
> 

Funny, I asked Al the same thing on IRC the other day:

23:28 < jlayton> viro: wondering if there is a potential race with I_CREATING in find_inode. 
                 Seems like you could have 2 tasks racing in calls to iget5_locked for the 
                 same inode. One creates an inode and starts instantiating it, and the second 
                 one gets NULL back because I_CREATING is set.
23:30 < viro> jlayton: where would I_CREATING come from?
23:30 < viro> it's set on insert_inode_locked() and similar paths
23:31 < viro> where you want iget5_locked() to fuck off and eat ESTALE
23:31 < jlayton> ok, right -- I was trying to pass an inode from new_inode to inode_insert5
23:32 < viro> seeing that it's been asked for an inode number that did _not_ exist until just 
              now (we'd just allocated it)

The assumption is that we'll never go looking for an inode until after
I_NEW is cleared. In the case of an asynchronous create in ceph though,
we may do exactly that if the reply comes back very quickly.
diff mbox series

Patch

diff --git a/fs/inode.c b/fs/inode.c
index 72c4c347afb7..61554c2477ab 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -935,6 +935,7 @@  struct inode *new_inode_pseudo(struct super_block *sb)
 	}
 	return inode;
 }
+EXPORT_SYMBOL(new_inode_pseudo);
 
 /**
  *	new_inode 	- obtain an inode