[GIT,PULL] Ceph fixes for 5.1-rc7
mbox series

Message ID 20190425174739.27604-1-idryomov@gmail.com
State New
Headers show
Series
  • [GIT,PULL] Ceph fixes for 5.1-rc7
Related show

Pull-request

https://github.com/ceph/ceph-client.git tags/ceph-for-5.1-rc7

Message

Ilya Dryomov April 25, 2019, 5:47 p.m. UTC
Hi Linus,

The following changes since commit 085b7755808aa11f78ab9377257e1dad2e6fa4bb:

  Linux 5.1-rc6 (2019-04-21 10:45:57 -0700)

are available in the Git repository at:

  https://github.com/ceph/ceph-client.git tags/ceph-for-5.1-rc7

for you to fetch changes up to 37659182bff1eeaaeadcfc8f853c6d2b6dbc3f47:

  ceph: fix ci->i_head_snapc leak (2019-04-23 21:37:54 +0200)

----------------------------------------------------------------
dentry name handling fixes from Jeff and a memory leak fix from Zheng.
Both are old issues, marked for stable.

----------------------------------------------------------------
Jeff Layton (3):
      ceph: only use d_name directly when parent is locked
      ceph: ensure d_name stability in ceph_dentry_hash()
      ceph: handle the case where a dentry has been renamed on outstanding req

Yan, Zheng (1):
      ceph: fix ci->i_head_snapc leak

 fs/ceph/dir.c        |  6 ++++-
 fs/ceph/inode.c      | 16 +++++++++++-
 fs/ceph/mds_client.c | 70 +++++++++++++++++++++++++++++++++++++++++++---------
 fs/ceph/snap.c       |  7 +++++-
 4 files changed, 85 insertions(+), 14 deletions(-)

Comments

Linus Torvalds April 25, 2019, 6:02 p.m. UTC | #1
On Thu, Apr 25, 2019 at 10:48 AM Ilya Dryomov <idryomov@gmail.com> wrote:
>
> dentry name handling fixes from Jeff and a memory leak fix from Zheng.
> Both are old issues, marked for stable.

Hmm. You probably should have talked to Al about the dentry name
issue, because he'd most likely have pointed you towards our helper
function for exactly this thing:

    struct name_snapshot stable;

    take_dentry_name_snapshot(&stable, dentry);
    ... use stable.name ..
    release_dentry_name_snapshot(&stable);

which doesn't need any extra memory allocation outside of some fairly
limited stack allocation for the 'name_snapshot' itself, because it
knows about the dentry name rules, and

 - for inline names, it copies it under the d_lock into the fixed
DNAME_INLINE_LEN-sized buffer

 - for out-of-line names, it knows that the name allocation is stable
and ref-counted, and just increments the refcount and uses the
existing name pointer.

now, maybe you need to always do that name allocation anyway (looking
at the diff it looks like you often do that for other cases), so maybe
the name snapshot capability isn't all that useful for you and the
above wouldn't have helped, but I suspect you might not even have
realized that there was an option like this.

I've pulled this, but maybe Jeff wants to look at whether that
snapshotting model could have helped.

                       Linus
Al Viro April 25, 2019, 6:21 p.m. UTC | #2
On Thu, Apr 25, 2019 at 11:02:54AM -0700, Linus Torvalds wrote:

> I've pulled this, but maybe Jeff wants to look at whether that
> snapshotting model could have helped.

I really wonder if 76a495d666e5 (ceph: ensure d_name stability in
ceph_dentry_hash()) makes any sense; OK, you have ->d_lock held
over that, but what does it protect against?  Sure, you'll get
something that was valid while you held ->d_lock, but what good
does it do to the callers?  If they really have to care about
races with d_move(), which value do they want?
Jeff Layton April 25, 2019, 6:23 p.m. UTC | #3
On Thu, 2019-04-25 at 11:02 -0700, Linus Torvalds wrote:
> On Thu, Apr 25, 2019 at 10:48 AM Ilya Dryomov <idryomov@gmail.com> wrote:
> > dentry name handling fixes from Jeff and a memory leak fix from Zheng.
> > Both are old issues, marked for stable.
> 
> Hmm. You probably should have talked to Al about the dentry name
> issue, because he'd most likely have pointed you towards our helper
> function for exactly this thing:
> 
>     struct name_snapshot stable;
> 
>     take_dentry_name_snapshot(&stable, dentry);
>     ... use stable.name ..
>     release_dentry_name_snapshot(&stable);
> 
> which doesn't need any extra memory allocation outside of some fairly
> limited stack allocation for the 'name_snapshot' itself, because it
> knows about the dentry name rules, and
> 
>  - for inline names, it copies it under the d_lock into the fixed
> DNAME_INLINE_LEN-sized buffer
> 
>  - for out-of-line names, it knows that the name allocation is stable
> and ref-counted, and just increments the refcount and uses the
> existing name pointer.
> 
> now, maybe you need to always do that name allocation anyway (looking
> at the diff it looks like you often do that for other cases), so maybe
> the name snapshot capability isn't all that useful for you and the
> above wouldn't have helped, but I suspect you might not even have
> realized that there was an option like this.
> 
> I've pulled this, but maybe Jeff wants to look at whether that
> snapshotting model could have helped.
> 
>                        Linus

Thanks for the info! I think it would have.

I took a quick look at the dcache code to see if we had something like
that before I did this, but I guess I didn't look closely enough. Those
routines do look nicer than my hand-rolled version.

I'll look at spinning up a patch to switch that over in the near future.

Thanks,
Linus Torvalds April 25, 2019, 6:24 p.m. UTC | #4
On Thu, Apr 25, 2019 at 11:21 AM Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> I really wonder if 76a495d666e5 (ceph: ensure d_name stability in
> ceph_dentry_hash()) makes any sense; OK, you have ->d_lock held
> over that, but what does it protect against?  Sure, you'll get
> something that was valid while you held ->d_lock, but what good
> does it do to the callers?  If they really have to care about
> races with d_move(), which value do they want?

I suspect they only care that the data they gather for the network
packet is coherent, and passes internal sanity tests. You get either
old or new, but at least you don't get "not NUL terminated because the
length didn't match the data".

                    Linus
Al Viro April 25, 2019, 6:31 p.m. UTC | #5
On Thu, Apr 25, 2019 at 11:24:53AM -0700, Linus Torvalds wrote:

> I suspect they only care that the data they gather for the network
> packet is coherent, and passes internal sanity tests. You get either
> old or new, but at least you don't get "not NUL terminated because the
> length didn't match the data".

I'm not familiar enough with the protocol to tell if it's actually
OK; if anything, it's a question to ceph folks...
pr-tracker-bot@kernel.org April 25, 2019, 6:35 p.m. UTC | #6
The pull request you sent on Thu, 25 Apr 2019 19:47:39 +0200:

> https://github.com/ceph/ceph-client.git tags/ceph-for-5.1-rc7

has been merged into torvalds/linux.git:
https://git.kernel.org/torvalds/c/8113a85f872003a9f5c58f9f143054b0d8ec73a5

Thank you!
Jeff Layton April 25, 2019, 6:36 p.m. UTC | #7
On Thu, 2019-04-25 at 11:24 -0700, Linus Torvalds wrote:
> On Thu, Apr 25, 2019 at 11:21 AM Al Viro <viro@zeniv.linux.org.uk> wrote:
> > I really wonder if 76a495d666e5 (ceph: ensure d_name stability in
> > ceph_dentry_hash()) makes any sense; OK, you have ->d_lock held
> > over that, but what does it protect against?  Sure, you'll get
> > something that was valid while you held ->d_lock, but what good
> > does it do to the callers?  If they really have to care about
> > races with d_move(), which value do they want?
> 
> I suspect they only care that the data they gather for the network
> packet is coherent, and passes internal sanity tests. You get either
> old or new, but at least you don't get "not NUL terminated because the
> length didn't match the data".
> 
>                     Linus

Right. My main concern was preventing oopses while iterating over
d_name.

The main use for this hash is to help determine to which MDS we should
direct the request. My understanding is that if we get it wrong, we'll
end up retrying the request and pick a different MDS, so it's not fatal
if we do race in this fashion.
Al Viro April 25, 2019, 8:09 p.m. UTC | #8
On Thu, Apr 25, 2019 at 02:23:59PM -0400, Jeff Layton wrote:

> I took a quick look at the dcache code to see if we had something like
> that before I did this, but I guess I didn't look closely enough. Those
> routines do look nicer than my hand-rolled version.
> 
> I'll look at spinning up a patch to switch that over in the near future.

Jeff asked to put the name length in there; looking at the existing users,
it does make sense.  I propose turning struct name_snapshot into

struct name_snapshot {
        struct qstr name;
        unsigned char inline_name[DNAME_INLINE_LEN];
};

with
void take_dentry_name_snapshot(struct name_snapshot *name, struct dentry *dentry)
{
        spin_lock(&dentry->d_lock);
	name->name = dentry->d_name;
        if (unlikely(dname_external(dentry))) {
                struct external_name *p = external_name(dentry);
                atomic_inc(&p->u.count);
        } else {
                memcpy(name->inline_name, dentry->d_iname,
                       dentry->d_name.len + 1);
                name->name.name = name->inline_name;
        }
	spin_unlock(&dentry->d_lock);
}

and callers adjusted, initially simply by turning snapshot.name into
snapshot.name.name.  Next step: overlayfs call site becoming
        take_dentry_name_snapshot(&name, real);
        this = lookup_one_len(name.name.name, connected, name.name.len);
Next: snotify stuff switched to passing struct qstr * - fsnotify_move()
first, then fsnotify().  That one would
	* leave callers passing NULL alone
	* have the callers passing snapshot.name.name pass &snapshot.name
	* fsnotify_dirent() pass the entire &dentry->d_name, not just
dentry->d_name.name (that's dependent upon parent being held exclusive;
devpts plays fast and loose here, relying upon the lack of any kind of
renames, explicit or implicit, in that fs)
	* ditto for remaining call in fsnotify_move() (both parents
are locked in all callers, thankfully)
	* ditto for fsnotify_link()
	* kernfs callers in kernfs_notify_workfn() would grow strlen().
Next: send_to_group() and ->handle_event() instances switched to passing
struct qstr *.
Next: inotify_handle_event() doesn't need to bother with strlen().
Next: audit_update_watch() and audit_compare_dname_path() switched to
struct qstr *.  Callers in __audit_inode_child() pass the entire
&dentry->d_name.  strlen() inside audit_compare_dname_path() goes away.

Objections?
Jeff Layton April 26, 2019, 4:25 p.m. UTC | #9
On Thu, 2019-04-25 at 21:09 +0100, Al Viro wrote:
> On Thu, Apr 25, 2019 at 02:23:59PM -0400, Jeff Layton wrote:
> 
> > I took a quick look at the dcache code to see if we had something like
> > that before I did this, but I guess I didn't look closely enough. Those
> > routines do look nicer than my hand-rolled version.
> > 
> > I'll look at spinning up a patch to switch that over in the near future.
> 
> Jeff asked to put the name length in there; looking at the existing users,
> it does make sense.  I propose turning struct name_snapshot into
> 
> struct name_snapshot {
>         struct qstr name;
>         unsigned char inline_name[DNAME_INLINE_LEN];
> };
> 
> with
> void take_dentry_name_snapshot(struct name_snapshot *name, struct dentry *dentry)
> {
>         spin_lock(&dentry->d_lock);
> 	name->name = dentry->d_name;
>         if (unlikely(dname_external(dentry))) {
>                 struct external_name *p = external_name(dentry);
>                 atomic_inc(&p->u.count);
>         } else {
>                 memcpy(name->inline_name, dentry->d_iname,
>                        dentry->d_name.len + 1);
>                 name->name.name = name->inline_name;
>         }
> 	spin_unlock(&dentry->d_lock);
> }
> 
> and callers adjusted, initially simply by turning snapshot.name into
> snapshot.name.name.  Next step: overlayfs call site becoming
>         take_dentry_name_snapshot(&name, real);
>         this = lookup_one_len(name.name.name, connected, name.name.len);
> Next: snotify stuff switched to passing struct qstr * - fsnotify_move()
> first, then fsnotify().  That one would
> 	* leave callers passing NULL alone
> 	* have the callers passing snapshot.name.name pass &snapshot.name
> 	* fsnotify_dirent() pass the entire &dentry->d_name, not just
> dentry->d_name.name (that's dependent upon parent being held exclusive;
> devpts plays fast and loose here, relying upon the lack of any kind of
> renames, explicit or implicit, in that fs)
> 	* ditto for remaining call in fsnotify_move() (both parents
> are locked in all callers, thankfully)
> 	* ditto for fsnotify_link()
> 	* kernfs callers in kernfs_notify_workfn() would grow strlen().
> Next: send_to_group() and ->handle_event() instances switched to passing
> struct qstr *.
> Next: inotify_handle_event() doesn't need to bother with strlen().
> Next: audit_update_watch() and audit_compare_dname_path() switched to
> struct qstr *.  Callers in __audit_inode_child() pass the entire
> &dentry->d_name.  strlen() inside audit_compare_dname_path() goes away.
> 
> Objections?

I have some patches that do what you lay out above. They build but I
haven't ran them through much testing yet.

It turns out though that using name_snapshot from ceph is a bit more
tricky. In some cases, we have to call ceph_mdsc_build_path to build up
a full path string. We can't easily populate a name_snapshot from there
because struct external_name is only defined in fs/dcache.c.

I could add some routines to do this, but it feels a lot like I'm
abusing internal dcache interfaces. I'll keep thinking about it though.

While we're on the subject though:

struct external_name {
        union {
                atomic_t count;
                struct rcu_head head;
        } u;
        unsigned char name[];
};

Is it really ok to union the count and rcu_head there?

I haven't trawled through all of the code yet, but what prevents someone
from trying to access the count inside an RCU critical section, after
call_rcu has been called on it?

Thanks,
Linus Torvalds April 26, 2019, 4:36 p.m. UTC | #10
On Fri, Apr 26, 2019 at 9:25 AM Jeff Layton <jlayton@kernel.org> wrote:
>
> Is it really ok to union the count and rcu_head there?

It should be fine, because the rcu_count should only ever be used once
the count has gone to zero and the name cannot be found any more.

And while RCU path walking may find and use the *name* after the
dentry has been killed off (but not free'd yet), all the actual
external_name() accesses should be serialized by the dentry lock, so
there's no access to those fields once the dentry is dead.

At least that's how it's supposed to work. Al would be the final
arbiter on this.

             Linus
Linus Torvalds April 26, 2019, 4:43 p.m. UTC | #11
On Fri, Apr 26, 2019 at 9:36 AM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> On Fri, Apr 26, 2019 at 9:25 AM Jeff Layton <jlayton@kernel.org> wrote:
> >
> > Is it really ok to union the count and rcu_head there?
>
> It should be fine, because the rcu_count should only ever be used once
                                     ^^^^^

that should be 'head' of course.

              Linus "duh" Torvalds
Al Viro April 26, 2019, 4:50 p.m. UTC | #12
On Fri, Apr 26, 2019 at 12:25:03PM -0400, Jeff Layton wrote:

> It turns out though that using name_snapshot from ceph is a bit more
> tricky. In some cases, we have to call ceph_mdsc_build_path to build up
> a full path string. We can't easily populate a name_snapshot from there
> because struct external_name is only defined in fs/dcache.c.

Explain, please.  For ceph_mdsc_build_path() you don't need name
snapshots at all and existing code is, AFAICS, just fine, except
for pointless pr_err() there.

I _probably_ would take allocation out of the loop (e.g. make it
__getname(), called unconditionally) and turned it into the
d_path.c-style read_seqbegin_or_lock()/need_seqretry()/done_seqretry()
loop, so that the first pass would go under rcu_read_lock(), while
the second (if needed) would just hold rename_lock exclusive (without
bumping the refcount).  But that's a matter of (theoretical) livelock
avoidance, not the locking correctness for ->d_name accesses.

Oh, and
        *base = ceph_ino(d_inode(temp));
        *plen = len;
probably belongs in critical section - _that_ might be a correctness
issue, since temp is not held by anything once you are out of there.

> I could add some routines to do this, but it feels a lot like I'm
> abusing internal dcache interfaces. I'll keep thinking about it though.
> 
> While we're on the subject though:
> 
> struct external_name {
>         union {
>                 atomic_t count;
>                 struct rcu_head head;
>         } u;
>         unsigned char name[];
> };
> 
> Is it really ok to union the count and rcu_head there?
> 
> I haven't trawled through all of the code yet, but what prevents someone
> from trying to access the count inside an RCU critical section, after
> call_rcu has been called on it?

The fact that no lockless accesses to ->count are ever done?
Al Viro April 26, 2019, 5:01 p.m. UTC | #13
On Fri, Apr 26, 2019 at 09:36:00AM -0700, Linus Torvalds wrote:
> On Fri, Apr 26, 2019 at 9:25 AM Jeff Layton <jlayton@kernel.org> wrote:
> >
> > Is it really ok to union the count and rcu_head there?
> 
> It should be fine, because the rcu_count should only ever be used once
> the count has gone to zero and the name cannot be found any more.
> 
> And while RCU path walking may find and use the *name* after the
> dentry has been killed off (but not free'd yet), all the actual
> external_name() accesses should be serialized by the dentry lock, so
> there's no access to those fields once the dentry is dead.

It's not quite that; access to external_name contents is fine,
->d_lock or not.  __d_lookup_rcu() does read it under rcu_read_lock
alone.

However:
	* we never free it without an RCU delay after the final
drop of refcount.  RCU delay might happen on dentry->d_rcu (if
it's dentry_free()) or on name->p.rcu (if it's release_dentry_name_snapshot()
or d_move() dropping the final reference).
	* it's never observed in ->d_name after the refcount
reaches zero.
	* no lockless access ever looks at the refcount.  It
can look at ->name[], but that's it.

What I don't understand is why would anyone want to mess with
name snapshots for dentry_path() lookalikes...
Linus Torvalds April 26, 2019, 5:08 p.m. UTC | #14
On Fri, Apr 26, 2019 at 10:01 AM Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> What I don't understand is why would anyone want to mess with
> name snapshots for dentry_path() lookalikes...

Talking about dentry_path()... Shouldn't the ceph path walking code
also check the mount_lock sequence to make sure the path is actually
stable?

Maybe it doesn't matter.

                  Linus
Al Viro April 26, 2019, 5:11 p.m. UTC | #15
On Fri, Apr 26, 2019 at 10:08:48AM -0700, Linus Torvalds wrote:
> On Fri, Apr 26, 2019 at 10:01 AM Al Viro <viro@zeniv.linux.org.uk> wrote:
> >
> > What I don't understand is why would anyone want to mess with
> > name snapshots for dentry_path() lookalikes...
> 
> Talking about dentry_path()... Shouldn't the ceph path walking code
> also check the mount_lock sequence to make sure the path is actually
> stable?
> 
> Maybe it doesn't matter.

They want it relative to fs root - after all, server doesn't know or
care what's mounted where on client...
Jeff Layton April 26, 2019, 5:30 p.m. UTC | #16
On Fri, 2019-04-26 at 17:50 +0100, Al Viro wrote:
> On Fri, Apr 26, 2019 at 12:25:03PM -0400, Jeff Layton wrote:
> 
> > It turns out though that using name_snapshot from ceph is a bit more
> > tricky. In some cases, we have to call ceph_mdsc_build_path to build up
> > a full path string. We can't easily populate a name_snapshot from there
> > because struct external_name is only defined in fs/dcache.c.
> 
> Explain, please.  For ceph_mdsc_build_path() you don't need name
> snapshots at all and existing code is, AFAICS, just fine, except
> for pointless pr_err() there.
> 

Eventually we have to pass back the result of all the
build_dentry_path() shenanigans to create_request_message(), and then
free whatever that result is after using it.

Today we can get back a string+length from ceph_mdsc_build_path or
clone_dentry_name, or we might get direct pointers into the dentry if
the situation allows for it.

Now we want to rip out clone_dentry_name() and start using
take_dentry_name_snapshot(). That returns a name_snapshot that we'll
need to pass back to create_request_message. It will need to deal with
the fact that it could get one of those instead of just a string+length.

My original thought was to always pass back a name_snapshot, but that
turns out to be difficult because its innards are not public. The other
potential solutions that I've tried make this code look even worse than
it already is.


> I _probably_ would take allocation out of the loop (e.g. make it
> __getname(), called unconditionally) and turned it into the
> d_path.c-style read_seqbegin_or_lock()/need_seqretry()/done_seqretry()
> loop, so that the first pass would go under rcu_read_lock(), while
> the second (if needed) would just hold rename_lock exclusive (without
> bumping the refcount).  But that's a matter of (theoretical) livelock
> avoidance, not the locking correctness for ->d_name accesses.
> 

Yeah, that does sound better. I want to think about this code a bit

> Oh, and
>         *base = ceph_ino(d_inode(temp));
>         *plen = len;
> probably belongs in critical section - _that_ might be a correctness
> issue, since temp is not held by anything once you are out of there.
> 

Good catch. I'll fix that up.

> > I could add some routines to do this, but it feels a lot like I'm
> > abusing internal dcache interfaces. I'll keep thinking about it though.
> > 
> > While we're on the subject though:
> > 
> > struct external_name {
> >         union {
> >                 atomic_t count;
> >                 struct rcu_head head;
> >         } u;
> >         unsigned char name[];
> > };
> > 
> > Is it really ok to union the count and rcu_head there?
> > 
> > I haven't trawled through all of the code yet, but what prevents someone
> > from trying to access the count inside an RCU critical section, after
> > call_rcu has been called on it?
> 
> The fact that no lockless accesses to ->count are ever done?

Thanks,
Jeff Layton April 26, 2019, 8:49 p.m. UTC | #17
On Fri, 2019-04-26 at 18:01 +0100, Al Viro wrote:
> On Fri, Apr 26, 2019 at 09:36:00AM -0700, Linus Torvalds wrote:
> > On Fri, Apr 26, 2019 at 9:25 AM Jeff Layton <jlayton@kernel.org> wrote:
> > > Is it really ok to union the count and rcu_head there?
> > 
> > It should be fine, because the rcu_count should only ever be used once
> > the count has gone to zero and the name cannot be found any more.
> > 
> > And while RCU path walking may find and use the *name* after the
> > dentry has been killed off (but not free'd yet), all the actual
> > external_name() accesses should be serialized by the dentry lock, so
> > there's no access to those fields once the dentry is dead.
> 
> It's not quite that; access to external_name contents is fine,
> ->d_lock or not.  __d_lookup_rcu() does read it under rcu_read_lock
> alone.
> 
> However:
> 	* we never free it without an RCU delay after the final
> drop of refcount.  RCU delay might happen on dentry->d_rcu (if
> it's dentry_free()) or on name->p.rcu (if it's release_dentry_name_snapshot()
> or d_move() dropping the final reference).
> 	* it's never observed in ->d_name after the refcount
> reaches zero.
> 	* no lockless access ever looks at the refcount.  It
> can look at ->name[], but that's it.
> 

Got it, thanks. Why use an atomic_t for the refcount if it's always
accessed under spinlock?

> What I don't understand is why would anyone want to mess with
> name snapshots for dentry_path() lookalikes...

Mostly because the place where the ceph code needs to use and free these
strings is rather far removed from where they are created. It simplifies
that part if we can access and free them all in the same way.

I was planning to just use name_snapshots universally for that purpose,
but they have a rather specific method of freeing things that is hard to
duplicate if you don't have a dentry to clone.

Probably that means I'm approaching this problem in the wrong way and
need to do it differently.
Al Viro April 26, 2019, 9:28 p.m. UTC | #18
On Fri, Apr 26, 2019 at 04:49:24PM -0400, Jeff Layton wrote:

> Got it, thanks. Why use an atomic_t for the refcount if it's always
> accessed under spinlock?

_Which_ spinlock?  The whole reason for refcounts is that many dentries
might end up with shared external name.  So ->d_lock on one of them
won't do anything to accesses via another...

> > What I don't understand is why would anyone want to mess with
> > name snapshots for dentry_path() lookalikes...
> 
> Mostly because the place where the ceph code needs to use and free these
> strings is rather far removed from where they are created. It simplifies
> that part if we can access and free them all in the same way.
> 
> I was planning to just use name_snapshots universally for that purpose,
> but they have a rather specific method of freeing things that is hard to
> duplicate if you don't have a dentry to clone.
> 
> Probably that means I'm approaching this problem in the wrong way and
> need to do it differently.

For short names snapshot will put the copy into your variable, normally
on the stack frame.  So it doesn't really help if you want to keep that
stuff around.  Again, dentries with short names have those stored inside
the dentry and for anything long-term you will need to copy that data;
otherwise rename() will happily change those under you.

If you want to deal with pathnames, do __getname() + a loop similar to
that in dentry_path() et.al.  All there is to it.  Snapshots are
for situations when you want a reasonably short-term access to
name of specific dentry you have a reference to and do not want
to do any allocations.
Al Viro April 28, 2019, 4:38 a.m. UTC | #19
On Fri, Apr 26, 2019 at 01:30:53PM -0400, Jeff Layton wrote:

> > I _probably_ would take allocation out of the loop (e.g. make it
> > __getname(), called unconditionally) and turned it into the
> > d_path.c-style read_seqbegin_or_lock()/need_seqretry()/done_seqretry()
> > loop, so that the first pass would go under rcu_read_lock(), while
> > the second (if needed) would just hold rename_lock exclusive (without
> > bumping the refcount).  But that's a matter of (theoretical) livelock
> > avoidance, not the locking correctness for ->d_name accesses.
> > 
> 
> Yeah, that does sound better. I want to think about this code a bit

FWIW, is there any reason to insist that the pathname is put into the
beginning of the buffer?  I mean, instead of path + pathlen we might
return path + offset, with the pathname going from path + offset to
path + PATH_MAX - 1 inclusive, with path being the thing eventually
freed.

It's easier to build the string backwards, seeing that we are walking
from leaf to root...
Jeff Layton April 28, 2019, 1:27 p.m. UTC | #20
On Sun, 2019-04-28 at 05:38 +0100, Al Viro wrote:
> On Fri, Apr 26, 2019 at 01:30:53PM -0400, Jeff Layton wrote:
> 
> > > I _probably_ would take allocation out of the loop (e.g. make it
> > > __getname(), called unconditionally) and turned it into the
> > > d_path.c-style read_seqbegin_or_lock()/need_seqretry()/done_seqretry()
> > > loop, so that the first pass would go under rcu_read_lock(), while
> > > the second (if needed) would just hold rename_lock exclusive (without
> > > bumping the refcount).  But that's a matter of (theoretical) livelock
> > > avoidance, not the locking correctness for ->d_name accesses.
> > > 
> > 
> > Yeah, that does sound better. I want to think about this code a bit
> 
> FWIW, is there any reason to insist that the pathname is put into the
> beginning of the buffer?  I mean, instead of path + pathlen we might
> return path + offset, with the pathname going from path + offset to
> path + PATH_MAX - 1 inclusive, with path being the thing eventually
> freed.
> 
> It's easier to build the string backwards, seeing that we are walking
> from leaf to root...

(cc'ing linux-cifs)

I don't see a problem doing what you suggest. An offset + fixed length
buffer would be fine there.

Is there a real benefit to using __getname though? It sucks when we have
to reallocate but I doubt that it happens with any frequency. Most of
these paths will end up being much shorter than PATH_MAX and that slims
down the memory footprint a bit.

Also, FWIW -- this code was originally copied from cifs'
build_path_from_dentry(). Should we aim to put something in common
infrastructure that both can call?

There are some significant logic differences in the two functions though
so we might need some sort of callback function or something to know
when to stop walking.
Al Viro April 28, 2019, 2:48 p.m. UTC | #21
On Sun, Apr 28, 2019 at 09:27:20AM -0400, Jeff Layton wrote:

> I don't see a problem doing what you suggest. An offset + fixed length
> buffer would be fine there.
> 
> Is there a real benefit to using __getname though? It sucks when we have
> to reallocate but I doubt that it happens with any frequency. Most of
> these paths will end up being much shorter than PATH_MAX and that slims
> down the memory footprint a bit.

AFAICS, they are all short-lived; don't forget that slabs have cache,
so in that situation allocations are cheap.

> Also, FWIW -- this code was originally copied from cifs'
> build_path_from_dentry(). Should we aim to put something in common
> infrastructure that both can call?
> 
> There are some significant logic differences in the two functions though
> so we might need some sort of callback function or something to know
> when to stop walking.

Not if you want it fast...  Indirect calls are not cheap; the cost of
those callbacks would be considerable.  Besides, you want more than
"where do I stop", right?  It's also "what output do I use for this
dentry", both for you and for cifs (there it's "which separator to use",
in ceph it's "these we want represented as //")...

Can it be called on detached subtree, during e.g. open_by_handle()?
There it can get really fishy; you end up with base being at the
random point on the way towards root.  How does that work, and if
it *does* work, why do we need the whole path in the first place?

BTW, for cifs there's no need to play with ->d_lock as we go.  For
ceph, the only need comes from looking at d_inode(), and I wonder if
it would be better to duplicate that information ("is that a
snapdir/nosnap") into dentry iself - would certainly be cheaper.
OTOH, we are getting short on spare bits in ->d_flags...
Jeff Layton April 28, 2019, 3:47 p.m. UTC | #22
On Sun, 2019-04-28 at 15:48 +0100, Al Viro wrote:
> On Sun, Apr 28, 2019 at 09:27:20AM -0400, Jeff Layton wrote:
> 
> > I don't see a problem doing what you suggest. An offset + fixed length
> > buffer would be fine there.
> > 
> > Is there a real benefit to using __getname though? It sucks when we have
> > to reallocate but I doubt that it happens with any frequency. Most of
> > these paths will end up being much shorter than PATH_MAX and that slims
> > down the memory footprint a bit.
> 
> AFAICS, they are all short-lived; don't forget that slabs have cache,
> so in that situation allocations are cheap.
> 

Fair enough. Al also pointed out on IRC that the __getname/__putname
caches are likely to be hot, so using that may be less costly cpu-wise.

> > Also, FWIW -- this code was originally copied from cifs'
> > build_path_from_dentry(). Should we aim to put something in common
> > infrastructure that both can call?
> > 
> > There are some significant logic differences in the two functions though
> > so we might need some sort of callback function or something to know
> > when to stop walking.
> 
> Not if you want it fast...  Indirect calls are not cheap; the cost of
> those callbacks would be considerable.  Besides, you want more than
> "where do I stop", right?  It's also "what output do I use for this
> dentry", both for you and for cifs (there it's "which separator to use",
> in ceph it's "these we want represented as //")...
> 
> Can it be called on detached subtree, during e.g. open_by_handle()?
> There it can get really fishy; you end up with base being at the
> random point on the way towards root.  How does that work, and if
> it *does* work, why do we need the whole path in the first place?
> 

This I'm not sure of. commit 79b33c8874334e (ceph: snapshot nfs re-
export) explains this a bit, but I'm not sure it really covers this
case.

Zheng/Sage, feel free to correct me here:

My understanding is that for snapshots you need the base inode number,
snapid, and the full path from there to the dentry for a ceph MDS call.

There is a filehandle type for a snapshotted inode:

struct ceph_nfs_snapfh {
        u64 ino;
        u64 snapid;
        u64 parent_ino;
        u32 hash;
} __attribute__ ((packed));

So I guess it is possible. You could do name_to_handle_at for an inode
deep down in a snapshotted tree, and then try to open_by_handle_at after
the dcache gets cleaned out for some other reason.

What I'm not clear on is why we need to build paths at all for
snapshots. Why is a parent inode number (inside the snapshot) + a snapid
+ dentry name not sufficient?

> BTW, for cifs there's no need to play with ->d_lock as we go.  For
> ceph, the only need comes from looking at d_inode(), and I wonder if
> it would be better to duplicate that information ("is that a
> snapdir/nosnap") into dentry iself - would certainly be cheaper.
> OTOH, we are getting short on spare bits in ->d_flags...

We could stick that in ceph_dentry_info (->d_fsdata). We have a flags
field in there already.
Al Viro April 28, 2019, 3:52 p.m. UTC | #23
On Sun, Apr 28, 2019 at 11:47:58AM -0400, Jeff Layton wrote:

> We could stick that in ceph_dentry_info (->d_fsdata). We have a flags
> field in there already.

Yes, but...  You have it freed in ->d_release(), AFAICS, and without
any delays.  So lockless accesses will be trouble.
Jeff Layton April 28, 2019, 4:18 p.m. UTC | #24
On Sun, 2019-04-28 at 16:52 +0100, Al Viro wrote:
> On Sun, Apr 28, 2019 at 11:47:58AM -0400, Jeff Layton wrote:
> 
> > We could stick that in ceph_dentry_info (->d_fsdata). We have a flags
> > field in there already.
> 
> Yes, but...  You have it freed in ->d_release(), AFAICS, and without
> any delays.  So lockless accesses will be trouble.

That's easy enough to fix -- we could add a rcu_head to it and call_rcu
in ceph_d_release instead of just freeing it immediately.

I guess if we find that d_fsdata is NULL, we can just treat it like we
currently do a negative dentry?
Al Viro April 28, 2019, 4:40 p.m. UTC | #25
On Sun, Apr 28, 2019 at 04:52:16PM +0100, Al Viro wrote:
> On Sun, Apr 28, 2019 at 11:47:58AM -0400, Jeff Layton wrote:
> 
> > We could stick that in ceph_dentry_info (->d_fsdata). We have a flags
> > field in there already.
> 
> Yes, but...  You have it freed in ->d_release(), AFAICS, and without
> any delays.  So lockless accesses will be trouble.

You could RCU-delay the actual kmem_cache_free(ceph_dentry_cachep, di)
in there, but I've no idea whether the overhead would be painful -
on massive eviction (e.g. on memory pressure) it might be.  Another
variant is to introduce ->d_free(), to be called from __d_free()
and __d_free_external().  That, however, would need another ->d_flags
bit for presence of that method, so that we don't get extra overhead
from looking into ->d_op...

Looking through ->d_release() instances, we have

afs: empty, might as well have not been there

autofs: does some sync stuff (eviction from ->active_list/->expire_list)
plus kfree_rcu

ceph: some sync stuff + immediate kmem_cache_free()

debugfs: kfree(), might or might not be worth RCU-delaying

ecryptfs: sync stuff (path_put for ->lower) + RCU-delayed part

fuse: kfree_rcu()

nfs: kfree()

overlayfs: a bunch of dput() (obviously sync) + kfree_rcu()

9p: sync

So it actually might make sense to move the RCU-delayed bits to
separate method.  Some ->d_release() instances would be simply
gone, as for the rest...  I wonder which of the sync parts can
be moved over to ->d_prune().  Not guaranteed to be doable
(or a good idea), but...  E.g. for autofs it almost certainly
would be the right place for the sync parts - we are,
essentially, telling the filesystem to forget its private
(non-refcounted) references to the victim.