Orangefs, v4.5 and the merge window...

Message ID	CAOg9mSSzJZiv=qBTuLWaEF+FJcVA=UhXtxFPwxiYG8mkn-Rj7Q@mail.gmail.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <linux-fsdevel-owner@kernel.org> MIME-Version: 1.0 Date: Fri, 11 Mar 2016 15:18:57 -0500 Message-ID: <CAOg9mSSzJZiv=qBTuLWaEF+FJcVA=UhXtxFPwxiYG8mkn-Rj7Q@mail.gmail.com> Subject: Orangefs, v4.5 and the merge window... From: Mike Marshall <hubcap@omnibond.com> To: Al Viro <viro@zeniv.linux.org.uk>, linux-fsdevel <linux-fsdevel@vger.kernel.org>, Mike Marshall <hubcap@omnibond.com> Content-Type: text/plain; charset=UTF-8 Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk

Mike Marshall March 11, 2016, 8:18 p.m. UTC

Greetings...

The Orangefs for-next tree is:

git://git.kernel.org/pub/scm/linux/kernel/git/hubcap/linux.git
for-next

I did a test merge (just locally, not pushed out) of Orangefs:for-next
and v4.5-rc7 so I could test out how I think I need to patch for
the follow_link -> get_link change, the diff is below.

On Monday next, assuming that v4.5 is finalized this weekend,
I plan to do a actual merge with v4.5, apply the get_link patch
and push that to Orangefs:for-next.

Hi Al <g>... might we get an ACK this time around?

-Mike

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Al Viro March 11, 2016, 9:47 p.m. UTC | #1

On Fri, Mar 11, 2016 at 03:18:57PM -0500, Mike Marshall wrote:
> Greetings...
> 
> The Orangefs for-next tree is:
> 
> git://git.kernel.org/pub/scm/linux/kernel/git/hubcap/linux.git
> for-next
> 
> I did a test merge (just locally, not pushed out) of Orangefs:for-next
> and v4.5-rc7 so I could test out how I think I need to patch for
> the follow_link -> get_link change, the diff is below.
> 
> On Monday next, assuming that v4.5 is finalized this weekend,
> I plan to do a actual merge with v4.5, apply the get_link patch
> and push that to Orangefs:for-next.
> 
> Hi Al <g>... might we get an ACK this time around?

You do realize that it will mean fun few weeks post-merge fixing the rest of
problems, right?  FWIW, I think that right now it *is* at the state where it
such fixing is feasible, so modulo that...

As far as I can see, waiting-related logics should be solid by now, ditto
for lifetime rules; sanitizing the input...  listxattr still does need fixing
(feed it a negative in ->downcall.resp.listxattr.lengths[0] and watch Bad
Things(tm) happen; no idea why would anyone go for
fs/orangefs/downcall.h:82:      __s32 lengths[ORANGEFS_MAX_XATTR_LISTLEN];
for representing string lengths in the first place, but that's what you've
got there and no sanity checks are done on it beyond
                if (total + new_op->downcall.resp.listxattr.lengths[i] > size)
                        goto done;
which is not enough - not with total and size being ssize_t and ...lengths[] -
signed 32bit).

The logics around maintaining the list of orangefs superblocks (add/remove/
traverse) needs fixing; right now ioctl(..., ORANGEFS_DEV_REMOUNT_ALL) will
walk through it with only request_mutex held.  Both insertion and removal
are protected only by orangefs_superblocks_lock, and removal is insane -
        struct list_head *tmp_safe = NULL;                              \
        struct orangefs_sb_info_s *orangefs_sb = NULL;                  \
                                                                        \
        spin_lock(&orangefs_superblocks_lock);                          \
        list_for_each_safe(tmp, tmp_safe, &orangefs_superblocks) {      \
                orangefs_sb = list_entry(tmp,                           \
                                      struct orangefs_sb_info_s,        \
                                      list);                            \
                if (orangefs_sb && (orangefs_sb->sb == sb)) {           \
                        gossip_debug(GOSSIP_SUPER_DEBUG,                \
                            "Removing SB %p from orangefs superblocks\n",      \
                        orangefs_sb);                                   \
                        list_del(&orangefs_sb->list);                   \
                        break;                                          \
                }                                                       \
        }                                                               \
        spin_unlock(&orangefs_superblocks_lock);                        \
list_entry is never NULL, for starters, and since there is a pointer back
from superblock to that orangefs_sb_info, there's no reason to walk the entire
list to find one.  BTW, both add_orangefs_sb() and remove_orangefs_sb() should
be taken to their sole users.

Sanity aside, there's really no lock in common for list modifiers and list
walker I'd mentioned above.  FWIW, I would make orangefs_remount()
take struct orangefs_sb_info instead of struct super_block and flipped the
order of operations in orangefs_kill_sb() - kill_anon_super() *first*, then
remove from the list, then tell the userland that it's going away (i.e.
call orangefs_unmount_sb()).  request_mutex in the last one would, at least,
prevent freeing the sucker before orangefs_remount() is done with it.

Walking the list and calling orangefs_remount() on everything would still need
care - you'd need to hold orangefs_superblocks_lock, drop it for actual calls
of orangefs_remount() and have list removal preserve the forward pointer.

That's probably the worst remaining locking issue I see in there.  Doable,
if not pleasant...

IIRC, there also had been some unpleasantness with getattr messing ->i_mode
halfway through... <checks>  Yes - copy_attributes_to_inode() will be called,
and do
        inode->i_mode = orangefs_inode_perms(attrs);
...
                inode->i_mode |= S_IFLNK;
...
                        strncpy(orangefs_inode->link_target,
                                symname,
                                ORANGEFS_NAME_MAX);
If nothing else, *another* stat(2) racing with this one could pick the
intermediate value of ->i_mode and proceed to report it to userland.
Another problem is overwriting the symlink body; that can get very
unpleasant, since it might be traversed by another syscall right at that
moment.  Any change of a symlink body means "we'd missed it going stale"; 
there is no way to change a symlink contents without removing it and
creating a new one.  Should anything other than orangefs_iget() even bother
copying it?  The same goes for inode type changes, of course (regular
vs. directory vs. symlink, etc.).

Speaking of orangefs_iget(), orangefs_set_inode() is pointlessly paranoid.
Not a bug per se, but
        struct orangefs_inode_s *orangefs_inode = NULL;

        /* Make sure that we have sane parameters */
        if (!data || !inode)  
                return 0;
        orangefs_inode = ORANGEFS_I(inode);
        if (!orangefs_inode)
                return 0;
is all wrong - 'data' is the last argument passed to iget5_locked (i.e. 'ref'
of orangefs_iget()) and that's always an address of either a local variable
or of a field in a large structure, and not even the first one; 'inode'
is never NULL - it's the address of struct inode the caller is about to
insert into the hash chain; ORANGEFS_I() is container_of(), so it's not
going to be NULL either.

I'll need to look through the archived threads to see if there's anything
else left; IIRC, debugfs-related code had seriously nasty issues in case of
allocation failures, but those were fairly isolated.  I'll read through the
archive tomorrow and see if there's anything else mentioned and not dealt
with; I don't remember anything really bad, but it had been well over
a hundred mails starting about half a year ago; I sure as hell do not
remember every tangential subthread in all of that, so I'll need to recheck.

I _think_ that all remaining issues can be quickly dealt with, and the code
has zero impact on the rest of the kernel.  I wouldn't risk putting it into
-final without fixups, but as for the merge schedule... either merge it
before -rc1 and fix it up by -rc3 or so, or fix it during the window and
merge at around -rc2 - I'm fine with either variant.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Mike Marshall March 11, 2016, 10:35 p.m. UTC | #2

> either merge it
> before -rc1 and fix it up by -rc3 or so, or fix it during the
> window and merge at around -rc2 - I'm fine with either
> variant.

We've kept a list we made from all those mail messages
so we could check off things we've tried to address, I
was looking at it yesterday and I know it is not up-to-date,
but we'll work to get it that way. The second option
might be safer unless you help us again, I don't want
to sign a rubber check to Linus.

-Mike

On Fri, Mar 11, 2016 at 4:47 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
> On Fri, Mar 11, 2016 at 03:18:57PM -0500, Mike Marshall wrote:
>> Greetings...
>>
>> The Orangefs for-next tree is:
>>
>> git://git.kernel.org/pub/scm/linux/kernel/git/hubcap/linux.git
>> for-next
>>
>> I did a test merge (just locally, not pushed out) of Orangefs:for-next
>> and v4.5-rc7 so I could test out how I think I need to patch for
>> the follow_link -> get_link change, the diff is below.
>>
>> On Monday next, assuming that v4.5 is finalized this weekend,
>> I plan to do a actual merge with v4.5, apply the get_link patch
>> and push that to Orangefs:for-next.
>>
>> Hi Al <g>... might we get an ACK this time around?
>
> You do realize that it will mean fun few weeks post-merge fixing the rest of
> problems, right?  FWIW, I think that right now it *is* at the state where it
> such fixing is feasible, so modulo that...
>
> As far as I can see, waiting-related logics should be solid by now, ditto
> for lifetime rules; sanitizing the input...  listxattr still does need fixing
> (feed it a negative in ->downcall.resp.listxattr.lengths[0] and watch Bad
> Things(tm) happen; no idea why would anyone go for
> fs/orangefs/downcall.h:82:      __s32 lengths[ORANGEFS_MAX_XATTR_LISTLEN];
> for representing string lengths in the first place, but that's what you've
> got there and no sanity checks are done on it beyond
>                 if (total + new_op->downcall.resp.listxattr.lengths[i] > size)
>                         goto done;
> which is not enough - not with total and size being ssize_t and ...lengths[] -
> signed 32bit).
>
> The logics around maintaining the list of orangefs superblocks (add/remove/
> traverse) needs fixing; right now ioctl(..., ORANGEFS_DEV_REMOUNT_ALL) will
> walk through it with only request_mutex held.  Both insertion and removal
> are protected only by orangefs_superblocks_lock, and removal is insane -
>         struct list_head *tmp_safe = NULL;                              \
>         struct orangefs_sb_info_s *orangefs_sb = NULL;                  \
>                                                                         \
>         spin_lock(&orangefs_superblocks_lock);                          \
>         list_for_each_safe(tmp, tmp_safe, &orangefs_superblocks) {      \
>                 orangefs_sb = list_entry(tmp,                           \
>                                       struct orangefs_sb_info_s,        \
>                                       list);                            \
>                 if (orangefs_sb && (orangefs_sb->sb == sb)) {           \
>                         gossip_debug(GOSSIP_SUPER_DEBUG,                \
>                             "Removing SB %p from orangefs superblocks\n",      \
>                         orangefs_sb);                                   \
>                         list_del(&orangefs_sb->list);                   \
>                         break;                                          \
>                 }                                                       \
>         }                                                               \
>         spin_unlock(&orangefs_superblocks_lock);                        \
> list_entry is never NULL, for starters, and since there is a pointer back
> from superblock to that orangefs_sb_info, there's no reason to walk the entire
> list to find one.  BTW, both add_orangefs_sb() and remove_orangefs_sb() should
> be taken to their sole users.
>
> Sanity aside, there's really no lock in common for list modifiers and list
> walker I'd mentioned above.  FWIW, I would make orangefs_remount()
> take struct orangefs_sb_info instead of struct super_block and flipped the
> order of operations in orangefs_kill_sb() - kill_anon_super() *first*, then
> remove from the list, then tell the userland that it's going away (i.e.
> call orangefs_unmount_sb()).  request_mutex in the last one would, at least,
> prevent freeing the sucker before orangefs_remount() is done with it.
>
> Walking the list and calling orangefs_remount() on everything would still need
> care - you'd need to hold orangefs_superblocks_lock, drop it for actual calls
> of orangefs_remount() and have list removal preserve the forward pointer.
>
> That's probably the worst remaining locking issue I see in there.  Doable,
> if not pleasant...
>
> IIRC, there also had been some unpleasantness with getattr messing ->i_mode
> halfway through... <checks>  Yes - copy_attributes_to_inode() will be called,
> and do
>         inode->i_mode = orangefs_inode_perms(attrs);
> ...
>                 inode->i_mode |= S_IFLNK;
> ...
>                         strncpy(orangefs_inode->link_target,
>                                 symname,
>                                 ORANGEFS_NAME_MAX);
> If nothing else, *another* stat(2) racing with this one could pick the
> intermediate value of ->i_mode and proceed to report it to userland.
> Another problem is overwriting the symlink body; that can get very
> unpleasant, since it might be traversed by another syscall right at that
> moment.  Any change of a symlink body means "we'd missed it going stale";
> there is no way to change a symlink contents without removing it and
> creating a new one.  Should anything other than orangefs_iget() even bother
> copying it?  The same goes for inode type changes, of course (regular
> vs. directory vs. symlink, etc.).
>
> Speaking of orangefs_iget(), orangefs_set_inode() is pointlessly paranoid.
> Not a bug per se, but
>         struct orangefs_inode_s *orangefs_inode = NULL;
>
>         /* Make sure that we have sane parameters */
>         if (!data || !inode)
>                 return 0;
>         orangefs_inode = ORANGEFS_I(inode);
>         if (!orangefs_inode)
>                 return 0;
> is all wrong - 'data' is the last argument passed to iget5_locked (i.e. 'ref'
> of orangefs_iget()) and that's always an address of either a local variable
> or of a field in a large structure, and not even the first one; 'inode'
> is never NULL - it's the address of struct inode the caller is about to
> insert into the hash chain; ORANGEFS_I() is container_of(), so it's not
> going to be NULL either.
>
> I'll need to look through the archived threads to see if there's anything
> else left; IIRC, debugfs-related code had seriously nasty issues in case of
> allocation failures, but those were fairly isolated.  I'll read through the
> archive tomorrow and see if there's anything else mentioned and not dealt
> with; I don't remember anything really bad, but it had been well over
> a hundred mails starting about half a year ago; I sure as hell do not
> remember every tangential subthread in all of that, so I'll need to recheck.
>
> I _think_ that all remaining issues can be quickly dealt with, and the code
> has zero impact on the rest of the kernel.  I wouldn't risk putting it into
> -final without fixups, but as for the merge schedule... either merge it
> before -rc1 and fix it up by -rc3 or so, or fix it during the window and
> merge at around -rc2 - I'm fine with either variant.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Mike Marshall March 14, 2016, 9:03 p.m. UTC | #3

Hi Al (and everyone)...

git://git.kernel.org/pub/scm/linux/kernel/git/hubcap/linux.git
for-next

... is updated to v4.5, it has the follow_link -> get_link change.

I worked today to clean up the debugfs (and sysfs) problems,
and we'll keep ticking things off the list, perhaps I should be
posting patches here for review instead automatically updating
for-next when we work on an issue...

We're working on putting "the list" in a place that can be
viewed by everyone and edited by both Martin and myself...

-Mike

On Fri, Mar 11, 2016 at 5:35 PM, Mike Marshall <hubcap@omnibond.com> wrote:
>> either merge it
>> before -rc1 and fix it up by -rc3 or so, or fix it during the
>> window and merge at around -rc2 - I'm fine with either
>> variant.
>
> We've kept a list we made from all those mail messages
> so we could check off things we've tried to address, I
> was looking at it yesterday and I know it is not up-to-date,
> but we'll work to get it that way. The second option
> might be safer unless you help us again, I don't want
> to sign a rubber check to Linus.
>
> -Mike
>
> On Fri, Mar 11, 2016 at 4:47 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
>> On Fri, Mar 11, 2016 at 03:18:57PM -0500, Mike Marshall wrote:
>>> Greetings...
>>>
>>> The Orangefs for-next tree is:
>>>
>>> git://git.kernel.org/pub/scm/linux/kernel/git/hubcap/linux.git
>>> for-next
>>>
>>> I did a test merge (just locally, not pushed out) of Orangefs:for-next
>>> and v4.5-rc7 so I could test out how I think I need to patch for
>>> the follow_link -> get_link change, the diff is below.
>>>
>>> On Monday next, assuming that v4.5 is finalized this weekend,
>>> I plan to do a actual merge with v4.5, apply the get_link patch
>>> and push that to Orangefs:for-next.
>>>
>>> Hi Al <g>... might we get an ACK this time around?
>>
>> You do realize that it will mean fun few weeks post-merge fixing the rest of
>> problems, right?  FWIW, I think that right now it *is* at the state where it
>> such fixing is feasible, so modulo that...
>>
>> As far as I can see, waiting-related logics should be solid by now, ditto
>> for lifetime rules; sanitizing the input...  listxattr still does need fixing
>> (feed it a negative in ->downcall.resp.listxattr.lengths[0] and watch Bad
>> Things(tm) happen; no idea why would anyone go for
>> fs/orangefs/downcall.h:82:      __s32 lengths[ORANGEFS_MAX_XATTR_LISTLEN];
>> for representing string lengths in the first place, but that's what you've
>> got there and no sanity checks are done on it beyond
>>                 if (total + new_op->downcall.resp.listxattr.lengths[i] > size)
>>                         goto done;
>> which is not enough - not with total and size being ssize_t and ...lengths[] -
>> signed 32bit).
>>
>> The logics around maintaining the list of orangefs superblocks (add/remove/
>> traverse) needs fixing; right now ioctl(..., ORANGEFS_DEV_REMOUNT_ALL) will
>> walk through it with only request_mutex held.  Both insertion and removal
>> are protected only by orangefs_superblocks_lock, and removal is insane -
>>         struct list_head *tmp_safe = NULL;                              \
>>         struct orangefs_sb_info_s *orangefs_sb = NULL;                  \
>>                                                                         \
>>         spin_lock(&orangefs_superblocks_lock);                          \
>>         list_for_each_safe(tmp, tmp_safe, &orangefs_superblocks) {      \
>>                 orangefs_sb = list_entry(tmp,                           \
>>                                       struct orangefs_sb_info_s,        \
>>                                       list);                            \
>>                 if (orangefs_sb && (orangefs_sb->sb == sb)) {           \
>>                         gossip_debug(GOSSIP_SUPER_DEBUG,                \
>>                             "Removing SB %p from orangefs superblocks\n",      \
>>                         orangefs_sb);                                   \
>>                         list_del(&orangefs_sb->list);                   \
>>                         break;                                          \
>>                 }                                                       \
>>         }                                                               \
>>         spin_unlock(&orangefs_superblocks_lock);                        \
>> list_entry is never NULL, for starters, and since there is a pointer back
>> from superblock to that orangefs_sb_info, there's no reason to walk the entire
>> list to find one.  BTW, both add_orangefs_sb() and remove_orangefs_sb() should
>> be taken to their sole users.
>>
>> Sanity aside, there's really no lock in common for list modifiers and list
>> walker I'd mentioned above.  FWIW, I would make orangefs_remount()
>> take struct orangefs_sb_info instead of struct super_block and flipped the
>> order of operations in orangefs_kill_sb() - kill_anon_super() *first*, then
>> remove from the list, then tell the userland that it's going away (i.e.
>> call orangefs_unmount_sb()).  request_mutex in the last one would, at least,
>> prevent freeing the sucker before orangefs_remount() is done with it.
>>
>> Walking the list and calling orangefs_remount() on everything would still need
>> care - you'd need to hold orangefs_superblocks_lock, drop it for actual calls
>> of orangefs_remount() and have list removal preserve the forward pointer.
>>
>> That's probably the worst remaining locking issue I see in there.  Doable,
>> if not pleasant...
>>
>> IIRC, there also had been some unpleasantness with getattr messing ->i_mode
>> halfway through... <checks>  Yes - copy_attributes_to_inode() will be called,
>> and do
>>         inode->i_mode = orangefs_inode_perms(attrs);
>> ...
>>                 inode->i_mode |= S_IFLNK;
>> ...
>>                         strncpy(orangefs_inode->link_target,
>>                                 symname,
>>                                 ORANGEFS_NAME_MAX);
>> If nothing else, *another* stat(2) racing with this one could pick the
>> intermediate value of ->i_mode and proceed to report it to userland.
>> Another problem is overwriting the symlink body; that can get very
>> unpleasant, since it might be traversed by another syscall right at that
>> moment.  Any change of a symlink body means "we'd missed it going stale";
>> there is no way to change a symlink contents without removing it and
>> creating a new one.  Should anything other than orangefs_iget() even bother
>> copying it?  The same goes for inode type changes, of course (regular
>> vs. directory vs. symlink, etc.).
>>
>> Speaking of orangefs_iget(), orangefs_set_inode() is pointlessly paranoid.
>> Not a bug per se, but
>>         struct orangefs_inode_s *orangefs_inode = NULL;
>>
>>         /* Make sure that we have sane parameters */
>>         if (!data || !inode)
>>                 return 0;
>>         orangefs_inode = ORANGEFS_I(inode);
>>         if (!orangefs_inode)
>>                 return 0;
>> is all wrong - 'data' is the last argument passed to iget5_locked (i.e. 'ref'
>> of orangefs_iget()) and that's always an address of either a local variable
>> or of a field in a large structure, and not even the first one; 'inode'
>> is never NULL - it's the address of struct inode the caller is about to
>> insert into the hash chain; ORANGEFS_I() is container_of(), so it's not
>> going to be NULL either.
>>
>> I'll need to look through the archived threads to see if there's anything
>> else left; IIRC, debugfs-related code had seriously nasty issues in case of
>> allocation failures, but those were fairly isolated.  I'll read through the
>> archive tomorrow and see if there's anything else mentioned and not dealt
>> with; I don't remember anything really bad, but it had been well over
>> a hundred mails starting about half a year ago; I sure as hell do not
>> remember every tangential subthread in all of that, so I'll need to recheck.
>>
>> I _think_ that all remaining issues can be quickly dealt with, and the code
>> has zero impact on the rest of the kernel.  I wouldn't risk putting it into
>> -final without fixups, but as for the merge schedule... either merge it
>> before -rc1 and fix it up by -rc3 or so, or fix it during the window and
>> merge at around -rc2 - I'm fine with either variant.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Martin Brandenburg March 15, 2016, 4:04 a.m. UTC | #4

On Fri, 11 Mar 2016, Al Viro wrote:

> IIRC, there also had been some unpleasantness with getattr messing ->i_mode
> halfway through... <checks>  Yes - copy_attributes_to_inode() will be called,
> and do
>        inode->i_mode = orangefs_inode_perms(attrs);
> ...
>                inode->i_mode |= S_IFLNK;
> ...
>                        strncpy(orangefs_inode->link_target,
>                                symname,
>                                ORANGEFS_NAME_MAX);
> If nothing else, *another* stat(2) racing with this one could pick the
> intermediate value of ->i_mode and proceed to report it to userland.
> Another problem is overwriting the symlink body; that can get very
> unpleasant, since it might be traversed by another syscall right at that
> moment.  Any change of a symlink body means "we'd missed it going stale";
> there is no way to change a symlink contents without removing it and
> creating a new one.  Should anything other than orangefs_iget() even bother
> copying it?  The same goes for inode type changes, of course (regular
> vs. directory vs. symlink, etc.).

I thought I fixed most of that when I went over
d_revalidate. But now I see that there's the case that
the server attribute changes after revalidate and before
the operation e.g. chmod. What happens to the inode we
now know is stale? What if two processes notice it's
stale at the same time?

We can go ahead with the chmod even if the object type
has changed since the kernel last saw it. But OrangeFS
doesn't have NFS's sillyrename. So what do we do in the
middle of read when we get ENOENT? We ignore POSIX and
return it to userspace anyway and leave the bad inode
around now.

There's lots of unsanitized error codes from the server
going to userspace now.

Will writing i_mode in one shot fix it?

-- Martin
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Martin Brandenburg March 15, 2016, 4:45 p.m. UTC | #5

On Tue, 15 Mar 2016, Martin Brandenburg wrote:

> On Fri, 11 Mar 2016, Al Viro wrote:
>
>> IIRC, there also had been some unpleasantness with getattr messing ->i_mode
>> halfway through... <checks>  Yes - copy_attributes_to_inode() will be 
>> called,
>> and do
>>        inode->i_mode = orangefs_inode_perms(attrs);
>> ...
>>                inode->i_mode |= S_IFLNK;
>> ...
>>                        strncpy(orangefs_inode->link_target,
>>                                symname,
>>                                ORANGEFS_NAME_MAX);
>> If nothing else, *another* stat(2) racing with this one could pick the
>> intermediate value of ->i_mode and proceed to report it to userland.
>> Another problem is overwriting the symlink body; that can get very
>> unpleasant, since it might be traversed by another syscall right at that
>> moment.  Any change of a symlink body means "we'd missed it going stale";
>> there is no way to change a symlink contents without removing it and
>> creating a new one.  Should anything other than orangefs_iget() even bother
>> copying it?  The same goes for inode type changes, of course (regular
>> vs. directory vs. symlink, etc.).
>
> I thought I fixed most of that when I went over
> d_revalidate. But now I see that there's the case that
> the server attribute changes after revalidate and before
> the operation e.g. chmod. What happens to the inode we
> now know is stale? What if two processes notice it's
> stale at the same time?
>
> We can go ahead with the chmod even if the object type
> has changed since the kernel last saw it. But OrangeFS
> doesn't have NFS's sillyrename. So what do we do in the
> middle of read when we get ENOENT? We ignore POSIX and
> return it to userspace anyway and leave the bad inode
> around now.
>
> There's lots of unsanitized error codes from the server
> going to userspace now.
>
> Will writing i_mode in one shot fix it?
>
> -- Martin
>

This is where I'm going with this. Not done yet, but you
can see where it's going hopefully.

https://github.com/martinbrandenburg/linux.git branch getattr

-- Martin
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Al Viro March 26, 2016, 12:21 a.m. UTC | #6

On Mon, Mar 14, 2016 at 05:03:55PM -0400, Mike Marshall wrote:
> Hi Al (and everyone)...
> 
> git://git.kernel.org/pub/scm/linux/kernel/git/hubcap/linux.git
> for-next
> 
> ... is updated to v4.5, it has the follow_link -> get_link change.
> 
> I worked today to clean up the debugfs (and sysfs) problems,
> and we'll keep ticking things off the list, perhaps I should be
> posting patches here for review instead automatically updating
> for-next when we work on an issue...
> 
> We're working on putting "the list" in a place that can be
> viewed by everyone and edited by both Martin and myself...

FWIW, I've ported several cleanups + short read/write on error + lseek
fixes + orangefs_superblocks locking fixes on top of that branch and pushed
into vfs.git#orangefs-untested.

I think it can be merged this cycle; if the fixes above really work, this
+ Martin's latest should leave only the handling of allocation failures in
debugfs-related code.

Linus, would you be OK with that thing going into -rc2?  It's self-contained
and I think it's in reasonably sane shape.  If not for the last locking fixes,
I would even suggest -rc1 with fixups of debugfs issues during the next week,
but that last commit got no testing whatsoever... ;-/

I would prefer if pull request went from Mike, once he's OK with the whole
set, but as far as I'm concerned the material in his #for-next + my pile above
+ Martin's branch (getattr fixes) should be OK for mainline merge.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Mike Marshall March 26, 2016, 1 a.m. UTC | #7

Wow... the superblock problem is something we haven't worked on yet...
thanks!

Martin's new code caused some xfstests regressions, but we got past
all but the generic/028 regression... I don't know if Martin worked on
that today, I wasn't in the office.

I had fixed up what I saw wrong in debugfs, so I'll have to look
again for the allocation failures...

I haven't pushed Martin's re-worked code to kernel.org yet, since
we were working through the couple of regressions, have you looked
at it on his github repo, and it doesn't conflict with your patches?

I'll look and see if I can get your patches to load on top of
Martin's patches and get that pushed to kernel.org.

I leave before dawn on Sunday morning for that Linux
Foundation summit in Lake Tahoe, so if we get this
thing merged now (I hope <g>) we'll have to continue
to polish it through the rc's, but that's what they're there
for, right?

-Mike

On Fri, Mar 25, 2016 at 8:21 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
> On Mon, Mar 14, 2016 at 05:03:55PM -0400, Mike Marshall wrote:
>> Hi Al (and everyone)...
>>
>> git://git.kernel.org/pub/scm/linux/kernel/git/hubcap/linux.git
>> for-next
>>
>> ... is updated to v4.5, it has the follow_link -> get_link change.
>>
>> I worked today to clean up the debugfs (and sysfs) problems,
>> and we'll keep ticking things off the list, perhaps I should be
>> posting patches here for review instead automatically updating
>> for-next when we work on an issue...
>>
>> We're working on putting "the list" in a place that can be
>> viewed by everyone and edited by both Martin and myself...
>
> FWIW, I've ported several cleanups + short read/write on error + lseek
> fixes + orangefs_superblocks locking fixes on top of that branch and pushed
> into vfs.git#orangefs-untested.
>
> I think it can be merged this cycle; if the fixes above really work, this
> + Martin's latest should leave only the handling of allocation failures in
> debugfs-related code.
>
> Linus, would you be OK with that thing going into -rc2?  It's self-contained
> and I think it's in reasonably sane shape.  If not for the last locking fixes,
> I would even suggest -rc1 with fixups of debugfs issues during the next week,
> but that last commit got no testing whatsoever... ;-/
>
> I would prefer if pull request went from Mike, once he's OK with the whole
> set, but as far as I'm concerned the material in his #for-next + my pile above
> + Martin's branch (getattr fixes) should be OK for mainline merge.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Al Viro March 26, 2016, 1:01 a.m. UTC | #8

On Fri, Mar 25, 2016 at 05:28:33PM -0700, Linus Torvalds wrote:
> Gaah. I'm going to be traveling starting tomorrow night (red-eye) and I was
> going to cut rc1 tomorrow and then really hope for a calm rc2 week.
> 
> At the same time, is live to get this finally merged, about a year after I
> originally thought were would. So I guess I'll take it, even if I hate the
> timing.

Umm... IIRC, there had been precedents when self-contained driver got merged
later than -rc1.  Might make more sense than rushing the merge in the last
day of the window, especially since you are travelling...

> There are zero changes outside orangefs itself, and obviously the
> Makefile/KConfig file to get it all enabled?

Yes.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Mike Marshall March 26, 2016, 1:02 a.m. UTC | #9

There's the orangefs.txt in Documentation.

And I'm listed as a maintainer, I forget what file that's in...

But you're right, we mostly don't touch anything out of fs/orangefs...

-Mike

On Fri, Mar 25, 2016 at 8:28 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> On Mar 25, 2016 5:21 PM, "Al Viro" <viro@zeniv.linux.org.uk> wrote:
>>
>> Linus, would you be OK with that thing going into -rc2?  It's
>> self-contained
>> and I think it's in reasonably sane shape.  If not for the last locking
>> fixes,
>> I would even suggest -rc1 with fixups of debugfs issues during the next
>> week,
>> but that last commit got no testing whatsoever... ;-/
>
> Gaah. I'm going to be traveling starting tomorrow night (red-eye) and I was
> going to cut rc1 tomorrow and then really hope for a calm rc2 week.
>
> At the same time, is live to get this finally merged, about a year after I
> originally thought were would. So I guess I'll take it, even if I hate the
> timing.
>
> There are zero changes outside orangefs itself, and obviously the
> Makefile/KConfig file to get it all enabled?
>
>     Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Mike Marshall March 26, 2016, 1:07 a.m. UTC | #10

If Al's patches load on top of Martin's patches without conflict, I can probably
get it all merged and minimally tested before noon my time (eastern
US) tomorrow... that's probably late for where y'all are though...

-Mike

On Fri, Mar 25, 2016 at 9:01 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
> On Fri, Mar 25, 2016 at 05:28:33PM -0700, Linus Torvalds wrote:
>> Gaah. I'm going to be traveling starting tomorrow night (red-eye) and I was
>> going to cut rc1 tomorrow and then really hope for a calm rc2 week.
>>
>> At the same time, is live to get this finally merged, about a year after I
>> originally thought were would. So I guess I'll take it, even if I hate the
>> timing.
>
> Umm... IIRC, there had been precedents when self-contained driver got merged
> later than -rc1.  Might make more sense than rushing the merge in the last
> day of the window, especially since you are travelling...
>
>> There are zero changes outside orangefs itself, and obviously the
>> Makefile/KConfig file to get it all enabled?
>
> Yes.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Mike Marshall March 26, 2016, 3:55 a.m. UTC | #11

I got Al's patches merged with Martin's patches and only
one pretty simple conflict to fix.

In testing, most things seem to work right, but at least one
important thing is busted: umount...

unmounting the filesystem just gets the kernel into a forever loop,
"ERROR: fs_mount_pending 147936488" just goes on
forever until I interrupt the umount command:

Mar 25 23:42:00 be1 kernel: [  156.990737] Releasing OP (ffff8800189a0000: 103)
Mar 25 23:42:00 be1 kernel: [  156.991366] orangefs_kill_sb: called
Mar 25 23:42:00 be1 kernel: [  156.991401] orangefs_destroy_inode:
deallocated ffff880018998000 destroying inode
00001000-0000-0000-0000-000000000000
Mar 25 23:42:00 be1 kernel: [  156.991414] orangefs_unmount_sb called
on sb ffff880018810000
Mar 25 23:42:00 be1 kernel: [  156.991416] Alloced OP
(ffff8800189a0000: 104 OP_FS_UMOUNT)
Mar 25 23:42:00 be1 kernel: [  156.991417] Attempting ORANGEFS Unmount
via host tcp://be1:3334/orangefs
Mar 25 23:42:00 be1 kernel: [  156.991418] service_operation:
orangefs_fs_umount op:ffff8800189a0000: process:umount: pid:1124:
Mar 25 23:42:00 be1 kernel: [  156.991419] service_operation:
op:OP_FS_UMOUNT: op_state:1: process:umount:
Mar 25 23:42:00 be1 kernel: [  156.991435] orangefs: skipping op tag
104 OP_FS_UMOUNT
Mar 25 23:42:00 be1 kernel: [  156.991435] orangefs: ERROR:
fs_mount_pending 147936488
Mar 25 23:42:00 be1 kernel: [  156.991440] orangefs: skipping op tag
104 OP_FS_UMOUNT

The github repo has all the patches, I didn't think I
should push it to the orangefs for-next branch like this.

https://github.com/hubcapsc/linux/tree/current

I'll look to see if I can see it, I guess it has something to
do with Al's superblock re-do...

-Mike

On Fri, Mar 25, 2016 at 9:11 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> On Mar 25, 2016 18:01, "Al Viro" <viro@zeniv.linux.org.uk> wrote:
>>
>> Umm... IIRC, there had been precedents when self-contained driver got
>> merged
>> later than -rc1.
>
> I do it if there is some reason for it, yes.
>
> In particular, for things like consumer hardware that has not had a driver
> for it, I'll merge new drivers later, because missing one release nasty be a
> big deal if some average-Joe machine just doesn't work. But even then it's
> not just "random driver", but something that keeps that machine from working
> entirely - disk or network or GPU or USB or something core like that.
>
> So it's not just "new driver that cannot cause a regression".  It's also
> about the driver being important to people who can't reasonably be expected
> to compile their own kernel..
>
> There is little reason to do an expedited out-of-the-merge-window merge for
> something like orangefs, though.
>
>    Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Orangefs, v4.5 and the merge window...

Commit Message

Comments

Patch