diff mbox

[v2] locks: Filter /proc/locks output on proc pid ns

Message ID 1470209710-30022-1-git-send-email-kernel@kyup.com (mailing list archive)
State New, archived
Headers show

Commit Message

kernel@kyup.com Aug. 3, 2016, 7:35 a.m. UTC
On busy container servers reading /proc/locks shows all the locks
created by all clients. This can cause large latency spikes. In my
case I observed lsof taking up to 5-10 seconds while processing around
50k locks. Fix this by limiting the locks shown only to those created
in the same pidns as the one the proc was mounted in. When reading
/proc/locks from the init_pid_ns show everything.

Signed-off-by: Nikolay Borisov <kernel@kyup.com>
---
 fs/locks.c | 6 ++++++
 1 file changed, 6 insertions(+)

Comments

Jeff Layton Aug. 3, 2016, 1:46 p.m. UTC | #1
On Wed, 2016-08-03 at 10:35 +0300, Nikolay Borisov wrote:
> On busy container servers reading /proc/locks shows all the locks
> created by all clients. This can cause large latency spikes. In my
> case I observed lsof taking up to 5-10 seconds while processing around
> 50k locks. Fix this by limiting the locks shown only to those created
> in the same pidns as the one the proc was mounted in. When reading
> /proc/locks from the init_pid_ns show everything.
> 
> > Signed-off-by: Nikolay Borisov <kernel@kyup.com>
> ---
>  fs/locks.c | 6 ++++++
>  1 file changed, 6 insertions(+)
> 
> diff --git a/fs/locks.c b/fs/locks.c
> index ee1b15f6fc13..751673d7f7fc 100644
> --- a/fs/locks.c
> +++ b/fs/locks.c
> @@ -2648,9 +2648,15 @@ static int locks_show(struct seq_file *f, void *v)
>  {
> >  	struct locks_iterator *iter = f->private;
> >  	struct file_lock *fl, *bfl;
> > +	struct pid_namespace *proc_pidns = file_inode(f->file)->i_sb->s_fs_info;
> > +	struct pid_namespace *current_pidns = task_active_pid_ns(current);
>  
> >  	fl = hlist_entry(v, struct file_lock, fl_link);
>  
> > > +	if ((current_pidns != &init_pid_ns) && fl->fl_nspid

Ok, so when you read from a process that's in the init_pid_ns
namespace, then you'll get the whole pile of locks, even when reading
this from a filesystem that was mounted in a different pid_ns?

That seems odd to me if so. Any reason not to just uniformly use the
proc_pidns here?

> > > +	    && (proc_pidns != ns_of_pid(fl->fl_nspid)))
> > +		return 0;
> +
> >  	lock_get_status(f, fl, iter->li_pos, "");
>  
> >  	list_for_each_entry(bfl, &fl->fl_block, fl_block)
kernel@kyup.com Aug. 3, 2016, 2:17 p.m. UTC | #2
On 08/03/2016 04:46 PM, Jeff Layton wrote:
> On Wed, 2016-08-03 at 10:35 +0300, Nikolay Borisov wrote:
>> On busy container servers reading /proc/locks shows all the locks
>> created by all clients. This can cause large latency spikes. In my
>> case I observed lsof taking up to 5-10 seconds while processing around
>> 50k locks. Fix this by limiting the locks shown only to those created
>> in the same pidns as the one the proc was mounted in. When reading
>> /proc/locks from the init_pid_ns show everything.
>>
>>> Signed-off-by: Nikolay Borisov <kernel@kyup.com>
>> ---
>>  fs/locks.c | 6 ++++++
>>  1 file changed, 6 insertions(+)
>>
>> diff --git a/fs/locks.c b/fs/locks.c
>> index ee1b15f6fc13..751673d7f7fc 100644
>> --- a/fs/locks.c
>> +++ b/fs/locks.c
>> @@ -2648,9 +2648,15 @@ static int locks_show(struct seq_file *f, void *v)
>>  {
>>>  	struct locks_iterator *iter = f->private;
>>>  	struct file_lock *fl, *bfl;
>>> +	struct pid_namespace *proc_pidns = file_inode(f->file)->i_sb->s_fs_info;
>>> +	struct pid_namespace *current_pidns = task_active_pid_ns(current);
>>  
>>>  	fl = hlist_entry(v, struct file_lock, fl_link);
>>  
>>>> +	if ((current_pidns != &init_pid_ns) && fl->fl_nspid
> 
> Ok, so when you read from a process that's in the init_pid_ns
> namespace, then you'll get the whole pile of locks, even when reading
> this from a filesystem that was mounted in a different pid_ns?
> 
> That seems odd to me if so. Any reason not to just uniformly use the
> proc_pidns here?

[CCing some people from openvz/CRIU]

My train of thought was "we should have means which would be the one
universal truth about everything and this would be a process in the
init_pid_ns". I don't have strong preference as long as I'm not breaking
userspace. As I said before - I think the CRIU guys might be using that
interface.

> 
>>>> +	    && (proc_pidns != ns_of_pid(fl->fl_nspid)))
>>> +		return 0;
>> +
>>>  	lock_get_status(f, fl, iter->li_pos, "");
>>  
>>>  	list_for_each_entry(bfl, &fl->fl_block, fl_block)
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
J. Bruce Fields Aug. 3, 2016, 2:28 p.m. UTC | #3
On Wed, Aug 03, 2016 at 05:17:09PM +0300, Nikolay Borisov wrote:
> 
> 
> On 08/03/2016 04:46 PM, Jeff Layton wrote:
> > On Wed, 2016-08-03 at 10:35 +0300, Nikolay Borisov wrote:
> >> On busy container servers reading /proc/locks shows all the locks
> >> created by all clients. This can cause large latency spikes. In my
> >> case I observed lsof taking up to 5-10 seconds while processing around
> >> 50k locks. Fix this by limiting the locks shown only to those created
> >> in the same pidns as the one the proc was mounted in. When reading
> >> /proc/locks from the init_pid_ns show everything.
> >>
> >>> Signed-off-by: Nikolay Borisov <kernel@kyup.com>
> >> ---
> >>  fs/locks.c | 6 ++++++
> >>  1 file changed, 6 insertions(+)
> >>
> >> diff --git a/fs/locks.c b/fs/locks.c
> >> index ee1b15f6fc13..751673d7f7fc 100644
> >> --- a/fs/locks.c
> >> +++ b/fs/locks.c
> >> @@ -2648,9 +2648,15 @@ static int locks_show(struct seq_file *f, void *v)
> >>  {
> >>>  	struct locks_iterator *iter = f->private;
> >>>  	struct file_lock *fl, *bfl;
> >>> +	struct pid_namespace *proc_pidns = file_inode(f->file)->i_sb->s_fs_info;
> >>> +	struct pid_namespace *current_pidns = task_active_pid_ns(current);
> >>  
> >>>  	fl = hlist_entry(v, struct file_lock, fl_link);
> >>  
> >>>> +	if ((current_pidns != &init_pid_ns) && fl->fl_nspid
> > 
> > Ok, so when you read from a process that's in the init_pid_ns
> > namespace, then you'll get the whole pile of locks, even when reading
> > this from a filesystem that was mounted in a different pid_ns?
> > 
> > That seems odd to me if so. Any reason not to just uniformly use the
> > proc_pidns here?
> 
> [CCing some people from openvz/CRIU]
> 
> My train of thought was "we should have means which would be the one
> universal truth about everything and this would be a process in the
> init_pid_ns".

OK, but why not make that means be "mount proc from the init_pid_ns and
read /proc/locks there".  So just replace current_pidns with proc_pidns
in the above.  I think that's all Jeff was suggesting.

--b.

> I don't have strong preference as long as I'm not breaking
> userspace. As I said before - I think the CRIU guys might be using that
> interface.
> 
> > 
> >>>> +	    && (proc_pidns != ns_of_pid(fl->fl_nspid)))
> >>> +		return 0;
> >> +
> >>>  	lock_get_status(f, fl, iter->li_pos, "");
> >>  
> >>>  	list_for_each_entry(bfl, &fl->fl_block, fl_block)
> > 
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
kernel@kyup.com Aug. 3, 2016, 2:33 p.m. UTC | #4
On 08/03/2016 05:28 PM, J. Bruce Fields wrote:
> On Wed, Aug 03, 2016 at 05:17:09PM +0300, Nikolay Borisov wrote:
>>
>>
>> On 08/03/2016 04:46 PM, Jeff Layton wrote:
>>> On Wed, 2016-08-03 at 10:35 +0300, Nikolay Borisov wrote:
>>>> On busy container servers reading /proc/locks shows all the locks
>>>> created by all clients. This can cause large latency spikes. In my
>>>> case I observed lsof taking up to 5-10 seconds while processing around
>>>> 50k locks. Fix this by limiting the locks shown only to those created
>>>> in the same pidns as the one the proc was mounted in. When reading
>>>> /proc/locks from the init_pid_ns show everything.
>>>>
>>>>> Signed-off-by: Nikolay Borisov <kernel@kyup.com>
>>>> ---
>>>>  fs/locks.c | 6 ++++++
>>>>  1 file changed, 6 insertions(+)
>>>>
>>>> diff --git a/fs/locks.c b/fs/locks.c
>>>> index ee1b15f6fc13..751673d7f7fc 100644
>>>> --- a/fs/locks.c
>>>> +++ b/fs/locks.c
>>>> @@ -2648,9 +2648,15 @@ static int locks_show(struct seq_file *f, void *v)
>>>>  {
>>>>>  	struct locks_iterator *iter = f->private;
>>>>>  	struct file_lock *fl, *bfl;
>>>>> +	struct pid_namespace *proc_pidns = file_inode(f->file)->i_sb->s_fs_info;
>>>>> +	struct pid_namespace *current_pidns = task_active_pid_ns(current);
>>>>  
>>>>>  	fl = hlist_entry(v, struct file_lock, fl_link);
>>>>  
>>>>>> +	if ((current_pidns != &init_pid_ns) && fl->fl_nspid
>>>
>>> Ok, so when you read from a process that's in the init_pid_ns
>>> namespace, then you'll get the whole pile of locks, even when reading
>>> this from a filesystem that was mounted in a different pid_ns?
>>>
>>> That seems odd to me if so. Any reason not to just uniformly use the
>>> proc_pidns here?
>>
>> [CCing some people from openvz/CRIU]
>>
>> My train of thought was "we should have means which would be the one
>> universal truth about everything and this would be a process in the
>> init_pid_ns".
> 
> OK, but why not make that means be "mount proc from the init_pid_ns and
> read /proc/locks there".  So just replace current_pidns with proc_pidns
> in the above.  I think that's all Jeff was suggesting.

Oh, you are right. Silly me, yes, I'm happy with this and I will send a
patch.


> 
> --b.
> 
>> I don't have strong preference as long as I'm not breaking
>> userspace. As I said before - I think the CRIU guys might be using that
>> interface.
>>
>>>
>>>>>> +	    && (proc_pidns != ns_of_pid(fl->fl_nspid)))
>>>>> +		return 0;
>>>> +
>>>>>  	lock_get_status(f, fl, iter->li_pos, "");
>>>>  
>>>>>  	list_for_each_entry(bfl, &fl->fl_block, fl_block)
>>>
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Pavel Emelyanov Aug. 3, 2016, 2:54 p.m. UTC | #5
On 08/03/2016 05:17 PM, Nikolay Borisov wrote:
> 
> 
> On 08/03/2016 04:46 PM, Jeff Layton wrote:
>> On Wed, 2016-08-03 at 10:35 +0300, Nikolay Borisov wrote:
>>> On busy container servers reading /proc/locks shows all the locks
>>> created by all clients. This can cause large latency spikes. In my
>>> case I observed lsof taking up to 5-10 seconds while processing around
>>> 50k locks. Fix this by limiting the locks shown only to those created
>>> in the same pidns as the one the proc was mounted in. When reading
>>> /proc/locks from the init_pid_ns show everything.
>>>
>>>> Signed-off-by: Nikolay Borisov <kernel@kyup.com>
>>> ---
>>>  fs/locks.c | 6 ++++++
>>>  1 file changed, 6 insertions(+)
>>>
>>> diff --git a/fs/locks.c b/fs/locks.c
>>> index ee1b15f6fc13..751673d7f7fc 100644
>>> --- a/fs/locks.c
>>> +++ b/fs/locks.c
>>> @@ -2648,9 +2648,15 @@ static int locks_show(struct seq_file *f, void *v)
>>>  {
>>>>  	struct locks_iterator *iter = f->private;
>>>>  	struct file_lock *fl, *bfl;
>>>> +	struct pid_namespace *proc_pidns = file_inode(f->file)->i_sb->s_fs_info;
>>>> +	struct pid_namespace *current_pidns = task_active_pid_ns(current);
>>>  
>>>>  	fl = hlist_entry(v, struct file_lock, fl_link);
>>>  
>>>>> +	if ((current_pidns != &init_pid_ns) && fl->fl_nspid
>>
>> Ok, so when you read from a process that's in the init_pid_ns
>> namespace, then you'll get the whole pile of locks, even when reading
>> this from a filesystem that was mounted in a different pid_ns?
>>
>> That seems odd to me if so. Any reason not to just uniformly use the
>> proc_pidns here?
> 
> [CCing some people from openvz/CRIU]

Thanks :)

> My train of thought was "we should have means which would be the one
> universal truth about everything and this would be a process in the
> init_pid_ns". I don't have strong preference as long as I'm not breaking
> userspace. As I said before - I think the CRIU guys might be using that
> interface.

This particular change won't break us mostly because we've switched to
reading the /proc/pid/fdinfo/n files for locks.

-- Pavel

>>
>>>>> +	    && (proc_pidns != ns_of_pid(fl->fl_nspid)))
>>>> +		return 0;
>>> +
>>>>  	lock_get_status(f, fl, iter->li_pos, "");
>>>  
>>>>  	list_for_each_entry(bfl, &fl->fl_block, fl_block)
>>
> .
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
kernel@kyup.com Aug. 3, 2016, 3 p.m. UTC | #6
On 08/03/2016 05:54 PM, Pavel Emelyanov wrote:
> On 08/03/2016 05:17 PM, Nikolay Borisov wrote:
>>
>>
[SNIP]
>>
>> [CCing some people from openvz/CRIU]
> 
> Thanks :)
> 
>> My train of thought was "we should have means which would be the one
>> universal truth about everything and this would be a process in the
>> init_pid_ns". I don't have strong preference as long as I'm not breaking
>> userspace. As I said before - I think the CRIU guys might be using that
>> interface.
> 
> This particular change won't break us mostly because we've switched to
> reading the /proc/pid/fdinfo/n files for locks.

[thinking out loud here]

I've never actually looked into those files but now that I have it seems
to make sense to also switch 'lsof' to actually reading the locks from
the available pids directories rather than relying on the global
/proc/locks interface. Oh well :)

[/thinking out loud here]

> 
> -- Pavel
> 
>>>
>>>>>> +	    && (proc_pidns != ns_of_pid(fl->fl_nspid)))
>>>>> +		return 0;
>>>> +
>>>>>  	lock_get_status(f, fl, iter->li_pos, "");
>>>>  
>>>>>  	list_for_each_entry(bfl, &fl->fl_block, fl_block)
>>>
>> .
>>
> 
> _______________________________________________
> Containers mailing list
> Containers@lists.linux-foundation.org
> https://lists.linuxfoundation.org/mailman/listinfo/containers
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
J. Bruce Fields Aug. 3, 2016, 3:06 p.m. UTC | #7
On Wed, Aug 03, 2016 at 06:00:18PM +0300, Nikolay Borisov wrote:
> 
> 
> On 08/03/2016 05:54 PM, Pavel Emelyanov wrote:
> > On 08/03/2016 05:17 PM, Nikolay Borisov wrote:
> >>
> >>
> [SNIP]
> >>
> >> [CCing some people from openvz/CRIU]
> > 
> > Thanks :)
> > 
> >> My train of thought was "we should have means which would be the one
> >> universal truth about everything and this would be a process in the
> >> init_pid_ns". I don't have strong preference as long as I'm not breaking
> >> userspace. As I said before - I think the CRIU guys might be using that
> >> interface.
> > 
> > This particular change won't break us mostly because we've switched to
> > reading the /proc/pid/fdinfo/n files for locks.
> 
> [thinking out loud here]
> 
> I've never actually looked into those files but now that I have it seems
> to make sense to also switch 'lsof' to actually reading the locks from
> the available pids directories rather than relying on the global
> /proc/locks interface. Oh well :)

Digging around...  Oh, I see, there's an optional 'lock:..' line in
/proc/[pid]/fdinfo/[pid] file, is that what you're looking at?  I'd
forgotten.  Yeah, maybe that would make more sense long term.

--b.

> 
> [/thinking out loud here]
> 
> > 
> > -- Pavel
> > 
> >>>
> >>>>>> +	    && (proc_pidns != ns_of_pid(fl->fl_nspid)))
> >>>>> +		return 0;
> >>>> +
> >>>>>  	lock_get_status(f, fl, iter->li_pos, "");
> >>>>  
> >>>>>  	list_for_each_entry(bfl, &fl->fl_block, fl_block)
> >>>
> >> .
> >>
> > 
> > _______________________________________________
> > Containers mailing list
> > Containers@lists.linux-foundation.org
> > https://lists.linuxfoundation.org/mailman/listinfo/containers
> > 
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
kernel@kyup.com Aug. 3, 2016, 3:10 p.m. UTC | #8
On 08/03/2016 06:06 PM, J. Bruce Fields wrote:
> Digging around...  Oh, I see, there's an optional 'lock:..' line in
> /proc/[pid]/fdinfo/[pid] file, is that what you're looking at?  I'd
> forgotten.  Yeah, maybe that would make more sense long term.

Yep, that's the one but this requires the userspace to be updated to use
that interface. In the meantime we could do away with some maintenance
of the existing /proc/locks :)
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric W. Biederman Aug. 3, 2016, 5:35 p.m. UTC | #9
Nikolay Borisov <kernel@kyup.com> writes:

> On 08/03/2016 06:06 PM, J. Bruce Fields wrote:
>> Digging around...  Oh, I see, there's an optional 'lock:..' line in
>> /proc/[pid]/fdinfo/[pid] file, is that what you're looking at?  I'd
>> forgotten.  Yeah, maybe that would make more sense long term.
>
> Yep, that's the one but this requires the userspace to be updated to use
> that interface. In the meantime we could do away with some maintenance
> of the existing /proc/locks :)

I am tempted to say let's not change /proc/locks at all, but if locks
really are in a pid namespace than I do think it makes sense to filter
them in /proc just so there is not excessive visiblity outside of the
pid namespace.

Excessive visibility is a problem on it's own.

Eric

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/fs/locks.c b/fs/locks.c
index ee1b15f6fc13..751673d7f7fc 100644
--- a/fs/locks.c
+++ b/fs/locks.c
@@ -2648,9 +2648,15 @@  static int locks_show(struct seq_file *f, void *v)
 {
 	struct locks_iterator *iter = f->private;
 	struct file_lock *fl, *bfl;
+	struct pid_namespace *proc_pidns = file_inode(f->file)->i_sb->s_fs_info;
+	struct pid_namespace *current_pidns = task_active_pid_ns(current);
 
 	fl = hlist_entry(v, struct file_lock, fl_link);
 
+	if ((current_pidns != &init_pid_ns) && fl->fl_nspid
+	    && (proc_pidns != ns_of_pid(fl->fl_nspid)))
+		return 0;
+
 	lock_get_status(f, fl, iter->li_pos, "");
 
 	list_for_each_entry(bfl, &fl->fl_block, fl_block)