Message ID | 1470209710-30022-1-git-send-email-kernel@kyup.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Wed, 2016-08-03 at 10:35 +0300, Nikolay Borisov wrote: > On busy container servers reading /proc/locks shows all the locks > created by all clients. This can cause large latency spikes. In my > case I observed lsof taking up to 5-10 seconds while processing around > 50k locks. Fix this by limiting the locks shown only to those created > in the same pidns as the one the proc was mounted in. When reading > /proc/locks from the init_pid_ns show everything. > > > Signed-off-by: Nikolay Borisov <kernel@kyup.com> > --- > fs/locks.c | 6 ++++++ > 1 file changed, 6 insertions(+) > > diff --git a/fs/locks.c b/fs/locks.c > index ee1b15f6fc13..751673d7f7fc 100644 > --- a/fs/locks.c > +++ b/fs/locks.c > @@ -2648,9 +2648,15 @@ static int locks_show(struct seq_file *f, void *v) > { > > struct locks_iterator *iter = f->private; > > struct file_lock *fl, *bfl; > > + struct pid_namespace *proc_pidns = file_inode(f->file)->i_sb->s_fs_info; > > + struct pid_namespace *current_pidns = task_active_pid_ns(current); > > > fl = hlist_entry(v, struct file_lock, fl_link); > > > > + if ((current_pidns != &init_pid_ns) && fl->fl_nspid Ok, so when you read from a process that's in the init_pid_ns namespace, then you'll get the whole pile of locks, even when reading this from a filesystem that was mounted in a different pid_ns? That seems odd to me if so. Any reason not to just uniformly use the proc_pidns here? > > > + && (proc_pidns != ns_of_pid(fl->fl_nspid))) > > + return 0; > + > > lock_get_status(f, fl, iter->li_pos, ""); > > > list_for_each_entry(bfl, &fl->fl_block, fl_block)
On 08/03/2016 04:46 PM, Jeff Layton wrote: > On Wed, 2016-08-03 at 10:35 +0300, Nikolay Borisov wrote: >> On busy container servers reading /proc/locks shows all the locks >> created by all clients. This can cause large latency spikes. In my >> case I observed lsof taking up to 5-10 seconds while processing around >> 50k locks. Fix this by limiting the locks shown only to those created >> in the same pidns as the one the proc was mounted in. When reading >> /proc/locks from the init_pid_ns show everything. >> >>> Signed-off-by: Nikolay Borisov <kernel@kyup.com> >> --- >> fs/locks.c | 6 ++++++ >> 1 file changed, 6 insertions(+) >> >> diff --git a/fs/locks.c b/fs/locks.c >> index ee1b15f6fc13..751673d7f7fc 100644 >> --- a/fs/locks.c >> +++ b/fs/locks.c >> @@ -2648,9 +2648,15 @@ static int locks_show(struct seq_file *f, void *v) >> { >>> struct locks_iterator *iter = f->private; >>> struct file_lock *fl, *bfl; >>> + struct pid_namespace *proc_pidns = file_inode(f->file)->i_sb->s_fs_info; >>> + struct pid_namespace *current_pidns = task_active_pid_ns(current); >> >>> fl = hlist_entry(v, struct file_lock, fl_link); >> >>>> + if ((current_pidns != &init_pid_ns) && fl->fl_nspid > > Ok, so when you read from a process that's in the init_pid_ns > namespace, then you'll get the whole pile of locks, even when reading > this from a filesystem that was mounted in a different pid_ns? > > That seems odd to me if so. Any reason not to just uniformly use the > proc_pidns here? [CCing some people from openvz/CRIU] My train of thought was "we should have means which would be the one universal truth about everything and this would be a process in the init_pid_ns". I don't have strong preference as long as I'm not breaking userspace. As I said before - I think the CRIU guys might be using that interface. > >>>> + && (proc_pidns != ns_of_pid(fl->fl_nspid))) >>> + return 0; >> + >>> lock_get_status(f, fl, iter->li_pos, ""); >> >>> list_for_each_entry(bfl, &fl->fl_block, fl_block) > -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Aug 03, 2016 at 05:17:09PM +0300, Nikolay Borisov wrote: > > > On 08/03/2016 04:46 PM, Jeff Layton wrote: > > On Wed, 2016-08-03 at 10:35 +0300, Nikolay Borisov wrote: > >> On busy container servers reading /proc/locks shows all the locks > >> created by all clients. This can cause large latency spikes. In my > >> case I observed lsof taking up to 5-10 seconds while processing around > >> 50k locks. Fix this by limiting the locks shown only to those created > >> in the same pidns as the one the proc was mounted in. When reading > >> /proc/locks from the init_pid_ns show everything. > >> > >>> Signed-off-by: Nikolay Borisov <kernel@kyup.com> > >> --- > >> fs/locks.c | 6 ++++++ > >> 1 file changed, 6 insertions(+) > >> > >> diff --git a/fs/locks.c b/fs/locks.c > >> index ee1b15f6fc13..751673d7f7fc 100644 > >> --- a/fs/locks.c > >> +++ b/fs/locks.c > >> @@ -2648,9 +2648,15 @@ static int locks_show(struct seq_file *f, void *v) > >> { > >>> struct locks_iterator *iter = f->private; > >>> struct file_lock *fl, *bfl; > >>> + struct pid_namespace *proc_pidns = file_inode(f->file)->i_sb->s_fs_info; > >>> + struct pid_namespace *current_pidns = task_active_pid_ns(current); > >> > >>> fl = hlist_entry(v, struct file_lock, fl_link); > >> > >>>> + if ((current_pidns != &init_pid_ns) && fl->fl_nspid > > > > Ok, so when you read from a process that's in the init_pid_ns > > namespace, then you'll get the whole pile of locks, even when reading > > this from a filesystem that was mounted in a different pid_ns? > > > > That seems odd to me if so. Any reason not to just uniformly use the > > proc_pidns here? > > [CCing some people from openvz/CRIU] > > My train of thought was "we should have means which would be the one > universal truth about everything and this would be a process in the > init_pid_ns". OK, but why not make that means be "mount proc from the init_pid_ns and read /proc/locks there". So just replace current_pidns with proc_pidns in the above. I think that's all Jeff was suggesting. --b. > I don't have strong preference as long as I'm not breaking > userspace. As I said before - I think the CRIU guys might be using that > interface. > > > > >>>> + && (proc_pidns != ns_of_pid(fl->fl_nspid))) > >>> + return 0; > >> + > >>> lock_get_status(f, fl, iter->li_pos, ""); > >> > >>> list_for_each_entry(bfl, &fl->fl_block, fl_block) > > -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 08/03/2016 05:28 PM, J. Bruce Fields wrote: > On Wed, Aug 03, 2016 at 05:17:09PM +0300, Nikolay Borisov wrote: >> >> >> On 08/03/2016 04:46 PM, Jeff Layton wrote: >>> On Wed, 2016-08-03 at 10:35 +0300, Nikolay Borisov wrote: >>>> On busy container servers reading /proc/locks shows all the locks >>>> created by all clients. This can cause large latency spikes. In my >>>> case I observed lsof taking up to 5-10 seconds while processing around >>>> 50k locks. Fix this by limiting the locks shown only to those created >>>> in the same pidns as the one the proc was mounted in. When reading >>>> /proc/locks from the init_pid_ns show everything. >>>> >>>>> Signed-off-by: Nikolay Borisov <kernel@kyup.com> >>>> --- >>>> fs/locks.c | 6 ++++++ >>>> 1 file changed, 6 insertions(+) >>>> >>>> diff --git a/fs/locks.c b/fs/locks.c >>>> index ee1b15f6fc13..751673d7f7fc 100644 >>>> --- a/fs/locks.c >>>> +++ b/fs/locks.c >>>> @@ -2648,9 +2648,15 @@ static int locks_show(struct seq_file *f, void *v) >>>> { >>>>> struct locks_iterator *iter = f->private; >>>>> struct file_lock *fl, *bfl; >>>>> + struct pid_namespace *proc_pidns = file_inode(f->file)->i_sb->s_fs_info; >>>>> + struct pid_namespace *current_pidns = task_active_pid_ns(current); >>>> >>>>> fl = hlist_entry(v, struct file_lock, fl_link); >>>> >>>>>> + if ((current_pidns != &init_pid_ns) && fl->fl_nspid >>> >>> Ok, so when you read from a process that's in the init_pid_ns >>> namespace, then you'll get the whole pile of locks, even when reading >>> this from a filesystem that was mounted in a different pid_ns? >>> >>> That seems odd to me if so. Any reason not to just uniformly use the >>> proc_pidns here? >> >> [CCing some people from openvz/CRIU] >> >> My train of thought was "we should have means which would be the one >> universal truth about everything and this would be a process in the >> init_pid_ns". > > OK, but why not make that means be "mount proc from the init_pid_ns and > read /proc/locks there". So just replace current_pidns with proc_pidns > in the above. I think that's all Jeff was suggesting. Oh, you are right. Silly me, yes, I'm happy with this and I will send a patch. > > --b. > >> I don't have strong preference as long as I'm not breaking >> userspace. As I said before - I think the CRIU guys might be using that >> interface. >> >>> >>>>>> + && (proc_pidns != ns_of_pid(fl->fl_nspid))) >>>>> + return 0; >>>> + >>>>> lock_get_status(f, fl, iter->li_pos, ""); >>>> >>>>> list_for_each_entry(bfl, &fl->fl_block, fl_block) >>> -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 08/03/2016 05:17 PM, Nikolay Borisov wrote: > > > On 08/03/2016 04:46 PM, Jeff Layton wrote: >> On Wed, 2016-08-03 at 10:35 +0300, Nikolay Borisov wrote: >>> On busy container servers reading /proc/locks shows all the locks >>> created by all clients. This can cause large latency spikes. In my >>> case I observed lsof taking up to 5-10 seconds while processing around >>> 50k locks. Fix this by limiting the locks shown only to those created >>> in the same pidns as the one the proc was mounted in. When reading >>> /proc/locks from the init_pid_ns show everything. >>> >>>> Signed-off-by: Nikolay Borisov <kernel@kyup.com> >>> --- >>> fs/locks.c | 6 ++++++ >>> 1 file changed, 6 insertions(+) >>> >>> diff --git a/fs/locks.c b/fs/locks.c >>> index ee1b15f6fc13..751673d7f7fc 100644 >>> --- a/fs/locks.c >>> +++ b/fs/locks.c >>> @@ -2648,9 +2648,15 @@ static int locks_show(struct seq_file *f, void *v) >>> { >>>> struct locks_iterator *iter = f->private; >>>> struct file_lock *fl, *bfl; >>>> + struct pid_namespace *proc_pidns = file_inode(f->file)->i_sb->s_fs_info; >>>> + struct pid_namespace *current_pidns = task_active_pid_ns(current); >>> >>>> fl = hlist_entry(v, struct file_lock, fl_link); >>> >>>>> + if ((current_pidns != &init_pid_ns) && fl->fl_nspid >> >> Ok, so when you read from a process that's in the init_pid_ns >> namespace, then you'll get the whole pile of locks, even when reading >> this from a filesystem that was mounted in a different pid_ns? >> >> That seems odd to me if so. Any reason not to just uniformly use the >> proc_pidns here? > > [CCing some people from openvz/CRIU] Thanks :) > My train of thought was "we should have means which would be the one > universal truth about everything and this would be a process in the > init_pid_ns". I don't have strong preference as long as I'm not breaking > userspace. As I said before - I think the CRIU guys might be using that > interface. This particular change won't break us mostly because we've switched to reading the /proc/pid/fdinfo/n files for locks. -- Pavel >> >>>>> + && (proc_pidns != ns_of_pid(fl->fl_nspid))) >>>> + return 0; >>> + >>>> lock_get_status(f, fl, iter->li_pos, ""); >>> >>>> list_for_each_entry(bfl, &fl->fl_block, fl_block) >> > . > -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 08/03/2016 05:54 PM, Pavel Emelyanov wrote: > On 08/03/2016 05:17 PM, Nikolay Borisov wrote: >> >> [SNIP] >> >> [CCing some people from openvz/CRIU] > > Thanks :) > >> My train of thought was "we should have means which would be the one >> universal truth about everything and this would be a process in the >> init_pid_ns". I don't have strong preference as long as I'm not breaking >> userspace. As I said before - I think the CRIU guys might be using that >> interface. > > This particular change won't break us mostly because we've switched to > reading the /proc/pid/fdinfo/n files for locks. [thinking out loud here] I've never actually looked into those files but now that I have it seems to make sense to also switch 'lsof' to actually reading the locks from the available pids directories rather than relying on the global /proc/locks interface. Oh well :) [/thinking out loud here] > > -- Pavel > >>> >>>>>> + && (proc_pidns != ns_of_pid(fl->fl_nspid))) >>>>> + return 0; >>>> + >>>>> lock_get_status(f, fl, iter->li_pos, ""); >>>> >>>>> list_for_each_entry(bfl, &fl->fl_block, fl_block) >>> >> . >> > > _______________________________________________ > Containers mailing list > Containers@lists.linux-foundation.org > https://lists.linuxfoundation.org/mailman/listinfo/containers > -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Aug 03, 2016 at 06:00:18PM +0300, Nikolay Borisov wrote: > > > On 08/03/2016 05:54 PM, Pavel Emelyanov wrote: > > On 08/03/2016 05:17 PM, Nikolay Borisov wrote: > >> > >> > [SNIP] > >> > >> [CCing some people from openvz/CRIU] > > > > Thanks :) > > > >> My train of thought was "we should have means which would be the one > >> universal truth about everything and this would be a process in the > >> init_pid_ns". I don't have strong preference as long as I'm not breaking > >> userspace. As I said before - I think the CRIU guys might be using that > >> interface. > > > > This particular change won't break us mostly because we've switched to > > reading the /proc/pid/fdinfo/n files for locks. > > [thinking out loud here] > > I've never actually looked into those files but now that I have it seems > to make sense to also switch 'lsof' to actually reading the locks from > the available pids directories rather than relying on the global > /proc/locks interface. Oh well :) Digging around... Oh, I see, there's an optional 'lock:..' line in /proc/[pid]/fdinfo/[pid] file, is that what you're looking at? I'd forgotten. Yeah, maybe that would make more sense long term. --b. > > [/thinking out loud here] > > > > > -- Pavel > > > >>> > >>>>>> + && (proc_pidns != ns_of_pid(fl->fl_nspid))) > >>>>> + return 0; > >>>> + > >>>>> lock_get_status(f, fl, iter->li_pos, ""); > >>>> > >>>>> list_for_each_entry(bfl, &fl->fl_block, fl_block) > >>> > >> . > >> > > > > _______________________________________________ > > Containers mailing list > > Containers@lists.linux-foundation.org > > https://lists.linuxfoundation.org/mailman/listinfo/containers > > -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 08/03/2016 06:06 PM, J. Bruce Fields wrote: > Digging around... Oh, I see, there's an optional 'lock:..' line in > /proc/[pid]/fdinfo/[pid] file, is that what you're looking at? I'd > forgotten. Yeah, maybe that would make more sense long term. Yep, that's the one but this requires the userspace to be updated to use that interface. In the meantime we could do away with some maintenance of the existing /proc/locks :) -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Nikolay Borisov <kernel@kyup.com> writes: > On 08/03/2016 06:06 PM, J. Bruce Fields wrote: >> Digging around... Oh, I see, there's an optional 'lock:..' line in >> /proc/[pid]/fdinfo/[pid] file, is that what you're looking at? I'd >> forgotten. Yeah, maybe that would make more sense long term. > > Yep, that's the one but this requires the userspace to be updated to use > that interface. In the meantime we could do away with some maintenance > of the existing /proc/locks :) I am tempted to say let's not change /proc/locks at all, but if locks really are in a pid namespace than I do think it makes sense to filter them in /proc just so there is not excessive visiblity outside of the pid namespace. Excessive visibility is a problem on it's own. Eric -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/fs/locks.c b/fs/locks.c index ee1b15f6fc13..751673d7f7fc 100644 --- a/fs/locks.c +++ b/fs/locks.c @@ -2648,9 +2648,15 @@ static int locks_show(struct seq_file *f, void *v) { struct locks_iterator *iter = f->private; struct file_lock *fl, *bfl; + struct pid_namespace *proc_pidns = file_inode(f->file)->i_sb->s_fs_info; + struct pid_namespace *current_pidns = task_active_pid_ns(current); fl = hlist_entry(v, struct file_lock, fl_link); + if ((current_pidns != &init_pid_ns) && fl->fl_nspid + && (proc_pidns != ns_of_pid(fl->fl_nspid))) + return 0; + lock_get_status(f, fl, iter->li_pos, ""); list_for_each_entry(bfl, &fl->fl_block, fl_block)
On busy container servers reading /proc/locks shows all the locks created by all clients. This can cause large latency spikes. In my case I observed lsof taking up to 5-10 seconds while processing around 50k locks. Fix this by limiting the locks shown only to those created in the same pidns as the one the proc was mounted in. When reading /proc/locks from the init_pid_ns show everything. Signed-off-by: Nikolay Borisov <kernel@kyup.com> --- fs/locks.c | 6 ++++++ 1 file changed, 6 insertions(+)