diff mbox

[RFC,1/3] procfs: fdinfo -- Extend information about epoll target files

Message ID 20170221171254.954209904@openvz.org (mailing list archive)
State New, archived
Headers show

Commit Message

Cyrill Gorcunov Feb. 21, 2017, 4:59 p.m. UTC
Since it is possbile to have same number in tfd field (say
file added, closed, then nother file dup'ed to same number
and added back) it is imposible to distinguish such target
files solely by their numbers.

Strictly speaking regular applications don't need to recognize
these targets at all but for checkpoint/restore sake we need
to collect targets to be able to push them back on restore
in proper order.

Thus lets add file position, inode and device number where
this target lays. This three fields can be used as a primary
key for sorting, and together with kcmp help CRIU can find
out an exact file target (from the whole set of processes
being checkpointed). 

Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
CC: Al Viro <viro@zeniv.linux.org.uk>
CC: Andrew Morton <akpm@linuxfoundation.org>
CC: Andrey Vagin <avagin@openvz.org>
CC: Pavel Emelyanov <xemul@virtuozzo.com>
CC: Michael Kerrisk <mtk.manpages@gmail.com>
CC: Kir Kolyshkin <kir@openvz.org>
CC: Jason Baron <jbaron@akamai.com>
CC: Andy Lutomirski <luto@amacapital.net>
---
 Documentation/filesystems/proc.txt |    6 +++++-
 fs/eventpoll.c                     |    8 ++++++--
 2 files changed, 11 insertions(+), 3 deletions(-)

Comments

Andy Lutomirski Feb. 21, 2017, 6:41 p.m. UTC | #1
On Tue, Feb 21, 2017 at 8:59 AM, Cyrill Gorcunov <gorcunov@openvz.org> wrote:
> Since it is possbile to have same number in tfd field (say
> file added, closed, then nother file dup'ed to same number
> and added back) it is imposible to distinguish such target
> files solely by their numbers.
>
> Strictly speaking regular applications don't need to recognize
> these targets at all but for checkpoint/restore sake we need
> to collect targets to be able to push them back on restore
> in proper order.
>
> Thus lets add file position, inode and device number where
> this target lays. This three fields can be used as a primary
> key for sorting, and together with kcmp help CRIU can find
> out an exact file target (from the whole set of processes
> being checkpointed).

I have no problem with this, but I'm wondering whether kcmp's ordered
comparisons could also be used for this purpose.
Cyrill Gorcunov Feb. 21, 2017, 7:16 p.m. UTC | #2
On Tue, Feb 21, 2017 at 10:41:12AM -0800, Andy Lutomirski wrote:
> > Thus lets add file position, inode and device number where
> > this target lays. This three fields can be used as a primary
> > key for sorting, and together with kcmp help CRIU can find
> > out an exact file target (from the whole set of processes
> > being checkpointed).
> 
> I have no problem with this, but I'm wondering whether kcmp's ordered
> comparisons could also be used for this purpose.

Yes it can, but it would increas number of kcmp calls signisicantly.
Look, here is how we build files tree in criu: we take inode^sdev^pos
as a primary key and remember it inside rbtree while we're dumping files
(note also that we don't keep files opened but rather dump them in
chunks). Then once we find that two files have same primary key
we use kcmp to build subtree. This really helps a lot. And I plan
to do the same with target files from epolls:

 - they gonna be handled after all opened files of all processes
   in container (or children processes if dumping single task),
   thus the complete tree with primary key already will be built

 - on every target file I calculate primary key and then using
   kcmp will find if file is exactly one matching
Pavel Emelyanov Feb. 22, 2017, 7:44 a.m. UTC | #3
On 02/21/2017 10:16 PM, Cyrill Gorcunov wrote:
> On Tue, Feb 21, 2017 at 10:41:12AM -0800, Andy Lutomirski wrote:
>>> Thus lets add file position, inode and device number where
>>> this target lays. This three fields can be used as a primary
>>> key for sorting, and together with kcmp help CRIU can find
>>> out an exact file target (from the whole set of processes
>>> being checkpointed).
>>
>> I have no problem with this, but I'm wondering whether kcmp's ordered
>> comparisons could also be used for this purpose.
> 
> Yes it can, but it would increas number of kcmp calls signisicantly.

Actually it shouldn't. If you extend the kcmp argument to accept the
epollfd:epollslot pair, this would be effectively the same as if you
had all your epoll-ed files injected into your fdtable with "strange"
fd numbers. We already have two-level rbtree for this in criu, adding
extended ("strange") fd to it should be OK.

> Look, here is how we build files tree in criu: we take inode^sdev^pos
> as a primary key and remember it inside rbtree while we're dumping files
> (note also that we don't keep files opened but rather dump them in
> chunks). Then once we find that two files have same primary key
> we use kcmp to build subtree. This really helps a lot. And I plan
> to do the same with target files from epolls:
> 
>  - they gonna be handled after all opened files of all processes
>    in container (or children processes if dumping single task),
>    thus the complete tree with primary key already will be built
> 
>  - on every target file I calculate primary key and then using
>    kcmp will find if file is exactly one matching
> .
>
Cyrill Gorcunov Feb. 22, 2017, 7:54 a.m. UTC | #4
On Wed, Feb 22, 2017 at 10:44:07AM +0300, Pavel Emelyanov wrote:
> On 02/21/2017 10:16 PM, Cyrill Gorcunov wrote:
> > On Tue, Feb 21, 2017 at 10:41:12AM -0800, Andy Lutomirski wrote:
> >>> Thus lets add file position, inode and device number where
> >>> this target lays. This three fields can be used as a primary
> >>> key for sorting, and together with kcmp help CRIU can find
> >>> out an exact file target (from the whole set of processes
> >>> being checkpointed).
> >>
> >> I have no problem with this, but I'm wondering whether kcmp's ordered
> >> comparisons could also be used for this purpose.
> > 
> > Yes it can, but it would increas number of kcmp calls signisicantly.
> 
> Actually it shouldn't. If you extend the kcmp argument to accept the
> epollfd:epollslot pair, this would be effectively the same as if you
> had all your epoll-ed files injected into your fdtable with "strange"
> fd numbers. We already have two-level rbtree for this in criu, adding
> extended ("strange") fd to it should be OK.

Nope. Pavel, I guess you forget how we handle file tree in criu currently.
We call for kcmp only if we have to -- when primary key for two entries
is the same.
Pavel Emelyanov Feb. 22, 2017, 8:09 a.m. UTC | #5
On 02/22/2017 10:54 AM, Cyrill Gorcunov wrote:
> On Wed, Feb 22, 2017 at 10:44:07AM +0300, Pavel Emelyanov wrote:
>> On 02/21/2017 10:16 PM, Cyrill Gorcunov wrote:
>>> On Tue, Feb 21, 2017 at 10:41:12AM -0800, Andy Lutomirski wrote:
>>>>> Thus lets add file position, inode and device number where
>>>>> this target lays. This three fields can be used as a primary
>>>>> key for sorting, and together with kcmp help CRIU can find
>>>>> out an exact file target (from the whole set of processes
>>>>> being checkpointed).
>>>>
>>>> I have no problem with this, but I'm wondering whether kcmp's ordered
>>>> comparisons could also be used for this purpose.
>>>
>>> Yes it can, but it would increas number of kcmp calls signisicantly.
>>
>> Actually it shouldn't. If you extend the kcmp argument to accept the
>> epollfd:epollslot pair, this would be effectively the same as if you
>> had all your epoll-ed files injected into your fdtable with "strange"
>> fd numbers. We already have two-level rbtree for this in criu, adding
>> extended ("strange") fd to it should be OK.
> 
> Nope. Pavel, I guess you forget how we handle file tree in criu currently.
> We call for kcmp only if we have to -- when primary key for two entries
> is the same.

True, but the latter is an optimization to reduce the number of syscalls.

Look, in order to have a primary key you need to do some system call for the
fd you check (read from proc or stat the descriptor). But for target files in
e-polls you don't make this per-fd syscall to get primary key, just call the
kcmp instead.

-- Pavel
Cyrill Gorcunov Feb. 22, 2017, 8:18 a.m. UTC | #6
On Wed, Feb 22, 2017 at 11:09:23AM +0300, Pavel Emelyanov wrote:
> >>
> >> Actually it shouldn't. If you extend the kcmp argument to accept the
> >> epollfd:epollslot pair, this would be effectively the same as if you
> >> had all your epoll-ed files injected into your fdtable with "strange"
> >> fd numbers. We already have two-level rbtree for this in criu, adding
> >> extended ("strange") fd to it should be OK.
> > 
> > Nope. Pavel, I guess you forget how we handle file tree in criu currently.
> > We call for kcmp only if we have to -- when primary key for two entries
> > is the same.
> 
> True, but the latter is an optimization to reduce the number of syscalls.

Exactly. While syscalls are quite effective, they are still not coming
for free, so I'm trying to reduce their number as much as possible.

> Look, in order to have a primary key you need to do some system call for the
> fd you check (read from proc or stat the descriptor). But for target files in
> e-polls you don't make this per-fd syscall to get primary key, just call the
> kcmp instead.

I have to parse fdinfo anyway, because I need to fetch queued events and mask.
So I'll _have_ to make this per-fd syscall for parsing. And this opens
a way to optimize overall picture -- we can immediately read primary
key and reduce kcmp calls.
Pavel Emelyanov Feb. 22, 2017, 8:29 a.m. UTC | #7
On 02/22/2017 11:18 AM, Cyrill Gorcunov wrote:
> On Wed, Feb 22, 2017 at 11:09:23AM +0300, Pavel Emelyanov wrote:
>>>>
>>>> Actually it shouldn't. If you extend the kcmp argument to accept the
>>>> epollfd:epollslot pair, this would be effectively the same as if you
>>>> had all your epoll-ed files injected into your fdtable with "strange"
>>>> fd numbers. We already have two-level rbtree for this in criu, adding
>>>> extended ("strange") fd to it should be OK.
>>>
>>> Nope. Pavel, I guess you forget how we handle file tree in criu currently.
>>> We call for kcmp only if we have to -- when primary key for two entries
>>> is the same.
>>
>> True, but the latter is an optimization to reduce the number of syscalls.
> 
> Exactly. While syscalls are quite effective, they are still not coming
> for free, so I'm trying to reduce their number as much as possible.
> 
>> Look, in order to have a primary key you need to do some system call for the
>> fd you check (read from proc or stat the descriptor). But for target files in
>> e-polls you don't make this per-fd syscall to get primary key, just call the
>> kcmp instead.
> 
> I have to parse fdinfo anyway, because I need to fetch queued events and mask.
> So I'll _have_ to make this per-fd syscall for parsing. And this opens
> a way to optimize overall picture -- we can immediately read primary
> key and reduce kcmp calls.

You read fdinfo per-epoll, but kcmp-s we're talking here are about per-target-files.
So having dev:ino pair would help to reduce the number of kcmps, but even w/o
this extension we can work OK.

Besides, in most of the cases fd number you'd read from epoll's fdinfo will actually
be present in task's fdtable, so you can call a single kcmp, make sure the file is
correct and that's it. The need to actually _search_ for the runaway file with the
set of kcmp will (should) be quite rare case.

-- Pavel
Cyrill Gorcunov Feb. 22, 2017, 8:35 a.m. UTC | #8
On Wed, Feb 22, 2017 at 11:29:23AM +0300, Pavel Emelyanov wrote:
> On 02/22/2017 11:18 AM, Cyrill Gorcunov wrote:
> > On Wed, Feb 22, 2017 at 11:09:23AM +0300, Pavel Emelyanov wrote:
> >>>>
> >>>> Actually it shouldn't. If you extend the kcmp argument to accept the
> >>>> epollfd:epollslot pair, this would be effectively the same as if you
> >>>> had all your epoll-ed files injected into your fdtable with "strange"
> >>>> fd numbers. We already have two-level rbtree for this in criu, adding
> >>>> extended ("strange") fd to it should be OK.
> >>>
> >>> Nope. Pavel, I guess you forget how we handle file tree in criu currently.
> >>> We call for kcmp only if we have to -- when primary key for two entries
> >>> is the same.
> >>
> >> True, but the latter is an optimization to reduce the number of syscalls.
> > 
> > Exactly. While syscalls are quite effective, they are still not coming
> > for free, so I'm trying to reduce their number as much as possible.
> > 
> >> Look, in order to have a primary key you need to do some system call for the
> >> fd you check (read from proc or stat the descriptor). But for target files in
> >> e-polls you don't make this per-fd syscall to get primary key, just call the
> >> kcmp instead.
> > 
> > I have to parse fdinfo anyway, because I need to fetch queued events and mask.
> > So I'll _have_ to make this per-fd syscall for parsing. And this opens
> > a way to optimize overall picture -- we can immediately read primary
> > key and reduce kcmp calls.
> 
> You read fdinfo per-epoll, but kcmp-s we're talking here are about per-target-files.
> So having dev:ino pair would help to reduce the number of kcmps, but even w/o
> this extension we can work OK.

I didn't say we can't. But since we're reading fdinfo anyway it will help I don't
see a single reason why should not we take this opportunity to speedup.

> Besides, in most of the cases fd number you'd read from epoll's fdinfo will actually
> be present in task's fdtable, so you can call a single kcmp, make sure the file is
> correct and that's it. The need to actually _search_ for the runaway file with the
> set of kcmp will (should) be quite rare case.

Yes. But this rare cases are the reason why I started this series :( I would
love to not add new code at all but simply had to.
diff mbox

Patch

Index: linux-ml.git/Documentation/filesystems/proc.txt
===================================================================
--- linux-ml.git.orig/Documentation/filesystems/proc.txt
+++ linux-ml.git/Documentation/filesystems/proc.txt
@@ -1779,12 +1779,16 @@  pair provide additional information part
 	pos:	0
 	flags:	02
 	mnt_id:	9
-	tfd:        5 events:       1d data: ffffffffffffffff
+	tfd:        5 events:       1d data: ffffffffffffffff pos:0 ino:61af sdev:7
 
 	where 'tfd' is a target file descriptor number in decimal form,
 	'events' is events mask being watched and the 'data' is data
 	associated with a target [see epoll(7) for more details].
 
+	The 'pos' is current offset of the target file in decimal form
+	[see lseek(2)], 'ino' and 'sdev' are inode and device numbers
+	where target file resides, all in hex format.
+
 	Fsnotify files
 	~~~~~~~~~~~~~~
 	For inotify files the format is the following
Index: linux-ml.git/fs/eventpoll.c
===================================================================
--- linux-ml.git.orig/fs/eventpoll.c
+++ linux-ml.git/fs/eventpoll.c
@@ -883,10 +883,14 @@  static void ep_show_fdinfo(struct seq_fi
 	mutex_lock(&ep->mtx);
 	for (rbp = rb_first(&ep->rbr); rbp; rbp = rb_next(rbp)) {
 		struct epitem *epi = rb_entry(rbp, struct epitem, rbn);
+		struct inode *inode = file_inode(epi->ffd.file);
 
-		seq_printf(m, "tfd: %8d events: %8x data: %16llx\n",
+		seq_printf(m, "tfd: %8d events: %8x data: %16llx "
+			   " pos:%lli ino:%lx sdev:%x\n",
 			   epi->ffd.fd, epi->event.events,
-			   (long long)epi->event.data);
+			   (long long)epi->event.data,
+			   (long long)epi->ffd.file->f_pos,
+			   inode->i_ino, inode->i_sb->s_dev);
 		if (seq_has_overflowed(m))
 			break;
 	}