diff mbox

[3/3] eventfd: add internal reference counting to fix notifier race conditions

Message ID alpine.DEB.1.10.0906211613200.8816@makko.or.mcafeemobile.com (mailing list archive)
State New, archived
Headers show

Commit Message

Davide Libenzi June 21, 2009, 11:54 p.m. UTC
On Sun, 21 Jun 2009, Gregory Haskins wrote:

> This looks great, Davide.  I am fairly certain I can now solve the races
> and even implement Michael's DEASSIGN feature with this patch in place. 
> I will actually fire it up tomorrow when I am back in the office and
> give it a spin, but I do not spy any more races via visual inspection.
> 
> Kind Regards,
> -Greg
> 
> PS: I was wrong with my previous statement about requiring an embeddable
> object (eventfd_notifier for me, eventfd_pollcb for you).  I think you
> can technically solve this issue minimally by merely locking the POLLHUP
> and exposing the kref.  However, I think that leads to an more awkward
> interface (e.g. we already have eventfd_fget() plus we add a new one
> like eventfd_refget(), which might confuse users), so I prefer what you
> did here.  Just thought I would throw that out there in case you would
> prefer to change even fewer lines.

I actually ended up exposing the eventfd context anyway, since IMO is a 
better option instead of holding references to the eventfd file (that 
makes VFS people uneasy).
So eventfd_signal() now accepts the eventfd context instead of the file 
pointer.
Inside KAIO we had it holding a reference to the underlying file*, same 
thing LGUEST (Rusty Cc-ed), that has been changed to fit the new API.
And since eventfd_ctx_put() doesn't sleep, KAIO code got simpler a little.
The other change have been to make KAIO select EVENTFD, instead of making 
the ugly empty stubs inside eventfd.h to accomodate (KAIO && !EVENTFD).
The eventfd_pollcb_register() remained to be accepting a file*, first 
because it needs it ;) , second because it can unracely detect POLLHUP due 
to the caller holding a reference to the file*.




- Davide


---
 drivers/lguest/lg.h          |    4 -
 drivers/lguest/lguest_user.c |    4 -
 fs/aio.c                     |   24 +-----
 fs/eventfd.c                 |  170 ++++++++++++++++++++++++++++++++++++++++---
 include/linux/aio.h          |    5 -
 include/linux/eventfd.h      |   28 ++++---
 init/Kconfig                 |    1 
 7 files changed, 192 insertions(+), 44 deletions(-)

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Gregory Haskins June 22, 2009, 4:05 p.m. UTC | #1
Davide Libenzi wrote:
> On Sun, 21 Jun 2009, Gregory Haskins wrote:
>
>   
>> This looks great, Davide.  I am fairly certain I can now solve the races
>> and even implement Michael's DEASSIGN feature with this patch in place. 
>> I will actually fire it up tomorrow when I am back in the office and
>> give it a spin, but I do not spy any more races via visual inspection.
>>
>> Kind Regards,
>> -Greg
>>
>> PS: I was wrong with my previous statement about requiring an embeddable
>> object (eventfd_notifier for me, eventfd_pollcb for you).  I think you
>> can technically solve this issue minimally by merely locking the POLLHUP
>> and exposing the kref.  However, I think that leads to an more awkward
>> interface (e.g. we already have eventfd_fget() plus we add a new one
>> like eventfd_refget(), which might confuse users), so I prefer what you
>> did here.  Just thought I would throw that out there in case you would
>> prefer to change even fewer lines.
>>     
>
> I actually ended up exposing the eventfd context anyway, since IMO is a 
> better option instead of holding references to the eventfd file (that 
> makes VFS people uneasy).
>   

I liked "version - 1"  better ;)

I think ultimately we still want to hold the fget() for
eventfd_signal(), as it is the producer side.  Without it, we have no
way of knowing when the last producer goes away if they happen to be an
in-kernel user.

Also, thinking about it some more, I think I like the non-wrapped pollcb
variant better now.  This is because I can uhook the wqh in the POLLHUP
and defer the kref_put() until later.  I think this combination gives me
the most ideal environment (even though technically I should not expect
more activity on the wait-queue after the POLLHUP).

In either case, I whipped up a super-lite patch with just the bare
minimum and then developed the KVM solution on top.  Things look really
good, now, and I am happy with the result.  As an aside, it only has the
following impact on the eventfd core:

 fs/eventfd.c            |   43 ++++++++++++++++++++++++++++++++++++-------
 include/linux/eventfd.h |    7 +++++++
 2 files changed, 43 insertions(+), 7 deletions(-)

Ill post the series, including the KVM portions, for review.

Thanks Davide,
-Greg
Davide Libenzi June 22, 2009, 5:01 p.m. UTC | #2
On Mon, 22 Jun 2009, Gregory Haskins wrote:

> Davide Libenzi wrote:
> > On Sun, 21 Jun 2009, Gregory Haskins wrote:
> >
> >   
> >> This looks great, Davide.  I am fairly certain I can now solve the races
> >> and even implement Michael's DEASSIGN feature with this patch in place. 
> >> I will actually fire it up tomorrow when I am back in the office and
> >> give it a spin, but I do not spy any more races via visual inspection.
> >>
> >> Kind Regards,
> >> -Greg
> >>
> >> PS: I was wrong with my previous statement about requiring an embeddable
> >> object (eventfd_notifier for me, eventfd_pollcb for you).  I think you
> >> can technically solve this issue minimally by merely locking the POLLHUP
> >> and exposing the kref.  However, I think that leads to an more awkward
> >> interface (e.g. we already have eventfd_fget() plus we add a new one
> >> like eventfd_refget(), which might confuse users), so I prefer what you
> >> did here.  Just thought I would throw that out there in case you would
> >> prefer to change even fewer lines.
> >>     
> >
> > I actually ended up exposing the eventfd context anyway, since IMO is a 
> > better option instead of holding references to the eventfd file (that 
> > makes VFS people uneasy).
> >   
> 
> I liked "version - 1"  better ;)
> 
> I think ultimately we still want to hold the fget() for
> eventfd_signal(), as it is the producer side.  Without it, we have no
> way of knowing when the last producer goes away if they happen to be an
> in-kernel user.

No you don't. Holding a file* reference does not give you any assurance 
whatsoever that the last consumer did not go away. The f_count value will 
simply be 1 (just you).
If a producer side want to take proper action when the last consumer 
leaves the building, it needs to register a pollcb hook and handle 
POLLHUP.
Exposing the opaque eventfd context, and basing the eventfd operations on 
that one (instead of the live referenced file*) is a cleaner interface.
Can you please base your IRQfd bits on top of that, so that we can give 
this IRQfd thread a closure?


- Davide


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Gregory Haskins June 22, 2009, 5:43 p.m. UTC | #3
Davide Libenzi wrote:
> On Mon, 22 Jun 2009, Gregory Haskins wrote:
>
>   
>> Davide Libenzi wrote:
>>     
>>> On Sun, 21 Jun 2009, Gregory Haskins wrote:
>>>
>>>   
>>>       
>>>> This looks great, Davide.  I am fairly certain I can now solve the races
>>>> and even implement Michael's DEASSIGN feature with this patch in place. 
>>>> I will actually fire it up tomorrow when I am back in the office and
>>>> give it a spin, but I do not spy any more races via visual inspection.
>>>>
>>>> Kind Regards,
>>>> -Greg
>>>>
>>>> PS: I was wrong with my previous statement about requiring an embeddable
>>>> object (eventfd_notifier for me, eventfd_pollcb for you).  I think you
>>>> can technically solve this issue minimally by merely locking the POLLHUP
>>>> and exposing the kref.  However, I think that leads to an more awkward
>>>> interface (e.g. we already have eventfd_fget() plus we add a new one
>>>> like eventfd_refget(), which might confuse users), so I prefer what you
>>>> did here.  Just thought I would throw that out there in case you would
>>>> prefer to change even fewer lines.
>>>>     
>>>>         
>>> I actually ended up exposing the eventfd context anyway, since IMO is a 
>>> better option instead of holding references to the eventfd file (that 
>>> makes VFS people uneasy).
>>>   
>>>       
>> I liked "version - 1"  better ;)
>>
>> I think ultimately we still want to hold the fget() for
>> eventfd_signal(), as it is the producer side.  Without it, we have no
>> way of knowing when the last producer goes away if they happen to be an
>> in-kernel user.
>>     
>
> No you don't. Holding a file* reference does not give you any assurance 
> whatsoever that the last consumer did not go away. The f_count value will 
> simply be 1 (just you).
>   

I am probably confused or perhaps have the wrong terminology, but isnt
that "ok".  I am concerned about the consumer (the guy getting the
POLLINs) to be able to detect POLLHUP when the last producer
(f_ops->write() from userspace, eventfd_signal() from kernel) goes away.

Consider the following sequence:

-------------------

userspace calls "fd = eventfd()", and gives one to KVM as an irqfd, and
the other to some PCI-passthrough device.

The kvm/irqfd side acquires a kref, the pci side acquires a file.  At
this moment, userspace has the fd, and the pci device has the file (for
eventfd_signal()).  The fget() count is 2.  Userspace closes the fd
because its done with it, and the count drops to 1.

Some time later, pci does an fput(), and KVM sees the POLLHUP and cleans up.

-------------------

In this new model, the POLLHUP would have gone out as soon as userspace
closed the fd, even though the intended producer (the PCI device) and
the consumer (the KVM guest) are still up and running.  This doesnt seem
right to me.  Or am I missing something?

Thanks Davide,
-Greg
Davide Libenzi June 22, 2009, 6:03 p.m. UTC | #4
On Mon, 22 Jun 2009, Gregory Haskins wrote:

> I am probably confused or perhaps have the wrong terminology, but isnt
> that "ok".  I am concerned about the consumer (the guy getting the
> POLLINs) to be able to detect POLLHUP when the last producer
> (f_ops->write() from userspace, eventfd_signal() from kernel) goes away.
> 
> Consider the following sequence:
> 
> -------------------
> 
> userspace calls "fd = eventfd()", and gives one to KVM as an irqfd, and
> the other to some PCI-passthrough device.
> 
> The kvm/irqfd side acquires a kref, the pci side acquires a file.  At
> this moment, userspace has the fd, and the pci device has the file (for
> eventfd_signal()).  The fget() count is 2.  Userspace closes the fd
> because its done with it, and the count drops to 1.
> 
> Some time later, pci does an fput(), and KVM sees the POLLHUP and cleans up.
> 
> -------------------
> 
> In this new model, the POLLHUP would have gone out as soon as userspace
> closed the fd, even though the intended producer (the PCI device) and
> the consumer (the KVM guest) are still up and running.  This doesnt seem
> right to me.  Or am I missing something?

What you're doing there, is setting up a kernel-to-kernel (since 
userspace only role is to create the eventfd) communication, using a file* 
as accessory. That IMO is plain wrong.
If userspace is either the producer, or the consumer, and you need to 
handle userspace leaving the building, you need to:

	file = eventfd_fget(fd);
	ctx = eventfd_ctx_get(file); /* Eventually, if producer */
	eventfd_pollcb_register(file, ...);
	fput(file);

In your case of kernel-to-kernel scenario, why would you need eventfd at 
all, if userspace role in that model is simply to create it?
There are more effective ways to have in kernel communication channels, 
than resorting to userspace link facilities like eventfd.



- Davide


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Gregory Haskins June 22, 2009, 6:31 p.m. UTC | #5
Davide Libenzi wrote:
> On Mon, 22 Jun 2009, Gregory Haskins wrote:
>
>   
>> I am probably confused or perhaps have the wrong terminology, but isnt
>> that "ok".  I am concerned about the consumer (the guy getting the
>> POLLINs) to be able to detect POLLHUP when the last producer
>> (f_ops->write() from userspace, eventfd_signal() from kernel) goes away.
>>
>> Consider the following sequence:
>>
>> -------------------
>>
>> userspace calls "fd = eventfd()", and gives one to KVM as an irqfd, and
>> the other to some PCI-passthrough device.
>>
>> The kvm/irqfd side acquires a kref, the pci side acquires a file.  At
>> this moment, userspace has the fd, and the pci device has the file (for
>> eventfd_signal()).  The fget() count is 2.  Userspace closes the fd
>> because its done with it, and the count drops to 1.
>>
>> Some time later, pci does an fput(), and KVM sees the POLLHUP and cleans up.
>>
>> -------------------
>>
>> In this new model, the POLLHUP would have gone out as soon as userspace
>> closed the fd, even though the intended producer (the PCI device) and
>> the consumer (the KVM guest) are still up and running.  This doesnt seem
>> right to me.  Or am I missing something?
>>     
>
> What you're doing there, is setting up a kernel-to-kernel (since 
> userspace only role is to create the eventfd) communication, using a file* 
> as accessory. That IMO is plain wrong.
> If userspace is either the producer, or the consumer, and you need to 
> handle userspace leaving the building, you need to:
>
> 	file = eventfd_fget(fd);
> 	ctx = eventfd_ctx_get(file); /* Eventually, if producer */
> 	eventfd_pollcb_register(file, ...);
> 	fput(file);
>
> In your case of kernel-to-kernel scenario, why would you need eventfd at 
> all, if userspace role in that model is simply to create it?
>   

The general thesis is for decoupling of the two subsystems.  In order to
do this, you need some form of polymorphism and an intermediate "handle"
mechanism which is userspace friendly.  File-descriptors already fit
this role neatly, with the "int fd" being the handle, and the f_ops
being the polymorphic interface.  Eventfd is of course, a subclass of
this concept in that it has these same general properties but with
signaling semantics (non-blocking collapsible events, etc).

Say, for example, you wanted disk IO completion events to generate an
interrupt into a guest.  One way to do this would, of course, modify all
the disk-io code so it knows how to directly inject a KVM guest
interrupt.   While this would work, someone would undoubtedly get flamed
for such a suggestion ;)

Another way to do it is to treat the AIO eventfd as the hook point. 
IIUC AIO already knows how to be an eventfd producer.  KVM, by virtue of
irqfd, already knows how to be an eventfd consumer.  So now kvm can
consume AIO, or it can consume userspace events equally well, and
without modification.  Neither side needs to know about the other per
se, other than the details on how to use the eventfd interface.

Don't get me wrong:  We expect userspace to use all this stuff too.  I
just expect that we will see all permutations of producer/consumer +
userspace/kernel combinations, so I want to retain that "all producers
have left" notification feature set.  Today eventfd supports producers
or consumers in userspace, and producers in the kernel.  This new work
we are doing adds consumer support in the kernel.  Kernel to kernel is
just a natural extension of that.

-Greg

> There are more effective ways to have in kernel communication channels, 
> than resorting to userspace link facilities like eventfd.
>
>
>
> - Davide
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
Davide Libenzi June 22, 2009, 6:40 p.m. UTC | #6
On Mon, 22 Jun 2009, Gregory Haskins wrote:

> The general thesis is for decoupling of the two subsystems.  In order to
> do this, you need some form of polymorphism and an intermediate "handle"
> mechanism which is userspace friendly.  File-descriptors already fit
> this role neatly, with the "int fd" being the handle, and the f_ops
> being the polymorphic interface.  Eventfd is of course, a subclass of
> this concept in that it has these same general properties but with
> signaling semantics (non-blocking collapsible events, etc).
> 
> Say, for example, you wanted disk IO completion events to generate an
> interrupt into a guest.  One way to do this would, of course, modify all
> the disk-io code so it knows how to directly inject a KVM guest
> interrupt.   While this would work, someone would undoubtedly get flamed
> for such a suggestion ;)
> 
> Another way to do it is to treat the AIO eventfd as the hook point. 
> IIUC AIO already knows how to be an eventfd producer.  KVM, by virtue of
> irqfd, already knows how to be an eventfd consumer.  So now kvm can
> consume AIO, or it can consume userspace events equally well, and
> without modification.  Neither side needs to know about the other per
> se, other than the details on how to use the eventfd interface.
> 
> Don't get me wrong:  We expect userspace to use all this stuff too.  I
> just expect that we will see all permutations of producer/consumer +
> userspace/kernel combinations, so I want to retain that "all producers
> have left" notification feature set.  Today eventfd supports producers
> or consumers in userspace, and producers in the kernel.  This new work
> we are doing adds consumer support in the kernel.  Kernel to kernel is
> just a natural extension of that.

A file* is the VFS link between userspace and the kernel. Is not a magical 
polymorphic interface to be used for whatever kernel side reasons.
Basing a kernel internal API over it is flawed.
On top of that, a single reference count does not put you on cover about 
the possible combinations of producers and consumers. For that, you'd need 
a pipe-like reference handling logic, that is way far from the eventfd scope.
So please stop making hypothetical cases about interface usages, and 
*when* we will have a real case, we'll see what the better handling for it 
will be.



- Davide


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Michael S. Tsirkin June 22, 2009, 6:41 p.m. UTC | #7
On Mon, Jun 22, 2009 at 11:03:22AM -0700, Davide Libenzi wrote:
> In your case of kernel-to-kernel scenario, why would you need eventfd at 
> all, if userspace role in that model is simply to create it?

That's not 100% true. We have a mode where userspace is the producer
and/or consumer (migration mode) and we switch between that and
direct kernel-to-kernel communication.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Davide Libenzi June 22, 2009, 6:51 p.m. UTC | #8
On Mon, 22 Jun 2009, Michael S. Tsirkin wrote:

> On Mon, Jun 22, 2009 at 11:03:22AM -0700, Davide Libenzi wrote:
> > In your case of kernel-to-kernel scenario, why would you need eventfd at 
> > all, if userspace role in that model is simply to create it?
> 
> That's not 100% true. We have a mode where userspace is the producer
> and/or consumer (migration mode) and we switch between that and
> direct kernel-to-kernel communication.

Then you'd need to ask yourself how to handle your complex case inside the 
KVM code, so that other eventfd users are not affected by the extra fat 
needed to handle your scenarios. Thing that seem to be continuosly tried.
A file* based kernel-to-kernel interface is rather wrong IMO.



- Davide


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Michael S. Tsirkin June 22, 2009, 7:05 p.m. UTC | #9
On Mon, Jun 22, 2009 at 11:51:42AM -0700, Davide Libenzi wrote:
> A file* based kernel-to-kernel interface is rather wrong IMO.

But eventfd_ctx should work fine.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Gregory Haskins June 22, 2009, 7:16 p.m. UTC | #10
Davide Libenzi wrote:
> On Mon, 22 Jun 2009, Michael S. Tsirkin wrote:
>
>   
>> On Mon, Jun 22, 2009 at 11:03:22AM -0700, Davide Libenzi wrote:
>>     
>>> In your case of kernel-to-kernel scenario, why would you need eventfd at 
>>> all, if userspace role in that model is simply to create it?
>>>       
>> That's not 100% true. We have a mode where userspace is the producer
>> and/or consumer (migration mode) and we switch between that and
>> direct kernel-to-kernel communication.
>>     
>
> Then you'd need to ask yourself how to handle your complex case inside the 
> KVM code, so that other eventfd users are not affected by the extra fat 
> needed to handle your scenarios. Thing that seem to be continuosly tried.
> A file* based kernel-to-kernel interface is rather wrong IMO.
>   

Well, I will point out that the interface in question is
eventfd_signal(struct file *), and you were the one that invented it
afaict.  Can't help it if we like it :)

BTW: The termination point of that call is incidental.  Afterall, it
works the same for kernel-kernel or kernel-userspace.  The only relevant
discussion point here is that your proposal breaks POLLHUP for
eventfd_signal() users, and we have a real use case that cares.

Let me ask the question a different way: Lets say we all agree that
eventfd_signal() cannot be file* based to be correct.  Is there a way
you can think of that satisfies both the need to get rid of file* AND
still deliver producer-release notification to consumers.  Ultimately
that is all I care about.

-Greg
Gregory Haskins June 22, 2009, 7:26 p.m. UTC | #11
Michael S. Tsirkin wrote:
> On Mon, Jun 22, 2009 at 11:51:42AM -0700, Davide Libenzi wrote:
>   
>> A file* based kernel-to-kernel interface is rather wrong IMO.
>>     
>
> But eventfd_ctx should work fine.
>   

Yeah, and I guess we can always just say that qemu can't close the fd or
something.  Seems hacky, but it might work if Davide insists we need his
change.

-Greg
Davide Libenzi June 22, 2009, 7:29 p.m. UTC | #12
On Mon, 22 Jun 2009, Gregory Haskins wrote:

> Michael S. Tsirkin wrote:
> > On Mon, Jun 22, 2009 at 11:51:42AM -0700, Davide Libenzi wrote:
> >   
> >> A file* based kernel-to-kernel interface is rather wrong IMO.
> >>     
> >
> > But eventfd_ctx should work fine.
> >   
> 
> Yeah, and I guess we can always just say that qemu can't close the fd or
> something.  Seems hacky, but it might work if Davide insists we need his
> change.

Continuing here, since there's no reason of having many subthreads talking 
about the same thing.
Can you make a detailed example of what you're trying to achieve (no Hint 
Mode, please)?
As it sounds to me, that you need a consumer/producer reference counting, 
to cover your scenario correctly.



- Davide


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Davide Libenzi June 22, 2009, 7:54 p.m. UTC | #13
On Mon, 22 Jun 2009, Gregory Haskins wrote:

> Davide Libenzi wrote:
> > On Mon, 22 Jun 2009, Michael S. Tsirkin wrote:
> >
> >   
> >> On Mon, Jun 22, 2009 at 11:03:22AM -0700, Davide Libenzi wrote:
> >>     
> >>> In your case of kernel-to-kernel scenario, why would you need eventfd at 
> >>> all, if userspace role in that model is simply to create it?
> >>>       
> >> That's not 100% true. We have a mode where userspace is the producer
> >> and/or consumer (migration mode) and we switch between that and
> >> direct kernel-to-kernel communication.
> >>     
> >
> > Then you'd need to ask yourself how to handle your complex case inside the 
> > KVM code, so that other eventfd users are not affected by the extra fat 
> > needed to handle your scenarios. Thing that seem to be continuosly tried.
> > A file* based kernel-to-kernel interface is rather wrong IMO.
> >   
> 
> Well, I will point out that the interface in question is
> eventfd_signal(struct file *), and you were the one that invented it
> afaict.  Can't help it if we like it :)

Yes, I did. The case for eventfd was dual. First, it was a communication 
link between userspace and kernel, for things like KAIO. Second, it was a 
faster and smaller userspace replacement for things people used pipes 
before.
In all those cases, at least one reference of the file* is alive in 
userspace.
Even with the above case in mind, today I'd still use the eventfd_ctx as 
internal kernel API accessory.



- Davide


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Gregory Haskins June 22, 2009, 8:06 p.m. UTC | #14
Davide Libenzi wrote:
> On Mon, 22 Jun 2009, Gregory Haskins wrote:
>
>   
>> Michael S. Tsirkin wrote:
>>     
>>> On Mon, Jun 22, 2009 at 11:51:42AM -0700, Davide Libenzi wrote:
>>>   
>>>       
>>>> A file* based kernel-to-kernel interface is rather wrong IMO.
>>>>     
>>>>         
>>> But eventfd_ctx should work fine.
>>>   
>>>       
>> Yeah, and I guess we can always just say that qemu can't close the fd or
>> something.  Seems hacky, but it might work if Davide insists we need his
>> change.
>>     
>
> Continuing here, since there's no reason of having many subthreads talking 
> about the same thing.
> Can you make a detailed example of what you're trying to achieve (no Hint 
> Mode, please)?
> As it sounds to me, that you need a consumer/producer reference counting, 
> to cover your scenario correctly.
>   

Well, one of them was already briefly mentioned (the PCI-passthrough
thing).  I am not personally working on this part (yet, anyway).

Another example of something I am actually working on as we speak would
be for this thing we are building called "virtual-bus".  It is a way to
build/deploy device models directly in the kernel.

In either of these cases, we have this concept of allowing the guest to
notify the host, or vice versa, that something happened.  Typically this
would be in reference to some chunk of shared memory, and the signaling
is telling the other side "I changed something, go look".

Without going into a ton of detail (unless, of course, you want it) is
that we are generalizing the signaling infrastructure (irqfd and
iosignalfd) so that something like PCI-passthrough or vbus are not
directly coupled to KVM.  They communicate to KVM purely in terms of
(among other things) these irqfd/iosignalfd interfaces.

Using vbus as an example (though others are similar):  vbus would
primarily exists as a kernel-model.  However, there would be a small
device model in qemu-kem userspace to publish something like a PCI
device that declares its resource requirements to the guest.   Some of
those requirements would be things like how many interrupts it needs,
and what IO ranges it supports, etc.  When the guest programs the PCI
space, it maps the resources from its own world into the virtual PCI
resources emulated in qemu.

So up in userspace, the vbus pci-device would have an open reference to
the kvm guest (derived from /dev/kvm) and an open reference to a vbus
(derived from /dev/vbus).  Lets call these kvmfd, and vbusfd,
respectively.  For something like an interrupt, we would hook the point
where the PCI-MSI interrupt is assigned, and would do the following:

gsi = kvm_irq_route_gsi();
fd = eventfd(0, 0);
ioctl(kvmfd, KVM_IRQFD_ASSIGN, {fd, gsi});
ioctl(vbusfd, VBUS_SHMSIGNAL_ASSIGN, {sigid, fd});

So userspace orchestrated the assignment of this one eventfd to a KVM
consumer, and a VBUS producer.  The two subsystems do not care about the
details of the other side of the link, per se.  VBUS just knows that it
can eventfd_signal() its memory region to tell whomever is listening
that it changed.  Likewise, KVM just knows to inject "gsi" when it gets
signalled.  You could equally have given "fd" to a userspace thread for
either producer or consumer roles, or any other combination.

If we were doing PCI-passthough, substitute the last SHMSIGNAL_ASSIGN
ioctl call with some PCI_PASSTHROUGH_ASSIGN verb and you get the idea.

The important thing is that once this is established, userspace doesn't
necessarily care about the fd anymore.  So now the question is: do we
keep it around for other things?  Do we keep it around because we don't
want KVM to see the POLLHUP, or do we address the "release" code so that
it works even if userspace issued close(fd) at this point.  I am not
sure what the answer is, but this is the scenario we are concerned with
in this thread.  In the example above, vbus is free to produce events on
its eventfd until it gets a SHMSIGNAL_DEASSIGN request.

-Greg

>
>
> - Davide
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
Michael S. Tsirkin June 22, 2009, 8:28 p.m. UTC | #15
On Mon, Jun 22, 2009 at 03:26:41PM -0400, Gregory Haskins wrote:
> Michael S. Tsirkin wrote:
> > On Mon, Jun 22, 2009 at 11:51:42AM -0700, Davide Libenzi wrote:
> >   
> >> A file* based kernel-to-kernel interface is rather wrong IMO.
> >>     
> >
> > But eventfd_ctx should work fine.
> >   
> 
> Yeah, and I guess we can always just say that qemu can't close the fd or
> something.  Seems hacky, but it might work if Davide insists we need his
> change.

Yes, I think that's fine.
Davide Libenzi June 22, 2009, 10:53 p.m. UTC | #16
On Mon, 22 Jun 2009, Gregory Haskins wrote:

> So up in userspace, the vbus pci-device would have an open reference to
> the kvm guest (derived from /dev/kvm) and an open reference to a vbus
> (derived from /dev/vbus).  Lets call these kvmfd, and vbusfd,
> respectively.  For something like an interrupt, we would hook the point
> where the PCI-MSI interrupt is assigned, and would do the following:
> 
> gsi = kvm_irq_route_gsi();
> fd = eventfd(0, 0);
> ioctl(kvmfd, KVM_IRQFD_ASSIGN, {fd, gsi});
> ioctl(vbusfd, VBUS_SHMSIGNAL_ASSIGN, {sigid, fd});
> 
> So userspace orchestrated the assignment of this one eventfd to a KVM
> consumer, and a VBUS producer.  The two subsystems do not care about the
> details of the other side of the link, per se.  VBUS just knows that it
> can eventfd_signal() its memory region to tell whomever is listening
> that it changed.  Likewise, KVM just knows to inject "gsi" when it gets
> signalled.  You could equally have given "fd" to a userspace thread for
> either producer or consumer roles, or any other combination.
> 
> If we were doing PCI-passthough, substitute the last SHMSIGNAL_ASSIGN
> ioctl call with some PCI_PASSTHROUGH_ASSIGN verb and you get the idea.
> 
> The important thing is that once this is established, userspace doesn't
> necessarily care about the fd anymore.  So now the question is: do we
> keep it around for other things?  Do we keep it around because we don't
> want KVM to see the POLLHUP, or do we address the "release" code so that
> it works even if userspace issued close(fd) at this point.  I am not
> sure what the answer is, but this is the scenario we are concerned with
> in this thread.  In the example above, vbus is free to produce events on
> its eventfd until it gets a SHMSIGNAL_DEASSIGN request.

I see.
The thing remains, that in order to reliably handle generic 
producer/consumer scenarios you'd need a reference counting similar to 
pipes, where the notion of producer and consumer is very well defined.
In your case, it'd be sufficent to expose an:

	int eventfd_wakeup(struct eventfd_ctx *ctx, void *key);

So that each end can signal, in an file* f_count independent way, when 
they're about to leave. Say with a new status bit.
The problem, that I felt coming since when I introduced the key-based 
wakeups, is that right now the "key" start to become a little restrictive 
in terms of amount of information carried by it.
If there are other key-aware waiters on "wqh", your new bit cannot 
conflict with existing ones, and you (and future users of the interface) 
will be polluting the global bit namespace.
Probably turning "key" into a pointer to a structure like:

	struct wakeup_key {
		unsigned long type;
		void *key;
	};

So the current:

	wake_up_poll(wq, POLLIN);

Would become:

	key.type = KT_POLLMASK;
	key.key = POLLIN;
	wake_up_key(wq, &key);

At that point the waiters (like poll/epoll) can simply discard the wakeup 
types they are not interested into (poll/epoll would care only about 
KT_POLLMASK), and something more than a simple bitmask can be carried by 
the wakups. Like:

	key.type = KT_SOMEDATA;
	key.key = &data;
	wake_up_key(wq, &key);

Of course, we could build another layer on top, to fit this model, but 
then we'd have the dual citizenship among normal wakeups and new-interface 
wakeups/signaling.
Another thing. Now that the interface exposes the eventfd_ctx context 
directly, the pollcb register/unregister does not have any reason to be 
inside eventfd (they're totally generic bits at that point), so my thought 
was about to drop them.



- Davide


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Gregory Haskins June 23, 2009, 1:03 a.m. UTC | #17
Davide Libenzi wrote:
> On Mon, 22 Jun 2009, Gregory Haskins wrote:
>
>   
>> So up in userspace, the vbus pci-device would have an open reference to
>> the kvm guest (derived from /dev/kvm) and an open reference to a vbus
>> (derived from /dev/vbus).  Lets call these kvmfd, and vbusfd,
>> respectively.  For something like an interrupt, we would hook the point
>> where the PCI-MSI interrupt is assigned, and would do the following:
>>
>> gsi = kvm_irq_route_gsi();
>> fd = eventfd(0, 0);
>> ioctl(kvmfd, KVM_IRQFD_ASSIGN, {fd, gsi});
>> ioctl(vbusfd, VBUS_SHMSIGNAL_ASSIGN, {sigid, fd});
>>
>> So userspace orchestrated the assignment of this one eventfd to a KVM
>> consumer, and a VBUS producer.  The two subsystems do not care about the
>> details of the other side of the link, per se.  VBUS just knows that it
>> can eventfd_signal() its memory region to tell whomever is listening
>> that it changed.  Likewise, KVM just knows to inject "gsi" when it gets
>> signalled.  You could equally have given "fd" to a userspace thread for
>> either producer or consumer roles, or any other combination.
>>
>> If we were doing PCI-passthough, substitute the last SHMSIGNAL_ASSIGN
>> ioctl call with some PCI_PASSTHROUGH_ASSIGN verb and you get the idea.
>>
>> The important thing is that once this is established, userspace doesn't
>> necessarily care about the fd anymore.  So now the question is: do we
>> keep it around for other things?  Do we keep it around because we don't
>> want KVM to see the POLLHUP, or do we address the "release" code so that
>> it works even if userspace issued close(fd) at this point.  I am not
>> sure what the answer is, but this is the scenario we are concerned with
>> in this thread.  In the example above, vbus is free to produce events on
>> its eventfd until it gets a SHMSIGNAL_DEASSIGN request.
>>     
>
> I see.
> The thing remains, that in order to reliably handle generic 
> producer/consumer scenarios you'd need a reference counting similar to 
> pipes, where the notion of producer and consumer is very well defined.
>   

I see your point.

Well, I think the more important thing here is that we address the
races, and add support for DEASSIGN.  We can do both of those things
with any of the patches that you and I have been kicking around.  So
what I propose is that we move forward with whatever patch you bless as
proper for now.  This producer-release issue is pretty minor in the
grand scheme of things.  We can always just have userspace hold the fd.

I can either take in the last one you sent, or it sounds like you wanted
to possibly do another round of cleanup?  Whatever the case may be, let
me know and we can coordinate with Andrew/Avi on what tree to pull it
into.  It sounds like riding in kvm.git is the perhaps the most logical.

Thanks Davide,
-Greg
Davide Libenzi June 23, 2009, 1:17 a.m. UTC | #18
On Mon, 22 Jun 2009, Gregory Haskins wrote:

> Davide Libenzi wrote:
> > On Mon, 22 Jun 2009, Gregory Haskins wrote:
> >
> >   
> >> So up in userspace, the vbus pci-device would have an open reference to
> >> the kvm guest (derived from /dev/kvm) and an open reference to a vbus
> >> (derived from /dev/vbus).  Lets call these kvmfd, and vbusfd,
> >> respectively.  For something like an interrupt, we would hook the point
> >> where the PCI-MSI interrupt is assigned, and would do the following:
> >>
> >> gsi = kvm_irq_route_gsi();
> >> fd = eventfd(0, 0);
> >> ioctl(kvmfd, KVM_IRQFD_ASSIGN, {fd, gsi});
> >> ioctl(vbusfd, VBUS_SHMSIGNAL_ASSIGN, {sigid, fd});
> >>
> >> So userspace orchestrated the assignment of this one eventfd to a KVM
> >> consumer, and a VBUS producer.  The two subsystems do not care about the
> >> details of the other side of the link, per se.  VBUS just knows that it
> >> can eventfd_signal() its memory region to tell whomever is listening
> >> that it changed.  Likewise, KVM just knows to inject "gsi" when it gets
> >> signalled.  You could equally have given "fd" to a userspace thread for
> >> either producer or consumer roles, or any other combination.
> >>
> >> If we were doing PCI-passthough, substitute the last SHMSIGNAL_ASSIGN
> >> ioctl call with some PCI_PASSTHROUGH_ASSIGN verb and you get the idea.
> >>
> >> The important thing is that once this is established, userspace doesn't
> >> necessarily care about the fd anymore.  So now the question is: do we
> >> keep it around for other things?  Do we keep it around because we don't
> >> want KVM to see the POLLHUP, or do we address the "release" code so that
> >> it works even if userspace issued close(fd) at this point.  I am not
> >> sure what the answer is, but this is the scenario we are concerned with
> >> in this thread.  In the example above, vbus is free to produce events on
> >> its eventfd until it gets a SHMSIGNAL_DEASSIGN request.
> >>     
> >
> > I see.
> > The thing remains, that in order to reliably handle generic 
> > producer/consumer scenarios you'd need a reference counting similar to 
> > pipes, where the notion of producer and consumer is very well defined.
> >   
> 
> I see your point.
> 
> Well, I think the more important thing here is that we address the
> races, and add support for DEASSIGN.  We can do both of those things
> with any of the patches that you and I have been kicking around.  So
> what I propose is that we move forward with whatever patch you bless as
> proper for now.  This producer-release issue is pretty minor in the
> grand scheme of things.  We can always just have userspace hold the fd.

Is it a real problem? Can it be decently handled on the KVM side?
Reason I'm asking, is that I wouldn't want to change the interface, only 
to find it unsuitable a few weeks later.




> I can either take in the last one you sent, or it sounds like you wanted
> to possibly do another round of cleanup?  Whatever the case may be, let
> me know and we can coordinate with Andrew/Avi on what tree to pull it
> into.  It sounds like riding in kvm.git is the perhaps the most logical.

Yes, I'd like to drop the pollcb bits (that you can implement KVM-side), 
at least.
And yes, going kvm.git is probably the best path.



- Davide


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Gregory Haskins June 23, 2009, 1:26 a.m. UTC | #19
Davide Libenzi wrote:
> On Mon, 22 Jun 2009, Gregory Haskins wrote:
>
>   
>> Davide Libenzi wrote:
>>     
>>> On Mon, 22 Jun 2009, Gregory Haskins wrote:
>>>
>>>   
>>>       
>>>> So up in userspace, the vbus pci-device would have an open reference to
>>>> the kvm guest (derived from /dev/kvm) and an open reference to a vbus
>>>> (derived from /dev/vbus).  Lets call these kvmfd, and vbusfd,
>>>> respectively.  For something like an interrupt, we would hook the point
>>>> where the PCI-MSI interrupt is assigned, and would do the following:
>>>>
>>>> gsi = kvm_irq_route_gsi();
>>>> fd = eventfd(0, 0);
>>>> ioctl(kvmfd, KVM_IRQFD_ASSIGN, {fd, gsi});
>>>> ioctl(vbusfd, VBUS_SHMSIGNAL_ASSIGN, {sigid, fd});
>>>>
>>>> So userspace orchestrated the assignment of this one eventfd to a KVM
>>>> consumer, and a VBUS producer.  The two subsystems do not care about the
>>>> details of the other side of the link, per se.  VBUS just knows that it
>>>> can eventfd_signal() its memory region to tell whomever is listening
>>>> that it changed.  Likewise, KVM just knows to inject "gsi" when it gets
>>>> signalled.  You could equally have given "fd" to a userspace thread for
>>>> either producer or consumer roles, or any other combination.
>>>>
>>>> If we were doing PCI-passthough, substitute the last SHMSIGNAL_ASSIGN
>>>> ioctl call with some PCI_PASSTHROUGH_ASSIGN verb and you get the idea.
>>>>
>>>> The important thing is that once this is established, userspace doesn't
>>>> necessarily care about the fd anymore.  So now the question is: do we
>>>> keep it around for other things?  Do we keep it around because we don't
>>>> want KVM to see the POLLHUP, or do we address the "release" code so that
>>>> it works even if userspace issued close(fd) at this point.  I am not
>>>> sure what the answer is, but this is the scenario we are concerned with
>>>> in this thread.  In the example above, vbus is free to produce events on
>>>> its eventfd until it gets a SHMSIGNAL_DEASSIGN request.
>>>>     
>>>>         
>>> I see.
>>> The thing remains, that in order to reliably handle generic 
>>> producer/consumer scenarios you'd need a reference counting similar to 
>>> pipes, where the notion of producer and consumer is very well defined.
>>>   
>>>       
>> I see your point.
>>
>> Well, I think the more important thing here is that we address the
>> races, and add support for DEASSIGN.  We can do both of those things
>> with any of the patches that you and I have been kicking around.  So
>> what I propose is that we move forward with whatever patch you bless as
>> proper for now.  This producer-release issue is pretty minor in the
>> grand scheme of things.  We can always just have userspace hold the fd.
>>     
>
> Is it a real problem? Can it be decently handled on the KVM side?
> Reason I'm asking, is that I wouldn't want to change the interface, only 
> to find it unsuitable a few weeks later.
>   

To be honest, I am not sure.  I would guess its not a *huge* deal,
though it was obviously enough of a concern to at least discuss it.  I
can definitely say that I think the other issues which are being fixed
are substantially more important.

>
>
>
>   
>> I can either take in the last one you sent, or it sounds like you wanted
>> to possibly do another round of cleanup?  Whatever the case may be, let
>> me know and we can coordinate with Andrew/Avi on what tree to pull it
>> into.  It sounds like riding in kvm.git is the perhaps the most logical.
>>     
>
> Yes, I'd like to drop the pollcb bits (that you can implement KVM-side), 
> at least.
> And yes, going kvm.git is probably the best path.
>   

Sounds good.  Thanks for your patience, Davide.  I think we are almost
there! ;)

-Greg

>
>
> - Davide
>
>
>
Rusty Russell June 23, 2009, 3:25 a.m. UTC | #20
On Mon, 22 Jun 2009 09:24:20 am Davide Libenzi wrote:
> I actually ended up exposing the eventfd context anyway, since IMO is a
> better option instead of holding references to the eventfd file (that
> makes VFS people uneasy).

lguest is a special case since we don't let them close the fds, except by 
closing /dev/lguest.

Lguest change seems fine, but:

> Index: linux-2.6.mod/drivers/lguest/lg.h
> ===================================================================
> --- linux-2.6.mod.orig/drivers/lguest/lg.h	2009-06-21 15:55:17.000000000
> -0700 +++ linux-2.6.mod/drivers/lguest/lg.h	2009-06-21 15:56:00.000000000
> -0700 @@ -80,9 +80,11 @@ struct lg_cpu {
>  	struct lg_cpu_arch arch;
>  };
>
> +struct eventfd_ctx;
> +
>  struct lg_eventfd {
>  	unsigned long addr;
> -	struct file *event;
> +	struct eventfd_ctx *event;
>  };

The first 'struct eventfd_ctx;' line is not required.

Thanks,
Rusty.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Davide Libenzi June 23, 2009, 2:29 p.m. UTC | #21
On Mon, 22 Jun 2009, Gregory Haskins wrote:

> To be honest, I am not sure.  I would guess its not a *huge* deal,
> though it was obviously enough of a concern to at least discuss it.  I
> can definitely say that I think the other issues which are being fixed
> are substantially more important.

Ok then, will repost the revised patch later today.


- Davide


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Davide Libenzi June 23, 2009, 2:31 p.m. UTC | #22
On Tue, 23 Jun 2009, Rusty Russell wrote:

> The first 'struct eventfd_ctx;' line is not required.

Will repost dropping that.


- Davide


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Davide Libenzi June 23, 2009, 2:35 p.m. UTC | #23
On Tue, 23 Jun 2009, Gregory Haskins wrote:

> Davide Libenzi wrote:
> > On Mon, 22 Jun 2009, Gregory Haskins wrote:
> >
> >   
> >> To be honest, I am not sure.  I would guess its not a *huge* deal,
> >> though it was obviously enough of a concern to at least discuss it.  I
> >> can definitely say that I think the other issues which are being fixed
> >> are substantially more important.
> >>     
> >
> > Ok then, will repost the revised patch later today.
> >   
> 
> Ok sounds good.  I did have a chance to take a closer look at your
> proposal for the key data, and I think it makes a lot of sense.  Do you
> see it as a problem if we defer adding that?  Or should we try to add
> that notion now?

That would need to go eventually via mainline, after some discussion. But 
yes, I believe that using the "key" as simple bitmask is a little 
restrictive.


- Davide


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Gregory Haskins June 23, 2009, 2:37 p.m. UTC | #24
Davide Libenzi wrote:
> On Mon, 22 Jun 2009, Gregory Haskins wrote:
>
>   
>> To be honest, I am not sure.  I would guess its not a *huge* deal,
>> though it was obviously enough of a concern to at least discuss it.  I
>> can definitely say that I think the other issues which are being fixed
>> are substantially more important.
>>     
>
> Ok then, will repost the revised patch later today.
>   

Ok sounds good.  I did have a chance to take a closer look at your
proposal for the key data, and I think it makes a lot of sense.  Do you
see it as a problem if we defer adding that?  Or should we try to add
that notion now?

-Greg
Gregory Haskins June 23, 2009, 2:42 p.m. UTC | #25
Davide Libenzi wrote:
> On Tue, 23 Jun 2009, Gregory Haskins wrote:
>
>   
>> Davide Libenzi wrote:
>>     
>>> On Mon, 22 Jun 2009, Gregory Haskins wrote:
>>>
>>>   
>>>       
>>>> To be honest, I am not sure.  I would guess its not a *huge* deal,
>>>> though it was obviously enough of a concern to at least discuss it.  I
>>>> can definitely say that I think the other issues which are being fixed
>>>> are substantially more important.
>>>>     
>>>>         
>>> Ok then, will repost the revised patch later today.
>>>   
>>>       
>> Ok sounds good.  I did have a chance to take a closer look at your
>> proposal for the key data, and I think it makes a lot of sense.  Do you
>> see it as a problem if we defer adding that?  Or should we try to add
>> that notion now?
>>     
>
> That would need to go eventually via mainline, after some discussion. But 
> yes, I believe that using the "key" as simple bitmask is a little 
> restrictive.
>   

Ok.   As long as you do not think we are somehow painting ourselves into
a corner, lets just go with your original plan for the revised patch
sans key changes.

Thanks Davide,
-Greg
Michael S. Tsirkin June 23, 2009, 3:04 p.m. UTC | #26
On Tue, Jun 23, 2009 at 07:35:00AM -0700, Davide Libenzi wrote:
> On Tue, 23 Jun 2009, Gregory Haskins wrote:
> 
> > Davide Libenzi wrote:
> > > On Mon, 22 Jun 2009, Gregory Haskins wrote:
> > >
> > >   
> > >> To be honest, I am not sure.  I would guess its not a *huge* deal,
> > >> though it was obviously enough of a concern to at least discuss it.  I
> > >> can definitely say that I think the other issues which are being fixed
> > >> are substantially more important.
> > >>     
> > >
> > > Ok then, will repost the revised patch later today.
> > >   
> > 
> > Ok sounds good.  I did have a chance to take a closer look at your
> > proposal for the key data, and I think it makes a lot of sense.  Do you
> > see it as a problem if we defer adding that?  Or should we try to add
> > that notion now?
> 
> That would need to go eventually via mainline, after some discussion. But 
> yes, I believe that using the "key" as simple bitmask is a little 
> restrictive.
> 
> 
> - Davide

I agree.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Rusty Russell June 24, 2009, 3:25 a.m. UTC | #27
On Tue, 23 Jun 2009 03:33:22 am Davide Libenzi wrote:
> What you're doing there, is setting up a kernel-to-kernel (since
> userspace only role is to create the eventfd) communication, using a file*
> as accessory. That IMO is plain wrong.

The most sensible is that userspace can use these fds; an in-kernel variant is 
possible too, but not primary IMHO.

It's nice that userspace create the fds; it can then use the same fd for 
multiple event sources.

But I didn't see anything wrong with the way eventfd used to work: you have a 
kvm ioctl to say "attach this eventfd to this guest notification" and that does 
the eventfd_fget.  A detach ioctl does the fput (as does release of the kvm 
fd).

If they close the eventfd and don't do the detach ioctl, it's their problem.

Rusty.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Davide Libenzi June 24, 2009, 10:45 p.m. UTC | #28
On Wed, 24 Jun 2009, Rusty Russell wrote:

> On Tue, 23 Jun 2009 03:33:22 am Davide Libenzi wrote:
> > What you're doing there, is setting up a kernel-to-kernel (since
> > userspace only role is to create the eventfd) communication, using a file*
> > as accessory. That IMO is plain wrong.
> 
> The most sensible is that userspace can use these fds; an in-kernel variant is 
> possible too, but not primary IMHO.
> 
> It's nice that userspace create the fds; it can then use the same fd for 
> multiple event sources.
> 
> But I didn't see anything wrong with the way eventfd used to work: you have a 
> kvm ioctl to say "attach this eventfd to this guest notification" and that does 
> the eventfd_fget.  A detach ioctl does the fput (as does release of the kvm 
> fd).
> 
> If they close the eventfd and don't do the detach ioctl, it's their problem.

Some components would like to know if userspace dropped the fd, and take 
proper action accordingly (release resources, drop module instances, etc...).
The POLLHUP helps with that, but w/out decoupling file* memory from 
eventfd_ctx memory, it becomes pretty tricky (if feasible at all) to 
handle the event in a race-free way.
Once the file* is decoupled from the eventfd_ctx, it becomes saner to 
expose the internal kernel API via the eventfd_ctx.
Another thing that comes in my mind (that for some components might not 
matter) is considering the effect of userspace doing things like:

	for (;;) {
		fd = eventfd(...);
		ioctl(xfd, XXX_ADD, fd);
		close(fd);
	}

That might lead to unprivileged users drawing kernel memory w/out any 
userspace accountability, if not properly handled.



- Davide


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Rusty Russell June 25, 2009, 11:42 a.m. UTC | #29
On Thu, 25 Jun 2009 08:15:11 am Davide Libenzi wrote:
> On Wed, 24 Jun 2009, Rusty Russell wrote:
> > On Tue, 23 Jun 2009 03:33:22 am Davide Libenzi wrote:
> > > What you're doing there, is setting up a kernel-to-kernel (since
> > > userspace only role is to create the eventfd) communication, using a
> > > file* as accessory. That IMO is plain wrong.
> >
> > The most sensible is that userspace can use these fds; an in-kernel
> > variant is possible too, but not primary IMHO.
> >
> > It's nice that userspace create the fds; it can then use the same fd for
> > multiple event sources.
> >
> > But I didn't see anything wrong with the way eventfd used to work: you
> > have a kvm ioctl to say "attach this eventfd to this guest notification"
> > and that does the eventfd_fget.  A detach ioctl does the fput (as does
> > release of the kvm fd).
> >
> > If they close the eventfd and don't do the detach ioctl, it's their
> > problem.
>
> Some components would like to know if userspace dropped the fd, and take
> proper action accordingly (release resources, drop module instances,
> etc...).

Like to know?  Possibly.  Need to know?  Not anything I've seen so far.

If userspace creates the fd, component grab a ref and if userspace wants that 
fd completely freed must close the fd *and* tell component.  Simple, race free 
and explicit.  All wins.

As this discussion shows, doing some kind of implies non-reference is hard, 
complex and racy.

> Another thing that comes in my mind (that for some components might not
> matter) is considering the effect of userspace doing things like:
>
> 	for (;;) {
> 		fd = eventfd(...);
> 		ioctl(xfd, XXX_ADD, fd);
> 		close(fd);
> 	}
>
> That might lead to unprivileged users drawing kernel memory w/out any
> userspace accountability, if not properly handled.

No, fget_eventfd covers this exactly as expected.  Don't doubt your ability to 
design sane kernel interfaces; eventfd is nice!  All lguest needed was a 
couple of EXPORT_SYMBOLS and it fitted in beautifully.

Thanks,
Rusty.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Davide Libenzi June 25, 2009, 4:34 p.m. UTC | #30
On Thu, 25 Jun 2009, Rusty Russell wrote:

> On Thu, 25 Jun 2009 08:15:11 am Davide Libenzi wrote:
> >
> > Some components would like to know if userspace dropped the fd, and take
> > proper action accordingly (release resources, drop module instances,
> > etc...).
> 
> Like to know?  Possibly.  Need to know?  Not anything I've seen so far.
> 
> If userspace creates the fd, component grab a ref and if userspace wants that 
> fd completely freed must close the fd *and* tell component.  Simple, race free 
> and explicit.  All wins.
> 
> As this discussion shows, doing some kind of implies non-reference is hard, 
> complex and racy.

Easier, we agree. Not doing anything is always easier, provided the 
userspace interface allows for it.
Cleaner, I'm not sure. Again, it depends from the userspace interface, but 
usually when you close(2) something, you expect the kernel to react 
accordingly, and not on relying on userspace issuing extra calls in order 
to proper cleanup the kernel context.
This is even more true when the eventfd is the sole handle to the visisble 
userspace interface.
In such cases, not taking proper action on close(2) and requiring extra 
calls, would lead to designing interface with the close-no-really-i-mean-it
patterns.



- Davide


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Gregory Haskins June 25, 2009, 5:32 p.m. UTC | #31
Davide Libenzi wrote:
> On Thu, 25 Jun 2009, Rusty Russell wrote:
>
>   
>> On Thu, 25 Jun 2009 08:15:11 am Davide Libenzi wrote:
>>     
>>> Some components would like to know if userspace dropped the fd, and take
>>> proper action accordingly (release resources, drop module instances,
>>> etc...).
>>>       
>> Like to know?  Possibly.  Need to know?  Not anything I've seen so far.
>>
>> If userspace creates the fd, component grab a ref and if userspace wants that 
>> fd completely freed must close the fd *and* tell component.  Simple, race free 
>> and explicit.  All wins.
>>
>> As this discussion shows, doing some kind of implies non-reference is hard, 
>> complex and racy.
>>     
>
> Easier, we agree. Not doing anything is always easier, provided the 
> userspace interface allows for it.
> Cleaner, I'm not sure. Again, it depends from the userspace interface, but 
> usually when you close(2) something, you expect the kernel to react 
> accordingly, and not on relying on userspace issuing extra calls in order 
> to proper cleanup the kernel context.
> This is even more true when the eventfd is the sole handle to the visisble 
> userspace interface.
> In such cases, not taking proper action on close(2) and requiring extra 
> calls, would lead to designing interface with the close-no-really-i-mean-it
> patterns.
>   

Avi,
  Davide's argument is compelling in favor of considering the latest
irqfd fixes I pushed.  You could run into weirdness if we tried to
revert to the old requirement on explicit DEASSIGN ioctl once we start
doing things like handing the fd to other apps/components.  Its cleaner
IMO to retain the implicit-deassign-on-close behavior.

Since v5 also restores optional explicit DEASSIGN support, the only
downside to accepting my patches (vs reverting POLLHUP completely) is
the complexity required to deal with POLLHUP.  But this has been
substantially reduced and clarified since yesterdays review, and I am
reasonably confident the protocol is sound.

Please take a close look at it and consider for merging, if you would.

Thanks,
-Greg
Michael S. Tsirkin June 25, 2009, 6:26 p.m. UTC | #32
On Thu, Jun 25, 2009 at 01:32:22PM -0400, Gregory Haskins wrote:
> Please take a close look at it and consider for merging, if you would.

With all the incremental patching, I kind of lost track of what the
complete file would look like.  Is there a git tree I could pull?
Gregory Haskins June 25, 2009, 6:41 p.m. UTC | #33
Michael S. Tsirkin wrote:
> On Thu, Jun 25, 2009 at 01:32:22PM -0400, Gregory Haskins wrote:
>   
>> Please take a close look at it and consider for merging, if you would.
>>     
>
> With all the incremental patching, I kind of lost track of what the
> complete file would look like.  Is there a git tree I could pull?
>
>   

Ask and ye shall receive :)

git pull
ssh://master.kernel.org/pub/scm/linux/kernel/git/ghaskins/linux-2.6-hacks.git
kvm/irqfd

If you dont have a kernel.org account, it should rsync to the
gitweb/git: server soon, in which case
s|ssh://master|git://git should get you the proper path.

Note that there is one additional race that I found that is not included
in that tree in the slow-work infrastructure.  You can find my latest
submission to David Howells to resolve that issue here:

http://patchwork.kernel.org/patch/32287/

Thanks Michael!
-Greg
Michael S. Tsirkin June 26, 2009, 11:23 a.m. UTC | #34
On Thu, Jun 25, 2009 at 02:41:09PM -0400, Gregory Haskins wrote:
> Michael S. Tsirkin wrote:
> > On Thu, Jun 25, 2009 at 01:32:22PM -0400, Gregory Haskins wrote:
> >   
> >> Please take a close look at it and consider for merging, if you would.
> >>     
> >
> > With all the incremental patching, I kind of lost track of what the
> > complete file would look like.  Is there a git tree I could pull?
> >
> >   
> 
> Ask and ye shall receive :)
> 
> git pull
> ssh://master.kernel.org/pub/scm/linux/kernel/git/ghaskins/linux-2.6-hacks.git
> kvm/irqfd
> 
> If you dont have a kernel.org account, it should rsync to the
> gitweb/git: server soon, in which case
> s|ssh://master|git://git should get you the proper path.
> 
> Note that there is one additional race that I found that is not included
> in that tree in the slow-work infrastructure.  You can find my latest
> submission to David Howells to resolve that issue here:
> 
> http://patchwork.kernel.org/patch/32287/
> 
> Thanks Michael!
> -Greg
> 

Yes, I saw that. I agree your patch solves that issue.
diff mbox

Patch

Index: linux-2.6.mod/fs/eventfd.c
===================================================================
--- linux-2.6.mod.orig/fs/eventfd.c	2009-06-21 09:52:10.000000000 -0700
+++ linux-2.6.mod/fs/eventfd.c	2009-06-21 16:51:27.000000000 -0700
@@ -17,32 +17,38 @@ 
 #include <linux/eventfd.h>
 #include <linux/syscalls.h>
 #include <linux/module.h>
+#include <linux/kref.h>
 
 struct eventfd_ctx {
+	struct kref kref;
 	wait_queue_head_t wqh;
 	/*
 	 * Every time that a write(2) is performed on an eventfd, the
 	 * value of the __u64 being written is added to "count" and a
 	 * wakeup is performed on "wqh". A read(2) will return the "count"
 	 * value to userspace, and will reset "count" to zero. The kernel
-	 * size eventfd_signal() also, adds to the "count" counter and
+	 * side eventfd_signal() also, adds to the "count" counter and
 	 * issue a wakeup.
 	 */
 	__u64 count;
 	unsigned int flags;
 };
 
-/*
- * Adds "n" to the eventfd counter "count". Returns "n" in case of
- * success, or a value lower then "n" in case of coutner overflow.
- * This function is supposed to be called by the kernel in paths
- * that do not allow sleeping. In this function we allow the counter
- * to reach the ULLONG_MAX value, and we signal this as overflow
- * condition by returining a POLLERR to poll(2).
+/**
+ * eventfd_signal - Adds @n to the eventfd counter. This function is
+ *                  supposed to be called by the kernel in paths that do not
+ *                  allow sleeping. In this function we allow the counter
+ *                  to reach the ULLONG_MAX value, and we signal this as
+ *                  overflow condition by returining a POLLERR to poll(2).
+ *
+ * @ctx: [in] Pointer to the eventfd context.
+ * @n: [in] Value of the counter to be added to the eventfd internal counter.
+ *
+ * Returns: In case of success, it returns @n, otherwise (in case of overflow
+ *          of the eventfd 64bit internal counter) a value lower than @n.
  */
-int eventfd_signal(struct file *file, int n)
+int eventfd_signal(struct eventfd_ctx *ctx, int n)
 {
-	struct eventfd_ctx *ctx = file->private_data;
 	unsigned long flags;
 
 	if (n < 0)
@@ -59,9 +65,30 @@  int eventfd_signal(struct file *file, in
 }
 EXPORT_SYMBOL_GPL(eventfd_signal);
 
+static void eventfd_free(struct kref *kref)
+{
+	struct eventfd_ctx *ctx = container_of(kref, struct eventfd_ctx, kref);
+
+	kfree(ctx);
+}
+
+static struct eventfd_ctx *eventfd_get(struct eventfd_ctx *ctx)
+{
+	kref_get(&ctx->kref);
+	return ctx;
+}
+
+static void eventfd_put(struct eventfd_ctx *ctx)
+{
+	kref_put(&ctx->kref, eventfd_free);
+}
+
 static int eventfd_release(struct inode *inode, struct file *file)
 {
-	kfree(file->private_data);
+	struct eventfd_ctx *ctx = file->private_data;
+
+	wake_up_poll(&ctx->wqh, POLLHUP);
+	eventfd_put(ctx);
 	return 0;
 }
 
@@ -185,6 +212,14 @@  static const struct file_operations even
 	.write		= eventfd_write,
 };
 
+/**
+ * eventfd_fget - Acquire a reference of an eventfd file descriptor.
+ *
+ * @fd: [in] Eventfd file descriptor.
+ *
+ * Returns: A pointer to the eventfd file structure in case of success, or a
+ *          proper error pointer in case of failure.
+ */
 struct file *eventfd_fget(int fd)
 {
 	struct file *file;
@@ -201,6 +236,59 @@  struct file *eventfd_fget(int fd)
 }
 EXPORT_SYMBOL_GPL(eventfd_fget);
 
+/**
+ * eventfd_ctx_get - Acquires a reference to the internal eventfd context.
+ *
+ * @file: [in] Pointer to the eventfd file.
+ *
+ * Returns: In case of success, returns a pointer to the eventfd context,
+ *          otherwise a proper error code.
+ */
+struct eventfd_ctx *eventfd_ctx_get(struct file *file)
+{
+	return file->f_op == &eventfd_fops ? eventfd_get(file->private_data) :
+		ERR_PTR(-EINVAL);
+}
+EXPORT_SYMBOL_GPL(eventfd_ctx_get);
+
+/**
+ * eventfd_ctx_put - Releases a reference to the internal eventfd context
+ *                   (previously acquired with eventfd_ctx_get()).
+ *
+ * @ctx: [in] Pointer to eventfd context.
+ *
+ * Returns: Nothing.
+ */
+void eventfd_ctx_put(struct eventfd_ctx *ctx)
+{
+	eventfd_put(ctx);
+}
+EXPORT_SYMBOL_GPL(eventfd_ctx_put);
+
+/**
+ * eventfd_ctx_fdget - Acquires a reference to the internal eventfd context
+ *                     given an eventfd file descriptor.
+ *
+ * @fd: [in] Eventfd file descriptor.
+ *
+ * Returns: In case of success, it returns a pointer to the internal eventfd
+ *          context, otherwise a proper error code.
+ */
+struct eventfd_ctx *eventfd_ctx_fdget(int fd)
+{
+	struct file *file;
+	struct eventfd_ctx *ctx;
+
+	file = eventfd_fget(fd);
+	if (IS_ERR(file))
+		return (struct eventfd_ctx *) file;
+	ctx = eventfd_ctx_get(file);
+	fput(file);
+
+	return ctx;
+}
+EXPORT_SYMBOL_GPL(eventfd_ctx_fdget);
+
 SYSCALL_DEFINE2(eventfd2, unsigned int, count, int, flags)
 {
 	int fd;
@@ -217,6 +305,7 @@  SYSCALL_DEFINE2(eventfd2, unsigned int, 
 	if (!ctx)
 		return -ENOMEM;
 
+	kref_init(&ctx->kref);
 	init_waitqueue_head(&ctx->wqh);
 	ctx->count = count;
 	ctx->flags = flags;
@@ -237,3 +326,62 @@  SYSCALL_DEFINE1(eventfd, unsigned int, c
 	return sys_eventfd2(count, 0);
 }
 
+static void eventfd_pollcb_ptqueue(struct file *file, wait_queue_head_t *wqh,
+				   poll_table *pt)
+{
+	struct eventfd_pollcb *ecb;
+
+	ecb = container_of(pt, struct eventfd_pollcb, pt);
+	add_wait_queue(wqh, &ecb->wait);
+}
+
+/**
+ * eventfd_pollcb_register - Registers a wakeup callback with the eventfd
+ *                           file. The wakeup callback will be called from
+ *                           atomic context, and should handle the POLLHUP
+ *                           event accordingly in order to notice the last
+ *                           instance of the eventfd descriptor going away.
+ *
+ * @file: [in] Pointer to the eventfd file.
+ * @ecb: [out] Pointer to the eventfd callback structure.
+ * @cbf: [in] Pointer to the wakeup callback function.
+ * @events: [out] Pointer to the events that were ready at the time of the
+ *                callback registration.
+ *
+ * Returns: Zero in case of success, or a proper error code.
+ */
+int eventfd_pollcb_register(struct file *file, struct eventfd_pollcb *ecb,
+			    wait_queue_func_t cbf, unsigned int *events)
+{
+	struct eventfd_ctx *ctx;
+
+	ctx = file->private_data;
+	if (file->f_op != &eventfd_fops)
+		return -EINVAL;
+
+	init_waitqueue_func_entry(&ecb->wait, cbf);
+	init_poll_funcptr(&ecb->pt, eventfd_pollcb_ptqueue);
+	ecb->ctx = eventfd_get(ctx);
+
+	*events = file->f_op->poll(file, &ecb->pt);
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(eventfd_pollcb_register);
+
+/**
+ * eventfd_pollcb_unregister - Unregisters a wakeup callback previously registered
+ *                             with eventfd_pollcb_register().
+ *
+ * @ecb: [in] Pointer to the eventfd callback structure previously registered with
+ *            eventfd_pollcb_register().
+ *
+ * Returns: Nothing.
+ */
+void eventfd_pollcb_unregister(struct eventfd_pollcb *ecb)
+{
+	remove_wait_queue(&ecb->ctx->wqh, &ecb->wait);
+	eventfd_put(ecb->ctx);
+}
+EXPORT_SYMBOL_GPL(eventfd_pollcb_unregister);
+
Index: linux-2.6.mod/include/linux/eventfd.h
===================================================================
--- linux-2.6.mod.orig/include/linux/eventfd.h	2009-06-21 09:52:10.000000000 -0700
+++ linux-2.6.mod/include/linux/eventfd.h	2009-06-21 16:49:19.000000000 -0700
@@ -8,10 +8,10 @@ 
 #ifndef _LINUX_EVENTFD_H
 #define _LINUX_EVENTFD_H
 
-#ifdef CONFIG_EVENTFD
-
-/* For O_CLOEXEC and O_NONBLOCK */
 #include <linux/fcntl.h>
+#include <linux/wait.h>
+#include <linux/poll.h>
+#include <linux/file.h>
 
 /*
  * CAREFUL: Check include/asm-generic/fcntl.h when defining
@@ -27,16 +27,22 @@ 
 #define EFD_SHARED_FCNTL_FLAGS (O_CLOEXEC | O_NONBLOCK)
 #define EFD_FLAGS_SET (EFD_SHARED_FCNTL_FLAGS | EFD_SEMAPHORE)
 
-struct file *eventfd_fget(int fd);
-int eventfd_signal(struct file *file, int n);
-
-#else /* CONFIG_EVENTFD */
+struct eventfd_ctx;
 
-#define eventfd_fget(fd) ERR_PTR(-ENOSYS)
-static inline int eventfd_signal(struct file *file, int n)
-{ return 0; }
+struct eventfd_pollcb {
+	poll_table pt;
+	struct eventfd_ctx *ctx;
+	wait_queue_t wait;
+};
 
-#endif /* CONFIG_EVENTFD */
+struct file *eventfd_fget(int fd);
+struct eventfd_ctx *eventfd_ctx_fdget(int fd);
+struct eventfd_ctx *eventfd_ctx_get(struct file *file);
+void eventfd_ctx_put(struct eventfd_ctx *ctx);
+int eventfd_signal(struct eventfd_ctx *ctx, int n);
+int eventfd_pollcb_register(struct file *file, struct eventfd_pollcb *ecb,
+			    wait_queue_func_t cbf, unsigned int *events);
+void eventfd_pollcb_unregister(struct eventfd_pollcb *ecb);
 
 #endif /* _LINUX_EVENTFD_H */
 
Index: linux-2.6.mod/fs/aio.c
===================================================================
--- linux-2.6.mod.orig/fs/aio.c	2009-06-21 14:36:52.000000000 -0700
+++ linux-2.6.mod/fs/aio.c	2009-06-21 15:25:44.000000000 -0700
@@ -485,6 +485,8 @@  static inline void really_put_req(struct
 {
 	assert_spin_locked(&ctx->ctx_lock);
 
+	if (req->ki_eventfd != NULL)
+		eventfd_ctx_put(req->ki_eventfd);
 	if (req->ki_dtor)
 		req->ki_dtor(req);
 	if (req->ki_iovec != &req->ki_inline_vec)
@@ -509,8 +511,6 @@  static void aio_fput_routine(struct work
 		/* Complete the fput(s) */
 		if (req->ki_filp != NULL)
 			__fput(req->ki_filp);
-		if (req->ki_eventfd != NULL)
-			__fput(req->ki_eventfd);
 
 		/* Link the iocb into the context's free list */
 		spin_lock_irq(&ctx->ctx_lock);
@@ -528,8 +528,6 @@  static void aio_fput_routine(struct work
  */
 static int __aio_put_req(struct kioctx *ctx, struct kiocb *req)
 {
-	int schedule_putreq = 0;
-
 	dprintk(KERN_DEBUG "aio_put(%p): f_count=%ld\n",
 		req, atomic_long_read(&req->ki_filp->f_count));
 
@@ -549,24 +547,16 @@  static int __aio_put_req(struct kioctx *
 	 * we would not be holding the last reference to the file*, so
 	 * this function will be executed w/out any aio kthread wakeup.
 	 */
-	if (unlikely(atomic_long_dec_and_test(&req->ki_filp->f_count)))
-		schedule_putreq++;
-	else
-		req->ki_filp = NULL;
-	if (req->ki_eventfd != NULL) {
-		if (unlikely(atomic_long_dec_and_test(&req->ki_eventfd->f_count)))
-			schedule_putreq++;
-		else
-			req->ki_eventfd = NULL;
-	}
-	if (unlikely(schedule_putreq)) {
+	if (unlikely(atomic_long_dec_and_test(&req->ki_filp->f_count))) {
 		get_ioctx(ctx);
 		spin_lock(&fput_lock);
 		list_add(&req->ki_list, &fput_head);
 		spin_unlock(&fput_lock);
 		queue_work(aio_wq, &fput_work);
-	} else
+	} else {
+		req->ki_filp = NULL;
 		really_put_req(ctx, req);
+	}
 	return 1;
 }
 
@@ -1622,7 +1612,7 @@  static int io_submit_one(struct kioctx *
 		 * an eventfd() fd, and will be signaled for each completed
 		 * event using the eventfd_signal() function.
 		 */
-		req->ki_eventfd = eventfd_fget((int) iocb->aio_resfd);
+		req->ki_eventfd = eventfd_ctx_fdget((int) iocb->aio_resfd);
 		if (IS_ERR(req->ki_eventfd)) {
 			ret = PTR_ERR(req->ki_eventfd);
 			req->ki_eventfd = NULL;
Index: linux-2.6.mod/include/linux/aio.h
===================================================================
--- linux-2.6.mod.orig/include/linux/aio.h	2009-06-21 14:36:52.000000000 -0700
+++ linux-2.6.mod/include/linux/aio.h	2009-06-21 15:32:42.000000000 -0700
@@ -12,6 +12,7 @@ 
 #define AIO_MAXSEGS		4
 #define AIO_KIOGRP_NR_ATOMIC	8
 
+struct eventfd_ctx;
 struct kioctx;
 
 /* Notes on cancelling a kiocb:
@@ -121,9 +122,9 @@  struct kiocb {
 
 	/*
 	 * If the aio_resfd field of the userspace iocb is not zero,
-	 * this is the underlying file* to deliver event to.
+	 * this is the underlying eventfd context to deliver events to.
 	 */
-	struct file		*ki_eventfd;
+	struct eventfd_ctx	*ki_eventfd;
 };
 
 #define is_sync_kiocb(iocb)	((iocb)->ki_key == KIOCB_SYNC_KEY)
Index: linux-2.6.mod/init/Kconfig
===================================================================
--- linux-2.6.mod.orig/init/Kconfig	2009-06-21 15:42:52.000000000 -0700
+++ linux-2.6.mod/init/Kconfig	2009-06-21 15:43:25.000000000 -0700
@@ -925,6 +925,7 @@  config SHMEM
 
 config AIO
 	bool "Enable AIO support" if EMBEDDED
+	select EVENTFD
 	default y
 	help
 	  This option enables POSIX asynchronous I/O which may by used
Index: linux-2.6.mod/drivers/lguest/lg.h
===================================================================
--- linux-2.6.mod.orig/drivers/lguest/lg.h	2009-06-21 15:55:17.000000000 -0700
+++ linux-2.6.mod/drivers/lguest/lg.h	2009-06-21 15:56:00.000000000 -0700
@@ -80,9 +80,11 @@  struct lg_cpu {
 	struct lg_cpu_arch arch;
 };
 
+struct eventfd_ctx;
+
 struct lg_eventfd {
 	unsigned long addr;
-	struct file *event;
+	struct eventfd_ctx *event;
 };
 
 struct lg_eventfd_map {
Index: linux-2.6.mod/drivers/lguest/lguest_user.c
===================================================================
--- linux-2.6.mod.orig/drivers/lguest/lguest_user.c	2009-06-21 15:55:17.000000000 -0700
+++ linux-2.6.mod/drivers/lguest/lguest_user.c	2009-06-21 15:57:11.000000000 -0700
@@ -50,7 +50,7 @@  static int add_eventfd(struct lguest *lg
 
 	/* Now append new entry. */
 	new->map[new->num].addr = addr;
-	new->map[new->num].event = eventfd_fget(fd);
+	new->map[new->num].event = eventfd_ctx_fdget(fd);
 	if (IS_ERR(new->map[new->num].event)) {
 		kfree(new);
 		return PTR_ERR(new->map[new->num].event);
@@ -357,7 +357,7 @@  static int close(struct inode *inode, st
 
 	/* Release any eventfds they registered. */
 	for (i = 0; i < lg->eventfds->num; i++)
-		fput(lg->eventfds->map[i].event);
+		eventfd_ctx_put(lg->eventfds->map[i].event);
 	kfree(lg->eventfds);
 
 	/* If lg->dead doesn't contain an error code it will be NULL or a