Message ID | alpine.DEB.1.10.0906211613200.8816@makko.or.mcafeemobile.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Davide Libenzi wrote: > On Sun, 21 Jun 2009, Gregory Haskins wrote: > > >> This looks great, Davide. I am fairly certain I can now solve the races >> and even implement Michael's DEASSIGN feature with this patch in place. >> I will actually fire it up tomorrow when I am back in the office and >> give it a spin, but I do not spy any more races via visual inspection. >> >> Kind Regards, >> -Greg >> >> PS: I was wrong with my previous statement about requiring an embeddable >> object (eventfd_notifier for me, eventfd_pollcb for you). I think you >> can technically solve this issue minimally by merely locking the POLLHUP >> and exposing the kref. However, I think that leads to an more awkward >> interface (e.g. we already have eventfd_fget() plus we add a new one >> like eventfd_refget(), which might confuse users), so I prefer what you >> did here. Just thought I would throw that out there in case you would >> prefer to change even fewer lines. >> > > I actually ended up exposing the eventfd context anyway, since IMO is a > better option instead of holding references to the eventfd file (that > makes VFS people uneasy). > I liked "version - 1" better ;) I think ultimately we still want to hold the fget() for eventfd_signal(), as it is the producer side. Without it, we have no way of knowing when the last producer goes away if they happen to be an in-kernel user. Also, thinking about it some more, I think I like the non-wrapped pollcb variant better now. This is because I can uhook the wqh in the POLLHUP and defer the kref_put() until later. I think this combination gives me the most ideal environment (even though technically I should not expect more activity on the wait-queue after the POLLHUP). In either case, I whipped up a super-lite patch with just the bare minimum and then developed the KVM solution on top. Things look really good, now, and I am happy with the result. As an aside, it only has the following impact on the eventfd core: fs/eventfd.c | 43 ++++++++++++++++++++++++++++++++++++------- include/linux/eventfd.h | 7 +++++++ 2 files changed, 43 insertions(+), 7 deletions(-) Ill post the series, including the KVM portions, for review. Thanks Davide, -Greg
On Mon, 22 Jun 2009, Gregory Haskins wrote: > Davide Libenzi wrote: > > On Sun, 21 Jun 2009, Gregory Haskins wrote: > > > > > >> This looks great, Davide. I am fairly certain I can now solve the races > >> and even implement Michael's DEASSIGN feature with this patch in place. > >> I will actually fire it up tomorrow when I am back in the office and > >> give it a spin, but I do not spy any more races via visual inspection. > >> > >> Kind Regards, > >> -Greg > >> > >> PS: I was wrong with my previous statement about requiring an embeddable > >> object (eventfd_notifier for me, eventfd_pollcb for you). I think you > >> can technically solve this issue minimally by merely locking the POLLHUP > >> and exposing the kref. However, I think that leads to an more awkward > >> interface (e.g. we already have eventfd_fget() plus we add a new one > >> like eventfd_refget(), which might confuse users), so I prefer what you > >> did here. Just thought I would throw that out there in case you would > >> prefer to change even fewer lines. > >> > > > > I actually ended up exposing the eventfd context anyway, since IMO is a > > better option instead of holding references to the eventfd file (that > > makes VFS people uneasy). > > > > I liked "version - 1" better ;) > > I think ultimately we still want to hold the fget() for > eventfd_signal(), as it is the producer side. Without it, we have no > way of knowing when the last producer goes away if they happen to be an > in-kernel user. No you don't. Holding a file* reference does not give you any assurance whatsoever that the last consumer did not go away. The f_count value will simply be 1 (just you). If a producer side want to take proper action when the last consumer leaves the building, it needs to register a pollcb hook and handle POLLHUP. Exposing the opaque eventfd context, and basing the eventfd operations on that one (instead of the live referenced file*) is a cleaner interface. Can you please base your IRQfd bits on top of that, so that we can give this IRQfd thread a closure? - Davide -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Davide Libenzi wrote: > On Mon, 22 Jun 2009, Gregory Haskins wrote: > > >> Davide Libenzi wrote: >> >>> On Sun, 21 Jun 2009, Gregory Haskins wrote: >>> >>> >>> >>>> This looks great, Davide. I am fairly certain I can now solve the races >>>> and even implement Michael's DEASSIGN feature with this patch in place. >>>> I will actually fire it up tomorrow when I am back in the office and >>>> give it a spin, but I do not spy any more races via visual inspection. >>>> >>>> Kind Regards, >>>> -Greg >>>> >>>> PS: I was wrong with my previous statement about requiring an embeddable >>>> object (eventfd_notifier for me, eventfd_pollcb for you). I think you >>>> can technically solve this issue minimally by merely locking the POLLHUP >>>> and exposing the kref. However, I think that leads to an more awkward >>>> interface (e.g. we already have eventfd_fget() plus we add a new one >>>> like eventfd_refget(), which might confuse users), so I prefer what you >>>> did here. Just thought I would throw that out there in case you would >>>> prefer to change even fewer lines. >>>> >>>> >>> I actually ended up exposing the eventfd context anyway, since IMO is a >>> better option instead of holding references to the eventfd file (that >>> makes VFS people uneasy). >>> >>> >> I liked "version - 1" better ;) >> >> I think ultimately we still want to hold the fget() for >> eventfd_signal(), as it is the producer side. Without it, we have no >> way of knowing when the last producer goes away if they happen to be an >> in-kernel user. >> > > No you don't. Holding a file* reference does not give you any assurance > whatsoever that the last consumer did not go away. The f_count value will > simply be 1 (just you). > I am probably confused or perhaps have the wrong terminology, but isnt that "ok". I am concerned about the consumer (the guy getting the POLLINs) to be able to detect POLLHUP when the last producer (f_ops->write() from userspace, eventfd_signal() from kernel) goes away. Consider the following sequence: ------------------- userspace calls "fd = eventfd()", and gives one to KVM as an irqfd, and the other to some PCI-passthrough device. The kvm/irqfd side acquires a kref, the pci side acquires a file. At this moment, userspace has the fd, and the pci device has the file (for eventfd_signal()). The fget() count is 2. Userspace closes the fd because its done with it, and the count drops to 1. Some time later, pci does an fput(), and KVM sees the POLLHUP and cleans up. ------------------- In this new model, the POLLHUP would have gone out as soon as userspace closed the fd, even though the intended producer (the PCI device) and the consumer (the KVM guest) are still up and running. This doesnt seem right to me. Or am I missing something? Thanks Davide, -Greg
On Mon, 22 Jun 2009, Gregory Haskins wrote: > I am probably confused or perhaps have the wrong terminology, but isnt > that "ok". I am concerned about the consumer (the guy getting the > POLLINs) to be able to detect POLLHUP when the last producer > (f_ops->write() from userspace, eventfd_signal() from kernel) goes away. > > Consider the following sequence: > > ------------------- > > userspace calls "fd = eventfd()", and gives one to KVM as an irqfd, and > the other to some PCI-passthrough device. > > The kvm/irqfd side acquires a kref, the pci side acquires a file. At > this moment, userspace has the fd, and the pci device has the file (for > eventfd_signal()). The fget() count is 2. Userspace closes the fd > because its done with it, and the count drops to 1. > > Some time later, pci does an fput(), and KVM sees the POLLHUP and cleans up. > > ------------------- > > In this new model, the POLLHUP would have gone out as soon as userspace > closed the fd, even though the intended producer (the PCI device) and > the consumer (the KVM guest) are still up and running. This doesnt seem > right to me. Or am I missing something? What you're doing there, is setting up a kernel-to-kernel (since userspace only role is to create the eventfd) communication, using a file* as accessory. That IMO is plain wrong. If userspace is either the producer, or the consumer, and you need to handle userspace leaving the building, you need to: file = eventfd_fget(fd); ctx = eventfd_ctx_get(file); /* Eventually, if producer */ eventfd_pollcb_register(file, ...); fput(file); In your case of kernel-to-kernel scenario, why would you need eventfd at all, if userspace role in that model is simply to create it? There are more effective ways to have in kernel communication channels, than resorting to userspace link facilities like eventfd. - Davide -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Davide Libenzi wrote: > On Mon, 22 Jun 2009, Gregory Haskins wrote: > > >> I am probably confused or perhaps have the wrong terminology, but isnt >> that "ok". I am concerned about the consumer (the guy getting the >> POLLINs) to be able to detect POLLHUP when the last producer >> (f_ops->write() from userspace, eventfd_signal() from kernel) goes away. >> >> Consider the following sequence: >> >> ------------------- >> >> userspace calls "fd = eventfd()", and gives one to KVM as an irqfd, and >> the other to some PCI-passthrough device. >> >> The kvm/irqfd side acquires a kref, the pci side acquires a file. At >> this moment, userspace has the fd, and the pci device has the file (for >> eventfd_signal()). The fget() count is 2. Userspace closes the fd >> because its done with it, and the count drops to 1. >> >> Some time later, pci does an fput(), and KVM sees the POLLHUP and cleans up. >> >> ------------------- >> >> In this new model, the POLLHUP would have gone out as soon as userspace >> closed the fd, even though the intended producer (the PCI device) and >> the consumer (the KVM guest) are still up and running. This doesnt seem >> right to me. Or am I missing something? >> > > What you're doing there, is setting up a kernel-to-kernel (since > userspace only role is to create the eventfd) communication, using a file* > as accessory. That IMO is plain wrong. > If userspace is either the producer, or the consumer, and you need to > handle userspace leaving the building, you need to: > > file = eventfd_fget(fd); > ctx = eventfd_ctx_get(file); /* Eventually, if producer */ > eventfd_pollcb_register(file, ...); > fput(file); > > In your case of kernel-to-kernel scenario, why would you need eventfd at > all, if userspace role in that model is simply to create it? > The general thesis is for decoupling of the two subsystems. In order to do this, you need some form of polymorphism and an intermediate "handle" mechanism which is userspace friendly. File-descriptors already fit this role neatly, with the "int fd" being the handle, and the f_ops being the polymorphic interface. Eventfd is of course, a subclass of this concept in that it has these same general properties but with signaling semantics (non-blocking collapsible events, etc). Say, for example, you wanted disk IO completion events to generate an interrupt into a guest. One way to do this would, of course, modify all the disk-io code so it knows how to directly inject a KVM guest interrupt. While this would work, someone would undoubtedly get flamed for such a suggestion ;) Another way to do it is to treat the AIO eventfd as the hook point. IIUC AIO already knows how to be an eventfd producer. KVM, by virtue of irqfd, already knows how to be an eventfd consumer. So now kvm can consume AIO, or it can consume userspace events equally well, and without modification. Neither side needs to know about the other per se, other than the details on how to use the eventfd interface. Don't get me wrong: We expect userspace to use all this stuff too. I just expect that we will see all permutations of producer/consumer + userspace/kernel combinations, so I want to retain that "all producers have left" notification feature set. Today eventfd supports producers or consumers in userspace, and producers in the kernel. This new work we are doing adds consumer support in the kernel. Kernel to kernel is just a natural extension of that. -Greg > There are more effective ways to have in kernel communication channels, > than resorting to userspace link facilities like eventfd. > > > > - Davide > > > -- > To unsubscribe from this list: send the line "unsubscribe kvm" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html >
On Mon, 22 Jun 2009, Gregory Haskins wrote: > The general thesis is for decoupling of the two subsystems. In order to > do this, you need some form of polymorphism and an intermediate "handle" > mechanism which is userspace friendly. File-descriptors already fit > this role neatly, with the "int fd" being the handle, and the f_ops > being the polymorphic interface. Eventfd is of course, a subclass of > this concept in that it has these same general properties but with > signaling semantics (non-blocking collapsible events, etc). > > Say, for example, you wanted disk IO completion events to generate an > interrupt into a guest. One way to do this would, of course, modify all > the disk-io code so it knows how to directly inject a KVM guest > interrupt. While this would work, someone would undoubtedly get flamed > for such a suggestion ;) > > Another way to do it is to treat the AIO eventfd as the hook point. > IIUC AIO already knows how to be an eventfd producer. KVM, by virtue of > irqfd, already knows how to be an eventfd consumer. So now kvm can > consume AIO, or it can consume userspace events equally well, and > without modification. Neither side needs to know about the other per > se, other than the details on how to use the eventfd interface. > > Don't get me wrong: We expect userspace to use all this stuff too. I > just expect that we will see all permutations of producer/consumer + > userspace/kernel combinations, so I want to retain that "all producers > have left" notification feature set. Today eventfd supports producers > or consumers in userspace, and producers in the kernel. This new work > we are doing adds consumer support in the kernel. Kernel to kernel is > just a natural extension of that. A file* is the VFS link between userspace and the kernel. Is not a magical polymorphic interface to be used for whatever kernel side reasons. Basing a kernel internal API over it is flawed. On top of that, a single reference count does not put you on cover about the possible combinations of producers and consumers. For that, you'd need a pipe-like reference handling logic, that is way far from the eventfd scope. So please stop making hypothetical cases about interface usages, and *when* we will have a real case, we'll see what the better handling for it will be. - Davide -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, Jun 22, 2009 at 11:03:22AM -0700, Davide Libenzi wrote: > In your case of kernel-to-kernel scenario, why would you need eventfd at > all, if userspace role in that model is simply to create it? That's not 100% true. We have a mode where userspace is the producer and/or consumer (migration mode) and we switch between that and direct kernel-to-kernel communication. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, 22 Jun 2009, Michael S. Tsirkin wrote: > On Mon, Jun 22, 2009 at 11:03:22AM -0700, Davide Libenzi wrote: > > In your case of kernel-to-kernel scenario, why would you need eventfd at > > all, if userspace role in that model is simply to create it? > > That's not 100% true. We have a mode where userspace is the producer > and/or consumer (migration mode) and we switch between that and > direct kernel-to-kernel communication. Then you'd need to ask yourself how to handle your complex case inside the KVM code, so that other eventfd users are not affected by the extra fat needed to handle your scenarios. Thing that seem to be continuosly tried. A file* based kernel-to-kernel interface is rather wrong IMO. - Davide -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, Jun 22, 2009 at 11:51:42AM -0700, Davide Libenzi wrote:
> A file* based kernel-to-kernel interface is rather wrong IMO.
But eventfd_ctx should work fine.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Davide Libenzi wrote: > On Mon, 22 Jun 2009, Michael S. Tsirkin wrote: > > >> On Mon, Jun 22, 2009 at 11:03:22AM -0700, Davide Libenzi wrote: >> >>> In your case of kernel-to-kernel scenario, why would you need eventfd at >>> all, if userspace role in that model is simply to create it? >>> >> That's not 100% true. We have a mode where userspace is the producer >> and/or consumer (migration mode) and we switch between that and >> direct kernel-to-kernel communication. >> > > Then you'd need to ask yourself how to handle your complex case inside the > KVM code, so that other eventfd users are not affected by the extra fat > needed to handle your scenarios. Thing that seem to be continuosly tried. > A file* based kernel-to-kernel interface is rather wrong IMO. > Well, I will point out that the interface in question is eventfd_signal(struct file *), and you were the one that invented it afaict. Can't help it if we like it :) BTW: The termination point of that call is incidental. Afterall, it works the same for kernel-kernel or kernel-userspace. The only relevant discussion point here is that your proposal breaks POLLHUP for eventfd_signal() users, and we have a real use case that cares. Let me ask the question a different way: Lets say we all agree that eventfd_signal() cannot be file* based to be correct. Is there a way you can think of that satisfies both the need to get rid of file* AND still deliver producer-release notification to consumers. Ultimately that is all I care about. -Greg
Michael S. Tsirkin wrote: > On Mon, Jun 22, 2009 at 11:51:42AM -0700, Davide Libenzi wrote: > >> A file* based kernel-to-kernel interface is rather wrong IMO. >> > > But eventfd_ctx should work fine. > Yeah, and I guess we can always just say that qemu can't close the fd or something. Seems hacky, but it might work if Davide insists we need his change. -Greg
On Mon, 22 Jun 2009, Gregory Haskins wrote: > Michael S. Tsirkin wrote: > > On Mon, Jun 22, 2009 at 11:51:42AM -0700, Davide Libenzi wrote: > > > >> A file* based kernel-to-kernel interface is rather wrong IMO. > >> > > > > But eventfd_ctx should work fine. > > > > Yeah, and I guess we can always just say that qemu can't close the fd or > something. Seems hacky, but it might work if Davide insists we need his > change. Continuing here, since there's no reason of having many subthreads talking about the same thing. Can you make a detailed example of what you're trying to achieve (no Hint Mode, please)? As it sounds to me, that you need a consumer/producer reference counting, to cover your scenario correctly. - Davide -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, 22 Jun 2009, Gregory Haskins wrote: > Davide Libenzi wrote: > > On Mon, 22 Jun 2009, Michael S. Tsirkin wrote: > > > > > >> On Mon, Jun 22, 2009 at 11:03:22AM -0700, Davide Libenzi wrote: > >> > >>> In your case of kernel-to-kernel scenario, why would you need eventfd at > >>> all, if userspace role in that model is simply to create it? > >>> > >> That's not 100% true. We have a mode where userspace is the producer > >> and/or consumer (migration mode) and we switch between that and > >> direct kernel-to-kernel communication. > >> > > > > Then you'd need to ask yourself how to handle your complex case inside the > > KVM code, so that other eventfd users are not affected by the extra fat > > needed to handle your scenarios. Thing that seem to be continuosly tried. > > A file* based kernel-to-kernel interface is rather wrong IMO. > > > > Well, I will point out that the interface in question is > eventfd_signal(struct file *), and you were the one that invented it > afaict. Can't help it if we like it :) Yes, I did. The case for eventfd was dual. First, it was a communication link between userspace and kernel, for things like KAIO. Second, it was a faster and smaller userspace replacement for things people used pipes before. In all those cases, at least one reference of the file* is alive in userspace. Even with the above case in mind, today I'd still use the eventfd_ctx as internal kernel API accessory. - Davide -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Davide Libenzi wrote: > On Mon, 22 Jun 2009, Gregory Haskins wrote: > > >> Michael S. Tsirkin wrote: >> >>> On Mon, Jun 22, 2009 at 11:51:42AM -0700, Davide Libenzi wrote: >>> >>> >>>> A file* based kernel-to-kernel interface is rather wrong IMO. >>>> >>>> >>> But eventfd_ctx should work fine. >>> >>> >> Yeah, and I guess we can always just say that qemu can't close the fd or >> something. Seems hacky, but it might work if Davide insists we need his >> change. >> > > Continuing here, since there's no reason of having many subthreads talking > about the same thing. > Can you make a detailed example of what you're trying to achieve (no Hint > Mode, please)? > As it sounds to me, that you need a consumer/producer reference counting, > to cover your scenario correctly. > Well, one of them was already briefly mentioned (the PCI-passthrough thing). I am not personally working on this part (yet, anyway). Another example of something I am actually working on as we speak would be for this thing we are building called "virtual-bus". It is a way to build/deploy device models directly in the kernel. In either of these cases, we have this concept of allowing the guest to notify the host, or vice versa, that something happened. Typically this would be in reference to some chunk of shared memory, and the signaling is telling the other side "I changed something, go look". Without going into a ton of detail (unless, of course, you want it) is that we are generalizing the signaling infrastructure (irqfd and iosignalfd) so that something like PCI-passthrough or vbus are not directly coupled to KVM. They communicate to KVM purely in terms of (among other things) these irqfd/iosignalfd interfaces. Using vbus as an example (though others are similar): vbus would primarily exists as a kernel-model. However, there would be a small device model in qemu-kem userspace to publish something like a PCI device that declares its resource requirements to the guest. Some of those requirements would be things like how many interrupts it needs, and what IO ranges it supports, etc. When the guest programs the PCI space, it maps the resources from its own world into the virtual PCI resources emulated in qemu. So up in userspace, the vbus pci-device would have an open reference to the kvm guest (derived from /dev/kvm) and an open reference to a vbus (derived from /dev/vbus). Lets call these kvmfd, and vbusfd, respectively. For something like an interrupt, we would hook the point where the PCI-MSI interrupt is assigned, and would do the following: gsi = kvm_irq_route_gsi(); fd = eventfd(0, 0); ioctl(kvmfd, KVM_IRQFD_ASSIGN, {fd, gsi}); ioctl(vbusfd, VBUS_SHMSIGNAL_ASSIGN, {sigid, fd}); So userspace orchestrated the assignment of this one eventfd to a KVM consumer, and a VBUS producer. The two subsystems do not care about the details of the other side of the link, per se. VBUS just knows that it can eventfd_signal() its memory region to tell whomever is listening that it changed. Likewise, KVM just knows to inject "gsi" when it gets signalled. You could equally have given "fd" to a userspace thread for either producer or consumer roles, or any other combination. If we were doing PCI-passthough, substitute the last SHMSIGNAL_ASSIGN ioctl call with some PCI_PASSTHROUGH_ASSIGN verb and you get the idea. The important thing is that once this is established, userspace doesn't necessarily care about the fd anymore. So now the question is: do we keep it around for other things? Do we keep it around because we don't want KVM to see the POLLHUP, or do we address the "release" code so that it works even if userspace issued close(fd) at this point. I am not sure what the answer is, but this is the scenario we are concerned with in this thread. In the example above, vbus is free to produce events on its eventfd until it gets a SHMSIGNAL_DEASSIGN request. -Greg > > > - Davide > > > -- > To unsubscribe from this list: send the line "unsubscribe kvm" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html >
On Mon, Jun 22, 2009 at 03:26:41PM -0400, Gregory Haskins wrote: > Michael S. Tsirkin wrote: > > On Mon, Jun 22, 2009 at 11:51:42AM -0700, Davide Libenzi wrote: > > > >> A file* based kernel-to-kernel interface is rather wrong IMO. > >> > > > > But eventfd_ctx should work fine. > > > > Yeah, and I guess we can always just say that qemu can't close the fd or > something. Seems hacky, but it might work if Davide insists we need his > change. Yes, I think that's fine.
On Mon, 22 Jun 2009, Gregory Haskins wrote: > So up in userspace, the vbus pci-device would have an open reference to > the kvm guest (derived from /dev/kvm) and an open reference to a vbus > (derived from /dev/vbus). Lets call these kvmfd, and vbusfd, > respectively. For something like an interrupt, we would hook the point > where the PCI-MSI interrupt is assigned, and would do the following: > > gsi = kvm_irq_route_gsi(); > fd = eventfd(0, 0); > ioctl(kvmfd, KVM_IRQFD_ASSIGN, {fd, gsi}); > ioctl(vbusfd, VBUS_SHMSIGNAL_ASSIGN, {sigid, fd}); > > So userspace orchestrated the assignment of this one eventfd to a KVM > consumer, and a VBUS producer. The two subsystems do not care about the > details of the other side of the link, per se. VBUS just knows that it > can eventfd_signal() its memory region to tell whomever is listening > that it changed. Likewise, KVM just knows to inject "gsi" when it gets > signalled. You could equally have given "fd" to a userspace thread for > either producer or consumer roles, or any other combination. > > If we were doing PCI-passthough, substitute the last SHMSIGNAL_ASSIGN > ioctl call with some PCI_PASSTHROUGH_ASSIGN verb and you get the idea. > > The important thing is that once this is established, userspace doesn't > necessarily care about the fd anymore. So now the question is: do we > keep it around for other things? Do we keep it around because we don't > want KVM to see the POLLHUP, or do we address the "release" code so that > it works even if userspace issued close(fd) at this point. I am not > sure what the answer is, but this is the scenario we are concerned with > in this thread. In the example above, vbus is free to produce events on > its eventfd until it gets a SHMSIGNAL_DEASSIGN request. I see. The thing remains, that in order to reliably handle generic producer/consumer scenarios you'd need a reference counting similar to pipes, where the notion of producer and consumer is very well defined. In your case, it'd be sufficent to expose an: int eventfd_wakeup(struct eventfd_ctx *ctx, void *key); So that each end can signal, in an file* f_count independent way, when they're about to leave. Say with a new status bit. The problem, that I felt coming since when I introduced the key-based wakeups, is that right now the "key" start to become a little restrictive in terms of amount of information carried by it. If there are other key-aware waiters on "wqh", your new bit cannot conflict with existing ones, and you (and future users of the interface) will be polluting the global bit namespace. Probably turning "key" into a pointer to a structure like: struct wakeup_key { unsigned long type; void *key; }; So the current: wake_up_poll(wq, POLLIN); Would become: key.type = KT_POLLMASK; key.key = POLLIN; wake_up_key(wq, &key); At that point the waiters (like poll/epoll) can simply discard the wakeup types they are not interested into (poll/epoll would care only about KT_POLLMASK), and something more than a simple bitmask can be carried by the wakups. Like: key.type = KT_SOMEDATA; key.key = &data; wake_up_key(wq, &key); Of course, we could build another layer on top, to fit this model, but then we'd have the dual citizenship among normal wakeups and new-interface wakeups/signaling. Another thing. Now that the interface exposes the eventfd_ctx context directly, the pollcb register/unregister does not have any reason to be inside eventfd (they're totally generic bits at that point), so my thought was about to drop them. - Davide -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Davide Libenzi wrote: > On Mon, 22 Jun 2009, Gregory Haskins wrote: > > >> So up in userspace, the vbus pci-device would have an open reference to >> the kvm guest (derived from /dev/kvm) and an open reference to a vbus >> (derived from /dev/vbus). Lets call these kvmfd, and vbusfd, >> respectively. For something like an interrupt, we would hook the point >> where the PCI-MSI interrupt is assigned, and would do the following: >> >> gsi = kvm_irq_route_gsi(); >> fd = eventfd(0, 0); >> ioctl(kvmfd, KVM_IRQFD_ASSIGN, {fd, gsi}); >> ioctl(vbusfd, VBUS_SHMSIGNAL_ASSIGN, {sigid, fd}); >> >> So userspace orchestrated the assignment of this one eventfd to a KVM >> consumer, and a VBUS producer. The two subsystems do not care about the >> details of the other side of the link, per se. VBUS just knows that it >> can eventfd_signal() its memory region to tell whomever is listening >> that it changed. Likewise, KVM just knows to inject "gsi" when it gets >> signalled. You could equally have given "fd" to a userspace thread for >> either producer or consumer roles, or any other combination. >> >> If we were doing PCI-passthough, substitute the last SHMSIGNAL_ASSIGN >> ioctl call with some PCI_PASSTHROUGH_ASSIGN verb and you get the idea. >> >> The important thing is that once this is established, userspace doesn't >> necessarily care about the fd anymore. So now the question is: do we >> keep it around for other things? Do we keep it around because we don't >> want KVM to see the POLLHUP, or do we address the "release" code so that >> it works even if userspace issued close(fd) at this point. I am not >> sure what the answer is, but this is the scenario we are concerned with >> in this thread. In the example above, vbus is free to produce events on >> its eventfd until it gets a SHMSIGNAL_DEASSIGN request. >> > > I see. > The thing remains, that in order to reliably handle generic > producer/consumer scenarios you'd need a reference counting similar to > pipes, where the notion of producer and consumer is very well defined. > I see your point. Well, I think the more important thing here is that we address the races, and add support for DEASSIGN. We can do both of those things with any of the patches that you and I have been kicking around. So what I propose is that we move forward with whatever patch you bless as proper for now. This producer-release issue is pretty minor in the grand scheme of things. We can always just have userspace hold the fd. I can either take in the last one you sent, or it sounds like you wanted to possibly do another round of cleanup? Whatever the case may be, let me know and we can coordinate with Andrew/Avi on what tree to pull it into. It sounds like riding in kvm.git is the perhaps the most logical. Thanks Davide, -Greg
On Mon, 22 Jun 2009, Gregory Haskins wrote: > Davide Libenzi wrote: > > On Mon, 22 Jun 2009, Gregory Haskins wrote: > > > > > >> So up in userspace, the vbus pci-device would have an open reference to > >> the kvm guest (derived from /dev/kvm) and an open reference to a vbus > >> (derived from /dev/vbus). Lets call these kvmfd, and vbusfd, > >> respectively. For something like an interrupt, we would hook the point > >> where the PCI-MSI interrupt is assigned, and would do the following: > >> > >> gsi = kvm_irq_route_gsi(); > >> fd = eventfd(0, 0); > >> ioctl(kvmfd, KVM_IRQFD_ASSIGN, {fd, gsi}); > >> ioctl(vbusfd, VBUS_SHMSIGNAL_ASSIGN, {sigid, fd}); > >> > >> So userspace orchestrated the assignment of this one eventfd to a KVM > >> consumer, and a VBUS producer. The two subsystems do not care about the > >> details of the other side of the link, per se. VBUS just knows that it > >> can eventfd_signal() its memory region to tell whomever is listening > >> that it changed. Likewise, KVM just knows to inject "gsi" when it gets > >> signalled. You could equally have given "fd" to a userspace thread for > >> either producer or consumer roles, or any other combination. > >> > >> If we were doing PCI-passthough, substitute the last SHMSIGNAL_ASSIGN > >> ioctl call with some PCI_PASSTHROUGH_ASSIGN verb and you get the idea. > >> > >> The important thing is that once this is established, userspace doesn't > >> necessarily care about the fd anymore. So now the question is: do we > >> keep it around for other things? Do we keep it around because we don't > >> want KVM to see the POLLHUP, or do we address the "release" code so that > >> it works even if userspace issued close(fd) at this point. I am not > >> sure what the answer is, but this is the scenario we are concerned with > >> in this thread. In the example above, vbus is free to produce events on > >> its eventfd until it gets a SHMSIGNAL_DEASSIGN request. > >> > > > > I see. > > The thing remains, that in order to reliably handle generic > > producer/consumer scenarios you'd need a reference counting similar to > > pipes, where the notion of producer and consumer is very well defined. > > > > I see your point. > > Well, I think the more important thing here is that we address the > races, and add support for DEASSIGN. We can do both of those things > with any of the patches that you and I have been kicking around. So > what I propose is that we move forward with whatever patch you bless as > proper for now. This producer-release issue is pretty minor in the > grand scheme of things. We can always just have userspace hold the fd. Is it a real problem? Can it be decently handled on the KVM side? Reason I'm asking, is that I wouldn't want to change the interface, only to find it unsuitable a few weeks later. > I can either take in the last one you sent, or it sounds like you wanted > to possibly do another round of cleanup? Whatever the case may be, let > me know and we can coordinate with Andrew/Avi on what tree to pull it > into. It sounds like riding in kvm.git is the perhaps the most logical. Yes, I'd like to drop the pollcb bits (that you can implement KVM-side), at least. And yes, going kvm.git is probably the best path. - Davide -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Davide Libenzi wrote: > On Mon, 22 Jun 2009, Gregory Haskins wrote: > > >> Davide Libenzi wrote: >> >>> On Mon, 22 Jun 2009, Gregory Haskins wrote: >>> >>> >>> >>>> So up in userspace, the vbus pci-device would have an open reference to >>>> the kvm guest (derived from /dev/kvm) and an open reference to a vbus >>>> (derived from /dev/vbus). Lets call these kvmfd, and vbusfd, >>>> respectively. For something like an interrupt, we would hook the point >>>> where the PCI-MSI interrupt is assigned, and would do the following: >>>> >>>> gsi = kvm_irq_route_gsi(); >>>> fd = eventfd(0, 0); >>>> ioctl(kvmfd, KVM_IRQFD_ASSIGN, {fd, gsi}); >>>> ioctl(vbusfd, VBUS_SHMSIGNAL_ASSIGN, {sigid, fd}); >>>> >>>> So userspace orchestrated the assignment of this one eventfd to a KVM >>>> consumer, and a VBUS producer. The two subsystems do not care about the >>>> details of the other side of the link, per se. VBUS just knows that it >>>> can eventfd_signal() its memory region to tell whomever is listening >>>> that it changed. Likewise, KVM just knows to inject "gsi" when it gets >>>> signalled. You could equally have given "fd" to a userspace thread for >>>> either producer or consumer roles, or any other combination. >>>> >>>> If we were doing PCI-passthough, substitute the last SHMSIGNAL_ASSIGN >>>> ioctl call with some PCI_PASSTHROUGH_ASSIGN verb and you get the idea. >>>> >>>> The important thing is that once this is established, userspace doesn't >>>> necessarily care about the fd anymore. So now the question is: do we >>>> keep it around for other things? Do we keep it around because we don't >>>> want KVM to see the POLLHUP, or do we address the "release" code so that >>>> it works even if userspace issued close(fd) at this point. I am not >>>> sure what the answer is, but this is the scenario we are concerned with >>>> in this thread. In the example above, vbus is free to produce events on >>>> its eventfd until it gets a SHMSIGNAL_DEASSIGN request. >>>> >>>> >>> I see. >>> The thing remains, that in order to reliably handle generic >>> producer/consumer scenarios you'd need a reference counting similar to >>> pipes, where the notion of producer and consumer is very well defined. >>> >>> >> I see your point. >> >> Well, I think the more important thing here is that we address the >> races, and add support for DEASSIGN. We can do both of those things >> with any of the patches that you and I have been kicking around. So >> what I propose is that we move forward with whatever patch you bless as >> proper for now. This producer-release issue is pretty minor in the >> grand scheme of things. We can always just have userspace hold the fd. >> > > Is it a real problem? Can it be decently handled on the KVM side? > Reason I'm asking, is that I wouldn't want to change the interface, only > to find it unsuitable a few weeks later. > To be honest, I am not sure. I would guess its not a *huge* deal, though it was obviously enough of a concern to at least discuss it. I can definitely say that I think the other issues which are being fixed are substantially more important. > > > > >> I can either take in the last one you sent, or it sounds like you wanted >> to possibly do another round of cleanup? Whatever the case may be, let >> me know and we can coordinate with Andrew/Avi on what tree to pull it >> into. It sounds like riding in kvm.git is the perhaps the most logical. >> > > Yes, I'd like to drop the pollcb bits (that you can implement KVM-side), > at least. > And yes, going kvm.git is probably the best path. > Sounds good. Thanks for your patience, Davide. I think we are almost there! ;) -Greg > > > - Davide > > >
On Mon, 22 Jun 2009 09:24:20 am Davide Libenzi wrote: > I actually ended up exposing the eventfd context anyway, since IMO is a > better option instead of holding references to the eventfd file (that > makes VFS people uneasy). lguest is a special case since we don't let them close the fds, except by closing /dev/lguest. Lguest change seems fine, but: > Index: linux-2.6.mod/drivers/lguest/lg.h > =================================================================== > --- linux-2.6.mod.orig/drivers/lguest/lg.h 2009-06-21 15:55:17.000000000 > -0700 +++ linux-2.6.mod/drivers/lguest/lg.h 2009-06-21 15:56:00.000000000 > -0700 @@ -80,9 +80,11 @@ struct lg_cpu { > struct lg_cpu_arch arch; > }; > > +struct eventfd_ctx; > + > struct lg_eventfd { > unsigned long addr; > - struct file *event; > + struct eventfd_ctx *event; > }; The first 'struct eventfd_ctx;' line is not required. Thanks, Rusty. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, 22 Jun 2009, Gregory Haskins wrote: > To be honest, I am not sure. I would guess its not a *huge* deal, > though it was obviously enough of a concern to at least discuss it. I > can definitely say that I think the other issues which are being fixed > are substantially more important. Ok then, will repost the revised patch later today. - Davide -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, 23 Jun 2009, Rusty Russell wrote:
> The first 'struct eventfd_ctx;' line is not required.
Will repost dropping that.
- Davide
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, 23 Jun 2009, Gregory Haskins wrote: > Davide Libenzi wrote: > > On Mon, 22 Jun 2009, Gregory Haskins wrote: > > > > > >> To be honest, I am not sure. I would guess its not a *huge* deal, > >> though it was obviously enough of a concern to at least discuss it. I > >> can definitely say that I think the other issues which are being fixed > >> are substantially more important. > >> > > > > Ok then, will repost the revised patch later today. > > > > Ok sounds good. I did have a chance to take a closer look at your > proposal for the key data, and I think it makes a lot of sense. Do you > see it as a problem if we defer adding that? Or should we try to add > that notion now? That would need to go eventually via mainline, after some discussion. But yes, I believe that using the "key" as simple bitmask is a little restrictive. - Davide -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Davide Libenzi wrote: > On Mon, 22 Jun 2009, Gregory Haskins wrote: > > >> To be honest, I am not sure. I would guess its not a *huge* deal, >> though it was obviously enough of a concern to at least discuss it. I >> can definitely say that I think the other issues which are being fixed >> are substantially more important. >> > > Ok then, will repost the revised patch later today. > Ok sounds good. I did have a chance to take a closer look at your proposal for the key data, and I think it makes a lot of sense. Do you see it as a problem if we defer adding that? Or should we try to add that notion now? -Greg
Davide Libenzi wrote: > On Tue, 23 Jun 2009, Gregory Haskins wrote: > > >> Davide Libenzi wrote: >> >>> On Mon, 22 Jun 2009, Gregory Haskins wrote: >>> >>> >>> >>>> To be honest, I am not sure. I would guess its not a *huge* deal, >>>> though it was obviously enough of a concern to at least discuss it. I >>>> can definitely say that I think the other issues which are being fixed >>>> are substantially more important. >>>> >>>> >>> Ok then, will repost the revised patch later today. >>> >>> >> Ok sounds good. I did have a chance to take a closer look at your >> proposal for the key data, and I think it makes a lot of sense. Do you >> see it as a problem if we defer adding that? Or should we try to add >> that notion now? >> > > That would need to go eventually via mainline, after some discussion. But > yes, I believe that using the "key" as simple bitmask is a little > restrictive. > Ok. As long as you do not think we are somehow painting ourselves into a corner, lets just go with your original plan for the revised patch sans key changes. Thanks Davide, -Greg
On Tue, Jun 23, 2009 at 07:35:00AM -0700, Davide Libenzi wrote: > On Tue, 23 Jun 2009, Gregory Haskins wrote: > > > Davide Libenzi wrote: > > > On Mon, 22 Jun 2009, Gregory Haskins wrote: > > > > > > > > >> To be honest, I am not sure. I would guess its not a *huge* deal, > > >> though it was obviously enough of a concern to at least discuss it. I > > >> can definitely say that I think the other issues which are being fixed > > >> are substantially more important. > > >> > > > > > > Ok then, will repost the revised patch later today. > > > > > > > Ok sounds good. I did have a chance to take a closer look at your > > proposal for the key data, and I think it makes a lot of sense. Do you > > see it as a problem if we defer adding that? Or should we try to add > > that notion now? > > That would need to go eventually via mainline, after some discussion. But > yes, I believe that using the "key" as simple bitmask is a little > restrictive. > > > - Davide I agree. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, 23 Jun 2009 03:33:22 am Davide Libenzi wrote: > What you're doing there, is setting up a kernel-to-kernel (since > userspace only role is to create the eventfd) communication, using a file* > as accessory. That IMO is plain wrong. The most sensible is that userspace can use these fds; an in-kernel variant is possible too, but not primary IMHO. It's nice that userspace create the fds; it can then use the same fd for multiple event sources. But I didn't see anything wrong with the way eventfd used to work: you have a kvm ioctl to say "attach this eventfd to this guest notification" and that does the eventfd_fget. A detach ioctl does the fput (as does release of the kvm fd). If they close the eventfd and don't do the detach ioctl, it's their problem. Rusty. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, 24 Jun 2009, Rusty Russell wrote: > On Tue, 23 Jun 2009 03:33:22 am Davide Libenzi wrote: > > What you're doing there, is setting up a kernel-to-kernel (since > > userspace only role is to create the eventfd) communication, using a file* > > as accessory. That IMO is plain wrong. > > The most sensible is that userspace can use these fds; an in-kernel variant is > possible too, but not primary IMHO. > > It's nice that userspace create the fds; it can then use the same fd for > multiple event sources. > > But I didn't see anything wrong with the way eventfd used to work: you have a > kvm ioctl to say "attach this eventfd to this guest notification" and that does > the eventfd_fget. A detach ioctl does the fput (as does release of the kvm > fd). > > If they close the eventfd and don't do the detach ioctl, it's their problem. Some components would like to know if userspace dropped the fd, and take proper action accordingly (release resources, drop module instances, etc...). The POLLHUP helps with that, but w/out decoupling file* memory from eventfd_ctx memory, it becomes pretty tricky (if feasible at all) to handle the event in a race-free way. Once the file* is decoupled from the eventfd_ctx, it becomes saner to expose the internal kernel API via the eventfd_ctx. Another thing that comes in my mind (that for some components might not matter) is considering the effect of userspace doing things like: for (;;) { fd = eventfd(...); ioctl(xfd, XXX_ADD, fd); close(fd); } That might lead to unprivileged users drawing kernel memory w/out any userspace accountability, if not properly handled. - Davide -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, 25 Jun 2009 08:15:11 am Davide Libenzi wrote: > On Wed, 24 Jun 2009, Rusty Russell wrote: > > On Tue, 23 Jun 2009 03:33:22 am Davide Libenzi wrote: > > > What you're doing there, is setting up a kernel-to-kernel (since > > > userspace only role is to create the eventfd) communication, using a > > > file* as accessory. That IMO is plain wrong. > > > > The most sensible is that userspace can use these fds; an in-kernel > > variant is possible too, but not primary IMHO. > > > > It's nice that userspace create the fds; it can then use the same fd for > > multiple event sources. > > > > But I didn't see anything wrong with the way eventfd used to work: you > > have a kvm ioctl to say "attach this eventfd to this guest notification" > > and that does the eventfd_fget. A detach ioctl does the fput (as does > > release of the kvm fd). > > > > If they close the eventfd and don't do the detach ioctl, it's their > > problem. > > Some components would like to know if userspace dropped the fd, and take > proper action accordingly (release resources, drop module instances, > etc...). Like to know? Possibly. Need to know? Not anything I've seen so far. If userspace creates the fd, component grab a ref and if userspace wants that fd completely freed must close the fd *and* tell component. Simple, race free and explicit. All wins. As this discussion shows, doing some kind of implies non-reference is hard, complex and racy. > Another thing that comes in my mind (that for some components might not > matter) is considering the effect of userspace doing things like: > > for (;;) { > fd = eventfd(...); > ioctl(xfd, XXX_ADD, fd); > close(fd); > } > > That might lead to unprivileged users drawing kernel memory w/out any > userspace accountability, if not properly handled. No, fget_eventfd covers this exactly as expected. Don't doubt your ability to design sane kernel interfaces; eventfd is nice! All lguest needed was a couple of EXPORT_SYMBOLS and it fitted in beautifully. Thanks, Rusty. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, 25 Jun 2009, Rusty Russell wrote: > On Thu, 25 Jun 2009 08:15:11 am Davide Libenzi wrote: > > > > Some components would like to know if userspace dropped the fd, and take > > proper action accordingly (release resources, drop module instances, > > etc...). > > Like to know? Possibly. Need to know? Not anything I've seen so far. > > If userspace creates the fd, component grab a ref and if userspace wants that > fd completely freed must close the fd *and* tell component. Simple, race free > and explicit. All wins. > > As this discussion shows, doing some kind of implies non-reference is hard, > complex and racy. Easier, we agree. Not doing anything is always easier, provided the userspace interface allows for it. Cleaner, I'm not sure. Again, it depends from the userspace interface, but usually when you close(2) something, you expect the kernel to react accordingly, and not on relying on userspace issuing extra calls in order to proper cleanup the kernel context. This is even more true when the eventfd is the sole handle to the visisble userspace interface. In such cases, not taking proper action on close(2) and requiring extra calls, would lead to designing interface with the close-no-really-i-mean-it patterns. - Davide -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Davide Libenzi wrote: > On Thu, 25 Jun 2009, Rusty Russell wrote: > > >> On Thu, 25 Jun 2009 08:15:11 am Davide Libenzi wrote: >> >>> Some components would like to know if userspace dropped the fd, and take >>> proper action accordingly (release resources, drop module instances, >>> etc...). >>> >> Like to know? Possibly. Need to know? Not anything I've seen so far. >> >> If userspace creates the fd, component grab a ref and if userspace wants that >> fd completely freed must close the fd *and* tell component. Simple, race free >> and explicit. All wins. >> >> As this discussion shows, doing some kind of implies non-reference is hard, >> complex and racy. >> > > Easier, we agree. Not doing anything is always easier, provided the > userspace interface allows for it. > Cleaner, I'm not sure. Again, it depends from the userspace interface, but > usually when you close(2) something, you expect the kernel to react > accordingly, and not on relying on userspace issuing extra calls in order > to proper cleanup the kernel context. > This is even more true when the eventfd is the sole handle to the visisble > userspace interface. > In such cases, not taking proper action on close(2) and requiring extra > calls, would lead to designing interface with the close-no-really-i-mean-it > patterns. > Avi, Davide's argument is compelling in favor of considering the latest irqfd fixes I pushed. You could run into weirdness if we tried to revert to the old requirement on explicit DEASSIGN ioctl once we start doing things like handing the fd to other apps/components. Its cleaner IMO to retain the implicit-deassign-on-close behavior. Since v5 also restores optional explicit DEASSIGN support, the only downside to accepting my patches (vs reverting POLLHUP completely) is the complexity required to deal with POLLHUP. But this has been substantially reduced and clarified since yesterdays review, and I am reasonably confident the protocol is sound. Please take a close look at it and consider for merging, if you would. Thanks, -Greg
On Thu, Jun 25, 2009 at 01:32:22PM -0400, Gregory Haskins wrote:
> Please take a close look at it and consider for merging, if you would.
With all the incremental patching, I kind of lost track of what the
complete file would look like. Is there a git tree I could pull?
Michael S. Tsirkin wrote: > On Thu, Jun 25, 2009 at 01:32:22PM -0400, Gregory Haskins wrote: > >> Please take a close look at it and consider for merging, if you would. >> > > With all the incremental patching, I kind of lost track of what the > complete file would look like. Is there a git tree I could pull? > > Ask and ye shall receive :) git pull ssh://master.kernel.org/pub/scm/linux/kernel/git/ghaskins/linux-2.6-hacks.git kvm/irqfd If you dont have a kernel.org account, it should rsync to the gitweb/git: server soon, in which case s|ssh://master|git://git should get you the proper path. Note that there is one additional race that I found that is not included in that tree in the slow-work infrastructure. You can find my latest submission to David Howells to resolve that issue here: http://patchwork.kernel.org/patch/32287/ Thanks Michael! -Greg
On Thu, Jun 25, 2009 at 02:41:09PM -0400, Gregory Haskins wrote: > Michael S. Tsirkin wrote: > > On Thu, Jun 25, 2009 at 01:32:22PM -0400, Gregory Haskins wrote: > > > >> Please take a close look at it and consider for merging, if you would. > >> > > > > With all the incremental patching, I kind of lost track of what the > > complete file would look like. Is there a git tree I could pull? > > > > > > Ask and ye shall receive :) > > git pull > ssh://master.kernel.org/pub/scm/linux/kernel/git/ghaskins/linux-2.6-hacks.git > kvm/irqfd > > If you dont have a kernel.org account, it should rsync to the > gitweb/git: server soon, in which case > s|ssh://master|git://git should get you the proper path. > > Note that there is one additional race that I found that is not included > in that tree in the slow-work infrastructure. You can find my latest > submission to David Howells to resolve that issue here: > > http://patchwork.kernel.org/patch/32287/ > > Thanks Michael! > -Greg > Yes, I saw that. I agree your patch solves that issue.
Index: linux-2.6.mod/fs/eventfd.c =================================================================== --- linux-2.6.mod.orig/fs/eventfd.c 2009-06-21 09:52:10.000000000 -0700 +++ linux-2.6.mod/fs/eventfd.c 2009-06-21 16:51:27.000000000 -0700 @@ -17,32 +17,38 @@ #include <linux/eventfd.h> #include <linux/syscalls.h> #include <linux/module.h> +#include <linux/kref.h> struct eventfd_ctx { + struct kref kref; wait_queue_head_t wqh; /* * Every time that a write(2) is performed on an eventfd, the * value of the __u64 being written is added to "count" and a * wakeup is performed on "wqh". A read(2) will return the "count" * value to userspace, and will reset "count" to zero. The kernel - * size eventfd_signal() also, adds to the "count" counter and + * side eventfd_signal() also, adds to the "count" counter and * issue a wakeup. */ __u64 count; unsigned int flags; }; -/* - * Adds "n" to the eventfd counter "count". Returns "n" in case of - * success, or a value lower then "n" in case of coutner overflow. - * This function is supposed to be called by the kernel in paths - * that do not allow sleeping. In this function we allow the counter - * to reach the ULLONG_MAX value, and we signal this as overflow - * condition by returining a POLLERR to poll(2). +/** + * eventfd_signal - Adds @n to the eventfd counter. This function is + * supposed to be called by the kernel in paths that do not + * allow sleeping. In this function we allow the counter + * to reach the ULLONG_MAX value, and we signal this as + * overflow condition by returining a POLLERR to poll(2). + * + * @ctx: [in] Pointer to the eventfd context. + * @n: [in] Value of the counter to be added to the eventfd internal counter. + * + * Returns: In case of success, it returns @n, otherwise (in case of overflow + * of the eventfd 64bit internal counter) a value lower than @n. */ -int eventfd_signal(struct file *file, int n) +int eventfd_signal(struct eventfd_ctx *ctx, int n) { - struct eventfd_ctx *ctx = file->private_data; unsigned long flags; if (n < 0) @@ -59,9 +65,30 @@ int eventfd_signal(struct file *file, in } EXPORT_SYMBOL_GPL(eventfd_signal); +static void eventfd_free(struct kref *kref) +{ + struct eventfd_ctx *ctx = container_of(kref, struct eventfd_ctx, kref); + + kfree(ctx); +} + +static struct eventfd_ctx *eventfd_get(struct eventfd_ctx *ctx) +{ + kref_get(&ctx->kref); + return ctx; +} + +static void eventfd_put(struct eventfd_ctx *ctx) +{ + kref_put(&ctx->kref, eventfd_free); +} + static int eventfd_release(struct inode *inode, struct file *file) { - kfree(file->private_data); + struct eventfd_ctx *ctx = file->private_data; + + wake_up_poll(&ctx->wqh, POLLHUP); + eventfd_put(ctx); return 0; } @@ -185,6 +212,14 @@ static const struct file_operations even .write = eventfd_write, }; +/** + * eventfd_fget - Acquire a reference of an eventfd file descriptor. + * + * @fd: [in] Eventfd file descriptor. + * + * Returns: A pointer to the eventfd file structure in case of success, or a + * proper error pointer in case of failure. + */ struct file *eventfd_fget(int fd) { struct file *file; @@ -201,6 +236,59 @@ struct file *eventfd_fget(int fd) } EXPORT_SYMBOL_GPL(eventfd_fget); +/** + * eventfd_ctx_get - Acquires a reference to the internal eventfd context. + * + * @file: [in] Pointer to the eventfd file. + * + * Returns: In case of success, returns a pointer to the eventfd context, + * otherwise a proper error code. + */ +struct eventfd_ctx *eventfd_ctx_get(struct file *file) +{ + return file->f_op == &eventfd_fops ? eventfd_get(file->private_data) : + ERR_PTR(-EINVAL); +} +EXPORT_SYMBOL_GPL(eventfd_ctx_get); + +/** + * eventfd_ctx_put - Releases a reference to the internal eventfd context + * (previously acquired with eventfd_ctx_get()). + * + * @ctx: [in] Pointer to eventfd context. + * + * Returns: Nothing. + */ +void eventfd_ctx_put(struct eventfd_ctx *ctx) +{ + eventfd_put(ctx); +} +EXPORT_SYMBOL_GPL(eventfd_ctx_put); + +/** + * eventfd_ctx_fdget - Acquires a reference to the internal eventfd context + * given an eventfd file descriptor. + * + * @fd: [in] Eventfd file descriptor. + * + * Returns: In case of success, it returns a pointer to the internal eventfd + * context, otherwise a proper error code. + */ +struct eventfd_ctx *eventfd_ctx_fdget(int fd) +{ + struct file *file; + struct eventfd_ctx *ctx; + + file = eventfd_fget(fd); + if (IS_ERR(file)) + return (struct eventfd_ctx *) file; + ctx = eventfd_ctx_get(file); + fput(file); + + return ctx; +} +EXPORT_SYMBOL_GPL(eventfd_ctx_fdget); + SYSCALL_DEFINE2(eventfd2, unsigned int, count, int, flags) { int fd; @@ -217,6 +305,7 @@ SYSCALL_DEFINE2(eventfd2, unsigned int, if (!ctx) return -ENOMEM; + kref_init(&ctx->kref); init_waitqueue_head(&ctx->wqh); ctx->count = count; ctx->flags = flags; @@ -237,3 +326,62 @@ SYSCALL_DEFINE1(eventfd, unsigned int, c return sys_eventfd2(count, 0); } +static void eventfd_pollcb_ptqueue(struct file *file, wait_queue_head_t *wqh, + poll_table *pt) +{ + struct eventfd_pollcb *ecb; + + ecb = container_of(pt, struct eventfd_pollcb, pt); + add_wait_queue(wqh, &ecb->wait); +} + +/** + * eventfd_pollcb_register - Registers a wakeup callback with the eventfd + * file. The wakeup callback will be called from + * atomic context, and should handle the POLLHUP + * event accordingly in order to notice the last + * instance of the eventfd descriptor going away. + * + * @file: [in] Pointer to the eventfd file. + * @ecb: [out] Pointer to the eventfd callback structure. + * @cbf: [in] Pointer to the wakeup callback function. + * @events: [out] Pointer to the events that were ready at the time of the + * callback registration. + * + * Returns: Zero in case of success, or a proper error code. + */ +int eventfd_pollcb_register(struct file *file, struct eventfd_pollcb *ecb, + wait_queue_func_t cbf, unsigned int *events) +{ + struct eventfd_ctx *ctx; + + ctx = file->private_data; + if (file->f_op != &eventfd_fops) + return -EINVAL; + + init_waitqueue_func_entry(&ecb->wait, cbf); + init_poll_funcptr(&ecb->pt, eventfd_pollcb_ptqueue); + ecb->ctx = eventfd_get(ctx); + + *events = file->f_op->poll(file, &ecb->pt); + + return 0; +} +EXPORT_SYMBOL_GPL(eventfd_pollcb_register); + +/** + * eventfd_pollcb_unregister - Unregisters a wakeup callback previously registered + * with eventfd_pollcb_register(). + * + * @ecb: [in] Pointer to the eventfd callback structure previously registered with + * eventfd_pollcb_register(). + * + * Returns: Nothing. + */ +void eventfd_pollcb_unregister(struct eventfd_pollcb *ecb) +{ + remove_wait_queue(&ecb->ctx->wqh, &ecb->wait); + eventfd_put(ecb->ctx); +} +EXPORT_SYMBOL_GPL(eventfd_pollcb_unregister); + Index: linux-2.6.mod/include/linux/eventfd.h =================================================================== --- linux-2.6.mod.orig/include/linux/eventfd.h 2009-06-21 09:52:10.000000000 -0700 +++ linux-2.6.mod/include/linux/eventfd.h 2009-06-21 16:49:19.000000000 -0700 @@ -8,10 +8,10 @@ #ifndef _LINUX_EVENTFD_H #define _LINUX_EVENTFD_H -#ifdef CONFIG_EVENTFD - -/* For O_CLOEXEC and O_NONBLOCK */ #include <linux/fcntl.h> +#include <linux/wait.h> +#include <linux/poll.h> +#include <linux/file.h> /* * CAREFUL: Check include/asm-generic/fcntl.h when defining @@ -27,16 +27,22 @@ #define EFD_SHARED_FCNTL_FLAGS (O_CLOEXEC | O_NONBLOCK) #define EFD_FLAGS_SET (EFD_SHARED_FCNTL_FLAGS | EFD_SEMAPHORE) -struct file *eventfd_fget(int fd); -int eventfd_signal(struct file *file, int n); - -#else /* CONFIG_EVENTFD */ +struct eventfd_ctx; -#define eventfd_fget(fd) ERR_PTR(-ENOSYS) -static inline int eventfd_signal(struct file *file, int n) -{ return 0; } +struct eventfd_pollcb { + poll_table pt; + struct eventfd_ctx *ctx; + wait_queue_t wait; +}; -#endif /* CONFIG_EVENTFD */ +struct file *eventfd_fget(int fd); +struct eventfd_ctx *eventfd_ctx_fdget(int fd); +struct eventfd_ctx *eventfd_ctx_get(struct file *file); +void eventfd_ctx_put(struct eventfd_ctx *ctx); +int eventfd_signal(struct eventfd_ctx *ctx, int n); +int eventfd_pollcb_register(struct file *file, struct eventfd_pollcb *ecb, + wait_queue_func_t cbf, unsigned int *events); +void eventfd_pollcb_unregister(struct eventfd_pollcb *ecb); #endif /* _LINUX_EVENTFD_H */ Index: linux-2.6.mod/fs/aio.c =================================================================== --- linux-2.6.mod.orig/fs/aio.c 2009-06-21 14:36:52.000000000 -0700 +++ linux-2.6.mod/fs/aio.c 2009-06-21 15:25:44.000000000 -0700 @@ -485,6 +485,8 @@ static inline void really_put_req(struct { assert_spin_locked(&ctx->ctx_lock); + if (req->ki_eventfd != NULL) + eventfd_ctx_put(req->ki_eventfd); if (req->ki_dtor) req->ki_dtor(req); if (req->ki_iovec != &req->ki_inline_vec) @@ -509,8 +511,6 @@ static void aio_fput_routine(struct work /* Complete the fput(s) */ if (req->ki_filp != NULL) __fput(req->ki_filp); - if (req->ki_eventfd != NULL) - __fput(req->ki_eventfd); /* Link the iocb into the context's free list */ spin_lock_irq(&ctx->ctx_lock); @@ -528,8 +528,6 @@ static void aio_fput_routine(struct work */ static int __aio_put_req(struct kioctx *ctx, struct kiocb *req) { - int schedule_putreq = 0; - dprintk(KERN_DEBUG "aio_put(%p): f_count=%ld\n", req, atomic_long_read(&req->ki_filp->f_count)); @@ -549,24 +547,16 @@ static int __aio_put_req(struct kioctx * * we would not be holding the last reference to the file*, so * this function will be executed w/out any aio kthread wakeup. */ - if (unlikely(atomic_long_dec_and_test(&req->ki_filp->f_count))) - schedule_putreq++; - else - req->ki_filp = NULL; - if (req->ki_eventfd != NULL) { - if (unlikely(atomic_long_dec_and_test(&req->ki_eventfd->f_count))) - schedule_putreq++; - else - req->ki_eventfd = NULL; - } - if (unlikely(schedule_putreq)) { + if (unlikely(atomic_long_dec_and_test(&req->ki_filp->f_count))) { get_ioctx(ctx); spin_lock(&fput_lock); list_add(&req->ki_list, &fput_head); spin_unlock(&fput_lock); queue_work(aio_wq, &fput_work); - } else + } else { + req->ki_filp = NULL; really_put_req(ctx, req); + } return 1; } @@ -1622,7 +1612,7 @@ static int io_submit_one(struct kioctx * * an eventfd() fd, and will be signaled for each completed * event using the eventfd_signal() function. */ - req->ki_eventfd = eventfd_fget((int) iocb->aio_resfd); + req->ki_eventfd = eventfd_ctx_fdget((int) iocb->aio_resfd); if (IS_ERR(req->ki_eventfd)) { ret = PTR_ERR(req->ki_eventfd); req->ki_eventfd = NULL; Index: linux-2.6.mod/include/linux/aio.h =================================================================== --- linux-2.6.mod.orig/include/linux/aio.h 2009-06-21 14:36:52.000000000 -0700 +++ linux-2.6.mod/include/linux/aio.h 2009-06-21 15:32:42.000000000 -0700 @@ -12,6 +12,7 @@ #define AIO_MAXSEGS 4 #define AIO_KIOGRP_NR_ATOMIC 8 +struct eventfd_ctx; struct kioctx; /* Notes on cancelling a kiocb: @@ -121,9 +122,9 @@ struct kiocb { /* * If the aio_resfd field of the userspace iocb is not zero, - * this is the underlying file* to deliver event to. + * this is the underlying eventfd context to deliver events to. */ - struct file *ki_eventfd; + struct eventfd_ctx *ki_eventfd; }; #define is_sync_kiocb(iocb) ((iocb)->ki_key == KIOCB_SYNC_KEY) Index: linux-2.6.mod/init/Kconfig =================================================================== --- linux-2.6.mod.orig/init/Kconfig 2009-06-21 15:42:52.000000000 -0700 +++ linux-2.6.mod/init/Kconfig 2009-06-21 15:43:25.000000000 -0700 @@ -925,6 +925,7 @@ config SHMEM config AIO bool "Enable AIO support" if EMBEDDED + select EVENTFD default y help This option enables POSIX asynchronous I/O which may by used Index: linux-2.6.mod/drivers/lguest/lg.h =================================================================== --- linux-2.6.mod.orig/drivers/lguest/lg.h 2009-06-21 15:55:17.000000000 -0700 +++ linux-2.6.mod/drivers/lguest/lg.h 2009-06-21 15:56:00.000000000 -0700 @@ -80,9 +80,11 @@ struct lg_cpu { struct lg_cpu_arch arch; }; +struct eventfd_ctx; + struct lg_eventfd { unsigned long addr; - struct file *event; + struct eventfd_ctx *event; }; struct lg_eventfd_map { Index: linux-2.6.mod/drivers/lguest/lguest_user.c =================================================================== --- linux-2.6.mod.orig/drivers/lguest/lguest_user.c 2009-06-21 15:55:17.000000000 -0700 +++ linux-2.6.mod/drivers/lguest/lguest_user.c 2009-06-21 15:57:11.000000000 -0700 @@ -50,7 +50,7 @@ static int add_eventfd(struct lguest *lg /* Now append new entry. */ new->map[new->num].addr = addr; - new->map[new->num].event = eventfd_fget(fd); + new->map[new->num].event = eventfd_ctx_fdget(fd); if (IS_ERR(new->map[new->num].event)) { kfree(new); return PTR_ERR(new->map[new->num].event); @@ -357,7 +357,7 @@ static int close(struct inode *inode, st /* Release any eventfds they registered. */ for (i = 0; i < lg->eventfds->num; i++) - fput(lg->eventfds->map[i].event); + eventfd_ctx_put(lg->eventfds->map[i].event); kfree(lg->eventfds); /* If lg->dead doesn't contain an error code it will be NULL or a