Message ID | 1439185906-28180-2-git-send-email-dhobsong@igel.co.jp (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Replying to my own post, but I had the following comments/questions. Martin, if you have any response to my comments I would be very happy to hear them. On 2015-08-10 2:51 PM, Damian Hobson-Garcia wrote: > From: Martin Sustrik <sustrik@250bpm.com> > [snip] > > write(2): > > User is allowed to write only buffers containing the following structure: > > struct efd_mask { > __u32 events; > __u64 data; > }; > > The value of 'events' should be any combination of event flags as defined by > poll(2) function (POLLIN, POLLOUT, POLLERR, POLLHUP etc.) Specified events will > be signaled when polling (select, poll, epoll) on the eventfd is done later on. > 'data' is opaque data that are not interpreted by eventfd object. > I'm not fully clear on the purpose that the 'data' member serves. Does this opaque handle need to be tied together with this event synchronization construct? [snip] > @@ -55,6 +69,9 @@ __u64 eventfd_signal(struct eventfd_ctx *ctx, __u64 n) > { > + /* This function should never be used with eventfd in the mask mode. */ > + BUG_ON(ctx->flags & EFD_MASK); > + ... > @@ -158,6 +180,9 @@ int eventfd_ctx_remove_wait_queue(struct eventfd_ctx *ctx, wait_queue_t *wait, > { > + /* This function should never be used with eventfd in the mask mode. */ > + BUG_ON(ctx->flags & EFD_MASK); > + ... > @@ -188,6 +213,9 @@ ssize_t eventfd_ctx_read(struct eventfd_ctx *ctx, int no_wait, __u64 *cnt) > + /* This function should never be used with eventfd in the mask mode. */ > + BUG_ON(ctx->flags & EFD_MASK); > + If eventfd_ctx_fileget() returns EINVAL when EFD_MASK is set, I don't think that there will be a way to call these functions in the mask mode, so it should be possible to get rid of the BUG_ON checks. [snip] > @@ -230,6 +258,19 @@ static ssize_t eventfd_read(struct file *file, char __user *buf, size_t count, > ssize_t res; > __u64 cnt; > > + if (ctx->flags & EFD_MASK) { > + struct efd_mask mask; > + > + if (count < sizeof(mask)) > + return -EINVAL; > + spin_lock_irq(&ctx->wqh.lock); > + mask = ctx->mask; > + spin_unlock_irq(&ctx->wqh.lock); > + if (copy_to_user(buf, &mask, sizeof(mask))) > + return -EFAULT; > + return sizeof(mask); > + } > + For the other eventfd modes, reading the value will update the internal state of the eventfd (either clearing or decrementing the counter). Should something similar be done here? I'm thinking of a case where a process is polling on this fd in a loop. Clearing the efd_mask data on read should provide an easy way for the polling process to know if it is seeing new poll events. > @@ -292,8 +351,13 @@ static void eventfd_show_fdinfo(struct seq_file *m, struct file *f) > struct eventfd_ctx *ctx = f->private_data; > > spin_lock_irq(&ctx->wqh.lock); > - seq_printf(m, "eventfd-count: %16llx\n", > - (unsigned long long)ctx->count); > + if (ctx->flags & EFD_MASK) { > + seq_printf(m, "eventfd-mask: %x\n", > + (unsigned)ctx->mask.events); > + } else { > + seq_printf(m, "eventfd-count: %16llx\n", > + (unsigned long long)ctx->count); > + } > spin_unlock_irq(&ctx->wqh.lock); > } I think that putting the EFD_MASK functionality into a different fops structure might be useful for reducing the number of if statements. Thank you, Damian -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 2015-08-10 08:23, Damian Hobson-Garcia wrote: > Replying to my own post, but I had the following comments/questions. > Martin, if you have any response to my comments I would be very happy > to > hear them. > > On 2015-08-10 2:51 PM, Damian Hobson-Garcia wrote: >> From: Martin Sustrik <sustrik@250bpm.com> >> > [snip] >> >> write(2): >> >> User is allowed to write only buffers containing the following >> structure: >> >> struct efd_mask { >> __u32 events; >> __u64 data; >> }; >> >> The value of 'events' should be any combination of event flags as >> defined by >> poll(2) function (POLLIN, POLLOUT, POLLERR, POLLHUP etc.) Specified >> events will >> be signaled when polling (select, poll, epoll) on the eventfd is done >> later on. >> 'data' is opaque data that are not interpreted by eventfd object. >> > I'm not fully clear on the purpose that the 'data' member serves. Does > this opaque handle need to be tied together with this event > synchronization construct? It's a convenience thing. Imagine you are implementing your own file descriptor type in user space. You create an EFD_MASK socket and a structure that will hold any state that you need for the socket (tx/rx buffers and such). Now you have two things to pass around. If you want to pass the fd to a function, it must have two parameters (fd and pointer to the structure). To fix it you can put the fd into the structure. That way there's only one thing to pass around (the structure). The problem with that approach is when you have generic code that deals with file descriptors. For example, a simple poller which accepts a list of (fd, callback) pairs and invokes the callback when one of the fds signals POLLIN. You can't send a pointer to a structure to such function. All you can send is the fd, but then, when the callback is invoked, fd is all you have. You have no idea where your state is. 'data' member allows you to put the pointer to the state to the socket itself. Thus, if you have a fd, you can always find out where the associated data is by reading the mask structure from the fd. > > [snip] > >> @@ -55,6 +69,9 @@ __u64 eventfd_signal(struct eventfd_ctx *ctx, __u64 >> n) >> { >> + /* This function should never be used with eventfd in the mask mode. >> */ >> + BUG_ON(ctx->flags & EFD_MASK); >> + > ... >> @@ -158,6 +180,9 @@ int eventfd_ctx_remove_wait_queue(struct >> eventfd_ctx *ctx, wait_queue_t *wait, >> { >> + /* This function should never be used with eventfd in the mask mode. >> */ >> + BUG_ON(ctx->flags & EFD_MASK); >> + > ... >> @@ -188,6 +213,9 @@ ssize_t eventfd_ctx_read(struct eventfd_ctx *ctx, >> int no_wait, __u64 *cnt) >> + /* This function should never be used with eventfd in the mask mode. >> */ >> + BUG_ON(ctx->flags & EFD_MASK); >> + > > If eventfd_ctx_fileget() returns EINVAL when EFD_MASK is set, I don't > think that there will be a way to call these functions in the mask > mode, > so it should be possible to get rid of the BUG_ON checks. Sure. Feel free to do so. > > [snip] >> @@ -230,6 +258,19 @@ static ssize_t eventfd_read(struct file *file, >> char __user *buf, size_t count, >> ssize_t res; >> __u64 cnt; >> >> + if (ctx->flags & EFD_MASK) { >> + struct efd_mask mask; >> + >> + if (count < sizeof(mask)) >> + return -EINVAL; >> + spin_lock_irq(&ctx->wqh.lock); >> + mask = ctx->mask; >> + spin_unlock_irq(&ctx->wqh.lock); >> + if (copy_to_user(buf, &mask, sizeof(mask))) >> + return -EFAULT; >> + return sizeof(mask); >> + } >> + > > For the other eventfd modes, reading the value will update the internal > state of the eventfd (either clearing or decrementing the counter). > Should something similar be done here? I'm thinking of a case where a > process is polling on this fd in a loop. Clearing the efd_mask data on > read should provide an easy way for the polling process to know if it > is > seeing new poll events. No. In this case reading the value has no effect on the state of the fd. How it should work is rather: // fd is in POLLIN state poll(fd); // function exits with POLLIN but fd remains in POLLIN state my_recv(fd, buf, size); // my_recv function have found out that there's no more data to recv and switched off the POLLIN flag poll(fd); // we block here waiting for more data to arrive from the network > >> @@ -292,8 +351,13 @@ static void eventfd_show_fdinfo(struct seq_file >> *m, struct file *f) >> struct eventfd_ctx *ctx = f->private_data; >> >> spin_lock_irq(&ctx->wqh.lock); >> - seq_printf(m, "eventfd-count: %16llx\n", >> - (unsigned long long)ctx->count); >> + if (ctx->flags & EFD_MASK) { >> + seq_printf(m, "eventfd-mask: %x\n", >> + (unsigned)ctx->mask.events); >> + } else { >> + seq_printf(m, "eventfd-count: %16llx\n", >> + (unsigned long long)ctx->count); >> + } >> spin_unlock_irq(&ctx->wqh.lock); >> } > I think that putting the EFD_MASK functionality into a different fops > structure might be useful for reducing the number of if statements. Sure. No objections. Thanks for re-submitting the patch! Martin -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Hi Martin, Thanks for your comments. On 2015-08-10 3:39 PM, Martin Sustrik wrote: > On 2015-08-10 08:23, Damian Hobson-Garcia wrote: >> Replying to my own post, but I had the following comments/questions. >> Martin, if you have any response to my comments I would be very happy to >> hear them. >> >> On 2015-08-10 2:51 PM, Damian Hobson-Garcia wrote: >>> From: Martin Sustrik <sustrik@250bpm.com> >>> >> [snip] >>> >>> write(2): >>> >>> User is allowed to write only buffers containing the following >>> structure: >>> >>> struct efd_mask { >>> __u32 events; >>> __u64 data; >>> }; >>> >>> The value of 'events' should be any combination of event flags as >>> defined by >>> poll(2) function (POLLIN, POLLOUT, POLLERR, POLLHUP etc.) Specified >>> events will >>> be signaled when polling (select, poll, epoll) on the eventfd is done >>> later on. >>> 'data' is opaque data that are not interpreted by eventfd object. >>> >> I'm not fully clear on the purpose that the 'data' member serves. Does >> this opaque handle need to be tied together with this event >> synchronization construct? > > It's a convenience thing. Imagine you are implementing your own file > descriptor type in user space. You create an EFD_MASK socket and a > structure that will hold any state that you need for the socket (tx/rx > buffers and such). > > Now you have two things to pass around. If you want to pass the fd to a > function, it must have two parameters (fd and pointer to the structure). > > To fix it you can put the fd into the structure. That way there's only > one thing to pass around (the structure). > > The problem with that approach is when you have generic code that deals > with file descriptors. For example, a simple poller which accepts a list > of (fd, callback) pairs and invokes the callback when one of the fds > signals POLLIN. You can't send a pointer to a structure to such > function. All you can send is the fd, but then, when the callback is > invoked, fd is all you have. You have no idea where your state is. > > 'data' member allows you to put the pointer to the state to the socket > itself. Thus, if you have a fd, you can always find out where the > associated data is by reading the mask structure from the fd. > Ok, I see what you're saying. I guess that keeping track of the mapping between the fd and the struct in user space could be non-trivial if there are a large number of active fds that are polling very frequently. Wouldn't it be sufficient to just use epoll() in this case though? It already seems to support this kind of thing. >> >> [snip] >> >>> @@ -55,6 +69,9 @@ __u64 eventfd_signal(struct eventfd_ctx *ctx, __u64 n) >>> { >>> + /* This function should never be used with eventfd in the mask >>> mode. */ >>> + BUG_ON(ctx->flags & EFD_MASK); >>> + >> ... >>> @@ -158,6 +180,9 @@ int eventfd_ctx_remove_wait_queue(struct >>> eventfd_ctx *ctx, wait_queue_t *wait, >>> { >>> + /* This function should never be used with eventfd in the mask >>> mode. */ >>> + BUG_ON(ctx->flags & EFD_MASK); >>> + >> ... >>> @@ -188,6 +213,9 @@ ssize_t eventfd_ctx_read(struct eventfd_ctx *ctx, >>> int no_wait, __u64 *cnt) >>> + /* This function should never be used with eventfd in the mask >>> mode. */ >>> + BUG_ON(ctx->flags & EFD_MASK); >>> + >> >> If eventfd_ctx_fileget() returns EINVAL when EFD_MASK is set, I don't >> think that there will be a way to call these functions in the mask mode, >> so it should be possible to get rid of the BUG_ON checks. > > Sure. Feel free to do so. > >> >> [snip] >>> @@ -230,6 +258,19 @@ static ssize_t eventfd_read(struct file *file, >>> char __user *buf, size_t count, >>> ssize_t res; >>> __u64 cnt; >>> >>> + if (ctx->flags & EFD_MASK) { >>> + struct efd_mask mask; >>> + >>> + if (count < sizeof(mask)) >>> + return -EINVAL; >>> + spin_lock_irq(&ctx->wqh.lock); >>> + mask = ctx->mask; >>> + spin_unlock_irq(&ctx->wqh.lock); >>> + if (copy_to_user(buf, &mask, sizeof(mask))) >>> + return -EFAULT; >>> + return sizeof(mask); >>> + } >>> + >> >> For the other eventfd modes, reading the value will update the internal >> state of the eventfd (either clearing or decrementing the counter). >> Should something similar be done here? I'm thinking of a case where a >> process is polling on this fd in a loop. Clearing the efd_mask data on >> read should provide an easy way for the polling process to know if it is >> seeing new poll events. > > No. In this case reading the value has no effect on the state of the fd. > How it should work is rather: > > // fd is in POLLIN state > poll(fd); > // function exits with POLLIN but fd remains in POLLIN state > my_recv(fd, buf, size); > // my_recv function have found out that there's no more data to recv and > switched off the POLLIN flag > poll(fd); // we block here waiting for more data to arrive from the network > How exactly doe the receiver switch off the POLLIN flag? Does the receiver also write to the eventfd? or do you mean that it just doesn't set POLLIN in the events mask? It seems cleaner to have the sender only write the eventfd and the receiver only read it. Your example would be exactly the same, except that my_recv(fd, buf, size) would read to clear instead of write. >> >>> @@ -292,8 +351,13 @@ static void eventfd_show_fdinfo(struct seq_file >>> *m, struct file *f) >>> struct eventfd_ctx *ctx = f->private_data; >>> >>> spin_lock_irq(&ctx->wqh.lock); >>> - seq_printf(m, "eventfd-count: %16llx\n", >>> - (unsigned long long)ctx->count); >>> + if (ctx->flags & EFD_MASK) { >>> + seq_printf(m, "eventfd-mask: %x\n", >>> + (unsigned)ctx->mask.events); >>> + } else { >>> + seq_printf(m, "eventfd-count: %16llx\n", >>> + (unsigned long long)ctx->count); >>> + } >>> spin_unlock_irq(&ctx->wqh.lock); >>> } >> I think that putting the EFD_MASK functionality into a different fops >> structure might be useful for reducing the number of if statements. > > Sure. No objections. > > Thanks for re-submitting the patch! My pleasure. > Martin > Damian -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 2015-08-10 10:57, Damian Hobson-Garcia wrote: > Hi Martin, > > Thanks for your comments. > > On 2015-08-10 3:39 PM, Martin Sustrik wrote: >> On 2015-08-10 08:23, Damian Hobson-Garcia wrote: >>> Replying to my own post, but I had the following comments/questions. >>> Martin, if you have any response to my comments I would be very happy >>> to >>> hear them. >>> >>> On 2015-08-10 2:51 PM, Damian Hobson-Garcia wrote: >>>> From: Martin Sustrik <sustrik@250bpm.com> >>>> >>> [snip] >>>> >>>> write(2): >>>> >>>> User is allowed to write only buffers containing the following >>>> structure: >>>> >>>> struct efd_mask { >>>> __u32 events; >>>> __u64 data; >>>> }; >>>> >>>> The value of 'events' should be any combination of event flags as >>>> defined by >>>> poll(2) function (POLLIN, POLLOUT, POLLERR, POLLHUP etc.) Specified >>>> events will >>>> be signaled when polling (select, poll, epoll) on the eventfd is >>>> done >>>> later on. >>>> 'data' is opaque data that are not interpreted by eventfd object. >>>> >>> I'm not fully clear on the purpose that the 'data' member serves. >>> Does >>> this opaque handle need to be tied together with this event >>> synchronization construct? >> >> It's a convenience thing. Imagine you are implementing your own file >> descriptor type in user space. You create an EFD_MASK socket and a >> structure that will hold any state that you need for the socket (tx/rx >> buffers and such). >> >> Now you have two things to pass around. If you want to pass the fd to >> a >> function, it must have two parameters (fd and pointer to the >> structure). >> >> To fix it you can put the fd into the structure. That way there's only >> one thing to pass around (the structure). >> >> The problem with that approach is when you have generic code that >> deals >> with file descriptors. For example, a simple poller which accepts a >> list >> of (fd, callback) pairs and invokes the callback when one of the fds >> signals POLLIN. You can't send a pointer to a structure to such >> function. All you can send is the fd, but then, when the callback is >> invoked, fd is all you have. You have no idea where your state is. >> >> 'data' member allows you to put the pointer to the state to the socket >> itself. Thus, if you have a fd, you can always find out where the >> associated data is by reading the mask structure from the fd. >> > > Ok, I see what you're saying. I guess that keeping track of the mapping > between the fd and the struct in user space could be non-trivial if > there are a large number of active fds that are polling very > frequently. > Wouldn't it be sufficient to just use epoll() in this case though? It > already seems to support this kind of thing. My use case was like this: int s = mysocket(); ... // myrecv() can get the pointer to the structure // without user having to pass it as an argument myrecv(s, buf, sizeof(buf)); However, same behaviour can be accomplished by simply keeping a static array of pointers in the user space. So let's cut this part out of the patch. > >>> >>> [snip] >>> >>>> @@ -55,6 +69,9 @@ __u64 eventfd_signal(struct eventfd_ctx *ctx, >>>> __u64 n) >>>> { >>>> + /* This function should never be used with eventfd in the mask >>>> mode. */ >>>> + BUG_ON(ctx->flags & EFD_MASK); >>>> + >>> ... >>>> @@ -158,6 +180,9 @@ int eventfd_ctx_remove_wait_queue(struct >>>> eventfd_ctx *ctx, wait_queue_t *wait, >>>> { >>>> + /* This function should never be used with eventfd in the mask >>>> mode. */ >>>> + BUG_ON(ctx->flags & EFD_MASK); >>>> + >>> ... >>>> @@ -188,6 +213,9 @@ ssize_t eventfd_ctx_read(struct eventfd_ctx >>>> *ctx, >>>> int no_wait, __u64 *cnt) >>>> + /* This function should never be used with eventfd in the mask >>>> mode. */ >>>> + BUG_ON(ctx->flags & EFD_MASK); >>>> + >>> >>> If eventfd_ctx_fileget() returns EINVAL when EFD_MASK is set, I don't >>> think that there will be a way to call these functions in the mask >>> mode, >>> so it should be possible to get rid of the BUG_ON checks. >> >> Sure. Feel free to do so. >> >>> >>> [snip] >>>> @@ -230,6 +258,19 @@ static ssize_t eventfd_read(struct file *file, >>>> char __user *buf, size_t count, >>>> ssize_t res; >>>> __u64 cnt; >>>> >>>> + if (ctx->flags & EFD_MASK) { >>>> + struct efd_mask mask; >>>> + >>>> + if (count < sizeof(mask)) >>>> + return -EINVAL; >>>> + spin_lock_irq(&ctx->wqh.lock); >>>> + mask = ctx->mask; >>>> + spin_unlock_irq(&ctx->wqh.lock); >>>> + if (copy_to_user(buf, &mask, sizeof(mask))) >>>> + return -EFAULT; >>>> + return sizeof(mask); >>>> + } >>>> + >>> >>> For the other eventfd modes, reading the value will update the >>> internal >>> state of the eventfd (either clearing or decrementing the counter). >>> Should something similar be done here? I'm thinking of a case where a >>> process is polling on this fd in a loop. Clearing the efd_mask data >>> on >>> read should provide an easy way for the polling process to know if it >>> is >>> seeing new poll events. >> >> No. In this case reading the value has no effect on the state of the >> fd. >> How it should work is rather: >> >> // fd is in POLLIN state >> poll(fd); >> // function exits with POLLIN but fd remains in POLLIN state >> my_recv(fd, buf, size); >> // my_recv function have found out that there's no more data to recv >> and >> switched off the POLLIN flag >> poll(fd); // we block here waiting for more data to arrive from the >> network >> > > How exactly doe the receiver switch off the POLLIN flag? Does the > receiver also write to the eventfd? or do you mean that it just doesn't > set POLLIN in the events mask? It seems cleaner to have the sender > only > write the eventfd and the receiver only read it. Your example would be > exactly the same, except that my_recv(fd, buf, size) would read to > clear > instead of write. Keep in mind that the user of your mysocket is not supposed to do recv() or send() on the raw underlying fd. It's the implementation, myrecv() and mysend(), that does that. That being the case and also assuming that we cut the pointer out, there seems to be little use for recv() any more. The implementation of socket knows what state it is in and so it doesn't have to retrieve it using recv(). All it has to do is perform whatever business logic is needed and then set new state of the socket via send(). And the fact there's no clear use case, the logic of recv() is not obvious. We can very well return ENOTIMPL. > >>> >>>> @@ -292,8 +351,13 @@ static void eventfd_show_fdinfo(struct seq_file >>>> *m, struct file *f) >>>> struct eventfd_ctx *ctx = f->private_data; >>>> >>>> spin_lock_irq(&ctx->wqh.lock); >>>> - seq_printf(m, "eventfd-count: %16llx\n", >>>> - (unsigned long long)ctx->count); >>>> + if (ctx->flags & EFD_MASK) { >>>> + seq_printf(m, "eventfd-mask: %x\n", >>>> + (unsigned)ctx->mask.events); >>>> + } else { >>>> + seq_printf(m, "eventfd-count: %16llx\n", >>>> + (unsigned long long)ctx->count); >>>> + } >>>> spin_unlock_irq(&ctx->wqh.lock); >>>> } >>> I think that putting the EFD_MASK functionality into a different fops >>> structure might be useful for reducing the number of if statements. >> >> Sure. No objections. >> >> Thanks for re-submitting the patch! > > My pleasure. > >> Martin >> > > Damian -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 2015-08-11 6:16 AM, Martin Sustrik wrote: > On 2015-08-10 10:57, Damian Hobson-Garcia wrote: >> Hi Martin, >> >> Thanks for your comments. >> >> On 2015-08-10 3:39 PM, Martin Sustrik wrote: >>> On 2015-08-10 08:23, Damian Hobson-Garcia wrote: >>>> Replying to my own post, but I had the following comments/questions. >>>> Martin, if you have any response to my comments I would be very >>>> happy to >>>> hear them. >>>> >>>> On 2015-08-10 2:51 PM, Damian Hobson-Garcia wrote: >>>>> From: Martin Sustrik <sustrik@250bpm.com> >>>>> >>>> [snip] >>>>> >>>>> write(2): >>>>> >>>>> User is allowed to write only buffers containing the following >>>>> structure: >>>>> >>>>> struct efd_mask { >>>>> __u32 events; >>>>> __u64 data; >>>>> }; >>>>> >>>>> The value of 'events' should be any combination of event flags as >>>>> defined by >>>>> poll(2) function (POLLIN, POLLOUT, POLLERR, POLLHUP etc.) Specified >>>>> events will >>>>> be signaled when polling (select, poll, epoll) on the eventfd is done >>>>> later on. >>>>> 'data' is opaque data that are not interpreted by eventfd object. >>>>> >>>> I'm not fully clear on the purpose that the 'data' member serves. Does >>>> this opaque handle need to be tied together with this event >>>> synchronization construct? >>> >>> It's a convenience thing. Imagine you are implementing your own file >>> descriptor type in user space. You create an EFD_MASK socket and a >>> structure that will hold any state that you need for the socket (tx/rx >>> buffers and such). >>> >>> Now you have two things to pass around. If you want to pass the fd to a >>> function, it must have two parameters (fd and pointer to the structure). >>> >>> To fix it you can put the fd into the structure. That way there's only >>> one thing to pass around (the structure). >>> >>> The problem with that approach is when you have generic code that deals >>> with file descriptors. For example, a simple poller which accepts a list >>> of (fd, callback) pairs and invokes the callback when one of the fds >>> signals POLLIN. You can't send a pointer to a structure to such >>> function. All you can send is the fd, but then, when the callback is >>> invoked, fd is all you have. You have no idea where your state is. >>> >>> 'data' member allows you to put the pointer to the state to the socket >>> itself. Thus, if you have a fd, you can always find out where the >>> associated data is by reading the mask structure from the fd. >>> >> >> Ok, I see what you're saying. I guess that keeping track of the mapping >> between the fd and the struct in user space could be non-trivial if >> there are a large number of active fds that are polling very frequently. >> Wouldn't it be sufficient to just use epoll() in this case though? It >> already seems to support this kind of thing. > > My use case was like this: > > int s = mysocket(); > ... > // myrecv() can get the pointer to the structure > // without user having to pass it as an argument > myrecv(s, buf, sizeof(buf)); > > However, same behaviour can be accomplished by simply keeping > a static array of pointers in the user space. > > So let's cut this part out of the patch. > Ok, I'll drop the 'data' member. I could add some padding to the efd_mask structure so that it is the same size as the 64-bit data size used when EFD_MASK is not set. >> >>>> >>>> [snip] >>>> >>>>> @@ -55,6 +69,9 @@ __u64 eventfd_signal(struct eventfd_ctx *ctx, >>>>> __u64 n) >>>>> { >>>>> + /* This function should never be used with eventfd in the mask >>>>> mode. */ >>>>> + BUG_ON(ctx->flags & EFD_MASK); >>>>> + >>>> ... >>>>> @@ -158,6 +180,9 @@ int eventfd_ctx_remove_wait_queue(struct >>>>> eventfd_ctx *ctx, wait_queue_t *wait, >>>>> { >>>>> + /* This function should never be used with eventfd in the mask >>>>> mode. */ >>>>> + BUG_ON(ctx->flags & EFD_MASK); >>>>> + >>>> ... >>>>> @@ -188,6 +213,9 @@ ssize_t eventfd_ctx_read(struct eventfd_ctx *ctx, >>>>> int no_wait, __u64 *cnt) >>>>> + /* This function should never be used with eventfd in the mask >>>>> mode. */ >>>>> + BUG_ON(ctx->flags & EFD_MASK); >>>>> + >>>> >>>> If eventfd_ctx_fileget() returns EINVAL when EFD_MASK is set, I don't >>>> think that there will be a way to call these functions in the mask >>>> mode, >>>> so it should be possible to get rid of the BUG_ON checks. >>> >>> Sure. Feel free to do so. >>> >>>> >>>> [snip] >>>>> @@ -230,6 +258,19 @@ static ssize_t eventfd_read(struct file *file, >>>>> char __user *buf, size_t count, >>>>> ssize_t res; >>>>> __u64 cnt; >>>>> >>>>> + if (ctx->flags & EFD_MASK) { >>>>> + struct efd_mask mask; >>>>> + >>>>> + if (count < sizeof(mask)) >>>>> + return -EINVAL; >>>>> + spin_lock_irq(&ctx->wqh.lock); >>>>> + mask = ctx->mask; >>>>> + spin_unlock_irq(&ctx->wqh.lock); >>>>> + if (copy_to_user(buf, &mask, sizeof(mask))) >>>>> + return -EFAULT; >>>>> + return sizeof(mask); >>>>> + } >>>>> + >>>> >>>> For the other eventfd modes, reading the value will update the internal >>>> state of the eventfd (either clearing or decrementing the counter). >>>> Should something similar be done here? I'm thinking of a case where a >>>> process is polling on this fd in a loop. Clearing the efd_mask data on >>>> read should provide an easy way for the polling process to know if >>>> it is >>>> seeing new poll events. >>> >>> No. In this case reading the value has no effect on the state of the fd. >>> How it should work is rather: >>> >>> // fd is in POLLIN state >>> poll(fd); >>> // function exits with POLLIN but fd remains in POLLIN state >>> my_recv(fd, buf, size); >>> // my_recv function have found out that there's no more data to recv and >>> switched off the POLLIN flag >>> poll(fd); // we block here waiting for more data to arrive from the >>> network >>> >> >> How exactly doe the receiver switch off the POLLIN flag? Does the >> receiver also write to the eventfd? or do you mean that it just doesn't >> set POLLIN in the events mask? It seems cleaner to have the sender only >> write the eventfd and the receiver only read it. Your example would be >> exactly the same, except that my_recv(fd, buf, size) would read to clear >> instead of write. > > Keep in mind that the user of your mysocket is not supposed to do > recv() or send() on the raw underlying fd. It's the implementation, > myrecv() and mysend(), that does that. > > That being the case and also assuming that we cut the pointer out, there > seems > to be little use for recv() any more. The implementation of socket knows > what state it is in and so it doesn't have to retrieve it using recv(). > > All it has to do is perform whatever business logic is needed and then > set new > state of the socket via send(). > > And the fact there's no clear use case, the logic of recv() is not obvious. > We can very well return ENOTIMPL. If the user of mysocket will only directly interact with the fd through poll/select/epoll then yes, read()/recv() doesn't have any use, and I agree, dropping it seems cleanest. > > >> >>>> >>>>> @@ -292,8 +351,13 @@ static void eventfd_show_fdinfo(struct seq_file >>>>> *m, struct file *f) >>>>> struct eventfd_ctx *ctx = f->private_data; >>>>> >>>>> spin_lock_irq(&ctx->wqh.lock); >>>>> - seq_printf(m, "eventfd-count: %16llx\n", >>>>> - (unsigned long long)ctx->count); >>>>> + if (ctx->flags & EFD_MASK) { >>>>> + seq_printf(m, "eventfd-mask: %x\n", >>>>> + (unsigned)ctx->mask.events); >>>>> + } else { >>>>> + seq_printf(m, "eventfd-count: %16llx\n", >>>>> + (unsigned long long)ctx->count); >>>>> + } >>>>> spin_unlock_irq(&ctx->wqh.lock); >>>>> } >>>> I think that putting the EFD_MASK functionality into a different fops >>>> structure might be useful for reducing the number of if statements. >>> >>> Sure. No objections. >>> >>> Thanks for re-submitting the patch! >> >> My pleasure. >> >>> Martin >>> >> >> Damian > Damian -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/fs/eventfd.c b/fs/eventfd.c index 8d0c0df..11fb92a 100644 --- a/fs/eventfd.c +++ b/fs/eventfd.c @@ -2,6 +2,7 @@ * fs/eventfd.c * * Copyright (C) 2007 Davide Libenzi <davidel@xmailserver.org> + * Copyright (C) 2013 Martin Sustrik <sustrik@250bpm.com> * */ @@ -22,18 +23,31 @@ #include <linux/proc_fs.h> #include <linux/seq_file.h> +#define EFD_SHARED_FCNTL_FLAGS (O_CLOEXEC | O_NONBLOCK) +#define EFD_FLAGS_SET (EFD_SHARED_FCNTL_FLAGS | EFD_SEMAPHORE | EFD_MASK) +#define EFD_MASK_VALID_EVENTS (POLLIN | POLLPRI | POLLOUT | POLLERR | POLLHUP) + struct eventfd_ctx { struct kref kref; wait_queue_head_t wqh; - /* - * Every time that a write(2) is performed on an eventfd, the - * value of the __u64 being written is added to "count" and a - * wakeup is performed on "wqh". A read(2) will return the "count" - * value to userspace, and will reset "count" to zero. The kernel - * side eventfd_signal() also, adds to the "count" counter and - * issue a wakeup. - */ - __u64 count; + union { + /* + * Every time that a write(2) is performed on an eventfd, the + * value of the __u64 being written is added to "count" and a + * wakeup is performed on "wqh". A read(2) will return the + * "count" value to userspace, and will reset "count" to zero. + * The kernel side eventfd_signal() also, adds to the "count" + * counter and issue a wakeup. + */ + __u64 count; + + /* + * When using eventfd in EFD_MASK mode this stracture stores the + * current events to be signaled on the eventfd (events member) + * along with opaque user-defined data (data member). + */ + struct efd_mask mask; + }; unsigned int flags; }; @@ -55,6 +69,9 @@ __u64 eventfd_signal(struct eventfd_ctx *ctx, __u64 n) { unsigned long flags; + /* This function should never be used with eventfd in the mask mode. */ + BUG_ON(ctx->flags & EFD_MASK); + spin_lock_irqsave(&ctx->wqh.lock, flags); if (ULLONG_MAX - ctx->count < n) n = ULLONG_MAX - ctx->count; @@ -124,6 +141,11 @@ static unsigned int eventfd_poll(struct file *file, poll_table *wait) smp_rmb(); count = ctx->count; + if (ctx->flags & EFD_MASK) { + events = ctx->mask.events; + return events; + } + if (count > 0) events |= POLLIN; if (count == ULLONG_MAX) @@ -158,6 +180,9 @@ int eventfd_ctx_remove_wait_queue(struct eventfd_ctx *ctx, wait_queue_t *wait, { unsigned long flags; + /* This function should never be used with eventfd in the mask mode. */ + BUG_ON(ctx->flags & EFD_MASK); + spin_lock_irqsave(&ctx->wqh.lock, flags); eventfd_ctx_do_read(ctx, cnt); __remove_wait_queue(&ctx->wqh, wait); @@ -188,6 +213,9 @@ ssize_t eventfd_ctx_read(struct eventfd_ctx *ctx, int no_wait, __u64 *cnt) ssize_t res; DECLARE_WAITQUEUE(wait, current); + /* This function should never be used with eventfd in the mask mode. */ + BUG_ON(ctx->flags & EFD_MASK); + spin_lock_irq(&ctx->wqh.lock); *cnt = 0; res = -EAGAIN; @@ -230,6 +258,19 @@ static ssize_t eventfd_read(struct file *file, char __user *buf, size_t count, ssize_t res; __u64 cnt; + if (ctx->flags & EFD_MASK) { + struct efd_mask mask; + + if (count < sizeof(mask)) + return -EINVAL; + spin_lock_irq(&ctx->wqh.lock); + mask = ctx->mask; + spin_unlock_irq(&ctx->wqh.lock); + if (copy_to_user(buf, &mask, sizeof(mask))) + return -EFAULT; + return sizeof(mask); + } + if (count < sizeof(cnt)) return -EINVAL; res = eventfd_ctx_read(ctx, file->f_flags & O_NONBLOCK, &cnt); @@ -247,6 +288,24 @@ static ssize_t eventfd_write(struct file *file, const char __user *buf, size_t c __u64 ucnt; DECLARE_WAITQUEUE(wait, current); + if (ctx->flags & EFD_MASK) { + struct efd_mask mask; + + if (count < sizeof(mask)) + return -EINVAL; + if (copy_from_user(&mask, buf, sizeof(mask))) + return -EFAULT; + if (mask.events & ~EFD_MASK_VALID_EVENTS) + return -EINVAL; + spin_lock_irq(&ctx->wqh.lock); + memcpy(&ctx->mask, &mask, sizeof(ctx->mask)); + if (waitqueue_active(&ctx->wqh)) + wake_up_locked_poll(&ctx->wqh, + (unsigned long)ctx->mask.events); + spin_unlock_irq(&ctx->wqh.lock); + return sizeof(ctx->mask); + } + if (count < sizeof(ucnt)) return -EINVAL; if (copy_from_user(&ucnt, buf, sizeof(ucnt))) @@ -292,8 +351,13 @@ static void eventfd_show_fdinfo(struct seq_file *m, struct file *f) struct eventfd_ctx *ctx = f->private_data; spin_lock_irq(&ctx->wqh.lock); - seq_printf(m, "eventfd-count: %16llx\n", - (unsigned long long)ctx->count); + if (ctx->flags & EFD_MASK) { + seq_printf(m, "eventfd-mask: %x\n", + (unsigned)ctx->mask.events); + } else { + seq_printf(m, "eventfd-count: %16llx\n", + (unsigned long long)ctx->count); + } spin_unlock_irq(&ctx->wqh.lock); } #endif @@ -406,7 +470,12 @@ struct file *eventfd_file_create(unsigned int count, int flags) kref_init(&ctx->kref); init_waitqueue_head(&ctx->wqh); - ctx->count = count; + if (flags & EFD_MASK) { + ctx->mask.events = 0; + ctx->mask.data = 0; + } else { + ctx->count = count; + } ctx->flags = flags; file = anon_inode_getfile("[eventfd]", &eventfd_fops, ctx, diff --git a/include/linux/eventfd.h b/include/linux/eventfd.h index ff0b981..87de343 100644 --- a/include/linux/eventfd.h +++ b/include/linux/eventfd.h @@ -8,23 +8,11 @@ #ifndef _LINUX_EVENTFD_H #define _LINUX_EVENTFD_H +#include <uapi/linux/eventfd.h> + #include <linux/fcntl.h> #include <linux/wait.h> -/* - * CAREFUL: Check include/uapi/asm-generic/fcntl.h when defining - * new flags, since they might collide with O_* ones. We want - * to re-use O_* flags that couldn't possibly have a meaning - * from eventfd, in order to leave a free define-space for - * shared O_* flags. - */ -#define EFD_SEMAPHORE (1 << 0) -#define EFD_CLOEXEC O_CLOEXEC -#define EFD_NONBLOCK O_NONBLOCK - -#define EFD_SHARED_FCNTL_FLAGS (O_CLOEXEC | O_NONBLOCK) -#define EFD_FLAGS_SET (EFD_SHARED_FCNTL_FLAGS | EFD_SEMAPHORE) - struct file; #ifdef CONFIG_EVENTFD diff --git a/include/uapi/linux/eventfd.h b/include/uapi/linux/eventfd.h new file mode 100644 index 0000000..03057a5 --- /dev/null +++ b/include/uapi/linux/eventfd.h @@ -0,0 +1,40 @@ +/* + * Copyright (C) 2013 Martin Sustrik <sustrik@250bpm.com> + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + */ + +#ifndef _UAPI_LINUX_EVENTFD_H +#define _UAPI_LINUX_EVENTFD_H + +/* For O_CLOEXEC */ +#include <linux/fcntl.h> +#include <linux/types.h> + +/* + * CAREFUL: Check include/asm-generic/fcntl.h when defining + * new flags, since they might collide with O_* ones. We want + * to re-use O_* flags that couldn't possibly have a meaning + * from eventfd, in order to leave a free define-space for + * shared O_* flags. + */ + +/* Provide semaphore-like semantics for reads from the eventfd. */ +#define EFD_SEMAPHORE (1 << 0) +/* Provide event mask semantics for the eventfd. */ +#define EFD_MASK (1 << 1) +/* Set the close-on-exec (FD_CLOEXEC) flag on the eventfd. */ +#define EFD_CLOEXEC O_CLOEXEC +/* Create the eventfd in non-blocking mode. */ +#define EFD_NONBLOCK O_NONBLOCK + +/* Structure to read/write to eventfd in EFD_MASK mode. */ +struct efd_mask { + __u32 events; + __u64 data; +} __packed; + +#endif /* _UAPI_LINUX_EVENTFD_H */