diff mbox series

[v3] epoll: use refcount to reduce ep_mutex contention

Message ID 1aedd7e87097bc4352ba658ac948c585a655785a.1669657846.git.pabeni@redhat.com (mailing list archive)
State Not Applicable
Headers show
Series [v3] epoll: use refcount to reduce ep_mutex contention | expand

Checks

Context Check Description
netdev/tree_selection success Not a local patch

Commit Message

Paolo Abeni Nov. 28, 2022, 6 p.m. UTC
We are observing huge contention on the epmutex during an http
connection/rate test:

 83.17% 0.25%  nginx            [kernel.kallsyms]         [k] entry_SYSCALL_64_after_hwframe
[...]
           |--66.96%--__fput
                      |--60.04%--eventpoll_release_file
                                 |--58.41%--__mutex_lock.isra.6
                                           |--56.56%--osq_lock

The application is multi-threaded, creates a new epoll entry for
each incoming connection, and does not delete it before the
connection shutdown - that is, before the connection's fd close().

Many different threads compete frequently for the epmutex lock,
affecting the overall performance.

To reduce the contention this patch introduces explicit reference counting
for the eventpoll struct. Each registered event acquires a reference,
and references are released at ep_remove() time.

Additionally, this introduces a new 'dying' flag to prevent races between
ep_free() and eventpoll_release_file(): the latter marks, under f_lock
spinlock, each epitem as before removing it, while ep_free() does not
touch dying epitems.

The eventpoll struct is released by whoever - among ep_free() and
eventpoll_release_file() drops its last reference.

With all the above in place, we can drop the epmutex usage at disposal time.

Overall this produces a significant performance improvement in the
mentioned connection/rate scenario: the mutex operations disappear from
the topmost offenders in the perf report, and the measured connections/rate
grows by ~60%.

Tested-by: Xiumei Mu <xmu@redhat.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
---
v3: (addresses comments from Eric Biggers)
- introduce the 'dying' flag, use it to dispose immediately struct eventpoll
  at ep_free() time
- update a leftover comments still referring to old epmutex usage

v2 at:
https://lore.kernel.org/linux-fsdevel/f35e58ed5af8131f0f402c3dc6c3033fa96d1843.1669312208.git.pabeni@redhat.com/

v1 at:
https://lore.kernel.org/linux-fsdevel/f35e58ed5af8131f0f402c3dc6c3033fa96d1843.1669312208.git.pabeni@redhat.com/

Previous related effort at:
https://lore.kernel.org/linux-fsdevel/20190727113542.162213-1-cj.chengjian@huawei.com/
https://lkml.org/lkml/2017/10/28/81
---
 fs/eventpoll.c | 171 +++++++++++++++++++++++++++++++------------------
 1 file changed, 109 insertions(+), 62 deletions(-)

Comments

Keller, Jacob E Nov. 29, 2022, 12:06 a.m. UTC | #1
On 11/28/2022 10:00 AM, Paolo Abeni wrote:
> We are observing huge contention on the epmutex during an http
> connection/rate test:
> 
>   83.17% 0.25%  nginx            [kernel.kallsyms]         [k] entry_SYSCALL_64_after_hwframe
> [...]
>             |--66.96%--__fput
>                        |--60.04%--eventpoll_release_file
>                                   |--58.41%--__mutex_lock.isra.6
>                                             |--56.56%--osq_lock
> 
> The application is multi-threaded, creates a new epoll entry for
> each incoming connection, and does not delete it before the
> connection shutdown - that is, before the connection's fd close().
> 
> Many different threads compete frequently for the epmutex lock,
> affecting the overall performance.
> 
> To reduce the contention this patch introduces explicit reference counting
> for the eventpoll struct. Each registered event acquires a reference,
> and references are released at ep_remove() time.
> 
> Additionally, this introduces a new 'dying' flag to prevent races between
> ep_free() and eventpoll_release_file(): the latter marks, under f_lock
> spinlock, each epitem as before removing it, while ep_free() does not
> touch dying epitems.
> 
> The eventpoll struct is released by whoever - among ep_free() and
> eventpoll_release_file() drops its last reference.
> 
> With all the above in place, we can drop the epmutex usage at disposal time.
> 
> Overall this produces a significant performance improvement in the
> mentioned connection/rate scenario: the mutex operations disappear from
> the topmost offenders in the perf report, and the measured connections/rate
> grows by ~60%.
> 
> Tested-by: Xiumei Mu <xmu@redhat.com>
> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
> ---
> v3: (addresses comments from Eric Biggers)
> - introduce the 'dying' flag, use it to dispose immediately struct eventpoll
>    at ep_free() time
> - update a leftover comments still referring to old epmutex usage
> 
> v2 at:
> https://lore.kernel.org/linux-fsdevel/f35e58ed5af8131f0f402c3dc6c3033fa96d1843.1669312208.git.pabeni@redhat.com/
> 
> v1 at:
> https://lore.kernel.org/linux-fsdevel/f35e58ed5af8131f0f402c3dc6c3033fa96d1843.1669312208.git.pabeni@redhat.com/
> 
> Previous related effort at:
> https://lore.kernel.org/linux-fsdevel/20190727113542.162213-1-cj.chengjian@huawei.com/
> https://lkml.org/lkml/2017/10/28/81
> ---
>   fs/eventpoll.c | 171 +++++++++++++++++++++++++++++++------------------
>   1 file changed, 109 insertions(+), 62 deletions(-)
> 
> diff --git a/fs/eventpoll.c b/fs/eventpoll.c
> index 52954d4637b5..af22e5e6f683 100644
> --- a/fs/eventpoll.c
> +++ b/fs/eventpoll.c
> @@ -57,13 +57,7 @@
>    * we need a lock that will allow us to sleep. This lock is a
>    * mutex (ep->mtx). It is acquired during the event transfer loop,
>    * during epoll_ctl(EPOLL_CTL_DEL) and during eventpoll_release_file().
> - * Then we also need a global mutex to serialize eventpoll_release_file()
> - * and ep_free().
> - * This mutex is acquired by ep_free() during the epoll file
> - * cleanup path and it is also acquired by eventpoll_release_file()
> - * if a file has been pushed inside an epoll set and it is then
> - * close()d without a previous call to epoll_ctl(EPOLL_CTL_DEL).
> - * It is also acquired when inserting an epoll fd onto another epoll
> + * The epmutex is acquired when inserting an epoll fd onto another epoll
>    * fd. We do this so that we walk the epoll tree and ensure that this
>    * insertion does not create a cycle of epoll file descriptors, which
>    * could lead to deadlock. We need a global mutex to prevent two
> @@ -153,6 +147,13 @@ struct epitem {
>   	/* The file descriptor information this item refers to */
>   	struct epoll_filefd ffd;
>   
> +	/*
> +	 * Protected by file->f_lock, true for to-be-released epitem already
> +	 * removed from the "struct file" items list; together with
> +	 * eventpoll->refcount orchestrates "struct eventpoll" disposal
> +	 */
> +	bool dying;
> +
>   	/* List containing poll wait queues */
>   	struct eppoll_entry *pwqlist;
>   
> @@ -217,6 +218,12 @@ struct eventpoll {
>   	u64 gen;
>   	struct hlist_head refs;
>   
> +	/*
> +	 * usage count, protected by mtx, used together with epitem->dying to
> +	 * orchestrate the disposal of this struct
> +	 */
> +	unsigned int refcount;
> +

Why not use a kref (or at least struct refcount?) those provide some 
guarantees like guaranteeing atomic operations and saturation when the 
refcount value would overflow.

>   #ifdef CONFIG_NET_RX_BUSY_POLL
>   	/* used to track busy poll napi_id */
>   	unsigned int napi_id;
> @@ -240,9 +247,7 @@ struct ep_pqueue {
>   /* Maximum number of epoll watched descriptors, per user */
>   static long max_user_watches __read_mostly;
>   
> -/*
> - * This mutex is used to serialize ep_free() and eventpoll_release_file().
> - */
> +/* Used for cycles detection */
>   static DEFINE_MUTEX(epmutex);
>   
>   static u64 loop_check_gen = 0;
> @@ -555,8 +560,7 @@ static void ep_remove_wait_queue(struct eppoll_entry *pwq)
>   
>   /*
>    * This function unregisters poll callbacks from the associated file
> - * descriptor.  Must be called with "mtx" held (or "epmutex" if called from
> - * ep_free).
> + * descriptor.  Must be called with "mtx" held.
>    */
>   static void ep_unregister_pollwait(struct eventpoll *ep, struct epitem *epi)
>   {
> @@ -679,11 +683,38 @@ static void epi_rcu_free(struct rcu_head *head)
>   	kmem_cache_free(epi_cache, epi);
>   }
>   
> +static void ep_get(struct eventpoll *ep)
> +{
> +	ep->refcount++;
> +}
This would become something like "kref_get(&ep->kref)" or maybe even 
something like "kref_get_unless_zero" or some other form depending on 
exactly how you acquire a pointer to an eventpoll structure.

> +
> +/*
> + * Returns true if the event poll can be disposed
> + */
> +static bool ep_put(struct eventpoll *ep)
> +{
> +	if (--ep->refcount)
> +		return false;
> +
> +	WARN_ON_ONCE(!RB_EMPTY_ROOT(&ep->rbr.rb_root));
> +	return true;
> +}

This could become kref_put(&ep->kref, ep_dispose).

> +
> +static void ep_dispose(struct eventpoll *ep)
> +{
> +	mutex_destroy(&ep->mtx);
> +	free_uid(ep->user);
> +	wakeup_source_unregister(ep->ws);
> +	kfree(ep);
> +}
This would takea  kref pointer, use container_of to get to the eventpoll 
structure, and then perform necessary cleanup once all references drop.

The exact specific steps here and whether it would still be safe to call 
mutex_destroy is a bit unclear since you typically would only call 
mutex_destroy when its absolutely sure that no one has locked the mutex.

If you're careful about how kref_get is used you can avoid needing to be 
holding mutex_destroy when ep_dispose is called. Since krefs are struct 
refcount and thus atomic you don't need the lock to protect access or 
ordering guaruantees.

See Documentation/core-api/kref.rst for a better overview of the API and 
how to use it safely. I suspect that with just kref you could also 
safely avoid the "dying" flag as well, but I am not 100% sure.

> +
>   /*
>    * Removes a "struct epitem" from the eventpoll RB tree and deallocates
>    * all the associated resources. Must be called with "mtx" held.
> + * If the dying flag is set, do the removal only if force is true.
> + * Returns true if the eventpoll can be disposed.
>    */
> -static int ep_remove(struct eventpoll *ep, struct epitem *epi)
> +static bool __ep_remove(struct eventpoll *ep, struct epitem *epi, bool force)
>   {
>   	struct file *file = epi->ffd.file;
>   	struct epitems_head *to_free;
> @@ -698,6 +729,11 @@ static int ep_remove(struct eventpoll *ep, struct epitem *epi)
>   
>   	/* Remove the current item from the list of epoll hooks */
>   	spin_lock(&file->f_lock);
> +	if (epi->dying && !force) {
> +		spin_unlock(&file->f_lock);
> +		return false;
> +	}
> +
>   	to_free = NULL;
>   	head = file->f_ep;
>   	if (head->first == &epi->fllink && !epi->fllink.next) {
> @@ -731,28 +767,28 @@ static int ep_remove(struct eventpoll *ep, struct epitem *epi)
>   	call_rcu(&epi->rcu, epi_rcu_free);
>   
>   	percpu_counter_dec(&ep->user->epoll_watches);
> +	return ep_put(ep);
> +}
>   
> -	return 0;
> +/*
> + * ep_remove variant for callers owing an additional reference to the ep
> + */
> +static void ep_remove_safe(struct eventpoll *ep, struct epitem *epi)
> +{
> +	WARN_ON_ONCE(__ep_remove(ep, epi, false));
>   }
>   
>   static void ep_free(struct eventpoll *ep)
>   {
>   	struct rb_node *rbp;
>   	struct epitem *epi;
> +	bool dispose;
>   
>   	/* We need to release all tasks waiting for these file */
>   	if (waitqueue_active(&ep->poll_wait))
>   		ep_poll_safewake(ep, NULL);
>   
> -	/*
> -	 * We need to lock this because we could be hit by
> -	 * eventpoll_release_file() while we're freeing the "struct eventpoll".
> -	 * We do not need to hold "ep->mtx" here because the epoll file
> -	 * is on the way to be removed and no one has references to it
> -	 * anymore. The only hit might come from eventpoll_release_file() but
> -	 * holding "epmutex" is sufficient here.
> -	 */
> -	mutex_lock(&epmutex);
> +	mutex_lock(&ep->mtx);
>   
>   	/*
>   	 * Walks through the whole tree by unregistering poll callbacks.
> @@ -766,25 +802,21 @@ static void ep_free(struct eventpoll *ep)
>   
>   	/*
>   	 * Walks through the whole tree by freeing each "struct epitem". At this
> -	 * point we are sure no poll callbacks will be lingering around, and also by
> -	 * holding "epmutex" we can be sure that no file cleanup code will hit
> -	 * us during this operation. So we can avoid the lock on "ep->lock".
> -	 * We do not need to lock ep->mtx, either, we only do it to prevent
> -	 * a lockdep warning.
> +	 * point we are sure no poll callbacks will be lingering around.
> +	 * Since we still own a reference to the eventpoll struct, the loop can't
> +	 * dispose it.
>   	 */
> -	mutex_lock(&ep->mtx);
>   	while ((rbp = rb_first_cached(&ep->rbr)) != NULL) {
>   		epi = rb_entry(rbp, struct epitem, rbn);
> -		ep_remove(ep, epi);
> +		ep_remove_safe(ep, epi);
>   		cond_resched();
>   	}
> + > +	dispose = ep_put(ep);
>   	mutex_unlock(&ep->mtx);
>   
> -	mutex_unlock(&epmutex);
> -	mutex_destroy(&ep->mtx);
> -	free_uid(ep->user);
> -	wakeup_source_unregister(ep->ws);
> -	kfree(ep);
> +	if (dispose)
> +		ep_dispose(ep);
>   }
>   
>   static int ep_eventpoll_release(struct inode *inode, struct file *file)
> @@ -904,33 +936,35 @@ void eventpoll_release_file(struct file *file)
>   {
>   	struct eventpoll *ep;
>   	struct epitem *epi;
> -	struct hlist_node *next;
> +	bool dispose;
>   
>   	/*
> -	 * We don't want to get "file->f_lock" because it is not
> -	 * necessary. It is not necessary because we're in the "struct file"
> -	 * cleanup path, and this means that no one is using this file anymore.
> -	 * So, for example, epoll_ctl() cannot hit here since if we reach this
> -	 * point, the file counter already went to zero and fget() would fail.
> -	 * The only hit might come from ep_free() but by holding the mutex
> -	 * will correctly serialize the operation. We do need to acquire
> -	 * "ep->mtx" after "epmutex" because ep_remove() requires it when called
> -	 * from anywhere but ep_free().
> -	 *
> -	 * Besides, ep_remove() acquires the lock, so we can't hold it here.
> +	 * Use the 'dying' flag to prevent a concurrent ep_free() from touching
> +	 * the epitems list before eventpoll_release_file() can access the
> +	 * ep->mtx.
>   	 */
> -	mutex_lock(&epmutex);
> -	if (unlikely(!file->f_ep)) {
> -		mutex_unlock(&epmutex);
> -		return;
> -	}
> -	hlist_for_each_entry_safe(epi, next, file->f_ep, fllink) {
> +again:
> +	spin_lock(&file->f_lock);
> +	if (file->f_ep && file->f_ep->first) {
> +		/* detach from ep tree */
> +		epi = hlist_entry(file->f_ep->first, struct epitem, fllink);
> +		epi->dying = true;
> +		spin_unlock(&file->f_lock);
> +
> +		/*
> +		 * ep access is safe as we still own a reference to the ep
> +		 * struct
> +		 */
>   		ep = epi->ep;
> -		mutex_lock_nested(&ep->mtx, 0);
> -		ep_remove(ep, epi);
> +		mutex_lock(&ep->mtx);
> +		dispose = __ep_remove(ep, epi, true);
>   		mutex_unlock(&ep->mtx);
> +
> +		if (dispose)
> +			ep_dispose(ep);
> +		goto again;
>   	}
> -	mutex_unlock(&epmutex);
> +	spin_unlock(&file->f_lock);
>   }
>   
>   static int ep_alloc(struct eventpoll **pep)
> @@ -953,6 +987,7 @@ static int ep_alloc(struct eventpoll **pep)
>   	ep->rbr = RB_ROOT_CACHED;
>   	ep->ovflist = EP_UNACTIVE_PTR;
>   	ep->user = user;
> +	ep->refcount = 1;
>   
>   	*pep = ep;
>   
> @@ -1494,16 +1529,22 @@ static int ep_insert(struct eventpoll *ep, const struct epoll_event *event,
>   	if (tep)
>   		mutex_unlock(&tep->mtx);
>   
> +	/*
> +	 * ep_remove() calls in the later error paths can't lead to ep_dispose()
> +	 * as overall will lead to no refcount changes
> +	 */
> +	ep_get(ep); > +
>   	/* now check if we've created too many backpaths */
>   	if (unlikely(full_check && reverse_path_check())) {
> -		ep_remove(ep, epi);
> +		ep_remove_safe(ep, epi);
>   		return -EINVAL;
>   	}
>   
>   	if (epi->event.events & EPOLLWAKEUP) {
>   		error = ep_create_wakeup_source(epi);
>   		if (error) {
> -			ep_remove(ep, epi);
> +			ep_remove_safe(ep, epi);
>   			return error;
>   		}
>   	}
> @@ -1527,7 +1568,7 @@ static int ep_insert(struct eventpoll *ep, const struct epoll_event *event,
>   	 * high memory pressure.
>   	 */
>   	if (unlikely(!epq.epi)) {
> -		ep_remove(ep, epi);
> +		ep_remove_safe(ep, epi);
>   		return -ENOMEM;
>   	}
>   
> @@ -2165,10 +2206,16 @@ int do_epoll_ctl(int epfd, int op, int fd, struct epoll_event *epds,
>   			error = -EEXIST;
>   		break;
>   	case EPOLL_CTL_DEL:
> -		if (epi)
> -			error = ep_remove(ep, epi);
> -		else
> +		if (epi) {
> +			/*
> +			 * The eventpoll itself is still alive: the refcount
> +			 * can't go to zero here.
> +			 */
> +			ep_remove_safe(ep, epi);
> +			error = 0;
> +		} else {
>   			error = -ENOENT;
> +		}
>   		break;
>   	case EPOLL_CTL_MOD:
>   		if (epi) {
Paolo Abeni Nov. 29, 2022, 9:05 a.m. UTC | #2
Hello,

On Mon, 2022-11-28 at 16:06 -0800, Jacob Keller wrote:
> > @@ -217,6 +218,12 @@ struct eventpoll {
> >   	u64 gen;
> >   	struct hlist_head refs;
> >   
> > +	/*
> > +	 * usage count, protected by mtx, used together with epitem->dying to
> > +	 * orchestrate the disposal of this struct
> > +	 */
> > +	unsigned int refcount;
> > +
> 
> Why not use a kref (or at least struct refcount?) those provide some 
> guarantees like guaranteeing atomic operations and saturation when the 
> refcount value would overflow.

Thank you for the feedback!

I thought about that options and ultimately opted otherwise because we
can avoid the additional atomic operations required by kref/refcount_t.
The reference count is always touched under the ep->mtx mutex.
Reasonably this does not introduce performance regressions even in the
stranger corner case.

The above was explicitly noted in the previous revisions commit
message, but I removed it by mistake while updating the message for v3.

I can switch to kref if there is agreement WRT such performance trade-
off. Another option would be adding a couple of additional checks for
wrap-arounds in ep_put() and ep_get() - so that we get similar safety
guarantees but no additional atomic operations.

> >   #ifdef CONFIG_NET_RX_BUSY_POLL
> >   	/* used to track busy poll napi_id */
> >   	unsigned int napi_id;
> > @@ -240,9 +247,7 @@ struct ep_pqueue {
> >   /* Maximum number of epoll watched descriptors, per user */
> >   static long max_user_watches __read_mostly;
> >   
> > -/*
> > - * This mutex is used to serialize ep_free() and eventpoll_release_file().
> > - */
> > +/* Used for cycles detection */
> >   static DEFINE_MUTEX(epmutex);
> >   
> >   static u64 loop_check_gen = 0;
> > @@ -555,8 +560,7 @@ static void ep_remove_wait_queue(struct eppoll_entry *pwq)
> >   
> >   /*
> >    * This function unregisters poll callbacks from the associated file
> > - * descriptor.  Must be called with "mtx" held (or "epmutex" if called from
> > - * ep_free).
> > + * descriptor.  Must be called with "mtx" held.
> >    */
> >   static void ep_unregister_pollwait(struct eventpoll *ep, struct epitem *epi)
> >   {
> > @@ -679,11 +683,38 @@ static void epi_rcu_free(struct rcu_head *head)
> >   	kmem_cache_free(epi_cache, epi);
> >   }
> >   
> > +static void ep_get(struct eventpoll *ep)
> > +{
> > +	ep->refcount++;
> > +}
> This would become something like "kref_get(&ep->kref)" or maybe even 
> something like "kref_get_unless_zero" or some other form depending on 
> exactly how you acquire a pointer to an eventpoll structure.

No need for kref_get_unless_zero here, in all ep_get() call-sites we
know that at least onother reference is alive and can't go away
concurrently.

> > +
> > +/*
> > + * Returns true if the event poll can be disposed
> > + */
> > +static bool ep_put(struct eventpoll *ep)
> > +{
> > +	if (--ep->refcount)
> > +		return false;
> > +
> > +	WARN_ON_ONCE(!RB_EMPTY_ROOT(&ep->rbr.rb_root));
> > +	return true;
> > +}
> 
> This could become kref_put(&ep->kref, ep_dispose).

I think it would be necessary releasing the ep->mtx mutex before
invoking ep_dispose()...

> > +
> > +static void ep_dispose(struct eventpoll *ep)
> > +{
> > +	mutex_destroy(&ep->mtx);
> > +	free_uid(ep->user);
> > +	wakeup_source_unregister(ep->ws);
> > +	kfree(ep);
> > +}
> This would takea  kref pointer, use container_of to get to the eventpoll 
> structure, and then perform necessary cleanup once all references drop.
> 
> The exact specific steps here and whether it would still be safe to call 
> mutex_destroy is a bit unclear since you typically would only call 
> mutex_destroy when its absolutely sure that no one has locked the mutex.

... due to the above. The current patch code ensures that ep_dispose()
is called only after the ep->mtx mutex is released.


> See Documentation/core-api/kref.rst for a better overview of the API and 
> how to use it safely. I suspect that with just kref you could also 
> safely avoid the "dying" flag as well, but I am not 100% sure.

I *think* we will still need the 'dying' flag, otherwise ep_free()
can't tell if the traversed epitems entries still held a reference to
struct eventpoll - eventpoll_release_file() and ep_free() could
potentially try to release the same reference twice and kref could
detect that race only for the last reference.

Thanks!

Paolo
Keller, Jacob E Nov. 29, 2022, 6:25 p.m. UTC | #3
On 11/29/2022 1:05 AM, Paolo Abeni wrote:
> Hello,
> 
> On Mon, 2022-11-28 at 16:06 -0800, Jacob Keller wrote:
>>> @@ -217,6 +218,12 @@ struct eventpoll {
>>>    	u64 gen;
>>>    	struct hlist_head refs;
>>>    
>>> +	/*
>>> +	 * usage count, protected by mtx, used together with epitem->dying to
>>> +	 * orchestrate the disposal of this struct
>>> +	 */
>>> +	unsigned int refcount;
>>> +
>>
>> Why not use a kref (or at least struct refcount?) those provide some
>> guarantees like guaranteeing atomic operations and saturation when the
>> refcount value would overflow.
> 
> Thank you for the feedback!
> 
> I thought about that options and ultimately opted otherwise because we
> can avoid the additional atomic operations required by kref/refcount_t.
> The reference count is always touched under the ep->mtx mutex.
> Reasonably this does not introduce performance regressions even in the
> stranger corner case.
> 
> The above was explicitly noted in the previous revisions commit
> message, but I removed it by mistake while updating the message for v3.
> 
> I can switch to kref if there is agreement WRT such performance trade-
> off. Another option would be adding a couple of additional checks for
> wrap-arounds in ep_put() and ep_get() - so that we get similar safety
> guarantees but no additional atomic operations.
> 

If its already been assessed and discarded for a good reason then I 
don't have a problem with sticking with this version. I didn't see any 
reference to this before and krefs feel like the natural choice for 
refcounting since it seems easier to follow getting the implementation 
correct.

>>>    #ifdef CONFIG_NET_RX_BUSY_POLL
>>>    	/* used to track busy poll napi_id */
>>>    	unsigned int napi_id;
>>> @@ -240,9 +247,7 @@ struct ep_pqueue {
>>>    /* Maximum number of epoll watched descriptors, per user */
>>>    static long max_user_watches __read_mostly;
>>>    
>>> -/*
>>> - * This mutex is used to serialize ep_free() and eventpoll_release_file().
>>> - */
>>> +/* Used for cycles detection */
>>>    static DEFINE_MUTEX(epmutex);
>>>    
>>>    static u64 loop_check_gen = 0;
>>> @@ -555,8 +560,7 @@ static void ep_remove_wait_queue(struct eppoll_entry *pwq)
>>>    
>>>    /*
>>>     * This function unregisters poll callbacks from the associated file
>>> - * descriptor.  Must be called with "mtx" held (or "epmutex" if called from
>>> - * ep_free).
>>> + * descriptor.  Must be called with "mtx" held.
>>>     */
>>>    static void ep_unregister_pollwait(struct eventpoll *ep, struct epitem *epi)
>>>    {
>>> @@ -679,11 +683,38 @@ static void epi_rcu_free(struct rcu_head *head)
>>>    	kmem_cache_free(epi_cache, epi);
>>>    }
>>>    
>>> +static void ep_get(struct eventpoll *ep)
>>> +{
>>> +	ep->refcount++;
>>> +}
>> This would become something like "kref_get(&ep->kref)" or maybe even
>> something like "kref_get_unless_zero" or some other form depending on
>> exactly how you acquire a pointer to an eventpoll structure.
> 
> No need for kref_get_unless_zero here, in all ep_get() call-sites we
> know that at least onother reference is alive and can't go away
> concurrently.
> 

Fair, yea.

>>> +
>>> +/*
>>> + * Returns true if the event poll can be disposed
>>> + */
>>> +static bool ep_put(struct eventpoll *ep)
>>> +{
>>> +	if (--ep->refcount)
>>> +		return false;
>>> +
>>> +	WARN_ON_ONCE(!RB_EMPTY_ROOT(&ep->rbr.rb_root));
>>> +	return true;
>>> +}
>>
>> This could become kref_put(&ep->kref, ep_dispose).
> 
> I think it would be necessary releasing the ep->mtx mutex before
> invoking ep_dispose()...
> 

Yes you'd have to refactor things a bit since ep_dispose wouldn't 
actually get called until all outstanding references get dropped.

>>> +
>>> +static void ep_dispose(struct eventpoll *ep)
>>> +{
>>> +	mutex_destroy(&ep->mtx);
>>> +	free_uid(ep->user);
>>> +	wakeup_source_unregister(ep->ws);
>>> +	kfree(ep);
>>> +}
>> This would takea  kref pointer, use container_of to get to the eventpoll
>> structure, and then perform necessary cleanup once all references drop.
>>
>> The exact specific steps here and whether it would still be safe to call
>> mutex_destroy is a bit unclear since you typically would only call
>> mutex_destroy when its absolutely sure that no one has locked the mutex.
> 
> ... due to the above. The current patch code ensures that ep_dispose()
> is called only after the ep->mtx mutex is released.
> 

Correct, because you use the dying flag, yep.

> 
>> See Documentation/core-api/kref.rst for a better overview of the API and
>> how to use it safely. I suspect that with just kref you could also
>> safely avoid the "dying" flag as well, but I am not 100% sure.
> 
> I *think* we will still need the 'dying' flag, otherwise ep_free()
> can't tell if the traversed epitems entries still held a reference to
> struct eventpoll - eventpoll_release_file() and ep_free() could
> potentially try to release the same reference twice and kref could
> detect that race only for the last reference.
> 
> Thanks!
> 
> Paolo
> 

Right. Switching to krefs would require some thinking to go through the 
various cases of what gets a ref and how.

I'm not sure I follow how that would work it sounds like something weird 
with the way references are being counted here. Its likely that I simply 
don't understand the full context.

Either way, if there is good reason to avoid kref due to cost of atomics 
in this case and the fact that we need the mutex anyways I suppose I 
have no objections. I'm not sure what the perf cost of an atomic vs a 
simple integer is in practice though.
Eric Biggers Dec. 13, 2022, 6:01 a.m. UTC | #4
On Mon, Nov 28, 2022 at 07:00:10PM +0100, Paolo Abeni wrote:
> We are observing huge contention on the epmutex during an http
> connection/rate test:
> 
>  83.17% 0.25%  nginx            [kernel.kallsyms]         [k] entry_SYSCALL_64_after_hwframe
> [...]
>            |--66.96%--__fput
>                       |--60.04%--eventpoll_release_file
>                                  |--58.41%--__mutex_lock.isra.6
>                                            |--56.56%--osq_lock
> 
> The application is multi-threaded, creates a new epoll entry for
> each incoming connection, and does not delete it before the
> connection shutdown - that is, before the connection's fd close().
> 
> Many different threads compete frequently for the epmutex lock,
> affecting the overall performance.
> 
> To reduce the contention this patch introduces explicit reference counting
> for the eventpoll struct. Each registered event acquires a reference,
> and references are released at ep_remove() time.
> 
> Additionally, this introduces a new 'dying' flag to prevent races between
> ep_free() and eventpoll_release_file(): the latter marks, under f_lock
> spinlock, each epitem as before removing it, while ep_free() does not
> touch dying epitems.
> 
> The eventpoll struct is released by whoever - among ep_free() and
> eventpoll_release_file() drops its last reference.
> 
> With all the above in place, we can drop the epmutex usage at disposal time.
> 
> Overall this produces a significant performance improvement in the
> mentioned connection/rate scenario: the mutex operations disappear from
> the topmost offenders in the perf report, and the measured connections/rate
> grows by ~60%.
> 
> Tested-by: Xiumei Mu <xmu@redhat.com>
> Signed-off-by: Paolo Abeni <pabeni@redhat.com>

I am trying to understand whether this patch is correct.

One thing that would help would be to use more standard naming:

	ep_put => ep_refcount_dec_and_test (or ep_put_and_test)
	ep_dispose => ep_free
	ep_free => ep_clear_and_put

- Eric
Paolo Abeni Dec. 13, 2022, 6:21 p.m. UTC | #5
Hi,

On Mon, 2022-12-12 at 22:01 -0800, Eric Biggers wrote:
> I am trying to understand whether this patch is correct.
> 
> One thing that would help would be to use more standard naming:
> 
> 	ep_put => ep_refcount_dec_and_test (or ep_put_and_test)
> 	ep_dispose => ep_free
> 	ep_free => ep_clear_and_put

Thank you for the feedback. 

I must admit I'm not good at all at selecting good names, so I
definitelly will apply the above. I additionally still have to cover
the feedback from Jacob - switching the reference count to a kref - as
I've been diverted to other tasks.

I hope to be able to share a new revision of this patch next week.

Thanks,

Paolo
Eric Biggers Dec. 13, 2022, 6:59 p.m. UTC | #6
On Tue, Dec 13, 2022 at 07:21:14PM +0100, Paolo Abeni wrote:
> Hi,
> 
> On Mon, 2022-12-12 at 22:01 -0800, Eric Biggers wrote:
> > I am trying to understand whether this patch is correct.
> > 
> > One thing that would help would be to use more standard naming:
> > 
> > 	ep_put => ep_refcount_dec_and_test (or ep_put_and_test)
> > 	ep_dispose => ep_free
> > 	ep_free => ep_clear_and_put
> 
> Thank you for the feedback. 
> 
> I must admit I'm not good at all at selecting good names, so I
> definitelly will apply the above. I additionally still have to cover
> the feedback from Jacob - switching the reference count to a kref - as
> I've been diverted to other tasks.
> 
> I hope to be able to share a new revision of this patch next week.

Using 'refcount_t' directly is another option.

I think a plain 'unsigned int' would be fine here, if all reads and writes of
the refcount really happen under a mutex.  Using refcount_t (or kref) would add
some extra sanity checks, though.

- Eric
diff mbox series

Patch

diff --git a/fs/eventpoll.c b/fs/eventpoll.c
index 52954d4637b5..af22e5e6f683 100644
--- a/fs/eventpoll.c
+++ b/fs/eventpoll.c
@@ -57,13 +57,7 @@ 
  * we need a lock that will allow us to sleep. This lock is a
  * mutex (ep->mtx). It is acquired during the event transfer loop,
  * during epoll_ctl(EPOLL_CTL_DEL) and during eventpoll_release_file().
- * Then we also need a global mutex to serialize eventpoll_release_file()
- * and ep_free().
- * This mutex is acquired by ep_free() during the epoll file
- * cleanup path and it is also acquired by eventpoll_release_file()
- * if a file has been pushed inside an epoll set and it is then
- * close()d without a previous call to epoll_ctl(EPOLL_CTL_DEL).
- * It is also acquired when inserting an epoll fd onto another epoll
+ * The epmutex is acquired when inserting an epoll fd onto another epoll
  * fd. We do this so that we walk the epoll tree and ensure that this
  * insertion does not create a cycle of epoll file descriptors, which
  * could lead to deadlock. We need a global mutex to prevent two
@@ -153,6 +147,13 @@  struct epitem {
 	/* The file descriptor information this item refers to */
 	struct epoll_filefd ffd;
 
+	/*
+	 * Protected by file->f_lock, true for to-be-released epitem already
+	 * removed from the "struct file" items list; together with
+	 * eventpoll->refcount orchestrates "struct eventpoll" disposal
+	 */
+	bool dying;
+
 	/* List containing poll wait queues */
 	struct eppoll_entry *pwqlist;
 
@@ -217,6 +218,12 @@  struct eventpoll {
 	u64 gen;
 	struct hlist_head refs;
 
+	/*
+	 * usage count, protected by mtx, used together with epitem->dying to
+	 * orchestrate the disposal of this struct
+	 */
+	unsigned int refcount;
+
 #ifdef CONFIG_NET_RX_BUSY_POLL
 	/* used to track busy poll napi_id */
 	unsigned int napi_id;
@@ -240,9 +247,7 @@  struct ep_pqueue {
 /* Maximum number of epoll watched descriptors, per user */
 static long max_user_watches __read_mostly;
 
-/*
- * This mutex is used to serialize ep_free() and eventpoll_release_file().
- */
+/* Used for cycles detection */
 static DEFINE_MUTEX(epmutex);
 
 static u64 loop_check_gen = 0;
@@ -555,8 +560,7 @@  static void ep_remove_wait_queue(struct eppoll_entry *pwq)
 
 /*
  * This function unregisters poll callbacks from the associated file
- * descriptor.  Must be called with "mtx" held (or "epmutex" if called from
- * ep_free).
+ * descriptor.  Must be called with "mtx" held.
  */
 static void ep_unregister_pollwait(struct eventpoll *ep, struct epitem *epi)
 {
@@ -679,11 +683,38 @@  static void epi_rcu_free(struct rcu_head *head)
 	kmem_cache_free(epi_cache, epi);
 }
 
+static void ep_get(struct eventpoll *ep)
+{
+	ep->refcount++;
+}
+
+/*
+ * Returns true if the event poll can be disposed
+ */
+static bool ep_put(struct eventpoll *ep)
+{
+	if (--ep->refcount)
+		return false;
+
+	WARN_ON_ONCE(!RB_EMPTY_ROOT(&ep->rbr.rb_root));
+	return true;
+}
+
+static void ep_dispose(struct eventpoll *ep)
+{
+	mutex_destroy(&ep->mtx);
+	free_uid(ep->user);
+	wakeup_source_unregister(ep->ws);
+	kfree(ep);
+}
+
 /*
  * Removes a "struct epitem" from the eventpoll RB tree and deallocates
  * all the associated resources. Must be called with "mtx" held.
+ * If the dying flag is set, do the removal only if force is true.
+ * Returns true if the eventpoll can be disposed.
  */
-static int ep_remove(struct eventpoll *ep, struct epitem *epi)
+static bool __ep_remove(struct eventpoll *ep, struct epitem *epi, bool force)
 {
 	struct file *file = epi->ffd.file;
 	struct epitems_head *to_free;
@@ -698,6 +729,11 @@  static int ep_remove(struct eventpoll *ep, struct epitem *epi)
 
 	/* Remove the current item from the list of epoll hooks */
 	spin_lock(&file->f_lock);
+	if (epi->dying && !force) {
+		spin_unlock(&file->f_lock);
+		return false;
+	}
+
 	to_free = NULL;
 	head = file->f_ep;
 	if (head->first == &epi->fllink && !epi->fllink.next) {
@@ -731,28 +767,28 @@  static int ep_remove(struct eventpoll *ep, struct epitem *epi)
 	call_rcu(&epi->rcu, epi_rcu_free);
 
 	percpu_counter_dec(&ep->user->epoll_watches);
+	return ep_put(ep);
+}
 
-	return 0;
+/*
+ * ep_remove variant for callers owing an additional reference to the ep
+ */
+static void ep_remove_safe(struct eventpoll *ep, struct epitem *epi)
+{
+	WARN_ON_ONCE(__ep_remove(ep, epi, false));
 }
 
 static void ep_free(struct eventpoll *ep)
 {
 	struct rb_node *rbp;
 	struct epitem *epi;
+	bool dispose;
 
 	/* We need to release all tasks waiting for these file */
 	if (waitqueue_active(&ep->poll_wait))
 		ep_poll_safewake(ep, NULL);
 
-	/*
-	 * We need to lock this because we could be hit by
-	 * eventpoll_release_file() while we're freeing the "struct eventpoll".
-	 * We do not need to hold "ep->mtx" here because the epoll file
-	 * is on the way to be removed and no one has references to it
-	 * anymore. The only hit might come from eventpoll_release_file() but
-	 * holding "epmutex" is sufficient here.
-	 */
-	mutex_lock(&epmutex);
+	mutex_lock(&ep->mtx);
 
 	/*
 	 * Walks through the whole tree by unregistering poll callbacks.
@@ -766,25 +802,21 @@  static void ep_free(struct eventpoll *ep)
 
 	/*
 	 * Walks through the whole tree by freeing each "struct epitem". At this
-	 * point we are sure no poll callbacks will be lingering around, and also by
-	 * holding "epmutex" we can be sure that no file cleanup code will hit
-	 * us during this operation. So we can avoid the lock on "ep->lock".
-	 * We do not need to lock ep->mtx, either, we only do it to prevent
-	 * a lockdep warning.
+	 * point we are sure no poll callbacks will be lingering around.
+	 * Since we still own a reference to the eventpoll struct, the loop can't
+	 * dispose it.
 	 */
-	mutex_lock(&ep->mtx);
 	while ((rbp = rb_first_cached(&ep->rbr)) != NULL) {
 		epi = rb_entry(rbp, struct epitem, rbn);
-		ep_remove(ep, epi);
+		ep_remove_safe(ep, epi);
 		cond_resched();
 	}
+
+	dispose = ep_put(ep);
 	mutex_unlock(&ep->mtx);
 
-	mutex_unlock(&epmutex);
-	mutex_destroy(&ep->mtx);
-	free_uid(ep->user);
-	wakeup_source_unregister(ep->ws);
-	kfree(ep);
+	if (dispose)
+		ep_dispose(ep);
 }
 
 static int ep_eventpoll_release(struct inode *inode, struct file *file)
@@ -904,33 +936,35 @@  void eventpoll_release_file(struct file *file)
 {
 	struct eventpoll *ep;
 	struct epitem *epi;
-	struct hlist_node *next;
+	bool dispose;
 
 	/*
-	 * We don't want to get "file->f_lock" because it is not
-	 * necessary. It is not necessary because we're in the "struct file"
-	 * cleanup path, and this means that no one is using this file anymore.
-	 * So, for example, epoll_ctl() cannot hit here since if we reach this
-	 * point, the file counter already went to zero and fget() would fail.
-	 * The only hit might come from ep_free() but by holding the mutex
-	 * will correctly serialize the operation. We do need to acquire
-	 * "ep->mtx" after "epmutex" because ep_remove() requires it when called
-	 * from anywhere but ep_free().
-	 *
-	 * Besides, ep_remove() acquires the lock, so we can't hold it here.
+	 * Use the 'dying' flag to prevent a concurrent ep_free() from touching
+	 * the epitems list before eventpoll_release_file() can access the
+	 * ep->mtx.
 	 */
-	mutex_lock(&epmutex);
-	if (unlikely(!file->f_ep)) {
-		mutex_unlock(&epmutex);
-		return;
-	}
-	hlist_for_each_entry_safe(epi, next, file->f_ep, fllink) {
+again:
+	spin_lock(&file->f_lock);
+	if (file->f_ep && file->f_ep->first) {
+		/* detach from ep tree */
+		epi = hlist_entry(file->f_ep->first, struct epitem, fllink);
+		epi->dying = true;
+		spin_unlock(&file->f_lock);
+
+		/*
+		 * ep access is safe as we still own a reference to the ep
+		 * struct
+		 */
 		ep = epi->ep;
-		mutex_lock_nested(&ep->mtx, 0);
-		ep_remove(ep, epi);
+		mutex_lock(&ep->mtx);
+		dispose = __ep_remove(ep, epi, true);
 		mutex_unlock(&ep->mtx);
+
+		if (dispose)
+			ep_dispose(ep);
+		goto again;
 	}
-	mutex_unlock(&epmutex);
+	spin_unlock(&file->f_lock);
 }
 
 static int ep_alloc(struct eventpoll **pep)
@@ -953,6 +987,7 @@  static int ep_alloc(struct eventpoll **pep)
 	ep->rbr = RB_ROOT_CACHED;
 	ep->ovflist = EP_UNACTIVE_PTR;
 	ep->user = user;
+	ep->refcount = 1;
 
 	*pep = ep;
 
@@ -1494,16 +1529,22 @@  static int ep_insert(struct eventpoll *ep, const struct epoll_event *event,
 	if (tep)
 		mutex_unlock(&tep->mtx);
 
+	/*
+	 * ep_remove() calls in the later error paths can't lead to ep_dispose()
+	 * as overall will lead to no refcount changes
+	 */
+	ep_get(ep);
+
 	/* now check if we've created too many backpaths */
 	if (unlikely(full_check && reverse_path_check())) {
-		ep_remove(ep, epi);
+		ep_remove_safe(ep, epi);
 		return -EINVAL;
 	}
 
 	if (epi->event.events & EPOLLWAKEUP) {
 		error = ep_create_wakeup_source(epi);
 		if (error) {
-			ep_remove(ep, epi);
+			ep_remove_safe(ep, epi);
 			return error;
 		}
 	}
@@ -1527,7 +1568,7 @@  static int ep_insert(struct eventpoll *ep, const struct epoll_event *event,
 	 * high memory pressure.
 	 */
 	if (unlikely(!epq.epi)) {
-		ep_remove(ep, epi);
+		ep_remove_safe(ep, epi);
 		return -ENOMEM;
 	}
 
@@ -2165,10 +2206,16 @@  int do_epoll_ctl(int epfd, int op, int fd, struct epoll_event *epds,
 			error = -EEXIST;
 		break;
 	case EPOLL_CTL_DEL:
-		if (epi)
-			error = ep_remove(ep, epi);
-		else
+		if (epi) {
+			/*
+			 * The eventpoll itself is still alive: the refcount
+			 * can't go to zero here.
+			 */
+			ep_remove_safe(ep, epi);
+			error = 0;
+		} else {
 			error = -ENOENT;
+		}
 		break;
 	case EPOLL_CTL_MOD:
 		if (epi) {