[v6] epoll: use refcount to reduce ep_mutex contention

We are observing huge contention on the epmutex during an http
connection/rate test:

 83.17% 0.25%  nginx            [kernel.kallsyms]         [k] entry_SYSCALL_64_after_hwframe
[...]
           |--66.96%--__fput
                      |--60.04%--eventpoll_release_file
                                 |--58.41%--__mutex_lock.isra.6
                                           |--56.56%--osq_lock

The application is multi-threaded, creates a new epoll entry for
each incoming connection, and does not delete it before the
connection shutdown - that is, before the connection's fd close().

Many different threads compete frequently for the epmutex lock,
affecting the overall performance.

To reduce the contention this patch introduces explicit reference counting
for the eventpoll struct. Each registered event acquires a reference,
and references are released at ep_remove() time.

The eventpoll struct is released by whoever - among EP file close() and
and the monitored file close() drops its last reference.

Additionally, this introduces a new 'dying' flag to prevent races between
the EP file close() and the monitored file close().
ep_eventpoll_release() marks, under f_lock spinlock, each epitem as dying
before removing it, while EP file close() does not touch dying epitems.

The above is needed as both close operations could run concurrently and
drop the EP reference acquired via the epitem entry. Without the above
flag, the monitored file close() could reach the EP struct via the epitem
list while the epitem is still listed and then try to put it after its
disposal.

An alternative could be avoiding touching the references acquired via
the epitems at EP file close() time, but that could leave the EP struct
alive for potentially unlimited time after EP file close(), with nasty
side effects.

With all the above in place, we can drop the epmutex usage at disposal time.

Overall this produces a significant performance improvement in the
mentioned connection/rate scenario: the mutex operations disappear from
the topmost offenders in the perf report, and the measured connections/rate
grows by ~60%.

To make the change more readable this additionally renames ep_free() to
ep_clear_and_put(), and moves the actual memory cleanup in a separate
ep_free() helper.

Co-developed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Tested-by: Xiumei Mu <xmu@redhiat.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
---
This addresses a deadlock reported by syzkaller on v5:
https://lore.kernel.org/lkml/000000000000c6dc0305f75b4d74@google.com/T/#u
Due to such change I stripped the acked-by/reviewed-by tag carried in
the previous revision

v5 at:
https://lore.kernel.org/linux-fsdevel/323de732635cc3513c1837c6cbb98f012174f994.1678312201.git.pabeni@redhat.com/

v4 at:
https://lore.kernel.org/linux-fsdevel/f0c49fb4b682b81d64184d1181bc960728907474.camel@redhat.com/T/#t

v3 at:
https://lore.kernel.org/linux-fsdevel/1aedd7e87097bc4352ba658ac948c585a655785a.1669657846.git.pabeni@redhat.com/

v2 at:
https://lore.kernel.org/linux-fsdevel/f35e58ed5af8131f0f402c3dc6c3033fa96d1843.1669312208.git.pabeni@redhat.com/

v1 at:
https://lore.kernel.org/linux-fsdevel/f35e58ed5af8131f0f402c3dc6c3033fa96d1843.1669312208.git.pabeni@redhat.com/

Previous related effort at:
https://lore.kernel.org/linux-fsdevel/20190727113542.162213-1-cj.chengjian@huawei.com/
https://lkml.org/lkml/2017/10/28/81
---
 fs/eventpoll.c | 195 +++++++++++++++++++++++++++++++------------------
 1 file changed, 123 insertions(+), 72 deletions(-)

Message ID	4a57788dcaf28f5eb4f8dfddcc3a8b172a7357bb.1679504153.git.pabeni@redhat.com (mailing list archive)
State	Not Applicable
Delegated to:	Netdev Maintainers
Headers	show Return-Path: <netdev-owner@vger.kernel.org> X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 9E130C6FD1F for <netdev@archiver.kernel.org>; Wed, 22 Mar 2023 16:58:24 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231199AbjCVQ6W (ORCPT <rfc822;netdev@archiver.kernel.org>); Wed, 22 Mar 2023 12:58:22 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:42228 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231192AbjCVQ6O (ORCPT <rfc822;netdev@vger.kernel.org>); Wed, 22 Mar 2023 12:58:14 -0400 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id F25354C30 for <netdev@vger.kernel.org>; Wed, 22 Mar 2023 09:57:26 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1679504246; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding; bh=CpXVTDLQB08O5tSWiImjBguRkrDlOZEpxZfPb6HDQ2s=; b=VDDGan9o0ztKy1UlW4VpoBKJOtKf7vrsIeVsbNlzDfgMqlpOTtW4FcYbQpiIhGPcfOdjss 7YsTeJYy9yJ9Em+A1JnaKeI+FuyrQpF//PuuhZ4tKfUdYZPrkJrvhSDt9g0CZ/kGO0uYdk /i8w8KAs9G53g9paNTp2Uvfra+4LQVk= Received: from mimecast-mx02.redhat.com (mimecast-mx02.redhat.com [66.187.233.88]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id us-mta-616-ESTI1vcSPniqHCdqE7FENg-1; Wed, 22 Mar 2023 12:57:24 -0400 X-MC-Unique: ESTI1vcSPniqHCdqE7FENg-1 Received: from smtp.corp.redhat.com (int-mx04.intmail.prod.int.rdu2.redhat.com [10.11.54.4]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx02.redhat.com (Postfix) with ESMTPS id 61BDC802D2E; Wed, 22 Mar 2023 16:57:24 +0000 (UTC) Received: from gerbillo.redhat.com (unknown [10.39.194.155]) by smtp.corp.redhat.com (Postfix) with ESMTP id 4FFB02027040; Wed, 22 Mar 2023 16:57:22 +0000 (UTC) From: Paolo Abeni <pabeni@redhat.com> To: netdev@vger.kernel.org Cc: Soheil Hassas Yeganeh <soheil@google.com>, Al Viro <viro@zeniv.linux.org.uk>, Carlos Maiolino <cmaiolino@redhat.com>, Eric Biggers <ebiggers@kernel.org>, Jacob Keller <jacob.e.keller@intel.com>, Andrew Morton <akpm@linux-foundation.org>, Jens Axboe <axboe@kernel.dk>, Christian Brauner <brauner@kernel.org>, linux-fsdevel@vger.kernel.org, Eric Dumazet <edumazet@google.com> Subject: [PATCH v6] epoll: use refcount to reduce ep_mutex contention Date: Wed, 22 Mar 2023 17:57:02 +0100 Message-Id: <4a57788dcaf28f5eb4f8dfddcc3a8b172a7357bb.1679504153.git.pabeni@redhat.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Scanned-By: MIMEDefang 3.1 on 10.11.54.4 Precedence: bulk List-ID: <netdev.vger.kernel.org> X-Mailing-List: netdev@vger.kernel.org
Series	[v6] epoll: use refcount to reduce ep_mutex contention \| expand [v6] epoll: use refcount to reduce ep_mutex contention

[v6] epoll: use refcount to reduce ep_mutex contention

Checks

Commit Message

Comments

Patch