diff mbox

epoll: avoid calling ep_call_nested() from ep_poll_safewake()

Message ID 1507920533-8812-1-git-send-email-jbaron@akamai.com (mailing list archive)
State New, archived
Headers show

Commit Message

Jason Baron Oct. 13, 2017, 6:48 p.m. UTC
The ep_poll_safewake() function is used to wakeup potentially nested epoll
file descriptors. The function uses ep_call_nested() to prevent entering
the same wake up queue more than once, and to prevent excessively deep
wakeup paths (deeper than EP_MAX_NESTS). However, this is not necessary
since we are already preventing these conditions during EPOLL_CTL_ADD. This
saves extra function calls, and avoids taking a global lock during the
ep_call_nested() calls.

I have, however, left ep_call_nested() for the CONFIG_DEBUG_LOCK_ALLOC
case, since ep_call_nested() keeps track of the nesting level, and this is
required by the call to spin_lock_irqsave_nested(). It would be nice to
remove the ep_call_nested() calls for the CONFIG_DEBUG_LOCK_ALLOC case as
well, however its not clear how to simply pass the nesting level through
multiple wake_up() levels without more surgery. In any case, I don't think
CONFIG_DEBUG_LOCK_ALLOC is generally used for production. This patch, also
apparently fixes a workload at Google that Salman Qazi reported by
completely removing the poll_safewake_ncalls->lock from wakeup paths.

Signed-off-by: Jason Baron <jbaron@akamai.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Salman Qazi <sqazi@google.com>
---
 fs/eventpoll.c | 47 ++++++++++++++++++-----------------------------
 1 file changed, 18 insertions(+), 29 deletions(-)

Comments

Davidlohr Bueso Oct. 17, 2017, 3:37 p.m. UTC | #1
On Fri, 13 Oct 2017, Jason Baron wrote:

>The ep_poll_safewake() function is used to wakeup potentially nested epoll
>file descriptors. The function uses ep_call_nested() to prevent entering
>the same wake up queue more than once, and to prevent excessively deep
>wakeup paths (deeper than EP_MAX_NESTS). However, this is not necessary
>since we are already preventing these conditions during EPOLL_CTL_ADD. This
>saves extra function calls, and avoids taking a global lock during the
>ep_call_nested() calls.

This makes sense.

>
>I have, however, left ep_call_nested() for the CONFIG_DEBUG_LOCK_ALLOC
>case, since ep_call_nested() keeps track of the nesting level, and this is
>required by the call to spin_lock_irqsave_nested(). It would be nice to
>remove the ep_call_nested() calls for the CONFIG_DEBUG_LOCK_ALLOC case as
>well, however its not clear how to simply pass the nesting level through
>multiple wake_up() levels without more surgery. In any case, I don't think
>CONFIG_DEBUG_LOCK_ALLOC is generally used for production. This patch, also
>apparently fixes a workload at Google that Salman Qazi reported by
>completely removing the poll_safewake_ncalls->lock from wakeup paths.

I'm a bit curious about the workload (which uses lots of EPOLL_CTL_ADDs) as
I was tackling the nested epoll scaling issue with loop and readywalk lists
in mind. 

>
>Signed-off-by: Jason Baron <jbaron@akamai.com>
>Cc: Alexander Viro <viro@zeniv.linux.org.uk>
>Cc: Andrew Morton <akpm@linux-foundation.org>
>Cc: Salman Qazi <sqazi@google.com>

Acked-by: Davidlohr Bueso <dbueso@suse.de>

>---
> fs/eventpoll.c | 47 ++++++++++++++++++-----------------------------
> 1 file changed, 18 insertions(+), 29 deletions(-)

Yay for getting rid of some of the callback hell.

Thanks,
Davidlohr
Jason Baron Oct. 18, 2017, 2:03 p.m. UTC | #2
On 10/17/2017 11:37 AM, Davidlohr Bueso wrote:
> On Fri, 13 Oct 2017, Jason Baron wrote:
> 
>> The ep_poll_safewake() function is used to wakeup potentially nested
>> epoll
>> file descriptors. The function uses ep_call_nested() to prevent entering
>> the same wake up queue more than once, and to prevent excessively deep
>> wakeup paths (deeper than EP_MAX_NESTS). However, this is not necessary
>> since we are already preventing these conditions during EPOLL_CTL_ADD.
>> This
>> saves extra function calls, and avoids taking a global lock during the
>> ep_call_nested() calls.
> 
> This makes sense.
> 
>>
>> I have, however, left ep_call_nested() for the CONFIG_DEBUG_LOCK_ALLOC
>> case, since ep_call_nested() keeps track of the nesting level, and
>> this is
>> required by the call to spin_lock_irqsave_nested(). It would be nice to
>> remove the ep_call_nested() calls for the CONFIG_DEBUG_LOCK_ALLOC case as
>> well, however its not clear how to simply pass the nesting level through
>> multiple wake_up() levels without more surgery. In any case, I don't
>> think
>> CONFIG_DEBUG_LOCK_ALLOC is generally used for production. This patch,
>> also
>> apparently fixes a workload at Google that Salman Qazi reported by
>> completely removing the poll_safewake_ncalls->lock from wakeup paths.
> 
> I'm a bit curious about the workload (which uses lots of EPOLL_CTL_ADDs) as
> I was tackling the nested epoll scaling issue with loop and readywalk lists
> in mind.
>>

I'm not sure the details of the workload - perhaps Salman can elaborate
further about it.

It would seem that the safewake would potentially be the most contended
in general in the nested case, because generally you have a few epoll
fds attached to lots of sources doing wakeups. So those sources are all
going to conflict on the safewake lock. The readywalk is used when
performing a 'nested' poll and in general this is likely going to be
called on a few epoll fds. That said, we should remove it too. I will
post a patch to remove it.

The loop lock is used during EPOLL_CTL_ADD to check for loops and deep
wakeup paths and so I would expect this to be less common, but I
wouldn't doubt there are workloads impacted by it. We can potentially, I
think remove this one too - and the global 'epmutex'. I posted some
ideas a while ago on it:

http://lkml.iu.edu/hypermail//linux/kernel/1501.1/05905.html

We can work through these ideas or others...

Thanks,

-Jason


>> Signed-off-by: Jason Baron <jbaron@akamai.com>
>> Cc: Alexander Viro <viro@zeniv.linux.org.uk>
>> Cc: Andrew Morton <akpm@linux-foundation.org>
>> Cc: Salman Qazi <sqazi@google.com>
> 
> Acked-by: Davidlohr Bueso <dbueso@suse.de>
> 
>> ---
>> fs/eventpoll.c | 47 ++++++++++++++++++-----------------------------
>> 1 file changed, 18 insertions(+), 29 deletions(-)
> 
> Yay for getting rid of some of the callback hell.
> 
> Thanks,
> Davidlohr
Davidlohr Bueso Oct. 28, 2017, 2:05 p.m. UTC | #3
On Wed, 18 Oct 2017, Jason Baron wrote:

>http://lkml.iu.edu/hypermail//linux/kernel/1501.1/05905.html
>
>We can work through these ideas or others...

So, unsurprisingly, I got some _really_ good results on the epoll_wait()
benchmark by removing the readywalk list altogether -- something like
4000% in some cases. It's also survived more testing, although I'm still
waiting for customer confirmation on his real workload, which helps
further test the changes (along with this patch to move the safewake
list to debug).

Could you please resend patch 1 in the series above?

Thanks,
Davidlohr
Jason Baron Oct. 31, 2017, 2:14 p.m. UTC | #4
On 10/28/2017 10:05 AM, Davidlohr Bueso wrote:
> On Wed, 18 Oct 2017, Jason Baron wrote:
> 
>> http://lkml.iu.edu/hypermail//linux/kernel/1501.1/05905.html
>>
>> We can work through these ideas or others...
> 
> So, unsurprisingly, I got some _really_ good results on the epoll_wait()
> benchmark by removing the readywalk list altogether -- something like
> 4000% in some cases. It's also survived more testing, although I'm still
> waiting for customer confirmation on his real workload, which helps
> further test the changes (along with this patch to move the safewake
> list to debug).
> 
> Could you please resend patch 1 in the series above?
> 

Hi,

I've resent patch 1 with some cleanups now. Thanks for the testing feedback.

Thanks,

-Jason
Hou Tao Jan. 18, 2018, 11 a.m. UTC | #5
Hi Jason,

On 2017/10/18 22:03, Jason Baron wrote:
> 
> 
> On 10/17/2017 11:37 AM, Davidlohr Bueso wrote:
>> On Fri, 13 Oct 2017, Jason Baron wrote:
>>
>>> The ep_poll_safewake() function is used to wakeup potentially nested
>>> epoll
>>> file descriptors. The function uses ep_call_nested() to prevent entering
>>> the same wake up queue more than once, and to prevent excessively deep
>>> wakeup paths (deeper than EP_MAX_NESTS). However, this is not necessary
>>> since we are already preventing these conditions during EPOLL_CTL_ADD.
>>> This
>>> saves extra function calls, and avoids taking a global lock during the
>>> ep_call_nested() calls.
>>
>> This makes sense.
>>
>>>
>>> I have, however, left ep_call_nested() for the CONFIG_DEBUG_LOCK_ALLOC
>>> case, since ep_call_nested() keeps track of the nesting level, and
>>> this is
>>> required by the call to spin_lock_irqsave_nested(). It would be nice to
>>> remove the ep_call_nested() calls for the CONFIG_DEBUG_LOCK_ALLOC case as
>>> well, however its not clear how to simply pass the nesting level through
>>> multiple wake_up() levels without more surgery. In any case, I don't
>>> think
>>> CONFIG_DEBUG_LOCK_ALLOC is generally used for production. This patch,
>>> also
>>> apparently fixes a workload at Google that Salman Qazi reported by
>>> completely removing the poll_safewake_ncalls->lock from wakeup paths.
>>
>> I'm a bit curious about the workload (which uses lots of EPOLL_CTL_ADDs) as
>> I was tackling the nested epoll scaling issue with loop and readywalk lists
>> in mind.
>>>
> 
> I'm not sure the details of the workload - perhaps Salman can elaborate
> further about it.
> 
> It would seem that the safewake would potentially be the most contended
> in general in the nested case, because generally you have a few epoll
> fds attached to lots of sources doing wakeups. So those sources are all
> going to conflict on the safewake lock. The readywalk is used when
> performing a 'nested' poll and in general this is likely going to be
> called on a few epoll fds. That said, we should remove it too. I will
> post a patch to remove it.
> 
> The loop lock is used during EPOLL_CTL_ADD to check for loops and deep
> wakeup paths and so I would expect this to be less common, but I
> wouldn't doubt there are workloads impacted by it. We can potentially, I
> think remove this one too - and the global 'epmutex'. I posted some
> ideas a while ago on it:
> 
> http://lkml.iu.edu/hypermail//linux/kernel/1501.1/05905.html
> 
> We can work through these ideas or others...

I have tested these patches on a simple non-nested epoll workload and found that
there are performance regression (-0.5% which is better than -1.0% from my patch set)
under normal nginx conf and performance improvement (+3%, and the number of my
patch set is +4%) under the one-fd-per-request conf. The reason for performance
improvement is clear, however, I haven't figured out the reason for the performance
regression (maybe cache related ?)

It seems that your patch set will work fine both under normal and nested epoll workload,
so I don't plan to continue my patch set which only works for normal epoll workload.
I have one question about your patch set. It is about the newly added fields in
struct file. I had tried to move these fields into struct eventpoll, so only epoll fds
will be added into a disjoint set and the target fd will be linked to a disjoint set
dynamically, but that method doesn't work out because there are multiple target fds and
the order in which these target fds are added to a disjoint set will lead to deadlock.
So do you have other ideas to reduce the size of these fields ?

Thanks,
Tao

The following lines are the result of performance result:

baseline: result for v4.14
ep_cc: result for v4.14 + ep_cc patch set
my: result for  v4.14 + my patch set

The first 15 columns are the results from wrk and their unit is Requests/sec,
the lats two columns are their average value and standard derivation.

* normal scenario: nginx + wrk

baseline
376704.74 377223.88 378096.67 379484.45 379199.04 379526.21 378008.82 \
379959.98 377634.47 379127.27 377622.92 379442.12 378994.44 376000.08 377046.58 \
378271.4447 1210.673592

ep_cc
373376.08 376897.65 373493.65 376154.44 377374.36 379080.88 375124.44 \
376404.16 376539.28 375141.84 379075.32 377214.39 374524.52 375605.87 372167.8 \
375878.312 1983.190539

my
373067.96 372369 375317.61 375564.66 373841.88 373699.88 376802.91 369988.97 \
376080.45 371580.64 374265.34 376370.34 377033.75 372786.91 375521.61 \
374286.1273 2059.92093


* one-fd-per-request scenario: nginx + wrk
baseline
124509.77 106542.46 118074.46 111434.02 117628.9 108260.56 119037.91 \
114413.75 108706.82 116277.99 111588.4 118216.56 121267.63 110950.4 116455.43 \
114891.004 5168.30288

ep_cc
124209.45 118297.8 120258.1 122785.95 118744.64 118471.76 117621.07 \
117525.14 114797.14 115348.47 117154.02 117208.37 118618.92 118578.8 118945.52 \
118571.01 2437.050182

my
125090.99 121408.92 116444.09 120055.4 119399.24 114938.04 122889.56 \
114363.49 120934.15 119052.03 118060.47 129976.45 121108.08 114849.93 114810.64 \
119558.7653 4338.18401

> Thanks,
> 
> -Jason
> 
> 
>>> Signed-off-by: Jason Baron <jbaron@akamai.com>
>>> Cc: Alexander Viro <viro@zeniv.linux.org.uk>
>>> Cc: Andrew Morton <akpm@linux-foundation.org>
>>> Cc: Salman Qazi <sqazi@google.com>
>>
>> Acked-by: Davidlohr Bueso <dbueso@suse.de>
>>
>>> ---
>>> fs/eventpoll.c | 47 ++++++++++++++++++-----------------------------
>>> 1 file changed, 18 insertions(+), 29 deletions(-)
>>
>> Yay for getting rid of some of the callback hell.
>>
>> Thanks,
>> Davidlohr
> 
> .
>
Jason Baron Jan. 18, 2018, 9:18 p.m. UTC | #6
On 01/18/2018 06:00 AM, Hou Tao wrote:
> Hi Jason,
> 
> On 2017/10/18 22:03, Jason Baron wrote:
>>
>>
>> On 10/17/2017 11:37 AM, Davidlohr Bueso wrote:
>>> On Fri, 13 Oct 2017, Jason Baron wrote:
>>>
>>>> The ep_poll_safewake() function is used to wakeup potentially nested
>>>> epoll
>>>> file descriptors. The function uses ep_call_nested() to prevent entering
>>>> the same wake up queue more than once, and to prevent excessively deep
>>>> wakeup paths (deeper than EP_MAX_NESTS). However, this is not necessary
>>>> since we are already preventing these conditions during EPOLL_CTL_ADD.
>>>> This
>>>> saves extra function calls, and avoids taking a global lock during the
>>>> ep_call_nested() calls.
>>>
>>> This makes sense.
>>>
>>>>
>>>> I have, however, left ep_call_nested() for the CONFIG_DEBUG_LOCK_ALLOC
>>>> case, since ep_call_nested() keeps track of the nesting level, and
>>>> this is
>>>> required by the call to spin_lock_irqsave_nested(). It would be nice to
>>>> remove the ep_call_nested() calls for the CONFIG_DEBUG_LOCK_ALLOC case as
>>>> well, however its not clear how to simply pass the nesting level through
>>>> multiple wake_up() levels without more surgery. In any case, I don't
>>>> think
>>>> CONFIG_DEBUG_LOCK_ALLOC is generally used for production. This patch,
>>>> also
>>>> apparently fixes a workload at Google that Salman Qazi reported by
>>>> completely removing the poll_safewake_ncalls->lock from wakeup paths.
>>>
>>> I'm a bit curious about the workload (which uses lots of EPOLL_CTL_ADDs) as
>>> I was tackling the nested epoll scaling issue with loop and readywalk lists
>>> in mind.
>>>>
>>
>> I'm not sure the details of the workload - perhaps Salman can elaborate
>> further about it.
>>
>> It would seem that the safewake would potentially be the most contended
>> in general in the nested case, because generally you have a few epoll
>> fds attached to lots of sources doing wakeups. So those sources are all
>> going to conflict on the safewake lock. The readywalk is used when
>> performing a 'nested' poll and in general this is likely going to be
>> called on a few epoll fds. That said, we should remove it too. I will
>> post a patch to remove it.
>>
>> The loop lock is used during EPOLL_CTL_ADD to check for loops and deep
>> wakeup paths and so I would expect this to be less common, but I
>> wouldn't doubt there are workloads impacted by it. We can potentially, I
>> think remove this one too - and the global 'epmutex'. I posted some
>> ideas a while ago on it:
>>
>> http://lkml.iu.edu/hypermail//linux/kernel/1501.1/05905.html
>>
>> We can work through these ideas or others...
> 
> I have tested these patches on a simple non-nested epoll workload and found that
> there are performance regression (-0.5% which is better than -1.0% from my patch set)
> under normal nginx conf and performance improvement (+3%, and the number of my
> patch set is +4%) under the one-fd-per-request conf. The reason for performance
> improvement is clear, however, I haven't figured out the reason for the performance
> regression (maybe cache related ?)
> 
> It seems that your patch set will work fine both under normal and nested epoll workload,
> so I don't plan to continue my patch set which only works for normal epoll workload.
> I have one question about your patch set. It is about the newly added fields in
> struct file. I had tried to move these fields into struct eventpoll, so only epoll fds
> will be added into a disjoint set and the target fd will be linked to a disjoint set
> dynamically, but that method doesn't work out because there are multiple target fds and
> the order in which these target fds are added to a disjoint set will lead to deadlock.
> So do you have other ideas to reduce the size of these fields ?
> 
> Thanks,
> Tao

Hi,

I agree that increasing the size of struct file is no go. One idea is to
simply have a single pointer in 'struct file' for epoll usage, that
could be dynamically allocated when a file is added to an epoll set.
That would reduce the size of things from where we currently are.

I'm also not sure that the target fds (or non-epoll fds) need to be
added to the disjoint sets. The sets are to serialize loop checks of
which the target fds can not be a part of.

If there is interest in resurrecting this patchset, I can pick it back
up. I hadn't been pushing it forward because I wasn't convinced it
mattered on real workloads...

Thanks,

-Jason


> 
> The following lines are the result of performance result:
> 
> baseline: result for v4.14
> ep_cc: result for v4.14 + ep_cc patch set
> my: result for  v4.14 + my patch set
> 
> The first 15 columns are the results from wrk and their unit is Requests/sec,
> the lats two columns are their average value and standard derivation.
> 
> * normal scenario: nginx + wrk
> 
> baseline
> 376704.74 377223.88 378096.67 379484.45 379199.04 379526.21 378008.82 \
> 379959.98 377634.47 379127.27 377622.92 379442.12 378994.44 376000.08 377046.58 \
> 378271.4447 1210.673592
> 
> ep_cc
> 373376.08 376897.65 373493.65 376154.44 377374.36 379080.88 375124.44 \
> 376404.16 376539.28 375141.84 379075.32 377214.39 374524.52 375605.87 372167.8 \
> 375878.312 1983.190539
> 
> my
> 373067.96 372369 375317.61 375564.66 373841.88 373699.88 376802.91 369988.97 \
> 376080.45 371580.64 374265.34 376370.34 377033.75 372786.91 375521.61 \
> 374286.1273 2059.92093
> 
> 
> * one-fd-per-request scenario: nginx + wrk
> baseline
> 124509.77 106542.46 118074.46 111434.02 117628.9 108260.56 119037.91 \
> 114413.75 108706.82 116277.99 111588.4 118216.56 121267.63 110950.4 116455.43 \
> 114891.004 5168.30288
> 
> ep_cc
> 124209.45 118297.8 120258.1 122785.95 118744.64 118471.76 117621.07 \
> 117525.14 114797.14 115348.47 117154.02 117208.37 118618.92 118578.8 118945.52 \
> 118571.01 2437.050182
> 
> my
> 125090.99 121408.92 116444.09 120055.4 119399.24 114938.04 122889.56 \
> 114363.49 120934.15 119052.03 118060.47 129976.45 121108.08 114849.93 114810.64 \
> 119558.7653 4338.18401
> 
>> Thanks,
>>
>> -Jason
>>
>>
>>>> Signed-off-by: Jason Baron <jbaron@akamai.com>
>>>> Cc: Alexander Viro <viro@zeniv.linux.org.uk>
>>>> Cc: Andrew Morton <akpm@linux-foundation.org>
>>>> Cc: Salman Qazi <sqazi@google.com>
>>>
>>> Acked-by: Davidlohr Bueso <dbueso@suse.de>
>>>
>>>> ---
>>>> fs/eventpoll.c | 47 ++++++++++++++++++-----------------------------
>>>> 1 file changed, 18 insertions(+), 29 deletions(-)
>>>
>>> Yay for getting rid of some of the callback hell.
>>>
>>> Thanks,
>>> Davidlohr
>>
>> .
>>
>
diff mbox

Patch

diff --git a/fs/eventpoll.c b/fs/eventpoll.c
index 2fabd19..69acfab 100644
--- a/fs/eventpoll.c
+++ b/fs/eventpoll.c
@@ -276,9 +276,6 @@  static DEFINE_MUTEX(epmutex);
 /* Used to check for epoll file descriptor inclusion loops */
 static struct nested_calls poll_loop_ncalls;
 
-/* Used for safe wake up implementation */
-static struct nested_calls poll_safewake_ncalls;
-
 /* Used to call file's f_op->poll() under the nested calls boundaries */
 static struct nested_calls poll_readywalk_ncalls;
 
@@ -551,40 +548,21 @@  static int ep_call_nested(struct nested_calls *ncalls, int max_nests,
  * this special case of epoll.
  */
 #ifdef CONFIG_DEBUG_LOCK_ALLOC
-static inline void ep_wake_up_nested(wait_queue_head_t *wqueue,
-				     unsigned long events, int subclass)
+
+static struct nested_calls poll_safewake_ncalls;
+
+static int ep_poll_wakeup_proc(void *priv, void *cookie, int call_nests)
 {
 	unsigned long flags;
+	wait_queue_head_t *wqueue = (wait_queue_head_t *)cookie;
 
-	spin_lock_irqsave_nested(&wqueue->lock, flags, subclass);
-	wake_up_locked_poll(wqueue, events);
+	spin_lock_irqsave_nested(&wqueue->lock, flags, call_nests + 1);
+	wake_up_locked_poll(wqueue, POLLIN);
 	spin_unlock_irqrestore(&wqueue->lock, flags);
-}
-#else
-static inline void ep_wake_up_nested(wait_queue_head_t *wqueue,
-				     unsigned long events, int subclass)
-{
-	wake_up_poll(wqueue, events);
-}
-#endif
 
-static int ep_poll_wakeup_proc(void *priv, void *cookie, int call_nests)
-{
-	ep_wake_up_nested((wait_queue_head_t *) cookie, POLLIN,
-			  1 + call_nests);
 	return 0;
 }
 
-/*
- * Perform a safe wake up of the poll wait list. The problem is that
- * with the new callback'd wake up system, it is possible that the
- * poll callback is reentered from inside the call to wake_up() done
- * on the poll wait queue head. The rule is that we cannot reenter the
- * wake up code from the same task more than EP_MAX_NESTS times,
- * and we cannot reenter the same wait queue head at all. This will
- * enable to have a hierarchy of epoll file descriptor of no more than
- * EP_MAX_NESTS deep.
- */
 static void ep_poll_safewake(wait_queue_head_t *wq)
 {
 	int this_cpu = get_cpu();
@@ -595,6 +573,15 @@  static void ep_poll_safewake(wait_queue_head_t *wq)
 	put_cpu();
 }
 
+#else
+
+static void ep_poll_safewake(wait_queue_head_t *wq)
+{
+	wake_up_poll(wq, POLLIN);
+}
+
+#endif
+
 static void ep_remove_wait_queue(struct eppoll_entry *pwq)
 {
 	wait_queue_head_t *whead;
@@ -2315,8 +2302,10 @@  static int __init eventpoll_init(void)
 	 */
 	ep_nested_calls_init(&poll_loop_ncalls);
 
+#ifdef CONFIG_DEBUG_LOCK_ALLOC
 	/* Initialize the structure used to perform safe poll wait head wake ups */
 	ep_nested_calls_init(&poll_safewake_ncalls);
+#endif
 
 	/* Initialize the structure used to perform file's f_op->poll() calls */
 	ep_nested_calls_init(&poll_readywalk_ncalls);