diff mbox

fs-writeback: drop wb->list_lock during blk_finish_plug()

Message ID CA+55aFzXW7t+1v3tmW2sxn-BLpvZ1_Ye6epiPWBeq70FoaSmFQ@mail.gmail.com (mailing list archive)
State New, archived
Headers show

Commit Message

Linus Torvalds Sept. 18, 2015, 1:50 a.m. UTC
On Thu, Sep 17, 2015 at 5:37 PM, Dave Chinner <david@fromorbit.com> wrote:
>> >
>> > I'm not seeing why that should be an issue. Sure, there's some CPU
>> > overhead to context switching, but I don't see that it should be that
>> > big of a deal.
>
> It may well change the dispatch order of enough IOs for it to be
> significant on an IO bound device.

Hmm. Maybe. We obviously try to order the IO's a bit by inode, and I
could see the use of a workqueue maybe changing that sufficiently. But
it still sounds a bit unlikely.

And in fact, I think I have a better explanation.

> In outright performance on my test machine, the difference in
> files/s is noise. However, the consistency looks to be substantially
> improved and the context switch rate is now running at under
> 3,000/sec.

Hmm. I don't recall seeing you mention how many context switches per
second you had before. What is it down from?

However, I think I may have found something more interesting here.

The fact is that a *normal* schedule will trigger that whole
blk_schedule_flush_plug(), but a cond_sched() or a cond_sched_lock()
doesn't actually do a normal schedule at all. Those trigger a
*preemption* schedule.

And a preemption schedule does not trigger that unplugging at all.
Why? A kernel "preemption" very much tries to avoid touching thread
state, because the whole point is that normally we may be preempting
threads in random places, so we don't run things like
sched_submit_work(), because the thread may be in the middle of
*creating* that work, and we don't want to introduce races. The
preemption scheduling can also be done with "task->state" set to
sleeping, and it won't actually sleep.

Now, for the explicit schedules like "cond_resched()" and
"cond_resched_lock()", those races with obviously don't exist, but
they happen to share the same preemption scheduling logic.

So it turns out that as far as I can see, the whole "cond_resched()"
will not start any IO at all, and it will just be left on the thread
plug until we schedule back to the thread.

So I don't think this has anything to do with kblockd_workqueue. I
don't think it even gets to that point.

I may be missing something, but just to humor me, can you test the
attached patch *without* Chris's patch to do explicit plugging? This
should make cond_resched() and cond_resched_lock() run the unplugging.

It may be entirely broken, I haven't thought this entirely through.

                   Linus

Comments

Dave Chinner Sept. 18, 2015, 5:40 a.m. UTC | #1
On Thu, Sep 17, 2015 at 06:50:29PM -0700, Linus Torvalds wrote:
> On Thu, Sep 17, 2015 at 5:37 PM, Dave Chinner <david@fromorbit.com> wrote:
> >> >
> >> > I'm not seeing why that should be an issue. Sure, there's some CPU
> >> > overhead to context switching, but I don't see that it should be that
> >> > big of a deal.
> >
> > It may well change the dispatch order of enough IOs for it to be
> > significant on an IO bound device.
> 
> Hmm. Maybe. We obviously try to order the IO's a bit by inode, and I
> could see the use of a workqueue maybe changing that sufficiently. But
> it still sounds a bit unlikely.
> 
> And in fact, I think I have a better explanation.
> 
> > In outright performance on my test machine, the difference in
> > files/s is noise. However, the consistency looks to be substantially
> > improved and the context switch rate is now running at under
> > 3,000/sec.
> 
> Hmm. I don't recall seeing you mention how many context switches per
> second you had before. What is it down from?

Around 4000-4500/sec, so there's not a huge amount of context
switches going on. It's not like btrfs, where this workload
generates over 250,000 context switches/sec....

> However, I think I may have found something more interesting here.
> 
> The fact is that a *normal* schedule will trigger that whole
> blk_schedule_flush_plug(), but a cond_sched() or a cond_sched_lock()
> doesn't actually do a normal schedule at all. Those trigger a
> *preemption* schedule.

Ok, makes sense - the plug is not being flushed as we switch away,
but Chris' patch makes it do that.

> So it turns out that as far as I can see, the whole "cond_resched()"
> will not start any IO at all, and it will just be left on the thread
> plug until we schedule back to the thread.
> 
> So I don't think this has anything to do with kblockd_workqueue. I
> don't think it even gets to that point.
> 
> I may be missing something, but just to humor me, can you test the
> attached patch *without* Chris's patch to do explicit plugging? This
> should make cond_resched() and cond_resched_lock() run the unplugging.

Context switches go back to the 4-4500/sec range. Otherwise
behaviour and performance is indistinguishable from Chris' patch.

Cheers,

Dave.


PS: just hit another "did this just get broken in 4.3-rc1" issue - I
can't run blktrace while there's a IO load because:

$ sudo blktrace -d /dev/vdc
BLKTRACESETUP(2) /dev/vdc failed: 5/Input/output error
Thread 1 failed open /sys/kernel/debug/block/(null)/trace1: 2/No such file or directory
....

[  641.424618] blktrace: page allocation failure: order:5, mode:0x2040d0
[  641.425952] CPU: 13 PID: 11163 Comm: blktrace Not tainted 4.3.0-rc1-dgc+ #430
[  641.427321] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Debian-1.8.2-1 04/01/2014
[  641.429066]  0000000000000005 ffff8800b731faa0 ffffffff81785609 00000000002040d0
[  641.430585]  ffff8800b731fb28 ffffffff81183688 0000000000000000 ffffffff824926f8
[  641.432143]  ffff8800b731faf0 0000000000000042 0000000000000010 0000000000000100
[  641.433695] Call Trace:
[  641.434196]  [<ffffffff81785609>] dump_stack+0x4b/0x72
[  641.435219]  [<ffffffff81183688>] warn_alloc_failed+0xd8/0x130
[  641.436402]  [<ffffffff811863bb>] __alloc_pages_nodemask+0x66b/0x7d0
[  641.437684]  [<ffffffff811c0e19>] cache_grow.constprop.61+0xc9/0x2d0
[  641.438933]  [<ffffffff811c1569>] kmem_cache_alloc_trace+0x129/0x400
[  641.440240]  [<ffffffff811424f8>] relay_open+0x68/0x2c0
[  641.441299]  [<ffffffff8115deb1>] do_blk_trace_setup+0x191/0x2d0
[  641.442492]  [<ffffffff8115e041>] blk_trace_setup+0x51/0x90
[  641.443586]  [<ffffffff8115e229>] blk_trace_ioctl+0xb9/0x110
[  641.444702]  [<ffffffff811e586c>] ? mntput_no_expire+0x2c/0x1a0
[  641.446419]  [<ffffffff8176fa58>] blkdev_ioctl+0x528/0x690
[  641.447489]  [<ffffffff811fd37d>] block_ioctl+0x3d/0x50
[  641.449194]  [<ffffffff811d93da>] do_vfs_ioctl+0x2ba/0x490
[  641.450491]  [<ffffffff811d9629>] SyS_ioctl+0x79/0x90
[  641.451575]  [<ffffffff81d800ae>] entry_SYSCALL_64_fastpath+0x12/0x71
[  641.453231] Mem-Info:
[  641.453714] active_anon:11651 inactive_anon:2411 isolated_anon:0
                active_file:8456 inactive_file:2070698 isolated_file:0
                unevictable:1011 dirty:374605 writeback:1176 unstable:0
                slab_reclaimable:1514989 slab_unreclaimable:233692
                mapped:9332 shmem:2250 pagetables:1891 bounce:0
                free:24305 free_pcp:1448 free_cma:0
[  641.461950] DMA free:15908kB min:12kB low:12kB high:16kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15992kB managed:15908kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
[  641.470017] lowmem_reserve[]: 0 2960 16005 16005
[  641.471041] DMA32 free:57076kB min:2992kB low:3740kB high:4488kB active_anon:8692kB inactive_anon:2572kB active_file:6224kB inactive_file:1507352kB unevictable:1004kB isolated(anon):0kB isolated(file):0kB present:3129212kB managed:3032864kB mlocked:1004kB dirty:275336kB writeback:896kB mapped:6800kB shmem:1836kB slab_reclaimable:1086292kB slab_unreclaimable:183168kB kernel_stack:23600kB pagetables:1372kB unstable:0kB bounce:0kB free_pcp:3440kB local_pcp:16kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[  641.479677] lowmem_reserve[]: 0 0 13044 13044
[  641.480888] Normal free:18040kB min:13188kB low:16484kB high:19780kB active_anon:38264kB inactive_anon:7072kB active_file:27600kB inactive_file:6775516kB unevictable:3040kB isolated(anon):0kB isolated(file):0kB present:13631488kB managed:13357908kB mlocked:3040kB dirty:1223624kB writeback:5144kB mapped:30528kB shmem:7164kB slab_reclaimable:4975584kB slab_unreclaimable:755076kB kernel_stack:44912kB pagetables:6192kB unstable:0kB bounce:0kB free_pcp:3432kB local_pcp:88kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[  641.492392] lowmem_reserve[]: 0 0 0 0
[  641.493170] DMA: 1*4kB (U) 0*8kB 0*16kB 1*32kB (U) 2*64kB (U) 1*128kB (U) 1*256kB (U) 0*512kB 1*1024kB (U) 1*2048kB (M) 3*4096kB (M) = 15908kB
[  641.496035] DMA32: 6584*4kB (UEM) 3783*8kB (UM) 17*16kB (M) 9*32kB (M) 3*64kB (M) 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 57352kB
[  641.498705] Normal: 3372*4kB (UE) 393*8kB (UEM) 37*16kB (UEM) 33*32kB (UEM) 6*64kB (EM) 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 18664kB
[  641.501479] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[  641.503019] 2082056 total pagecache pages
[  641.503759] 0 pages in swap cache
[  641.504403] Swap cache stats: add 0, delete 0, find 0/0
[  641.505374] Free swap  = 497976kB
[  641.506002] Total swap = 497976kB
[  641.506622] 4194173 pages RAM
[  641.507171] 0 pages HighMem/MovableOnly
[  641.507878] 92503 pages reserved

This is from:

gdb) l *(relay_open+0x68)
0xffffffff811424f8 is in relay_open (kernel/relay.c:582).
577                     return NULL;
578             if (subbuf_size > UINT_MAX / n_subbufs)
579                     return NULL;
580
581             chan = kzalloc(sizeof(struct rchan), GFP_KERNEL);
582             if (!chan)
583                     return NULL;
584
585             chan->version = RELAYFS_CHANNEL_VERSION;
586             chan->n_subbufs = n_subbufs;

and struct rchan has a member struct rchan_buf *buf[NR_CPUS];
and CONFIG_NR_CPUS=8192, hence the attempt at an order 5 allocation
that fails here....

Cheers,

Dave.
Linus Torvalds Sept. 18, 2015, 6:04 a.m. UTC | #2
On Thu, Sep 17, 2015 at 10:40 PM, Dave Chinner <david@fromorbit.com> wrote:
>
> Ok, makes sense - the plug is not being flushed as we switch away,
> but Chris' patch makes it do that.

Yup.

And I actually think Chris' patch is better than the one I sent out
(but maybe the scheduler people should take a look at the behavior of
cond_resched()), I just wanted you to test that to verify the
behavior.

The fact that Chris' patch ends up lowering the context switches
(because it does the unplugging directly) is also an argument for his
approach.

I just wanted to understand the oddity with kblockd_workqueue. And I
think that's solved.

> Context switches go back to the 4-4500/sec range. Otherwise
> behaviour and performance is indistinguishable from Chris' patch.

.. this was exactly what I wanted to hear. So it sounds like we have
no odd unexplained behavior left in this area.

Which is not to say that there wouldn't be room for improvement, but
it just makes me much happier about the state of these patches to feel
like we understand what was going on.

> PS: just hit another "did this just get broken in 4.3-rc1" issue - I
> can't run blktrace while there's a IO load because:
>
> $ sudo blktrace -d /dev/vdc
> BLKTRACESETUP(2) /dev/vdc failed: 5/Input/output error
> Thread 1 failed open /sys/kernel/debug/block/(null)/trace1: 2/No such file or directory
> ....
>
> [  641.424618] blktrace: page allocation failure: order:5, mode:0x2040d0
> [  641.438933]  [<ffffffff811c1569>] kmem_cache_alloc_trace+0x129/0x400
> [  641.440240]  [<ffffffff811424f8>] relay_open+0x68/0x2c0
> [  641.441299]  [<ffffffff8115deb1>] do_blk_trace_setup+0x191/0x2d0
>
> gdb) l *(relay_open+0x68)
> 0xffffffff811424f8 is in relay_open (kernel/relay.c:582).
> 577                     return NULL;
> 578             if (subbuf_size > UINT_MAX / n_subbufs)
> 579                     return NULL;
> 580
> 581             chan = kzalloc(sizeof(struct rchan), GFP_KERNEL);
> 582             if (!chan)
> 583                     return NULL;
> 584
> 585             chan->version = RELAYFS_CHANNEL_VERSION;
> 586             chan->n_subbufs = n_subbufs;
>
> and struct rchan has a member struct rchan_buf *buf[NR_CPUS];
> and CONFIG_NR_CPUS=8192, hence the attempt at an order 5 allocation
> that fails here....

Hm. Have you always had MAX_SMP (and the NR_CPU==8192 that it causes)?
From a quick check, none of this code seems to be new.

That said, having that

        struct rchan_buf *buf[NR_CPUS];

in "struct rchan" really is something we should fix. We really should
strive to not allocate things by CONFIG_NR_CPU's, but by the actual
real CPU count.

This looks to be mostly Jens' code, and much of it harkens back to 2006. Jens?

                    Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Linus Torvalds Sept. 18, 2015, 6:06 a.m. UTC | #3
Gaah, my mailer autocompleted Jens' email with an old one..

Sorry for the repeat email with the correct address.

On Thu, Sep 17, 2015 at 11:04 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Thu, Sep 17, 2015 at 10:40 PM, Dave Chinner <david@fromorbit.com> wrote:
>>
>> PS: just hit another "did this just get broken in 4.3-rc1" issue - I
>> can't run blktrace while there's a IO load because:
>>
>> $ sudo blktrace -d /dev/vdc
>> BLKTRACESETUP(2) /dev/vdc failed: 5/Input/output error
>> Thread 1 failed open /sys/kernel/debug/block/(null)/trace1: 2/No such file or directory
>> ....
>>
>> [  641.424618] blktrace: page allocation failure: order:5, mode:0x2040d0
>> [  641.438933]  [<ffffffff811c1569>] kmem_cache_alloc_trace+0x129/0x400
>> [  641.440240]  [<ffffffff811424f8>] relay_open+0x68/0x2c0
>> [  641.441299]  [<ffffffff8115deb1>] do_blk_trace_setup+0x191/0x2d0
>>
>> gdb) l *(relay_open+0x68)
>> 0xffffffff811424f8 is in relay_open (kernel/relay.c:582).
>> 577                     return NULL;
>> 578             if (subbuf_size > UINT_MAX / n_subbufs)
>> 579                     return NULL;
>> 580
>> 581             chan = kzalloc(sizeof(struct rchan), GFP_KERNEL);
>> 582             if (!chan)
>> 583                     return NULL;
>> 584
>> 585             chan->version = RELAYFS_CHANNEL_VERSION;
>> 586             chan->n_subbufs = n_subbufs;
>>
>> and struct rchan has a member struct rchan_buf *buf[NR_CPUS];
>> and CONFIG_NR_CPUS=8192, hence the attempt at an order 5 allocation
>> that fails here....
>
> Hm. Have you always had MAX_SMP (and the NR_CPU==8192 that it causes)?
> From a quick check, none of this code seems to be new.
>
> That said, having that
>
>         struct rchan_buf *buf[NR_CPUS];
>
> in "struct rchan" really is something we should fix. We really should
> strive to not allocate things by CONFIG_NR_CPU's, but by the actual
> real CPU count.
>
> This looks to be mostly Jens' code, and much of it harkens back to 2006. Jens?
>
>                     Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Chris Mason Sept. 18, 2015, 1:16 p.m. UTC | #4
On Thu, Sep 17, 2015 at 11:04:03PM -0700, Linus Torvalds wrote:
> On Thu, Sep 17, 2015 at 10:40 PM, Dave Chinner <david@fromorbit.com> wrote:
> >
> > Ok, makes sense - the plug is not being flushed as we switch away,
> > but Chris' patch makes it do that.
> 
> Yup.

Huh, that does make much more sense, thanks Linus.  I'm wondering where
else I've assumed that cond_resched() unplugged.

> 
> And I actually think Chris' patch is better than the one I sent out
> (but maybe the scheduler people should take a look at the behavior of
> cond_resched()), I just wanted you to test that to verify the
> behavior.

Ok, I'll fix up the description and comments and send out.

-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jens Axboe Sept. 18, 2015, 2:21 p.m. UTC | #5
On 09/18/2015 12:06 AM, Linus Torvalds wrote:
> Gaah, my mailer autocompleted Jens' email with an old one..
>
> Sorry for the repeat email with the correct address.
>
> On Thu, Sep 17, 2015 at 11:04 PM, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
>> On Thu, Sep 17, 2015 at 10:40 PM, Dave Chinner <david@fromorbit.com> wrote:
>>>
>>> PS: just hit another "did this just get broken in 4.3-rc1" issue - I
>>> can't run blktrace while there's a IO load because:
>>>
>>> $ sudo blktrace -d /dev/vdc
>>> BLKTRACESETUP(2) /dev/vdc failed: 5/Input/output error
>>> Thread 1 failed open /sys/kernel/debug/block/(null)/trace1: 2/No such file or directory
>>> ....
>>>
>>> [  641.424618] blktrace: page allocation failure: order:5, mode:0x2040d0
>>> [  641.438933]  [<ffffffff811c1569>] kmem_cache_alloc_trace+0x129/0x400
>>> [  641.440240]  [<ffffffff811424f8>] relay_open+0x68/0x2c0
>>> [  641.441299]  [<ffffffff8115deb1>] do_blk_trace_setup+0x191/0x2d0
>>>
>>> gdb) l *(relay_open+0x68)
>>> 0xffffffff811424f8 is in relay_open (kernel/relay.c:582).
>>> 577                     return NULL;
>>> 578             if (subbuf_size > UINT_MAX / n_subbufs)
>>> 579                     return NULL;
>>> 580
>>> 581             chan = kzalloc(sizeof(struct rchan), GFP_KERNEL);
>>> 582             if (!chan)
>>> 583                     return NULL;
>>> 584
>>> 585             chan->version = RELAYFS_CHANNEL_VERSION;
>>> 586             chan->n_subbufs = n_subbufs;
>>>
>>> and struct rchan has a member struct rchan_buf *buf[NR_CPUS];
>>> and CONFIG_NR_CPUS=8192, hence the attempt at an order 5 allocation
>>> that fails here....
>>
>> Hm. Have you always had MAX_SMP (and the NR_CPU==8192 that it causes)?
>>  From a quick check, none of this code seems to be new.
>>
>> That said, having that
>>
>>          struct rchan_buf *buf[NR_CPUS];
>>
>> in "struct rchan" really is something we should fix. We really should
>> strive to not allocate things by CONFIG_NR_CPU's, but by the actual
>> real CPU count.
>>
>> This looks to be mostly Jens' code, and much of it harkens back to 2006. Jens?

The relayfs code mostly came out of IBM, but yes, that alloc doesn't 
look nice. Not a regression, though, I don't think that has changed in 
years. I'll take a stab at fixing this.
Jens Axboe Sept. 18, 2015, 2:23 p.m. UTC | #6
On 09/18/2015 07:16 AM, Chris Mason wrote:
> On Thu, Sep 17, 2015 at 11:04:03PM -0700, Linus Torvalds wrote:
>> On Thu, Sep 17, 2015 at 10:40 PM, Dave Chinner <david@fromorbit.com> wrote:
>>>
>>> Ok, makes sense - the plug is not being flushed as we switch away,
>>> but Chris' patch makes it do that.
>>
>> Yup.
>
> Huh, that does make much more sense, thanks Linus.  I'm wondering where
> else I've assumed that cond_resched() unplugged.
>
>>
>> And I actually think Chris' patch is better than the one I sent out
>> (but maybe the scheduler people should take a look at the behavior of
>> cond_resched()), I just wanted you to test that to verify the
>> behavior.

It makes no sense for preemption schedule to NOT unplug, the fact that 
it doesn't is news to me as well. It was never the intent of the 
unplug-on-schedule to NOT unplug for certain schedule out events, that 
seems like very odd behavior.
Linus Torvalds Sept. 18, 2015, 3:32 p.m. UTC | #7
On Fri, Sep 18, 2015 at 7:23 AM, Jens Axboe <axboe@fb.com> wrote:
>
> It makes no sense for preemption schedule to NOT unplug, the fact that it
> doesn't is news to me as well. It was never the intent of the
> unplug-on-schedule to NOT unplug for certain schedule out events, that seems
> like very odd behavior.

Actually, even a *full* schedule doesn't unplug, unless the process is
going to sleep. See sched_submit_work(), which  will only call the
unplugging if the process is actually going to sleep (ok, so it's a
bit subtle if you don't know the state rules, but it's the
"!tsk->state" check there)

So preemption and cond_resched() isn't _that_ odd. We've basically
treated a non-sleeping schedule as a no-op for the task work.

The thinking was probably that it might be better to delay starting
the IO in case we get scheduled back quickly, and we're obviously not
actually _sleeping_, so it's likely not too bad.

Now, that's probably bogus, and I think that we should perhaps just
make the rule be that "if we actually switch to another task, we run
blk_schedule_flush_plug()".

But it should be noted that that really *does* introduce a lot of new
potential races. Traditionally, our block layer plugging has been
entirely thread-synchronous, and would never happen asynchronously.
But with preemption, that "switch to another thread" really *does*
happen asynchronously.

So making things always happen on task switch is actually fairly
dangerous, and potentially adds the need for much more synchronization
for the IO submission.

What we possibly *could* make the scheduler rule be:

 - if it's not an actual PREEMPT_ACTIVE (ie in a random place)

 - _and_ we actually switch to another thread

 - _then_ do the whole blk_schedule_flush_plug(tsk) thing.

adding some scheduler people to the explicit cc list.

That said, the "cond_resched[_lock]()" functions currently always set
PREEMPT_ACTIVE (indirectly - they use preempt_schedule_common()), so
even though those are synchronous, right now they *look* asynchronous
to the scheduler, so we'd still have to sort that out.

Ingo/Peter/Frederic? Comments?

                        Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Peter Zijlstra Sept. 18, 2015, 3:59 p.m. UTC | #8
On Fri, Sep 18, 2015 at 08:32:15AM -0700, Linus Torvalds wrote:

> Actually, even a *full* schedule doesn't unplug, unless the process is
> going to sleep. See sched_submit_work(), which  will only call the
> unplugging if the process is actually going to sleep (ok, so it's a
> bit subtle if you don't know the state rules, but it's the
> "!tsk->state" check there)
> 
> So preemption and cond_resched() isn't _that_ odd. We've basically
> treated a non-sleeping schedule as a no-op for the task work.
> 
> The thinking was probably that it might be better to delay starting
> the IO in case we get scheduled back quickly, and we're obviously not
> actually _sleeping_, so it's likely not too bad.
> 
> Now, that's probably bogus, and I think that we should perhaps just
> make the rule be that "if we actually switch to another task, we run
> blk_schedule_flush_plug()".

That gets to be a tad tricky. We must either do it while holding
scheduler locks (and I would really want to avoid calling block layer
code with those held) or when the new task is already scheduled (and
that would suck too, because then we account stuff to the wrong task).

> But it should be noted that that really *does* introduce a lot of new
> potential races. Traditionally, our block layer plugging has been
> entirely thread-synchronous, and would never happen asynchronously.
> But with preemption, that "switch to another thread" really *does*
> happen asynchronously.
> 
> So making things always happen on task switch is actually fairly
> dangerous, and potentially adds the need for much more synchronization
> for the IO submission.

Can't say I'm a fan of it :/

> What we possibly *could* make the scheduler rule be:
> 
>  - if it's not an actual PREEMPT_ACTIVE (ie in a random place)

PREEMPT_ACTIVE is actually a recursion flag, 'random place' does not
factor into it. That is, most 'random places' will not have
PREEMPT_ACTIVE set.

>  - _and_ we actually switch to another thread

This is 'hard', as per the above.

>  - _then_ do the whole blk_schedule_flush_plug(tsk) thing.
> 
> adding some scheduler people to the explicit cc list.
> 
> That said, the "cond_resched[_lock]()" functions currently always set
> PREEMPT_ACTIVE (indirectly - they use preempt_schedule_common()), so
> even though those are synchronous, right now they *look* asynchronous
> to the scheduler, so we'd still have to sort that out.

See PREEMPT_ACTIVE being a recursion flag, we set it there so we won't
preempt while we're already scheduling.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Peter Zijlstra Sept. 18, 2015, 4:02 p.m. UTC | #9
On Fri, Sep 18, 2015 at 05:59:56PM +0200, Peter Zijlstra wrote:
> >  - if it's not an actual PREEMPT_ACTIVE (ie in a random place)
> 
> PREEMPT_ACTIVE is actually a recursion flag, 'random place' does not
> factor into it. That is, most 'random places' will not have
> PREEMPT_ACTIVE set.

Ah, _not_ PREEMPT_ACTIVE, its late and I can't read no moar.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Linus Torvalds Sept. 18, 2015, 4:12 p.m. UTC | #10
On Fri, Sep 18, 2015 at 8:59 AM, Peter Zijlstra <peterz@infradead.org> wrote:
>
> See PREEMPT_ACTIVE being a recursion flag, we set it there so we won't
> preempt while we're already scheduling.

PREEMPT_ACTIVE does more than that. It really is a sign that "this is
not synchronous". It causes the scheduler to ignore the current task
flags (because it might already be TASK_SLEEPING, but we aren't
_actually_ ready to sleep yet) etc.

So no. It's not "you can't be preempted during scheduling". That's the
*normal* preempt count, and all scheduling calls end up setting that
some way (ie "schedule()" just does preempt_disable()).

So I disagree with your notion that it's a recursion flag. It is
absolutely nothing of the sort. It gets set by preemption - and,
somewhat illogically, by cond_resched().

The fact that cond_resched() sets it is *probably* because some of the
callers end up calling it from page fault paths etc, and the same
"ignore TASK_SLEEPING etc" rules apply. But it does mean that
"cond_resched()" is a bit misleaning as a name. It's really a
"cond_preempt()".

                 Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Dave Chinner Sept. 18, 2015, 10:17 p.m. UTC | #11
On Thu, Sep 17, 2015 at 11:04:03PM -0700, Linus Torvalds wrote:
> On Thu, Sep 17, 2015 at 10:40 PM, Dave Chinner <david@fromorbit.com> wrote:
> >
> > Ok, makes sense - the plug is not being flushed as we switch away,
> > but Chris' patch makes it do that.
> 
> Yup.
> 
> And I actually think Chris' patch is better than the one I sent out
> (but maybe the scheduler people should take a look at the behavior of
> cond_resched()), I just wanted you to test that to verify the
> behavior.
> 
> The fact that Chris' patch ends up lowering the context switches
> (because it does the unplugging directly) is also an argument for his
> approach.
> 
> I just wanted to understand the oddity with kblockd_workqueue. And I
> think that's solved.
> 
> > Context switches go back to the 4-4500/sec range. Otherwise
> > behaviour and performance is indistinguishable from Chris' patch.
> 
> .. this was exactly what I wanted to hear. So it sounds like we have
> no odd unexplained behavior left in this area.
> 
> Which is not to say that there wouldn't be room for improvement, but
> it just makes me much happier about the state of these patches to feel
> like we understand what was going on.
> 
> > PS: just hit another "did this just get broken in 4.3-rc1" issue - I
> > can't run blktrace while there's a IO load because:
> >
> > $ sudo blktrace -d /dev/vdc
> > BLKTRACESETUP(2) /dev/vdc failed: 5/Input/output error
> > Thread 1 failed open /sys/kernel/debug/block/(null)/trace1: 2/No such file or directory
> > ....
> >
> > [  641.424618] blktrace: page allocation failure: order:5, mode:0x2040d0
> > [  641.438933]  [<ffffffff811c1569>] kmem_cache_alloc_trace+0x129/0x400
> > [  641.440240]  [<ffffffff811424f8>] relay_open+0x68/0x2c0
> > [  641.441299]  [<ffffffff8115deb1>] do_blk_trace_setup+0x191/0x2d0
> >
> > gdb) l *(relay_open+0x68)
> > 0xffffffff811424f8 is in relay_open (kernel/relay.c:582).
> > 577                     return NULL;
> > 578             if (subbuf_size > UINT_MAX / n_subbufs)
> > 579                     return NULL;
> > 580
> > 581             chan = kzalloc(sizeof(struct rchan), GFP_KERNEL);
> > 582             if (!chan)
> > 583                     return NULL;
> > 584
> > 585             chan->version = RELAYFS_CHANNEL_VERSION;
> > 586             chan->n_subbufs = n_subbufs;
> >
> > and struct rchan has a member struct rchan_buf *buf[NR_CPUS];
> > and CONFIG_NR_CPUS=8192, hence the attempt at an order 5 allocation
> > that fails here....
> 
> Hm. Have you always had MAX_SMP (and the NR_CPU==8192 that it causes)?
> From a quick check, none of this code seems to be new.

Yes, I always build MAX_SMP kernels for testing, because XFS is
often used on such machines and so I want to find issues exactly
like this in my testing rather than on customer machines... :/

> That said, having that
> 
>         struct rchan_buf *buf[NR_CPUS];
> 
> in "struct rchan" really is something we should fix. We really should
> strive to not allocate things by CONFIG_NR_CPU's, but by the actual
> real CPU count.

*nod*. But it doesn't fix the problem of the memory allocation
failing when there's still gigabytes of immediately reclaimable
memory available in the page cache. If this is failing under page
cache memory pressure, then we're going to be doing an awful lot
more falling back to vmalloc in the filesystem code where large
allocations like this are done e.g. extended attribute buffers are
order-5, and used a lot when doing things like backups which tend to
also produce significant page cache memory pressure.

Hence I'm tending towards there being a memory reclaim behaviour
regression, not so much worrying about whether this specific
allocation is optimal or not.

Cheers,

Dave.
Jan Kara Sept. 21, 2015, 9:24 a.m. UTC | #12
On Sat 19-09-15 08:17:14, Dave Chinner wrote:
> On Thu, Sep 17, 2015 at 11:04:03PM -0700, Linus Torvalds wrote:
> > On Thu, Sep 17, 2015 at 10:40 PM, Dave Chinner <david@fromorbit.com> wrote:
> > > PS: just hit another "did this just get broken in 4.3-rc1" issue - I
> > > can't run blktrace while there's a IO load because:
> > >
> > > $ sudo blktrace -d /dev/vdc
> > > BLKTRACESETUP(2) /dev/vdc failed: 5/Input/output error
> > > Thread 1 failed open /sys/kernel/debug/block/(null)/trace1: 2/No such file or directory
> > > ....
> > >
> > > [  641.424618] blktrace: page allocation failure: order:5, mode:0x2040d0
> > > [  641.438933]  [<ffffffff811c1569>] kmem_cache_alloc_trace+0x129/0x400
> > > [  641.440240]  [<ffffffff811424f8>] relay_open+0x68/0x2c0
> > > [  641.441299]  [<ffffffff8115deb1>] do_blk_trace_setup+0x191/0x2d0
> > >
> > > gdb) l *(relay_open+0x68)
> > > 0xffffffff811424f8 is in relay_open (kernel/relay.c:582).
> > > 577                     return NULL;
> > > 578             if (subbuf_size > UINT_MAX / n_subbufs)
> > > 579                     return NULL;
> > > 580
> > > 581             chan = kzalloc(sizeof(struct rchan), GFP_KERNEL);
> > > 582             if (!chan)
> > > 583                     return NULL;
> > > 584
> > > 585             chan->version = RELAYFS_CHANNEL_VERSION;
> > > 586             chan->n_subbufs = n_subbufs;
> > >
> > > and struct rchan has a member struct rchan_buf *buf[NR_CPUS];
> > > and CONFIG_NR_CPUS=8192, hence the attempt at an order 5 allocation
> > > that fails here....
> > 
> > Hm. Have you always had MAX_SMP (and the NR_CPU==8192 that it causes)?
> > From a quick check, none of this code seems to be new.
> 
> Yes, I always build MAX_SMP kernels for testing, because XFS is
> often used on such machines and so I want to find issues exactly
> like this in my testing rather than on customer machines... :/
> 
> > That said, having that
> > 
> >         struct rchan_buf *buf[NR_CPUS];
> > 
> > in "struct rchan" really is something we should fix. We really should
> > strive to not allocate things by CONFIG_NR_CPU's, but by the actual
> > real CPU count.
> 
> *nod*. But it doesn't fix the problem of the memory allocation
> failing when there's still gigabytes of immediately reclaimable
> memory available in the page cache. If this is failing under page
> cache memory pressure, then we're going to be doing an awful lot
> more falling back to vmalloc in the filesystem code where large
> allocations like this are done e.g. extended attribute buffers are
> order-5, and used a lot when doing things like backups which tend to
> also produce significant page cache memory pressure.
> 
> Hence I'm tending towards there being a memory reclaim behaviour
> regression, not so much worrying about whether this specific
> allocation is optimal or not.

Yup, looks like a regression in reclaim. Added linux-mm folks to CC.

								Honza
Andrew Morton Sept. 21, 2015, 8:21 p.m. UTC | #13
On Mon, 21 Sep 2015 11:24:29 +0200 Jan Kara <jack@suse.cz> wrote:

> On Sat 19-09-15 08:17:14, Dave Chinner wrote:
> > On Thu, Sep 17, 2015 at 11:04:03PM -0700, Linus Torvalds wrote:
> > > On Thu, Sep 17, 2015 at 10:40 PM, Dave Chinner <david@fromorbit.com> wrote:
> > > > PS: just hit another "did this just get broken in 4.3-rc1" issue - I
> > > > can't run blktrace while there's a IO load because:
> > > >
> > > > $ sudo blktrace -d /dev/vdc
> > > > BLKTRACESETUP(2) /dev/vdc failed: 5/Input/output error
> > > > Thread 1 failed open /sys/kernel/debug/block/(null)/trace1: 2/No such file or directory
> > > > ....
> > > >
> > > > [  641.424618] blktrace: page allocation failure: order:5, mode:0x2040d0
> > > > [  641.438933]  [<ffffffff811c1569>] kmem_cache_alloc_trace+0x129/0x400
> > > > [  641.440240]  [<ffffffff811424f8>] relay_open+0x68/0x2c0
> > > > [  641.441299]  [<ffffffff8115deb1>] do_blk_trace_setup+0x191/0x2d0
> > > >
> > > > gdb) l *(relay_open+0x68)
> > > > 0xffffffff811424f8 is in relay_open (kernel/relay.c:582).
> > > > 577                     return NULL;
> > > > 578             if (subbuf_size > UINT_MAX / n_subbufs)
> > > > 579                     return NULL;
> > > > 580
> > > > 581             chan = kzalloc(sizeof(struct rchan), GFP_KERNEL);
> > > > 582             if (!chan)
> > > > 583                     return NULL;
> > > > 584
> > > > 585             chan->version = RELAYFS_CHANNEL_VERSION;
> > > > 586             chan->n_subbufs = n_subbufs;
> > > >
> > > > and struct rchan has a member struct rchan_buf *buf[NR_CPUS];
> > > > and CONFIG_NR_CPUS=8192, hence the attempt at an order 5 allocation
> > > > that fails here....
> > > 
> > > Hm. Have you always had MAX_SMP (and the NR_CPU==8192 that it causes)?
> > > From a quick check, none of this code seems to be new.
> > 
> > Yes, I always build MAX_SMP kernels for testing, because XFS is
> > often used on such machines and so I want to find issues exactly
> > like this in my testing rather than on customer machines... :/
> > 
> > > That said, having that
> > > 
> > >         struct rchan_buf *buf[NR_CPUS];
> > > 
> > > in "struct rchan" really is something we should fix. We really should
> > > strive to not allocate things by CONFIG_NR_CPU's, but by the actual
> > > real CPU count.
> > 
> > *nod*. But it doesn't fix the problem of the memory allocation
> > failing when there's still gigabytes of immediately reclaimable
> > memory available in the page cache. If this is failing under page
> > cache memory pressure, then we're going to be doing an awful lot
> > more falling back to vmalloc in the filesystem code where large
> > allocations like this are done e.g. extended attribute buffers are
> > order-5, and used a lot when doing things like backups which tend to
> > also produce significant page cache memory pressure.
> > 
> > Hence I'm tending towards there being a memory reclaim behaviour
> > regression, not so much worrying about whether this specific
> > allocation is optimal or not.
> 
> Yup, looks like a regression in reclaim. Added linux-mm folks to CC.

That's going to be hard to find.  Possibly Vlastimil's 5-patch series
"mm, compaction: more robust check for scanners meeting", possibly
Joonsoo's "mm/compaction: correct to flush migrated pages if pageblock
skip happens".  But probably something else :(

Teach relay.c about alloc_percpu()?
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Peter Zijlstra Sept. 28, 2015, 2:47 p.m. UTC | #14
On Fri, Sep 18, 2015 at 09:12:38AM -0700, Linus Torvalds wrote:
> 
> So I disagree with your notion that it's a recursion flag. It is
> absolutely nothing of the sort.

OK, agreed. I had it classed under recursion in my head, clearly I
indexed it sloppily. In any case I have a patch that kills off
PREEMPT_ACTIVE entirely.

I just have to clean it up, benchmark, split and write changelogs. But
it should be forthcoming 'soon'. As is, it boots..

> It gets set by preemption - and,
> somewhat illogically, by cond_resched().

I suspect that was done to make cond_resched() (voluntary preemption)
more robust and only have a single preemption path/logic. But all that
was done well before I got involved.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Linus Torvalds Sept. 28, 2015, 4:08 p.m. UTC | #15
On Mon, Sep 28, 2015 at 10:47 AM, Peter Zijlstra <peterz@infradead.org> wrote:
>
>> It gets set by preemption - and,
>> somewhat illogically, by cond_resched().
>
> I suspect that was done to make cond_resched() (voluntary preemption)
> more robust and only have a single preemption path/logic. But all that
> was done well before I got involved.

So I think it's actually the name that is bad, not necessarily the behavior.

We tend to put "cond_resched()" (and particularly
"cond_resched_lock()") in some fairly awkward places, and it's not
always entirely clear that task->state == TASK_RUNNING there.

So the preemptive behavior of not *really* putting the task to sleep
may actually be the right one. But it is rather non-intuitive given
the name - because "cond_resched()" basically is not at all equivalent
to "if (need_resched()) schedule()", which you'd kind of expect.

An explicit schedule will actually act on the task->state, and make us
go to sleep. "cond_resched()"  really is just a "voluntary preemption
point". And I think it would be better if it got named that way.

            Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Ingo Molnar Sept. 29, 2015, 7:55 a.m. UTC | #16
* Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Mon, Sep 28, 2015 at 10:47 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> >
> >> It gets set by preemption - and,
> >> somewhat illogically, by cond_resched().
> >
> > I suspect that was done to make cond_resched() (voluntary preemption)
> > more robust and only have a single preemption path/logic. But all that
> > was done well before I got involved.
> 
> So I think it's actually the name that is bad, not necessarily the behavior.
> 
> We tend to put "cond_resched()" (and particularly
> "cond_resched_lock()") in some fairly awkward places, and it's not
> always entirely clear that task->state == TASK_RUNNING there.
> 
> So the preemptive behavior of not *really* putting the task to sleep
> may actually be the right one. But it is rather non-intuitive given
> the name - because "cond_resched()" basically is not at all equivalent
> to "if (need_resched()) schedule()", which you'd kind of expect.
> 
> An explicit schedule will actually act on the task->state, and make us
> go to sleep. "cond_resched()"  really is just a "voluntary preemption
> point". And I think it would be better if it got named that way.

cond_preempt() perhaps? That would allude to preempt_schedule() and such, and 
would make it clearer that it's supposed to be an invariant on the sleep state 
(which schedule() is not).

Thanks,

	Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 97d276ff1edb..388ea9e7ab8a 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4548,6 +4548,7 @@  SYSCALL_DEFINE0(sched_yield)
 int __sched _cond_resched(void)
 {
 	if (should_resched(0)) {
+		sched_submit_work(current);
 		preempt_schedule_common();
 		return 1;
 	}
@@ -4572,9 +4573,10 @@  int __cond_resched_lock(spinlock_t *lock)
 
 	if (spin_needbreak(lock) || resched) {
 		spin_unlock(lock);
-		if (resched)
+		if (resched) {
+			sched_submit_work(current);
 			preempt_schedule_common();
-		else
+		} else
 			cpu_relax();
 		ret = 1;
 		spin_lock(lock);