diff mbox

[GIT,PULL] Block fixes for 4.14-rc2

Message ID 20170924130304.GA21904@infradead.org (mailing list archive)
State New, archived
Headers show

Commit Message

Christoph Hellwig Sept. 24, 2017, 1:03 p.m. UTC
On Fri, Sep 22, 2017 at 05:18:49PM -1000, Linus Torvalds wrote:
> WTF? Why is this so hard? It used to be that IDE drove people crazy.
> Now it's NVMe and generic block layer stuff.

Can you please explain what problems you have with the nvme patches?
They are tiny and fix either user reported bugs, or fix the FC transport
to to use the right error code and sgl type constants per the spec.

That's exactly what we want to get into a tree now instead of waiting
another 6 weeks.

The diffstate and diff itself are below.

And the block stuff looks similar to me, but I haven't looked at the
details.

 host/core.c          |    9 +---
 host/fabrics.c       |   18 ++++----
 host/fc.c            |   21 +++++-----
 host/pci.c           |   34 +++++++++-------
 host/rdma.c          |    9 +++-
 target/core.c        |    9 ++--
 target/fabrics-cmd.c |    9 +++-
 target/fc.c          |   24 ++++++-----
 target/fcloop.c      |  104 +++++++++++++++++++--------------------------------
 target/nvmet.h       |    1
 10 files changed, 117 insertions(+), 121 deletions(-)

Comments

Linus Torvalds Sept. 24, 2017, 5:34 p.m. UTC | #1
On Sun, Sep 24, 2017 at 6:03 AM, Christoph Hellwig <hch@infradead.org> wrote:
> On Fri, Sep 22, 2017 at 05:18:49PM -1000, Linus Torvalds wrote:
>> WTF? Why is this so hard? It used to be that IDE drove people crazy.
>> Now it's NVMe and generic block layer stuff.
>
> Can you please explain what problems you have with the nvme patches?

This time it wasn't the NVMe parts (that was the last few releases).
This time it was the random writeback patches that had been committed
only days before, and that completely change some fundamental code.
Some of the commits say "this changes no semantics", but that's after
the previous commit just changed the argument values exactly so that
the next commit wouldn't change semantics.

Don't get me wrong - all the commits look perfectly _sane_.  That's
not my issue. But these commits literally showed up on lkml just a
couple of days before getting sent to me, and even when they were sent
the cover letter for the series of six patches was literally

   More graceful flusher thread memory reclaim wakeup

which *really* does not say "this is an important big fix or regression" to me.

What made it ok to send that outside the merge window? It looks like a
nice cleanup, no question about it, and it probably fixes some
behavioral problem at FB. But that behavioral problem is not *new*, as
far as I can tell.

So why was it sent as a bug-fix pull request?

Now, if this was some other subsystem that _hadn't_ had problems in
EVERY SINGLE recent merge window, I'd likely have been more than happy
to take it early in the rc series because that subsystem wasn't on my
list of "increased scrutiny due to historical issues".

But as it is, the block layer _is_ on my list of increased scrutiny
(and part of that reason is NVMe patches - again, "fixes" being sent
that were really just normal on-going development)

                            Linus
Jens Axboe Sept. 25, 2017, 12:03 a.m. UTC | #2
On 09/24/2017 11:34 AM, Linus Torvalds wrote:
> On Sun, Sep 24, 2017 at 6:03 AM, Christoph Hellwig <hch@infradead.org> wrote:
>> On Fri, Sep 22, 2017 at 05:18:49PM -1000, Linus Torvalds wrote:
>>> WTF? Why is this so hard? It used to be that IDE drove people crazy.
>>> Now it's NVMe and generic block layer stuff.
>>
>> Can you please explain what problems you have with the nvme patches?
> 
> This time it wasn't the NVMe parts (that was the last few releases).
> This time it was the random writeback patches that had been committed
> only days before, and that completely change some fundamental code.
> Some of the commits say "this changes no semantics", but that's after
> the previous commit just changed the argument values exactly so that
> the next commit wouldn't change semantics.
> 
> Don't get me wrong - all the commits look perfectly _sane_.  That's
> not my issue. But these commits literally showed up on lkml just a
> couple of days before getting sent to me, and even when they were sent
> the cover letter for the series of six patches was literally
> 
>    More graceful flusher thread memory reclaim wakeup
> 
> which *really* does not say "this is an important big fix or regression" to me.
> 
> What made it ok to send that outside the merge window? It looks like a
> nice cleanup, no question about it, and it probably fixes some
> behavioral problem at FB. But that behavioral problem is not *new*, as
> far as I can tell.
> 
> So why was it sent as a bug-fix pull request?
> 
> Now, if this was some other subsystem that _hadn't_ had problems in
> EVERY SINGLE recent merge window, I'd likely have been more than happy
> to take it early in the rc series because that subsystem wasn't on my
> list of "increased scrutiny due to historical issues".
> 
> But as it is, the block layer _is_ on my list of increased scrutiny
> (and part of that reason is NVMe patches - again, "fixes" being sent
> that were really just normal on-going development)

Sorry for the late response, been moving this weekend and tending to
things more important than arguing on the Internet.

NVMe fixes have sometimes been accepted for the current series, while
they should have been pushed. I'm not going to argue with that, but I
think we've managed to make that better the last few series. It's still
not squeaky clean, but it is improving substantially.  I don't believe
that's even been the case for core changes, which have always been
scrutinized much more heavily.

I knew you might object to the writeback changes. As mentioned in the
pull request, if you diff the whole thing, it's really not big at all.
It's mostly shuffling some arguments and things around, to make the
final patch trivial. And it does mean that it's super easy to review.
In the original review series, we are hitting this in production, and
have been for a while. So while it's not a regression for this series,
it's something that causes us quite a bit of pain. When attempting to
free memory ends up gobling up tons of extra memory AND causing the
flusher threads to waste tons of CPU, then that's a big problem. Big
enough that we get softlockups from just attempting to process these
identical start-all-writeback requests.

That's why I sent it in for this series. Yes, it's not months old code,
but it has been tested. And it's not like it's a lot of NEW code, it's
mostly removing code, or moving things around.

I can resend without this stuff, but I think it'd be a shame to release
4.14 without it. Let me know, and I'll respin and send a new pull
request.
Linus Torvalds Sept. 25, 2017, 3:16 a.m. UTC | #3
On Sun, Sep 24, 2017 at 5:03 PM, Jens Axboe <axboe@kernel.dk> wrote:
>
> NVMe fixes have sometimes been accepted for the current series, while
> they should have been pushed. I'm not going to argue with that, but I
> think we've managed to make that better the last few series. It's still
> not squeaky clean, but it is improving substantially.  I don't believe
> that's even been the case for core changes, which have always been
> scrutinized much more heavily.

The NVMe fixes actually looked like real fixes. It's not what I
reacted to, I would have happily pulled those.

The comment about "NVMe and generic block layer" was about these areas
in general having been problem spots, and not the particular NVMe
patches in this pull request.

> I knew you might object to the writeback changes. As mentioned in the
> pull request, if you diff the whole thing, it's really not big at all.

Oh, it wasn't horribly big, and I agree that it was easy to read
through, but it really was entirely new code and really should have
been a merge window issue, not in a "fixes" pull.

And I did check the history of those patches, and it was all very very
recent and had recent comments on it too - it wasn't some old patches
that had been around for a while.

And while the patches were small, their effect were decidedly *not* obvious.

For example, while the patches were easy to read, one of them changed
the "try to free some memory" behavior fairly radically, going from
write out "at most 1024 pages" to "everything".

That may be a simple one-liner patch to read, but it sure as hell
wasn't obvious what the behavioral change really is under different
loads.

And when the patch is two days old, I can pretty much guarantee that
it hasn't been tested under a huge variety of loads. Maybe it does
exactly the right thing for *some* load, but...

It honestly looked like that patch existed purely in order to make the
next few patches be trivial ("no functional changes in this patch")
simply because the functional changes were done earlier.

That's all actually very nice for bisecting etc, and honestly, I liked
how the patch series looked and was split out, but I liked how it
looked as a *development* series.

It didn't make me go "this is a bug fix".

> It's mostly shuffling some arguments and things around, to make the
> final patch trivial. And it does mean that it's super easy to review.

See above. I think it was "super easy to review" only on a very
superficial level, not on a deeper one.

For example, I doubt you actually went back and looked at where that
"1024" came from that you just changed to zero?

It comes from the bitkeeper days, commit 407ee6c87e ("[PATCH]
low-latency page reclaim") in the historical tree:

    [PATCH] low-latency page reclaim

    Convert the VM to not wait on other people's dirty data.

     - If we find a dirty page and its queue is not congested, do some
writeback.

     - If we find a dirty page and its queue _is_ congested then just
       refile the page.

     - If we find a PageWriteback page then just refile the page.

     - There is additional throttling for write(2) callers.  Within
       generic_file_write(), record their backing queue in ->current.
       Within page reclaim, if this tasks encounters a page which is dirty
       or under writeback onthis queue, block on it.  This gives some more
       writer throttling and reduces the page refiling frequency.

    It's somewhat CPU expensive - under really heavy load we only get a 50%
    reclaim rate in pages coming off the tail of the LRU.  This can be
    fixed by splitting the inactive list into reclaimable and
    non-reclaimable lists.  But the CPU load isn't too bad, and latency is
    much, much more important in these situations.

    Example: with `mem=512m', running 4 instances of `dbench 100', 2.5.34
    took 35 minutes to compile a kernel.  With this patch, it took three
    minutes, 45 seconds.

    I haven't done swapcache or MAP_SHARED pages yet.  If there's tons of
    dirty swapcache or mmap data around we still stall heavily in page
    reclaim.  That's less important.

    This patch also has a tweak for swapless machines: don't even bother
    bringing anon pages onto the inactive list if there is no swap online.

which introduced that "long nr_pages" argument to wakeup_bdflush().

Here's that patch for people who don't keep that historical tree
around like I do:

    https://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git/commit/?id=407ee6c87e477434e3cb8be96885ed27b5539b6f

Now, it's entirely possible that the issue that made us do that in the
first place is long long gone, and we don't need that page limiter in
the VM any more.

But is it "super easy to review"?

No, it sure as hell is not.

> That's why I sent it in for this series. Yes, it's not months old code,
> but it has been tested. And it's not like it's a lot of NEW code, it's
> mostly removing code, or moving things around.
>
> I can resend without this stuff, but I think it'd be a shame to release
> 4.14 without it. Let me know, and I'll respin and send a new pull
> request.

So in order to make sane progress, I would suggest:

 - send the actual obvious *fixes* in separately as a branch of
unquestionable fixes. If it fixes an Oops, or is a security fix, or
makes something not work, it's a hard fix.

 - the writeback thing may make sense, but it really is a performance
tweak for a particular load at FB, and it should be seen as such. It's
not obviously a fix, and it has some really subtle issues. See above.

So I could still take the writeback series too. But I absolutely
*refuse* to take that as part of a "bugfix" pull, as if it was somehow
obviously a fix. Because it damn well is not at all obviously a fix.

So send that series *separately* as a independent branch, so that it's
obvious that "hey, this is something else that would help some FB
loads, we'd be super happy to have it in 4.14 due to it being a LTS
release".

And then I might just decide "ok, the code looks ok, I'm willing to
risk it, and if it causes problems for somebody, it came in as a
separate merge that could trivially be reverted".

See what I'm saying?

                    Linus
Jens Axboe Sept. 25, 2017, 2:46 p.m. UTC | #4
On 09/24/2017 09:16 PM, Linus Torvalds wrote:
>> I knew you might object to the writeback changes. As mentioned in the
>> pull request, if you diff the whole thing, it's really not big at all.
> 
> Oh, it wasn't horribly big, and I agree that it was easy to read
> through, but it really was entirely new code and really should have
> been a merge window issue, not in a "fixes" pull.
> 
> And I did check the history of those patches, and it was all very very
> recent and had recent comments on it too - it wasn't some old patches
> that had been around for a while.
> 
> And while the patches were small, their effect were decidedly *not*
> obvious.
> 
> For example, while the patches were easy to read, one of them changed
> the "try to free some memory" behavior fairly radically, going from
> write out "at most 1024 pages" to "everything".
> 
> That may be a simple one-liner patch to read, but it sure as hell
> wasn't obvious what the behavioral change really is under different
> loads.

But still better than hiding it deep in some other patch... It's a
one-liner patch. It may not be trivial to review, but it puts the detail
that you care about right in your face.

> And when the patch is two days old, I can pretty much guarantee that
> it hasn't been tested under a huge variety of loads. Maybe it does
> exactly the right thing for *some* load, but...
> 
> It honestly looked like that patch existed purely in order to make the
> next few patches be trivial ("no functional changes in this patch")
> simply because the functional changes were done earlier.

See below for discussion on that bit.

>> It's mostly shuffling some arguments and things around, to make the
>> final patch trivial. And it does mean that it's super easy to review.
> 
> See above. I think it was "super easy to review" only on a very
> superficial level, not on a deeper one.
> 
> For example, I doubt you actually went back and looked at where that
> "1024" came from that you just changed to zero?
> 
> It comes from the bitkeeper days, commit 407ee6c87e ("[PATCH]
> low-latency page reclaim") in the historical tree:
> 
>     [PATCH] low-latency page reclaim
> 
>     Convert the VM to not wait on other people's dirty data.
> 
>      - If we find a dirty page and its queue is not congested, do some
> writeback.
> 
>      - If we find a dirty page and its queue _is_ congested then just
>        refile the page.
> 
>      - If we find a PageWriteback page then just refile the page.
> 
>      - There is additional throttling for write(2) callers.  Within
>        generic_file_write(), record their backing queue in ->current.
>        Within page reclaim, if this tasks encounters a page which is dirty
>        or under writeback onthis queue, block on it.  This gives some more
>        writer throttling and reduces the page refiling frequency.
> 
>     It's somewhat CPU expensive - under really heavy load we only get a 50%
>     reclaim rate in pages coming off the tail of the LRU.  This can be
>     fixed by splitting the inactive list into reclaimable and
>     non-reclaimable lists.  But the CPU load isn't too bad, and latency is
>     much, much more important in these situations.
> 
>     Example: with `mem=512m', running 4 instances of `dbench 100', 2.5.34
>     took 35 minutes to compile a kernel.  With this patch, it took three
>     minutes, 45 seconds.
> 
>     I haven't done swapcache or MAP_SHARED pages yet.  If there's tons of
>     dirty swapcache or mmap data around we still stall heavily in page
>     reclaim.  That's less important.
> 
>     This patch also has a tweak for swapless machines: don't even bother
>     bringing anon pages onto the inactive list if there is no swap online.
> 
> which introduced that "long nr_pages" argument to wakeup_bdflush().
> 
> Here's that patch for people who don't keep that historical tree
> around like I do:
> 
>     https://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git/commit/?id=407ee6c87e477434e3cb8be96885ed27b5539b6f
> 
> Now, it's entirely possible that the issue that made us do that in the
> first place is long long gone, and we don't need that page limiter in
> the VM any more.
> 
> But is it "super easy to review"?
> 
> No, it sure as hell is not.

Honestly, what I really wanted to do was kill that call completely. If
you read the free_more_memory() function, it looks like crap. You have
some random call to starting writeback on a random number of pages. Then
you call into node reclaim. What I really wanted to do was kill the
wakeup_flusher_threads() completely, we want to rely on
try_to_free_pages() starting targeted writeback instead of some magic
number of pages across devices. But I decided that we should probably
just keep that random flusher wakeup, and tackle that part in a v2 for
4.15. Further code cleanups and improvements are very possible, getting
rid of the work unit allocation completely and just folding this odd
wakeup in with the general writeback starting infrastructure.

On top of that, Writeback has changed a LOT since those days, we track
write device write bandwidths and try to do the right thing, we have
per-sb flusher threads that do the actual flushing and no longer rely on
non-blocking congestion to throttle.

We also used to call free_more_memory() from getblk(), if that failed.
It's got a sched_yield() in it, for crying out loud!

>> That's why I sent it in for this series. Yes, it's not months old code,
>> but it has been tested. And it's not like it's a lot of NEW code, it's
>> mostly removing code, or moving things around.
>>
>> I can resend without this stuff, but I think it'd be a shame to release
>> 4.14 without it. Let me know, and I'll respin and send a new pull
>> request.
> 
> So in order to make sane progress, I would suggest:
> 
>  - send the actual obvious *fixes* in separately as a branch of
> unquestionable fixes. If it fixes an Oops, or is a security fix, or
> makes something not work, it's a hard fix.
> 
>  - the writeback thing may make sense, but it really is a performance
> tweak for a particular load at FB, and it should be seen as such. It's
> not obviously a fix, and it has some really subtle issues. See above.

The workload is nothing special. Getting to the point where it's bad
enough to trigger a softlockup probably requires some serious beating up
of the box, but getting to a suboptimal state wrt work units pending and
the flusher threads is pretty trivial.

> So I could still take the writeback series too. But I absolutely
> *refuse* to take that as part of a "bugfix" pull, as if it was somehow
> obviously a fix. Because it damn well is not at all obviously a fix.
> 
> So send that series *separately* as a independent branch, so that it's
> obvious that "hey, this is something else that would help some FB
> loads, we'd be super happy to have it in 4.14 due to it being a LTS
> release".
> 
> And then I might just decide "ok, the code looks ok, I'm willing to
> risk it, and if it causes problems for somebody, it came in as a
> separate merge that could trivially be reverted".
> 
> See what I'm saying?

Yes, that's a good point, it should have been a separate pull request
for the writeback changes. 

I'll split them up and resend as two separate pull requests.
Chris Mason Sept. 25, 2017, 3:52 p.m. UTC | #5
On 09/24/2017 11:16 PM, Linus Torvalds wrote:
> On Sun, Sep 24, 2017 at 5:03 PM, Jens Axboe <axboe@kernel.dk> wrote:
>>
>> NVMe fixes have sometimes been accepted for the current series, while
>> they should have been pushed. I'm not going to argue with that, but I
>> think we've managed to make that better the last few series. It's still
>> not squeaky clean, but it is improving substantially.  I don't believe
>> that's even been the case for core changes, which have always been
>> scrutinized much more heavily.
> 
> The NVMe fixes actually looked like real fixes. It's not what I
> reacted to, I would have happily pulled those.
> 
> The comment about "NVMe and generic block layer" was about these areas
> in general having been problem spots, and not the particular NVMe
> patches in this pull request.
> 
>> I knew you might object to the writeback changes. As mentioned in the
>> pull request, if you diff the whole thing, it's really not big at all.
> 
> Oh, it wasn't horribly big, and I agree that it was easy to read
> through, but it really was entirely new code and really should have
> been a merge window issue, not in a "fixes" pull.
> 
> And I did check the history of those patches, and it was all very very
> recent and had recent comments on it too - it wasn't some old patches
> that had been around for a while.
> 
> And while the patches were small, their effect were decidedly *not* obvious.
> 
> For example, while the patches were easy to read, one of them changed
> the "try to free some memory" behavior fairly radically, going from
> write out "at most 1024 pages" to "everything".
> 
> That may be a simple one-liner patch to read, but it sure as hell
> wasn't obvious what the behavioral change really is under different
> loads.
> 
> And when the patch is two days old, I can pretty much guarantee that
> it hasn't been tested under a huge variety of loads. Maybe it does
> exactly the right thing for *some* load, but...

I've got a little more context for where these fixes came from.  We've 
been seeing softlockups in move_expired_inodes() for years, going back 
to at least v3.10 kernels.  It was relatively rare (20-80 times per week 
across the fleet) and we always recovered before anyone could hop on and 
figure out what was going on.  I assumed it was something crazy on boxes 
with lots of files and lots of filesystems where dirty inodes were 
popping back into the list, somehow making move_expired_inodes() take 
forever.

Two weeks ago we had two boxes just explode, 600,000,00+ kmalloc-64 slab 
entries, with new ones being added basically at CPU speed.  They were 
constantly waking up the flushers and adding new kmalloc-64 from 
wb_start_writeback().  Should have seen it forever ago, but the 
softlockups were just wb_do_writeback() churning through his list.

It takes a special case to hit the softlockup, since you need vmscan.c 
to send a bunch of work items but then you need the filesystems to not 
really have any dirty pages written while we process the work items. 
Some dumbass (me) typing sync to find out if the memory gets reclaimed 
does seem to help trigger.

On mainline, I was able to trigger 500,000+ work items without too much 
trouble, but I couldn't make the softlockup happen on purpose.  For more 
realistic tests, a workload with lots of memory pressure will frequently 
pile up bursts of 1000+ work items.  Each one comes from 
wakeup_flusher_threads() so each one is trying to flush all the dirty pages.

> 
> It honestly looked like that patch existed purely in order to make the
> next few patches be trivial ("no functional changes in this patch")
> simply because the functional changes were done earlier.
> 
> That's all actually very nice for bisecting etc, and honestly, I liked
> how the patch series looked and was split out, but I liked how it
> looked as a *development* series.
> 
> It didn't make me go "this is a bug fix".
> 
>> It's mostly shuffling some arguments and things around, to make the
>> final patch trivial. And it does mean that it's super easy to review.
> 
> See above. I think it was "super easy to review" only on a very
> superficial level, not on a deeper one.
> 
> For example, I doubt you actually went back and looked at where that
> "1024" came from that you just changed to zero?
> 
> It comes from the bitkeeper days, commit 407ee6c87e ("[PATCH]
> low-latency page reclaim") in the historical tree:
> 
>      [PATCH] low-latency page reclaim
> 
>      Convert the VM to not wait on other people's dirty data.
> 
>       - If we find a dirty page and its queue is not congested, do some
> writeback.
> 
>       - If we find a dirty page and its queue _is_ congested then just
>         refile the page.
> 
>       - If we find a PageWriteback page then just refile the page.
> 
>       - There is additional throttling for write(2) callers.  Within
>         generic_file_write(), record their backing queue in ->current.
>         Within page reclaim, if this tasks encounters a page which is dirty
>         or under writeback onthis queue, block on it.  This gives some more
>         writer throttling and reduces the page refiling frequency.
> 
>      It's somewhat CPU expensive - under really heavy load we only get a 50%
>      reclaim rate in pages coming off the tail of the LRU.  This can be
>      fixed by splitting the inactive list into reclaimable and
>      non-reclaimable lists.  But the CPU load isn't too bad, and latency is
>      much, much more important in these situations.
> 
>      Example: with `mem=512m', running 4 instances of `dbench 100', 2.5.34
>      took 35 minutes to compile a kernel.  With this patch, it took three
>      minutes, 45 seconds.
> 
>      I haven't done swapcache or MAP_SHARED pages yet.  If there's tons of
>      dirty swapcache or mmap data around we still stall heavily in page
>      reclaim.  That's less important.
> 
>      This patch also has a tweak for swapless machines: don't even bother
>      bringing anon pages onto the inactive list if there is no swap online.
> 
> which introduced that "long nr_pages" argument to wakeup_bdflush().
> 
> Here's that patch for people who don't keep that historical tree
> around like I do:
> 
>      https://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git/commit/?id=407ee6c87e477434e3cb8be96885ed27b5539b6f
> 
> Now, it's entirely possible that the issue that made us do that in the
> first place is long long gone, and we don't need that page limiter in
> the VM any more.
> 
> But is it "super easy to review"?
> 

Looking at free_more_memory() in fs/buffer.c, it only gets called when 
grow_dev_page() returns 0 (because find_or_create_page() returned NULL) 
or in alloc_page_buffers() when alloc_buffer_head() returns NULL.

These are always done from sleeping contexts, I think from GFP_NOFS. 
So we're failing a single page or single slab allocation, and vmscan.c 
has almost certainly already done a full wakeup_flusher_threads(), and 
then fs/buffer.c does it again. For good measure, grow_dev_page() also 
does gfp_mask |= __GFP_NOFAIL.

I originally suggested we just remove the flusher wakeup completely from 
free_more_memory(), but I think Jens felt safer just turning it into a 
rate limited full flush like vmscan.c was already doing.

-chris
Linus Torvalds Sept. 25, 2017, 6:32 p.m. UTC | #6
On Mon, Sep 25, 2017 at 7:46 AM, Jens Axboe <axboe@kernel.dk> wrote:
>
> Honestly, what I really wanted to do was kill that call completely.

I can understand why you'd want to, but at the same time it is not
entirely clear that is possible.

So the problem is that we need more memory for buffer head
allocations, and we *have* to get it (since we very likely need the
buffer heads for writeout under low-mem cirumstances) - but at the
same time we obviously must not recurse into the filesystem.

The end result being that any synchronous freeing is going to be very
limited (GFP_NOFS really ends up limiting a lot), and so there's a
fairly strong argument that we should then wake up other threads that
*can* do filesystem writeback.

Of course, the problem with _that_ is that it's quite likely that the
flusher threads doing filesystem writeback may in turn get back to
this very same "we need buffer heads".

And that is, I assume, the case that facebook then hits.

So it's kind of "damned in you do, damned if you don't" situation.

But this is very much also why that free_more_memory() code then

 (a) limits the writeout

 (b) has that "yield()" in it

with the logic being to not cause unnecessarily much writeout, and
hopefully avoid huge latency spikes because of the (potential) looping
over the unsiccessful GFP_NOFS allocations.

But yes, a lot has changed over time. For example, originally
filesystems that used buffer heads used them both for reading and
writing, so the act of reading in a page tended to also populate the
buffer heads for that page.

I think most filesystems these days avoid buffer heads entirely, and
the dynamics of bh allocations have likely changed radically.

So I wouldn't remove the "wakeup_flusher_threads()" call, but maybe it
could be moved to be a unusual fallback case if the direct freeing
doesn't work.

But none of this happens unless that alloc_buffer_head() failed, so at
least one GFP_NOFS allocation has already failed miserably, which is
why that "make other threads that _can_ do filesystem IO try to free
things" exists in the first place.

> If you read the free_more_memory() function, it looks like crap.

I really don't see that.

Think of what the preconditions are:

 - a GFP_NOFS allocation has failed

 - we obviously cannot call back into filesystem code

and that function makes a fair amount of sense. It's _supposed_ to be
an unusual nasty case, after all.

Now,

> What I really wanted to do was kill the
> wakeup_flusher_threads() completely, we want to rely on
> try_to_free_pages() starting targeted writeback instead of some magic
> number of pages across devices.

Yes, hopefully the regular memory allocation should already do the
right thing, but the problem here is that we do need to loop over it,
which it isn't really designed for.

But one option might be to just say "_we_ shouldn't loop, the page
allocator should just loop until it gets a successful end result".

So remove _all_ of the fallback code entirely, and just use __GFP_NOFAIL.

That is in fact what some filesystems use for their own metadata:
you'll find that the

    GFP_NOFS | __GFP_NOFAIL

combination is not entirely unheard of, and in fact that buffer code
itself actually uses it for the backing page itself (not the buffer
heads) in grow_dev_page().

So rather than play more games in free_more_memory(), I'd personally
just rather remove that function entirely, and say "the buffer head
allocations simply cannot fail" and leave it to the VM code to do the
right thing.

> But I decided that we should probably
> just keep that random flusher wakeup, and tackle that part in a v2 for
> 4.15.

Oh, absolutely, I'd be more than happy to play with this code, but as
mentioned, I actually would do even more radical surgery on it.

So I'm not at all objecting to changing that function in major ways.

I just don't think that changing the flush count is in any way
obvious. In many ways it's a *much* scarier change than adding
__GFP_NOFAIL, in my opinion.

At least with __GFP_NOFAIL, we have some fairly good reason to believe
that now the memory management code won't be doing enormous flushing
work "just because".

So my only real objection was really the timing, and the "it's
obviously a fix".

> The workload is nothing special. Getting to the point where it's bad
> enough to trigger a softlockup probably requires some serious beating up
> of the box, but getting to a suboptimal state wrt work units pending and
> the flusher threads is pretty trivial.

I'm actually surprised that you have what appears to be a buffer-head
heavy workload at all.

What is it that triggers that many buffer heads in the first place?
Because I thought we'd gotten to the point where all normal file IO
can avoid the buffer heads entirely, and just directly work with
making bio's from the pages.

So I assume this is some metadata writeout (ie either directory
writeback or just the btrfs metadata tree?

                     Linus
Chris Mason Sept. 25, 2017, 9:17 p.m. UTC | #7
On 09/25/2017 02:32 PM, Linus Torvalds wrote:
> On Mon, Sep 25, 2017 at 7:46 AM, Jens Axboe <axboe@kernel.dk> wrote:
>>
>> Honestly, what I really wanted to do was kill that call completely.
> 
> I can understand why you'd want to, but at the same time it is not
> entirely clear that is possible.
> 
> So the problem is that we need more memory for buffer head
> allocations, and we *have* to get it (since we very likely need the
> buffer heads for writeout under low-mem cirumstances) - but at the
> same time we obviously must not recurse into the filesystem.
> 
> The end result being that any synchronous freeing is going to be very
> limited (GFP_NOFS really ends up limiting a lot), and so there's a
> fairly strong argument that we should then wake up other threads that
> *can* do filesystem writeback.
> 
> Of course, the problem with _that_ is that it's quite likely that the
> flusher threads doing filesystem writeback may in turn get back to
> this very same "we need buffer heads".
> 
> And that is, I assume, the case that facebook then hits.
> 
> So it's kind of "damned in you do, damned if you don't" situation.
> 
> But this is very much also why that free_more_memory() code then
> 
>   (a) limits the writeout
> 
>   (b) has that "yield()" in it
> 
> with the logic being to not cause unnecessarily much writeout, and
> hopefully avoid huge latency spikes because of the (potential) looping
> over the unsiccessful GFP_NOFS allocations.

My understanding is that for order-0 page allocations and 
kmem_cache_alloc(buffer_heads), GFP_NOFS is going to either loop forever 
or at the very least OOM kill something before returning NULL?

By the time free_more_memory() is called, vmscan.c has already triggered 
a full flush via wakeup_flusher_threads() at least once, and probably 
OOM killed something.

[ ... ]

> 
>> What I really wanted to do was kill the
>> wakeup_flusher_threads() completely, we want to rely on
>> try_to_free_pages() starting targeted writeback instead of some magic
>> number of pages across devices.
> 
> Yes, hopefully the regular memory allocation should already do the
> right thing, but the problem here is that we do need to loop over it,
> which it isn't really designed for.
> 
> But one option might be to just say "_we_ shouldn't loop, the page
> allocator should just loop until it gets a successful end result".
> 
> So remove _all_ of the fallback code entirely, and just use __GFP_NOFAIL.
> 

Agreed.  I think Jens was planning on this for the next merge window.

> That is in fact what some filesystems use for their own metadata:
> you'll find that the
> 
>      GFP_NOFS | __GFP_NOFAIL
> 
> combination is not entirely unheard of, and in fact that buffer code
> itself actually uses it for the backing page itself (not the buffer
> heads) in grow_dev_page().
> 
> So rather than play more games in free_more_memory(), I'd personally
> just rather remove that function entirely, and say "the buffer head
> allocations simply cannot fail" and leave it to the VM code to do the
> right thing.
> 
>> But I decided that we should probably
>> just keep that random flusher wakeup, and tackle that part in a v2 for
>> 4.15.
> 
> Oh, absolutely, I'd be more than happy to play with this code, but as
> mentioned, I actually would do even more radical surgery on it.
> 
> So I'm not at all objecting to changing that function in major ways.
> 
> I just don't think that changing the flush count is in any way
> obvious. In many ways it's a *much* scarier change than adding
> __GFP_NOFAIL, in my opinion.
> 
> At least with __GFP_NOFAIL, we have some fairly good reason to believe
> that now the memory management code won't be doing enormous flushing
> work "just because".
> 
> So my only real objection was really the timing, and the "it's
> obviously a fix".
> 
>> The workload is nothing special. Getting to the point where it's bad
>> enough to trigger a softlockup probably requires some serious beating up
>> of the box, but getting to a suboptimal state wrt work units pending and
>> the flusher threads is pretty trivial.
> 
> I'm actually surprised that you have what appears to be a buffer-head
> heavy workload at all.
> 
> What is it that triggers that many buffer heads in the first place?
> Because I thought we'd gotten to the point where all normal file IO
> can avoid the buffer heads entirely, and just directly work with
> making bio's from the pages.
>

We're not triggering free_more_memory().  I ran a probe on a few 
production machines and it didn't fire once over a 90 minute period of 
heavy load.  The main target of Jens' patchset was preventing 
shrink_inactive_list() -> wakeup_flusher_threads() from creating 
millions of work items without any rate limiting at all.

Once we went through and audited all of the callers of 
wakeup_flusher_threads(), free_more_memory() kind of stuck out as being 
weird.  It's trying to solve the same problems that vmscan.c is trying 
to solve (there are no free pages), but it's using a different method 
and not really communicating with the MM code.  It's not even checking 
to see if there are dirty pages.

I liked the way Jens changed it to use at least use the same args for 
wakeup_flusher_threads() as vmscan.c is now using.

But with all of that said, the fs/buffer.c hunks are the least important 
part of the series for what we've seen in production.  We only really 
care about when shrink_inactive_list() is calling wakeup_flusher_threads().
> So I assume this is some metadata writeout (ie either directory
> writeback or just the btrfs metadata tree?

Btrfs does sneak into the softlockup story on some machines because I 
have a check in our metadata writepages to bail if there is less than 
32MB of dirty metadata.  We added it ages ago because COW works better 
when we build up a good sized chunk of stuff to COW.

It goes into a weird corner case where we have dirty pages that we're 
refusing to write during background writeout, and shrink_inactive_list() 
hopes that adding one more wakeup_flusher_threads() call will clean 
them.  But it's not related to buffer_heads, which btrfs only uses to 
write the super blocks.

We see our softlockups most on boxes with 15 XFS filesystems.  Most of 
these are ext4 for root, and the ext4 pages/buffer_heads are a small 
fraction of what we have in page cache.  It's hadoop, so java monsters 
doing plain buffered IO, mostly onto the XFS drives.

-chris
Linus Torvalds Sept. 25, 2017, 10:21 p.m. UTC | #8
On Mon, Sep 25, 2017 at 2:17 PM, Chris Mason <clm@fb.com> wrote:
>
> My understanding is that for order-0 page allocations and
> kmem_cache_alloc(buffer_heads), GFP_NOFS is going to either loop forever or
> at the very least OOM kill something before returning NULL?

That should generally be true. We've occasionally screwed up in the
VM, so an explicit GFP_NOFAIL would definitely be best if we then
remove the looping in fs/buffer.c.

>> What is it that triggers that many buffer heads in the first place?
>> Because I thought we'd gotten to the point where all normal file IO
>> can avoid the buffer heads entirely, and just directly work with
>> making bio's from the pages.
>
> We're not triggering free_more_memory().  I ran a probe on a few production
> machines and it didn't fire once over a 90 minute period of heavy load.  The
> main target of Jens' patchset was preventing shrink_inactive_list() ->
> wakeup_flusher_threads() from creating millions of work items without any
> rate limiting at all.

So the two things I reacted to in that patch series were apparently
things that you guys don't even care about.

I reacted to the fs/buffer.c code, and to the change in laptop mode to
not do circular writeback.

The latter is another "it's probably ok, but it can be a subtle
change". In particular, things that re-write the same thing over and
over again can get very different behavior, even when you write out
"all" pages.

And I'm assuming you're not using laptop mode either on your servers
(that sounds insane, but I remember somebody actually ended up using
laptop mode even on servers, simply because they did *not* want the
regular timed writeback model, so it's not quite as insane as it
sounds).

                         Linus
Chris Mason Sept. 25, 2017, 10:48 p.m. UTC | #9
On 09/25/2017 06:21 PM, Linus Torvalds wrote:
> On Mon, Sep 25, 2017 at 2:17 PM, Chris Mason <clm@fb.com> wrote:
>>
>> My understanding is that for order-0 page allocations and
>> kmem_cache_alloc(buffer_heads), GFP_NOFS is going to either loop forever or
>> at the very least OOM kill something before returning NULL?
> 
> That should generally be true. We've occasionally screwed up in the
> VM, so an explicit GFP_NOFAIL would definitely be best if we then
> remove the looping in fs/buffer.c.

Right, I wouldn't remove the looping without the NOFAIL.  But in the 
normal case, it shouldn't be possible for free_more_memory() to be 
called without an OOM and without already having triggered the full flush.

That's why using the full flush in free_more_memory() felt like a small 
change to me.  But if you'd rather see GFP_NOFAIL in the next merge 
window I don't have any objections to that method either.

> 
>>> What is it that triggers that many buffer heads in the first place?
>>> Because I thought we'd gotten to the point where all normal file IO
>>> can avoid the buffer heads entirely, and just directly work with
>>> making bio's from the pages.
>>
>> We're not triggering free_more_memory().  I ran a probe on a few production
>> machines and it didn't fire once over a 90 minute period of heavy load.  The
>> main target of Jens' patchset was preventing shrink_inactive_list() ->
>> wakeup_flusher_threads() from creating millions of work items without any
>> rate limiting at all.
> 
> So the two things I reacted to in that patch series were apparently
> things that you guys don't even care about.

Yes and no.  fs/buffer.c didn't explode in prod recently, but we do 
exercise the OOM code often.  Even though I haven't seen it happen, I'd 
rather not leave fs/buffer.c able to trigger the same work explosion 
during an OOM spiral.  I know it's really unlikely, but java.

> 
> I reacted to the fs/buffer.c code, and to the change in laptop mode to
> not do circular writeback.
> 
> The latter is another "it's probably ok, but it can be a subtle
> change". In particular, things that re-write the same thing over and
> over again can get very different behavior, even when you write out
> "all" pages.
> 
> And I'm assuming you're not using laptop mode either on your servers
> (that sounds insane, but I remember somebody actually ended up using
> laptop mode even on servers, simply because they did *not* want the
> regular timed writeback model, so it's not quite as insane as it
> sounds).

I'd honestly have to do a query to make sure we aren't using it in some 
dark corner.  I sometimes think people try things just to see how long 
it takes Jens to notice.

-chris
Jens Axboe Sept. 27, 2017, 12:41 p.m. UTC | #10
Travel stable again, catching up. Chris did a great job explaining what
our issues were, so thanks for that.

On 09/26/2017 12:21 AM, Linus Torvalds wrote:
> On Mon, Sep 25, 2017 at 2:17 PM, Chris Mason <clm@fb.com> wrote:
>>
>> My understanding is that for order-0 page allocations and
>> kmem_cache_alloc(buffer_heads), GFP_NOFS is going to either loop forever or
>> at the very least OOM kill something before returning NULL?
> 
> That should generally be true. We've occasionally screwed up in the
> VM, so an explicit GFP_NOFAIL would definitely be best if we then
> remove the looping in fs/buffer.c.

Reworked to include that. More below.

>>> What is it that triggers that many buffer heads in the first place?
>>> Because I thought we'd gotten to the point where all normal file IO
>>> can avoid the buffer heads entirely, and just directly work with
>>> making bio's from the pages.
>>
>> We're not triggering free_more_memory().  I ran a probe on a few production
>> machines and it didn't fire once over a 90 minute period of heavy load.  The
>> main target of Jens' patchset was preventing shrink_inactive_list() ->
>> wakeup_flusher_threads() from creating millions of work items without any
>> rate limiting at all.
> 
> So the two things I reacted to in that patch series were apparently
> things that you guys don't even care about.

Right. But I'd like to stress that my development practice is to engineer
things that make sense in general, AND that fix the specific issue at
hand. This is never about making the general case worse, while fixing
some FB specific issue.

I'm very sure that others hit this case as well. Maybe not to the
extent of getting softlockups, but abysmal behavior happens long before
that. It just doesn't trigger any dmesg complaints.

> I reacted to the fs/buffer.c code, and to the change in laptop mode to
> not do circular writeback.
> 
> The latter is another "it's probably ok, but it can be a subtle
> change". In particular, things that re-write the same thing over and
> over again can get very different behavior, even when you write out
> "all" pages.
> 
> And I'm assuming you're not using laptop mode either on your servers
> (that sounds insane, but I remember somebody actually ended up using
> laptop mode even on servers, simply because they did *not* want the
> regular timed writeback model, so it's not quite as insane as it
> sounds).

So I reworked the series, to include three prep patches that end up
killing off free_more_memory(). This means that we don't have to do the
1024 -> 0 change in there. On top of that, I added a separate bit to
manage range cyclic vs non range cyclic flush all work. This means that
we don't have to worry about the laptop case either.

I think that should quell any of the concerns in the patchset, you can
find the new series here:

http://git.kernel.dk/cgit/linux-block/log/?h=wb-start-all

Unless you feel comfortable taking it for 4.14, I'm going to push this
to 4.15. In any case, it won't be ready until tomorrow, I need to push
this through the test machinery just in case.
Linus Torvalds Sept. 27, 2017, 5:36 p.m. UTC | #11
On Wed, Sep 27, 2017 at 5:41 AM, Jens Axboe <axboe@kernel.dk> wrote:
>
> So I reworked the series, to include three prep patches that end up
> killing off free_more_memory(). This means that we don't have to do the
> 1024 -> 0 change in there. On top of that, I added a separate bit to
> manage range cyclic vs non range cyclic flush all work. This means that
> we don't have to worry about the laptop case either.
>
> I think that should quell any of the concerns in the patchset, you can
> find the new series here:
>
> http://git.kernel.dk/cgit/linux-block/log/?h=wb-start-all

Yeah, this I feel more confident about.

It still has some subtle changes (nr_pages is slightly different for
laptop mode, and the whole "we rely on the VM to not return NULL") but
I cannot for the life of me imagine that those changes are noticeable
or meaningful.

So now that last commit that limits the pending flushes is the only
one that seems to matter, and the one you wanted to build up to. It's
not a subtle change, but that's the one that fixes the actual bug you
see, so now the rest of the series really looks like "prep-work for
the bug fix".

Of course, with the change to support range_cycling for the laptop
mode case, there's actually possibly _two_ pending full flushes (the
range-cyclic and the "all" one), so that last changelog may not be
entirely accurate.

Long-term, I would prefer to maybe make the "range_whole" case to
always be range-cyclic, because I suspect that's the better behavior,
but I think that series is the one that changes existing semantics the
least for the existing cases, so I think your approach is better for
now.

As far as I can tell, the cases that do not set range_cyclic right now are:

 - wakeup_flusher_threads()

 - sync_inodes_sb()

and in neither case does that lack of range_cyclic really seem to make
any sense.

It might be a purely historical artifact (an _old_ one: that
range_cyclic logic goes back to 2006, as far as I can tell).

But there might also be something I'm missing.

Anyway, your series looks fine to me.

              Linus
Jens Axboe Sept. 27, 2017, 7:24 p.m. UTC | #12
On 09/27/2017 07:36 PM, Linus Torvalds wrote:
> On Wed, Sep 27, 2017 at 5:41 AM, Jens Axboe <axboe@kernel.dk> wrote:
>>
>> So I reworked the series, to include three prep patches that end up
>> killing off free_more_memory(). This means that we don't have to do the
>> 1024 -> 0 change in there. On top of that, I added a separate bit to
>> manage range cyclic vs non range cyclic flush all work. This means that
>> we don't have to worry about the laptop case either.
>>
>> I think that should quell any of the concerns in the patchset, you can
>> find the new series here:
>>
>> http://git.kernel.dk/cgit/linux-block/log/?h=wb-start-all
> 
> Yeah, this I feel more confident about.
> 
> It still has some subtle changes (nr_pages is slightly different for
> laptop mode, and the whole "we rely on the VM to not return NULL") but
> I cannot for the life of me imagine that those changes are noticeable
> or meaningful.
> 
> So now that last commit that limits the pending flushes is the only
> one that seems to matter, and the one you wanted to build up to. It's
> not a subtle change, but that's the one that fixes the actual bug you
> see, so now the rest of the series really looks like "prep-work for
> the bug fix".
> 
> Of course, with the change to support range_cycling for the laptop
> mode case, there's actually possibly _two_ pending full flushes (the
> range-cyclic and the "all" one), so that last changelog may not be
> entirely accurate.

Yeah good point. I did modify it a little bit, but I should make that
clear throughout that changelog. Functionally it's identical as far as
the bug is concerned, 1 or 2 is a far cry from hundreds of thousands or
millions.

> Long-term, I would prefer to maybe make the "range_whole" case to
> always be range-cyclic, because I suspect that's the better behavior,
> but I think that series is the one that changes existing semantics the
> least for the existing cases, so I think your approach is better for
> now.

I agree, I was actually contemplating whether to make that change or
not. I don't see a reason to ever NOT do range cyclic writeback for the
full range request. Then we could kill the second wb->state bit and get
back to just having the one in flight.

> As far as I can tell, the cases that do not set range_cyclic right now are:
> 
>  - wakeup_flusher_threads()
> 
>  - sync_inodes_sb()
> 
> and in neither case does that lack of range_cyclic really seem to make
> any sense.
> 
> It might be a purely historical artifact (an _old_ one: that
> range_cyclic logic goes back to 2006, as far as I can tell).
> 
> But there might also be something I'm missing.
> 
> Anyway, your series looks fine to me.

Thanks for taking a look. I feel like the change should be made to do
range cyclic flushes for any request that wants to start writeback on
the full range, I can't think of a case where that make a difference.
Then we can kill the extra bit, and more importantly, kill this part:

        if (reason == WB_REASON_LAPTOP_TIMER)                                   
                range_cyclic = true;

which is a bit of a hack. I'll respin with that, and repost the series.
diff mbox

Patch

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index acc816b67582..bb2aad078637 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -134,8 +134,6 @@  static inline bool nvme_req_needs_retry(struct request *req)
 		return false;
 	if (nvme_req(req)->status & NVME_SC_DNR)
 		return false;
-	if (jiffies - req->start_time >= req->timeout)
-		return false;
 	if (nvme_req(req)->retries >= nvme_max_retries)
 		return false;
 	return true;
@@ -2590,7 +2588,7 @@  static void nvme_async_event_work(struct work_struct *work)
 		container_of(work, struct nvme_ctrl, async_event_work);
 
 	spin_lock_irq(&ctrl->lock);
-	while (ctrl->event_limit > 0) {
+	while (ctrl->state == NVME_CTRL_LIVE && ctrl->event_limit > 0) {
 		int aer_idx = --ctrl->event_limit;
 
 		spin_unlock_irq(&ctrl->lock);
@@ -2677,7 +2675,8 @@  void nvme_complete_async_event(struct nvme_ctrl *ctrl, __le16 status,
 		/*FALLTHRU*/
 	case NVME_SC_ABORT_REQ:
 		++ctrl->event_limit;
-		queue_work(nvme_wq, &ctrl->async_event_work);
+		if (ctrl->state == NVME_CTRL_LIVE)
+			queue_work(nvme_wq, &ctrl->async_event_work);
 		break;
 	default:
 		break;
@@ -2692,7 +2691,7 @@  void nvme_complete_async_event(struct nvme_ctrl *ctrl, __le16 status,
 		nvme_queue_scan(ctrl);
 		break;
 	case NVME_AER_NOTICE_FW_ACT_STARTING:
-		schedule_work(&ctrl->fw_act_work);
+		queue_work(nvme_wq, &ctrl->fw_act_work);
 		break;
 	default:
 		dev_warn(ctrl->device, "async event result %08x\n", result);
diff --git a/drivers/nvme/host/fabrics.c b/drivers/nvme/host/fabrics.c
index 47307752dc65..555c976cc2ee 100644
--- a/drivers/nvme/host/fabrics.c
+++ b/drivers/nvme/host/fabrics.c
@@ -565,6 +565,7 @@  static int nvmf_parse_options(struct nvmf_ctrl_options *opts,
 	opts->queue_size = NVMF_DEF_QUEUE_SIZE;
 	opts->nr_io_queues = num_online_cpus();
 	opts->reconnect_delay = NVMF_DEF_RECONNECT_DELAY;
+	opts->kato = NVME_DEFAULT_KATO;
 
 	options = o = kstrdup(buf, GFP_KERNEL);
 	if (!options)
@@ -655,21 +656,22 @@  static int nvmf_parse_options(struct nvmf_ctrl_options *opts,
 				goto out;
 			}
 
-			if (opts->discovery_nqn) {
-				pr_err("Discovery controllers cannot accept keep_alive_tmo != 0\n");
-				ret = -EINVAL;
-				goto out;
-			}
-
 			if (token < 0) {
 				pr_err("Invalid keep_alive_tmo %d\n", token);
 				ret = -EINVAL;
 				goto out;
-			} else if (token == 0) {
+			} else if (token == 0 && !opts->discovery_nqn) {
 				/* Allowed for debug */
 				pr_warn("keep_alive_tmo 0 won't execute keep alives!!!\n");
 			}
 			opts->kato = token;
+
+			if (opts->discovery_nqn && opts->kato) {
+				pr_err("Discovery controllers cannot accept KATO != 0\n");
+				ret = -EINVAL;
+				goto out;
+			}
+
 			break;
 		case NVMF_OPT_CTRL_LOSS_TMO:
 			if (match_int(args, &token)) {
@@ -762,8 +764,6 @@  static int nvmf_parse_options(struct nvmf_ctrl_options *opts,
 	uuid_copy(&opts->host->id, &hostid);
 
 out:
-	if (!opts->discovery_nqn && !opts->kato)
-		opts->kato = NVME_DEFAULT_KATO;
 	kfree(options);
 	return ret;
 }
diff --git a/drivers/nvme/host/fc.c b/drivers/nvme/host/fc.c
index d2e882c0f496..af075e998944 100644
--- a/drivers/nvme/host/fc.c
+++ b/drivers/nvme/host/fc.c
@@ -1376,7 +1376,7 @@  nvme_fc_fcpio_done(struct nvmefc_fcp_req *req)
 	if (atomic_read(&op->state) == FCPOP_STATE_ABORTED)
 		status = cpu_to_le16((NVME_SC_ABORT_REQ | NVME_SC_DNR) << 1);
 	else if (freq->status)
-		status = cpu_to_le16(NVME_SC_FC_TRANSPORT_ERROR << 1);
+		status = cpu_to_le16(NVME_SC_INTERNAL << 1);
 
 	/*
 	 * For the linux implementation, if we have an unsuccesful
@@ -1404,7 +1404,7 @@  nvme_fc_fcpio_done(struct nvmefc_fcp_req *req)
 		 */
 		if (freq->transferred_length !=
 			be32_to_cpu(op->cmd_iu.data_len)) {
-			status = cpu_to_le16(NVME_SC_FC_TRANSPORT_ERROR << 1);
+			status = cpu_to_le16(NVME_SC_INTERNAL << 1);
 			goto done;
 		}
 		result.u64 = 0;
@@ -1421,7 +1421,7 @@  nvme_fc_fcpio_done(struct nvmefc_fcp_req *req)
 					freq->transferred_length ||
 			     op->rsp_iu.status_code ||
 			     sqe->common.command_id != cqe->command_id)) {
-			status = cpu_to_le16(NVME_SC_FC_TRANSPORT_ERROR << 1);
+			status = cpu_to_le16(NVME_SC_INTERNAL << 1);
 			goto done;
 		}
 		result = cqe->result;
@@ -1429,7 +1429,7 @@  nvme_fc_fcpio_done(struct nvmefc_fcp_req *req)
 		break;
 
 	default:
-		status = cpu_to_le16(NVME_SC_FC_TRANSPORT_ERROR << 1);
+		status = cpu_to_le16(NVME_SC_INTERNAL << 1);
 		goto done;
 	}
 
@@ -1989,16 +1989,17 @@  nvme_fc_start_fcp_op(struct nvme_fc_ctrl *ctrl, struct nvme_fc_queue *queue,
 	 * as well as those by FC-NVME spec.
 	 */
 	WARN_ON_ONCE(sqe->common.metadata);
-	WARN_ON_ONCE(sqe->common.dptr.prp1);
-	WARN_ON_ONCE(sqe->common.dptr.prp2);
 	sqe->common.flags |= NVME_CMD_SGL_METABUF;
 
 	/*
-	 * format SQE DPTR field per FC-NVME rules
-	 *    type=data block descr; subtype=offset;
-	 *    offset is currently 0.
+	 * format SQE DPTR field per FC-NVME rules:
+	 *    type=0x5     Transport SGL Data Block Descriptor
+	 *    subtype=0xA  Transport-specific value
+	 *    address=0
+	 *    length=length of the data series
 	 */
-	sqe->rw.dptr.sgl.type = NVME_SGL_FMT_OFFSET;
+	sqe->rw.dptr.sgl.type = (NVME_TRANSPORT_SGL_DATA_DESC << 4) |
+					NVME_SGL_FMT_TRANSPORT_A;
 	sqe->rw.dptr.sgl.length = cpu_to_le32(data_len);
 	sqe->rw.dptr.sgl.addr = 0;
 
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 4a2121335f48..cb73bc8cad3b 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -24,6 +24,7 @@ 
 #include <linux/mm.h>
 #include <linux/module.h>
 #include <linux/mutex.h>
+#include <linux/once.h>
 #include <linux/pci.h>
 #include <linux/poison.h>
 #include <linux/t10-pi.h>
@@ -540,6 +541,20 @@  static void nvme_dif_complete(u32 p, u32 v, struct t10_pi_tuple *pi)
 }
 #endif
 
+static void nvme_print_sgl(struct scatterlist *sgl, int nents)
+{
+	int i;
+	struct scatterlist *sg;
+
+	for_each_sg(sgl, sg, nents, i) {
+		dma_addr_t phys = sg_phys(sg);
+		pr_warn("sg[%d] phys_addr:%pad offset:%d length:%d "
+			"dma_address:%pad dma_length:%d\n",
+			i, &phys, sg->offset, sg->length, &sg_dma_address(sg),
+			sg_dma_len(sg));
+	}
+}
+
 static blk_status_t nvme_setup_prps(struct nvme_dev *dev, struct request *req)
 {
 	struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
@@ -622,19 +637,10 @@  static blk_status_t nvme_setup_prps(struct nvme_dev *dev, struct request *req)
 	return BLK_STS_OK;
 
  bad_sgl:
-	if (WARN_ONCE(1, "Invalid SGL for payload:%d nents:%d\n",
-				blk_rq_payload_bytes(req), iod->nents)) {
-		for_each_sg(iod->sg, sg, iod->nents, i) {
-			dma_addr_t phys = sg_phys(sg);
-			pr_warn("sg[%d] phys_addr:%pad offset:%d length:%d "
-			       "dma_address:%pad dma_length:%d\n", i, &phys,
-					sg->offset, sg->length,
-					&sg_dma_address(sg),
-					sg_dma_len(sg));
-		}
-	}
+	WARN(DO_ONCE(nvme_print_sgl, iod->sg, iod->nents),
+			"Invalid SGL for payload:%d nents:%d\n",
+			blk_rq_payload_bytes(req), iod->nents);
 	return BLK_STS_IOERR;
-
 }
 
 static blk_status_t nvme_map_data(struct nvme_dev *dev, struct request *req,
@@ -1313,11 +1319,11 @@  static int nvme_create_queue(struct nvme_queue *nvmeq, int qid)
 	if (result < 0)
 		goto release_cq;
 
+	nvme_init_queue(nvmeq, qid);
 	result = queue_request_irq(nvmeq);
 	if (result < 0)
 		goto release_sq;
 
-	nvme_init_queue(nvmeq, qid);
 	return result;
 
  release_sq:
@@ -1464,6 +1470,7 @@  static int nvme_pci_configure_admin_queue(struct nvme_dev *dev)
 		return result;
 
 	nvmeq->cq_vector = 0;
+	nvme_init_queue(nvmeq, 0);
 	result = queue_request_irq(nvmeq);
 	if (result) {
 		nvmeq->cq_vector = -1;
@@ -2156,7 +2163,6 @@  static void nvme_reset_work(struct work_struct *work)
 	if (result)
 		goto out;
 
-	nvme_init_queue(dev->queues[0], 0);
 	result = nvme_alloc_admin_tags(dev);
 	if (result)
 		goto out;
diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
index 58983000964b..92a03ff5fb4d 100644
--- a/drivers/nvme/host/rdma.c
+++ b/drivers/nvme/host/rdma.c
@@ -942,7 +942,12 @@  static void nvme_rdma_reconnect_ctrl_work(struct work_struct *work)
 	}
 
 	changed = nvme_change_ctrl_state(&ctrl->ctrl, NVME_CTRL_LIVE);
-	WARN_ON_ONCE(!changed);
+	if (!changed) {
+		/* state change failure is ok if we're in DELETING state */
+		WARN_ON_ONCE(ctrl->ctrl.state != NVME_CTRL_DELETING);
+		return;
+	}
+
 	ctrl->ctrl.nr_reconnects = 0;
 
 	nvme_start_ctrl(&ctrl->ctrl);
@@ -962,7 +967,7 @@  static void nvme_rdma_error_recovery_work(struct work_struct *work)
 	struct nvme_rdma_ctrl *ctrl = container_of(work,
 			struct nvme_rdma_ctrl, err_work);
 
-	nvme_stop_ctrl(&ctrl->ctrl);
+	nvme_stop_keep_alive(&ctrl->ctrl);
 
 	if (ctrl->ctrl.queue_count > 1) {
 		nvme_stop_queues(&ctrl->ctrl);
diff --git a/drivers/nvme/target/core.c b/drivers/nvme/target/core.c
index 7c23eaf8e563..1b208beeef50 100644
--- a/drivers/nvme/target/core.c
+++ b/drivers/nvme/target/core.c
@@ -390,10 +390,10 @@  static void __nvmet_req_complete(struct nvmet_req *req, u16 status)
 	if (status)
 		nvmet_set_status(req, status);
 
-	/* XXX: need to fill in something useful for sq_head */
-	req->rsp->sq_head = 0;
-	if (likely(req->sq)) /* may happen during early failure */
-		req->rsp->sq_id = cpu_to_le16(req->sq->qid);
+	if (req->sq->size)
+		req->sq->sqhd = (req->sq->sqhd + 1) % req->sq->size;
+	req->rsp->sq_head = cpu_to_le16(req->sq->sqhd);
+	req->rsp->sq_id = cpu_to_le16(req->sq->qid);
 	req->rsp->command_id = req->cmd->common.command_id;
 
 	if (req->ns)
@@ -420,6 +420,7 @@  void nvmet_cq_setup(struct nvmet_ctrl *ctrl, struct nvmet_cq *cq,
 void nvmet_sq_setup(struct nvmet_ctrl *ctrl, struct nvmet_sq *sq,
 		u16 qid, u16 size)
 {
+	sq->sqhd = 0;
 	sq->qid = qid;
 	sq->size = size;
 
diff --git a/drivers/nvme/target/fabrics-cmd.c b/drivers/nvme/target/fabrics-cmd.c
index 859a66725291..db3bf6b8bf9e 100644
--- a/drivers/nvme/target/fabrics-cmd.c
+++ b/drivers/nvme/target/fabrics-cmd.c
@@ -109,9 +109,14 @@  static u16 nvmet_install_queue(struct nvmet_ctrl *ctrl, struct nvmet_req *req)
 		pr_warn("queue already connected!\n");
 		return NVME_SC_CONNECT_CTRL_BUSY | NVME_SC_DNR;
 	}
+	if (!sqsize) {
+		pr_warn("queue size zero!\n");
+		return NVME_SC_CONNECT_INVALID_PARAM | NVME_SC_DNR;
+	}
 
-	nvmet_cq_setup(ctrl, req->cq, qid, sqsize);
-	nvmet_sq_setup(ctrl, req->sq, qid, sqsize);
+	/* note: convert queue size from 0's-based value to 1's-based value */
+	nvmet_cq_setup(ctrl, req->cq, qid, sqsize + 1);
+	nvmet_sq_setup(ctrl, req->sq, qid, sqsize + 1);
 	return 0;
 }
 
diff --git a/drivers/nvme/target/fc.c b/drivers/nvme/target/fc.c
index 421e43bf1dd7..58e010bdda3e 100644
--- a/drivers/nvme/target/fc.c
+++ b/drivers/nvme/target/fc.c
@@ -148,7 +148,7 @@  struct nvmet_fc_tgt_assoc {
 	u32				a_id;
 	struct nvmet_fc_tgtport		*tgtport;
 	struct list_head		a_list;
-	struct nvmet_fc_tgt_queue	*queues[NVMET_NR_QUEUES];
+	struct nvmet_fc_tgt_queue	*queues[NVMET_NR_QUEUES + 1];
 	struct kref			ref;
 };
 
@@ -608,7 +608,7 @@  nvmet_fc_alloc_target_queue(struct nvmet_fc_tgt_assoc *assoc,
 	unsigned long flags;
 	int ret;
 
-	if (qid >= NVMET_NR_QUEUES)
+	if (qid > NVMET_NR_QUEUES)
 		return NULL;
 
 	queue = kzalloc((sizeof(*queue) +
@@ -783,6 +783,9 @@  nvmet_fc_find_target_queue(struct nvmet_fc_tgtport *tgtport,
 	u16 qid = nvmet_fc_getqueueid(connection_id);
 	unsigned long flags;
 
+	if (qid > NVMET_NR_QUEUES)
+		return NULL;
+
 	spin_lock_irqsave(&tgtport->lock, flags);
 	list_for_each_entry(assoc, &tgtport->assoc_list, a_list) {
 		if (association_id == assoc->association_id) {
@@ -888,7 +891,7 @@  nvmet_fc_delete_target_assoc(struct nvmet_fc_tgt_assoc *assoc)
 	int i;
 
 	spin_lock_irqsave(&tgtport->lock, flags);
-	for (i = NVMET_NR_QUEUES - 1; i >= 0; i--) {
+	for (i = NVMET_NR_QUEUES; i >= 0; i--) {
 		queue = assoc->queues[i];
 		if (queue) {
 			if (!nvmet_fc_tgt_q_get(queue))
@@ -1910,8 +1913,7 @@  nvmet_fc_transfer_fcp_data(struct nvmet_fc_tgtport *tgtport,
 			spin_lock_irqsave(&fod->flock, flags);
 			fod->writedataactive = false;
 			spin_unlock_irqrestore(&fod->flock, flags);
-			nvmet_req_complete(&fod->req,
-					NVME_SC_FC_TRANSPORT_ERROR);
+			nvmet_req_complete(&fod->req, NVME_SC_INTERNAL);
 		} else /* NVMET_FCOP_READDATA or NVMET_FCOP_READDATA_RSP */ {
 			fcpreq->fcp_error = ret;
 			fcpreq->transferred_length = 0;
@@ -1929,8 +1931,7 @@  __nvmet_fc_fod_op_abort(struct nvmet_fc_fcp_iod *fod, bool abort)
 	/* if in the middle of an io and we need to tear down */
 	if (abort) {
 		if (fcpreq->op == NVMET_FCOP_WRITEDATA) {
-			nvmet_req_complete(&fod->req,
-					NVME_SC_FC_TRANSPORT_ERROR);
+			nvmet_req_complete(&fod->req, NVME_SC_INTERNAL);
 			return true;
 		}
 
@@ -1968,8 +1969,7 @@  nvmet_fc_fod_op_done(struct nvmet_fc_fcp_iod *fod)
 			fod->abort = true;
 			spin_unlock(&fod->flock);
 
-			nvmet_req_complete(&fod->req,
-					NVME_SC_FC_TRANSPORT_ERROR);
+			nvmet_req_complete(&fod->req, NVME_SC_INTERNAL);
 			return;
 		}
 
@@ -2533,13 +2533,17 @@  nvmet_fc_remove_port(struct nvmet_port *port)
 {
 	struct nvmet_fc_tgtport *tgtport = port->priv;
 	unsigned long flags;
+	bool matched = false;
 
 	spin_lock_irqsave(&nvmet_fc_tgtlock, flags);
 	if (tgtport->port == port) {
-		nvmet_fc_tgtport_put(tgtport);
+		matched = true;
 		tgtport->port = NULL;
 	}
 	spin_unlock_irqrestore(&nvmet_fc_tgtlock, flags);
+
+	if (matched)
+		nvmet_fc_tgtport_put(tgtport);
 }
 
 static struct nvmet_fabrics_ops nvmet_fc_tgt_fcp_ops = {
diff --git a/drivers/nvme/target/fcloop.c b/drivers/nvme/target/fcloop.c
index 1cb9847ec261..7b75d9de55ab 100644
--- a/drivers/nvme/target/fcloop.c
+++ b/drivers/nvme/target/fcloop.c
@@ -224,8 +224,6 @@  struct fcloop_nport {
 	struct fcloop_lport *lport;
 	struct list_head nport_list;
 	struct kref ref;
-	struct completion rport_unreg_done;
-	struct completion tport_unreg_done;
 	u64 node_name;
 	u64 port_name;
 	u32 port_role;
@@ -576,7 +574,7 @@  fcloop_tgt_fcp_abort(struct nvmet_fc_target_port *tgtport,
 	tfcp_req->aborted = true;
 	spin_unlock(&tfcp_req->reqlock);
 
-	tfcp_req->status = NVME_SC_FC_TRANSPORT_ABORTED;
+	tfcp_req->status = NVME_SC_INTERNAL;
 
 	/*
 	 * nothing more to do. If io wasn't active, the transport should
@@ -630,6 +628,32 @@  fcloop_fcp_abort(struct nvme_fc_local_port *localport,
 	schedule_work(&inireq->iniwork);
 }
 
+static void
+fcloop_nport_free(struct kref *ref)
+{
+	struct fcloop_nport *nport =
+		container_of(ref, struct fcloop_nport, ref);
+	unsigned long flags;
+
+	spin_lock_irqsave(&fcloop_lock, flags);
+	list_del(&nport->nport_list);
+	spin_unlock_irqrestore(&fcloop_lock, flags);
+
+	kfree(nport);
+}
+
+static void
+fcloop_nport_put(struct fcloop_nport *nport)
+{
+	kref_put(&nport->ref, fcloop_nport_free);
+}
+
+static int
+fcloop_nport_get(struct fcloop_nport *nport)
+{
+	return kref_get_unless_zero(&nport->ref);
+}
+
 static void
 fcloop_localport_delete(struct nvme_fc_local_port *localport)
 {
@@ -644,8 +668,7 @@  fcloop_remoteport_delete(struct nvme_fc_remote_port *remoteport)
 {
 	struct fcloop_rport *rport = remoteport->private;
 
-	/* release any threads waiting for the unreg to complete */
-	complete(&rport->nport->rport_unreg_done);
+	fcloop_nport_put(rport->nport);
 }
 
 static void
@@ -653,8 +676,7 @@  fcloop_targetport_delete(struct nvmet_fc_target_port *targetport)
 {
 	struct fcloop_tport *tport = targetport->private;
 
-	/* release any threads waiting for the unreg to complete */
-	complete(&tport->nport->tport_unreg_done);
+	fcloop_nport_put(tport->nport);
 }
 
 #define	FCLOOP_HW_QUEUES		4
@@ -722,6 +744,7 @@  fcloop_create_local_port(struct device *dev, struct device_attribute *attr,
 		goto out_free_opts;
 	}
 
+	memset(&pinfo, 0, sizeof(pinfo));
 	pinfo.node_name = opts->wwnn;
 	pinfo.port_name = opts->wwpn;
 	pinfo.port_role = opts->roles;
@@ -804,32 +827,6 @@  fcloop_delete_local_port(struct device *dev, struct device_attribute *attr,
 	return ret ? ret : count;
 }
 
-static void
-fcloop_nport_free(struct kref *ref)
-{
-	struct fcloop_nport *nport =
-		container_of(ref, struct fcloop_nport, ref);
-	unsigned long flags;
-
-	spin_lock_irqsave(&fcloop_lock, flags);
-	list_del(&nport->nport_list);
-	spin_unlock_irqrestore(&fcloop_lock, flags);
-
-	kfree(nport);
-}
-
-static void
-fcloop_nport_put(struct fcloop_nport *nport)
-{
-	kref_put(&nport->ref, fcloop_nport_free);
-}
-
-static int
-fcloop_nport_get(struct fcloop_nport *nport)
-{
-	return kref_get_unless_zero(&nport->ref);
-}
-
 static struct fcloop_nport *
 fcloop_alloc_nport(const char *buf, size_t count, bool remoteport)
 {
@@ -938,6 +935,7 @@  fcloop_create_remote_port(struct device *dev, struct device_attribute *attr,
 	if (!nport)
 		return -EIO;
 
+	memset(&pinfo, 0, sizeof(pinfo));
 	pinfo.node_name = nport->node_name;
 	pinfo.port_name = nport->port_name;
 	pinfo.port_role = nport->port_role;
@@ -979,24 +977,12 @@  __unlink_remote_port(struct fcloop_nport *nport)
 }
 
 static int
-__wait_remoteport_unreg(struct fcloop_nport *nport, struct fcloop_rport *rport)
+__remoteport_unreg(struct fcloop_nport *nport, struct fcloop_rport *rport)
 {
-	int ret;
-
 	if (!rport)
 		return -EALREADY;
 
-	init_completion(&nport->rport_unreg_done);
-
-	ret = nvme_fc_unregister_remoteport(rport->remoteport);
-	if (ret)
-		return ret;
-
-	wait_for_completion(&nport->rport_unreg_done);
-
-	fcloop_nport_put(nport);
-
-	return ret;
+	return nvme_fc_unregister_remoteport(rport->remoteport);
 }
 
 static ssize_t
@@ -1029,7 +1015,7 @@  fcloop_delete_remote_port(struct device *dev, struct device_attribute *attr,
 	if (!nport)
 		return -ENOENT;
 
-	ret = __wait_remoteport_unreg(nport, rport);
+	ret = __remoteport_unreg(nport, rport);
 
 	return ret ? ret : count;
 }
@@ -1086,24 +1072,12 @@  __unlink_target_port(struct fcloop_nport *nport)
 }
 
 static int
-__wait_targetport_unreg(struct fcloop_nport *nport, struct fcloop_tport *tport)
+__targetport_unreg(struct fcloop_nport *nport, struct fcloop_tport *tport)
 {
-	int ret;
-
 	if (!tport)
 		return -EALREADY;
 
-	init_completion(&nport->tport_unreg_done);
-
-	ret = nvmet_fc_unregister_targetport(tport->targetport);
-	if (ret)
-		return ret;
-
-	wait_for_completion(&nport->tport_unreg_done);
-
-	fcloop_nport_put(nport);
-
-	return ret;
+	return nvmet_fc_unregister_targetport(tport->targetport);
 }
 
 static ssize_t
@@ -1136,7 +1110,7 @@  fcloop_delete_target_port(struct device *dev, struct device_attribute *attr,
 	if (!nport)
 		return -ENOENT;
 
-	ret = __wait_targetport_unreg(nport, tport);
+	ret = __targetport_unreg(nport, tport);
 
 	return ret ? ret : count;
 }
@@ -1223,11 +1197,11 @@  static void __exit fcloop_exit(void)
 
 		spin_unlock_irqrestore(&fcloop_lock, flags);
 
-		ret = __wait_targetport_unreg(nport, tport);
+		ret = __targetport_unreg(nport, tport);
 		if (ret)
 			pr_warn("%s: Failed deleting target port\n", __func__);
 
-		ret = __wait_remoteport_unreg(nport, rport);
+		ret = __remoteport_unreg(nport, rport);
 		if (ret)
 			pr_warn("%s: Failed deleting remote port\n", __func__);
 
diff --git a/drivers/nvme/target/nvmet.h b/drivers/nvme/target/nvmet.h
index 7d261ab894f4..7b8e20adf760 100644
--- a/drivers/nvme/target/nvmet.h
+++ b/drivers/nvme/target/nvmet.h
@@ -74,6 +74,7 @@  struct nvmet_sq {
 	struct percpu_ref	ref;
 	u16			qid;
 	u16			size;
+	u16			sqhd;
 	struct completion	free_done;
 	struct completion	confirm_done;
 };