Message ID | 20170924130304.GA21904@infradead.org (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Sun, Sep 24, 2017 at 6:03 AM, Christoph Hellwig <hch@infradead.org> wrote: > On Fri, Sep 22, 2017 at 05:18:49PM -1000, Linus Torvalds wrote: >> WTF? Why is this so hard? It used to be that IDE drove people crazy. >> Now it's NVMe and generic block layer stuff. > > Can you please explain what problems you have with the nvme patches? This time it wasn't the NVMe parts (that was the last few releases). This time it was the random writeback patches that had been committed only days before, and that completely change some fundamental code. Some of the commits say "this changes no semantics", but that's after the previous commit just changed the argument values exactly so that the next commit wouldn't change semantics. Don't get me wrong - all the commits look perfectly _sane_. That's not my issue. But these commits literally showed up on lkml just a couple of days before getting sent to me, and even when they were sent the cover letter for the series of six patches was literally More graceful flusher thread memory reclaim wakeup which *really* does not say "this is an important big fix or regression" to me. What made it ok to send that outside the merge window? It looks like a nice cleanup, no question about it, and it probably fixes some behavioral problem at FB. But that behavioral problem is not *new*, as far as I can tell. So why was it sent as a bug-fix pull request? Now, if this was some other subsystem that _hadn't_ had problems in EVERY SINGLE recent merge window, I'd likely have been more than happy to take it early in the rc series because that subsystem wasn't on my list of "increased scrutiny due to historical issues". But as it is, the block layer _is_ on my list of increased scrutiny (and part of that reason is NVMe patches - again, "fixes" being sent that were really just normal on-going development) Linus
On 09/24/2017 11:34 AM, Linus Torvalds wrote: > On Sun, Sep 24, 2017 at 6:03 AM, Christoph Hellwig <hch@infradead.org> wrote: >> On Fri, Sep 22, 2017 at 05:18:49PM -1000, Linus Torvalds wrote: >>> WTF? Why is this so hard? It used to be that IDE drove people crazy. >>> Now it's NVMe and generic block layer stuff. >> >> Can you please explain what problems you have with the nvme patches? > > This time it wasn't the NVMe parts (that was the last few releases). > This time it was the random writeback patches that had been committed > only days before, and that completely change some fundamental code. > Some of the commits say "this changes no semantics", but that's after > the previous commit just changed the argument values exactly so that > the next commit wouldn't change semantics. > > Don't get me wrong - all the commits look perfectly _sane_. That's > not my issue. But these commits literally showed up on lkml just a > couple of days before getting sent to me, and even when they were sent > the cover letter for the series of six patches was literally > > More graceful flusher thread memory reclaim wakeup > > which *really* does not say "this is an important big fix or regression" to me. > > What made it ok to send that outside the merge window? It looks like a > nice cleanup, no question about it, and it probably fixes some > behavioral problem at FB. But that behavioral problem is not *new*, as > far as I can tell. > > So why was it sent as a bug-fix pull request? > > Now, if this was some other subsystem that _hadn't_ had problems in > EVERY SINGLE recent merge window, I'd likely have been more than happy > to take it early in the rc series because that subsystem wasn't on my > list of "increased scrutiny due to historical issues". > > But as it is, the block layer _is_ on my list of increased scrutiny > (and part of that reason is NVMe patches - again, "fixes" being sent > that were really just normal on-going development) Sorry for the late response, been moving this weekend and tending to things more important than arguing on the Internet. NVMe fixes have sometimes been accepted for the current series, while they should have been pushed. I'm not going to argue with that, but I think we've managed to make that better the last few series. It's still not squeaky clean, but it is improving substantially. I don't believe that's even been the case for core changes, which have always been scrutinized much more heavily. I knew you might object to the writeback changes. As mentioned in the pull request, if you diff the whole thing, it's really not big at all. It's mostly shuffling some arguments and things around, to make the final patch trivial. And it does mean that it's super easy to review. In the original review series, we are hitting this in production, and have been for a while. So while it's not a regression for this series, it's something that causes us quite a bit of pain. When attempting to free memory ends up gobling up tons of extra memory AND causing the flusher threads to waste tons of CPU, then that's a big problem. Big enough that we get softlockups from just attempting to process these identical start-all-writeback requests. That's why I sent it in for this series. Yes, it's not months old code, but it has been tested. And it's not like it's a lot of NEW code, it's mostly removing code, or moving things around. I can resend without this stuff, but I think it'd be a shame to release 4.14 without it. Let me know, and I'll respin and send a new pull request.
On Sun, Sep 24, 2017 at 5:03 PM, Jens Axboe <axboe@kernel.dk> wrote: > > NVMe fixes have sometimes been accepted for the current series, while > they should have been pushed. I'm not going to argue with that, but I > think we've managed to make that better the last few series. It's still > not squeaky clean, but it is improving substantially. I don't believe > that's even been the case for core changes, which have always been > scrutinized much more heavily. The NVMe fixes actually looked like real fixes. It's not what I reacted to, I would have happily pulled those. The comment about "NVMe and generic block layer" was about these areas in general having been problem spots, and not the particular NVMe patches in this pull request. > I knew you might object to the writeback changes. As mentioned in the > pull request, if you diff the whole thing, it's really not big at all. Oh, it wasn't horribly big, and I agree that it was easy to read through, but it really was entirely new code and really should have been a merge window issue, not in a "fixes" pull. And I did check the history of those patches, and it was all very very recent and had recent comments on it too - it wasn't some old patches that had been around for a while. And while the patches were small, their effect were decidedly *not* obvious. For example, while the patches were easy to read, one of them changed the "try to free some memory" behavior fairly radically, going from write out "at most 1024 pages" to "everything". That may be a simple one-liner patch to read, but it sure as hell wasn't obvious what the behavioral change really is under different loads. And when the patch is two days old, I can pretty much guarantee that it hasn't been tested under a huge variety of loads. Maybe it does exactly the right thing for *some* load, but... It honestly looked like that patch existed purely in order to make the next few patches be trivial ("no functional changes in this patch") simply because the functional changes were done earlier. That's all actually very nice for bisecting etc, and honestly, I liked how the patch series looked and was split out, but I liked how it looked as a *development* series. It didn't make me go "this is a bug fix". > It's mostly shuffling some arguments and things around, to make the > final patch trivial. And it does mean that it's super easy to review. See above. I think it was "super easy to review" only on a very superficial level, not on a deeper one. For example, I doubt you actually went back and looked at where that "1024" came from that you just changed to zero? It comes from the bitkeeper days, commit 407ee6c87e ("[PATCH] low-latency page reclaim") in the historical tree: [PATCH] low-latency page reclaim Convert the VM to not wait on other people's dirty data. - If we find a dirty page and its queue is not congested, do some writeback. - If we find a dirty page and its queue _is_ congested then just refile the page. - If we find a PageWriteback page then just refile the page. - There is additional throttling for write(2) callers. Within generic_file_write(), record their backing queue in ->current. Within page reclaim, if this tasks encounters a page which is dirty or under writeback onthis queue, block on it. This gives some more writer throttling and reduces the page refiling frequency. It's somewhat CPU expensive - under really heavy load we only get a 50% reclaim rate in pages coming off the tail of the LRU. This can be fixed by splitting the inactive list into reclaimable and non-reclaimable lists. But the CPU load isn't too bad, and latency is much, much more important in these situations. Example: with `mem=512m', running 4 instances of `dbench 100', 2.5.34 took 35 minutes to compile a kernel. With this patch, it took three minutes, 45 seconds. I haven't done swapcache or MAP_SHARED pages yet. If there's tons of dirty swapcache or mmap data around we still stall heavily in page reclaim. That's less important. This patch also has a tweak for swapless machines: don't even bother bringing anon pages onto the inactive list if there is no swap online. which introduced that "long nr_pages" argument to wakeup_bdflush(). Here's that patch for people who don't keep that historical tree around like I do: https://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git/commit/?id=407ee6c87e477434e3cb8be96885ed27b5539b6f Now, it's entirely possible that the issue that made us do that in the first place is long long gone, and we don't need that page limiter in the VM any more. But is it "super easy to review"? No, it sure as hell is not. > That's why I sent it in for this series. Yes, it's not months old code, > but it has been tested. And it's not like it's a lot of NEW code, it's > mostly removing code, or moving things around. > > I can resend without this stuff, but I think it'd be a shame to release > 4.14 without it. Let me know, and I'll respin and send a new pull > request. So in order to make sane progress, I would suggest: - send the actual obvious *fixes* in separately as a branch of unquestionable fixes. If it fixes an Oops, or is a security fix, or makes something not work, it's a hard fix. - the writeback thing may make sense, but it really is a performance tweak for a particular load at FB, and it should be seen as such. It's not obviously a fix, and it has some really subtle issues. See above. So I could still take the writeback series too. But I absolutely *refuse* to take that as part of a "bugfix" pull, as if it was somehow obviously a fix. Because it damn well is not at all obviously a fix. So send that series *separately* as a independent branch, so that it's obvious that "hey, this is something else that would help some FB loads, we'd be super happy to have it in 4.14 due to it being a LTS release". And then I might just decide "ok, the code looks ok, I'm willing to risk it, and if it causes problems for somebody, it came in as a separate merge that could trivially be reverted". See what I'm saying? Linus
On 09/24/2017 09:16 PM, Linus Torvalds wrote: >> I knew you might object to the writeback changes. As mentioned in the >> pull request, if you diff the whole thing, it's really not big at all. > > Oh, it wasn't horribly big, and I agree that it was easy to read > through, but it really was entirely new code and really should have > been a merge window issue, not in a "fixes" pull. > > And I did check the history of those patches, and it was all very very > recent and had recent comments on it too - it wasn't some old patches > that had been around for a while. > > And while the patches were small, their effect were decidedly *not* > obvious. > > For example, while the patches were easy to read, one of them changed > the "try to free some memory" behavior fairly radically, going from > write out "at most 1024 pages" to "everything". > > That may be a simple one-liner patch to read, but it sure as hell > wasn't obvious what the behavioral change really is under different > loads. But still better than hiding it deep in some other patch... It's a one-liner patch. It may not be trivial to review, but it puts the detail that you care about right in your face. > And when the patch is two days old, I can pretty much guarantee that > it hasn't been tested under a huge variety of loads. Maybe it does > exactly the right thing for *some* load, but... > > It honestly looked like that patch existed purely in order to make the > next few patches be trivial ("no functional changes in this patch") > simply because the functional changes were done earlier. See below for discussion on that bit. >> It's mostly shuffling some arguments and things around, to make the >> final patch trivial. And it does mean that it's super easy to review. > > See above. I think it was "super easy to review" only on a very > superficial level, not on a deeper one. > > For example, I doubt you actually went back and looked at where that > "1024" came from that you just changed to zero? > > It comes from the bitkeeper days, commit 407ee6c87e ("[PATCH] > low-latency page reclaim") in the historical tree: > > [PATCH] low-latency page reclaim > > Convert the VM to not wait on other people's dirty data. > > - If we find a dirty page and its queue is not congested, do some > writeback. > > - If we find a dirty page and its queue _is_ congested then just > refile the page. > > - If we find a PageWriteback page then just refile the page. > > - There is additional throttling for write(2) callers. Within > generic_file_write(), record their backing queue in ->current. > Within page reclaim, if this tasks encounters a page which is dirty > or under writeback onthis queue, block on it. This gives some more > writer throttling and reduces the page refiling frequency. > > It's somewhat CPU expensive - under really heavy load we only get a 50% > reclaim rate in pages coming off the tail of the LRU. This can be > fixed by splitting the inactive list into reclaimable and > non-reclaimable lists. But the CPU load isn't too bad, and latency is > much, much more important in these situations. > > Example: with `mem=512m', running 4 instances of `dbench 100', 2.5.34 > took 35 minutes to compile a kernel. With this patch, it took three > minutes, 45 seconds. > > I haven't done swapcache or MAP_SHARED pages yet. If there's tons of > dirty swapcache or mmap data around we still stall heavily in page > reclaim. That's less important. > > This patch also has a tweak for swapless machines: don't even bother > bringing anon pages onto the inactive list if there is no swap online. > > which introduced that "long nr_pages" argument to wakeup_bdflush(). > > Here's that patch for people who don't keep that historical tree > around like I do: > > https://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git/commit/?id=407ee6c87e477434e3cb8be96885ed27b5539b6f > > Now, it's entirely possible that the issue that made us do that in the > first place is long long gone, and we don't need that page limiter in > the VM any more. > > But is it "super easy to review"? > > No, it sure as hell is not. Honestly, what I really wanted to do was kill that call completely. If you read the free_more_memory() function, it looks like crap. You have some random call to starting writeback on a random number of pages. Then you call into node reclaim. What I really wanted to do was kill the wakeup_flusher_threads() completely, we want to rely on try_to_free_pages() starting targeted writeback instead of some magic number of pages across devices. But I decided that we should probably just keep that random flusher wakeup, and tackle that part in a v2 for 4.15. Further code cleanups and improvements are very possible, getting rid of the work unit allocation completely and just folding this odd wakeup in with the general writeback starting infrastructure. On top of that, Writeback has changed a LOT since those days, we track write device write bandwidths and try to do the right thing, we have per-sb flusher threads that do the actual flushing and no longer rely on non-blocking congestion to throttle. We also used to call free_more_memory() from getblk(), if that failed. It's got a sched_yield() in it, for crying out loud! >> That's why I sent it in for this series. Yes, it's not months old code, >> but it has been tested. And it's not like it's a lot of NEW code, it's >> mostly removing code, or moving things around. >> >> I can resend without this stuff, but I think it'd be a shame to release >> 4.14 without it. Let me know, and I'll respin and send a new pull >> request. > > So in order to make sane progress, I would suggest: > > - send the actual obvious *fixes* in separately as a branch of > unquestionable fixes. If it fixes an Oops, or is a security fix, or > makes something not work, it's a hard fix. > > - the writeback thing may make sense, but it really is a performance > tweak for a particular load at FB, and it should be seen as such. It's > not obviously a fix, and it has some really subtle issues. See above. The workload is nothing special. Getting to the point where it's bad enough to trigger a softlockup probably requires some serious beating up of the box, but getting to a suboptimal state wrt work units pending and the flusher threads is pretty trivial. > So I could still take the writeback series too. But I absolutely > *refuse* to take that as part of a "bugfix" pull, as if it was somehow > obviously a fix. Because it damn well is not at all obviously a fix. > > So send that series *separately* as a independent branch, so that it's > obvious that "hey, this is something else that would help some FB > loads, we'd be super happy to have it in 4.14 due to it being a LTS > release". > > And then I might just decide "ok, the code looks ok, I'm willing to > risk it, and if it causes problems for somebody, it came in as a > separate merge that could trivially be reverted". > > See what I'm saying? Yes, that's a good point, it should have been a separate pull request for the writeback changes. I'll split them up and resend as two separate pull requests.
On 09/24/2017 11:16 PM, Linus Torvalds wrote: > On Sun, Sep 24, 2017 at 5:03 PM, Jens Axboe <axboe@kernel.dk> wrote: >> >> NVMe fixes have sometimes been accepted for the current series, while >> they should have been pushed. I'm not going to argue with that, but I >> think we've managed to make that better the last few series. It's still >> not squeaky clean, but it is improving substantially. I don't believe >> that's even been the case for core changes, which have always been >> scrutinized much more heavily. > > The NVMe fixes actually looked like real fixes. It's not what I > reacted to, I would have happily pulled those. > > The comment about "NVMe and generic block layer" was about these areas > in general having been problem spots, and not the particular NVMe > patches in this pull request. > >> I knew you might object to the writeback changes. As mentioned in the >> pull request, if you diff the whole thing, it's really not big at all. > > Oh, it wasn't horribly big, and I agree that it was easy to read > through, but it really was entirely new code and really should have > been a merge window issue, not in a "fixes" pull. > > And I did check the history of those patches, and it was all very very > recent and had recent comments on it too - it wasn't some old patches > that had been around for a while. > > And while the patches were small, their effect were decidedly *not* obvious. > > For example, while the patches were easy to read, one of them changed > the "try to free some memory" behavior fairly radically, going from > write out "at most 1024 pages" to "everything". > > That may be a simple one-liner patch to read, but it sure as hell > wasn't obvious what the behavioral change really is under different > loads. > > And when the patch is two days old, I can pretty much guarantee that > it hasn't been tested under a huge variety of loads. Maybe it does > exactly the right thing for *some* load, but... I've got a little more context for where these fixes came from. We've been seeing softlockups in move_expired_inodes() for years, going back to at least v3.10 kernels. It was relatively rare (20-80 times per week across the fleet) and we always recovered before anyone could hop on and figure out what was going on. I assumed it was something crazy on boxes with lots of files and lots of filesystems where dirty inodes were popping back into the list, somehow making move_expired_inodes() take forever. Two weeks ago we had two boxes just explode, 600,000,00+ kmalloc-64 slab entries, with new ones being added basically at CPU speed. They were constantly waking up the flushers and adding new kmalloc-64 from wb_start_writeback(). Should have seen it forever ago, but the softlockups were just wb_do_writeback() churning through his list. It takes a special case to hit the softlockup, since you need vmscan.c to send a bunch of work items but then you need the filesystems to not really have any dirty pages written while we process the work items. Some dumbass (me) typing sync to find out if the memory gets reclaimed does seem to help trigger. On mainline, I was able to trigger 500,000+ work items without too much trouble, but I couldn't make the softlockup happen on purpose. For more realistic tests, a workload with lots of memory pressure will frequently pile up bursts of 1000+ work items. Each one comes from wakeup_flusher_threads() so each one is trying to flush all the dirty pages. > > It honestly looked like that patch existed purely in order to make the > next few patches be trivial ("no functional changes in this patch") > simply because the functional changes were done earlier. > > That's all actually very nice for bisecting etc, and honestly, I liked > how the patch series looked and was split out, but I liked how it > looked as a *development* series. > > It didn't make me go "this is a bug fix". > >> It's mostly shuffling some arguments and things around, to make the >> final patch trivial. And it does mean that it's super easy to review. > > See above. I think it was "super easy to review" only on a very > superficial level, not on a deeper one. > > For example, I doubt you actually went back and looked at where that > "1024" came from that you just changed to zero? > > It comes from the bitkeeper days, commit 407ee6c87e ("[PATCH] > low-latency page reclaim") in the historical tree: > > [PATCH] low-latency page reclaim > > Convert the VM to not wait on other people's dirty data. > > - If we find a dirty page and its queue is not congested, do some > writeback. > > - If we find a dirty page and its queue _is_ congested then just > refile the page. > > - If we find a PageWriteback page then just refile the page. > > - There is additional throttling for write(2) callers. Within > generic_file_write(), record their backing queue in ->current. > Within page reclaim, if this tasks encounters a page which is dirty > or under writeback onthis queue, block on it. This gives some more > writer throttling and reduces the page refiling frequency. > > It's somewhat CPU expensive - under really heavy load we only get a 50% > reclaim rate in pages coming off the tail of the LRU. This can be > fixed by splitting the inactive list into reclaimable and > non-reclaimable lists. But the CPU load isn't too bad, and latency is > much, much more important in these situations. > > Example: with `mem=512m', running 4 instances of `dbench 100', 2.5.34 > took 35 minutes to compile a kernel. With this patch, it took three > minutes, 45 seconds. > > I haven't done swapcache or MAP_SHARED pages yet. If there's tons of > dirty swapcache or mmap data around we still stall heavily in page > reclaim. That's less important. > > This patch also has a tweak for swapless machines: don't even bother > bringing anon pages onto the inactive list if there is no swap online. > > which introduced that "long nr_pages" argument to wakeup_bdflush(). > > Here's that patch for people who don't keep that historical tree > around like I do: > > https://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git/commit/?id=407ee6c87e477434e3cb8be96885ed27b5539b6f > > Now, it's entirely possible that the issue that made us do that in the > first place is long long gone, and we don't need that page limiter in > the VM any more. > > But is it "super easy to review"? > Looking at free_more_memory() in fs/buffer.c, it only gets called when grow_dev_page() returns 0 (because find_or_create_page() returned NULL) or in alloc_page_buffers() when alloc_buffer_head() returns NULL. These are always done from sleeping contexts, I think from GFP_NOFS. So we're failing a single page or single slab allocation, and vmscan.c has almost certainly already done a full wakeup_flusher_threads(), and then fs/buffer.c does it again. For good measure, grow_dev_page() also does gfp_mask |= __GFP_NOFAIL. I originally suggested we just remove the flusher wakeup completely from free_more_memory(), but I think Jens felt safer just turning it into a rate limited full flush like vmscan.c was already doing. -chris
On Mon, Sep 25, 2017 at 7:46 AM, Jens Axboe <axboe@kernel.dk> wrote: > > Honestly, what I really wanted to do was kill that call completely. I can understand why you'd want to, but at the same time it is not entirely clear that is possible. So the problem is that we need more memory for buffer head allocations, and we *have* to get it (since we very likely need the buffer heads for writeout under low-mem cirumstances) - but at the same time we obviously must not recurse into the filesystem. The end result being that any synchronous freeing is going to be very limited (GFP_NOFS really ends up limiting a lot), and so there's a fairly strong argument that we should then wake up other threads that *can* do filesystem writeback. Of course, the problem with _that_ is that it's quite likely that the flusher threads doing filesystem writeback may in turn get back to this very same "we need buffer heads". And that is, I assume, the case that facebook then hits. So it's kind of "damned in you do, damned if you don't" situation. But this is very much also why that free_more_memory() code then (a) limits the writeout (b) has that "yield()" in it with the logic being to not cause unnecessarily much writeout, and hopefully avoid huge latency spikes because of the (potential) looping over the unsiccessful GFP_NOFS allocations. But yes, a lot has changed over time. For example, originally filesystems that used buffer heads used them both for reading and writing, so the act of reading in a page tended to also populate the buffer heads for that page. I think most filesystems these days avoid buffer heads entirely, and the dynamics of bh allocations have likely changed radically. So I wouldn't remove the "wakeup_flusher_threads()" call, but maybe it could be moved to be a unusual fallback case if the direct freeing doesn't work. But none of this happens unless that alloc_buffer_head() failed, so at least one GFP_NOFS allocation has already failed miserably, which is why that "make other threads that _can_ do filesystem IO try to free things" exists in the first place. > If you read the free_more_memory() function, it looks like crap. I really don't see that. Think of what the preconditions are: - a GFP_NOFS allocation has failed - we obviously cannot call back into filesystem code and that function makes a fair amount of sense. It's _supposed_ to be an unusual nasty case, after all. Now, > What I really wanted to do was kill the > wakeup_flusher_threads() completely, we want to rely on > try_to_free_pages() starting targeted writeback instead of some magic > number of pages across devices. Yes, hopefully the regular memory allocation should already do the right thing, but the problem here is that we do need to loop over it, which it isn't really designed for. But one option might be to just say "_we_ shouldn't loop, the page allocator should just loop until it gets a successful end result". So remove _all_ of the fallback code entirely, and just use __GFP_NOFAIL. That is in fact what some filesystems use for their own metadata: you'll find that the GFP_NOFS | __GFP_NOFAIL combination is not entirely unheard of, and in fact that buffer code itself actually uses it for the backing page itself (not the buffer heads) in grow_dev_page(). So rather than play more games in free_more_memory(), I'd personally just rather remove that function entirely, and say "the buffer head allocations simply cannot fail" and leave it to the VM code to do the right thing. > But I decided that we should probably > just keep that random flusher wakeup, and tackle that part in a v2 for > 4.15. Oh, absolutely, I'd be more than happy to play with this code, but as mentioned, I actually would do even more radical surgery on it. So I'm not at all objecting to changing that function in major ways. I just don't think that changing the flush count is in any way obvious. In many ways it's a *much* scarier change than adding __GFP_NOFAIL, in my opinion. At least with __GFP_NOFAIL, we have some fairly good reason to believe that now the memory management code won't be doing enormous flushing work "just because". So my only real objection was really the timing, and the "it's obviously a fix". > The workload is nothing special. Getting to the point where it's bad > enough to trigger a softlockup probably requires some serious beating up > of the box, but getting to a suboptimal state wrt work units pending and > the flusher threads is pretty trivial. I'm actually surprised that you have what appears to be a buffer-head heavy workload at all. What is it that triggers that many buffer heads in the first place? Because I thought we'd gotten to the point where all normal file IO can avoid the buffer heads entirely, and just directly work with making bio's from the pages. So I assume this is some metadata writeout (ie either directory writeback or just the btrfs metadata tree? Linus
On 09/25/2017 02:32 PM, Linus Torvalds wrote: > On Mon, Sep 25, 2017 at 7:46 AM, Jens Axboe <axboe@kernel.dk> wrote: >> >> Honestly, what I really wanted to do was kill that call completely. > > I can understand why you'd want to, but at the same time it is not > entirely clear that is possible. > > So the problem is that we need more memory for buffer head > allocations, and we *have* to get it (since we very likely need the > buffer heads for writeout under low-mem cirumstances) - but at the > same time we obviously must not recurse into the filesystem. > > The end result being that any synchronous freeing is going to be very > limited (GFP_NOFS really ends up limiting a lot), and so there's a > fairly strong argument that we should then wake up other threads that > *can* do filesystem writeback. > > Of course, the problem with _that_ is that it's quite likely that the > flusher threads doing filesystem writeback may in turn get back to > this very same "we need buffer heads". > > And that is, I assume, the case that facebook then hits. > > So it's kind of "damned in you do, damned if you don't" situation. > > But this is very much also why that free_more_memory() code then > > (a) limits the writeout > > (b) has that "yield()" in it > > with the logic being to not cause unnecessarily much writeout, and > hopefully avoid huge latency spikes because of the (potential) looping > over the unsiccessful GFP_NOFS allocations. My understanding is that for order-0 page allocations and kmem_cache_alloc(buffer_heads), GFP_NOFS is going to either loop forever or at the very least OOM kill something before returning NULL? By the time free_more_memory() is called, vmscan.c has already triggered a full flush via wakeup_flusher_threads() at least once, and probably OOM killed something. [ ... ] > >> What I really wanted to do was kill the >> wakeup_flusher_threads() completely, we want to rely on >> try_to_free_pages() starting targeted writeback instead of some magic >> number of pages across devices. > > Yes, hopefully the regular memory allocation should already do the > right thing, but the problem here is that we do need to loop over it, > which it isn't really designed for. > > But one option might be to just say "_we_ shouldn't loop, the page > allocator should just loop until it gets a successful end result". > > So remove _all_ of the fallback code entirely, and just use __GFP_NOFAIL. > Agreed. I think Jens was planning on this for the next merge window. > That is in fact what some filesystems use for their own metadata: > you'll find that the > > GFP_NOFS | __GFP_NOFAIL > > combination is not entirely unheard of, and in fact that buffer code > itself actually uses it for the backing page itself (not the buffer > heads) in grow_dev_page(). > > So rather than play more games in free_more_memory(), I'd personally > just rather remove that function entirely, and say "the buffer head > allocations simply cannot fail" and leave it to the VM code to do the > right thing. > >> But I decided that we should probably >> just keep that random flusher wakeup, and tackle that part in a v2 for >> 4.15. > > Oh, absolutely, I'd be more than happy to play with this code, but as > mentioned, I actually would do even more radical surgery on it. > > So I'm not at all objecting to changing that function in major ways. > > I just don't think that changing the flush count is in any way > obvious. In many ways it's a *much* scarier change than adding > __GFP_NOFAIL, in my opinion. > > At least with __GFP_NOFAIL, we have some fairly good reason to believe > that now the memory management code won't be doing enormous flushing > work "just because". > > So my only real objection was really the timing, and the "it's > obviously a fix". > >> The workload is nothing special. Getting to the point where it's bad >> enough to trigger a softlockup probably requires some serious beating up >> of the box, but getting to a suboptimal state wrt work units pending and >> the flusher threads is pretty trivial. > > I'm actually surprised that you have what appears to be a buffer-head > heavy workload at all. > > What is it that triggers that many buffer heads in the first place? > Because I thought we'd gotten to the point where all normal file IO > can avoid the buffer heads entirely, and just directly work with > making bio's from the pages. > We're not triggering free_more_memory(). I ran a probe on a few production machines and it didn't fire once over a 90 minute period of heavy load. The main target of Jens' patchset was preventing shrink_inactive_list() -> wakeup_flusher_threads() from creating millions of work items without any rate limiting at all. Once we went through and audited all of the callers of wakeup_flusher_threads(), free_more_memory() kind of stuck out as being weird. It's trying to solve the same problems that vmscan.c is trying to solve (there are no free pages), but it's using a different method and not really communicating with the MM code. It's not even checking to see if there are dirty pages. I liked the way Jens changed it to use at least use the same args for wakeup_flusher_threads() as vmscan.c is now using. But with all of that said, the fs/buffer.c hunks are the least important part of the series for what we've seen in production. We only really care about when shrink_inactive_list() is calling wakeup_flusher_threads(). > So I assume this is some metadata writeout (ie either directory > writeback or just the btrfs metadata tree? Btrfs does sneak into the softlockup story on some machines because I have a check in our metadata writepages to bail if there is less than 32MB of dirty metadata. We added it ages ago because COW works better when we build up a good sized chunk of stuff to COW. It goes into a weird corner case where we have dirty pages that we're refusing to write during background writeout, and shrink_inactive_list() hopes that adding one more wakeup_flusher_threads() call will clean them. But it's not related to buffer_heads, which btrfs only uses to write the super blocks. We see our softlockups most on boxes with 15 XFS filesystems. Most of these are ext4 for root, and the ext4 pages/buffer_heads are a small fraction of what we have in page cache. It's hadoop, so java monsters doing plain buffered IO, mostly onto the XFS drives. -chris
On Mon, Sep 25, 2017 at 2:17 PM, Chris Mason <clm@fb.com> wrote: > > My understanding is that for order-0 page allocations and > kmem_cache_alloc(buffer_heads), GFP_NOFS is going to either loop forever or > at the very least OOM kill something before returning NULL? That should generally be true. We've occasionally screwed up in the VM, so an explicit GFP_NOFAIL would definitely be best if we then remove the looping in fs/buffer.c. >> What is it that triggers that many buffer heads in the first place? >> Because I thought we'd gotten to the point where all normal file IO >> can avoid the buffer heads entirely, and just directly work with >> making bio's from the pages. > > We're not triggering free_more_memory(). I ran a probe on a few production > machines and it didn't fire once over a 90 minute period of heavy load. The > main target of Jens' patchset was preventing shrink_inactive_list() -> > wakeup_flusher_threads() from creating millions of work items without any > rate limiting at all. So the two things I reacted to in that patch series were apparently things that you guys don't even care about. I reacted to the fs/buffer.c code, and to the change in laptop mode to not do circular writeback. The latter is another "it's probably ok, but it can be a subtle change". In particular, things that re-write the same thing over and over again can get very different behavior, even when you write out "all" pages. And I'm assuming you're not using laptop mode either on your servers (that sounds insane, but I remember somebody actually ended up using laptop mode even on servers, simply because they did *not* want the regular timed writeback model, so it's not quite as insane as it sounds). Linus
On 09/25/2017 06:21 PM, Linus Torvalds wrote: > On Mon, Sep 25, 2017 at 2:17 PM, Chris Mason <clm@fb.com> wrote: >> >> My understanding is that for order-0 page allocations and >> kmem_cache_alloc(buffer_heads), GFP_NOFS is going to either loop forever or >> at the very least OOM kill something before returning NULL? > > That should generally be true. We've occasionally screwed up in the > VM, so an explicit GFP_NOFAIL would definitely be best if we then > remove the looping in fs/buffer.c. Right, I wouldn't remove the looping without the NOFAIL. But in the normal case, it shouldn't be possible for free_more_memory() to be called without an OOM and without already having triggered the full flush. That's why using the full flush in free_more_memory() felt like a small change to me. But if you'd rather see GFP_NOFAIL in the next merge window I don't have any objections to that method either. > >>> What is it that triggers that many buffer heads in the first place? >>> Because I thought we'd gotten to the point where all normal file IO >>> can avoid the buffer heads entirely, and just directly work with >>> making bio's from the pages. >> >> We're not triggering free_more_memory(). I ran a probe on a few production >> machines and it didn't fire once over a 90 minute period of heavy load. The >> main target of Jens' patchset was preventing shrink_inactive_list() -> >> wakeup_flusher_threads() from creating millions of work items without any >> rate limiting at all. > > So the two things I reacted to in that patch series were apparently > things that you guys don't even care about. Yes and no. fs/buffer.c didn't explode in prod recently, but we do exercise the OOM code often. Even though I haven't seen it happen, I'd rather not leave fs/buffer.c able to trigger the same work explosion during an OOM spiral. I know it's really unlikely, but java. > > I reacted to the fs/buffer.c code, and to the change in laptop mode to > not do circular writeback. > > The latter is another "it's probably ok, but it can be a subtle > change". In particular, things that re-write the same thing over and > over again can get very different behavior, even when you write out > "all" pages. > > And I'm assuming you're not using laptop mode either on your servers > (that sounds insane, but I remember somebody actually ended up using > laptop mode even on servers, simply because they did *not* want the > regular timed writeback model, so it's not quite as insane as it > sounds). I'd honestly have to do a query to make sure we aren't using it in some dark corner. I sometimes think people try things just to see how long it takes Jens to notice. -chris
Travel stable again, catching up. Chris did a great job explaining what our issues were, so thanks for that. On 09/26/2017 12:21 AM, Linus Torvalds wrote: > On Mon, Sep 25, 2017 at 2:17 PM, Chris Mason <clm@fb.com> wrote: >> >> My understanding is that for order-0 page allocations and >> kmem_cache_alloc(buffer_heads), GFP_NOFS is going to either loop forever or >> at the very least OOM kill something before returning NULL? > > That should generally be true. We've occasionally screwed up in the > VM, so an explicit GFP_NOFAIL would definitely be best if we then > remove the looping in fs/buffer.c. Reworked to include that. More below. >>> What is it that triggers that many buffer heads in the first place? >>> Because I thought we'd gotten to the point where all normal file IO >>> can avoid the buffer heads entirely, and just directly work with >>> making bio's from the pages. >> >> We're not triggering free_more_memory(). I ran a probe on a few production >> machines and it didn't fire once over a 90 minute period of heavy load. The >> main target of Jens' patchset was preventing shrink_inactive_list() -> >> wakeup_flusher_threads() from creating millions of work items without any >> rate limiting at all. > > So the two things I reacted to in that patch series were apparently > things that you guys don't even care about. Right. But I'd like to stress that my development practice is to engineer things that make sense in general, AND that fix the specific issue at hand. This is never about making the general case worse, while fixing some FB specific issue. I'm very sure that others hit this case as well. Maybe not to the extent of getting softlockups, but abysmal behavior happens long before that. It just doesn't trigger any dmesg complaints. > I reacted to the fs/buffer.c code, and to the change in laptop mode to > not do circular writeback. > > The latter is another "it's probably ok, but it can be a subtle > change". In particular, things that re-write the same thing over and > over again can get very different behavior, even when you write out > "all" pages. > > And I'm assuming you're not using laptop mode either on your servers > (that sounds insane, but I remember somebody actually ended up using > laptop mode even on servers, simply because they did *not* want the > regular timed writeback model, so it's not quite as insane as it > sounds). So I reworked the series, to include three prep patches that end up killing off free_more_memory(). This means that we don't have to do the 1024 -> 0 change in there. On top of that, I added a separate bit to manage range cyclic vs non range cyclic flush all work. This means that we don't have to worry about the laptop case either. I think that should quell any of the concerns in the patchset, you can find the new series here: http://git.kernel.dk/cgit/linux-block/log/?h=wb-start-all Unless you feel comfortable taking it for 4.14, I'm going to push this to 4.15. In any case, it won't be ready until tomorrow, I need to push this through the test machinery just in case.
On Wed, Sep 27, 2017 at 5:41 AM, Jens Axboe <axboe@kernel.dk> wrote: > > So I reworked the series, to include three prep patches that end up > killing off free_more_memory(). This means that we don't have to do the > 1024 -> 0 change in there. On top of that, I added a separate bit to > manage range cyclic vs non range cyclic flush all work. This means that > we don't have to worry about the laptop case either. > > I think that should quell any of the concerns in the patchset, you can > find the new series here: > > http://git.kernel.dk/cgit/linux-block/log/?h=wb-start-all Yeah, this I feel more confident about. It still has some subtle changes (nr_pages is slightly different for laptop mode, and the whole "we rely on the VM to not return NULL") but I cannot for the life of me imagine that those changes are noticeable or meaningful. So now that last commit that limits the pending flushes is the only one that seems to matter, and the one you wanted to build up to. It's not a subtle change, but that's the one that fixes the actual bug you see, so now the rest of the series really looks like "prep-work for the bug fix". Of course, with the change to support range_cycling for the laptop mode case, there's actually possibly _two_ pending full flushes (the range-cyclic and the "all" one), so that last changelog may not be entirely accurate. Long-term, I would prefer to maybe make the "range_whole" case to always be range-cyclic, because I suspect that's the better behavior, but I think that series is the one that changes existing semantics the least for the existing cases, so I think your approach is better for now. As far as I can tell, the cases that do not set range_cyclic right now are: - wakeup_flusher_threads() - sync_inodes_sb() and in neither case does that lack of range_cyclic really seem to make any sense. It might be a purely historical artifact (an _old_ one: that range_cyclic logic goes back to 2006, as far as I can tell). But there might also be something I'm missing. Anyway, your series looks fine to me. Linus
On 09/27/2017 07:36 PM, Linus Torvalds wrote: > On Wed, Sep 27, 2017 at 5:41 AM, Jens Axboe <axboe@kernel.dk> wrote: >> >> So I reworked the series, to include three prep patches that end up >> killing off free_more_memory(). This means that we don't have to do the >> 1024 -> 0 change in there. On top of that, I added a separate bit to >> manage range cyclic vs non range cyclic flush all work. This means that >> we don't have to worry about the laptop case either. >> >> I think that should quell any of the concerns in the patchset, you can >> find the new series here: >> >> http://git.kernel.dk/cgit/linux-block/log/?h=wb-start-all > > Yeah, this I feel more confident about. > > It still has some subtle changes (nr_pages is slightly different for > laptop mode, and the whole "we rely on the VM to not return NULL") but > I cannot for the life of me imagine that those changes are noticeable > or meaningful. > > So now that last commit that limits the pending flushes is the only > one that seems to matter, and the one you wanted to build up to. It's > not a subtle change, but that's the one that fixes the actual bug you > see, so now the rest of the series really looks like "prep-work for > the bug fix". > > Of course, with the change to support range_cycling for the laptop > mode case, there's actually possibly _two_ pending full flushes (the > range-cyclic and the "all" one), so that last changelog may not be > entirely accurate. Yeah good point. I did modify it a little bit, but I should make that clear throughout that changelog. Functionally it's identical as far as the bug is concerned, 1 or 2 is a far cry from hundreds of thousands or millions. > Long-term, I would prefer to maybe make the "range_whole" case to > always be range-cyclic, because I suspect that's the better behavior, > but I think that series is the one that changes existing semantics the > least for the existing cases, so I think your approach is better for > now. I agree, I was actually contemplating whether to make that change or not. I don't see a reason to ever NOT do range cyclic writeback for the full range request. Then we could kill the second wb->state bit and get back to just having the one in flight. > As far as I can tell, the cases that do not set range_cyclic right now are: > > - wakeup_flusher_threads() > > - sync_inodes_sb() > > and in neither case does that lack of range_cyclic really seem to make > any sense. > > It might be a purely historical artifact (an _old_ one: that > range_cyclic logic goes back to 2006, as far as I can tell). > > But there might also be something I'm missing. > > Anyway, your series looks fine to me. Thanks for taking a look. I feel like the change should be made to do range cyclic flushes for any request that wants to start writeback on the full range, I can't think of a case where that make a difference. Then we can kill the extra bit, and more importantly, kill this part: if (reason == WB_REASON_LAPTOP_TIMER) range_cyclic = true; which is a bit of a hack. I'll respin with that, and repost the series.
diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c index acc816b67582..bb2aad078637 100644 --- a/drivers/nvme/host/core.c +++ b/drivers/nvme/host/core.c @@ -134,8 +134,6 @@ static inline bool nvme_req_needs_retry(struct request *req) return false; if (nvme_req(req)->status & NVME_SC_DNR) return false; - if (jiffies - req->start_time >= req->timeout) - return false; if (nvme_req(req)->retries >= nvme_max_retries) return false; return true; @@ -2590,7 +2588,7 @@ static void nvme_async_event_work(struct work_struct *work) container_of(work, struct nvme_ctrl, async_event_work); spin_lock_irq(&ctrl->lock); - while (ctrl->event_limit > 0) { + while (ctrl->state == NVME_CTRL_LIVE && ctrl->event_limit > 0) { int aer_idx = --ctrl->event_limit; spin_unlock_irq(&ctrl->lock); @@ -2677,7 +2675,8 @@ void nvme_complete_async_event(struct nvme_ctrl *ctrl, __le16 status, /*FALLTHRU*/ case NVME_SC_ABORT_REQ: ++ctrl->event_limit; - queue_work(nvme_wq, &ctrl->async_event_work); + if (ctrl->state == NVME_CTRL_LIVE) + queue_work(nvme_wq, &ctrl->async_event_work); break; default: break; @@ -2692,7 +2691,7 @@ void nvme_complete_async_event(struct nvme_ctrl *ctrl, __le16 status, nvme_queue_scan(ctrl); break; case NVME_AER_NOTICE_FW_ACT_STARTING: - schedule_work(&ctrl->fw_act_work); + queue_work(nvme_wq, &ctrl->fw_act_work); break; default: dev_warn(ctrl->device, "async event result %08x\n", result); diff --git a/drivers/nvme/host/fabrics.c b/drivers/nvme/host/fabrics.c index 47307752dc65..555c976cc2ee 100644 --- a/drivers/nvme/host/fabrics.c +++ b/drivers/nvme/host/fabrics.c @@ -565,6 +565,7 @@ static int nvmf_parse_options(struct nvmf_ctrl_options *opts, opts->queue_size = NVMF_DEF_QUEUE_SIZE; opts->nr_io_queues = num_online_cpus(); opts->reconnect_delay = NVMF_DEF_RECONNECT_DELAY; + opts->kato = NVME_DEFAULT_KATO; options = o = kstrdup(buf, GFP_KERNEL); if (!options) @@ -655,21 +656,22 @@ static int nvmf_parse_options(struct nvmf_ctrl_options *opts, goto out; } - if (opts->discovery_nqn) { - pr_err("Discovery controllers cannot accept keep_alive_tmo != 0\n"); - ret = -EINVAL; - goto out; - } - if (token < 0) { pr_err("Invalid keep_alive_tmo %d\n", token); ret = -EINVAL; goto out; - } else if (token == 0) { + } else if (token == 0 && !opts->discovery_nqn) { /* Allowed for debug */ pr_warn("keep_alive_tmo 0 won't execute keep alives!!!\n"); } opts->kato = token; + + if (opts->discovery_nqn && opts->kato) { + pr_err("Discovery controllers cannot accept KATO != 0\n"); + ret = -EINVAL; + goto out; + } + break; case NVMF_OPT_CTRL_LOSS_TMO: if (match_int(args, &token)) { @@ -762,8 +764,6 @@ static int nvmf_parse_options(struct nvmf_ctrl_options *opts, uuid_copy(&opts->host->id, &hostid); out: - if (!opts->discovery_nqn && !opts->kato) - opts->kato = NVME_DEFAULT_KATO; kfree(options); return ret; } diff --git a/drivers/nvme/host/fc.c b/drivers/nvme/host/fc.c index d2e882c0f496..af075e998944 100644 --- a/drivers/nvme/host/fc.c +++ b/drivers/nvme/host/fc.c @@ -1376,7 +1376,7 @@ nvme_fc_fcpio_done(struct nvmefc_fcp_req *req) if (atomic_read(&op->state) == FCPOP_STATE_ABORTED) status = cpu_to_le16((NVME_SC_ABORT_REQ | NVME_SC_DNR) << 1); else if (freq->status) - status = cpu_to_le16(NVME_SC_FC_TRANSPORT_ERROR << 1); + status = cpu_to_le16(NVME_SC_INTERNAL << 1); /* * For the linux implementation, if we have an unsuccesful @@ -1404,7 +1404,7 @@ nvme_fc_fcpio_done(struct nvmefc_fcp_req *req) */ if (freq->transferred_length != be32_to_cpu(op->cmd_iu.data_len)) { - status = cpu_to_le16(NVME_SC_FC_TRANSPORT_ERROR << 1); + status = cpu_to_le16(NVME_SC_INTERNAL << 1); goto done; } result.u64 = 0; @@ -1421,7 +1421,7 @@ nvme_fc_fcpio_done(struct nvmefc_fcp_req *req) freq->transferred_length || op->rsp_iu.status_code || sqe->common.command_id != cqe->command_id)) { - status = cpu_to_le16(NVME_SC_FC_TRANSPORT_ERROR << 1); + status = cpu_to_le16(NVME_SC_INTERNAL << 1); goto done; } result = cqe->result; @@ -1429,7 +1429,7 @@ nvme_fc_fcpio_done(struct nvmefc_fcp_req *req) break; default: - status = cpu_to_le16(NVME_SC_FC_TRANSPORT_ERROR << 1); + status = cpu_to_le16(NVME_SC_INTERNAL << 1); goto done; } @@ -1989,16 +1989,17 @@ nvme_fc_start_fcp_op(struct nvme_fc_ctrl *ctrl, struct nvme_fc_queue *queue, * as well as those by FC-NVME spec. */ WARN_ON_ONCE(sqe->common.metadata); - WARN_ON_ONCE(sqe->common.dptr.prp1); - WARN_ON_ONCE(sqe->common.dptr.prp2); sqe->common.flags |= NVME_CMD_SGL_METABUF; /* - * format SQE DPTR field per FC-NVME rules - * type=data block descr; subtype=offset; - * offset is currently 0. + * format SQE DPTR field per FC-NVME rules: + * type=0x5 Transport SGL Data Block Descriptor + * subtype=0xA Transport-specific value + * address=0 + * length=length of the data series */ - sqe->rw.dptr.sgl.type = NVME_SGL_FMT_OFFSET; + sqe->rw.dptr.sgl.type = (NVME_TRANSPORT_SGL_DATA_DESC << 4) | + NVME_SGL_FMT_TRANSPORT_A; sqe->rw.dptr.sgl.length = cpu_to_le32(data_len); sqe->rw.dptr.sgl.addr = 0; diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c index 4a2121335f48..cb73bc8cad3b 100644 --- a/drivers/nvme/host/pci.c +++ b/drivers/nvme/host/pci.c @@ -24,6 +24,7 @@ #include <linux/mm.h> #include <linux/module.h> #include <linux/mutex.h> +#include <linux/once.h> #include <linux/pci.h> #include <linux/poison.h> #include <linux/t10-pi.h> @@ -540,6 +541,20 @@ static void nvme_dif_complete(u32 p, u32 v, struct t10_pi_tuple *pi) } #endif +static void nvme_print_sgl(struct scatterlist *sgl, int nents) +{ + int i; + struct scatterlist *sg; + + for_each_sg(sgl, sg, nents, i) { + dma_addr_t phys = sg_phys(sg); + pr_warn("sg[%d] phys_addr:%pad offset:%d length:%d " + "dma_address:%pad dma_length:%d\n", + i, &phys, sg->offset, sg->length, &sg_dma_address(sg), + sg_dma_len(sg)); + } +} + static blk_status_t nvme_setup_prps(struct nvme_dev *dev, struct request *req) { struct nvme_iod *iod = blk_mq_rq_to_pdu(req); @@ -622,19 +637,10 @@ static blk_status_t nvme_setup_prps(struct nvme_dev *dev, struct request *req) return BLK_STS_OK; bad_sgl: - if (WARN_ONCE(1, "Invalid SGL for payload:%d nents:%d\n", - blk_rq_payload_bytes(req), iod->nents)) { - for_each_sg(iod->sg, sg, iod->nents, i) { - dma_addr_t phys = sg_phys(sg); - pr_warn("sg[%d] phys_addr:%pad offset:%d length:%d " - "dma_address:%pad dma_length:%d\n", i, &phys, - sg->offset, sg->length, - &sg_dma_address(sg), - sg_dma_len(sg)); - } - } + WARN(DO_ONCE(nvme_print_sgl, iod->sg, iod->nents), + "Invalid SGL for payload:%d nents:%d\n", + blk_rq_payload_bytes(req), iod->nents); return BLK_STS_IOERR; - } static blk_status_t nvme_map_data(struct nvme_dev *dev, struct request *req, @@ -1313,11 +1319,11 @@ static int nvme_create_queue(struct nvme_queue *nvmeq, int qid) if (result < 0) goto release_cq; + nvme_init_queue(nvmeq, qid); result = queue_request_irq(nvmeq); if (result < 0) goto release_sq; - nvme_init_queue(nvmeq, qid); return result; release_sq: @@ -1464,6 +1470,7 @@ static int nvme_pci_configure_admin_queue(struct nvme_dev *dev) return result; nvmeq->cq_vector = 0; + nvme_init_queue(nvmeq, 0); result = queue_request_irq(nvmeq); if (result) { nvmeq->cq_vector = -1; @@ -2156,7 +2163,6 @@ static void nvme_reset_work(struct work_struct *work) if (result) goto out; - nvme_init_queue(dev->queues[0], 0); result = nvme_alloc_admin_tags(dev); if (result) goto out; diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c index 58983000964b..92a03ff5fb4d 100644 --- a/drivers/nvme/host/rdma.c +++ b/drivers/nvme/host/rdma.c @@ -942,7 +942,12 @@ static void nvme_rdma_reconnect_ctrl_work(struct work_struct *work) } changed = nvme_change_ctrl_state(&ctrl->ctrl, NVME_CTRL_LIVE); - WARN_ON_ONCE(!changed); + if (!changed) { + /* state change failure is ok if we're in DELETING state */ + WARN_ON_ONCE(ctrl->ctrl.state != NVME_CTRL_DELETING); + return; + } + ctrl->ctrl.nr_reconnects = 0; nvme_start_ctrl(&ctrl->ctrl); @@ -962,7 +967,7 @@ static void nvme_rdma_error_recovery_work(struct work_struct *work) struct nvme_rdma_ctrl *ctrl = container_of(work, struct nvme_rdma_ctrl, err_work); - nvme_stop_ctrl(&ctrl->ctrl); + nvme_stop_keep_alive(&ctrl->ctrl); if (ctrl->ctrl.queue_count > 1) { nvme_stop_queues(&ctrl->ctrl); diff --git a/drivers/nvme/target/core.c b/drivers/nvme/target/core.c index 7c23eaf8e563..1b208beeef50 100644 --- a/drivers/nvme/target/core.c +++ b/drivers/nvme/target/core.c @@ -390,10 +390,10 @@ static void __nvmet_req_complete(struct nvmet_req *req, u16 status) if (status) nvmet_set_status(req, status); - /* XXX: need to fill in something useful for sq_head */ - req->rsp->sq_head = 0; - if (likely(req->sq)) /* may happen during early failure */ - req->rsp->sq_id = cpu_to_le16(req->sq->qid); + if (req->sq->size) + req->sq->sqhd = (req->sq->sqhd + 1) % req->sq->size; + req->rsp->sq_head = cpu_to_le16(req->sq->sqhd); + req->rsp->sq_id = cpu_to_le16(req->sq->qid); req->rsp->command_id = req->cmd->common.command_id; if (req->ns) @@ -420,6 +420,7 @@ void nvmet_cq_setup(struct nvmet_ctrl *ctrl, struct nvmet_cq *cq, void nvmet_sq_setup(struct nvmet_ctrl *ctrl, struct nvmet_sq *sq, u16 qid, u16 size) { + sq->sqhd = 0; sq->qid = qid; sq->size = size; diff --git a/drivers/nvme/target/fabrics-cmd.c b/drivers/nvme/target/fabrics-cmd.c index 859a66725291..db3bf6b8bf9e 100644 --- a/drivers/nvme/target/fabrics-cmd.c +++ b/drivers/nvme/target/fabrics-cmd.c @@ -109,9 +109,14 @@ static u16 nvmet_install_queue(struct nvmet_ctrl *ctrl, struct nvmet_req *req) pr_warn("queue already connected!\n"); return NVME_SC_CONNECT_CTRL_BUSY | NVME_SC_DNR; } + if (!sqsize) { + pr_warn("queue size zero!\n"); + return NVME_SC_CONNECT_INVALID_PARAM | NVME_SC_DNR; + } - nvmet_cq_setup(ctrl, req->cq, qid, sqsize); - nvmet_sq_setup(ctrl, req->sq, qid, sqsize); + /* note: convert queue size from 0's-based value to 1's-based value */ + nvmet_cq_setup(ctrl, req->cq, qid, sqsize + 1); + nvmet_sq_setup(ctrl, req->sq, qid, sqsize + 1); return 0; } diff --git a/drivers/nvme/target/fc.c b/drivers/nvme/target/fc.c index 421e43bf1dd7..58e010bdda3e 100644 --- a/drivers/nvme/target/fc.c +++ b/drivers/nvme/target/fc.c @@ -148,7 +148,7 @@ struct nvmet_fc_tgt_assoc { u32 a_id; struct nvmet_fc_tgtport *tgtport; struct list_head a_list; - struct nvmet_fc_tgt_queue *queues[NVMET_NR_QUEUES]; + struct nvmet_fc_tgt_queue *queues[NVMET_NR_QUEUES + 1]; struct kref ref; }; @@ -608,7 +608,7 @@ nvmet_fc_alloc_target_queue(struct nvmet_fc_tgt_assoc *assoc, unsigned long flags; int ret; - if (qid >= NVMET_NR_QUEUES) + if (qid > NVMET_NR_QUEUES) return NULL; queue = kzalloc((sizeof(*queue) + @@ -783,6 +783,9 @@ nvmet_fc_find_target_queue(struct nvmet_fc_tgtport *tgtport, u16 qid = nvmet_fc_getqueueid(connection_id); unsigned long flags; + if (qid > NVMET_NR_QUEUES) + return NULL; + spin_lock_irqsave(&tgtport->lock, flags); list_for_each_entry(assoc, &tgtport->assoc_list, a_list) { if (association_id == assoc->association_id) { @@ -888,7 +891,7 @@ nvmet_fc_delete_target_assoc(struct nvmet_fc_tgt_assoc *assoc) int i; spin_lock_irqsave(&tgtport->lock, flags); - for (i = NVMET_NR_QUEUES - 1; i >= 0; i--) { + for (i = NVMET_NR_QUEUES; i >= 0; i--) { queue = assoc->queues[i]; if (queue) { if (!nvmet_fc_tgt_q_get(queue)) @@ -1910,8 +1913,7 @@ nvmet_fc_transfer_fcp_data(struct nvmet_fc_tgtport *tgtport, spin_lock_irqsave(&fod->flock, flags); fod->writedataactive = false; spin_unlock_irqrestore(&fod->flock, flags); - nvmet_req_complete(&fod->req, - NVME_SC_FC_TRANSPORT_ERROR); + nvmet_req_complete(&fod->req, NVME_SC_INTERNAL); } else /* NVMET_FCOP_READDATA or NVMET_FCOP_READDATA_RSP */ { fcpreq->fcp_error = ret; fcpreq->transferred_length = 0; @@ -1929,8 +1931,7 @@ __nvmet_fc_fod_op_abort(struct nvmet_fc_fcp_iod *fod, bool abort) /* if in the middle of an io and we need to tear down */ if (abort) { if (fcpreq->op == NVMET_FCOP_WRITEDATA) { - nvmet_req_complete(&fod->req, - NVME_SC_FC_TRANSPORT_ERROR); + nvmet_req_complete(&fod->req, NVME_SC_INTERNAL); return true; } @@ -1968,8 +1969,7 @@ nvmet_fc_fod_op_done(struct nvmet_fc_fcp_iod *fod) fod->abort = true; spin_unlock(&fod->flock); - nvmet_req_complete(&fod->req, - NVME_SC_FC_TRANSPORT_ERROR); + nvmet_req_complete(&fod->req, NVME_SC_INTERNAL); return; } @@ -2533,13 +2533,17 @@ nvmet_fc_remove_port(struct nvmet_port *port) { struct nvmet_fc_tgtport *tgtport = port->priv; unsigned long flags; + bool matched = false; spin_lock_irqsave(&nvmet_fc_tgtlock, flags); if (tgtport->port == port) { - nvmet_fc_tgtport_put(tgtport); + matched = true; tgtport->port = NULL; } spin_unlock_irqrestore(&nvmet_fc_tgtlock, flags); + + if (matched) + nvmet_fc_tgtport_put(tgtport); } static struct nvmet_fabrics_ops nvmet_fc_tgt_fcp_ops = { diff --git a/drivers/nvme/target/fcloop.c b/drivers/nvme/target/fcloop.c index 1cb9847ec261..7b75d9de55ab 100644 --- a/drivers/nvme/target/fcloop.c +++ b/drivers/nvme/target/fcloop.c @@ -224,8 +224,6 @@ struct fcloop_nport { struct fcloop_lport *lport; struct list_head nport_list; struct kref ref; - struct completion rport_unreg_done; - struct completion tport_unreg_done; u64 node_name; u64 port_name; u32 port_role; @@ -576,7 +574,7 @@ fcloop_tgt_fcp_abort(struct nvmet_fc_target_port *tgtport, tfcp_req->aborted = true; spin_unlock(&tfcp_req->reqlock); - tfcp_req->status = NVME_SC_FC_TRANSPORT_ABORTED; + tfcp_req->status = NVME_SC_INTERNAL; /* * nothing more to do. If io wasn't active, the transport should @@ -630,6 +628,32 @@ fcloop_fcp_abort(struct nvme_fc_local_port *localport, schedule_work(&inireq->iniwork); } +static void +fcloop_nport_free(struct kref *ref) +{ + struct fcloop_nport *nport = + container_of(ref, struct fcloop_nport, ref); + unsigned long flags; + + spin_lock_irqsave(&fcloop_lock, flags); + list_del(&nport->nport_list); + spin_unlock_irqrestore(&fcloop_lock, flags); + + kfree(nport); +} + +static void +fcloop_nport_put(struct fcloop_nport *nport) +{ + kref_put(&nport->ref, fcloop_nport_free); +} + +static int +fcloop_nport_get(struct fcloop_nport *nport) +{ + return kref_get_unless_zero(&nport->ref); +} + static void fcloop_localport_delete(struct nvme_fc_local_port *localport) { @@ -644,8 +668,7 @@ fcloop_remoteport_delete(struct nvme_fc_remote_port *remoteport) { struct fcloop_rport *rport = remoteport->private; - /* release any threads waiting for the unreg to complete */ - complete(&rport->nport->rport_unreg_done); + fcloop_nport_put(rport->nport); } static void @@ -653,8 +676,7 @@ fcloop_targetport_delete(struct nvmet_fc_target_port *targetport) { struct fcloop_tport *tport = targetport->private; - /* release any threads waiting for the unreg to complete */ - complete(&tport->nport->tport_unreg_done); + fcloop_nport_put(tport->nport); } #define FCLOOP_HW_QUEUES 4 @@ -722,6 +744,7 @@ fcloop_create_local_port(struct device *dev, struct device_attribute *attr, goto out_free_opts; } + memset(&pinfo, 0, sizeof(pinfo)); pinfo.node_name = opts->wwnn; pinfo.port_name = opts->wwpn; pinfo.port_role = opts->roles; @@ -804,32 +827,6 @@ fcloop_delete_local_port(struct device *dev, struct device_attribute *attr, return ret ? ret : count; } -static void -fcloop_nport_free(struct kref *ref) -{ - struct fcloop_nport *nport = - container_of(ref, struct fcloop_nport, ref); - unsigned long flags; - - spin_lock_irqsave(&fcloop_lock, flags); - list_del(&nport->nport_list); - spin_unlock_irqrestore(&fcloop_lock, flags); - - kfree(nport); -} - -static void -fcloop_nport_put(struct fcloop_nport *nport) -{ - kref_put(&nport->ref, fcloop_nport_free); -} - -static int -fcloop_nport_get(struct fcloop_nport *nport) -{ - return kref_get_unless_zero(&nport->ref); -} - static struct fcloop_nport * fcloop_alloc_nport(const char *buf, size_t count, bool remoteport) { @@ -938,6 +935,7 @@ fcloop_create_remote_port(struct device *dev, struct device_attribute *attr, if (!nport) return -EIO; + memset(&pinfo, 0, sizeof(pinfo)); pinfo.node_name = nport->node_name; pinfo.port_name = nport->port_name; pinfo.port_role = nport->port_role; @@ -979,24 +977,12 @@ __unlink_remote_port(struct fcloop_nport *nport) } static int -__wait_remoteport_unreg(struct fcloop_nport *nport, struct fcloop_rport *rport) +__remoteport_unreg(struct fcloop_nport *nport, struct fcloop_rport *rport) { - int ret; - if (!rport) return -EALREADY; - init_completion(&nport->rport_unreg_done); - - ret = nvme_fc_unregister_remoteport(rport->remoteport); - if (ret) - return ret; - - wait_for_completion(&nport->rport_unreg_done); - - fcloop_nport_put(nport); - - return ret; + return nvme_fc_unregister_remoteport(rport->remoteport); } static ssize_t @@ -1029,7 +1015,7 @@ fcloop_delete_remote_port(struct device *dev, struct device_attribute *attr, if (!nport) return -ENOENT; - ret = __wait_remoteport_unreg(nport, rport); + ret = __remoteport_unreg(nport, rport); return ret ? ret : count; } @@ -1086,24 +1072,12 @@ __unlink_target_port(struct fcloop_nport *nport) } static int -__wait_targetport_unreg(struct fcloop_nport *nport, struct fcloop_tport *tport) +__targetport_unreg(struct fcloop_nport *nport, struct fcloop_tport *tport) { - int ret; - if (!tport) return -EALREADY; - init_completion(&nport->tport_unreg_done); - - ret = nvmet_fc_unregister_targetport(tport->targetport); - if (ret) - return ret; - - wait_for_completion(&nport->tport_unreg_done); - - fcloop_nport_put(nport); - - return ret; + return nvmet_fc_unregister_targetport(tport->targetport); } static ssize_t @@ -1136,7 +1110,7 @@ fcloop_delete_target_port(struct device *dev, struct device_attribute *attr, if (!nport) return -ENOENT; - ret = __wait_targetport_unreg(nport, tport); + ret = __targetport_unreg(nport, tport); return ret ? ret : count; } @@ -1223,11 +1197,11 @@ static void __exit fcloop_exit(void) spin_unlock_irqrestore(&fcloop_lock, flags); - ret = __wait_targetport_unreg(nport, tport); + ret = __targetport_unreg(nport, tport); if (ret) pr_warn("%s: Failed deleting target port\n", __func__); - ret = __wait_remoteport_unreg(nport, rport); + ret = __remoteport_unreg(nport, rport); if (ret) pr_warn("%s: Failed deleting remote port\n", __func__); diff --git a/drivers/nvme/target/nvmet.h b/drivers/nvme/target/nvmet.h index 7d261ab894f4..7b8e20adf760 100644 --- a/drivers/nvme/target/nvmet.h +++ b/drivers/nvme/target/nvmet.h @@ -74,6 +74,7 @@ struct nvmet_sq { struct percpu_ref ref; u16 qid; u16 size; + u16 sqhd; struct completion free_done; struct completion confirm_done; };