Message ID | 20170925144949.GP29668@kernel.org (mailing list archive) |
---|---|
State | Not Applicable |
Headers | show |
Em Mon, Sep 25, 2017 at 11:49:49AM -0300, Arnaldo Carvalho de Melo escreveu: > I'm trying to get an Infiniband test case working with the RT > kernel, and ended over tripping over this case: > > In drivers/infiniband/hw/hfi1/pio.c sc_buffer_alloc() disables > preemption that will be reenabled by either pio_copy() or > seg_pio_copy_end(). > > But before disabling preemption it grabs a spin lock that will > be dropped after it disables preemption, which ends up triggering a > warning in migrate_disable() later on. > > spin_lock_irqsave(&sc->alloc_lock) > migrate_disable() ++p->migrate_disable -> 2 > preempt_disable() > spin_unlock_irqrestore(&sc->alloc_lock) > migrate_enable() in_atomic(), so just returns, migrate_disable stays at 2 > spin_lock_irqsave(some other lock) -> b00m Sebastian, perhaps use local locks like you did for the random.c case? I'll study to see if that is possible... I mean this patch: https://git.kernel.org/pub/scm/linux/kernel/git/rt/linux-rt-devel.git/commit/drivers/char/random.c?h=linux-4.11.y-rt&id=4bed11300e24d5178829758e535cc4996490b7c8 ------------- random: avoid preempt_disable()ed section extract_crng() will use sleeping locks while in a preempt_disable() section due to get_cpu_var(). Work around it with local_locks. Cc: stable-rt@vger.kernel.org # where it applies to Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> ------------- - Arnaldo -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, 25 Sep 2017 11:49:49 -0300 Arnaldo Carvalho de Melo <acme@kernel.org> wrote: > Hi, > > I'm trying to get an Infiniband test case working with the RT > kernel, and ended over tripping over this case: > > In drivers/infiniband/hw/hfi1/pio.c sc_buffer_alloc() disables > preemption that will be reenabled by either pio_copy() or > seg_pio_copy_end(). > > But before disabling preemption it grabs a spin lock that will > be dropped after it disables preemption, which ends up triggering a > warning in migrate_disable() later on. > > spin_lock_irqsave(&sc->alloc_lock) > migrate_disable() ++p->migrate_disable -> 2 > preempt_disable() > spin_unlock_irqrestore(&sc->alloc_lock) > migrate_enable() in_atomic(), so just returns, migrate_disable stays at 2 > spin_lock_irqsave(some other lock) -> b00m > > And the WARN_ON code ends up tripping over this over and over in > log_store(). > > Sequence captured via ftrace_dump_on_oops + crash utility 'dmesg' > command. > > [512258.613862] sm-3297 16 .....11 359465349134644: sc_buffer_alloc <-hfi1_verbs_send_pio > [512258.613876] sm-3297 16 .....11 359465349134719: migrate_disable <-sc_buffer_alloc > [512258.613890] sm-3297 16 .....12 359465349134798: rt_spin_lock <-sc_buffer_alloc > [512258.613903] sm-3297 16 ....112 359465349135481: rt_spin_unlock <-sc_buffer_alloc > [512258.613916] sm-3297 16 ....112 359465349135556: migrate_enable <-sc_buffer_alloc > [512258.613935] sm-3297 16 ....112 359465349135788: seg_pio_copy_start <-hfi1_verbs_send_pio > [512258.613954] sm-3297 16 ....112 359465349136273: update_sge <-hfi1_verbs_send_pio > [512258.613981] sm-3297 16 ....112 359465349136373: seg_pio_copy_mid <-hfi1_verbs_send_pio > [512258.613999] sm-3297 16 ....112 359465349136873: update_sge <-hfi1_verbs_send_pio > [512258.614017] sm-3297 16 ....112 359465349136956: seg_pio_copy_mid <-hfi1_verbs_send_pio > [512258.614035] sm-3297 16 ....112 359465349137221: seg_pio_copy_end <-hfi1_verbs_send_pio > [512258.614048] sm-3297 16 .....12 359465349137360: migrate_disable <-hfi1_verbs_send_pio > [512258.614065] sm-3297 16 .....12 359465349137476: warn_slowpath_null <-migrate_disable > [512258.614081] sm-3297 16 .....12 359465349137564: __warn <-warn_slowpath_null > [512258.614088] sm-3297 16 .....12 359465349137958: printk <-__warn > [512258.614096] sm-3297 16 .....12 359465349138055: vprintk_default <-printk > [512258.614104] sm-3297 16 .....12 359465349138144: vprintk_emit <-vprintk_default > [512258.614111] sm-3297 16 d....12 359465349138312: _raw_spin_lock <-vprintk_emit > [512258.614119] sm-3297 16 d...112 359465349138789: log_store <-vprintk_emit > [512258.614127] sm-3297 16 .....12 359465349139068: migrate_disable <-vprintk_emit > > I'm wondering if turning this sc->alloc_lock to a raw_spin_lock is the > right solution, which I'm afraid its not, as there are places where it > is held and then the code goes on to grab other non-raw spinlocks... No, the correct solution is to convert the preempt_disable into a local_lock(), which will be a preempt_disable when PREEMPT_RT is not set. Look for other patches that convert preempt_disable() into local_lock()s for examples. -- Steve > > I got this patch in my test branch and it makes the test case go further > before splatting on other problems with infiniband + PREEMPT_RT_FULL, > but as I said, I fear its not the right solution, ideas? > > The kernel I'm seing this is RHEL's + the PREEMPT_RT_FULL patch: > > Linux version 3.10.0-709.rt56.636.test.el7.x86_64 (acme@seventh) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-16) (GCC) ) # > 1 SMP PREEMPT RT Wed Sep 20 18:04:55 -03 2017 > > I will try and build with the latest PREEMPT_RT_FULL patch, but the > infiniband codebase in RHEL seems to be up to what is upstream and > I just looked at patches-4.11.12-rt14/add_migrate_disable.patch and that > WARN_ON_ONCE(p->migrate_disable_atomic) is still there :-\ > > - Arnaldo > > -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, Sep 25, 2017 at 11:49:49AM -0300, Arnaldo Carvalho de Melo wrote: > Hi, > > I'm trying to get an Infiniband test case working with the RT > kernel, and ended over tripping over this case: > > In drivers/infiniband/hw/hfi1/pio.c sc_buffer_alloc() disables > preemption that will be reenabled by either pio_copy() or > seg_pio_copy_end(). Perhaps it's worth revisiting why this design decision was chosen in the first place? The introducing commit, a054374f15428cbc1 ("staging/rdma/hfi1: convert buffers allocated atomic to per cpu") mentions the previous atomic as being expensive, but the usage of preempt_disable()/preempt_enable() is also expensive, albeit in a different way. My first question would be: is disabling preemption here really necessary? AFAICT, preemption is only disabled to protect the access of the 'buffers_allocated' percpu counters, nothing else. However, because these counters are only observed in aggregate, across CPUs (see get_buffers_allocated()), the sum will be the same, regardless of a thread is migrated between a this_cpu_inc(buffers_allocated) and a this_cpu_dec(buffers_allocated). If that's the case, perhaps the fix is as easy as removing the preempt_disable() and preempt_enable()? Julia PS. Perhaps a longer term approach would be to investigate whether the usage of percpu_ref or percpu_counter is a better fit for this driver. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Em Mon, Sep 25, 2017 at 04:15:28PM -0400, Steven Rostedt escreveu: > On Mon, 25 Sep 2017 11:49:49 -0300 Arnaldo Carvalho de Melo <acme@kernel.org> wrote: > > I'm wondering if turning this sc->alloc_lock to a raw_spin_lock is the > > right solution, which I'm afraid its not, as there are places where it > > is held and then the code goes on to grab other non-raw spinlocks... > No, the correct solution is to convert the preempt_disable into a > local_lock(), which will be a preempt_disable when PREEMPT_RT is not > set. Look for other patches that convert preempt_disable() into > local_lock()s for examples. Thanks, I had seen a patch patch by Sebastian for random.c doing that, will continue in that direction. - Arnaldo -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
> > Hi, > > > > I'm trying to get an Infiniband test case working with the RT > > kernel, and ended over tripping over this case: > > > > In drivers/infiniband/hw/hfi1/pio.c sc_buffer_alloc() disables > > preemption that will be reenabled by either pio_copy() or > > seg_pio_copy_end(). > > Perhaps it's worth revisiting why this design decision was chosen in the > first place? The introducing commit, a054374f15428cbc1 > ("staging/rdma/hfi1: convert buffers allocated atomic to per cpu") > mentions the previous atomic as being expensive, but the usage of > preempt_disable()/preempt_enable() is also expensive, albeit in a > different way. > > My first question would be: is disabling preemption here really > necessary? > The motivation for the preempt manipulation has nothing to do with the counter. The buffer allocation returns a set of ringed write-combining 64B MMIO buffers. The allocate lock is then dropped right after allocation to allow parallel allocates. The hardware keeps track of the buffer fill across multiple CPUs, so that after the oldest buffer is copied the ring is advanced. The idea was to minimize the time from the drop of the allocate lock to the end of the copy to insure the highest rate of copy to the ring from multiple cores. > AFAICT, preemption is only disabled to protect the access of the > 'buffers_allocated' percpu counters, nothing else. > > However, because these counters are only observed in aggregate, across > CPUs (see get_buffers_allocated()), the sum will be the same, regardless > of a thread is migrated between a this_cpu_inc(buffers_allocated) and a > this_cpu_dec(buffers_allocated). > The counter exists purely as a hot path optimization to avoid an atomic. The only time an accurate counter is required is during reset sequences to wait for allocate/copy pairs to finish and we don't care if the closing decrement is on the same core. The percpu refcount might be a better choice, but I don't think it existed at the time the code was written. > If that's the case, perhaps the fix is as easy as removing the > preempt_disable() and preempt_enable()? > That would certainly be the easiest solution, but would compromise performance if a migration occurs after the allocate lock is dropped. We have been looking at the patch that was submitted for using the raw spin lock and we share Arnaldo's concern, particularly mixing lock types. Other locks can be procured in the timeout case in sc_wait_for_packet_egress() inside queue_work() and dd_dev_err()/printk(). Mike -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, 26 Sep 2017 14:10:52 +0000 "Marciniszyn, Mike" <mike.marciniszyn@intel.com> wrote: > > My first question would be: is disabling preemption here really > > necessary? > > > > The motivation for the preempt manipulation has nothing to do with > the counter. > > The buffer allocation returns a set of ringed write-combining 64B > MMIO buffers. The allocate lock is then dropped right after > allocation to allow parallel allocates. > > The hardware keeps track of the buffer fill across multiple CPUs, so > that after the oldest buffer is copied the ring is advanced. > > The idea was to minimize the time from the drop of the allocate lock > to the end of the copy to insure the highest rate of copy to the ring > from multiple cores. > > > AFAICT, preemption is only disabled to protect the access of the > > 'buffers_allocated' percpu counters, nothing else. > > > > However, because these counters are only observed in aggregate, > > across CPUs (see get_buffers_allocated()), the sum will be the > > same, regardless of a thread is migrated between a > > this_cpu_inc(buffers_allocated) and a > > this_cpu_dec(buffers_allocated). > > The counter exists purely as a hot path optimization to avoid an > atomic. > > The only time an accurate counter is required is during reset > sequences to wait for allocate/copy pairs to finish and we don't care > if the closing decrement is on the same core. > > The percpu refcount might be a better choice, but I don't think it > existed at the time the code was written. > > > If that's the case, perhaps the fix is as easy as removing the > > preempt_disable() and preempt_enable()? > > > > That would certainly be the easiest solution, but would compromise > performance if a migration occurs after the allocate lock is dropped. local_lock() disables migration. It's currently only in the -rt patch. It's made to allow preemption to occur as much as possible, as -rt cares about reaction time more than performance (although, we would love to keep performance as well). > > We have been looking at the patch that was submitted for using the > raw spin lock and we share Arnaldo's concern, particularly mixing > lock types. > > Other locks can be procured in the timeout case in > sc_wait_for_packet_egress() inside queue_work() and > dd_dev_err()/printk(). > A raw_spin_lock sacrifices reaction time, which goes against the goal of -rt. When PREEMPT_RT is disabled, the local_lock becomes preempt_disable, so it has no affect on the vanilla kernel. -- Steve -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
> > No, the correct solution is to convert the preempt_disable into a > > local_lock(), which will be a preempt_disable when PREEMPT_RT is not > > set. Look for other patches that convert preempt_disable() into > > local_lock()s for examples. > > Thanks, I had seen a patch patch by Sebastian for random.c doing that, > will continue in that direction. > I'm assuming you want to end up with common driver code base? Will there eventually be a wrapper upstream to allow that? Or will the rt patch handle those issues? Mike -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Em Tue, Sep 26, 2017 at 05:55:45PM +0000, Marciniszyn, Mike escreveu: > > > No, the correct solution is to convert the preempt_disable into a > > > local_lock(), which will be a preempt_disable when PREEMPT_RT is not > > > set. Look for other patches that convert preempt_disable() into > > > local_lock()s for examples. > > > > Thanks, I had seen a patch patch by Sebastian for random.c doing that, > > will continue in that direction. > > > > I'm assuming you want to end up with common driver code base? > > Will there eventually be a wrapper upstream to allow that? > > Or will the rt patch handle those issues? So what happens is that infrastructure in the rt patch eventually lands upstream, then this difference ceases to exist. Steven, are there plans for local locks to go upstream? Well, back to trying to learn about them and use in this case to see how it ends up... - Arnaldo -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Sep 26, 2017 at 02:10:52PM +0000, Marciniszyn, Mike wrote: > > > Hi, > > > > > > I'm trying to get an Infiniband test case working with the RT > > > kernel, and ended over tripping over this case: > > > > > > In drivers/infiniband/hw/hfi1/pio.c sc_buffer_alloc() disables > > > preemption that will be reenabled by either pio_copy() or > > > seg_pio_copy_end(). > > > > Perhaps it's worth revisiting why this design decision was chosen in the > > first place? The introducing commit, a054374f15428cbc1 > > ("staging/rdma/hfi1: convert buffers allocated atomic to per cpu") > > mentions the previous atomic as being expensive, but the usage of > > preempt_disable()/preempt_enable() is also expensive, albeit in a > > different way. > > > > My first question would be: is disabling preemption here really > > necessary? > > > > The motivation for the preempt manipulation has nothing to do with the counter. > > The buffer allocation returns a set of ringed write-combining 64B MMIO > buffers. The allocate lock is then dropped right after allocation to > allow parallel allocates. > > The hardware keeps track of the buffer fill across multiple CPUs, so > that after the oldest buffer is copied the ring is advanced. > > The idea was to minimize the time from the drop of the allocate lock > to the end of the copy to insure the highest rate of copy to the ring > from multiple cores. Okay, so to be clear, disabling preemption here is strictly a performance consideration, not a correctness one? > > If that's the case, perhaps the fix is as easy as removing the > > preempt_disable() and preempt_enable()? > > > > That would certainly be the easiest solution, but would compromise > performance if a migration occurs after the allocate lock is dropped. Given this response, I'm assuming that to be the case. Disabling preemption to avoid "compromising performance", presumes a very specific definition of "performance" that does not apply _in general_, and certainly does not apply for RT users, who care more for "response time" (as Steven put it). Instead, the kernel already directly provides tools for _users_ to make decisions about what the relative portions of their workloads are, namely scheduling parameters, CPU affinities, isolcpus, etc. Why are these not sufficient in this case? In addition, the use of local locks is a waste here, they provide more guarantees than is actually necessary; this driver will function fine if more than two threads on the same CPU are within this "critical section". If you want to maintain preempt_disable()/_enable() in non-RT, then whatever, use the preempt_disable_nort()/_enable_nort() varieties. Julia -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, 26 Sep 2017 15:28:00 -0300 Arnaldo Carvalho de Melo <acme@kernel.org> wrote: > So what happens is that infrastructure in the rt patch eventually lands > upstream, then this difference ceases to exist. > > Steven, are there plans for local locks to go upstream? > Yes, but they don't make sense unless we get PREEMPT_RT (sleeping spinlocks) upstream. Otherwise, they will just be another name for preempt_disable(). That said, there is some rational for getting them upsteam before sleeping spinlocks, and that is to annotate what a preempt_disable() is trying to protect. -- Steve -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 2017-09-26 14:10:52 [+0000], Marciniszyn, Mike wrote: > > My first question would be: is disabling preemption here really > > necessary? > > > > The motivation for the preempt manipulation has nothing to do with the counter. > > The buffer allocation returns a set of ringed write-combining 64B MMIO buffers. The allocate lock is then dropped right after allocation to allow parallel allocates. > > The hardware keeps track of the buffer fill across multiple CPUs, so that after the oldest buffer is copied the ring is advanced. > > The idea was to minimize the time from the drop of the allocate lock to the end of the copy to insure the highest rate of copy to the ring from multiple cores. So on -RT we would like to get rid of that preempt_disable() section. Do you have any numbers that say that this does any good? If so, by how much and how much does is differ compared to a !PREEMPT kernel if I may ask. Sebastian -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/drivers/infiniband/hw/hfi1/pio.c b/drivers/infiniband/hw/hfi1/pio.c index 615be68e40b3..8f28f8fe842d 100644 --- a/drivers/infiniband/hw/hfi1/pio.c +++ b/drivers/infiniband/hw/hfi1/pio.c @@ -744,7 +744,7 @@ struct send_context *sc_alloc(struct hfi1_devdata *dd, int type, sc->dd = dd; sc->node = numa; sc->type = type; - spin_lock_init(&sc->alloc_lock); + raw_spin_lock_init(&sc->alloc_lock); spin_lock_init(&sc->release_lock); spin_lock_init(&sc->credit_ctrl_lock); INIT_LIST_HEAD(&sc->piowait); @@ -929,13 +929,13 @@ void sc_disable(struct send_context *sc) return; /* do all steps, even if already disabled */ - spin_lock_irqsave(&sc->alloc_lock, flags); + raw_spin_lock_irqsave(&sc->alloc_lock, flags); reg = read_kctxt_csr(sc->dd, sc->hw_context, SC(CTRL)); reg &= ~SC(CTRL_CTXT_ENABLE_SMASK); sc->flags &= ~SCF_ENABLED; sc_wait_for_packet_egress(sc, 1); write_kctxt_csr(sc->dd, sc->hw_context, SC(CTRL), reg); - spin_unlock_irqrestore(&sc->alloc_lock, flags); + raw_spin_unlock_irqrestore(&sc->alloc_lock, flags); /* * Flush any waiters. Once the context is disabled, @@ -1232,7 +1232,7 @@ int sc_enable(struct send_context *sc) * worry about locking since the releaser will not do anything * if the context accounting values have not changed. */ - spin_lock_irqsave(&sc->alloc_lock, flags); + raw_spin_lock_irqsave(&sc->alloc_lock, flags); sc_ctrl = read_kctxt_csr(dd, sc->hw_context, SC(CTRL)); if ((sc_ctrl & SC(CTRL_CTXT_ENABLE_SMASK))) goto unlock; /* already enabled */ @@ -1303,7 +1303,7 @@ int sc_enable(struct send_context *sc) sc->flags |= SCF_ENABLED; unlock: - spin_unlock_irqrestore(&sc->alloc_lock, flags); + raw_spin_unlock_irqrestore(&sc->alloc_lock, flags); return ret; } @@ -1361,9 +1361,9 @@ void sc_stop(struct send_context *sc, int flag) sc->flags |= flag; /* stop buffer allocations */ - spin_lock_irqsave(&sc->alloc_lock, flags); + raw_spin_lock_irqsave(&sc->alloc_lock, flags); sc->flags &= ~SCF_ENABLED; - spin_unlock_irqrestore(&sc->alloc_lock, flags); + raw_spin_unlock_irqrestore(&sc->alloc_lock, flags); wake_up(&sc->halt_wait); } @@ -1391,9 +1391,9 @@ struct pio_buf *sc_buffer_alloc(struct send_context *sc, u32 dw_len, int trycount = 0; u32 head, next; - spin_lock_irqsave(&sc->alloc_lock, flags); + raw_spin_lock_irqsave(&sc->alloc_lock, flags); if (!(sc->flags & SCF_ENABLED)) { - spin_unlock_irqrestore(&sc->alloc_lock, flags); + raw_spin_unlock_irqrestore(&sc->alloc_lock, flags); goto done; } @@ -1402,7 +1402,7 @@ struct pio_buf *sc_buffer_alloc(struct send_context *sc, u32 dw_len, if (blocks > avail) { /* not enough room */ if (unlikely(trycount)) { /* already tried to get more room */ - spin_unlock_irqrestore(&sc->alloc_lock, flags); + raw_spin_unlock_irqrestore(&sc->alloc_lock, flags); goto done; } /* copy from receiver cache line and recalculate */ @@ -1458,7 +1458,7 @@ struct pio_buf *sc_buffer_alloc(struct send_context *sc, u32 dw_len, */ smp_wmb(); sc->sr_head = next; - spin_unlock_irqrestore(&sc->alloc_lock, flags); + raw_spin_unlock_irqrestore(&sc->alloc_lock, flags); /* finish filling in the buffer outside the lock */ pbuf->start = sc->base_addr + fill_wrap * PIO_BLOCK_SIZE; diff --git a/drivers/infiniband/hw/hfi1/pio.h b/drivers/infiniband/hw/hfi1/pio.h index 867e5ffc3595..06dfc6f81fd5 100644 --- a/drivers/infiniband/hw/hfi1/pio.h +++ b/drivers/infiniband/hw/hfi1/pio.h @@ -112,7 +112,7 @@ struct send_context { u8 group; /* credit return group */ /* allocator fields */ - spinlock_t alloc_lock ____cacheline_aligned_in_smp; + raw_spinlock_t alloc_lock ____cacheline_aligned_in_smp; u32 sr_head; /* shadow ring head */ unsigned long fill; /* official alloc count */ unsigned long alloc_free; /* copy of free (less cache thrash) */