Message ID | 20250107140004.2732830-1-memxor@gmail.com (mailing list archive) |
---|---|
Headers | show |
Series | Resilient Queued Spin Lock | expand |
On Tue, 7 Jan 2025 at 06:00, Kumar Kartikeya Dwivedi <memxor@gmail.com> wrote: > > This patch set introduces Resilient Queued Spin Lock (or rqspinlock with > res_spin_lock() and res_spin_unlock() APIs). So when I see people doing new locking mechanisms, I invariably go "Oh no!". But this series seems reasonable to me. I see that PeterZ had a couple of minor comments (well, the arm64 one is more fundamental), which hopefully means that it seems reasonable to him too. Peter? That said, it would be lovely if Waiman and Will would also take a look. Perhaps Will in particular, considering Peter's point about smp_cond_load_acquire() on arm64. And it looks like Will wasn't cc'd on the series. Added. Will? See https://lore.kernel.org/all/20250107140004.2732830-1-memxor@gmail.com/ for the series. Linus
On Tue, Jan 07, 2025 at 03:54:36PM -0800, Linus Torvalds wrote: > On Tue, 7 Jan 2025 at 06:00, Kumar Kartikeya Dwivedi <memxor@gmail.com> wrote: > > > > This patch set introduces Resilient Queued Spin Lock (or rqspinlock with > > res_spin_lock() and res_spin_unlock() APIs). > > So when I see people doing new locking mechanisms, I invariably go "Oh no!". > > But this series seems reasonable to me. I see that PeterZ had a couple > of minor comments (well, the arm64 one is more fundamental), which > hopefully means that it seems reasonable to him too. Peter? I've not had time to fully read the whole thing yet, I only did a quick once over. I'll try and get around to doing a proper reading eventually, but I'm chasing a regression atm, and then I need to go review a ton of code Andrew merged over the xmas/newyears holiday :/ One potential issue is that qspinlock isn't suitable for all architectures -- and I've yet to figure out widely BPF is planning on using this. Notably qspinlock is ineffective (as in way over engineered) for architectures that do not provide hardware level progress guarantees on competing atomics and qspinlock uses mixed sized atomics, which are typically under specified, architecturally. Another issue is the code duplication. Anyway, I'll get to it eventually...
On Wed, 8 Jan 2025 at 14:48, Peter Zijlstra <peterz@infradead.org> wrote: > > On Tue, Jan 07, 2025 at 03:54:36PM -0800, Linus Torvalds wrote: > > On Tue, 7 Jan 2025 at 06:00, Kumar Kartikeya Dwivedi <memxor@gmail.com> wrote: > > > > > > This patch set introduces Resilient Queued Spin Lock (or rqspinlock with > > > res_spin_lock() and res_spin_unlock() APIs). > > > > So when I see people doing new locking mechanisms, I invariably go "Oh no!". > > > > But this series seems reasonable to me. I see that PeterZ had a couple > > of minor comments (well, the arm64 one is more fundamental), which > > hopefully means that it seems reasonable to him too. Peter? > > I've not had time to fully read the whole thing yet, I only did a quick > once over. I'll try and get around to doing a proper reading eventually, > but I'm chasing a regression atm, and then I need to go review a ton of > code Andrew merged over the xmas/newyears holiday :/ > > One potential issue is that qspinlock isn't suitable for all > architectures -- and I've yet to figure out widely BPF is planning on > using this. For architectures where qspinlock is not available, I think we can have a fallback to a test and set lock with timeout and deadlock checks, like patch 12. We plan on using this in BPF core and BPF maps, so the usage will be pervasive, and we have atleast one architecture in CI (s390) which doesn't have ARCH_USER_QUEUED_SPINLOCK selected, so we should have coverage for both cases. For now the fallback is missing, but I will add one in v2. > Notably qspinlock is ineffective (as in way over engineered) > for architectures that do not provide hardware level progress guarantees > on competing atomics and qspinlock uses mixed sized atomics, which are > typically under specified, architecturally. Yes, we also noticed during development that try_cmpxchg_tail (in patch 9) couldn't rely on 16-bit cmpxchg being available everywhere (I think the build broke on arm64), unlike 16-bit xchg which is used in xchg_tail, but otherwise we should be using 32-bit atomics or relying on mixed sized atomics similar to qspinlock. > > Another issue is the code duplication. I agree that this isn't ideal, but IMO it would be too ugly to ifdef parts of qspinlock slow path to accommodate rqspinlock logic, and it will get harder to reason about. Plus there's distinct return types for both slow paths, which means if we combine them we end up with the normal qspinlock returning a value, which isn't very meaningful. We can probably discuss more code sharing possibilities through common inline functions to minimize duplication though. > > Anyway, I'll get to it eventually...
On Wed, 8 Jan 2025 at 12:13, Kumar Kartikeya Dwivedi <memxor@gmail.com> wrote: > > Yes, we also noticed during development that try_cmpxchg_tail (in > patch 9) couldn't rely on 16-bit cmpxchg being available everywhere I think that's purely a "we have had no use for it" issue. A 16-bit cmpxchg can always be written using a larger size, and we did that for 8-bit ones for RCU. See commit d4e287d7caff ("rcu-tasks: Remove open-coded one-byte cmpxchg() emulation") which switched RCU over to use a "native" 8-bit cmpxchg, because Paul had added the capability to all architectures, sometimes using a bigger size and "emulating" it: a88d970c8bb5 ("lib: Add one-byte emulation function"). In fact, I think that series added a couple of 16-bit cases too, but I actually went "if we have no users, don't bother". Linus
On Thu, 9 Jan 2025 at 02:00, Linus Torvalds <torvalds@linux-foundation.org> wrote: > > On Wed, 8 Jan 2025 at 12:13, Kumar Kartikeya Dwivedi <memxor@gmail.com> wrote: > > > > Yes, we also noticed during development that try_cmpxchg_tail (in > > patch 9) couldn't rely on 16-bit cmpxchg being available everywhere > > I think that's purely a "we have had no use for it" issue. > > A 16-bit cmpxchg can always be written using a larger size, and we did > that for 8-bit ones for RCU. > > See commit d4e287d7caff ("rcu-tasks: Remove open-coded one-byte > cmpxchg() emulation") which switched RCU over to use a "native" 8-bit > cmpxchg, because Paul had added the capability to all architectures, > sometimes using a bigger size and "emulating" it: a88d970c8bb5 ("lib: > Add one-byte emulation function"). > > In fact, I think that series added a couple of 16-bit cases too, but I > actually went "if we have no users, don't bother". I see, that makes sense. I don't think we have a pressing need for it, so it should be fine as is. I initially used it because comparing other bits wasn't necessary when we only needed to reset the tail back to 0, but we would fall back to 32-bit cmpxchg in case of NR_CPUS > 16k anyway, since the tail is > 16-bits in that config. > > Linus
On Wed, Jan 08, 2025 at 12:30:27PM -0800, Linus Torvalds wrote: > On Wed, 8 Jan 2025 at 12:13, Kumar Kartikeya Dwivedi <memxor@gmail.com> wrote: > > > > Yes, we also noticed during development that try_cmpxchg_tail (in > > patch 9) couldn't rely on 16-bit cmpxchg being available everywhere > > I think that's purely a "we have had no use for it" issue. > > A 16-bit cmpxchg can always be written using a larger size, and we did > that for 8-bit ones for RCU. > > See commit d4e287d7caff ("rcu-tasks: Remove open-coded one-byte > cmpxchg() emulation") which switched RCU over to use a "native" 8-bit > cmpxchg, because Paul had added the capability to all architectures, > sometimes using a bigger size and "emulating" it: a88d970c8bb5 ("lib: > Add one-byte emulation function"). Glad you liked it. ;-) > In fact, I think that series added a couple of 16-bit cases too, but I > actually went "if we have no users, don't bother". Not only that, there were still architectures supported by the Linux kernel that lacked 16-bit store instructions. Although this does not make 16-bit emulation useless, it does give it some nasty sharp edges in the form of compilers turning those 16-bit stores into non-atomic RMW instructions. Or tearing them into 8-bit stores. So yes, I dropped 16-bit emulated cmpxchg() from later versions of that patch series. When support for those architectures are dropped, I would be happy to do the honors for 16-bit cmpxchg() emulation. Or to review someone else's doing the honors, for that matter. ;-) Thanx, Paul
On 1/8/25 3:12 PM, Kumar Kartikeya Dwivedi wrote: > On Wed, 8 Jan 2025 at 14:48, Peter Zijlstra <peterz@infradead.org> wrote: >> On Tue, Jan 07, 2025 at 03:54:36PM -0800, Linus Torvalds wrote: >>> On Tue, 7 Jan 2025 at 06:00, Kumar Kartikeya Dwivedi <memxor@gmail.com> wrote: >>>> This patch set introduces Resilient Queued Spin Lock (or rqspinlock with >>>> res_spin_lock() and res_spin_unlock() APIs). >>> So when I see people doing new locking mechanisms, I invariably go "Oh no!". >>> >>> But this series seems reasonable to me. I see that PeterZ had a couple >>> of minor comments (well, the arm64 one is more fundamental), which >>> hopefully means that it seems reasonable to him too. Peter? >> I've not had time to fully read the whole thing yet, I only did a quick >> once over. I'll try and get around to doing a proper reading eventually, >> but I'm chasing a regression atm, and then I need to go review a ton of >> code Andrew merged over the xmas/newyears holiday :/ >> >> One potential issue is that qspinlock isn't suitable for all >> architectures -- and I've yet to figure out widely BPF is planning on >> using this. > For architectures where qspinlock is not available, I think we can > have a fallback to a test and set lock with timeout and deadlock > checks, like patch 12. > We plan on using this in BPF core and BPF maps, so the usage will be > pervasive, and we have atleast one architecture in CI (s390) which > doesn't have ARCH_USER_QUEUED_SPINLOCK selected, so we should have > coverage for both cases. For now the fallback is missing, but I will > add one in v2. Event though ARCH_USE_QUEUED_SPINLOCK isn't set for s390, it is actually using its own variant of qspinlock which encodes in the lock word additional information needed by the architecture. Similary for PPC. Cheers, Longman
On Thu, 9 Jan 2025 at 19:29, Waiman Long <llong@redhat.com> wrote: > > On 1/8/25 3:12 PM, Kumar Kartikeya Dwivedi wrote: > > On Wed, 8 Jan 2025 at 14:48, Peter Zijlstra <peterz@infradead.org> wrote: > >> On Tue, Jan 07, 2025 at 03:54:36PM -0800, Linus Torvalds wrote: > >>> On Tue, 7 Jan 2025 at 06:00, Kumar Kartikeya Dwivedi <memxor@gmail.com> wrote: > >>>> This patch set introduces Resilient Queued Spin Lock (or rqspinlock with > >>>> res_spin_lock() and res_spin_unlock() APIs). > >>> So when I see people doing new locking mechanisms, I invariably go "Oh no!". > >>> > >>> But this series seems reasonable to me. I see that PeterZ had a couple > >>> of minor comments (well, the arm64 one is more fundamental), which > >>> hopefully means that it seems reasonable to him too. Peter? > >> I've not had time to fully read the whole thing yet, I only did a quick > >> once over. I'll try and get around to doing a proper reading eventually, > >> but I'm chasing a regression atm, and then I need to go review a ton of > >> code Andrew merged over the xmas/newyears holiday :/ > >> > >> One potential issue is that qspinlock isn't suitable for all > >> architectures -- and I've yet to figure out widely BPF is planning on > >> using this. > > For architectures where qspinlock is not available, I think we can > > have a fallback to a test and set lock with timeout and deadlock > > checks, like patch 12. > > We plan on using this in BPF core and BPF maps, so the usage will be > > pervasive, and we have atleast one architecture in CI (s390) which > > doesn't have ARCH_USER_QUEUED_SPINLOCK selected, so we should have > > coverage for both cases. For now the fallback is missing, but I will > > add one in v2. > > Event though ARCH_USE_QUEUED_SPINLOCK isn't set for s390, it is actually > using its own variant of qspinlock which encodes in the lock word > additional information needed by the architecture. Similary for PPC. Thanks, I see that now. It seems it is pretty similar to the paravirt scenario, where the algorithm would require changes to accommodate rqspinlock bits. For this series, I am planning to stick to a default TAS fallback, but we can tackle these cases together in a follow up. This series is already quite big and it would be better to focus on the base rqspinlock bits to keep things reviewable. Given we're only using this in BPF right now (in specific places where we're mindful we may fall back to TAS on some arches), we won't be regressing any other users. > > Cheers, > Longman >
On 1/9/25 4:13 PM, Kumar Kartikeya Dwivedi wrote: > On Thu, 9 Jan 2025 at 19:29, Waiman Long <llong@redhat.com> wrote: >> On 1/8/25 3:12 PM, Kumar Kartikeya Dwivedi wrote: >>> On Wed, 8 Jan 2025 at 14:48, Peter Zijlstra <peterz@infradead.org> wrote: >>>> On Tue, Jan 07, 2025 at 03:54:36PM -0800, Linus Torvalds wrote: >>>>> On Tue, 7 Jan 2025 at 06:00, Kumar Kartikeya Dwivedi <memxor@gmail.com> wrote: >>>>>> This patch set introduces Resilient Queued Spin Lock (or rqspinlock with >>>>>> res_spin_lock() and res_spin_unlock() APIs). >>>>> So when I see people doing new locking mechanisms, I invariably go "Oh no!". >>>>> >>>>> But this series seems reasonable to me. I see that PeterZ had a couple >>>>> of minor comments (well, the arm64 one is more fundamental), which >>>>> hopefully means that it seems reasonable to him too. Peter? >>>> I've not had time to fully read the whole thing yet, I only did a quick >>>> once over. I'll try and get around to doing a proper reading eventually, >>>> but I'm chasing a regression atm, and then I need to go review a ton of >>>> code Andrew merged over the xmas/newyears holiday :/ >>>> >>>> One potential issue is that qspinlock isn't suitable for all >>>> architectures -- and I've yet to figure out widely BPF is planning on >>>> using this. >>> For architectures where qspinlock is not available, I think we can >>> have a fallback to a test and set lock with timeout and deadlock >>> checks, like patch 12. >>> We plan on using this in BPF core and BPF maps, so the usage will be >>> pervasive, and we have atleast one architecture in CI (s390) which >>> doesn't have ARCH_USER_QUEUED_SPINLOCK selected, so we should have >>> coverage for both cases. For now the fallback is missing, but I will >>> add one in v2. >> Event though ARCH_USE_QUEUED_SPINLOCK isn't set for s390, it is actually >> using its own variant of qspinlock which encodes in the lock word >> additional information needed by the architecture. Similary for PPC. > Thanks, I see that now. It seems it is pretty similar to the paravirt > scenario, where the algorithm would require changes to accommodate > rqspinlock bits. > For this series, I am planning to stick to a default TAS fallback, but > we can tackle these cases together in a follow up. > This series is already quite big and it would be better to focus on > the base rqspinlock bits to keep things reviewable. > Given we're only using this in BPF right now (in specific places where > we're mindful we may fall back to TAS on some arches), we won't be > regressing any other users. I am not saying that you have to deal with that for the current patch series. However, it is something we need to tackle in the long run. Cheers, Longman