diff mbox series

[09/17] m68k: Implement xor_unlock_is_negative_byte

Message ID 20230915183707.2707298-10-willy@infradead.org (mailing list archive)
State New, archived
Headers show
Series Add folio_end_read | expand

Commit Message

Matthew Wilcox Sept. 15, 2023, 6:36 p.m. UTC
Using EOR to clear the guaranteed-to-be-set lock bit will test the
negative flag just like the x86 implementation.  This should be
more efficient than the generic implementation in filemap.c.  It
would be better if m68k had __GCC_ASM_FLAG_OUTPUTS__.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 arch/m68k/include/asm/bitops.h | 14 ++++++++++++++
 1 file changed, 14 insertions(+)

Comments

Greg Ungerer Sept. 16, 2023, 1:11 p.m. UTC | #1
On 16/9/23 04:36, Matthew Wilcox (Oracle) wrote:
> Using EOR to clear the guaranteed-to-be-set lock bit will test the
> negative flag just like the x86 implementation.  This should be
> more efficient than the generic implementation in filemap.c.  It
> would be better if m68k had __GCC_ASM_FLAG_OUTPUTS__.
> 
> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
> ---
>   arch/m68k/include/asm/bitops.h | 14 ++++++++++++++
>   1 file changed, 14 insertions(+)
> 
> diff --git a/arch/m68k/include/asm/bitops.h b/arch/m68k/include/asm/bitops.h
> index e984af71df6b..909ebe7cab5d 100644
> --- a/arch/m68k/include/asm/bitops.h
> +++ b/arch/m68k/include/asm/bitops.h
> @@ -319,6 +319,20 @@ arch___test_and_change_bit(unsigned long nr, volatile unsigned long *addr)
>   	return test_and_change_bit(nr, addr);
>   }
>   
> +static inline bool xor_unlock_is_negative_byte(unsigned long mask,
> +		volatile unsigned long *p)
> +{
> +	char result;
> +	char *cp = (char *)p + 3;	/* m68k is big-endian */
> +
> +	__asm__ __volatile__ ("eor.b %1, %2; smi %0"

The ColdFire members of the 68k family do not support byte size eor:

   CC      mm/filemap.o
{standard input}: Assembler messages:
{standard input}:824: Error: invalid instruction for this architecture; needs 68000 or higher (68000 [68ec000, 68hc000, 68hc001, 68008, 68302, 68306, 68307, 68322, 68356], 68010, 68020 [68k, 68ec020], 68030 [68ec030], 68040 [68ec040], 68060 [68ec060], cpu32 [68330, 68331, 68332, 68333, 68334, 68336, 68340, 68341, 68349, 68360], fidoa [fido]) -- statement `eor.b #1,3(%a0)' ignored

Regards
Greg




> +		: "=d" (result)
> +		: "di" (mask), "o" (*cp)
> +		: "memory");
> +	return result;
> +}
> +#define xor_unlock_is_negative_byte xor_unlock_is_negative_byte
> +
>   /*
>    *	The true 68020 and more advanced processors support the "bfffo"
>    *	instruction for finding bits. ColdFire and simple 68000 parts
Matthew Wilcox Sept. 16, 2023, 2:34 p.m. UTC | #2
On Sat, Sep 16, 2023 at 11:11:32PM +1000, Greg Ungerer wrote:
> On 16/9/23 04:36, Matthew Wilcox (Oracle) wrote:
> > Using EOR to clear the guaranteed-to-be-set lock bit will test the
> > negative flag just like the x86 implementation.  This should be
> > more efficient than the generic implementation in filemap.c.  It
> > would be better if m68k had __GCC_ASM_FLAG_OUTPUTS__.
> > 
> > Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
> > ---
> >   arch/m68k/include/asm/bitops.h | 14 ++++++++++++++
> >   1 file changed, 14 insertions(+)
> > 
> > diff --git a/arch/m68k/include/asm/bitops.h b/arch/m68k/include/asm/bitops.h
> > index e984af71df6b..909ebe7cab5d 100644
> > --- a/arch/m68k/include/asm/bitops.h
> > +++ b/arch/m68k/include/asm/bitops.h
> > @@ -319,6 +319,20 @@ arch___test_and_change_bit(unsigned long nr, volatile unsigned long *addr)
> >   	return test_and_change_bit(nr, addr);
> >   }
> > +static inline bool xor_unlock_is_negative_byte(unsigned long mask,
> > +		volatile unsigned long *p)
> > +{
> > +	char result;
> > +	char *cp = (char *)p + 3;	/* m68k is big-endian */
> > +
> > +	__asm__ __volatile__ ("eor.b %1, %2; smi %0"
> 
> The ColdFire members of the 68k family do not support byte size eor:
> 
>   CC      mm/filemap.o
> {standard input}: Assembler messages:
> {standard input}:824: Error: invalid instruction for this architecture; needs 68000 or higher (68000 [68ec000, 68hc000, 68hc001, 68008, 68302, 68306, 68307, 68322, 68356], 68010, 68020 [68k, 68ec020], 68030 [68ec030], 68040 [68ec040], 68060 [68ec060], cpu32 [68330, 68331, 68332, 68333, 68334, 68336, 68340, 68341, 68349, 68360], fidoa [fido]) -- statement `eor.b #1,3(%a0)' ignored

Well, that sucks.  What do you suggest for Coldfire?

(Shame you didn't join in on the original discussion:
https://lore.kernel.org/linux-m68k/ZLmKq2VLjYGBVhMI@casper.infradead.org/ )
Greg Ungerer Sept. 18, 2023, 2:37 p.m. UTC | #3
On 17/9/23 00:34, Matthew Wilcox wrote:
> On Sat, Sep 16, 2023 at 11:11:32PM +1000, Greg Ungerer wrote:
>> On 16/9/23 04:36, Matthew Wilcox (Oracle) wrote:
>>> Using EOR to clear the guaranteed-to-be-set lock bit will test the
>>> negative flag just like the x86 implementation.  This should be
>>> more efficient than the generic implementation in filemap.c.  It
>>> would be better if m68k had __GCC_ASM_FLAG_OUTPUTS__.
>>>
>>> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
>>> ---
>>>    arch/m68k/include/asm/bitops.h | 14 ++++++++++++++
>>>    1 file changed, 14 insertions(+)
>>>
>>> diff --git a/arch/m68k/include/asm/bitops.h b/arch/m68k/include/asm/bitops.h
>>> index e984af71df6b..909ebe7cab5d 100644
>>> --- a/arch/m68k/include/asm/bitops.h
>>> +++ b/arch/m68k/include/asm/bitops.h
>>> @@ -319,6 +319,20 @@ arch___test_and_change_bit(unsigned long nr, volatile unsigned long *addr)
>>>    	return test_and_change_bit(nr, addr);
>>>    }
>>> +static inline bool xor_unlock_is_negative_byte(unsigned long mask,
>>> +		volatile unsigned long *p)
>>> +{
>>> +	char result;
>>> +	char *cp = (char *)p + 3;	/* m68k is big-endian */
>>> +
>>> +	__asm__ __volatile__ ("eor.b %1, %2; smi %0"
>>
>> The ColdFire members of the 68k family do not support byte size eor:
>>
>>    CC      mm/filemap.o
>> {standard input}: Assembler messages:
>> {standard input}:824: Error: invalid instruction for this architecture; needs 68000 or higher (68000 [68ec000, 68hc000, 68hc001, 68008, 68302, 68306, 68307, 68322, 68356], 68010, 68020 [68k, 68ec020], 68030 [68ec030], 68040 [68ec040], 68060 [68ec060], cpu32 [68330, 68331, 68332, 68333, 68334, 68336, 68340, 68341, 68349, 68360], fidoa [fido]) -- statement `eor.b #1,3(%a0)' ignored
> 
> Well, that sucks.  What do you suggest for Coldfire?

I am not seeing an easy way to not fall back to something like the MIPS
implementation for ColdFire. Could obviously assemblerize this to do better
than gcc, but if it has to be atomic I think we are stuck with the irq locking.

static inline bool cf_xor_is_negative_byte(unsigned long mask,
                 volatile unsigned long *addr)
{
         unsigned long flags;
         unsigned long data;

         local_irq_save(flags)
         data = *addr;
         *addr = data ^ mask;
         local_irq_restore(flags);

         return (data & BIT(7)) != 0;
}

Regards
Greg


> (Shame you didn't join in on the original discussion:
> https://lore.kernel.org/linux-m68k/ZLmKq2VLjYGBVhMI@casper.infradead.org/ )
David Laight Sept. 19, 2023, 1:23 p.m. UTC | #4
> Well, that sucks.  What do you suggest for Coldfire?

Can you just do a 32bit xor ?
Unless you've got smp m68k I'd presume it is ok?
(And assuming you aren't falling off a page.)

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
Matthew Wilcox Sept. 19, 2023, 2:26 p.m. UTC | #5
On Tue, Sep 19, 2023 at 01:23:08PM +0000, David Laight wrote:
> > Well, that sucks.  What do you suggest for Coldfire?
> 
> Can you just do a 32bit xor ?
> Unless you've got smp m68k I'd presume it is ok?
> (And assuming you aren't falling off a page.)

Patch welcome.
David Laight Sept. 19, 2023, 2:35 p.m. UTC | #6
From: Matthew Wilcox <willy@infradead.org>
> Sent: 19 September 2023 15:26
> 
> On Tue, Sep 19, 2023 at 01:23:08PM +0000, David Laight wrote:
> > > Well, that sucks.  What do you suggest for Coldfire?
> >
> > Can you just do a 32bit xor ?
> > Unless you've got smp m68k I'd presume it is ok?
> > (And assuming you aren't falling off a page.)
> 
> Patch welcome.

My 68020 book seems to be at work and I'm at home.
(The 286, 386 and cy7c600 (sparc 32) books don't help).

But if the code is trying to do *ptr ^= 0x80 and check the
sign flag then you just need to use eor.l with 0x80000000
on the same address.
All the 68k I used would do misaligned accesses.
Although they can fault mid-instruction on the microcode stack.
Any smp 68020 had to be certain to resume on the same cpu.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
Matthew Wilcox Sept. 19, 2023, 3:14 p.m. UTC | #7
On Tue, Sep 19, 2023 at 02:35:25PM +0000, David Laight wrote:
> From: Matthew Wilcox <willy@infradead.org>
> > Sent: 19 September 2023 15:26
> > 
> > On Tue, Sep 19, 2023 at 01:23:08PM +0000, David Laight wrote:
> > > > Well, that sucks.  What do you suggest for Coldfire?
> > >
> > > Can you just do a 32bit xor ?
> > > Unless you've got smp m68k I'd presume it is ok?
> > > (And assuming you aren't falling off a page.)
> > 
> > Patch welcome.
> 
> My 68020 book seems to be at work and I'm at home.
> (The 286, 386 and cy7c600 (sparc 32) books don't help).
> 
> But if the code is trying to do *ptr ^= 0x80 and check the
> sign flag then you just need to use eor.l with 0x80000000
> on the same address.

I have a 68020 book; what I don't have is a Coldfire manual.

Anyway, that's not the brief.  We're looking to (eg) clear bit 0
and test whether bit 7 was set.  So it's the sign bit of the byte,
not the sign bit of the int.
David Laight Sept. 19, 2023, 3:22 p.m. UTC | #8
From: Matthew Wilcox <willy@infradead.org>
> Sent: 19 September 2023 16:15
> 
> On Tue, Sep 19, 2023 at 02:35:25PM +0000, David Laight wrote:
> > From: Matthew Wilcox <willy@infradead.org>
> > > Sent: 19 September 2023 15:26
> > >
> > > On Tue, Sep 19, 2023 at 01:23:08PM +0000, David Laight wrote:
> > > > > Well, that sucks.  What do you suggest for Coldfire?
> > > >
> > > > Can you just do a 32bit xor ?
> > > > Unless you've got smp m68k I'd presume it is ok?
> > > > (And assuming you aren't falling off a page.)
> > >
> > > Patch welcome.
...
> I have a 68020 book; what I don't have is a Coldfire manual.
> 
> Anyway, that's not the brief.  We're looking to (eg) clear bit 0
> and test whether bit 7 was set.  So it's the sign bit of the byte,
> not the sign bit of the int.

Use the address of the byte as an int and xor with 1u<<24.
The xor will do a rmw on the three bytes following, but I
doubt that matters.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
Matthew Wilcox Sept. 19, 2023, 3:47 p.m. UTC | #9
On Tue, Sep 19, 2023 at 03:22:25PM +0000, David Laight wrote:
> > Anyway, that's not the brief.  We're looking to (eg) clear bit 0
> > and test whether bit 7 was set.  So it's the sign bit of the byte,
> > not the sign bit of the int.
> 
> Use the address of the byte as an int and xor with 1u<<24.
> The xor will do a rmw on the three bytes following, but I
> doubt that matters.

Bet you a shiny penny that Coldfire takes an unaligned access trap ...
and besides, this is done on _every_ call to unlock_page().  That might
cross not only a cacheline boundary but also a page boundary.  I cannot
believe that would be a high-performing solution.  It might be just fine
on m68000 but I bet even by the 030 it's lower performing.
David Laight Sept. 19, 2023, 3:57 p.m. UTC | #10
From: Matthew Wilcox
> Sent: 19 September 2023 16:47
> 
> On Tue, Sep 19, 2023 at 03:22:25PM +0000, David Laight wrote:
> > > Anyway, that's not the brief.  We're looking to (eg) clear bit 0
> > > and test whether bit 7 was set.  So it's the sign bit of the byte,
> > > not the sign bit of the int.
> >
> > Use the address of the byte as an int and xor with 1u<<24.
> > The xor will do a rmw on the three bytes following, but I
> > doubt that matters.
> 
> Bet you a shiny penny that Coldfire takes an unaligned access trap ...

and then the 'firmware' silently fixed it up for you a few 1000
clocks later...

> and besides, this is done on _every_ call to unlock_page().  That might
> cross not only a cacheline boundary but also a page boundary.  I cannot
> believe that would be a high-performing solution.  It might be just fine
> on m68000 but I bet even by the 030 it's lower performing.

I do remember managing to use 'cas2' to add an item to a linked list.
But it is so painful so setup it was better just to disable interrupts.
For non-smp that is almost certainly ok.
(Unless the instructions are slow because of synchronisation.)
Otherwise you need to use 'cas' on the aligned word.
Assuming coldfire even has cas.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
Greg Ungerer Sept. 20, 2023, 7:15 a.m. UTC | #11
On 20/9/23 01:57, David Laight wrote:
> From: Matthew Wilcox
>> Sent: 19 September 2023 16:47
>>
>> On Tue, Sep 19, 2023 at 03:22:25PM +0000, David Laight wrote:
>>>> Anyway, that's not the brief.  We're looking to (eg) clear bit 0
>>>> and test whether bit 7 was set.  So it's the sign bit of the byte,
>>>> not the sign bit of the int.
>>>
>>> Use the address of the byte as an int and xor with 1u<<24.
>>> The xor will do a rmw on the three bytes following, but I
>>> doubt that matters.
>>
>> Bet you a shiny penny that Coldfire takes an unaligned access trap ...
> 
> and then the 'firmware' silently fixed it up for you a few 1000
> clocks later...
> 
>> and besides, this is done on _every_ call to unlock_page().  That might
>> cross not only a cacheline boundary but also a page boundary.  I cannot
>> believe that would be a high-performing solution.  It might be just fine
>> on m68000 but I bet even by the 030 it's lower performing.
> 
> I do remember managing to use 'cas2' to add an item to a linked list.
> But it is so painful so setup it was better just to disable interrupts.
> For non-smp that is almost certainly ok.
> (Unless the instructions are slow because of synchronisation.)
> Otherwise you need to use 'cas' on the aligned word.
> Assuming coldfire even has cas.

It doesn't. See CONFIG_CPU_HAS_NO_CAS in arch/m68k/Kconfig.cpu for how
m68k deals with ColdFire and early 68000 parts not having it.

Regards
Greg


> 
> 	David
> 
> -
> Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
> Registration No: 1397386 (Wales)
>
Greg Ungerer Sept. 20, 2023, 7:22 a.m. UTC | #12
On 20/9/23 01:14, Matthew Wilcox wrote:
> On Tue, Sep 19, 2023 at 02:35:25PM +0000, David Laight wrote:
>> From: Matthew Wilcox <willy@infradead.org>
>>> Sent: 19 September 2023 15:26
>>>
>>> On Tue, Sep 19, 2023 at 01:23:08PM +0000, David Laight wrote:
>>>>> Well, that sucks.  What do you suggest for Coldfire?
>>>>
>>>> Can you just do a 32bit xor ?
>>>> Unless you've got smp m68k I'd presume it is ok?
>>>> (And assuming you aren't falling off a page.)
>>>
>>> Patch welcome.
>>
>> My 68020 book seems to be at work and I'm at home.
>> (The 286, 386 and cy7c600 (sparc 32) books don't help).
>>
>> But if the code is trying to do *ptr ^= 0x80 and check the
>> sign flag then you just need to use eor.l with 0x80000000
>> on the same address.
> 
> I have a 68020 book; what I don't have is a Coldfire manual.

You can find it here: https://www.nxp.com/docs/en/reference-manual/CFPRM.pdf

Regards
Greg


> Anyway, that's not the brief.  We're looking to (eg) clear bit 0
> and test whether bit 7 was set.  So it's the sign bit of the byte,
> not the sign bit of the int.
Greg Ungerer Sept. 20, 2023, 7:45 a.m. UTC | #13
On 19/9/23 00:37, Greg Ungerer wrote:
> On 17/9/23 00:34, Matthew Wilcox wrote:
>> On Sat, Sep 16, 2023 at 11:11:32PM +1000, Greg Ungerer wrote:
>>> On 16/9/23 04:36, Matthew Wilcox (Oracle) wrote:
>>>> Using EOR to clear the guaranteed-to-be-set lock bit will test the
>>>> negative flag just like the x86 implementation.  This should be
>>>> more efficient than the generic implementation in filemap.c.  It
>>>> would be better if m68k had __GCC_ASM_FLAG_OUTPUTS__.
>>>>
>>>> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
>>>> ---
>>>>    arch/m68k/include/asm/bitops.h | 14 ++++++++++++++
>>>>    1 file changed, 14 insertions(+)
>>>>
>>>> diff --git a/arch/m68k/include/asm/bitops.h b/arch/m68k/include/asm/bitops.h
>>>> index e984af71df6b..909ebe7cab5d 100644
>>>> --- a/arch/m68k/include/asm/bitops.h
>>>> +++ b/arch/m68k/include/asm/bitops.h
>>>> @@ -319,6 +319,20 @@ arch___test_and_change_bit(unsigned long nr, volatile unsigned long *addr)
>>>>        return test_and_change_bit(nr, addr);
>>>>    }
>>>> +static inline bool xor_unlock_is_negative_byte(unsigned long mask,
>>>> +        volatile unsigned long *p)
>>>> +{
>>>> +    char result;
>>>> +    char *cp = (char *)p + 3;    /* m68k is big-endian */
>>>> +
>>>> +    __asm__ __volatile__ ("eor.b %1, %2; smi %0"
>>>
>>> The ColdFire members of the 68k family do not support byte size eor:
>>>
>>>    CC      mm/filemap.o
>>> {standard input}: Assembler messages:
>>> {standard input}:824: Error: invalid instruction for this architecture; needs 68000 or higher (68000 [68ec000, 68hc000, 68hc001, 68008, 68302, 68306, 68307, 68322, 68356], 68010, 68020 [68k, 68ec020], 68030 [68ec030], 68040 [68ec040], 68060 [68ec060], cpu32 [68330, 68331, 68332, 68333, 68334, 68336, 68340, 68341, 68349, 68360], fidoa [fido]) -- statement `eor.b #1,3(%a0)' ignored
>>
>> Well, that sucks.  What do you suggest for Coldfire?
> 
> I am not seeing an easy way to not fall back to something like the MIPS
> implementation for ColdFire. Could obviously assemblerize this to do better
> than gcc, but if it has to be atomic I think we are stuck with the irq locking.
> 
> static inline bool cf_xor_is_negative_byte(unsigned long mask,
>                  volatile unsigned long *addr)
> {
>          unsigned long flags;
>          unsigned long data;
> 
>          local_irq_save(flags)
>          data = *addr;
>          *addr = data ^ mask;
>          local_irq_restore(flags);
> 
>          return (data & BIT(7)) != 0;
> }

The problem with this C implementation is that need to use loal_irq_save()
which results in some ugly header dependencies trying top include irqflags.h.

This version at least compiles and run, though we can probably do better still.


diff --git a/arch/m68k/include/asm/bitops.h b/arch/m68k/include/asm/bitops.h
index e984af71df6b..99392c26e784 100644
--- a/arch/m68k/include/asm/bitops.h
+++ b/arch/m68k/include/asm/bitops.h
@@ -319,6 +319,48 @@ arch___test_and_change_bit(unsigned long nr, volatile unsigned long *addr)
         return test_and_change_bit(nr, addr);
  }
  
+static inline bool cf_xor_unlock_is_negative_byte(unsigned long mask,
+               volatile unsigned long *addr)
+{
+       unsigned long data;
+
+        asm volatile (
+               "move.w %%sr,%%d1       \n\t"
+               "move.w %%d1,%%d0       \n\t"
+               "ori.l  #0x0700,%%d0    \n\t"
+               "move.w %%d0,%%sr       \n\t"
+
+               "move.l %2@,%0          \n\t"
+               "eor.l  %1,%0           \n\t"
+               "move.l %0,%2@          \n\t"
+
+               "movew  %%d1,%%sr       \n"
+               : "=d" (data)
+               : "di" (mask), "a" (addr)
+               : "cc", "%d0", "%d1", "memory");
+
+       return (data & BIT(7)) != 0;
+}
+
+static inline bool m68k_xor_unlock_is_negative_byte(unsigned long mask,
+               volatile unsigned long *p)
+{
+       char result;
+       char *cp = (char *)p + 3;       /* m68k is big-endian */
+
+       __asm__ __volatile__ ("eor.b %1, %2; smi %0"
+               : "=d" (result)
+               : "di" (mask), "o" (*cp)
+               : "memory");
+       return result;
+}
+
+#if defined(CONFIG_COLDFIRE)
+#define xor_unlock_is_negative_byte(mask, p) cf_xor_unlock_is_negative_byte(mask, p)
+#else
+#define xor_unlock_is_negative_byte(mask, p) m68k_xor_unlock_is_negative_byte(mask, p)
+#endif
+
  /*
   *     The true 68020 and more advanced processors support the "bfffo"
   *     instruction for finding bits. ColdFire and simple 68000 parts


Regards
Greg
Linus Torvalds Sept. 20, 2023, 4:58 p.m. UTC | #14
On Wed, 20 Sept 2023 at 00:45, Greg Ungerer <gregungerer@westnet.com.au> wrote:
>
> The problem with this C implementation is that need to use loal_irq_save()
> which results in some ugly header dependencies trying top include irqflags.h.
>
> This version at least compiles and run, though we can probably do better still.

I was going to say "can't you use CAS?" but apparently coldfire
doesn't have that either. What a horrible thing.

I do wonder if we should just say "let's use the top bit instead"?

The reason we used bit #7 is that

 - only x86 used to do the clear_bit_unlock_is_negative_byte

 - it was easy with a simple "andb".

 - it was just a trivial "move two bits around".

but honestly, on x86, doing it with "andl/andq" shouldn't be any
worse, and we can still use a (sign-extended) 8-bit immediate value to
xor the low seven bits and test the high bit at the same time - so it
should be basically exactly the same code sequence.

There's a question about mixing access widths - which can be deadly to
performance on x86 - but generally this op should be the only op on
the page flags in that sequence, and doing a byte access isn't
necessarily better.

Of course, using the top bit then screws with all the
zone/node/section/lru_gen bits that we currently put in the high bits.
So it's absolutely *not* just a trivial "move this bit" operation.

It would change all the <linux/page-flags-layout.h> games we do.

That might be enough for any sane person to go "this is not worth it",
but *if* Willy goes "I like the bit twiddling games", maybe he'd be
willing to look at that.

I mean, he wrote alpha assembler for this, that certainly says
*something* about WIlly ;)

Willy?

                  Linus
Matthew Wilcox Oct. 2, 2023, 8:07 p.m. UTC | #15
On Wed, Sep 20, 2023 at 05:22:33PM +1000, Greg Ungerer wrote:
> On 20/9/23 01:14, Matthew Wilcox wrote:
> > I have a 68020 book; what I don't have is a Coldfire manual.
> 
> You can find it here: https://www.nxp.com/docs/en/reference-manual/CFPRM.pdf

Thanks, Greg.  This is almost good:

static inline bool xor_unlock_is_negative_byte(unsigned long mask,
                volatile unsigned long *p)
{
#ifdef CONFIG_COLDFIRE
        __asm__ __volatile__ ("eorl %1, %0"
                : "+m" (*p)
                : "d" (mask)
                : "memory");
        return *p & (1 << 7);
#else
        char result;
        char *cp = (char *)p + 3;       /* m68k is big-endian */

        __asm__ __volatile__ ("eor.b %1, %2; smi %0"
                : "=d" (result)
                : "di" (mask), "o" (*cp)
                : "memory");
        return result;
#endif
}

folio_end_read() does about as well as can be expected:

00000708 <folio_end_read>:
     708:       206f 0004       moveal %sp@(4),%a0
     70c:       7009            moveq #9,%d0
     70e:       4a2f 000b       tstb %sp@(11)
     712:       6602            bnes 716 <folio_end_read+0xe>
     714:       7001            moveq #1,%d0
     716:       b190            eorl %d0,%a0@
     718:       2010            movel %a0@,%d0
     71a:       4a00            tstb %d0
     71c:       6a0c            bpls 72a <folio_end_read+0x22>
     71e:       42af 0008       clrl %sp@(8)
     722:       2f48 0004       movel %a0,%sp@(4)
     726:       6000 fcfe       braw 426 <folio_wake_bit>
     72a:       4e75            rts

However, it seems that folio_unlock() could shave off an instruction:

00000918 <folio_unlock>:
     918:       206f 0004       moveal %sp@(4),%a0
     91c:       7001            moveq #1,%d0
     91e:       b190            eorl %d0,%a0@
     920:       2010            movel %a0@,%d0
     922:       4a00            tstb %d0
     924:       6a0a            bpls 930 <folio_unlock+0x18>
     926:       42a7            clrl %sp@-
     928:       2f08            movel %a0,%sp@-
     92a:       4eba fafa       jsr %pc@(426 <folio_wake_bit>)
     92e:       508f            addql #8,%sp
     930:       4e75            rts

We could use eori instead of eorl, at least according to table 3-9 on
page 3-8:

EOR Dy,<ea>x L Source ^ Destination → Destination ISA_A
EORI #<data>,Dx L Immediate Data ^ Destination → Destination ISA_A

but gas is unhappy with everything I've tried to use eori.  I'm building
with stmark2_defconfig, which I assume should work.
Greg Ungerer Oct. 3, 2023, 2:14 p.m. UTC | #16
On 3/10/23 06:07, Matthew Wilcox wrote:
> On Wed, Sep 20, 2023 at 05:22:33PM +1000, Greg Ungerer wrote:
>> On 20/9/23 01:14, Matthew Wilcox wrote:
>>> I have a 68020 book; what I don't have is a Coldfire manual.
>>
>> You can find it here: https://www.nxp.com/docs/en/reference-manual/CFPRM.pdf
> 
> Thanks, Greg.  This is almost good:
> 
> static inline bool xor_unlock_is_negative_byte(unsigned long mask,
>                  volatile unsigned long *p)
> {
> #ifdef CONFIG_COLDFIRE
>          __asm__ __volatile__ ("eorl %1, %0"
>                  : "+m" (*p)
>                  : "d" (mask)
>                  : "memory");
>          return *p & (1 << 7);
> #else
>          char result;
>          char *cp = (char *)p + 3;       /* m68k is big-endian */
> 
>          __asm__ __volatile__ ("eor.b %1, %2; smi %0"
>                  : "=d" (result)
>                  : "di" (mask), "o" (*cp)
>                  : "memory");
>          return result;
> #endif
> }
> 
> folio_end_read() does about as well as can be expected:
> 
> 00000708 <folio_end_read>:
>       708:       206f 0004       moveal %sp@(4),%a0
>       70c:       7009            moveq #9,%d0
>       70e:       4a2f 000b       tstb %sp@(11)
>       712:       6602            bnes 716 <folio_end_read+0xe>
>       714:       7001            moveq #1,%d0
>       716:       b190            eorl %d0,%a0@
>       718:       2010            movel %a0@,%d0
>       71a:       4a00            tstb %d0
>       71c:       6a0c            bpls 72a <folio_end_read+0x22>
>       71e:       42af 0008       clrl %sp@(8)
>       722:       2f48 0004       movel %a0,%sp@(4)
>       726:       6000 fcfe       braw 426 <folio_wake_bit>
>       72a:       4e75            rts
> 
> However, it seems that folio_unlock() could shave off an instruction:
> 
> 00000918 <folio_unlock>:
>       918:       206f 0004       moveal %sp@(4),%a0
>       91c:       7001            moveq #1,%d0
>       91e:       b190            eorl %d0,%a0@
>       920:       2010            movel %a0@,%d0
>       922:       4a00            tstb %d0
>       924:       6a0a            bpls 930 <folio_unlock+0x18>
>       926:       42a7            clrl %sp@-
>       928:       2f08            movel %a0,%sp@-
>       92a:       4eba fafa       jsr %pc@(426 <folio_wake_bit>)
>       92e:       508f            addql #8,%sp
>       930:       4e75            rts
> 
> We could use eori instead of eorl, at least according to table 3-9 on
> page 3-8:
> 
> EOR Dy,<ea>x L Source ^ Destination → Destination ISA_A
> EORI #<data>,Dx L Immediate Data ^ Destination → Destination ISA_A
> 
> but gas is unhappy with everything I've tried to use eori.  I'm building

I can't seem to get it to always use it either. This comes close:

         __asm__ __volatile__ ("eorl %1, %0"
                 : "+d" (*p)
                 : "di" (mask)
                 : "memory");
         return *p & (1 << 7);

Using eoril for folio_unlock, but not for folio_end_read:

400413e6 <folio_unlock>:
400413e6:       206f 0004       moveal %sp@(4),%a0
400413ea:       2010            movel %a0@,%d0
400413ec:       0a80 0000 0001  eoril #1,%d0
400413f2:       2080            movel %d0,%a0@
400413f4:       2010            movel %a0@,%d0
400413f6:       4a00            tstb %d0
400413f8:       6c0a            bges 40041404 <folio_unlock+0x1e>
400413fa:       42a7            clrl %sp@-
400413fc:       2f08            movel %a0,%sp@-
400413fe:       4eba ff30       jsr %pc@(40041330 <folio_wake_bit>)
40041402:       508f            addql #8,%sp
40041404:       4e75            rts

But that is still worse anyway.

> with stmark2_defconfig, which I assume should work.

Yes, or any of amcore, m5208evb, m5249evb, m5272c3, m5275evb, m5307c3, m5407c3.

Regards
Greg
Matthew Wilcox Oct. 3, 2023, 8:07 p.m. UTC | #17
On Wed, Oct 04, 2023 at 12:14:10AM +1000, Greg Ungerer wrote:
> On 3/10/23 06:07, Matthew Wilcox wrote:
> > 00000918 <folio_unlock>:
> >       918:       206f 0004       moveal %sp@(4),%a0
> >       91c:       7001            moveq #1,%d0
> >       91e:       b190            eorl %d0,%a0@
> >       920:       2010            movel %a0@,%d0
> >       922:       4a00            tstb %d0
> >       924:       6a0a            bpls 930 <folio_unlock+0x18>
> >       926:       42a7            clrl %sp@-
> >       928:       2f08            movel %a0,%sp@-
> >       92a:       4eba fafa       jsr %pc@(426 <folio_wake_bit>)
> >       92e:       508f            addql #8,%sp
> >       930:       4e75            rts

fwiw, here's what folio_unlock looks like today without any of my
patches:

00000746 <folio_unlock>:
     746:       206f 0004       moveal %sp@(4),%a0
     74a:       43e8 0003       lea %a0@(3),%a1
     74e:       0891 0000       bclr #0,%a1@
     752:       2010            movel %a0@,%d0
     754:       4a00            tstb %d0
     756:       6a0a            bpls 762 <folio_unlock+0x1c>
     758:       42a7            clrl %sp@-
     75a:       2f08            movel %a0,%sp@-
     75c:       4eba fcc8       jsr %pc@(426 <folio_wake_bit>)
     760:       508f            addql #8,%sp
     762:       4e75            rts

Same number of instructions, but today's code has slightly longer insns,
so I'm tempted to take the win?

> > We could use eori instead of eorl, at least according to table 3-9 on
> > page 3-8:
> > 
> > EOR Dy,<ea>x L Source ^ Destination → Destination ISA_A
> > EORI #<data>,Dx L Immediate Data ^ Destination → Destination ISA_A

Oh.  I misread.  It only does EORI to a data register; it can't do EORI
to an address.

> 400413e6 <folio_unlock>:
> 400413e6:       206f 0004       moveal %sp@(4),%a0
> 400413ea:       2010            movel %a0@,%d0
> 400413ec:       0a80 0000 0001  eoril #1,%d0
> 400413f2:       2080            movel %d0,%a0@
> 400413f4:       2010            movel %a0@,%d0
> 400413f6:       4a00            tstb %d0
> 400413f8:       6c0a            bges 40041404 <folio_unlock+0x1e>
> 400413fa:       42a7            clrl %sp@-
> 400413fc:       2f08            movel %a0,%sp@-
> 400413fe:       4eba ff30       jsr %pc@(40041330 <folio_wake_bit>)
> 40041402:       508f            addql #8,%sp
> 40041404:       4e75            rts
> 
> But that is still worse anyway.

Yup.  Looks like the version I posted actually does the best!  I'll
munge that into the patch series and repost.  Thanks for your help!
Greg Ungerer Oct. 4, 2023, 12:06 p.m. UTC | #18
On 4/10/23 06:07, Matthew Wilcox wrote:
> On Wed, Oct 04, 2023 at 12:14:10AM +1000, Greg Ungerer wrote:
>> On 3/10/23 06:07, Matthew Wilcox wrote:
>>> 00000918 <folio_unlock>:
>>>        918:       206f 0004       moveal %sp@(4),%a0
>>>        91c:       7001            moveq #1,%d0
>>>        91e:       b190            eorl %d0,%a0@
>>>        920:       2010            movel %a0@,%d0
>>>        922:       4a00            tstb %d0
>>>        924:       6a0a            bpls 930 <folio_unlock+0x18>
>>>        926:       42a7            clrl %sp@-
>>>        928:       2f08            movel %a0,%sp@-
>>>        92a:       4eba fafa       jsr %pc@(426 <folio_wake_bit>)
>>>        92e:       508f            addql #8,%sp
>>>        930:       4e75            rts
> 
> fwiw, here's what folio_unlock looks like today without any of my
> patches:
> 
> 00000746 <folio_unlock>:
>       746:       206f 0004       moveal %sp@(4),%a0
>       74a:       43e8 0003       lea %a0@(3),%a1
>       74e:       0891 0000       bclr #0,%a1@
>       752:       2010            movel %a0@,%d0
>       754:       4a00            tstb %d0
>       756:       6a0a            bpls 762 <folio_unlock+0x1c>
>       758:       42a7            clrl %sp@-
>       75a:       2f08            movel %a0,%sp@-
>       75c:       4eba fcc8       jsr %pc@(426 <folio_wake_bit>)
>       760:       508f            addql #8,%sp
>       762:       4e75            rts
> 
> Same number of instructions, but today's code has slightly longer insns,
> so I'm tempted to take the win?

Yes, I reckon so.


>>> We could use eori instead of eorl, at least according to table 3-9 on
>>> page 3-8:
>>>
>>> EOR Dy,<ea>x L Source ^ Destination → Destination ISA_A
>>> EORI #<data>,Dx L Immediate Data ^ Destination → Destination ISA_A
> 
> Oh.  I misread.  It only does EORI to a data register; it can't do EORI
> to an address.
> 
>> 400413e6 <folio_unlock>:
>> 400413e6:       206f 0004       moveal %sp@(4),%a0
>> 400413ea:       2010            movel %a0@,%d0
>> 400413ec:       0a80 0000 0001  eoril #1,%d0
>> 400413f2:       2080            movel %d0,%a0@
>> 400413f4:       2010            movel %a0@,%d0
>> 400413f6:       4a00            tstb %d0
>> 400413f8:       6c0a            bges 40041404 <folio_unlock+0x1e>
>> 400413fa:       42a7            clrl %sp@-
>> 400413fc:       2f08            movel %a0,%sp@-
>> 400413fe:       4eba ff30       jsr %pc@(40041330 <folio_wake_bit>)
>> 40041402:       508f            addql #8,%sp
>> 40041404:       4e75            rts
>>
>> But that is still worse anyway.
> 
> Yup.  Looks like the version I posted actually does the best!  I'll
> munge that into the patch series and repost.  Thanks for your help!

No worries. Sorry I didn't notice it earlier, but glad it is sorted now.

Regards
Greg
diff mbox series

Patch

diff --git a/arch/m68k/include/asm/bitops.h b/arch/m68k/include/asm/bitops.h
index e984af71df6b..909ebe7cab5d 100644
--- a/arch/m68k/include/asm/bitops.h
+++ b/arch/m68k/include/asm/bitops.h
@@ -319,6 +319,20 @@  arch___test_and_change_bit(unsigned long nr, volatile unsigned long *addr)
 	return test_and_change_bit(nr, addr);
 }
 
+static inline bool xor_unlock_is_negative_byte(unsigned long mask,
+		volatile unsigned long *p)
+{
+	char result;
+	char *cp = (char *)p + 3;	/* m68k is big-endian */
+
+	__asm__ __volatile__ ("eor.b %1, %2; smi %0"
+		: "=d" (result)
+		: "di" (mask), "o" (*cp)
+		: "memory");
+	return result;
+}
+#define xor_unlock_is_negative_byte xor_unlock_is_negative_byte
+
 /*
  *	The true 68020 and more advanced processors support the "bfffo"
  *	instruction for finding bits. ColdFire and simple 68000 parts