[REPOST,5/5] ARM: kvm MMIO support BE host running LE code

Message ID	1387558125-3460-6-git-send-email-victor.kamensky@linaro.org (mailing list archive)
State	New, archived
Headers	show Return-Path: <linux-arm-kernel-bounces+patchwork-linux-arm=patchwork.kernel.org@lists.infradead.org> From: Victor Kamensky <victor.kamensky@linaro.org> To: linux-arm-kernel@lists.infradead.org, kvmarm@lists.cs.columbia.edu, marc.zyngier@arm.com, christoffer.dall@linaro.org, ben.dooks@codethink.co.uk Subject: [PATCH REPOST 5/5] ARM: kvm MMIO support BE host running LE code Date: Fri, 20 Dec 2013 08:48:45 -0800 Message-Id: <1387558125-3460-6-git-send-email-victor.kamensky@linaro.org> In-Reply-To: <1387558125-3460-1-git-send-email-victor.kamensky@linaro.org> References: <1387558125-3460-1-git-send-email-victor.kamensky@linaro.org> Cc: linaro-kernel@lists.linaro.org, patches@linaro.org, taras.kondratiuk@linaro.org, Victor Kamensky <victor.kamensky@linaro.org>, will.deacon@arm.com, andre.przywara@linaro.org, Dave.Martin@arm.com, rmk@arm.linux.org.uk Precedence: list MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Sender: "linux-arm-kernel" <linux-arm-kernel-bounces@lists.infradead.org> Errors-To: linux-arm-kernel-bounces+patchwork-linux-arm=patchwork.kernel.org@lists.infradead.org

Victor Kamensky Dec. 20, 2013, 4:48 p.m. UTC

In case of status register E bit is not set (LE mode) and host runs in
BE mode we need byteswap data, so read/write is emulated correctly.

Signed-off-by: Victor Kamensky <victor.kamensky@linaro.org>
---
 arch/arm/include/asm/kvm_emulate.h | 22 ++++++++++++++++++----
 1 file changed, 18 insertions(+), 4 deletions(-)

Marc Zyngier Jan. 6, 2014, 12:37 p.m. UTC | #1

On Fri, Dec 20 2013 at 04:48:45 PM, Victor Kamensky <victor.kamensky@linaro.org> wrote:
> In case of status register E bit is not set (LE mode) and host runs in
> BE mode we need byteswap data, so read/write is emulated correctly.

I don't think this is correct.

The only reason we byteswap the value in the BE guest case is because it
has byteswapped the data the first place.

With a LE guest, the value we get in the register is the right one, no
need for further processing. I think your additional byteswap only
hides bugs somewhere else in the stack.

	M.

> Signed-off-by: Victor Kamensky <victor.kamensky@linaro.org>
> ---
>  arch/arm/include/asm/kvm_emulate.h | 22 ++++++++++++++++++----
>  1 file changed, 18 insertions(+), 4 deletions(-)
>
> diff --git a/arch/arm/include/asm/kvm_emulate.h b/arch/arm/include/asm/kvm_emulate.h
> index 0fa90c9..69b7469 100644
> --- a/arch/arm/include/asm/kvm_emulate.h
> +++ b/arch/arm/include/asm/kvm_emulate.h
> @@ -185,9 +185,16 @@ static inline unsigned long vcpu_data_guest_to_host(struct kvm_vcpu *vcpu,
>  		default:
>  			return be32_to_cpu(data);
>  		}
> +	} else {
> +		switch (len) {
> +		case 1:
> +			return data & 0xff;
> +		case 2:
> +			return le16_to_cpu(data & 0xffff);
> +		default:
> +			return le32_to_cpu(data);
> +		}
>  	}
> -
> -	return data;		/* Leave LE untouched */
>  }
>  
>  static inline unsigned long vcpu_data_host_to_guest(struct kvm_vcpu *vcpu,
> @@ -203,9 +210,16 @@ static inline unsigned long vcpu_data_host_to_guest(struct kvm_vcpu *vcpu,
>  		default:
>  			return cpu_to_be32(data);
>  		}
> +	} else {
> +		switch (len) {
> +		case 1:
> +			return data & 0xff;
> +		case 2:
> +			return cpu_to_le16(data & 0xffff);
> +		default:
> +			return cpu_to_le32(data);
> +		}
>  	}
> -
> -	return data;		/* Leave LE untouched */
>  }
>  
>  #endif /* __ARM_KVM_EMULATE_H__ */

Victor Kamensky Jan. 6, 2014, 5:44 p.m. UTC | #2

Hi Marc,

Thank you for looking into this.

On 6 January 2014 04:37, Marc Zyngier <marc.zyngier@arm.com> wrote:
> On Fri, Dec 20 2013 at 04:48:45 PM, Victor Kamensky <victor.kamensky@linaro.org> wrote:
>> In case of status register E bit is not set (LE mode) and host runs in
>> BE mode we need byteswap data, so read/write is emulated correctly.
>
> I don't think this is correct.
>
> The only reason we byteswap the value in the BE guest case is because it
> has byteswapped the data the first place.
>
> With a LE guest, the value we get in the register is the right one, no
> need for further processing. I think your additional byteswap only
> hides bugs somewhere else in the stack.

First, do we agree that this patch has effect only in BE host case
(CONFIG_CPU_BIG_ENDIAN=y), because in LE host case cpu_to_leXX
function does nothing only simple copy, just the same we  had before?

In BE host case, we have emulator (qemu, kvm-tool), host kernel, and
hypervisor part of code, all, operating in BE mode; and guest could be either
LE or BE (i.e E bit not set or set). That is opposite to LE host case,
where we have emulator (qemu, kvm-tool), host kernel, and hypervisor part
of code, all, operating in LE mode. Your changes introduced byteswap when
host is LE and access is happening with E bit set. I don't see why symmetry
should break for case when host is BE and access is happening with E bit
cleared.

In another words, regardless of E bit setting of guest access operation rest
of the stack should bring/see the same value before/after
vcpu_data_host_to_guest/vcpu_data_guest_to_host functions are applied. I.e
the rest of stack should be agnostic to E bit setting of access operation.
Do we agree on that? Now, depending on E bit setting of guest access operation
result should differ in its endianity - so in one of two cases byteswap must
happen. But it will not happen in case of BE host and LE access, unless my diff
is applied. Previously added byteswap code for E bit set case will not
have effect
because in BE host case cpu_to_beXX functions don't do anything just copy, and
in another branch of if statement again it just copies the data. So regardless
of E bit setting guest access resulting value is the same in case of
BE host - it
cannot be that way. Note, just only with your changes, in LE host case
byteswap will happen if E bit is set and no byteswap if E bit is clear - so
guest access resulting value does depend on E setting.

Also please note that vcpu_data_host_to_guest/vcpu_data_guest_to_host functions
effectively transfer data between host kernel and memory of saved guest CPU
registers. Those saved registers will be will be put back to CPU registers,
or saved from CPU registers to memory by hypervisor part of code. In BE host
case this hypervisor part of code operates in BE mode as well, so register set
shared between host and hypervisor part of code holds guest registers values in
memory in BE order. vcpu_data_host_to_guest/vcpu_data_guest_to_host function are
not interacting with CPU registers directly. I am not sure, but may this
point was missed.

Thanks,
Victor

>         M.
>
>> Signed-off-by: Victor Kamensky <victor.kamensky@linaro.org>
>> ---
>>  arch/arm/include/asm/kvm_emulate.h | 22 ++++++++++++++++++----
>>  1 file changed, 18 insertions(+), 4 deletions(-)
>>
>> diff --git a/arch/arm/include/asm/kvm_emulate.h b/arch/arm/include/asm/kvm_emulate.h
>> index 0fa90c9..69b7469 100644
>> --- a/arch/arm/include/asm/kvm_emulate.h
>> +++ b/arch/arm/include/asm/kvm_emulate.h
>> @@ -185,9 +185,16 @@ static inline unsigned long vcpu_data_guest_to_host(struct kvm_vcpu *vcpu,
>>               default:
>>                       return be32_to_cpu(data);
>>               }
>> +     } else {
>> +             switch (len) {
>> +             case 1:
>> +                     return data & 0xff;
>> +             case 2:
>> +                     return le16_to_cpu(data & 0xffff);
>> +             default:
>> +                     return le32_to_cpu(data);
>> +             }
>>       }
>> -
>> -     return data;            /* Leave LE untouched */
>>  }
>>
>>  static inline unsigned long vcpu_data_host_to_guest(struct kvm_vcpu *vcpu,
>> @@ -203,9 +210,16 @@ static inline unsigned long vcpu_data_host_to_guest(struct kvm_vcpu *vcpu,
>>               default:
>>                       return cpu_to_be32(data);
>>               }
>> +     } else {
>> +             switch (len) {
>> +             case 1:
>> +                     return data & 0xff;
>> +             case 2:
>> +                     return cpu_to_le16(data & 0xffff);
>> +             default:
>> +                     return cpu_to_le32(data);
>> +             }
>>       }
>> -
>> -     return data;            /* Leave LE untouched */
>>  }
>>
>>  #endif /* __ARM_KVM_EMULATE_H__ */
>
> --
> Jazz is not dead. It just smells funny.

Marc Zyngier Jan. 6, 2014, 6:20 p.m. UTC | #3

Hi Victor,

On Mon, Jan 06 2014 at 05:44:48 PM, Victor Kamensky <victor.kamensky@linaro.org> wrote:
> Hi Marc,
>
> Thank you for looking into this.
>
> On 6 January 2014 04:37, Marc Zyngier <marc.zyngier@arm.com> wrote:
>> On Fri, Dec 20 2013 at 04:48:45 PM, Victor Kamensky <victor.kamensky@linaro.org> wrote:
>>> In case of status register E bit is not set (LE mode) and host runs in
>>> BE mode we need byteswap data, so read/write is emulated correctly.
>>
>> I don't think this is correct.
>>
>> The only reason we byteswap the value in the BE guest case is because it
>> has byteswapped the data the first place.
>>
>> With a LE guest, the value we get in the register is the right one, no
>> need for further processing. I think your additional byteswap only
>> hides bugs somewhere else in the stack.
>
> First, do we agree that this patch has effect only in BE host case
> (CONFIG_CPU_BIG_ENDIAN=y), because in LE host case cpu_to_leXX
> function does nothing only simple copy, just the same we  had before?

Sure, but that is not the point.

> In BE host case, we have emulator (qemu, kvm-tool), host kernel, and
> hypervisor part of code, all, operating in BE mode; and guest could be either
> LE or BE (i.e E bit not set or set). That is opposite to LE host case,
> where we have emulator (qemu, kvm-tool), host kernel, and hypervisor part
> of code, all, operating in LE mode. Your changes introduced byteswap when
> host is LE and access is happening with E bit set. I don't see why symmetry
> should break for case when host is BE and access is happening with E bit
> cleared.

It is certainly not about symmetry. An IO access is LE, always. Again,
the only reason we byteswap a BE guest is because it tries to write to a
LE device, and thus byteswapping the data before it hits the bus.

When we trap this access, we need to correct that byteswap. And that is
the only case we should handle. A LE guest writes a LE value to a LE
device, and nothing is to be byteswapped.

As for the value you read on the host, it will be exactly the value the
guest has written (registers don't have any endianness).

> In another words, regardless of E bit setting of guest access operation rest
> of the stack should bring/see the same value before/after
> vcpu_data_host_to_guest/vcpu_data_guest_to_host functions are applied. I.e
> the rest of stack should be agnostic to E bit setting of access operation.
> Do we agree on that? Now, depending on E bit setting of guest access operation
> result should differ in its endianity - so in one of two cases byteswap must
> happen. But it will not happen in case of BE host and LE access, unless my diff
> is applied. Previously added byteswap code for E bit set case will not
> have effect
> because in BE host case cpu_to_beXX functions don't do anything just copy, and
> in another branch of if statement again it just copies the data. So regardless
> of E bit setting guest access resulting value is the same in case of
> BE host - it
> cannot be that way. Note, just only with your changes, in LE host case
> byteswap will happen if E bit is set and no byteswap if E bit is clear - so
> guest access resulting value does depend on E setting.
>
> Also please note that vcpu_data_host_to_guest/vcpu_data_guest_to_host functions
> effectively transfer data between host kernel and memory of saved guest CPU
> registers. Those saved registers will be will be put back to CPU registers,
> or saved from CPU registers to memory by hypervisor part of code. In BE host
> case this hypervisor part of code operates in BE mode as well, so register set
> shared between host and hypervisor part of code holds guest registers values in
> memory in BE order. vcpu_data_host_to_guest/vcpu_data_guest_to_host function are
> not interacting with CPU registers directly. I am not sure, but may this
> point was missed.

It wasn't missed. No matter how data is stored in memory (BE, LE, or
even PDP endianness), CPU registers always have a consistent
representation. They are immune to CPU endianness change, and storing
to/reading from memory won't change the value, as long as you use the
same endianness for writing/reading.

What you seems to be missing is that the emulated devices must be
LE. There is no such thing as a BE GIC. So for this to work properly,
you will need to fix the VGIC code (distributor emulation only) to be
host-endianness agnostic, and behave like a LE device, even on a BE
system. And all your other device emulations.

	M.

Victor Kamensky Jan. 6, 2014, 7:55 p.m. UTC | #4

On 6 January 2014 10:20, Marc Zyngier <marc.zyngier@arm.com> wrote:
> Hi Victor,
>
> On Mon, Jan 06 2014 at 05:44:48 PM, Victor Kamensky <victor.kamensky@linaro.org> wrote:
>> Hi Marc,
>>
>> Thank you for looking into this.
>>
>> On 6 January 2014 04:37, Marc Zyngier <marc.zyngier@arm.com> wrote:
>>> On Fri, Dec 20 2013 at 04:48:45 PM, Victor Kamensky <victor.kamensky@linaro.org> wrote:
>>>> In case of status register E bit is not set (LE mode) and host runs in
>>>> BE mode we need byteswap data, so read/write is emulated correctly.
>>>
>>> I don't think this is correct.
>>>
>>> The only reason we byteswap the value in the BE guest case is because it
>>> has byteswapped the data the first place.
>>>
>>> With a LE guest, the value we get in the register is the right one, no
>>> need for further processing. I think your additional byteswap only
>>> hides bugs somewhere else in the stack.
>>
>> First, do we agree that this patch has effect only in BE host case
>> (CONFIG_CPU_BIG_ENDIAN=y), because in LE host case cpu_to_leXX
>> function does nothing only simple copy, just the same we  had before?
>
> Sure, but that is not the point.
>
>> In BE host case, we have emulator (qemu, kvm-tool), host kernel, and
>> hypervisor part of code, all, operating in BE mode; and guest could be either
>> LE or BE (i.e E bit not set or set). That is opposite to LE host case,
>> where we have emulator (qemu, kvm-tool), host kernel, and hypervisor part
>> of code, all, operating in LE mode. Your changes introduced byteswap when
>> host is LE and access is happening with E bit set. I don't see why symmetry
>> should break for case when host is BE and access is happening with E bit
>> cleared.
>
> It is certainly not about symmetry. An IO access is LE, always. Again,
> the only reason we byteswap a BE guest is because it tries to write to a
> LE device, and thus byteswapping the data before it hits the bus.
>
> When we trap this access, we need to correct that byteswap. And that is
> the only case we should handle. A LE guest writes a LE value to a LE
> device, and nothing is to be byteswapped.
>
> As for the value you read on the host, it will be exactly the value the
> guest has written (registers don't have any endianness).
>
>> In another words, regardless of E bit setting of guest access operation rest
>> of the stack should bring/see the same value before/after
>> vcpu_data_host_to_guest/vcpu_data_guest_to_host functions are applied. I.e
>> the rest of stack should be agnostic to E bit setting of access operation.
>> Do we agree on that? Now, depending on E bit setting of guest access operation
>> result should differ in its endianity - so in one of two cases byteswap must
>> happen. But it will not happen in case of BE host and LE access, unless my diff
>> is applied. Previously added byteswap code for E bit set case will not
>> have effect
>> because in BE host case cpu_to_beXX functions don't do anything just copy, and
>> in another branch of if statement again it just copies the data. So regardless
>> of E bit setting guest access resulting value is the same in case of
>> BE host - it
>> cannot be that way. Note, just only with your changes, in LE host case
>> byteswap will happen if E bit is set and no byteswap if E bit is clear - so
>> guest access resulting value does depend on E setting.
>>
>> Also please note that vcpu_data_host_to_guest/vcpu_data_guest_to_host functions
>> effectively transfer data between host kernel and memory of saved guest CPU
>> registers. Those saved registers will be will be put back to CPU registers,
>> or saved from CPU registers to memory by hypervisor part of code. In BE host
>> case this hypervisor part of code operates in BE mode as well, so register set
>> shared between host and hypervisor part of code holds guest registers values in
>> memory in BE order. vcpu_data_host_to_guest/vcpu_data_guest_to_host function are
>> not interacting with CPU registers directly. I am not sure, but may this
>> point was missed.
>
> It wasn't missed. No matter how data is stored in memory (BE, LE, or
> even PDP endianness), CPU registers always have a consistent
> representation. They are immune to CPU endianness change, and storing
> to/reading from memory won't change the value, as long as you use the
> same endianness for writing/reading.
>
> What you seems to be missing is that the emulated devices must be
> LE.

It does not matter whether emulated devices are LE or BE.
It is about how E bit should be *emulated* during access.

For example consider situation

1) BE host case
2a) In some place BE guest (E bit set) accesses LE device like this
   ldr r3, [r0, #0]
   rev r3, r3
2b) In the same place BE guest (E bit initially set) accesses LE
device like this (which is really equivalent to 2a):
   setend le
   ldr r3, [r0, #0]
   setend be
3) everything else is completely the same

Regardless of how memory device at r0 address is emulated in the
rest of the stack, if my patch is not applied, in BE host case after
'ldr r3, [r0, #0' is trapped and emulated for both 2a) and 2b) cases
r3 would contain the same value! It is clearly wrong because in one
case memory was read with E bit set and another with E bit
cleared. Such reads should give the byteswapped values for the same
memory location (device or not). In 2a) case after 'rev r3, r3' executes
r3 value will be byteswapped compared to 2b) case - which is very
different if the same 2a) and 2b) code pieces would be executed in
non emulated case with real LE device.

If you suggest that current guest access E value should be
propagated down to the rest of the stack I disagree - it is too invasive.
I believe the rest of stack should emulate access to r0 memory
in the same way regardless what is current guest access E bit value.

Note if I construct similar example for LE host and in some place in
LE guest (E bit initially cleared)  will have (where r0 address is emulated).
4a)
  ldr r3, [r0, #0]
4b)
  setend be
  ldr r3, [r0, #0]
  setend le
  rev r3, r3

The rest of stack would emulate access to r0 address memory in
the same way (regardless of current E bit value) and in 4b) case
value would be byteswapped by code that you added (E bit is set
and host is in LE) and it would be byteswapped again by rev instruction.
As result r3 value will be the same for both 4a) and 4b) cases, the
same result as one would have with real non emulated device.

Thanks,
Victor

> There is no such thing as a BE GIC. So for this to work properly,
> you will need to fix the VGIC code (distributor emulation only) to be
> host-endianness agnostic, and behave like a LE device, even on a BE
> system. And all your other device emulations.
>
>         M.
> --
> Jazz is not dead. It just smells funny.

Peter Maydell Jan. 6, 2014, 10:31 p.m. UTC | #5

On 6 January 2014 18:20, Marc Zyngier <marc.zyngier@arm.com> wrote:
>  No matter how data is stored in memory (BE, LE, or
> even PDP endianness), CPU registers always have a consistent
> representation. They are immune to CPU endianness change, and storing
> to/reading from memory won't change the value, as long as you use the
> same endianness for writing/reading.

Ah, endianness. This always confuses me, but I hope the following
is correct... (in all the following when I say BE I mean BE8, not BE32,
since BE32 and virtualization never occur in the same CPU).

Certainly registers don't have endianness, but the entire point
of the CPSR.E bit is exactly that it changes the value as it is
stored to / read from memory, isn't it? -- that's where and when the
byte-lane flipping happens.

Where this impacts the hypervisor is that instead of actually sending
the data out to the bus via the byte-swapping h/w, we've trapped instead.
The hypervisor reads the original data directly from the guest CPU
registers, and so it's the hypervisor and userspace support code that
between them have to emulate the equivalent of the byte lane
swapping h/w. You could argue that it shouldn't be the kernel's
job, but since the kernel has to do it for the devices it emulates
internally, I'm not sure that makes much sense.

> What you seems to be missing is that the emulated devices must be
> LE. There is no such thing as a BE GIC.

Right, so a BE guest would be internally flipping the 32 bit value
it wants to write so that when it goes through the CPU's byte-lane
swap (because CPSR.E is set) it appears to the GIC with the correct
bit at the bottom, yes?

(At least I think that's what the GIC being LE means; I don't think
it's like the devices on the Private Peripheral Bus on the M-profile
cores which are genuinely on the CPU's side of the byte-lane
swapping h/w and thus always LE regardless of the state of the
endianness bit. Am I wrong there?)

It's not necessary that *all* emulated devices must be LE, of
course -- you could have a QEMU which supported a board
with a bunch of BE devices on it.

thanks
-- PMM

Christoffer Dall Jan. 6, 2014, 10:56 p.m. UTC | #6

On Mon, Jan 06, 2014 at 10:31:42PM +0000, Peter Maydell wrote:
> On 6 January 2014 18:20, Marc Zyngier <marc.zyngier@arm.com> wrote:
> >  No matter how data is stored in memory (BE, LE, or
> > even PDP endianness), CPU registers always have a consistent
> > representation. They are immune to CPU endianness change, and storing
> > to/reading from memory won't change the value, as long as you use the
> > same endianness for writing/reading.
> 
> Ah, endianness. This always confuses me, but I hope the following
> is correct... (in all the following when I say BE I mean BE8, not BE32,
> since BE32 and virtualization never occur in the same CPU).
> 
> Certainly registers don't have endianness, but the entire point
> of the CPSR.E bit is exactly that it changes the value as it is
> stored to / read from memory, isn't it? -- that's where and when the
> byte-lane flipping happens.
> 
> Where this impacts the hypervisor is that instead of actually sending
> the data out to the bus via the byte-swapping h/w, we've trapped instead.
> The hypervisor reads the original data directly from the guest CPU
> registers, and so it's the hypervisor and userspace support code that
> between them have to emulate the equivalent of the byte lane
> swapping h/w. You could argue that it shouldn't be the kernel's
> job, but since the kernel has to do it for the devices it emulates
> internally, I'm not sure that makes much sense.

As far as I understand, this is exactly what vcpu_data_guest_to_host and
vcpu_data_host_to_guest do; emulate the byte lane swapping.

The problem is that it only works on a little-endian host with the
current code, because be16_to_cpu (for example), actually perform a
byteswap, which is what needs to be emulated.  On a big-endian host, we
do nothing, so we end up giving a byteswapped value to the emulated
device.

I think a cleaner fix than this patch is to just change the
be16_to_cpu() to a __swab16() instead, which clearly indicates that
'here is the byte lane swapping'.

But admittedly this hurts my brain, so I'm not 100% sure I got this last
part right.

-Christoffer

> 
> > What you seems to be missing is that the emulated devices must be
> > LE. There is no such thing as a BE GIC.
> 
> Right, so a BE guest would be internally flipping the 32 bit value
> it wants to write so that when it goes through the CPU's byte-lane
> swap (because CPSR.E is set) it appears to the GIC with the correct
> bit at the bottom, yes?
> 
> (At least I think that's what the GIC being LE means; I don't think
> it's like the devices on the Private Peripheral Bus on the M-profile
> cores which are genuinely on the CPU's side of the byte-lane
> swapping h/w and thus always LE regardless of the state of the
> endianness bit. Am I wrong there?)
> 
> It's not necessary that *all* emulated devices must be LE, of
> course -- you could have a QEMU which supported a board
> with a bunch of BE devices on it.
> 
> thanks
> -- PMM

Victor Kamensky Jan. 7, 2014, 1:59 a.m. UTC | #7

On 6 January 2014 14:56, Christoffer Dall <christoffer.dall@linaro.org> wrote:
> On Mon, Jan 06, 2014 at 10:31:42PM +0000, Peter Maydell wrote:
>> On 6 January 2014 18:20, Marc Zyngier <marc.zyngier@arm.com> wrote:
>> >  No matter how data is stored in memory (BE, LE, or
>> > even PDP endianness), CPU registers always have a consistent
>> > representation. They are immune to CPU endianness change, and storing
>> > to/reading from memory won't change the value, as long as you use the
>> > same endianness for writing/reading.
>>
>> Ah, endianness. This always confuses me, but I hope the following
>> is correct... (in all the following when I say BE I mean BE8, not BE32,
>> since BE32 and virtualization never occur in the same CPU).
>>
>> Certainly registers don't have endianness, but the entire point
>> of the CPSR.E bit is exactly that it changes the value as it is
>> stored to / read from memory, isn't it? -- that's where and when the
>> byte-lane flipping happens.
>>
>> Where this impacts the hypervisor is that instead of actually sending
>> the data out to the bus via the byte-swapping h/w, we've trapped instead.
>> The hypervisor reads the original data directly from the guest CPU
>> registers, and so it's the hypervisor and userspace support code that
>> between them have to emulate the equivalent of the byte lane
>> swapping h/w. You could argue that it shouldn't be the kernel's
>> job, but since the kernel has to do it for the devices it emulates
>> internally, I'm not sure that makes much sense.
>
> As far as I understand, this is exactly what vcpu_data_guest_to_host and
> vcpu_data_host_to_guest do; emulate the byte lane swapping.
>
> The problem is that it only works on a little-endian host with the
> current code, because be16_to_cpu (for example), actually perform a
> byteswap, which is what needs to be emulated.  On a big-endian host, we
> do nothing, so we end up giving a byteswapped value to the emulated
> device.

Yes, that was my point on the thread: vcpu_data_guest_to_host and
vcpu_data_host_to_guest functions for any given host endianity should
give opposite endian results depending on CPSR E bit value. And
currently it is not happening in BE host case. It seems that Peter and
you agree with that and I gave example in another email with
dynamically switching E bit illustrating this problem for BE host.

> I think a cleaner fix than this patch is to just change the
> be16_to_cpu() to a __swab16() instead, which clearly indicates that
> 'here is the byte lane swapping'.

Yes, that may work, but it is a bit orthogonal issue. And I don't think
it is better. For this to work one need to change canonical endianity on
one of the sides around vcpu_data_guest_to_host and
vcpu_data_host_to_guest functions.

Changing  it on side that faces hypervisor (code that handles guest spilled
CPU register set) does not make sense at all - if we will keep guest CPU
register set in memory in LE form and hypervisor runs in BE (BE host),
code that spills registers would need to do constant byteswaps. Also any
access by host kernel and hypervisor (all running in BE) would need to do
byteswaps while working with guest saved registers.

Changing canonical form of data on side that faces emulator and mmio
part of kvm_run does not make sense either. kvm_run mmio.data field is
bytes array, when it comes to host kernel from emulator, it already contains
device memory in correct endian order that corresponds to endianity of
emulated device. For example for LE device word read access, after call is
emulated, mmio.data will contain mmio.data[0], mmio.data[1], mmio.data[2]
mmio.data[3] values in LE order (mmio.data[3] is MSB). Now look at
mmio_read_buf function introduced by Marc's 6d89d2d9 commit, this function
will byte copy this mmio.data buffer into integer according to ongoing mmio
access size. Note in BE host case such integer, in 'data' variable of
kvm_handle_mmio_return function, will have byteswapped value. Now when it will
be passed into vcpu_data_host_to_guest function, and it emulates read access
of guest with E bit set, and if we follow your suggestion, it will be
byteswapped.
I.e 'data' integer will contain non byteswapped value of LE device. It will be
further stored into some vcpu_reg register, still in native format (BE
store), and
further restored into guest CPU register, still non byteswapped (BE hypervisor).
And that is not what BE client reading word of LE device expects - BE client
knowing that it reads LE device with E bit set, it will issue additional rev
instruction to get device memory as integer. If we really want to follow your
suggestion, one may introduce compensatory byteswaps in mmio_read_buf
and mmio_write_buf functions in case of BE host, rather then just do
memcpy ... but I am not sure what it will buy us - in BE case it will swap data
twice.

Note in above description by "canonical" I mean some form of data regardless
of current access CPSR E value. But it may differ depending on host endianess.

Also as far as working with VGIC concerned: PATCH 2/5 [1] of this
series reads real
h/w vgic values from #GICH_HCR, #VGIC_CPU_VMCR, etc and byteswapps them in
case of BE host. So now VGIC "integer" values are present in kernel in cpu
native format. When mmio_data_read, and mmio_data_write functions of vgic.c
are called to fill mmio.data array because VGIC values are now in native format
but mmio.data array should contain memory in device endianity (LE for VGIC) my
PATCH 4/5 [2] of this series cpu_to_le32 and le32_to_cpu function to byteswap. I
admit that PATCH 4/5 comment is a bit obscure.

Thanks,
Victor

[1] http://lists.infradead.org/pipermail/linux-arm-kernel/2013-December/221168.html
[2] http://lists.infradead.org/pipermail/linux-arm-kernel/2013-December/221167.html

> But admittedly this hurts my brain, so I'm not 100% sure I got this last
> part right.
>
> -Christoffer
>
>>
>> > What you seems to be missing is that the emulated devices must be
>> > LE. There is no such thing as a BE GIC.
>>
>> Right, so a BE guest would be internally flipping the 32 bit value
>> it wants to write so that when it goes through the CPU's byte-lane
>> swap (because CPSR.E is set) it appears to the GIC with the correct
>> bit at the bottom, yes?
>>
>> (At least I think that's what the GIC being LE means; I don't think
>> it's like the devices on the Private Peripheral Bus on the M-profile
>> cores which are genuinely on the CPU's side of the byte-lane
>> swapping h/w and thus always LE regardless of the state of the
>> endianness bit. Am I wrong there?)
>>
>> It's not necessary that *all* emulated devices must be LE, of
>> course -- you could have a QEMU which supported a board
>> with a bunch of BE devices on it.
>>
>> thanks
>> -- PMM
> _______________________________________________
> kvmarm mailing list
> kvmarm@lists.cs.columbia.edu
> https://lists.cs.columbia.edu/cucslists/listinfo/kvmarm

Christoffer Dall Jan. 21, 2014, 1:19 a.m. UTC | #8

On Mon, Jan 06, 2014 at 05:59:03PM -0800, Victor Kamensky wrote:
> On 6 January 2014 14:56, Christoffer Dall <christoffer.dall@linaro.org> wrote:
> > On Mon, Jan 06, 2014 at 10:31:42PM +0000, Peter Maydell wrote:
> >> On 6 January 2014 18:20, Marc Zyngier <marc.zyngier@arm.com> wrote:
> >> >  No matter how data is stored in memory (BE, LE, or
> >> > even PDP endianness), CPU registers always have a consistent
> >> > representation. They are immune to CPU endianness change, and storing
> >> > to/reading from memory won't change the value, as long as you use the
> >> > same endianness for writing/reading.
> >>
> >> Ah, endianness. This always confuses me, but I hope the following
> >> is correct... (in all the following when I say BE I mean BE8, not BE32,
> >> since BE32 and virtualization never occur in the same CPU).
> >>
> >> Certainly registers don't have endianness, but the entire point
> >> of the CPSR.E bit is exactly that it changes the value as it is
> >> stored to / read from memory, isn't it? -- that's where and when the
> >> byte-lane flipping happens.
> >>
> >> Where this impacts the hypervisor is that instead of actually sending
> >> the data out to the bus via the byte-swapping h/w, we've trapped instead.
> >> The hypervisor reads the original data directly from the guest CPU
> >> registers, and so it's the hypervisor and userspace support code that
> >> between them have to emulate the equivalent of the byte lane
> >> swapping h/w. You could argue that it shouldn't be the kernel's
> >> job, but since the kernel has to do it for the devices it emulates
> >> internally, I'm not sure that makes much sense.
> >
> > As far as I understand, this is exactly what vcpu_data_guest_to_host and
> > vcpu_data_host_to_guest do; emulate the byte lane swapping.
> >
> > The problem is that it only works on a little-endian host with the
> > current code, because be16_to_cpu (for example), actually perform a
> > byteswap, which is what needs to be emulated.  On a big-endian host, we
> > do nothing, so we end up giving a byteswapped value to the emulated
> > device.
> 
> Yes, that was my point on the thread: vcpu_data_guest_to_host and
> vcpu_data_host_to_guest functions for any given host endianity should
> give opposite endian results depending on CPSR E bit value. And
> currently it is not happening in BE host case. It seems that Peter and
> you agree with that and I gave example in another email with
> dynamically switching E bit illustrating this problem for BE host.
> 
> > I think a cleaner fix than this patch is to just change the
> > be16_to_cpu() to a __swab16() instead, which clearly indicates that
> > 'here is the byte lane swapping'.
> 
> Yes, that may work, but it is a bit orthogonal issue.

Why?  I don't think it is, I think it's addressing exactly the point at
hand.

> And I don't think
> it is better. For this to work one need to change canonical endianity on
> one of the sides around vcpu_data_guest_to_host and
> vcpu_data_host_to_guest functions.

You have to simply clearly define which format you want mmio.data to be
in.  This is a user space interface across multiple architectures and
therefore something you have to consider carefully and you're limited in
choices to something that works with existing user space code.

> 
> Changing  it on side that faces hypervisor (code that handles guest spilled
> CPU register set) does not make sense at all - if we will keep guest CPU
> register set in memory in LE form and hypervisor runs in BE (BE host),
> code that spills registers would need to do constant byteswaps. Also any
> access by host kernel and hypervisor (all running in BE) would need to do
> byteswaps while working with guest saved registers.
> 
> Changing canonical form of data on side that faces emulator and mmio
> part of kvm_run does not make sense either. kvm_run mmio.data field is
> bytes array, when it comes to host kernel from emulator, it already contains
> device memory in correct endian order that corresponds to endianity of
> emulated device. For example for LE device word read access, after call is
> emulated, mmio.data will contain mmio.data[0], mmio.data[1], mmio.data[2]
> mmio.data[3] values in LE order (mmio.data[3] is MSB). Now look at
> mmio_read_buf function introduced by Marc's 6d89d2d9 commit, this function
> will byte copy this mmio.data buffer into integer according to ongoing mmio
> access size. Note in BE host case such integer, in 'data' variable of
> kvm_handle_mmio_return function, will have byteswapped value. Now when it will
> be passed into vcpu_data_host_to_guest function, and it emulates read access
> of guest with E bit set, and if we follow your suggestion, it will be
> byteswapped.
> I.e 'data' integer will contain non byteswapped value of LE device. It will be
> further stored into some vcpu_reg register, still in native format (BE
> store), and
> further restored into guest CPU register, still non byteswapped (BE hypervisor).
> And that is not what BE client reading word of LE device expects - BE client
> knowing that it reads LE device with E bit set, it will issue additional rev
> instruction to get device memory as integer. If we really want to follow your
> suggestion, one may introduce compensatory byteswaps in mmio_read_buf
> and mmio_write_buf functions in case of BE host, rather then just do
> memcpy ... but I am not sure what it will buy us - in BE case it will swap data
> twice.
> 
> Note in above description by "canonical" I mean some form of data regardless
> of current access CPSR E value. But it may differ depending on host endianess.
> 

There's a lot of text to digest here, talking about a canonical form
here doesn't help; just define the layout of the destination byte array.
I also got completely lost in what you're referring to when you talk
about 'sides' here.

The thing we must decide is how the data is stored in
kvm_exit_mmio.data.  See Peter's recent thread "KVM and
variable-endianness guest CPUs".  Once we agree on this, the rest should
be easy (assuming we use the same structure for the data in the kernel's
internal kvm_exit_mmio declared on the stack in io_mem_abort()).

The format you suggest requires any consumer of this data to consider
the host endianness, which I don't think makes anything more clear (see
my comment on the vgic patch).

The in-kernel interface between the io_mem_abort() code and any
in-kernel emulated device must do exactly the same as the interface
between KVM and QEMU must do for KVM_EXIT_MMIO.

Victor Kamensky Jan. 21, 2014, 5:24 a.m. UTC | #9

On 20 January 2014 17:19, Christoffer Dall <christoffer.dall@linaro.org> wrote:
> On Mon, Jan 06, 2014 at 05:59:03PM -0800, Victor Kamensky wrote:
>> On 6 January 2014 14:56, Christoffer Dall <christoffer.dall@linaro.org> wrote:
>> > On Mon, Jan 06, 2014 at 10:31:42PM +0000, Peter Maydell wrote:
>> >> On 6 January 2014 18:20, Marc Zyngier <marc.zyngier@arm.com> wrote:
>> >> >  No matter how data is stored in memory (BE, LE, or
>> >> > even PDP endianness), CPU registers always have a consistent
>> >> > representation. They are immune to CPU endianness change, and storing
>> >> > to/reading from memory won't change the value, as long as you use the
>> >> > same endianness for writing/reading.
>> >>
>> >> Ah, endianness. This always confuses me, but I hope the following
>> >> is correct... (in all the following when I say BE I mean BE8, not BE32,
>> >> since BE32 and virtualization never occur in the same CPU).
>> >>
>> >> Certainly registers don't have endianness, but the entire point
>> >> of the CPSR.E bit is exactly that it changes the value as it is
>> >> stored to / read from memory, isn't it? -- that's where and when the
>> >> byte-lane flipping happens.
>> >>
>> >> Where this impacts the hypervisor is that instead of actually sending
>> >> the data out to the bus via the byte-swapping h/w, we've trapped instead.
>> >> The hypervisor reads the original data directly from the guest CPU
>> >> registers, and so it's the hypervisor and userspace support code that
>> >> between them have to emulate the equivalent of the byte lane
>> >> swapping h/w. You could argue that it shouldn't be the kernel's
>> >> job, but since the kernel has to do it for the devices it emulates
>> >> internally, I'm not sure that makes much sense.
>> >
>> > As far as I understand, this is exactly what vcpu_data_guest_to_host and
>> > vcpu_data_host_to_guest do; emulate the byte lane swapping.
>> >
>> > The problem is that it only works on a little-endian host with the
>> > current code, because be16_to_cpu (for example), actually perform a
>> > byteswap, which is what needs to be emulated.  On a big-endian host, we
>> > do nothing, so we end up giving a byteswapped value to the emulated
>> > device.
>>
>> Yes, that was my point on the thread: vcpu_data_guest_to_host and
>> vcpu_data_host_to_guest functions for any given host endianity should
>> give opposite endian results depending on CPSR E bit value. And
>> currently it is not happening in BE host case. It seems that Peter and
>> you agree with that and I gave example in another email with
>> dynamically switching E bit illustrating this problem for BE host.
>>
>> > I think a cleaner fix than this patch is to just change the
>> > be16_to_cpu() to a __swab16() instead, which clearly indicates that
>> > 'here is the byte lane swapping'.
>>
>> Yes, that may work, but it is a bit orthogonal issue.
>
> Why?  I don't think it is, I think it's addressing exactly the point at
> hand.
>
>> And I don't think
>> it is better. For this to work one need to change canonical endianity on
>> one of the sides around vcpu_data_guest_to_host and
>> vcpu_data_host_to_guest functions.
>
> You have to simply clearly define which format you want mmio.data to be
> in.

I believe it is already decided. 'mmio.data' in 'struct kvm_run' is not
an integer type - it is bytes array. Bytes array does not have endianity.
It is endian agnostic. Here is snippet from linux/kvm.h

                /* KVM_EXIT_MMIO */
                struct {
                        __u64 phys_addr;
                        __u8  data[8];
                        __u32 len;
                        __u8  is_write;
                } mmio;

it is very natural to treat it as just a piece of memory. I.e when code reads
emulated LE device address as integer, this array will contain integer
placed in memory in LE order, data[3] is MSB, as it would be located in
regular memory. When code reads emulated BE device address as
integer this array will contain integer placed in memory in BE order,
data[0] is MSB.

You can think about it in that way: ARM system emulator runs on x86
(LE) and on PPC (BE). How mmio.data array for the same emulated
device should look like in across these two cases? I believe it should
be identical - just a stream of bytes.

Emulator code handles this situation quite nicely. For example check
in qemu endianness field of MemoryRegionOps structure. Depending
of the field value and current emulator endianity code will place
results into 'mmio.data' array in right order. See [1] as an example
in qemu where endianity of certain ARM devices were not declared
correctly - it was marked as DEVICE_NATIVE_ENDIAN whereas
it should be DEVICE_LITTLE_ENDIAN. After I changed that BE qemu
pretty much started working. I strongly suspect if one would run
ARM system emulation on PPC (BE) he/she would need the same
changes.

Note issue with virtio endianity is very different problem - there it
is not clear for given arrangement of host/emulator how to treat
virtio devices as LE or BE, and in what format data in rings
descriptors are.

Thanks,
Victor

[1]  https://git.linaro.org/people/victor.kamensky/qemu-be.git/commitdiff/8599358f9711b7a546a2bba63b6277fbfb5b8e0c?hp=c4880f08ff9451e3d8020153e1a710ab4acee151

> This is a user space interface across multiple architectures and
> therefore something you have to consider carefully and you're limited in
> choices to something that works with existing user space code.
>
>>
>> Changing  it on side that faces hypervisor (code that handles guest spilled
>> CPU register set) does not make sense at all - if we will keep guest CPU
>> register set in memory in LE form and hypervisor runs in BE (BE host),
>> code that spills registers would need to do constant byteswaps. Also any
>> access by host kernel and hypervisor (all running in BE) would need to do
>> byteswaps while working with guest saved registers.
>>
>> Changing canonical form of data on side that faces emulator and mmio
>> part of kvm_run does not make sense either. kvm_run mmio.data field is
>> bytes array, when it comes to host kernel from emulator, it already contains
>> device memory in correct endian order that corresponds to endianity of
>> emulated device. For example for LE device word read access, after call is
>> emulated, mmio.data will contain mmio.data[0], mmio.data[1], mmio.data[2]
>> mmio.data[3] values in LE order (mmio.data[3] is MSB). Now look at
>> mmio_read_buf function introduced by Marc's 6d89d2d9 commit, this function
>> will byte copy this mmio.data buffer into integer according to ongoing mmio
>> access size. Note in BE host case such integer, in 'data' variable of
>> kvm_handle_mmio_return function, will have byteswapped value. Now when it will
>> be passed into vcpu_data_host_to_guest function, and it emulates read access
>> of guest with E bit set, and if we follow your suggestion, it will be
>> byteswapped.
>> I.e 'data' integer will contain non byteswapped value of LE device. It will be
>> further stored into some vcpu_reg register, still in native format (BE
>> store), and
>> further restored into guest CPU register, still non byteswapped (BE hypervisor).
>> And that is not what BE client reading word of LE device expects - BE client
>> knowing that it reads LE device with E bit set, it will issue additional rev
>> instruction to get device memory as integer. If we really want to follow your
>> suggestion, one may introduce compensatory byteswaps in mmio_read_buf
>> and mmio_write_buf functions in case of BE host, rather then just do
>> memcpy ... but I am not sure what it will buy us - in BE case it will swap data
>> twice.
>>
>> Note in above description by "canonical" I mean some form of data regardless
>> of current access CPSR E value. But it may differ depending on host endianess.
>>
>
> There's a lot of text to digest here, talking about a canonical form
> here doesn't help; just define the layout of the destination byte array.
> I also got completely lost in what you're referring to when you talk
> about 'sides' here.
>
> The thing we must decide is how the data is stored in
> kvm_exit_mmio.data.  See Peter's recent thread "KVM and
> variable-endianness guest CPUs".  Once we agree on this, the rest should
> be easy (assuming we use the same structure for the data in the kernel's
> internal kvm_exit_mmio declared on the stack in io_mem_abort()).
>
> The format you suggest requires any consumer of this data to consider
> the host endianness, which I don't think makes anything more clear (see
> my comment on the vgic patch).
>
> The in-kernel interface between the io_mem_abort() code and any
> in-kernel emulated device must do exactly the same as the interface
> between KVM and QEMU must do for KVM_EXIT_MMIO.
>
> --
> Christoffer

Anup Patel Jan. 21, 2014, 5:46 a.m. UTC | #10

On Tue, Jan 21, 2014 at 10:54 AM, Victor Kamensky
<victor.kamensky@linaro.org> wrote:
> On 20 January 2014 17:19, Christoffer Dall <christoffer.dall@linaro.org> wrote:
>> On Mon, Jan 06, 2014 at 05:59:03PM -0800, Victor Kamensky wrote:
>>> On 6 January 2014 14:56, Christoffer Dall <christoffer.dall@linaro.org> wrote:
>>> > On Mon, Jan 06, 2014 at 10:31:42PM +0000, Peter Maydell wrote:
>>> >> On 6 January 2014 18:20, Marc Zyngier <marc.zyngier@arm.com> wrote:
>>> >> >  No matter how data is stored in memory (BE, LE, or
>>> >> > even PDP endianness), CPU registers always have a consistent
>>> >> > representation. They are immune to CPU endianness change, and storing
>>> >> > to/reading from memory won't change the value, as long as you use the
>>> >> > same endianness for writing/reading.
>>> >>
>>> >> Ah, endianness. This always confuses me, but I hope the following
>>> >> is correct... (in all the following when I say BE I mean BE8, not BE32,
>>> >> since BE32 and virtualization never occur in the same CPU).
>>> >>
>>> >> Certainly registers don't have endianness, but the entire point
>>> >> of the CPSR.E bit is exactly that it changes the value as it is
>>> >> stored to / read from memory, isn't it? -- that's where and when the
>>> >> byte-lane flipping happens.
>>> >>
>>> >> Where this impacts the hypervisor is that instead of actually sending
>>> >> the data out to the bus via the byte-swapping h/w, we've trapped instead.
>>> >> The hypervisor reads the original data directly from the guest CPU
>>> >> registers, and so it's the hypervisor and userspace support code that
>>> >> between them have to emulate the equivalent of the byte lane
>>> >> swapping h/w. You could argue that it shouldn't be the kernel's
>>> >> job, but since the kernel has to do it for the devices it emulates
>>> >> internally, I'm not sure that makes much sense.
>>> >
>>> > As far as I understand, this is exactly what vcpu_data_guest_to_host and
>>> > vcpu_data_host_to_guest do; emulate the byte lane swapping.
>>> >
>>> > The problem is that it only works on a little-endian host with the
>>> > current code, because be16_to_cpu (for example), actually perform a
>>> > byteswap, which is what needs to be emulated.  On a big-endian host, we
>>> > do nothing, so we end up giving a byteswapped value to the emulated
>>> > device.
>>>
>>> Yes, that was my point on the thread: vcpu_data_guest_to_host and
>>> vcpu_data_host_to_guest functions for any given host endianity should
>>> give opposite endian results depending on CPSR E bit value. And
>>> currently it is not happening in BE host case. It seems that Peter and
>>> you agree with that and I gave example in another email with
>>> dynamically switching E bit illustrating this problem for BE host.
>>>
>>> > I think a cleaner fix than this patch is to just change the
>>> > be16_to_cpu() to a __swab16() instead, which clearly indicates that
>>> > 'here is the byte lane swapping'.
>>>
>>> Yes, that may work, but it is a bit orthogonal issue.
>>
>> Why?  I don't think it is, I think it's addressing exactly the point at
>> hand.
>>
>>> And I don't think
>>> it is better. For this to work one need to change canonical endianity on
>>> one of the sides around vcpu_data_guest_to_host and
>>> vcpu_data_host_to_guest functions.
>>
>> You have to simply clearly define which format you want mmio.data to be
>> in.
>
> I believe it is already decided. 'mmio.data' in 'struct kvm_run' is not
> an integer type - it is bytes array. Bytes array does not have endianity.
> It is endian agnostic. Here is snippet from linux/kvm.h
>
>                 /* KVM_EXIT_MMIO */
>                 struct {
>                         __u64 phys_addr;
>                         __u8  data[8];
>                         __u32 len;
>                         __u8  is_write;
>                 } mmio;
>
> it is very natural to treat it as just a piece of memory. I.e when code reads
> emulated LE device address as integer, this array will contain integer
> placed in memory in LE order, data[3] is MSB, as it would be located in
> regular memory. When code reads emulated BE device address as
> integer this array will contain integer placed in memory in BE order,
> data[0] is MSB.
>
> You can think about it in that way: ARM system emulator runs on x86
> (LE) and on PPC (BE). How mmio.data array for the same emulated
> device should look like in across these two cases? I believe it should
> be identical - just a stream of bytes.
>
> Emulator code handles this situation quite nicely. For example check
> in qemu endianness field of MemoryRegionOps structure. Depending
> of the field value and current emulator endianity code will place
> results into 'mmio.data' array in right order. See [1] as an example
> in qemu where endianity of certain ARM devices were not declared
> correctly - it was marked as DEVICE_NATIVE_ENDIAN whereas
> it should be DEVICE_LITTLE_ENDIAN. After I changed that BE qemu
> pretty much started working. I strongly suspect if one would run
> ARM system emulation on PPC (BE) he/she would need the same
> changes.
>
> Note issue with virtio endianity is very different problem - there it
> is not clear for given arrangement of host/emulator how to treat
> virtio devices as LE or BE, and in what format data in rings
> descriptors are.

IMHO, device endianess should be taken care by device emulators only
because we can have Machine Model containing both LE devices and
BE devices. KVM ARM/ARM64 should only worry about endianess of
in-kernel emulated devices (e.g. VGIC). In general, QEMU or KVMTOOL
should be responsible of device endianess and for this QEMU or KVMTOOL
should also know whether Guest (or VM) is little-endian or big-endian.

Regards,
Anup

>
> Thanks,
> Victor
>
> [1]  https://git.linaro.org/people/victor.kamensky/qemu-be.git/commitdiff/8599358f9711b7a546a2bba63b6277fbfb5b8e0c?hp=c4880f08ff9451e3d8020153e1a710ab4acee151
>
>> This is a user space interface across multiple architectures and
>> therefore something you have to consider carefully and you're limited in
>> choices to something that works with existing user space code.
>>
>>>
>>> Changing  it on side that faces hypervisor (code that handles guest spilled
>>> CPU register set) does not make sense at all - if we will keep guest CPU
>>> register set in memory in LE form and hypervisor runs in BE (BE host),
>>> code that spills registers would need to do constant byteswaps. Also any
>>> access by host kernel and hypervisor (all running in BE) would need to do
>>> byteswaps while working with guest saved registers.
>>>
>>> Changing canonical form of data on side that faces emulator and mmio
>>> part of kvm_run does not make sense either. kvm_run mmio.data field is
>>> bytes array, when it comes to host kernel from emulator, it already contains
>>> device memory in correct endian order that corresponds to endianity of
>>> emulated device. For example for LE device word read access, after call is
>>> emulated, mmio.data will contain mmio.data[0], mmio.data[1], mmio.data[2]
>>> mmio.data[3] values in LE order (mmio.data[3] is MSB). Now look at
>>> mmio_read_buf function introduced by Marc's 6d89d2d9 commit, this function
>>> will byte copy this mmio.data buffer into integer according to ongoing mmio
>>> access size. Note in BE host case such integer, in 'data' variable of
>>> kvm_handle_mmio_return function, will have byteswapped value. Now when it will
>>> be passed into vcpu_data_host_to_guest function, and it emulates read access
>>> of guest with E bit set, and if we follow your suggestion, it will be
>>> byteswapped.
>>> I.e 'data' integer will contain non byteswapped value of LE device. It will be
>>> further stored into some vcpu_reg register, still in native format (BE
>>> store), and
>>> further restored into guest CPU register, still non byteswapped (BE hypervisor).
>>> And that is not what BE client reading word of LE device expects - BE client
>>> knowing that it reads LE device with E bit set, it will issue additional rev
>>> instruction to get device memory as integer. If we really want to follow your
>>> suggestion, one may introduce compensatory byteswaps in mmio_read_buf
>>> and mmio_write_buf functions in case of BE host, rather then just do
>>> memcpy ... but I am not sure what it will buy us - in BE case it will swap data
>>> twice.
>>>
>>> Note in above description by "canonical" I mean some form of data regardless
>>> of current access CPSR E value. But it may differ depending on host endianess.
>>>
>>
>> There's a lot of text to digest here, talking about a canonical form
>> here doesn't help; just define the layout of the destination byte array.
>> I also got completely lost in what you're referring to when you talk
>> about 'sides' here.
>>
>> The thing we must decide is how the data is stored in
>> kvm_exit_mmio.data.  See Peter's recent thread "KVM and
>> variable-endianness guest CPUs".  Once we agree on this, the rest should
>> be easy (assuming we use the same structure for the data in the kernel's
>> internal kvm_exit_mmio declared on the stack in io_mem_abort()).
>>
>> The format you suggest requires any consumer of this data to consider
>> the host endianness, which I don't think makes anything more clear (see
>> my comment on the vgic patch).
>>
>> The in-kernel interface between the io_mem_abort() code and any
>> in-kernel emulated device must do exactly the same as the interface
>> between KVM and QEMU must do for KVM_EXIT_MMIO.
>>
>> --
>> Christoffer
> _______________________________________________
> kvmarm mailing list
> kvmarm@lists.cs.columbia.edu
> https://lists.cs.columbia.edu/cucslists/listinfo/kvmarm

Christoffer Dall Jan. 21, 2014, 6:03 a.m. UTC | #11

On Mon, Jan 20, 2014 at 09:24:10PM -0800, Victor Kamensky wrote:
> On 20 January 2014 17:19, Christoffer Dall <christoffer.dall@linaro.org> wrote:
> > On Mon, Jan 06, 2014 at 05:59:03PM -0800, Victor Kamensky wrote:
> >> On 6 January 2014 14:56, Christoffer Dall <christoffer.dall@linaro.org> wrote:
> >> > On Mon, Jan 06, 2014 at 10:31:42PM +0000, Peter Maydell wrote:
> >> >> On 6 January 2014 18:20, Marc Zyngier <marc.zyngier@arm.com> wrote:
> >> >> >  No matter how data is stored in memory (BE, LE, or
> >> >> > even PDP endianness), CPU registers always have a consistent
> >> >> > representation. They are immune to CPU endianness change, and storing
> >> >> > to/reading from memory won't change the value, as long as you use the
> >> >> > same endianness for writing/reading.
> >> >>
> >> >> Ah, endianness. This always confuses me, but I hope the following
> >> >> is correct... (in all the following when I say BE I mean BE8, not BE32,
> >> >> since BE32 and virtualization never occur in the same CPU).
> >> >>
> >> >> Certainly registers don't have endianness, but the entire point
> >> >> of the CPSR.E bit is exactly that it changes the value as it is
> >> >> stored to / read from memory, isn't it? -- that's where and when the
> >> >> byte-lane flipping happens.
> >> >>
> >> >> Where this impacts the hypervisor is that instead of actually sending
> >> >> the data out to the bus via the byte-swapping h/w, we've trapped instead.
> >> >> The hypervisor reads the original data directly from the guest CPU
> >> >> registers, and so it's the hypervisor and userspace support code that
> >> >> between them have to emulate the equivalent of the byte lane
> >> >> swapping h/w. You could argue that it shouldn't be the kernel's
> >> >> job, but since the kernel has to do it for the devices it emulates
> >> >> internally, I'm not sure that makes much sense.
> >> >
> >> > As far as I understand, this is exactly what vcpu_data_guest_to_host and
> >> > vcpu_data_host_to_guest do; emulate the byte lane swapping.
> >> >
> >> > The problem is that it only works on a little-endian host with the
> >> > current code, because be16_to_cpu (for example), actually perform a
> >> > byteswap, which is what needs to be emulated.  On a big-endian host, we
> >> > do nothing, so we end up giving a byteswapped value to the emulated
> >> > device.
> >>
> >> Yes, that was my point on the thread: vcpu_data_guest_to_host and
> >> vcpu_data_host_to_guest functions for any given host endianity should
> >> give opposite endian results depending on CPSR E bit value. And
> >> currently it is not happening in BE host case. It seems that Peter and
> >> you agree with that and I gave example in another email with
> >> dynamically switching E bit illustrating this problem for BE host.
> >>
> >> > I think a cleaner fix than this patch is to just change the
> >> > be16_to_cpu() to a __swab16() instead, which clearly indicates that
> >> > 'here is the byte lane swapping'.
> >>
> >> Yes, that may work, but it is a bit orthogonal issue.
> >
> > Why?  I don't think it is, I think it's addressing exactly the point at
> > hand.
> >
> >> And I don't think
> >> it is better. For this to work one need to change canonical endianity on
> >> one of the sides around vcpu_data_guest_to_host and
> >> vcpu_data_host_to_guest functions.
> >
> > You have to simply clearly define which format you want mmio.data to be
> > in.
> 
> I believe it is already decided. 'mmio.data' in 'struct kvm_run' is not
> an integer type - it is bytes array. Bytes array does not have endianity.

Please read through this thread:
https://lists.cs.columbia.edu/pipermail/kvmarm/2014-January/008784.html

> It is endian agnostic. Here is snippet from linux/kvm.h
> 
>                 /* KVM_EXIT_MMIO */
>                 struct {
>                         __u64 phys_addr;
>                         __u8  data[8];
>                         __u32 len;
>                         __u8  is_write;
>                 } mmio;

Thanks, I already knew where to find this though ;)

I realize that it is a byte array.  But that doesn't change the fact
that a store of a word would have to put either the most or least
significant byte in data[0].

> 
> it is very natural to treat it as just a piece of memory. I.e when code reads
> emulated LE device address as integer, this array will contain integer
> placed in memory in LE order, data[3] is MSB, as it would be located in
> regular memory. When code reads emulated BE device address as
> integer this array will contain integer placed in memory in BE order,
> data[0] is MSB.

I don't understand this.  "code reads emulated device address as
integer".  The format of the byte array cannot be device-specific,
because the kernel doesn't know about device.  It can only depend on the
endianness of the VM and of the host.

Can you try in a single sentence to to specify what the format of the
byte array is?

> 
> You can think about it in that way: ARM system emulator runs on x86
> (LE) and on PPC (BE). How mmio.data array for the same emulated
> device should look like in across these two cases? I believe it should
> be identical - just a stream of bytes.

Well, KVM/ARM cannot run on PPC for obvious reasons, and this is a KVM
kernel to user space interface.

> 
> Emulator code handles this situation quite nicely. For example check
> in qemu endianness field of MemoryRegionOps structure. Depending
> of the field value and current emulator endianity code will place
> results into 'mmio.data' array in right order. See [1] as an example
> in qemu where endianity of certain ARM devices were not declared
> correctly - it was marked as DEVICE_NATIVE_ENDIAN whereas
> it should be DEVICE_LITTLE_ENDIAN. After I changed that BE qemu
> pretty much started working. I strongly suspect if one would run
> ARM system emulation on PPC (BE) he/she would need the same
> changes.

It doesn't really matter what the emulator does if there's no clear
specification of the interface it relies on.  It may happen to work in
the cases that are already supported (by chance), but we don't know how
to deal with a new (cross-endianness situation) because it is not
specified.

> 
> Note issue with virtio endianity is very different problem - there it
> is not clear for given arrangement of host/emulator how to treat
> virtio devices as LE or BE, and in what format data in rings
> descriptors are.
> 
> Thanks,
> Victor
> 
> [1]  https://git.linaro.org/people/victor.kamensky/qemu-be.git/commitdiff/8599358f9711b7a546a2bba63b6277fbfb5b8e0c?hp=c4880f08ff9451e3d8020153e1a710ab4acee151
> 
> > This is a user space interface across multiple architectures and
> > therefore something you have to consider carefully and you're limited in
> > choices to something that works with existing user space code.
> >
> >>
> >> Changing  it on side that faces hypervisor (code that handles guest spilled
> >> CPU register set) does not make sense at all - if we will keep guest CPU
> >> register set in memory in LE form and hypervisor runs in BE (BE host),
> >> code that spills registers would need to do constant byteswaps. Also any
> >> access by host kernel and hypervisor (all running in BE) would need to do
> >> byteswaps while working with guest saved registers.
> >>
> >> Changing canonical form of data on side that faces emulator and mmio
> >> part of kvm_run does not make sense either. kvm_run mmio.data field is
> >> bytes array, when it comes to host kernel from emulator, it already contains
> >> device memory in correct endian order that corresponds to endianity of
> >> emulated device. For example for LE device word read access, after call is
> >> emulated, mmio.data will contain mmio.data[0], mmio.data[1], mmio.data[2]
> >> mmio.data[3] values in LE order (mmio.data[3] is MSB). Now look at
> >> mmio_read_buf function introduced by Marc's 6d89d2d9 commit, this function
> >> will byte copy this mmio.data buffer into integer according to ongoing mmio
> >> access size. Note in BE host case such integer, in 'data' variable of
> >> kvm_handle_mmio_return function, will have byteswapped value. Now when it will
> >> be passed into vcpu_data_host_to_guest function, and it emulates read access
> >> of guest with E bit set, and if we follow your suggestion, it will be
> >> byteswapped.
> >> I.e 'data' integer will contain non byteswapped value of LE device. It will be
> >> further stored into some vcpu_reg register, still in native format (BE
> >> store), and
> >> further restored into guest CPU register, still non byteswapped (BE hypervisor).
> >> And that is not what BE client reading word of LE device expects - BE client
> >> knowing that it reads LE device with E bit set, it will issue additional rev
> >> instruction to get device memory as integer. If we really want to follow your
> >> suggestion, one may introduce compensatory byteswaps in mmio_read_buf
> >> and mmio_write_buf functions in case of BE host, rather then just do
> >> memcpy ... but I am not sure what it will buy us - in BE case it will swap data
> >> twice.
> >>
> >> Note in above description by "canonical" I mean some form of data regardless
> >> of current access CPSR E value. But it may differ depending on host endianess.
> >>
> >
> > There's a lot of text to digest here, talking about a canonical form
> > here doesn't help; just define the layout of the destination byte array.
> > I also got completely lost in what you're referring to when you talk
> > about 'sides' here.
> >
> > The thing we must decide is how the data is stored in
> > kvm_exit_mmio.data.  See Peter's recent thread "KVM and
> > variable-endianness guest CPUs".  Once we agree on this, the rest should
> > be easy (assuming we use the same structure for the data in the kernel's
> > internal kvm_exit_mmio declared on the stack in io_mem_abort()).
> >
> > The format you suggest requires any consumer of this data to consider
> > the host endianness, which I don't think makes anything more clear (see
> > my comment on the vgic patch).
> >
> > The in-kernel interface between the io_mem_abort() code and any
> > in-kernel emulated device must do exactly the same as the interface
> > between KVM and QEMU must do for KVM_EXIT_MMIO.
> >
> > --
> > Christoffer

Christoffer Dall Jan. 21, 2014, 6:31 a.m. UTC | #12

On Tue, Jan 21, 2014 at 11:16:46AM +0530, Anup Patel wrote:
> On Tue, Jan 21, 2014 at 10:54 AM, Victor Kamensky
> <victor.kamensky@linaro.org> wrote:
> > On 20 January 2014 17:19, Christoffer Dall <christoffer.dall@linaro.org> wrote:
> >> On Mon, Jan 06, 2014 at 05:59:03PM -0800, Victor Kamensky wrote:
> >>> On 6 January 2014 14:56, Christoffer Dall <christoffer.dall@linaro.org> wrote:
> >>> > On Mon, Jan 06, 2014 at 10:31:42PM +0000, Peter Maydell wrote:
> >>> >> On 6 January 2014 18:20, Marc Zyngier <marc.zyngier@arm.com> wrote:
> >>> >> >  No matter how data is stored in memory (BE, LE, or
> >>> >> > even PDP endianness), CPU registers always have a consistent
> >>> >> > representation. They are immune to CPU endianness change, and storing
> >>> >> > to/reading from memory won't change the value, as long as you use the
> >>> >> > same endianness for writing/reading.
> >>> >>
> >>> >> Ah, endianness. This always confuses me, but I hope the following
> >>> >> is correct... (in all the following when I say BE I mean BE8, not BE32,
> >>> >> since BE32 and virtualization never occur in the same CPU).
> >>> >>
> >>> >> Certainly registers don't have endianness, but the entire point
> >>> >> of the CPSR.E bit is exactly that it changes the value as it is
> >>> >> stored to / read from memory, isn't it? -- that's where and when the
> >>> >> byte-lane flipping happens.
> >>> >>
> >>> >> Where this impacts the hypervisor is that instead of actually sending
> >>> >> the data out to the bus via the byte-swapping h/w, we've trapped instead.
> >>> >> The hypervisor reads the original data directly from the guest CPU
> >>> >> registers, and so it's the hypervisor and userspace support code that
> >>> >> between them have to emulate the equivalent of the byte lane
> >>> >> swapping h/w. You could argue that it shouldn't be the kernel's
> >>> >> job, but since the kernel has to do it for the devices it emulates
> >>> >> internally, I'm not sure that makes much sense.
> >>> >
> >>> > As far as I understand, this is exactly what vcpu_data_guest_to_host and
> >>> > vcpu_data_host_to_guest do; emulate the byte lane swapping.
> >>> >
> >>> > The problem is that it only works on a little-endian host with the
> >>> > current code, because be16_to_cpu (for example), actually perform a
> >>> > byteswap, which is what needs to be emulated.  On a big-endian host, we
> >>> > do nothing, so we end up giving a byteswapped value to the emulated
> >>> > device.
> >>>
> >>> Yes, that was my point on the thread: vcpu_data_guest_to_host and
> >>> vcpu_data_host_to_guest functions for any given host endianity should
> >>> give opposite endian results depending on CPSR E bit value. And
> >>> currently it is not happening in BE host case. It seems that Peter and
> >>> you agree with that and I gave example in another email with
> >>> dynamically switching E bit illustrating this problem for BE host.
> >>>
> >>> > I think a cleaner fix than this patch is to just change the
> >>> > be16_to_cpu() to a __swab16() instead, which clearly indicates that
> >>> > 'here is the byte lane swapping'.
> >>>
> >>> Yes, that may work, but it is a bit orthogonal issue.
> >>
> >> Why?  I don't think it is, I think it's addressing exactly the point at
> >> hand.
> >>
> >>> And I don't think
> >>> it is better. For this to work one need to change canonical endianity on
> >>> one of the sides around vcpu_data_guest_to_host and
> >>> vcpu_data_host_to_guest functions.
> >>
> >> You have to simply clearly define which format you want mmio.data to be
> >> in.
> >
> > I believe it is already decided. 'mmio.data' in 'struct kvm_run' is not
> > an integer type - it is bytes array. Bytes array does not have endianity.
> > It is endian agnostic. Here is snippet from linux/kvm.h
> >
> >                 /* KVM_EXIT_MMIO */
> >                 struct {
> >                         __u64 phys_addr;
> >                         __u8  data[8];
> >                         __u32 len;
> >                         __u8  is_write;
> >                 } mmio;
> >
> > it is very natural to treat it as just a piece of memory. I.e when code reads
> > emulated LE device address as integer, this array will contain integer
> > placed in memory in LE order, data[3] is MSB, as it would be located in
> > regular memory. When code reads emulated BE device address as
> > integer this array will contain integer placed in memory in BE order,
> > data[0] is MSB.
> >
> > You can think about it in that way: ARM system emulator runs on x86
> > (LE) and on PPC (BE). How mmio.data array for the same emulated
> > device should look like in across these two cases? I believe it should
> > be identical - just a stream of bytes.
> >
> > Emulator code handles this situation quite nicely. For example check
> > in qemu endianness field of MemoryRegionOps structure. Depending
> > of the field value and current emulator endianity code will place
> > results into 'mmio.data' array in right order. See [1] as an example
> > in qemu where endianity of certain ARM devices were not declared
> > correctly - it was marked as DEVICE_NATIVE_ENDIAN whereas
> > it should be DEVICE_LITTLE_ENDIAN. After I changed that BE qemu
> > pretty much started working. I strongly suspect if one would run
> > ARM system emulation on PPC (BE) he/she would need the same
> > changes.
> >
> > Note issue with virtio endianity is very different problem - there it
> > is not clear for given arrangement of host/emulator how to treat
> > virtio devices as LE or BE, and in what format data in rings
> > descriptors are.
> 
> IMHO, device endianess should be taken care by device emulators only
> because we can have Machine Model containing both LE devices and
> BE devices. KVM ARM/ARM64 should only worry about endianess of
> in-kernel emulated devices (e.g. VGIC). In general, QEMU or KVMTOOL
> should be responsible of device endianess and for this QEMU or KVMTOOL
> should also know whether Guest (or VM) is little-endian or big-endian.
> 

Specifying the interface to say that this is a store of the register
value directly using the endianness of the host kernel is an option.
However, user space must fetch the CPSR on each MMIO from the kernel and
look at the E-bit to understand how it should interpret the data, which
may add overhead, and it doesn't change the fact that this needs to be
specified in the API.

The E bit on ARM specifies that the CPU will swap the bytes before
putting the register value on the memory bus.  That's all it does.

Something has to emulate this, and given that KVM emulates the CPU, I
think KVM should emulate the E-bit.

From my point of view, the mmio.data API as the signal you would receive
if you're any consumer of the memory operation externally to the CPU,
which would be in the form of a bunch of wires and a length, with no
endianness.

But, the thread I pointed Victor to is focused purely on this
discussion, so you should probably respond there.

-Christoffer

> 
> >
> > Thanks,
> > Victor
> >
> > [1]  https://git.linaro.org/people/victor.kamensky/qemu-be.git/commitdiff/8599358f9711b7a546a2bba63b6277fbfb5b8e0c?hp=c4880f08ff9451e3d8020153e1a710ab4acee151
> >
> >> This is a user space interface across multiple architectures and
> >> therefore something you have to consider carefully and you're limited in
> >> choices to something that works with existing user space code.
> >>
> >>>
> >>> Changing  it on side that faces hypervisor (code that handles guest spilled
> >>> CPU register set) does not make sense at all - if we will keep guest CPU
> >>> register set in memory in LE form and hypervisor runs in BE (BE host),
> >>> code that spills registers would need to do constant byteswaps. Also any
> >>> access by host kernel and hypervisor (all running in BE) would need to do
> >>> byteswaps while working with guest saved registers.
> >>>
> >>> Changing canonical form of data on side that faces emulator and mmio
> >>> part of kvm_run does not make sense either. kvm_run mmio.data field is
> >>> bytes array, when it comes to host kernel from emulator, it already contains
> >>> device memory in correct endian order that corresponds to endianity of
> >>> emulated device. For example for LE device word read access, after call is
> >>> emulated, mmio.data will contain mmio.data[0], mmio.data[1], mmio.data[2]
> >>> mmio.data[3] values in LE order (mmio.data[3] is MSB). Now look at
> >>> mmio_read_buf function introduced by Marc's 6d89d2d9 commit, this function
> >>> will byte copy this mmio.data buffer into integer according to ongoing mmio
> >>> access size. Note in BE host case such integer, in 'data' variable of
> >>> kvm_handle_mmio_return function, will have byteswapped value. Now when it will
> >>> be passed into vcpu_data_host_to_guest function, and it emulates read access
> >>> of guest with E bit set, and if we follow your suggestion, it will be
> >>> byteswapped.
> >>> I.e 'data' integer will contain non byteswapped value of LE device. It will be
> >>> further stored into some vcpu_reg register, still in native format (BE
> >>> store), and
> >>> further restored into guest CPU register, still non byteswapped (BE hypervisor).
> >>> And that is not what BE client reading word of LE device expects - BE client
> >>> knowing that it reads LE device with E bit set, it will issue additional rev
> >>> instruction to get device memory as integer. If we really want to follow your
> >>> suggestion, one may introduce compensatory byteswaps in mmio_read_buf
> >>> and mmio_write_buf functions in case of BE host, rather then just do
> >>> memcpy ... but I am not sure what it will buy us - in BE case it will swap data
> >>> twice.
> >>>
> >>> Note in above description by "canonical" I mean some form of data regardless
> >>> of current access CPSR E value. But it may differ depending on host endianess.
> >>>
> >>
> >> There's a lot of text to digest here, talking about a canonical form
> >> here doesn't help; just define the layout of the destination byte array.
> >> I also got completely lost in what you're referring to when you talk
> >> about 'sides' here.
> >>
> >> The thing we must decide is how the data is stored in
> >> kvm_exit_mmio.data.  See Peter's recent thread "KVM and
> >> variable-endianness guest CPUs".  Once we agree on this, the rest should
> >> be easy (assuming we use the same structure for the data in the kernel's
> >> internal kvm_exit_mmio declared on the stack in io_mem_abort()).
> >>
> >> The format you suggest requires any consumer of this data to consider
> >> the host endianness, which I don't think makes anything more clear (see
> >> my comment on the vgic patch).
> >>
> >> The in-kernel interface between the io_mem_abort() code and any
> >> in-kernel emulated device must do exactly the same as the interface
> >> between KVM and QEMU must do for KVM_EXIT_MMIO.
> >>
> >> --
> >> Christoffer
> > _______________________________________________
> > kvmarm mailing list
> > kvmarm@lists.cs.columbia.edu
> > https://lists.cs.columbia.edu/cucslists/listinfo/kvmarm

Victor Kamensky Jan. 21, 2014, 6:39 a.m. UTC | #13

Hi Anup,

On 20 January 2014 21:46, Anup Patel <anup@brainfault.org> wrote:
> On Tue, Jan 21, 2014 at 10:54 AM, Victor Kamensky
> <victor.kamensky@linaro.org> wrote:
>> On 20 January 2014 17:19, Christoffer Dall <christoffer.dall@linaro.org> wrote:
>>> On Mon, Jan 06, 2014 at 05:59:03PM -0800, Victor Kamensky wrote:
>>>> On 6 January 2014 14:56, Christoffer Dall <christoffer.dall@linaro.org> wrote:
>>>> > On Mon, Jan 06, 2014 at 10:31:42PM +0000, Peter Maydell wrote:
>>>> >> On 6 January 2014 18:20, Marc Zyngier <marc.zyngier@arm.com> wrote:
>>>> >> >  No matter how data is stored in memory (BE, LE, or
>>>> >> > even PDP endianness), CPU registers always have a consistent
>>>> >> > representation. They are immune to CPU endianness change, and storing
>>>> >> > to/reading from memory won't change the value, as long as you use the
>>>> >> > same endianness for writing/reading.
>>>> >>
>>>> >> Ah, endianness. This always confuses me, but I hope the following
>>>> >> is correct... (in all the following when I say BE I mean BE8, not BE32,
>>>> >> since BE32 and virtualization never occur in the same CPU).
>>>> >>
>>>> >> Certainly registers don't have endianness, but the entire point
>>>> >> of the CPSR.E bit is exactly that it changes the value as it is
>>>> >> stored to / read from memory, isn't it? -- that's where and when the
>>>> >> byte-lane flipping happens.
>>>> >>
>>>> >> Where this impacts the hypervisor is that instead of actually sending
>>>> >> the data out to the bus via the byte-swapping h/w, we've trapped instead.
>>>> >> The hypervisor reads the original data directly from the guest CPU
>>>> >> registers, and so it's the hypervisor and userspace support code that
>>>> >> between them have to emulate the equivalent of the byte lane
>>>> >> swapping h/w. You could argue that it shouldn't be the kernel's
>>>> >> job, but since the kernel has to do it for the devices it emulates
>>>> >> internally, I'm not sure that makes much sense.
>>>> >
>>>> > As far as I understand, this is exactly what vcpu_data_guest_to_host and
>>>> > vcpu_data_host_to_guest do; emulate the byte lane swapping.
>>>> >
>>>> > The problem is that it only works on a little-endian host with the
>>>> > current code, because be16_to_cpu (for example), actually perform a
>>>> > byteswap, which is what needs to be emulated.  On a big-endian host, we
>>>> > do nothing, so we end up giving a byteswapped value to the emulated
>>>> > device.
>>>>
>>>> Yes, that was my point on the thread: vcpu_data_guest_to_host and
>>>> vcpu_data_host_to_guest functions for any given host endianity should
>>>> give opposite endian results depending on CPSR E bit value. And
>>>> currently it is not happening in BE host case. It seems that Peter and
>>>> you agree with that and I gave example in another email with
>>>> dynamically switching E bit illustrating this problem for BE host.
>>>>
>>>> > I think a cleaner fix than this patch is to just change the
>>>> > be16_to_cpu() to a __swab16() instead, which clearly indicates that
>>>> > 'here is the byte lane swapping'.
>>>>
>>>> Yes, that may work, but it is a bit orthogonal issue.
>>>
>>> Why?  I don't think it is, I think it's addressing exactly the point at
>>> hand.
>>>
>>>> And I don't think
>>>> it is better. For this to work one need to change canonical endianity on
>>>> one of the sides around vcpu_data_guest_to_host and
>>>> vcpu_data_host_to_guest functions.
>>>
>>> You have to simply clearly define which format you want mmio.data to be
>>> in.
>>
>> I believe it is already decided. 'mmio.data' in 'struct kvm_run' is not
>> an integer type - it is bytes array. Bytes array does not have endianity.
>> It is endian agnostic. Here is snippet from linux/kvm.h
>>
>>                 /* KVM_EXIT_MMIO */
>>                 struct {
>>                         __u64 phys_addr;
>>                         __u8  data[8];
>>                         __u32 len;
>>                         __u8  is_write;
>>                 } mmio;
>>
>> it is very natural to treat it as just a piece of memory. I.e when code reads
>> emulated LE device address as integer, this array will contain integer
>> placed in memory in LE order, data[3] is MSB, as it would be located in
>> regular memory. When code reads emulated BE device address as
>> integer this array will contain integer placed in memory in BE order,
>> data[0] is MSB.
>>
>> You can think about it in that way: ARM system emulator runs on x86
>> (LE) and on PPC (BE). How mmio.data array for the same emulated
>> device should look like in across these two cases? I believe it should
>> be identical - just a stream of bytes.
>>
>> Emulator code handles this situation quite nicely. For example check
>> in qemu endianness field of MemoryRegionOps structure. Depending
>> of the field value and current emulator endianity code will place
>> results into 'mmio.data' array in right order. See [1] as an example
>> in qemu where endianity of certain ARM devices were not declared
>> correctly - it was marked as DEVICE_NATIVE_ENDIAN whereas
>> it should be DEVICE_LITTLE_ENDIAN. After I changed that BE qemu
>> pretty much started working. I strongly suspect if one would run
>> ARM system emulation on PPC (BE) he/she would need the same
>> changes.
>>
>> Note issue with virtio endianity is very different problem - there it
>> is not clear for given arrangement of host/emulator how to treat
>> virtio devices as LE or BE, and in what format data in rings
>> descriptors are.
>
> IMHO, device endianess should be taken care by device emulators only
> because we can have Machine Model containing both LE devices and
> BE devices. KVM ARM/ARM64 should only worry about endianess of
> in-kernel emulated devices (e.g. VGIC). In general, QEMU or KVMTOOL
> should be responsible of device endianess and for this QEMU or KVMTOOL
> should also know whether Guest (or VM) is little-endian or big-endian.

I agree with most of above statement except last part. I think
emulator and host KVM should not really care about guest endianity.
They should work in the same way in either case. MarcZ illustrated this
earlier with setup where LE KVM hosted either LE guest or BE guest.
Also note endianity as far as emulation concerned strictly speaking is
not property of the guest, it is rather property of current CPU execution
context (i.e E bit in CPSR reg of V7) In fact access endianity can
change on the fly - i.e when BE V7 image starts initially it assumes
that it runs in LE mode, then once kernel entered it switches CPU
into BE mode, the same happens with secondary CPU callback. And
with the last one I run into situation where such callback before switching
into BE mode read some emulated device with E bit off, latter the same
kernel reads the same device register with E bit on

Thanks,
Victor

> Regards,
> Anup
>
>>
>> Thanks,
>> Victor
>>
>> [1]  https://git.linaro.org/people/victor.kamensky/qemu-be.git/commitdiff/8599358f9711b7a546a2bba63b6277fbfb5b8e0c?hp=c4880f08ff9451e3d8020153e1a710ab4acee151
>>
>>> This is a user space interface across multiple architectures and
>>> therefore something you have to consider carefully and you're limited in
>>> choices to something that works with existing user space code.
>>>
>>>>
>>>> Changing  it on side that faces hypervisor (code that handles guest spilled
>>>> CPU register set) does not make sense at all - if we will keep guest CPU
>>>> register set in memory in LE form and hypervisor runs in BE (BE host),
>>>> code that spills registers would need to do constant byteswaps. Also any
>>>> access by host kernel and hypervisor (all running in BE) would need to do
>>>> byteswaps while working with guest saved registers.
>>>>
>>>> Changing canonical form of data on side that faces emulator and mmio
>>>> part of kvm_run does not make sense either. kvm_run mmio.data field is
>>>> bytes array, when it comes to host kernel from emulator, it already contains
>>>> device memory in correct endian order that corresponds to endianity of
>>>> emulated device. For example for LE device word read access, after call is
>>>> emulated, mmio.data will contain mmio.data[0], mmio.data[1], mmio.data[2]
>>>> mmio.data[3] values in LE order (mmio.data[3] is MSB). Now look at
>>>> mmio_read_buf function introduced by Marc's 6d89d2d9 commit, this function
>>>> will byte copy this mmio.data buffer into integer according to ongoing mmio
>>>> access size. Note in BE host case such integer, in 'data' variable of
>>>> kvm_handle_mmio_return function, will have byteswapped value. Now when it will
>>>> be passed into vcpu_data_host_to_guest function, and it emulates read access
>>>> of guest with E bit set, and if we follow your suggestion, it will be
>>>> byteswapped.
>>>> I.e 'data' integer will contain non byteswapped value of LE device. It will be
>>>> further stored into some vcpu_reg register, still in native format (BE
>>>> store), and
>>>> further restored into guest CPU register, still non byteswapped (BE hypervisor).
>>>> And that is not what BE client reading word of LE device expects - BE client
>>>> knowing that it reads LE device with E bit set, it will issue additional rev
>>>> instruction to get device memory as integer. If we really want to follow your
>>>> suggestion, one may introduce compensatory byteswaps in mmio_read_buf
>>>> and mmio_write_buf functions in case of BE host, rather then just do
>>>> memcpy ... but I am not sure what it will buy us - in BE case it will swap data
>>>> twice.
>>>>
>>>> Note in above description by "canonical" I mean some form of data regardless
>>>> of current access CPSR E value. But it may differ depending on host endianess.
>>>>
>>>
>>> There's a lot of text to digest here, talking about a canonical form
>>> here doesn't help; just define the layout of the destination byte array.
>>> I also got completely lost in what you're referring to when you talk
>>> about 'sides' here.
>>>
>>> The thing we must decide is how the data is stored in
>>> kvm_exit_mmio.data.  See Peter's recent thread "KVM and
>>> variable-endianness guest CPUs".  Once we agree on this, the rest should
>>> be easy (assuming we use the same structure for the data in the kernel's
>>> internal kvm_exit_mmio declared on the stack in io_mem_abort()).
>>>
>>> The format you suggest requires any consumer of this data to consider
>>> the host endianness, which I don't think makes anything more clear (see
>>> my comment on the vgic patch).
>>>
>>> The in-kernel interface between the io_mem_abort() code and any
>>> in-kernel emulated device must do exactly the same as the interface
>>> between KVM and QEMU must do for KVM_EXIT_MMIO.
>>>
>>> --
>>> Christoffer
>> _______________________________________________
>> kvmarm mailing list
>> kvmarm@lists.cs.columbia.edu
>> https://lists.cs.columbia.edu/cucslists/listinfo/kvmarm

[REPOST,5/5] ARM: kvm MMIO support BE host running LE code

Commit Message

Comments

Patch