diff mbox series

xen 4.15.5: msr_relaxed required for MSR 0x1a2

Message ID ZVZAO/W0m/h+IPbi@dingwall.me.uk (mailing list archive)
State New, archived
Headers show
Series xen 4.15.5: msr_relaxed required for MSR 0x1a2 | expand

Commit Message

James Dingwall Nov. 16, 2023, 4:15 p.m. UTC
Hi,

Per the msr_relaxed documentation:

   "If using this option is necessary to fix an issue, please report a bug."

After recently upgrading an environment from Xen 4.14.5 to Xen 4.15.5 we
started experiencing a BSOD at boot with one of our Windows guests.  We found
that enabling `msr_relaxed = 1` in the guest configuration has resolved the
problem.  With a debug build of Xen and `hvm_debug=2048` on the command line
the following messages were caught as the BSOD happened:

(XEN) [HVM:11.0] <vmx_msr_read_intercept> ecx=0x1a2
(XEN) vmx.c:3298:d11v0 RDMSR 0x000001a2 unimplemented
(XEN) d11v0 VIRIDIAN CRASH: 1e ffffffffc0000096 fffff80b8de81eb5 0 0

I found that MSR 0x1a2 is MSR_TEMPERATURE_TARGET and from that this patch
series from last month:

https://patchwork.kernel.org/project/xen-devel/list/?series=796550

Picking out just a small part of that fixes the problem for us. Although the
the patch is against 4.15.5 I think it would be relevant to more recent
releases too.

Thanks,
James

Comments

Andrew Cooper Nov. 16, 2023, 4:32 p.m. UTC | #1
On 16/11/2023 4:15 pm, James Dingwall wrote:
> Hi,
>
> Per the msr_relaxed documentation:
>
>    "If using this option is necessary to fix an issue, please report a bug."
>
> After recently upgrading an environment from Xen 4.14.5 to Xen 4.15.5 we
> started experiencing a BSOD at boot with one of our Windows guests.  We found
> that enabling `msr_relaxed = 1` in the guest configuration has resolved the
> problem.  With a debug build of Xen and `hvm_debug=2048` on the command line
> the following messages were caught as the BSOD happened:
>
> (XEN) [HVM:11.0] <vmx_msr_read_intercept> ecx=0x1a2
> (XEN) vmx.c:3298:d11v0 RDMSR 0x000001a2 unimplemented
> (XEN) d11v0 VIRIDIAN CRASH: 1e ffffffffc0000096 fffff80b8de81eb5 0 0
>
> I found that MSR 0x1a2 is MSR_TEMPERATURE_TARGET and from that this patch
> series from last month:
>
> https://patchwork.kernel.org/project/xen-devel/list/?series=796550
>
> Picking out just a small part of that fixes the problem for us. Although the
> the patch is against 4.15.5 I think it would be relevant to more recent
> releases too.

Which version of Windows, and what hardware?

The Viridian Crash isn't about the RDMSR itself - it's presumably
collateral damage shortly thereafter.

Does filling in 0 for that MSR also resolve the issue?  It's model
specific and we absolutely cannot pass it through from real hardware
like that.

Thanks,

~Andrew
James Dingwall Nov. 17, 2023, 9:18 a.m. UTC | #2
On Thu, Nov 16, 2023 at 04:32:47PM +0000, Andrew Cooper wrote:
> On 16/11/2023 4:15 pm, James Dingwall wrote:
> > Hi,
> >
> > Per the msr_relaxed documentation:
> >
> >    "If using this option is necessary to fix an issue, please report a bug."
> >
> > After recently upgrading an environment from Xen 4.14.5 to Xen 4.15.5 we
> > started experiencing a BSOD at boot with one of our Windows guests.  We found
> > that enabling `msr_relaxed = 1` in the guest configuration has resolved the
> > problem.  With a debug build of Xen and `hvm_debug=2048` on the command line
> > the following messages were caught as the BSOD happened:
> >
> > (XEN) [HVM:11.0] <vmx_msr_read_intercept> ecx=0x1a2
> > (XEN) vmx.c:3298:d11v0 RDMSR 0x000001a2 unimplemented
> > (XEN) d11v0 VIRIDIAN CRASH: 1e ffffffffc0000096 fffff80b8de81eb5 0 0
> >
> > I found that MSR 0x1a2 is MSR_TEMPERATURE_TARGET and from that this patch
> > series from last month:
> >
> > https://patchwork.kernel.org/project/xen-devel/list/?series=796550
> >
> > Picking out just a small part of that fixes the problem for us. Although the
> > the patch is against 4.15.5 I think it would be relevant to more recent
> > releases too.
> 
> Which version of Windows, and what hardware?
> 
> The Viridian Crash isn't about the RDMSR itself - it's presumably
> collateral damage shortly thereafter.
> 
> Does filling in 0 for that MSR also resolve the issue?  It's model
> specific and we absolutely cannot pass it through from real hardware
> like that.
> 

Hi Andrew,

Thanks for your response.  The guest is running Windows 10 and the crash
happens in a proprietary hardware driver.  A little bit of knowledge as
they say was enough to stop the crash but I don't understand the impact
of what I've actually done...

To rework the patch I'd need a bit of guidance, if I understand your
suggestion I set the MSR to 0 with this change in emul-priv-op.c:

diff --git a/xen/arch/x86/pv/emul-priv-op.c b/xen/arch/x86/pv/emul-priv-op.c
index ed97b1d6fcc..66f5e417df6 100644
--- a/xen/arch/x86/pv/emul-priv-op.c
+++ b/xen/arch/x86/pv/emul-priv-op.c
@@ -976,6 +976,10 @@ static int read_msr(unsigned int reg, uint64_t *val,
         *val = 0;
         return X86EMUL_OKAY;
 
+    case MSR_TEMPERATURE_TARGET:
+        *val = 0;
+        return X86EMUL_OKAY;
+
     case MSR_P6_PERFCTR(0) ... MSR_P6_PERFCTR(7):
     case MSR_P6_EVNTSEL(0) ... MSR_P6_EVNTSEL(3):
     case MSR_CORE_PERF_FIXED_CTR0 ... MSR_CORE_PERF_FIXED_CTR2:

and this in vmx.c:

diff --git a/xen/arch/x86/hvm/vmx/vmx.c b/xen/arch/x86/hvm/vmx/vmx.c
index 54023a92587..bbf37b7f272 100644
--- a/xen/arch/x86/hvm/vmx/vmx.c
+++ b/xen/arch/x86/hvm/vmx/vmx.c
@@ -3259,6 +3259,11 @@ static int vmx_msr_read_intercept(unsigned int msr, uint64_t *msr_content)
         if ( !nvmx_msr_read_intercept(msr, msr_content) )
             goto gp_fault;
         break;
+
+    case MSR_TEMPERATURE_TARGET:
+        *msr_content = 0;
+        break;
+
     case MSR_IA32_MISC_ENABLE:
         rdmsrl(MSR_IA32_MISC_ENABLE, *msr_content);
         /* Debug Trace Store is not supported. */


Thanks,
James
Jan Beulich Nov. 17, 2023, 9:56 a.m. UTC | #3
On 17.11.2023 10:18, James Dingwall wrote:
> On Thu, Nov 16, 2023 at 04:32:47PM +0000, Andrew Cooper wrote:
>> On 16/11/2023 4:15 pm, James Dingwall wrote:
>>> Hi,
>>>
>>> Per the msr_relaxed documentation:
>>>
>>>    "If using this option is necessary to fix an issue, please report a bug."
>>>
>>> After recently upgrading an environment from Xen 4.14.5 to Xen 4.15.5 we
>>> started experiencing a BSOD at boot with one of our Windows guests.  We found
>>> that enabling `msr_relaxed = 1` in the guest configuration has resolved the
>>> problem.  With a debug build of Xen and `hvm_debug=2048` on the command line
>>> the following messages were caught as the BSOD happened:
>>>
>>> (XEN) [HVM:11.0] <vmx_msr_read_intercept> ecx=0x1a2
>>> (XEN) vmx.c:3298:d11v0 RDMSR 0x000001a2 unimplemented
>>> (XEN) d11v0 VIRIDIAN CRASH: 1e ffffffffc0000096 fffff80b8de81eb5 0 0
>>>
>>> I found that MSR 0x1a2 is MSR_TEMPERATURE_TARGET and from that this patch
>>> series from last month:
>>>
>>> https://patchwork.kernel.org/project/xen-devel/list/?series=796550
>>>
>>> Picking out just a small part of that fixes the problem for us. Although the
>>> the patch is against 4.15.5 I think it would be relevant to more recent
>>> releases too.
>>
>> Which version of Windows, and what hardware?
>>
>> The Viridian Crash isn't about the RDMSR itself - it's presumably
>> collateral damage shortly thereafter.
>>
>> Does filling in 0 for that MSR also resolve the issue?  It's model
>> specific and we absolutely cannot pass it through from real hardware
>> like that.
>>
> 
> Hi Andrew,
> 
> Thanks for your response.  The guest is running Windows 10 and the crash
> happens in a proprietary hardware driver.  A little bit of knowledge as
> they say was enough to stop the crash but I don't understand the impact
> of what I've actually done...
> 
> To rework the patch I'd need a bit of guidance, if I understand your
> suggestion I set the MSR to 0 with this change in emul-priv-op.c:

For the purpose of the experiment suggested by Andrew ...

> diff --git a/xen/arch/x86/pv/emul-priv-op.c b/xen/arch/x86/pv/emul-priv-op.c
> index ed97b1d6fcc..66f5e417df6 100644
> --- a/xen/arch/x86/pv/emul-priv-op.c
> +++ b/xen/arch/x86/pv/emul-priv-op.c
> @@ -976,6 +976,10 @@ static int read_msr(unsigned int reg, uint64_t *val,
>          *val = 0;
>          return X86EMUL_OKAY;
>  
> +    case MSR_TEMPERATURE_TARGET:
> +        *val = 0;
> +        return X86EMUL_OKAY;
> +
>      case MSR_P6_PERFCTR(0) ... MSR_P6_PERFCTR(7):
>      case MSR_P6_EVNTSEL(0) ... MSR_P6_EVNTSEL(3):
>      case MSR_CORE_PERF_FIXED_CTR0 ... MSR_CORE_PERF_FIXED_CTR2:

... you wouldn't need this (affects PV domains only), and ...

> and this in vmx.c:
> 
> diff --git a/xen/arch/x86/hvm/vmx/vmx.c b/xen/arch/x86/hvm/vmx/vmx.c
> index 54023a92587..bbf37b7f272 100644
> --- a/xen/arch/x86/hvm/vmx/vmx.c
> +++ b/xen/arch/x86/hvm/vmx/vmx.c
> @@ -3259,6 +3259,11 @@ static int vmx_msr_read_intercept(unsigned int msr, uint64_t *msr_content)
>          if ( !nvmx_msr_read_intercept(msr, msr_content) )
>              goto gp_fault;
>          break;
> +
> +    case MSR_TEMPERATURE_TARGET:
> +        *msr_content = 0;
> +        break;
> +
>      case MSR_IA32_MISC_ENABLE:
>          rdmsrl(MSR_IA32_MISC_ENABLE, *msr_content);
>          /* Debug Trace Store is not supported. */

... indeed this ought to do. An eventual real patch may want to look
different, though.

Jan
Roger Pau Monne Nov. 17, 2023, 10:17 a.m. UTC | #4
On Fri, Nov 17, 2023 at 09:18:39AM +0000, James Dingwall wrote:
> On Thu, Nov 16, 2023 at 04:32:47PM +0000, Andrew Cooper wrote:
> > On 16/11/2023 4:15 pm, James Dingwall wrote:
> > > Hi,
> > >
> > > Per the msr_relaxed documentation:
> > >
> > >    "If using this option is necessary to fix an issue, please report a bug."
> > >
> > > After recently upgrading an environment from Xen 4.14.5 to Xen 4.15.5 we
> > > started experiencing a BSOD at boot with one of our Windows guests.  We found
> > > that enabling `msr_relaxed = 1` in the guest configuration has resolved the
> > > problem.  With a debug build of Xen and `hvm_debug=2048` on the command line
> > > the following messages were caught as the BSOD happened:
> > >
> > > (XEN) [HVM:11.0] <vmx_msr_read_intercept> ecx=0x1a2
> > > (XEN) vmx.c:3298:d11v0 RDMSR 0x000001a2 unimplemented
> > > (XEN) d11v0 VIRIDIAN CRASH: 1e ffffffffc0000096 fffff80b8de81eb5 0 0
> > >
> > > I found that MSR 0x1a2 is MSR_TEMPERATURE_TARGET and from that this patch
> > > series from last month:
> > >
> > > https://patchwork.kernel.org/project/xen-devel/list/?series=796550
> > >
> > > Picking out just a small part of that fixes the problem for us. Although the
> > > the patch is against 4.15.5 I think it would be relevant to more recent
> > > releases too.
> > 
> > Which version of Windows, and what hardware?
> > 
> > The Viridian Crash isn't about the RDMSR itself - it's presumably
> > collateral damage shortly thereafter.
> > 
> > Does filling in 0 for that MSR also resolve the issue?  It's model
> > specific and we absolutely cannot pass it through from real hardware
> > like that.
> > 
> 
> Hi Andrew,
> 
> Thanks for your response.  The guest is running Windows 10 and the crash
> happens in a proprietary hardware driver.

When you say proprietary you mean a custom driver made for your
use-case, or is this some vendor driver widely available?

Thanks, Roger.
James Dingwall Nov. 20, 2023, 8:27 a.m. UTC | #5
On Fri, Nov 17, 2023 at 10:56:30AM +0100, Jan Beulich wrote:
> On 17.11.2023 10:18, James Dingwall wrote:
> > On Thu, Nov 16, 2023 at 04:32:47PM +0000, Andrew Cooper wrote:
> >> On 16/11/2023 4:15 pm, James Dingwall wrote:
> >>> Hi,
> >>>
> >>> Per the msr_relaxed documentation:
> >>>
> >>>    "If using this option is necessary to fix an issue, please report a bug."
> >>>
> >>> After recently upgrading an environment from Xen 4.14.5 to Xen 4.15.5 we
> >>> started experiencing a BSOD at boot with one of our Windows guests.  We found
> >>> that enabling `msr_relaxed = 1` in the guest configuration has resolved the
> >>> problem.  With a debug build of Xen and `hvm_debug=2048` on the command line
> >>> the following messages were caught as the BSOD happened:
> >>>
> >>> (XEN) [HVM:11.0] <vmx_msr_read_intercept> ecx=0x1a2
> >>> (XEN) vmx.c:3298:d11v0 RDMSR 0x000001a2 unimplemented
> >>> (XEN) d11v0 VIRIDIAN CRASH: 1e ffffffffc0000096 fffff80b8de81eb5 0 0
> >>>
> >>> I found that MSR 0x1a2 is MSR_TEMPERATURE_TARGET and from that this patch
> >>> series from last month:
> >>>
> >>> https://patchwork.kernel.org/project/xen-devel/list/?series=796550
> >>>
> >>> Picking out just a small part of that fixes the problem for us. Although the
> >>> the patch is against 4.15.5 I think it would be relevant to more recent
> >>> releases too.
> >>
> >> Which version of Windows, and what hardware?
> >>
> >> The Viridian Crash isn't about the RDMSR itself - it's presumably
> >> collateral damage shortly thereafter.
> >>
> >> Does filling in 0 for that MSR also resolve the issue?  It's model
> >> specific and we absolutely cannot pass it through from real hardware
> >> like that.
> >>
> > 
> > Hi Andrew,
> > 
> > Thanks for your response.  The guest is running Windows 10 and the crash
> > happens in a proprietary hardware driver.  A little bit of knowledge as
> > they say was enough to stop the crash but I don't understand the impact
> > of what I've actually done...
> > 
> > To rework the patch I'd need a bit of guidance, if I understand your
> > suggestion I set the MSR to 0 with this change in emul-priv-op.c:
> 
> For the purpose of the experiment suggested by Andrew ...
> 
> > diff --git a/xen/arch/x86/pv/emul-priv-op.c b/xen/arch/x86/pv/emul-priv-op.c
> > index ed97b1d6fcc..66f5e417df6 100644
> > --- a/xen/arch/x86/pv/emul-priv-op.c
> > +++ b/xen/arch/x86/pv/emul-priv-op.c
> > @@ -976,6 +976,10 @@ static int read_msr(unsigned int reg, uint64_t *val,
> >          *val = 0;
> >          return X86EMUL_OKAY;
> >  
> > +    case MSR_TEMPERATURE_TARGET:
> > +        *val = 0;
> > +        return X86EMUL_OKAY;
> > +
> >      case MSR_P6_PERFCTR(0) ... MSR_P6_PERFCTR(7):
> >      case MSR_P6_EVNTSEL(0) ... MSR_P6_EVNTSEL(3):
> >      case MSR_CORE_PERF_FIXED_CTR0 ... MSR_CORE_PERF_FIXED_CTR2:
> 
> ... you wouldn't need this (affects PV domains only), and ...
> 
> > and this in vmx.c:
> > 
> > diff --git a/xen/arch/x86/hvm/vmx/vmx.c b/xen/arch/x86/hvm/vmx/vmx.c
> > index 54023a92587..bbf37b7f272 100644
> > --- a/xen/arch/x86/hvm/vmx/vmx.c
> > +++ b/xen/arch/x86/hvm/vmx/vmx.c
> > @@ -3259,6 +3259,11 @@ static int vmx_msr_read_intercept(unsigned int msr, uint64_t *msr_content)
> >          if ( !nvmx_msr_read_intercept(msr, msr_content) )
> >              goto gp_fault;
> >          break;
> > +
> > +    case MSR_TEMPERATURE_TARGET:
> > +        *msr_content = 0;
> > +        break;
> > +
> >      case MSR_IA32_MISC_ENABLE:
> >          rdmsrl(MSR_IA32_MISC_ENABLE, *msr_content);
> >          /* Debug Trace Store is not supported. */
> 
> ... indeed this ought to do. An eventual real patch may want to look
> different, though.
> 

Thanks Jan, based on the information I've reduced the patch to what seems the
minimal necessary to workaround the BSOD.  I assume simply not ending up at
X86EMUL_EXCEPTION is the resolution regardless of what value is set.

Regards,
James
James Dingwall Nov. 20, 2023, 8:44 a.m. UTC | #6
On Fri, Nov 17, 2023 at 11:17:46AM +0100, Roger Pau Monné wrote:
> On Fri, Nov 17, 2023 at 09:18:39AM +0000, James Dingwall wrote:
> > On Thu, Nov 16, 2023 at 04:32:47PM +0000, Andrew Cooper wrote:
> > > On 16/11/2023 4:15 pm, James Dingwall wrote:
> > > > Hi,
> > > >
> > > > Per the msr_relaxed documentation:
> > > >
> > > >    "If using this option is necessary to fix an issue, please report a bug."
> > > >
> > > > After recently upgrading an environment from Xen 4.14.5 to Xen 4.15.5 we
> > > > started experiencing a BSOD at boot with one of our Windows guests.  We found
> > > > that enabling `msr_relaxed = 1` in the guest configuration has resolved the
> > > > problem.  With a debug build of Xen and `hvm_debug=2048` on the command line
> > > > the following messages were caught as the BSOD happened:
> > > >
> > > > (XEN) [HVM:11.0] <vmx_msr_read_intercept> ecx=0x1a2
> > > > (XEN) vmx.c:3298:d11v0 RDMSR 0x000001a2 unimplemented
> > > > (XEN) d11v0 VIRIDIAN CRASH: 1e ffffffffc0000096 fffff80b8de81eb5 0 0
> > > >
> > > > I found that MSR 0x1a2 is MSR_TEMPERATURE_TARGET and from that this patch
> > > > series from last month:
> > > >
> > > > https://patchwork.kernel.org/project/xen-devel/list/?series=796550
> > > >
> > > > Picking out just a small part of that fixes the problem for us. Although the
> > > > the patch is against 4.15.5 I think it would be relevant to more recent
> > > > releases too.
> > > 
> > > Which version of Windows, and what hardware?
> > > 
> > > The Viridian Crash isn't about the RDMSR itself - it's presumably
> > > collateral damage shortly thereafter.
> > > 
> > > Does filling in 0 for that MSR also resolve the issue?  It's model
> > > specific and we absolutely cannot pass it through from real hardware
> > > like that.
> > > 
> > 
> > Hi Andrew,
> > 
> > Thanks for your response.  The guest is running Windows 10 and the crash
> > happens in a proprietary hardware driver.
> 
> When you say proprietary you mean a custom driver made for your
> use-case, or is this some vendor driver widely available?
> 

Hi Roger,

We have emulated some point of sale hardware with a custom qemu device.  It
is reasonably common but limited to its particular sector.  As the physical
hardware is all built to the same specification I assume the driver has made
assumptions about the availability of MSR_TEMPERATURE_TARGET and doesn't
handle the case it is absent which leads to the BSOD in the Windows guest.

Regards,
James
Jan Beulich Nov. 20, 2023, 8:49 a.m. UTC | #7
On 20.11.2023 09:27, James Dingwall wrote:
> On Fri, Nov 17, 2023 at 10:56:30AM +0100, Jan Beulich wrote:
>> On 17.11.2023 10:18, James Dingwall wrote:
>>> On Thu, Nov 16, 2023 at 04:32:47PM +0000, Andrew Cooper wrote:
>>>> On 16/11/2023 4:15 pm, James Dingwall wrote:
>>>>> Hi,
>>>>>
>>>>> Per the msr_relaxed documentation:
>>>>>
>>>>>    "If using this option is necessary to fix an issue, please report a bug."
>>>>>
>>>>> After recently upgrading an environment from Xen 4.14.5 to Xen 4.15.5 we
>>>>> started experiencing a BSOD at boot with one of our Windows guests.  We found
>>>>> that enabling `msr_relaxed = 1` in the guest configuration has resolved the
>>>>> problem.  With a debug build of Xen and `hvm_debug=2048` on the command line
>>>>> the following messages were caught as the BSOD happened:
>>>>>
>>>>> (XEN) [HVM:11.0] <vmx_msr_read_intercept> ecx=0x1a2
>>>>> (XEN) vmx.c:3298:d11v0 RDMSR 0x000001a2 unimplemented
>>>>> (XEN) d11v0 VIRIDIAN CRASH: 1e ffffffffc0000096 fffff80b8de81eb5 0 0
>>>>>
>>>>> I found that MSR 0x1a2 is MSR_TEMPERATURE_TARGET and from that this patch
>>>>> series from last month:
>>>>>
>>>>> https://patchwork.kernel.org/project/xen-devel/list/?series=796550
>>>>>
>>>>> Picking out just a small part of that fixes the problem for us. Although the
>>>>> the patch is against 4.15.5 I think it would be relevant to more recent
>>>>> releases too.
>>>>
>>>> Which version of Windows, and what hardware?
>>>>
>>>> The Viridian Crash isn't about the RDMSR itself - it's presumably
>>>> collateral damage shortly thereafter.
>>>>
>>>> Does filling in 0 for that MSR also resolve the issue?  It's model
>>>> specific and we absolutely cannot pass it through from real hardware
>>>> like that.
>>>>
>>>
>>> Hi Andrew,
>>>
>>> Thanks for your response.  The guest is running Windows 10 and the crash
>>> happens in a proprietary hardware driver.  A little bit of knowledge as
>>> they say was enough to stop the crash but I don't understand the impact
>>> of what I've actually done...
>>>
>>> To rework the patch I'd need a bit of guidance, if I understand your
>>> suggestion I set the MSR to 0 with this change in emul-priv-op.c:
>>
>> For the purpose of the experiment suggested by Andrew ...
>>
>>> diff --git a/xen/arch/x86/pv/emul-priv-op.c b/xen/arch/x86/pv/emul-priv-op.c
>>> index ed97b1d6fcc..66f5e417df6 100644
>>> --- a/xen/arch/x86/pv/emul-priv-op.c
>>> +++ b/xen/arch/x86/pv/emul-priv-op.c
>>> @@ -976,6 +976,10 @@ static int read_msr(unsigned int reg, uint64_t *val,
>>>          *val = 0;
>>>          return X86EMUL_OKAY;
>>>  
>>> +    case MSR_TEMPERATURE_TARGET:
>>> +        *val = 0;
>>> +        return X86EMUL_OKAY;
>>> +
>>>      case MSR_P6_PERFCTR(0) ... MSR_P6_PERFCTR(7):
>>>      case MSR_P6_EVNTSEL(0) ... MSR_P6_EVNTSEL(3):
>>>      case MSR_CORE_PERF_FIXED_CTR0 ... MSR_CORE_PERF_FIXED_CTR2:
>>
>> ... you wouldn't need this (affects PV domains only), and ...
>>
>>> and this in vmx.c:
>>>
>>> diff --git a/xen/arch/x86/hvm/vmx/vmx.c b/xen/arch/x86/hvm/vmx/vmx.c
>>> index 54023a92587..bbf37b7f272 100644
>>> --- a/xen/arch/x86/hvm/vmx/vmx.c
>>> +++ b/xen/arch/x86/hvm/vmx/vmx.c
>>> @@ -3259,6 +3259,11 @@ static int vmx_msr_read_intercept(unsigned int msr, uint64_t *msr_content)
>>>          if ( !nvmx_msr_read_intercept(msr, msr_content) )
>>>              goto gp_fault;
>>>          break;
>>> +
>>> +    case MSR_TEMPERATURE_TARGET:
>>> +        *msr_content = 0;
>>> +        break;
>>> +
>>>      case MSR_IA32_MISC_ENABLE:
>>>          rdmsrl(MSR_IA32_MISC_ENABLE, *msr_content);
>>>          /* Debug Trace Store is not supported. */
>>
>> ... indeed this ought to do. An eventual real patch may want to look
>> different, though.
>>
> 
> Thanks Jan, based on the information I've reduced the patch to what seems the
> minimal necessary to workaround the BSOD.  I assume simply not ending up at
> X86EMUL_EXCEPTION is the resolution regardless of what value is set.

Good. This then confirms that Andrew's expectation of this being enough was
correct. What's not really clear to me though is whether he'd also be okay
to put a (cleaned up) patch along these lines into the tree. Andrew?

Jan
Roger Pau Monne Nov. 20, 2023, 9:21 a.m. UTC | #8
On Mon, Nov 20, 2023 at 08:44:41AM +0000, James Dingwall wrote:
> On Fri, Nov 17, 2023 at 11:17:46AM +0100, Roger Pau Monné wrote:
> > On Fri, Nov 17, 2023 at 09:18:39AM +0000, James Dingwall wrote:
> > > On Thu, Nov 16, 2023 at 04:32:47PM +0000, Andrew Cooper wrote:
> > > > On 16/11/2023 4:15 pm, James Dingwall wrote:
> > > > > Hi,
> > > > >
> > > > > Per the msr_relaxed documentation:
> > > > >
> > > > >    "If using this option is necessary to fix an issue, please report a bug."
> > > > >
> > > > > After recently upgrading an environment from Xen 4.14.5 to Xen 4.15.5 we
> > > > > started experiencing a BSOD at boot with one of our Windows guests.  We found
> > > > > that enabling `msr_relaxed = 1` in the guest configuration has resolved the
> > > > > problem.  With a debug build of Xen and `hvm_debug=2048` on the command line
> > > > > the following messages were caught as the BSOD happened:
> > > > >
> > > > > (XEN) [HVM:11.0] <vmx_msr_read_intercept> ecx=0x1a2
> > > > > (XEN) vmx.c:3298:d11v0 RDMSR 0x000001a2 unimplemented
> > > > > (XEN) d11v0 VIRIDIAN CRASH: 1e ffffffffc0000096 fffff80b8de81eb5 0 0
> > > > >
> > > > > I found that MSR 0x1a2 is MSR_TEMPERATURE_TARGET and from that this patch
> > > > > series from last month:
> > > > >
> > > > > https://patchwork.kernel.org/project/xen-devel/list/?series=796550
> > > > >
> > > > > Picking out just a small part of that fixes the problem for us. Although the
> > > > > the patch is against 4.15.5 I think it would be relevant to more recent
> > > > > releases too.
> > > > 
> > > > Which version of Windows, and what hardware?
> > > > 
> > > > The Viridian Crash isn't about the RDMSR itself - it's presumably
> > > > collateral damage shortly thereafter.
> > > > 
> > > > Does filling in 0 for that MSR also resolve the issue?  It's model
> > > > specific and we absolutely cannot pass it through from real hardware
> > > > like that.
> > > > 
> > > 
> > > Hi Andrew,
> > > 
> > > Thanks for your response.  The guest is running Windows 10 and the crash
> > > happens in a proprietary hardware driver.
> > 
> > When you say proprietary you mean a custom driver made for your
> > use-case, or is this some vendor driver widely available?
> > 
> 
> Hi Roger,
> 
> We have emulated some point of sale hardware with a custom qemu device.  It
> is reasonably common but limited to its particular sector.  As the physical
> hardware is all built to the same specification I assume the driver has made
> assumptions about the availability of MSR_TEMPERATURE_TARGET and doesn't
> handle the case it is absent which leads to the BSOD in the Windows guest.

Hello James,

We have in the past exposed MSRs in order to workaround OSes
assumptions about such MSRs being unconditionally present, so it's not
completely unacceptable that we might end up exposing this if strictly
required.

My question would be, is it possible for such driver to get fixed in
order to avoid unconditionally poking at MSR_TEMPERATURE_TARGET, or
that's impossible?

From the Intel manual it seems like MSR_TEMPERATURE_TARGET is
unconditionally present on certain models, and hence we might have no
other option but to end up adding a dummy handler for reads.  I do
wonder whether returning all 0 is correct, as then the "thermal
throttling" would be enable when the CPU temp > 0C, which is
unrealistic.  I assume that wouldn't matter much as long as drivers
don't choke on such weird value.

Thanks, Roger.
Roger Pau Monne Nov. 20, 2023, 9:24 a.m. UTC | #9
On Mon, Nov 20, 2023 at 08:27:36AM +0000, James Dingwall wrote:
> On Fri, Nov 17, 2023 at 10:56:30AM +0100, Jan Beulich wrote:
> > On 17.11.2023 10:18, James Dingwall wrote:
> > > On Thu, Nov 16, 2023 at 04:32:47PM +0000, Andrew Cooper wrote:
> > >> On 16/11/2023 4:15 pm, James Dingwall wrote:
> > >>> Hi,
> > >>>
> > >>> Per the msr_relaxed documentation:
> > >>>
> > >>>    "If using this option is necessary to fix an issue, please report a bug."
> > >>>
> > >>> After recently upgrading an environment from Xen 4.14.5 to Xen 4.15.5 we
> > >>> started experiencing a BSOD at boot with one of our Windows guests.  We found
> > >>> that enabling `msr_relaxed = 1` in the guest configuration has resolved the
> > >>> problem.  With a debug build of Xen and `hvm_debug=2048` on the command line
> > >>> the following messages were caught as the BSOD happened:
> > >>>
> > >>> (XEN) [HVM:11.0] <vmx_msr_read_intercept> ecx=0x1a2
> > >>> (XEN) vmx.c:3298:d11v0 RDMSR 0x000001a2 unimplemented
> > >>> (XEN) d11v0 VIRIDIAN CRASH: 1e ffffffffc0000096 fffff80b8de81eb5 0 0
> > >>>
> > >>> I found that MSR 0x1a2 is MSR_TEMPERATURE_TARGET and from that this patch
> > >>> series from last month:
> > >>>
> > >>> https://patchwork.kernel.org/project/xen-devel/list/?series=796550
> > >>>
> > >>> Picking out just a small part of that fixes the problem for us. Although the
> > >>> the patch is against 4.15.5 I think it would be relevant to more recent
> > >>> releases too.
> > >>
> > >> Which version of Windows, and what hardware?
> > >>
> > >> The Viridian Crash isn't about the RDMSR itself - it's presumably
> > >> collateral damage shortly thereafter.
> > >>
> > >> Does filling in 0 for that MSR also resolve the issue?  It's model
> > >> specific and we absolutely cannot pass it through from real hardware
> > >> like that.
> > >>
> > > 
> > > Hi Andrew,
> > > 
> > > Thanks for your response.  The guest is running Windows 10 and the crash
> > > happens in a proprietary hardware driver.  A little bit of knowledge as
> > > they say was enough to stop the crash but I don't understand the impact
> > > of what I've actually done...
> > > 
> > > To rework the patch I'd need a bit of guidance, if I understand your
> > > suggestion I set the MSR to 0 with this change in emul-priv-op.c:
> > 
> > For the purpose of the experiment suggested by Andrew ...
> > 
> > > diff --git a/xen/arch/x86/pv/emul-priv-op.c b/xen/arch/x86/pv/emul-priv-op.c
> > > index ed97b1d6fcc..66f5e417df6 100644
> > > --- a/xen/arch/x86/pv/emul-priv-op.c
> > > +++ b/xen/arch/x86/pv/emul-priv-op.c
> > > @@ -976,6 +976,10 @@ static int read_msr(unsigned int reg, uint64_t *val,
> > >          *val = 0;
> > >          return X86EMUL_OKAY;
> > >  
> > > +    case MSR_TEMPERATURE_TARGET:
> > > +        *val = 0;
> > > +        return X86EMUL_OKAY;
> > > +
> > >      case MSR_P6_PERFCTR(0) ... MSR_P6_PERFCTR(7):
> > >      case MSR_P6_EVNTSEL(0) ... MSR_P6_EVNTSEL(3):
> > >      case MSR_CORE_PERF_FIXED_CTR0 ... MSR_CORE_PERF_FIXED_CTR2:
> > 
> > ... you wouldn't need this (affects PV domains only), and ...
> > 
> > > and this in vmx.c:
> > > 
> > > diff --git a/xen/arch/x86/hvm/vmx/vmx.c b/xen/arch/x86/hvm/vmx/vmx.c
> > > index 54023a92587..bbf37b7f272 100644
> > > --- a/xen/arch/x86/hvm/vmx/vmx.c
> > > +++ b/xen/arch/x86/hvm/vmx/vmx.c
> > > @@ -3259,6 +3259,11 @@ static int vmx_msr_read_intercept(unsigned int msr, uint64_t *msr_content)
> > >          if ( !nvmx_msr_read_intercept(msr, msr_content) )
> > >              goto gp_fault;
> > >          break;
> > > +
> > > +    case MSR_TEMPERATURE_TARGET:
> > > +        *msr_content = 0;
> > > +        break;

I think the preference now is to add such handling directly in
guest_rdmsr()?  Protected with a:

if ( !(cp->x86_vendor & (X86_VENDOR_INTEL)) )
    goto gp_fault;

Thanks, Roger.
James Dingwall Nov. 29, 2023, 1:18 p.m. UTC | #10
On Mon, Nov 20, 2023 at 10:24:05AM +0100, Roger Pau Monné wrote:
> On Mon, Nov 20, 2023 at 08:27:36AM +0000, James Dingwall wrote:
> > On Fri, Nov 17, 2023 at 10:56:30AM +0100, Jan Beulich wrote:
> > > On 17.11.2023 10:18, James Dingwall wrote:
> > > > On Thu, Nov 16, 2023 at 04:32:47PM +0000, Andrew Cooper wrote:
> > > >> On 16/11/2023 4:15 pm, James Dingwall wrote:
> > > >>> Hi,
> > > >>>
> > > >>> Per the msr_relaxed documentation:
> > > >>>
> > > >>>    "If using this option is necessary to fix an issue, please report a bug."
> > > >>>
> > > >>> After recently upgrading an environment from Xen 4.14.5 to Xen 4.15.5 we
> > > >>> started experiencing a BSOD at boot with one of our Windows guests.  We found
> > > >>> that enabling `msr_relaxed = 1` in the guest configuration has resolved the
> > > >>> problem.  With a debug build of Xen and `hvm_debug=2048` on the command line
> > > >>> the following messages were caught as the BSOD happened:
> > > >>>
> > > >>> (XEN) [HVM:11.0] <vmx_msr_read_intercept> ecx=0x1a2
> > > >>> (XEN) vmx.c:3298:d11v0 RDMSR 0x000001a2 unimplemented
> > > >>> (XEN) d11v0 VIRIDIAN CRASH: 1e ffffffffc0000096 fffff80b8de81eb5 0 0
> > > >>>
> > > >>> I found that MSR 0x1a2 is MSR_TEMPERATURE_TARGET and from that this patch
> > > >>> series from last month:
> > > >>>
> > > >>> https://patchwork.kernel.org/project/xen-devel/list/?series=796550
> > > >>>
> > > >>> Picking out just a small part of that fixes the problem for us. Although the
> > > >>> the patch is against 4.15.5 I think it would be relevant to more recent
> > > >>> releases too.
> > > >>
> > > >> Which version of Windows, and what hardware?
> > > >>
> > > >> The Viridian Crash isn't about the RDMSR itself - it's presumably
> > > >> collateral damage shortly thereafter.
> > > >>
> > > >> Does filling in 0 for that MSR also resolve the issue?  It's model
> > > >> specific and we absolutely cannot pass it through from real hardware
> > > >> like that.
> > > >>
> > > > 
> > > > Hi Andrew,
> > > > 
> > > > Thanks for your response.  The guest is running Windows 10 and the crash
> > > > happens in a proprietary hardware driver.  A little bit of knowledge as
> > > > they say was enough to stop the crash but I don't understand the impact
> > > > of what I've actually done...
> > > > 
> > > > To rework the patch I'd need a bit of guidance, if I understand your
> > > > suggestion I set the MSR to 0 with this change in emul-priv-op.c:
> > > 
> > > For the purpose of the experiment suggested by Andrew ...
> > > 
> > > > diff --git a/xen/arch/x86/pv/emul-priv-op.c b/xen/arch/x86/pv/emul-priv-op.c
> > > > index ed97b1d6fcc..66f5e417df6 100644
> > > > --- a/xen/arch/x86/pv/emul-priv-op.c
> > > > +++ b/xen/arch/x86/pv/emul-priv-op.c
> > > > @@ -976,6 +976,10 @@ static int read_msr(unsigned int reg, uint64_t *val,
> > > >          *val = 0;
> > > >          return X86EMUL_OKAY;
> > > >  
> > > > +    case MSR_TEMPERATURE_TARGET:
> > > > +        *val = 0;
> > > > +        return X86EMUL_OKAY;
> > > > +
> > > >      case MSR_P6_PERFCTR(0) ... MSR_P6_PERFCTR(7):
> > > >      case MSR_P6_EVNTSEL(0) ... MSR_P6_EVNTSEL(3):
> > > >      case MSR_CORE_PERF_FIXED_CTR0 ... MSR_CORE_PERF_FIXED_CTR2:
> > > 
> > > ... you wouldn't need this (affects PV domains only), and ...
> > > 
> > > > and this in vmx.c:
> > > > 
> > > > diff --git a/xen/arch/x86/hvm/vmx/vmx.c b/xen/arch/x86/hvm/vmx/vmx.c
> > > > index 54023a92587..bbf37b7f272 100644
> > > > --- a/xen/arch/x86/hvm/vmx/vmx.c
> > > > +++ b/xen/arch/x86/hvm/vmx/vmx.c
> > > > @@ -3259,6 +3259,11 @@ static int vmx_msr_read_intercept(unsigned int msr, uint64_t *msr_content)
> > > >          if ( !nvmx_msr_read_intercept(msr, msr_content) )
> > > >              goto gp_fault;
> > > >          break;
> > > > +
> > > > +    case MSR_TEMPERATURE_TARGET:
> > > > +        *msr_content = 0;
> > > > +        break;
> 
> I think the preference now is to add such handling directly in
> guest_rdmsr()?  Protected with a:
> 
> if ( !(cp->x86_vendor & (X86_VENDOR_INTEL)) )
>     goto gp_fault;
> 

It is possible we can patch the the driver which is triggering the BSOD but it
seems unlileky we'd be able to roll that out in advance of doing the Xen
upgrade for dom0.  If the problem we are encountering is specific to our
situation rather than a general case issue then we can easily carry a patch
for that.

Thanks for the help,
James
diff mbox series

Patch

diff --git a/xen/arch/x86/hvm/vmx/vmx.c b/xen/arch/x86/hvm/vmx/vmx.c
index 54023a92587..3f64471c8a8 100644
--- a/xen/arch/x86/hvm/vmx/vmx.c
+++ b/xen/arch/x86/hvm/vmx/vmx.c
@@ -3259,6 +3259,14 @@  static int vmx_msr_read_intercept(unsigned int msr, uint64_t *msr_content)
         if ( !nvmx_msr_read_intercept(msr, msr_content) )
             goto gp_fault;
         break;
+
+    case MSR_TEMPERATURE_TARGET:
+        if ( !rdmsr_safe(msr, *msr_content) )
+            break;
+        /* RO for guests, MSR_PLATFORM_INFO bits set accordingly in msr.c to indicate lack of write
+         * support. */
+        goto gp_fault;
+
     case MSR_IA32_MISC_ENABLE:
         rdmsrl(MSR_IA32_MISC_ENABLE, *msr_content);
         /* Debug Trace Store is not supported. */
diff --git a/xen/arch/x86/pv/emul-priv-op.c b/xen/arch/x86/pv/emul-priv-op.c
index ed97b1d6fcc..eb9eb45e820 100644
--- a/xen/arch/x86/pv/emul-priv-op.c
+++ b/xen/arch/x86/pv/emul-priv-op.c
@@ -976,6 +976,9 @@  static int read_msr(unsigned int reg, uint64_t *val,
         *val = 0;
         return X86EMUL_OKAY;
 
+    case MSR_TEMPERATURE_TARGET:
+        goto normal;
+
     case MSR_P6_PERFCTR(0) ... MSR_P6_PERFCTR(7):
     case MSR_P6_EVNTSEL(0) ... MSR_P6_EVNTSEL(3):
     case MSR_CORE_PERF_FIXED_CTR0 ... MSR_CORE_PERF_FIXED_CTR2:
diff --git a/xen/include/asm-x86/msr-index.h b/xen/include/asm-x86/msr-index.h
index 8b3ad575dbc..34e800fdc01 100644
--- a/xen/include/asm-x86/msr-index.h
+++ b/xen/include/asm-x86/msr-index.h
@@ -498,6 +498,9 @@ 
 #define MSR_IA32_MISC_ENABLE_XD_DISABLE	(1ULL << 34)
 
 #define MSR_IA32_TSC_DEADLINE		0x000006E0
+
+#define MSR_TEMPERATURE_TARGET		0x000001a2
+
 #define MSR_IA32_ENERGY_PERF_BIAS	0x000001b0
 
 /* Platform Shared Resource MSRs */