[2/2] ide: fix crash in IDE cdrom read

Message ID	20171128121055.6954-3-den@openvz.org (mailing list archive)
State	New, archived
Headers	show Return-Path: <qemu-devel-bounces+patchwork-qemu-devel=patchwork.kernel.org@nongnu.org> From: "Denis V. Lunev" <den@openvz.org> To: Date: Tue, 28 Nov 2017 15:10:55 +0300 Message-Id: <20171128121055.6954-3-den@openvz.org> In-Reply-To: <20171128121055.6954-1-den@openvz.org> References: <20171128121055.6954-1-den@openvz.org> Subject: [Qemu-devel] [PATCH 2/2] ide: fix crash in IDE cdrom read Precedence: list Cc: Kevin Wolf <kwolf@redhat.com>, qemu-devel@nongnu.org, qemu-stable@nongnu.org, Stefan Hajnoczi <stefanha@redhat.com>, "Denis V. Lunev" <den@openvz.org>, John Snow <jsnow@redhat.com> Errors-To: qemu-devel-bounces+patchwork-qemu-devel=patchwork.kernel.org@nongnu.org Sender: "Qemu-devel" <qemu-devel-bounces+patchwork-qemu-devel=patchwork.kernel.org@nongnu.org>

Denis V. Lunev Nov. 28, 2017, 12:10 p.m. UTC

There is the following crash reported from the field in QEMU 2.9:
    bdrv_inc_in_flight (bs=bs@entry=0x0)
    blk_aio_prwv
    blk_aio_preadv
    ide_buffered_readv
    cd_read_sector
    ide_data_readw
    portio_read
    memory_region_read_accessor
    access_with_adjusted_size
    memory_region_dispatch_read1
    memory_region_dispatch_read
    address_space_read_continue
    address_space_read_full
    address_space_read
    address_space_rw
    kvm_handle_io
    kvm_cpu_exec
    qemu_kvm_cpu_thread_fn
    start_thread
    clone
Indeed, the CDROM device without media has blk->bs == NULL. We should
check that the media is really available for the device like has been done
in SCSI code.

May be the patch adds a bit more check than necessary, but this is not be
the problem. We should always stay on the safe side.

Signed-off-by: Denis V. Lunev <den@openvz.org>
CC: John Snow <jsnow@redhat.com>
CC: Kevin Wolf <kwolf@redhat.com>
CC: Stefan Hajnoczi <stefanha@redhat.com>
---
 hw/ide/atapi.c | 32 ++++++++++++++++++++++++++++----
 hw/ide/core.c  |  4 ++--
 2 files changed, 30 insertions(+), 6 deletions(-)

Kevin Wolf Nov. 28, 2017, 4:56 p.m. UTC | #1

Am 28.11.2017 um 13:10 hat Denis V. Lunev geschrieben:
> There is the following crash reported from the field in QEMU 2.9:
>     bdrv_inc_in_flight (bs=bs@entry=0x0)
>     blk_aio_prwv
>     blk_aio_preadv
>     ide_buffered_readv
>     cd_read_sector
>     ide_data_readw
>     portio_read
>     memory_region_read_accessor
>     access_with_adjusted_size
>     memory_region_dispatch_read1
>     memory_region_dispatch_read
>     address_space_read_continue
>     address_space_read_full
>     address_space_read
>     address_space_rw
>     kvm_handle_io
>     kvm_cpu_exec
>     qemu_kvm_cpu_thread_fn
>     start_thread
>     clone
> Indeed, the CDROM device without media has blk->bs == NULL. We should
> check that the media is really available for the device like has been done
> in SCSI code.
> 
> May be the patch adds a bit more check than necessary, but this is not be
> the problem. We should always stay on the safe side.
> 
> Signed-off-by: Denis V. Lunev <den@openvz.org>
> CC: John Snow <jsnow@redhat.com>
> CC: Kevin Wolf <kwolf@redhat.com>
> CC: Stefan Hajnoczi <stefanha@redhat.com>
> ---
>  hw/ide/atapi.c | 32 ++++++++++++++++++++++++++++----
>  hw/ide/core.c  |  4 ++--
>  2 files changed, 30 insertions(+), 6 deletions(-)
> 
> diff --git a/hw/ide/atapi.c b/hw/ide/atapi.c
> index c0509c8bf5..fa50c0ccf6 100644
> --- a/hw/ide/atapi.c
> +++ b/hw/ide/atapi.c
> @@ -119,6 +119,11 @@ cd_read_sector_sync(IDEState *s)
>  
>      trace_cd_read_sector_sync(s->lba);
>  
> +    if (!blk_is_available(s->blk)) {
> +        ret = -ENOMEDIUM;
> +        goto fail;
> +    }
> +
>      switch (s->cd_sector_size) {
>      case 2048:
>          ret = blk_pread(s->blk, (int64_t)s->lba << ATAPI_SECTOR_BITS,
> @@ -132,8 +137,8 @@ cd_read_sector_sync(IDEState *s)
>          }
>          break;
>      default:
> -        block_acct_invalid(blk_get_stats(s->blk), BLOCK_ACCT_READ);
> -        return -EIO;
> +        ret = -EIO;
> +        goto fail;
>      }
>  
>      if (ret < 0) {
> @@ -145,6 +150,10 @@ cd_read_sector_sync(IDEState *s)
>      }
>  
>      return ret;
> +
> +fail:
> +    block_acct_invalid(blk_get_stats(s->blk), BLOCK_ACCT_READ);
> +    return ret;
>  }

I'm not sure if we can do anything about blk_aio_* in the short time
that we have until 2.11, so maybe we need to fix some callers like IDE
(we won't catch all of them anyway), but at least the synchronous one
should be easily handled in blk_prwv() by returning -ENOMEDIUM before we
access blk_bs(blk).

AIO is trickier because we need to schedule a BH, and blk_drain() can't
execute that BH yet unless we increase bs->in_flight - which we
obviously can't do for a NULL bs->in_flight. The proper solution
involves a separate blk->in_flight counter for the BB and a blk_drain()
implementation that considers it.

Kevin

Denis V. Lunev Nov. 28, 2017, 5:26 p.m. UTC | #2

On 11/28/2017 07:56 PM, Kevin Wolf wrote:
> Am 28.11.2017 um 13:10 hat Denis V. Lunev geschrieben:
>> There is the following crash reported from the field in QEMU 2.9:
>>     bdrv_inc_in_flight (bs=bs@entry=0x0)
>>     blk_aio_prwv
>>     blk_aio_preadv
>>     ide_buffered_readv
>>     cd_read_sector
>>     ide_data_readw
>>     portio_read
>>     memory_region_read_accessor
>>     access_with_adjusted_size
>>     memory_region_dispatch_read1
>>     memory_region_dispatch_read
>>     address_space_read_continue
>>     address_space_read_full
>>     address_space_read
>>     address_space_rw
>>     kvm_handle_io
>>     kvm_cpu_exec
>>     qemu_kvm_cpu_thread_fn
>>     start_thread
>>     clone
>> Indeed, the CDROM device without media has blk->bs == NULL. We should
>> check that the media is really available for the device like has been done
>> in SCSI code.
>>
>> May be the patch adds a bit more check than necessary, but this is not be
>> the problem. We should always stay on the safe side.
>>
>> Signed-off-by: Denis V. Lunev <den@openvz.org>
>> CC: John Snow <jsnow@redhat.com>
>> CC: Kevin Wolf <kwolf@redhat.com>
>> CC: Stefan Hajnoczi <stefanha@redhat.com>
>> ---
>>  hw/ide/atapi.c | 32 ++++++++++++++++++++++++++++----
>>  hw/ide/core.c  |  4 ++--
>>  2 files changed, 30 insertions(+), 6 deletions(-)
>>
>> diff --git a/hw/ide/atapi.c b/hw/ide/atapi.c
>> index c0509c8bf5..fa50c0ccf6 100644
>> --- a/hw/ide/atapi.c
>> +++ b/hw/ide/atapi.c
>> @@ -119,6 +119,11 @@ cd_read_sector_sync(IDEState *s)
>>  
>>      trace_cd_read_sector_sync(s->lba);
>>  
>> +    if (!blk_is_available(s->blk)) {
>> +        ret = -ENOMEDIUM;
>> +        goto fail;
>> +    }
>> +
>>      switch (s->cd_sector_size) {
>>      case 2048:
>>          ret = blk_pread(s->blk, (int64_t)s->lba << ATAPI_SECTOR_BITS,
>> @@ -132,8 +137,8 @@ cd_read_sector_sync(IDEState *s)
>>          }
>>          break;
>>      default:
>> -        block_acct_invalid(blk_get_stats(s->blk), BLOCK_ACCT_READ);
>> -        return -EIO;
>> +        ret = -EIO;
>> +        goto fail;
>>      }
>>  
>>      if (ret < 0) {
>> @@ -145,6 +150,10 @@ cd_read_sector_sync(IDEState *s)
>>      }
>>  
>>      return ret;
>> +
>> +fail:
>> +    block_acct_invalid(blk_get_stats(s->blk), BLOCK_ACCT_READ);
>> +    return ret;
>>  }
> I'm not sure if we can do anything about blk_aio_* in the short time
> that we have until 2.11, so maybe we need to fix some callers like IDE
> (we won't catch all of them anyway), but at least the synchronous one
> should be easily handled in blk_prwv() by returning -ENOMEDIUM before we
> access blk_bs(blk).
>
> AIO is trickier because we need to schedule a BH, and blk_drain() can't
> execute that BH yet unless we increase bs->in_flight - which we
> obviously can't do for a NULL bs->in_flight. The proper solution
> involves a separate blk->in_flight counter for the BB and a blk_drain()
> implementation that considers it.
>
> Kevin
I have double checked that SCSI code is correct.
AHCI comes through IDE thus with this patch applied
we will be safe at emulation layer.

Den

John Snow Nov. 28, 2017, 11:50 p.m. UTC | #3

On 11/28/2017 07:10 AM, Denis V. Lunev wrote:
> There is the following crash reported from the field in QEMU 2.9:
>     bdrv_inc_in_flight (bs=bs@entry=0x0)
>     blk_aio_prwv
>     blk_aio_preadv
>     ide_buffered_readv
>     cd_read_sector

Is ide_atapi_cmd_reply_end missing from the call stack here for some
reason? ide_data_readw /should/ have end_transfer_func set to that if it
was processing an ATAPI command; otherwise it shouldn't be able to get
here under normal circumstances...

>     ide_data_readw

How do we get here? ide_is_pio_out ought to be false here; do you know
what command was being processed? Do you have a reproducer?

Knowing both the ATA and ATAPI commands under consideration here would
be helpful.

>     portio_read
>     memory_region_read_accessor
>     access_with_adjusted_size
>     memory_region_dispatch_read1
>     memory_region_dispatch_read
>     address_space_read_continue
>     address_space_read_full
>     address_space_read
>     address_space_rw
>     kvm_handle_io
>     kvm_cpu_exec
>     qemu_kvm_cpu_thread_fn
>     start_thread
>     clone

The thing that looks to be incongruent here is that we appear to be
servicing a ATAPI reply; that reply is either the kind that uses
preformatted data in a buffer, or the kind that buffers a read.

If it buffers a read, it should require CHECK_READY which requires a medium.

If it does not buffer a read, it should not be able to invoke
cd_read_sector or any bdrv function from ide_data_readw.

If it neither serves preformatted data nor buffers a read, it should not
allow ide_data_readw to perform any action at all.

> Indeed, the CDROM device without media has blk->bs == NULL. We should
> check that the media is really available for the device like has been done
> in SCSI code.
> 
> May be the patch adds a bit more check than necessary, but this is not be
> the problem. We should always stay on the safe side.
> 
> Signed-off-by: Denis V. Lunev <den@openvz.org>
> CC: John Snow <jsnow@redhat.com>
> CC: Kevin Wolf <kwolf@redhat.com>
> CC: Stefan Hajnoczi <stefanha@redhat.com>
> ---
>  hw/ide/atapi.c | 32 ++++++++++++++++++++++++++++----
>  hw/ide/core.c  |  4 ++--
>  2 files changed, 30 insertions(+), 6 deletions(-)
> 
> diff --git a/hw/ide/atapi.c b/hw/ide/atapi.c
> index c0509c8bf5..fa50c0ccf6 100644
> --- a/hw/ide/atapi.c
> +++ b/hw/ide/atapi.c
> @@ -119,6 +119,11 @@ cd_read_sector_sync(IDEState *s)
>  
>      trace_cd_read_sector_sync(s->lba);
>  
> +    if (!blk_is_available(s->blk)) {
> +        ret = -ENOMEDIUM;
> +        goto fail;
> +    }
> +
>      switch (s->cd_sector_size) {
>      case 2048:
>          ret = blk_pread(s->blk, (int64_t)s->lba << ATAPI_SECTOR_BITS,
> @@ -132,8 +137,8 @@ cd_read_sector_sync(IDEState *s)
>          }
>          break;
>      default:
> -        block_acct_invalid(blk_get_stats(s->blk), BLOCK_ACCT_READ);
> -        return -EIO;
> +        ret = -EIO;
> +        goto fail;
>      }
>  
>      if (ret < 0) {
> @@ -145,6 +150,10 @@ cd_read_sector_sync(IDEState *s)
>      }
>  
>      return ret;
> +
> +fail:
> +    block_acct_invalid(blk_get_stats(s->blk), BLOCK_ACCT_READ);
> +    return ret;
>  }
>  
>  static void cd_read_sector_cb(void *opaque, int ret)
> @@ -174,9 +183,15 @@ static void cd_read_sector_cb(void *opaque, int ret)
>  
>  static int cd_read_sector(IDEState *s)
>  {
> +    int err;
> +
>      if (s->cd_sector_size != 2048 && s->cd_sector_size != 2352) {
> -        block_acct_invalid(blk_get_stats(s->blk), BLOCK_ACCT_READ);
> -        return -EINVAL;
> +        err = -EINVAL;
> +        goto fail;
> +    }
> +    if (!blk_is_available(s->blk)) {
> +        err = -ENOMEDIUM;
> +        goto fail;
>      }
>  
>      s->iov.iov_base = (s->cd_sector_size == 2352) ?
> @@ -195,6 +210,10 @@ static int cd_read_sector(IDEState *s)
>  
>      s->status |= BUSY_STAT;
>      return 0;
> +
> +fail:
> +    block_acct_invalid(blk_get_stats(s->blk), BLOCK_ACCT_READ);
> +    return err;
>  }
>  
>  void ide_atapi_cmd_ok(IDEState *s)
> @@ -404,6 +423,11 @@ static void ide_atapi_cmd_read_dma_cb(void *opaque, int ret)
>          goto eot;
>      }
>  
> +    if (!blk_is_available(s->blk)) {
> +        ide_atapi_cmd_read_dma_cb(s, -ENOMEDIUM);
> +        return;
> +    }
> +

I'm not sure this is OK, because it's an error but not setting the sense
code or raising an IRQ; it's only calling ide_set_inactive, so this
might be a problem. Normally the error code from the data read is
handled earlier in the call by ide_handle_rw_error which can set the
proper codes.

>      s->io_buffer_index = 0;
>      if (s->cd_sector_size == 2352) {
>          n = 1;
> diff --git a/hw/ide/core.c b/hw/ide/core.c
> index 471d0c928b..71780fc9d1 100644
> --- a/hw/ide/core.c
> +++ b/hw/ide/core.c
> @@ -758,7 +758,7 @@ static void ide_sector_read(IDEState *s)
>  
>      trace_ide_sector_read(sector_num, n);
>  
> -    if (!ide_sect_range_ok(s, sector_num, n)) {
> +    if (!ide_sect_range_ok(s, sector_num, n) || !blk_is_available(s->blk)) {
>          ide_rw_error(s);
>          block_acct_invalid(blk_get_stats(s->blk), BLOCK_ACCT_READ);
>          return;
> @@ -1023,7 +1023,7 @@ static void ide_sector_write(IDEState *s)
>  
>      trace_ide_sector_write(sector_num, n);
>  
> -    if (!ide_sect_range_ok(s, sector_num, n)) {
> +    if (!ide_sect_range_ok(s, sector_num, n) || !blk_is_available(s->blk)) {
>          ide_rw_error(s);
>          block_acct_invalid(blk_get_stats(s->blk), BLOCK_ACCT_WRITE);
>          return;
> 

Since it's not a new regression (this is being reported against 2.9) I
am somewhat hesitant to rush it into 2.11-rc3 without a little more info.

That said, here's a list of the ATAPI commands we service that either
return or CAN return data, but do not enforce CHECK_READY:

    [ 0x03 ] = { cmd_request_sense,                 ALLOW_UA },
    [ 0x12 ] = { cmd_inquiry,                       ALLOW_UA },
    [ 0x46 ] = { cmd_get_configuration,             ALLOW_UA },
    [ 0x4a ] = { cmd_get_event_status_notification, ALLOW_UA },
    [ 0x5a ] = { cmd_mode_sense, /* (10) */         0 },
    [ 0xbd ] = { cmd_mechanism_status,              0 },

These six all invoke ide_atapi_cmd_reply in some way (which allows the
guest to begin reading the reply via PIO mechanisms if DMA is not set):

cmd_request_sense:     ide_atapi_cmd_reply(s, 18, max_len);
cmd_inquiry:           ide_atapi_cmd_reply(s, idx, max_len);
cmd_get_configuration: ide_atapi_cmd_reply(s, len, max_len);
cmd_get_event_status_notification:
                       ide_atapi_cmd_reply(s, used_len, max_len);
cmd_mode_sense:        ide_atapi_cmd_reply(s, 16, max_len);
                       ide_atapi_cmd_reply(s, 24, max_len);
                       ide_atapi_cmd_reply(s, 30, max_len);
cmd_mechanism_status:  ide_atapi_cmd_reply(s, 8, max_len);

ide_atapi_cmd_reply itself sets lba to -1, which should inform
ide_atapi_cmd_reply_end not to attempt to buffer any new data. These
*should* be safe. The reply sizes are also all small enough that they
are almost certainly getting buffered in one single transfer without
attempting to buffer more data.

In the normative PIO case then, reads will consume from this buffer
until empty, then we'll call ide_atapi_cmd_reply_end through
end_transfer_func and hit the end of transfer logic.

I'm not sure I see how this crash is happening; it doesn't look like
this path allows for invoking the block read functions and everything
else is guarded by NONDATA or CHECK_READY.

Unless this is an interesting interaction with ide_atapi_cmd_reply
setting the DRQ bit which may trick the pio read function to allow PIO
reads to come in during this time?

Hypothetically:

cmd_packet sets end_transfer_func to ide_atapi_cmd so it can process
further once that PIO is completed. (The PIO may be emulated
synchronously, in the case of AHCI.) In the true PIO case, it may be
asynchronous.

Now, in the AHCI case, the guest has control back and the CDROM is now
executing a command to, perhaps, read data via DMA. Then,
simultaneously, the guest issues a PIO read and because the DRQ bit is
set for DMA, the PIO read also goes through.

This still requires a media, though... and a very broken guest doing
something naughty on purpose.

I can't quite connect the dots to see how this crash is possible in
general... it'd help to have:

- The ATAPI command being processed at the time
- The IDE command being processed at the time
- s->status

as a minimum, and then perhaps optionally some of the ATAPI loop
variables, like packet_transfer_size, elementary_transfer_size, and
io_buffer_index. Or a reproducer!

Sorry I'm not more help and I wrote too much again :(

Denis V. Lunev Nov. 30, 2017, 12:01 p.m. UTC | #4

On 11/29/2017 02:50 AM, John Snow wrote:
>
> On 11/28/2017 07:10 AM, Denis V. Lunev wrote:
>> There is the following crash reported from the field in QEMU 2.9:
>>     bdrv_inc_in_flight (bs=bs@entry=0x0)
>>     blk_aio_prwv
>>     blk_aio_preadv
>>     ide_buffered_readv
>>     cd_read_sector
> Is ide_atapi_cmd_reply_end missing from the call stack here for some
> reason? ide_data_readw /should/ have end_transfer_func set to that if it
> was processing an ATAPI command; otherwise it shouldn't be able to get
> here under normal circumstances...
this is my fault. I have lost this line while removing unnecessary into
from callstack. Here if the full one.

(gdb) bt
#0  bdrv_inc_in_flight (bs=bs@entry=0x0) at block/io.c:561
#1  0x000055b224160d37 in blk_aio_prwv (blk=0x55b2265405a0,
    offset=offset@entry=285122560, bytes=2048,
qiov=qiov@entry=0x55b22a31de20,
    co_entry=co_entry@entry=0x55b224160400 <blk_aio_read_entry>,
    flags=flags@entry=0, cb=cb@entry=0x55b224048e00
<ide_buffered_readv_cb>,
    opaque=opaque@entry=0x55b22a31de00) at block/block-backend.c:1151
#2  0x000055b224160db5 in blk_aio_preadv (blk=<optimized out>,
    offset=offset@entry=285122560, qiov=qiov@entry=0x55b22a31de20,
    flags=flags@entry=0, cb=cb@entry=0x55b224048e00
<ide_buffered_readv_cb>,
    opaque=opaque@entry=0x55b22a31de00) at block/block-backend.c:1256
#3  0x000055b22404bd8d in ide_buffered_readv (s=s@entry=0x55b22a712a68,
    sector_num=556880, iov=iov@entry=0x55b22a712d60,
    nb_sectors=nb_sectors@entry=4,
    cb=cb@entry=0x55b22404f2a0 <cd_read_sector_cb>,
    opaque=opaque@entry=0x55b22a712a68) at hw/ide/core.c:637
#4  0x000055b22404e2c1 in cd_read_sector (s=0x55b22a712a68)
    at hw/ide/atapi.c:198
#5  ide_atapi_cmd_reply_end (s=0x55b22a712a68) at hw/ide/atapi.c:272
#6  0x000055b224049704 in ide_data_readw (opaque=<optimized out>,
    addr=<optimized out>) at hw/ide/core.c:2263
#7  0x000055b223edd7f0 in portio_read (opaque=0x55b22a836000, addr=0,
size=2)
    at /usr/src/debug/qemu-2.9.0/ioport.c:180
#8  0x000055b223ee8e3b in memory_region_read_accessor (mr=0x55b22a836000,
    addr=0, value=0x7f843b5be7c0, size=2, shift=0, mask=65535, attrs=...)
    at /usr/src/debug/qemu-2.9.0/memory.c:435
#9  0x000055b223ee6369 in access_with_adjusted_size (addr=addr@entry=0,
    value=value@entry=0x7f843b5be7c0, size=size@entry=2,
    access_size_min=<optimized out>, access_size_max=<optimized out>,
    access=access@entry=0x55b223ee8e10 <memory_region_read_accessor>,
    mr=mr@entry=0x55b22a836000, attrs=attrs@entry=...)
    at /usr/src/debug/qemu-2.9.0/memory.c:592
#10 0x000055b223ee9d36 in memory_region_dispatch_read1 (attrs=..., size=2,
    pval=0x7f843b5be7c0, addr=0, mr=0x55b22a836000)
    at /usr/src/debug/qemu-2.9.0/memory.c:1238
#11 memory_region_dispatch_read (mr=mr@entry=0x55b22a836000,
    addr=addr@entry=0, pval=pval@entry=0x7f843b5be7c0, size=size@entry=2,
    attrs=attrs@entry=...) at /usr/src/debug/qemu-2.9.0/memory.c:1269
#12 0x000055b223e9bdb2 in address_space_read_continue (
    as=as@entry=0x55b2247db8c0 <address_space_io>, addr=addr@entry=496,
---Type <return> to continue, or q <return> to quit---
    attrs=..., attrs@entry=...,
    buf=buf@entry=0x7f844dc103fe <Address 0x7f844dc103fe out of bounds>,
    len=len@entry=2, addr1=0, l=2, mr=0x55b22a836000)
    at /usr/src/debug/qemu-2.9.0/exec.c:2844
#13 0x000055b223e9be67 in address_space_read_full (
    as=0x55b2247db8c0 <address_space_io>, addr=addr@entry=496, attrs=...,
    buf=buf@entry=0x7f844dc103fe <Address 0x7f844dc103fe out of bounds>,
    len=2, len@entry=602521191) at /usr/src/debug/qemu-2.9.0/exec.c:2895
#14 0x000055b223e9bfce in address_space_read (len=602521191,
    buf=0x7f844dc103fe <Address 0x7f844dc103fe out of bounds>, attrs=...,
    addr=496, as=<optimized out>)
    at /usr/src/debug/qemu-2.9.0/include/exec/memory.h:1718
#15 address_space_rw (as=<optimized out>, addr=addr@entry=496, attrs=...,
    attrs@entry=...,
    buf=buf@entry=0x7f844dc103fe <Address 0x7f844dc103fe out of bounds>,
    len=len@entry=2, is_write=is_write@entry=false)
    at /usr/src/debug/qemu-2.9.0/exec.c:2909
#16 0x000055b223ee5271 in kvm_handle_io (count=512, size=2,
    direction=<optimized out>, data=<optimized out>, attrs=..., port=496)
    at /usr/src/debug/qemu-2.9.0/kvm-all.c:1828
#17 kvm_cpu_exec (cpu=cpu@entry=0x55b229032000)
    at /usr/src/debug/qemu-2.9.0/kvm-all.c:2058
#18 0x000055b223ed1c22 in qemu_kvm_cpu_thread_fn (arg=0x55b229032000)
    at /usr/src/debug/qemu-2.9.0/cpus.c:1119
#19 0x00007f8443993e25 in start_thread (arg=0x7f843b5bf700)
    at pthread_create.c:308
#20 0x00007f84436c134d in clone ()
    at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113



>>     ide_data_readw
> How do we get here? ide_is_pio_out ought to be false here; do you know
> what command was being processed? Do you have a reproducer?
>
> Knowing both the ATA and ATAPI commands under consideration here would
> be helpful.
>

if you prefer, I can upload core/binary to some storage.
Here is the state.

(gdb) p *s
$3 = {bus = 0x55b22a7129f0, unit = 0 '\000', drive_kind = IDE_CD,
  cylinders = 0, heads = 0, sectors = 0, chs_trans = 0, nb_sectors =
11588488,
  mult_sectors = 16, identify_set = 1,
  identify_data = "\300\205", '\000' <repeats 18 times>, "MQ0000 1", ' '
<repeats 12 times>, "\003\000\000\002\004\000.2+5    EQUMD DVR-MO", ' '
<repeats 28 times>,
"\000\000\001\000\000\003\000\000\000\000\000\000\a", '\000' <repeats 17
times>,
"\a\000\a\000\003\000\264\000\264\000,\001\264\000\000\000\000\000\036\000\036",
'\000' <repeats 15 times>, "\036", '\000' <repeats 15 times>, "? ",
'\000' <repeats 333 times>, drive_serial = 1,
  drive_serial_str = "QM00001", '\000' <repeats 13 times>,
  drive_model_str = "QEMU DVD-ROM", '\000' <repeats 28 times>, wwn = 0,
  feature = 0 '\000', error = 0 '\000', nsector = 2, sector = 0 '\000',
  lcyl = 0 '\000', hcyl = 8 '\b', hob_feature = 0 '\000',
  hob_nsector = 3 '\003', hob_sector = 0 '\000', hob_lcyl = 0 '\000',
  hob_hcyl = 8 '\b', select = 32 ' ', status = 80 'P', lba48 = 0 '\000',
  blk = 0x55b2265405a0, version = "2.5+\000\000\000\000", events = {
    eject_request = false, new_media = false}, sense_key = 0 '\000',
  asc = 0 '\000', tray_open = false, tray_locked = false,
  cdrom_changed = 0 '\000', packet_transfer_size = 55296,
  elementary_transfer_size = 0, io_buffer_index = 2048, lba = 139220,
  cd_sector_size = 2048, atapi_dma = 0, acct = {bytes = 2048,
    start_time_ns = 433417215764666, type = BLOCK_ACCT_READ}, pio_aiocb
= 0x0,
  iov = {iov_base = 0x55b22a8ae000, iov_len = 2048}, qiov = {
    iov = 0x55b22a712d50, niov = 1, nalloc = -1, size = 2048},
  buffered_requests = {lh_first = 0x0}, io_buffer_offset = 0,
  io_buffer_size = 0, sg = {sg = 0x0, nsg = 0, nalloc = 0, size = 0,
    dev = 0x0, as = 0x0}, req_nb_sectors = 0,
  end_transfer_func = 0x55b22404e190 <ide_atapi_cmd_reply_end>,
  data_ptr = 0x55b22a8ae800 "", data_end = 0x55b22a8ae800 "",
  io_buffer = 0x55b22a8ae000
")8\"\t\031\"\232\vx.\311l\301\270o$\335\025\257\064F\271\rI,R\342\032\315\nԆV\341od\251\023>\024\370\323\060A\256\337\300텋\024\233\201U\221T^\202\303\036\"E%\262\230\367ξ\fW\302\360\277\347\334\022\253\a\025\216\rj\334z\355>)\230/\021\255%a^\306\001\032",

  io_buffer_total_len = 131076, cur_io_buffer_offset = 0,
  cur_io_buffer_len = 0, end_transfer_fn_idx = 0 '\000',
  sector_write_timer = 0x55b22a45d5c0, irq_count = 0, ext_error = 0 '\000',
  mdata_size = 0, mdata_storage = 0x0, media_changed = 0,
  dma_cmd = IDE_DMA_READ, smart_enabled = 1 '\001', smart_autosave = 1
'\001',
  smart_errors = 0, smart_selftest_count = 0 '\000',
---Type <return> to continue, or q <return> to quit---
  smart_selftest_data = 0x55b22a827000 "", ncq_queues = 0}
(gdb)

In general, I can upload the binary and core to the location
you want. Send it to me in person. By the way. I am staring to
receive these crashes in number. 2-3 in a day BUT from one
specific host. This could be important.

>>     portio_read
>>     memory_region_read_accessor
>>     access_with_adjusted_size
>>     memory_region_dispatch_read1
>>     memory_region_dispatch_read
>>     address_space_read_continue
>>     address_space_read_full
>>     address_space_read
>>     address_space_rw
>>     kvm_handle_io
>>     kvm_cpu_exec
>>     qemu_kvm_cpu_thread_fn
>>     start_thread
>>     clone
> The thing that looks to be incongruent here is that we appear to be
> servicing a ATAPI reply; that reply is either the kind that uses
> preformatted data in a buffer, or the kind that buffers a read.
>
> If it buffers a read, it should require CHECK_READY which requires a medium.
>
> If it does not buffer a read, it should not be able to invoke
> cd_read_sector or any bdrv function from ide_data_readw.
>
> If it neither serves preformatted data nor buffers a read, it should not
> allow ide_data_readw to perform any action at all.
the guest could misbehave! This is not a reason to crash.
For example there is a race or something like this. What
if the guest will execute read from the device without
ANY sanity checks? QEMU should not crash. This is the point
and this is the approach taken in the original patch.


>> Indeed, the CDROM device without media has blk->bs == NULL. We should
>> check that the media is really available for the device like has been done
>> in SCSI code.
>>
>> May be the patch adds a bit more check than necessary, but this is not be
>> the problem. We should always stay on the safe side.
>>
>> Signed-off-by: Denis V. Lunev <den@openvz.org>
>> CC: John Snow <jsnow@redhat.com>
>> CC: Kevin Wolf <kwolf@redhat.com>
>> CC: Stefan Hajnoczi <stefanha@redhat.com>
>> ---
>>  hw/ide/atapi.c | 32 ++++++++++++++++++++++++++++----
>>  hw/ide/core.c  |  4 ++--
>>  2 files changed, 30 insertions(+), 6 deletions(-)
>>
>> diff --git a/hw/ide/atapi.c b/hw/ide/atapi.c
>> index c0509c8bf5..fa50c0ccf6 100644
>> --- a/hw/ide/atapi.c
>> +++ b/hw/ide/atapi.c
>> @@ -119,6 +119,11 @@ cd_read_sector_sync(IDEState *s)
>>  
>>      trace_cd_read_sector_sync(s->lba);
>>  
>> +    if (!blk_is_available(s->blk)) {
>> +        ret = -ENOMEDIUM;
>> +        goto fail;
>> +    }
>> +
>>      switch (s->cd_sector_size) {
>>      case 2048:
>>          ret = blk_pread(s->blk, (int64_t)s->lba << ATAPI_SECTOR_BITS,
>> @@ -132,8 +137,8 @@ cd_read_sector_sync(IDEState *s)
>>          }
>>          break;
>>      default:
>> -        block_acct_invalid(blk_get_stats(s->blk), BLOCK_ACCT_READ);
>> -        return -EIO;
>> +        ret = -EIO;
>> +        goto fail;
>>      }
>>  
>>      if (ret < 0) {
>> @@ -145,6 +150,10 @@ cd_read_sector_sync(IDEState *s)
>>      }
>>  
>>      return ret;
>> +
>> +fail:
>> +    block_acct_invalid(blk_get_stats(s->blk), BLOCK_ACCT_READ);
>> +    return ret;
>>  }
>>  
>>  static void cd_read_sector_cb(void *opaque, int ret)
>> @@ -174,9 +183,15 @@ static void cd_read_sector_cb(void *opaque, int ret)
>>  
>>  static int cd_read_sector(IDEState *s)
>>  {
>> +    int err;
>> +
>>      if (s->cd_sector_size != 2048 && s->cd_sector_size != 2352) {
>> -        block_acct_invalid(blk_get_stats(s->blk), BLOCK_ACCT_READ);
>> -        return -EINVAL;
>> +        err = -EINVAL;
>> +        goto fail;
>> +    }
>> +    if (!blk_is_available(s->blk)) {
>> +        err = -ENOMEDIUM;
>> +        goto fail;
>>      }
>>  
>>      s->iov.iov_base = (s->cd_sector_size == 2352) ?
>> @@ -195,6 +210,10 @@ static int cd_read_sector(IDEState *s)
>>  
>>      s->status |= BUSY_STAT;
>>      return 0;
>> +
>> +fail:
>> +    block_acct_invalid(blk_get_stats(s->blk), BLOCK_ACCT_READ);
>> +    return err;
>>  }
>>  
>>  void ide_atapi_cmd_ok(IDEState *s)
>> @@ -404,6 +423,11 @@ static void ide_atapi_cmd_read_dma_cb(void *opaque, int ret)
>>          goto eot;
>>      }
>>  
>> +    if (!blk_is_available(s->blk)) {
>> +        ide_atapi_cmd_read_dma_cb(s, -ENOMEDIUM);
>> +        return;
>> +    }
>> +
> I'm not sure this is OK, because it's an error but not setting the sense
> code or raising an IRQ; it's only calling ide_set_inactive, so this
> might be a problem. Normally the error code from the data read is
> handled earlier in the call by ide_handle_rw_error which can set the
> proper codes.
>
this follows the approach used by the error handling in this call.
I this code is not made worse. We can hit to the same processing
with the different error upper.

>>      s->io_buffer_index = 0;
>>      if (s->cd_sector_size == 2352) {
>>          n = 1;
>> diff --git a/hw/ide/core.c b/hw/ide/core.c
>> index 471d0c928b..71780fc9d1 100644
>> --- a/hw/ide/core.c
>> +++ b/hw/ide/core.c
>> @@ -758,7 +758,7 @@ static void ide_sector_read(IDEState *s)
>>  
>>      trace_ide_sector_read(sector_num, n);
>>  
>> -    if (!ide_sect_range_ok(s, sector_num, n)) {
>> +    if (!ide_sect_range_ok(s, sector_num, n) || !blk_is_available(s->blk)) {
>>          ide_rw_error(s);
>>          block_acct_invalid(blk_get_stats(s->blk), BLOCK_ACCT_READ);
>>          return;
>> @@ -1023,7 +1023,7 @@ static void ide_sector_write(IDEState *s)
>>  
>>      trace_ide_sector_write(sector_num, n);
>>  
>> -    if (!ide_sect_range_ok(s, sector_num, n)) {
>> +    if (!ide_sect_range_ok(s, sector_num, n) || !blk_is_available(s->blk)) {
>>          ide_rw_error(s);
>>          block_acct_invalid(blk_get_stats(s->blk), BLOCK_ACCT_WRITE);
>>          return;
>>
> Since it's not a new regression (this is being reported against 2.9) I
> am somewhat hesitant to rush it into 2.11-rc3 without a little more info.
>
> That said, here's a list of the ATAPI commands we service that either
> return or CAN return data, but do not enforce CHECK_READY:
>
>     [ 0x03 ] = { cmd_request_sense,                 ALLOW_UA },
>     [ 0x12 ] = { cmd_inquiry,                       ALLOW_UA },
>     [ 0x46 ] = { cmd_get_configuration,             ALLOW_UA },
>     [ 0x4a ] = { cmd_get_event_status_notification, ALLOW_UA },
>     [ 0x5a ] = { cmd_mode_sense, /* (10) */         0 },
>     [ 0xbd ] = { cmd_mechanism_status,              0 },
>
> These six all invoke ide_atapi_cmd_reply in some way (which allows the
> guest to begin reading the reply via PIO mechanisms if DMA is not set):
>
> cmd_request_sense:     ide_atapi_cmd_reply(s, 18, max_len);
> cmd_inquiry:           ide_atapi_cmd_reply(s, idx, max_len);
> cmd_get_configuration: ide_atapi_cmd_reply(s, len, max_len);
> cmd_get_event_status_notification:
>                        ide_atapi_cmd_reply(s, used_len, max_len);
> cmd_mode_sense:        ide_atapi_cmd_reply(s, 16, max_len);
>                        ide_atapi_cmd_reply(s, 24, max_len);
>                        ide_atapi_cmd_reply(s, 30, max_len);
> cmd_mechanism_status:  ide_atapi_cmd_reply(s, 8, max_len);
>
> ide_atapi_cmd_reply itself sets lba to -1, which should inform
> ide_atapi_cmd_reply_end not to attempt to buffer any new data. These
> *should* be safe. The reply sizes are also all small enough that they
> are almost certainly getting buffered in one single transfer without
> attempting to buffer more data.
>
> In the normative PIO case then, reads will consume from this buffer
> until empty, then we'll call ide_atapi_cmd_reply_end through
> end_transfer_func and hit the end of transfer logic.
>
> I'm not sure I see how this crash is happening; it doesn't look like
> this path allows for invoking the block read functions and everything
> else is guarded by NONDATA or CHECK_READY.
>
> Unless this is an interesting interaction with ide_atapi_cmd_reply
> setting the DRQ bit which may trick the pio read function to allow PIO
> reads to come in during this time?
>
> Hypothetically:
>
> cmd_packet sets end_transfer_func to ide_atapi_cmd so it can process
> further once that PIO is completed. (The PIO may be emulated
> synchronously, in the case of AHCI.) In the true PIO case, it may be
> asynchronous.
>
> Now, in the AHCI case, the guest has control back and the CDROM is now
> executing a command to, perhaps, read data via DMA. Then,
> simultaneously, the guest issues a PIO read and because the DRQ bit is
> set for DMA, the PIO read also goes through.
>
> This still requires a media, though... and a very broken guest doing
> something naughty on purpose.
>
> I can't quite connect the dots to see how this crash is possible in
> general... it'd help to have:
>
> - The ATAPI command being processed at the time
> - The IDE command being processed at the time
> - s->status
>
> as a minimum, and then perhaps optionally some of the ATAPI loop
> variables, like packet_transfer_size, elementary_transfer_size, and
> io_buffer_index. Or a reproducer!
>
> Sorry I'm not more help and I wrote too much again :(
imho this does not matter. Once again - the guest can misbehave.
This is not the reason to crash. You can consider this as CVE.

Den

John Snow Dec. 12, 2017, 11:40 p.m. UTC | #5

Hi Den, the long email about IDE stuff:

On 11/30/2017 07:01 AM, Denis V. Lunev wrote:
> On 11/29/2017 02:50 AM, John Snow wrote:
>>
>> On 11/28/2017 07:10 AM, Denis V. Lunev wrote:
>>> There is the following crash reported from the field in QEMU 2.9:
>>>     bdrv_inc_in_flight (bs=bs@entry=0x0)
>>>     blk_aio_prwv
>>>     blk_aio_preadv
>>>     ide_buffered_readv
>>>     cd_read_sector
>> Is ide_atapi_cmd_reply_end missing from the call stack here for some
>> reason? ide_data_readw /should/ have end_transfer_func set to that if it
>> was processing an ATAPI command; otherwise it shouldn't be able to get
>> here under normal circumstances...
> this is my fault. I have lost this line while removing unnecessary into
> from callstack. Here if the full one.
> 

Well, at least we're back on terra firma here.

I'm going to talk myself through what the problem looks like here, and I
have some questions for you at the bottom of the email, so you can just
skim if you're not interested in the details.

> (gdb) bt
> #0  bdrv_inc_in_flight (bs=bs@entry=0x0) at block/io.c:561
> #1  0x000055b224160d37 in blk_aio_prwv (blk=0x55b2265405a0,
>     offset=offset@entry=285122560, bytes=2048,
> qiov=qiov@entry=0x55b22a31de20,
>     co_entry=co_entry@entry=0x55b224160400 <blk_aio_read_entry>,
>     flags=flags@entry=0, cb=cb@entry=0x55b224048e00
> <ide_buffered_readv_cb>,
>     opaque=opaque@entry=0x55b22a31de00) at block/block-backend.c:1151
> #2  0x000055b224160db5 in blk_aio_preadv (blk=<optimized out>,
>     offset=offset@entry=285122560, qiov=qiov@entry=0x55b22a31de20,
>     flags=flags@entry=0, cb=cb@entry=0x55b224048e00
> <ide_buffered_readv_cb>,
>     opaque=opaque@entry=0x55b22a31de00) at block/block-backend.c:1256
> #3  0x000055b22404bd8d in ide_buffered_readv (s=s@entry=0x55b22a712a68,
>     sector_num=556880, iov=iov@entry=0x55b22a712d60,
>     nb_sectors=nb_sectors@entry=4,
>     cb=cb@entry=0x55b22404f2a0 <cd_read_sector_cb>,
>     opaque=opaque@entry=0x55b22a712a68) at hw/ide/core.c:637
> #4  0x000055b22404e2c1 in cd_read_sector (s=0x55b22a712a68)
>     at hw/ide/atapi.c:198
> #5  ide_atapi_cmd_reply_end (s=0x55b22a712a68) at hw/ide/atapi.c:272
> #6  0x000055b224049704 in ide_data_readw (opaque=<optimized out>,
>     addr=<optimized out>) at hw/ide/core.c:2263
> #7  0x000055b223edd7f0 in portio_read (opaque=0x55b22a836000, addr=0,
> size=2)
>     at /usr/src/debug/qemu-2.9.0/ioport.c:180
> #8  0x000055b223ee8e3b in memory_region_read_accessor (mr=0x55b22a836000,
>     addr=0, value=0x7f843b5be7c0, size=2, shift=0, mask=65535, attrs=...)
>     at /usr/src/debug/qemu-2.9.0/memory.c:435
> #9  0x000055b223ee6369 in access_with_adjusted_size (addr=addr@entry=0,
>     value=value@entry=0x7f843b5be7c0, size=size@entry=2,
>     access_size_min=<optimized out>, access_size_max=<optimized out>,
>     access=access@entry=0x55b223ee8e10 <memory_region_read_accessor>,
>     mr=mr@entry=0x55b22a836000, attrs=attrs@entry=...)
>     at /usr/src/debug/qemu-2.9.0/memory.c:592
> #10 0x000055b223ee9d36 in memory_region_dispatch_read1 (attrs=..., size=2,
>     pval=0x7f843b5be7c0, addr=0, mr=0x55b22a836000)
>     at /usr/src/debug/qemu-2.9.0/memory.c:1238
> #11 memory_region_dispatch_read (mr=mr@entry=0x55b22a836000,
>     addr=addr@entry=0, pval=pval@entry=0x7f843b5be7c0, size=size@entry=2,
>     attrs=attrs@entry=...) at /usr/src/debug/qemu-2.9.0/memory.c:1269
> #12 0x000055b223e9bdb2 in address_space_read_continue (
>     as=as@entry=0x55b2247db8c0 <address_space_io>, addr=addr@entry=496,
> ---Type <return> to continue, or q <return> to quit---
>     attrs=..., attrs@entry=...,
>     buf=buf@entry=0x7f844dc103fe <Address 0x7f844dc103fe out of bounds>,
>     len=len@entry=2, addr1=0, l=2, mr=0x55b22a836000)
>     at /usr/src/debug/qemu-2.9.0/exec.c:2844
> #13 0x000055b223e9be67 in address_space_read_full (
>     as=0x55b2247db8c0 <address_space_io>, addr=addr@entry=496, attrs=...,
>     buf=buf@entry=0x7f844dc103fe <Address 0x7f844dc103fe out of bounds>,
>     len=2, len@entry=602521191) at /usr/src/debug/qemu-2.9.0/exec.c:2895
> #14 0x000055b223e9bfce in address_space_read (len=602521191,
>     buf=0x7f844dc103fe <Address 0x7f844dc103fe out of bounds>, attrs=...,
>     addr=496, as=<optimized out>)
>     at /usr/src/debug/qemu-2.9.0/include/exec/memory.h:1718
> #15 address_space_rw (as=<optimized out>, addr=addr@entry=496, attrs=...,
>     attrs@entry=...,
>     buf=buf@entry=0x7f844dc103fe <Address 0x7f844dc103fe out of bounds>,
>     len=len@entry=2, is_write=is_write@entry=false)
>     at /usr/src/debug/qemu-2.9.0/exec.c:2909
> #16 0x000055b223ee5271 in kvm_handle_io (count=512, size=2,
>     direction=<optimized out>, data=<optimized out>, attrs=..., port=496)
>     at /usr/src/debug/qemu-2.9.0/kvm-all.c:1828
> #17 kvm_cpu_exec (cpu=cpu@entry=0x55b229032000)
>     at /usr/src/debug/qemu-2.9.0/kvm-all.c:2058
> #18 0x000055b223ed1c22 in qemu_kvm_cpu_thread_fn (arg=0x55b229032000)
>     at /usr/src/debug/qemu-2.9.0/cpus.c:1119
> #19 0x00007f8443993e25 in start_thread (arg=0x7f843b5bf700)
>     at pthread_create.c:308
> #20 0x00007f84436c134d in clone ()
>     at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113
> 
> 
> 
>>>     ide_data_readw
>> How do we get here? ide_is_pio_out ought to be false here; do you know
>> what command was being processed? Do you have a reproducer?
>>
>> Knowing both the ATA and ATAPI commands under consideration here would
>> be helpful.
>>
> 
> if you prefer, I can upload core/binary to some storage.
> Here is the state.
> 

This might be helpful for me to poke around at, but unfortunately it
looks like I can't see the ATAPI command that was being processed from
the trace given. I don't think there's a way for me to retrieve it from
a core, either.

> (gdb) p *s
> $3 = {bus = 0x55b22a7129f0, unit = 0 '\000', drive_kind = IDE_CD,
>   cylinders = 0, heads = 0, sectors = 0, chs_trans = 0, nb_sectors =
> 11588488,
>   mult_sectors = 16, identify_set = 1,
>   identify_data = "\300\205", '\000' <repeats 18 times>, "MQ0000 1", ' '
> <repeats 12 times>, "\003\000\000\002\004\000.2+5    EQUMD DVR-MO", ' '
> <repeats 28 times>,
> "\000\000\001\000\000\003\000\000\000\000\000\000\a", '\000' <repeats 17
> times>,
> "\a\000\a\000\003\000\264\000\264\000,\001\264\000\000\000\000\000\036\000\036",
> '\000' <repeats 15 times>, "\036", '\000' <repeats 15 times>, "? ",
> '\000' <repeats 333 times>, drive_serial = 1,
>   drive_serial_str = "QM00001", '\000' <repeats 13 times>,
>   drive_model_str = "QEMU DVD-ROM", '\000' <repeats 28 times>, wwn = 0,
>   feature = 0 '\000', error = 0 '\000', nsector = 2, sector = 0 '\000',
>   lcyl = 0 '\000', hcyl = 8 '\b', hob_feature = 0 '\000',
>   hob_nsector = 3 '\003', hob_sector = 0 '\000', hob_lcyl = 0 '\000',
>   hob_hcyl = 8 '\b', select = 32 ' ', status = 80 'P', lba48 = 0 '\000',

status = BUSY_STAT
nsector = ATAPI_INT_REASON_IO (set in ide_atapi_cmd_reply_end)
byte_count_limit = 2048 (hcyl << 8)

>   blk = 0x55b2265405a0, version = "2.5+\000\000\000\000", events = {
>     eject_request = false, new_media = false}, sense_key = 0 '\000',
>   asc = 0 '\000', tray_open = false, tray_locked = false,
>   cdrom_changed = 0 '\000', packet_transfer_size = 55296,
>   elementary_transfer_size = 0, io_buffer_index = 2048, lba = 139220,

packet_transfer_size = 55296, (27 cd sectors)
elementary_transfer_size = 0,
io_buffer_index = 2048,
lba = 139220, or 0x021fd4


This is either an initial read request, or a rebuffering.
In the normative case, initial reads look like this:

IDE:
ide_ioport_write
  ide_exec_cmd (0xA0)
    cmd_packet
      ide_transfer_start
        s->end_transfer_func = ide_atapi_cmd
        [data_ptr and data_end set to meaningful values]
[Return to guest, await PIO write of scsi cdb]
ide_data_writew
  ide_atapi_cmd
    [soft error if !blk->bs]
    cmd_read_cd
      ide_atapi_cmd_read
        ide_atapi_cmd_read_pio
          ide_atapi_cmd_reply_end
            cd_read_sector
              [segv if !blk->bs]

AHCI:
handle_reg_h2d_fis
  [scsi CDB is copied into s->io_buffer]
  ide_exec_cmd
    cmd_packet
      ide_transfer_start
        [data_ptr and data_end set to transfer scsi cdb]
          ahci_start_transfer
            [shortcut, no transfer]
            s->data_ptr set to s->data_end [*]
              ide_atapi_cmd

* (but data_end likely is still != s->io_buffer)

from here the callback is the same as above: we'll wind up at
cd_read_sector. Neither of these cases match your traceback.

If it was the initial buffering, it would have occurred synchronously
with receipt of the SCSI CDB, but your traces place it as a consequence
of an attempt to read data, so that makes me suspect it's a rebuffering
attempt, and the data in s->io_buffer is... cd data? I guess? did you
have a CD in here at one point?

>   cd_sector_size = 2048, atapi_dma = 0, acct = {bytes = 2048,

So we're using PIO paths when applicable. The s->feature register being
0 seems to agree.

>     start_time_ns = 433417215764666, type = BLOCK_ACCT_READ}, pio_aiocb
> = 0x0,
>   iov = {iov_base = 0x55b22a8ae000, iov_len = 2048}, qiov = {
>     iov = 0x55b22a712d50, niov = 1, nalloc = -1, size = 2048},
>   buffered_requests = {lh_first = 0x0}, io_buffer_offset = 0,
>   io_buffer_size = 0, sg = {sg = 0x0, nsg = 0, nalloc = 0, size = 0,
>     dev = 0x0, as = 0x0}, req_nb_sectors = 0,
>   end_transfer_func = 0x55b22404e190 <ide_atapi_cmd_reply_end>,
>   data_ptr = 0x55b22a8ae800 "", data_end = 0x55b22a8ae800 "",
>   io_buffer = 0x55b22a8ae000

s->data_ptr  = 0x55b22a8ae800
s->data_end  = 0x55b22a8ae800
s->io_buffer = 0x55b22a8ae000
s->io_buffer_offset = 0

Uh, hm. This does not look like an in-progress PIO transfer of any kind...

This says to me that ide_dummy_transfer_stop was called at some point
and nothing has started up in its place.

This seems to conflict with the trace, which looks like a rebuffering.
This looks like... undefined behavior entirely. I am not sure I have a
good guess as to what's happening.

> ")8\"\t\031\"\232\vx.\311l\301\270o$\335\025\257\064F\271\rI,R\342\032\315\nԆV\341od\251\023>\024\370\323\060A\256\337\300텋\024\233\201U\221T^\202\303\036\"E%\262\230\367ξ\fW\302\360\277\347\334\022\253\a\025\216\rj\334z\355>)\230/\021\255%a^\306\001\032",

... hmm

0x29 56 34 09 19

0x29 is... almost a read command (0x28) but it isn't. 0x29 shouldn't map
to anything, actually. So this must be data we've buffered, or something
left over from god knows what.

> 
>   io_buffer_total_len = 131076, cur_io_buffer_offset = 0,
>   cur_io_buffer_len = 0, end_transfer_fn_idx = 0 '\000',
>   sector_write_timer = 0x55b22a45d5c0, irq_count = 0, ext_error = 0 '\000',
>   mdata_size = 0, mdata_storage = 0x0, media_changed = 0,
>   dma_cmd = IDE_DMA_READ, smart_enabled = 1 '\001', smart_autosave = 1
> '\001',
>   smart_errors = 0, smart_selftest_count = 0 '\000',
> ---Type <return> to continue, or q <return> to quit---
>   smart_selftest_data = 0x55b22a827000 "", ncq_queues = 0}
> (gdb)
> 
> In general, I can upload the binary and core to the location
> you want. Send it to me in person. By the way. I am staring to
> receive these crashes in number. 2-3 in a day BUT from one
> specific host. This could be important.
> 
>>>     portio_read
>>>     memory_region_read_accessor
>>>     access_with_adjusted_size
>>>     memory_region_dispatch_read1
>>>     memory_region_dispatch_read
>>>     address_space_read_continue
>>>     address_space_read_full
>>>     address_space_read
>>>     address_space_rw
>>>     kvm_handle_io
>>>     kvm_cpu_exec
>>>     qemu_kvm_cpu_thread_fn
>>>     start_thread
>>>     clone
>> The thing that looks to be incongruent here is that we appear to be
>> servicing a ATAPI reply; that reply is either the kind that uses
>> preformatted data in a buffer, or the kind that buffers a read.
>>
>> If it buffers a read, it should require CHECK_READY which requires a medium.
>>
>> If it does not buffer a read, it should not be able to invoke
>> cd_read_sector or any bdrv function from ide_data_readw.
>>
>> If it neither serves preformatted data nor buffers a read, it should not
>> allow ide_data_readw to perform any action at all.
> the guest could misbehave! This is not a reason to crash.
> For example there is a race or something like this. What
> if the guest will execute read from the device without
> ANY sanity checks? QEMU should not crash. This is the point
> and this is the approach taken in the original patch.
> 

The stuff quoted above isn't participatory, it's mandatory. I don't know
how we're bypassing CHECK_READY. I want to understand what's happening
before I pepper in sanity checks.

in 2.9.0, you have this code:

    if ((cmd->flags & CHECK_READY) &&
        (!media_present(s) || !blk_is_inserted(s->blk)))
    {
        ide_atapi_cmd_error(s, NOT_READY, ASC_MEDIUM_NOT_PRESENT);
        return;
    }

with media_present defined as:

static inline int media_present(IDEState *s)
{
    return !s->tray_open && s->nb_sectors > 0;
}

... however, from your gdb output, we do have:

tray_open = false
nb_sectors = 11588488

but blk_is_inserted is still going to check for the presence of 'bs'
attached to the Block Backend, so this check should still prevent us
from executing a read command out of the gate.

Is the guest executing a read and then are we racing with the removal of
the bds? Do you expect a CD to be inserted on this guest?

nb_sectors I would expect to be 0; based on ide_cd_change_cb:

    blk_get_geometry(s->blk, &nb_sectors);

this should set nb_sectors to 0 if there's no blk->bs.

I'm sort of lost, I can't really easily construct what exactly happened
here.

> 
>>> Indeed, the CDROM device without media has blk->bs == NULL. We should
>>> check that the media is really available for the device like has been done
>>> in SCSI code.
>>>
>>> May be the patch adds a bit more check than necessary, but this is not be
>>> the problem. We should always stay on the safe side.
>>>
>>> Signed-off-by: Denis V. Lunev <den@openvz.org>
>>> CC: John Snow <jsnow@redhat.com>
>>> CC: Kevin Wolf <kwolf@redhat.com>
>>> CC: Stefan Hajnoczi <stefanha@redhat.com>
>>> ---
>>>  hw/ide/atapi.c | 32 ++++++++++++++++++++++++++++----
>>>  hw/ide/core.c  |  4 ++--
>>>  2 files changed, 30 insertions(+), 6 deletions(-)
>>>
>>> diff --git a/hw/ide/atapi.c b/hw/ide/atapi.c
>>> index c0509c8bf5..fa50c0ccf6 100644
>>> --- a/hw/ide/atapi.c
>>> +++ b/hw/ide/atapi.c
>>> @@ -119,6 +119,11 @@ cd_read_sector_sync(IDEState *s)
>>>  
>>>      trace_cd_read_sector_sync(s->lba);
>>>  
>>> +    if (!blk_is_available(s->blk)) {
>>> +        ret = -ENOMEDIUM;
>>> +        goto fail;
>>> +    }
>>> +
>>>      switch (s->cd_sector_size) {
>>>      case 2048:
>>>          ret = blk_pread(s->blk, (int64_t)s->lba << ATAPI_SECTOR_BITS,
>>> @@ -132,8 +137,8 @@ cd_read_sector_sync(IDEState *s)
>>>          }
>>>          break;
>>>      default:
>>> -        block_acct_invalid(blk_get_stats(s->blk), BLOCK_ACCT_READ);
>>> -        return -EIO;
>>> +        ret = -EIO;
>>> +        goto fail;
>>>      }
>>>  
>>>      if (ret < 0) {
>>> @@ -145,6 +150,10 @@ cd_read_sector_sync(IDEState *s)
>>>      }
>>>  
>>>      return ret;
>>> +
>>> +fail:
>>> +    block_acct_invalid(blk_get_stats(s->blk), BLOCK_ACCT_READ);
>>> +    return ret;
>>>  }
>>>  
>>>  static void cd_read_sector_cb(void *opaque, int ret)
>>> @@ -174,9 +183,15 @@ static void cd_read_sector_cb(void *opaque, int ret)
>>>  
>>>  static int cd_read_sector(IDEState *s)
>>>  {
>>> +    int err;
>>> +
>>>      if (s->cd_sector_size != 2048 && s->cd_sector_size != 2352) {
>>> -        block_acct_invalid(blk_get_stats(s->blk), BLOCK_ACCT_READ);
>>> -        return -EINVAL;
>>> +        err = -EINVAL;
>>> +        goto fail;
>>> +    }
>>> +    if (!blk_is_available(s->blk)) {
>>> +        err = -ENOMEDIUM;
>>> +        goto fail;
>>>      }
>>>  
>>>      s->iov.iov_base = (s->cd_sector_size == 2352) ?
>>> @@ -195,6 +210,10 @@ static int cd_read_sector(IDEState *s)
>>>  
>>>      s->status |= BUSY_STAT;
>>>      return 0;
>>> +
>>> +fail:
>>> +    block_acct_invalid(blk_get_stats(s->blk), BLOCK_ACCT_READ);
>>> +    return err;
>>>  }
>>>  
>>>  void ide_atapi_cmd_ok(IDEState *s)
>>> @@ -404,6 +423,11 @@ static void ide_atapi_cmd_read_dma_cb(void *opaque, int ret)
>>>          goto eot;
>>>      }
>>>  
>>> +    if (!blk_is_available(s->blk)) {
>>> +        ide_atapi_cmd_read_dma_cb(s, -ENOMEDIUM);
>>> +        return;
>>> +    }
>>> +
>> I'm not sure this is OK, because it's an error but not setting the sense
>> code or raising an IRQ; it's only calling ide_set_inactive, so this
>> might be a problem. Normally the error code from the data read is
>> handled earlier in the call by ide_handle_rw_error which can set the
>> proper codes.
>>
> this follows the approach used by the error handling in this call.
> I this code is not made worse. We can hit to the same processing
> with the different error upper.
> 
>>>      s->io_buffer_index = 0;
>>>      if (s->cd_sector_size == 2352) {
>>>          n = 1;
>>> diff --git a/hw/ide/core.c b/hw/ide/core.c
>>> index 471d0c928b..71780fc9d1 100644
>>> --- a/hw/ide/core.c
>>> +++ b/hw/ide/core.c
>>> @@ -758,7 +758,7 @@ static void ide_sector_read(IDEState *s)
>>>  
>>>      trace_ide_sector_read(sector_num, n);
>>>  
>>> -    if (!ide_sect_range_ok(s, sector_num, n)) {
>>> +    if (!ide_sect_range_ok(s, sector_num, n) || !blk_is_available(s->blk)) {
>>>          ide_rw_error(s);
>>>          block_acct_invalid(blk_get_stats(s->blk), BLOCK_ACCT_READ);
>>>          return;
>>> @@ -1023,7 +1023,7 @@ static void ide_sector_write(IDEState *s)
>>>  
>>>      trace_ide_sector_write(sector_num, n);
>>>  
>>> -    if (!ide_sect_range_ok(s, sector_num, n)) {
>>> +    if (!ide_sect_range_ok(s, sector_num, n) || !blk_is_available(s->blk)) {
>>>          ide_rw_error(s);
>>>          block_acct_invalid(blk_get_stats(s->blk), BLOCK_ACCT_WRITE);
>>>          return;
>>>
>> Since it's not a new regression (this is being reported against 2.9) I
>> am somewhat hesitant to rush it into 2.11-rc3 without a little more info.
>>
>> That said, here's a list of the ATAPI commands we service that either
>> return or CAN return data, but do not enforce CHECK_READY:
>>
>>     [ 0x03 ] = { cmd_request_sense,                 ALLOW_UA },
>>     [ 0x12 ] = { cmd_inquiry,                       ALLOW_UA },
>>     [ 0x46 ] = { cmd_get_configuration,             ALLOW_UA },
>>     [ 0x4a ] = { cmd_get_event_status_notification, ALLOW_UA },
>>     [ 0x5a ] = { cmd_mode_sense, /* (10) */         0 },
>>     [ 0xbd ] = { cmd_mechanism_status,              0 },
>>
>> These six all invoke ide_atapi_cmd_reply in some way (which allows the
>> guest to begin reading the reply via PIO mechanisms if DMA is not set):
>>
>> cmd_request_sense:     ide_atapi_cmd_reply(s, 18, max_len);
>> cmd_inquiry:           ide_atapi_cmd_reply(s, idx, max_len);
>> cmd_get_configuration: ide_atapi_cmd_reply(s, len, max_len);
>> cmd_get_event_status_notification:
>>                        ide_atapi_cmd_reply(s, used_len, max_len);
>> cmd_mode_sense:        ide_atapi_cmd_reply(s, 16, max_len);
>>                        ide_atapi_cmd_reply(s, 24, max_len);
>>                        ide_atapi_cmd_reply(s, 30, max_len);
>> cmd_mechanism_status:  ide_atapi_cmd_reply(s, 8, max_len);
>>
>> ide_atapi_cmd_reply itself sets lba to -1, which should inform
>> ide_atapi_cmd_reply_end not to attempt to buffer any new data. These
>> *should* be safe. The reply sizes are also all small enough that they
>> are almost certainly getting buffered in one single transfer without
>> attempting to buffer more data.
>>
>> In the normative PIO case then, reads will consume from this buffer
>> until empty, then we'll call ide_atapi_cmd_reply_end through
>> end_transfer_func and hit the end of transfer logic.
>>
>> I'm not sure I see how this crash is happening; it doesn't look like
>> this path allows for invoking the block read functions and everything
>> else is guarded by NONDATA or CHECK_READY.
>>
>> Unless this is an interesting interaction with ide_atapi_cmd_reply
>> setting the DRQ bit which may trick the pio read function to allow PIO
>> reads to come in during this time?
>>
>> Hypothetically:
>>
>> cmd_packet sets end_transfer_func to ide_atapi_cmd so it can process
>> further once that PIO is completed. (The PIO may be emulated
>> synchronously, in the case of AHCI.) In the true PIO case, it may be
>> asynchronous.
>>
>> Now, in the AHCI case, the guest has control back and the CDROM is now
>> executing a command to, perhaps, read data via DMA. Then,
>> simultaneously, the guest issues a PIO read and because the DRQ bit is
>> set for DMA, the PIO read also goes through.
>>
>> This still requires a media, though... and a very broken guest doing
>> something naughty on purpose.
>>
>> I can't quite connect the dots to see how this crash is possible in
>> general... it'd help to have:
>>
>> - The ATAPI command being processed at the time
>> - The IDE command being processed at the time
>> - s->status
>>
>> as a minimum, and then perhaps optionally some of the ATAPI loop
>> variables, like packet_transfer_size, elementary_transfer_size, and
>> io_buffer_index. Or a reproducer!
>>
>> Sorry I'm not more help and I wrote too much again :(
> imho this does not matter. Once again - the guest can misbehave.

I never said that if a guest acting out of spec was the problem I
wouldn't fix it. I just want to find the root cause.

If this has been broken since 2.9, 2.11-rc3 is too late for a bandaid
applied to something I can't diagnose. Let's discuss this for 2.12 and I
will keep trying to figure out what the root cause is.


Some questions for you:

(1) Is the guest Linux? Do we know why this one machine might be
tripping up QEMU? (Is it running a fuzzer, a weird OS, etc...?)
(2) Does the VM actually have a CDROM inserted at some point? Is it
possible we're racing on some kind of eject or graph manipulation failure?
(3) Is this using AHCI or IDE?


If I can't figure it out within a week or so from here I'll just check
in the band-aid with some /* FIXME */ comments attached.

--js

Denis V. Lunev Dec. 14, 2017, 11:29 a.m. UTC | #6

On 12/13/2017 02:40 AM, John Snow wrote:
> Hi Den, the long email about IDE stuff:
>
> On 11/30/2017 07:01 AM, Denis V. Lunev wrote:
>> On 11/29/2017 02:50 AM, John Snow wrote:
>>> On 11/28/2017 07:10 AM, Denis V. Lunev wrote:
>>>> There is the following crash reported from the field in QEMU 2.9:
>>>>     bdrv_inc_in_flight (bs=bs@entry=0x0)
>>>>     blk_aio_prwv
>>>>     blk_aio_preadv
>>>>     ide_buffered_readv
>>>>     cd_read_sector
>>> Is ide_atapi_cmd_reply_end missing from the call stack here for some
>>> reason? ide_data_readw /should/ have end_transfer_func set to that if it
>>> was processing an ATAPI command; otherwise it shouldn't be able to get
>>> here under normal circumstances...
>> this is my fault. I have lost this line while removing unnecessary into
>> from callstack. Here if the full one.
>>
> Well, at least we're back on terra firma here.
>
> I'm going to talk myself through what the problem looks like here, and I
> have some questions for you at the bottom of the email, so you can just
> skim if you're not interested in the details.
>
>> (gdb) bt
>> #0  bdrv_inc_in_flight (bs=bs@entry=0x0) at block/io.c:561
>> #1  0x000055b224160d37 in blk_aio_prwv (blk=0x55b2265405a0,
>>     offset=offset@entry=285122560, bytes=2048,
>> qiov=qiov@entry=0x55b22a31de20,
>>     co_entry=co_entry@entry=0x55b224160400 <blk_aio_read_entry>,
>>     flags=flags@entry=0, cb=cb@entry=0x55b224048e00
>> <ide_buffered_readv_cb>,
>>     opaque=opaque@entry=0x55b22a31de00) at block/block-backend.c:1151
>> #2  0x000055b224160db5 in blk_aio_preadv (blk=<optimized out>,
>>     offset=offset@entry=285122560, qiov=qiov@entry=0x55b22a31de20,
>>     flags=flags@entry=0, cb=cb@entry=0x55b224048e00
>> <ide_buffered_readv_cb>,
>>     opaque=opaque@entry=0x55b22a31de00) at block/block-backend.c:1256
>> #3  0x000055b22404bd8d in ide_buffered_readv (s=s@entry=0x55b22a712a68,
>>     sector_num=556880, iov=iov@entry=0x55b22a712d60,
>>     nb_sectors=nb_sectors@entry=4,
>>     cb=cb@entry=0x55b22404f2a0 <cd_read_sector_cb>,
>>     opaque=opaque@entry=0x55b22a712a68) at hw/ide/core.c:637
>> #4  0x000055b22404e2c1 in cd_read_sector (s=0x55b22a712a68)
>>     at hw/ide/atapi.c:198
>> #5  ide_atapi_cmd_reply_end (s=0x55b22a712a68) at hw/ide/atapi.c:272
>> #6  0x000055b224049704 in ide_data_readw (opaque=<optimized out>,
>>     addr=<optimized out>) at hw/ide/core.c:2263
>> #7  0x000055b223edd7f0 in portio_read (opaque=0x55b22a836000, addr=0,
>> size=2)
>>     at /usr/src/debug/qemu-2.9.0/ioport.c:180
>> #8  0x000055b223ee8e3b in memory_region_read_accessor (mr=0x55b22a836000,
>>     addr=0, value=0x7f843b5be7c0, size=2, shift=0, mask=65535, attrs=...)
>>     at /usr/src/debug/qemu-2.9.0/memory.c:435
>> #9  0x000055b223ee6369 in access_with_adjusted_size (addr=addr@entry=0,
>>     value=value@entry=0x7f843b5be7c0, size=size@entry=2,
>>     access_size_min=<optimized out>, access_size_max=<optimized out>,
>>     access=access@entry=0x55b223ee8e10 <memory_region_read_accessor>,
>>     mr=mr@entry=0x55b22a836000, attrs=attrs@entry=...)
>>     at /usr/src/debug/qemu-2.9.0/memory.c:592
>> #10 0x000055b223ee9d36 in memory_region_dispatch_read1 (attrs=..., size=2,
>>     pval=0x7f843b5be7c0, addr=0, mr=0x55b22a836000)
>>     at /usr/src/debug/qemu-2.9.0/memory.c:1238
>> #11 memory_region_dispatch_read (mr=mr@entry=0x55b22a836000,
>>     addr=addr@entry=0, pval=pval@entry=0x7f843b5be7c0, size=size@entry=2,
>>     attrs=attrs@entry=...) at /usr/src/debug/qemu-2.9.0/memory.c:1269
>> #12 0x000055b223e9bdb2 in address_space_read_continue (
>>     as=as@entry=0x55b2247db8c0 <address_space_io>, addr=addr@entry=496,
>> ---Type <return> to continue, or q <return> to quit---
>>     attrs=..., attrs@entry=...,
>>     buf=buf@entry=0x7f844dc103fe <Address 0x7f844dc103fe out of bounds>,
>>     len=len@entry=2, addr1=0, l=2, mr=0x55b22a836000)
>>     at /usr/src/debug/qemu-2.9.0/exec.c:2844
>> #13 0x000055b223e9be67 in address_space_read_full (
>>     as=0x55b2247db8c0 <address_space_io>, addr=addr@entry=496, attrs=...,
>>     buf=buf@entry=0x7f844dc103fe <Address 0x7f844dc103fe out of bounds>,
>>     len=2, len@entry=602521191) at /usr/src/debug/qemu-2.9.0/exec.c:2895
>> #14 0x000055b223e9bfce in address_space_read (len=602521191,
>>     buf=0x7f844dc103fe <Address 0x7f844dc103fe out of bounds>, attrs=...,
>>     addr=496, as=<optimized out>)
>>     at /usr/src/debug/qemu-2.9.0/include/exec/memory.h:1718
>> #15 address_space_rw (as=<optimized out>, addr=addr@entry=496, attrs=...,
>>     attrs@entry=...,
>>     buf=buf@entry=0x7f844dc103fe <Address 0x7f844dc103fe out of bounds>,
>>     len=len@entry=2, is_write=is_write@entry=false)
>>     at /usr/src/debug/qemu-2.9.0/exec.c:2909
>> #16 0x000055b223ee5271 in kvm_handle_io (count=512, size=2,
>>     direction=<optimized out>, data=<optimized out>, attrs=..., port=496)
>>     at /usr/src/debug/qemu-2.9.0/kvm-all.c:1828
>> #17 kvm_cpu_exec (cpu=cpu@entry=0x55b229032000)
>>     at /usr/src/debug/qemu-2.9.0/kvm-all.c:2058
>> #18 0x000055b223ed1c22 in qemu_kvm_cpu_thread_fn (arg=0x55b229032000)
>>     at /usr/src/debug/qemu-2.9.0/cpus.c:1119
>> #19 0x00007f8443993e25 in start_thread (arg=0x7f843b5bf700)
>>     at pthread_create.c:308
>> #20 0x00007f84436c134d in clone ()
>>     at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113
>>
>>
>>
>>>>     ide_data_readw
>>> How do we get here? ide_is_pio_out ought to be false here; do you know
>>> what command was being processed? Do you have a reproducer?
>>>
>>> Knowing both the ATA and ATAPI commands under consideration here would
>>> be helpful.
>>>
>> if you prefer, I can upload core/binary to some storage.
>> Here is the state.
>>
> This might be helpful for me to poke around at, but unfortunately it
> looks like I can't see the ATAPI command that was being processed from
> the trace given. I don't think there's a way for me to retrieve it from
> a core, either.
>
>> (gdb) p *s
>> $3 = {bus = 0x55b22a7129f0, unit = 0 '\000', drive_kind = IDE_CD,
>>   cylinders = 0, heads = 0, sectors = 0, chs_trans = 0, nb_sectors =
>> 11588488,
>>   mult_sectors = 16, identify_set = 1,
>>   identify_data = "\300\205", '\000' <repeats 18 times>, "MQ0000 1", ' '
>> <repeats 12 times>, "\003\000\000\002\004\000.2+5    EQUMD DVR-MO", ' '
>> <repeats 28 times>,
>> "\000\000\001\000\000\003\000\000\000\000\000\000\a", '\000' <repeats 17
>> times>,
>> "\a\000\a\000\003\000\264\000\264\000,\001\264\000\000\000\000\000\036\000\036",
>> '\000' <repeats 15 times>, "\036", '\000' <repeats 15 times>, "? ",
>> '\000' <repeats 333 times>, drive_serial = 1,
>>   drive_serial_str = "QM00001", '\000' <repeats 13 times>,
>>   drive_model_str = "QEMU DVD-ROM", '\000' <repeats 28 times>, wwn = 0,
>>   feature = 0 '\000', error = 0 '\000', nsector = 2, sector = 0 '\000',
>>   lcyl = 0 '\000', hcyl = 8 '\b', hob_feature = 0 '\000',
>>   hob_nsector = 3 '\003', hob_sector = 0 '\000', hob_lcyl = 0 '\000',
>>   hob_hcyl = 8 '\b', select = 32 ' ', status = 80 'P', lba48 = 0 '\000',
> status = BUSY_STAT
> nsector = ATAPI_INT_REASON_IO (set in ide_atapi_cmd_reply_end)
> byte_count_limit = 2048 (hcyl << 8)
>
>>   blk = 0x55b2265405a0, version = "2.5+\000\000\000\000", events = {
>>     eject_request = false, new_media = false}, sense_key = 0 '\000',
>>   asc = 0 '\000', tray_open = false, tray_locked = false,
>>   cdrom_changed = 0 '\000', packet_transfer_size = 55296,
>>   elementary_transfer_size = 0, io_buffer_index = 2048, lba = 139220,
> packet_transfer_size = 55296, (27 cd sectors)
> elementary_transfer_size = 0,
> io_buffer_index = 2048,
> lba = 139220, or 0x021fd4
>
>
> This is either an initial read request, or a rebuffering.
> In the normative case, initial reads look like this:
>
> IDE:
> ide_ioport_write
>   ide_exec_cmd (0xA0)
>     cmd_packet
>       ide_transfer_start
>         s->end_transfer_func = ide_atapi_cmd
>         [data_ptr and data_end set to meaningful values]
> [Return to guest, await PIO write of scsi cdb]
> ide_data_writew
>   ide_atapi_cmd
>     [soft error if !blk->bs]
>     cmd_read_cd
>       ide_atapi_cmd_read
>         ide_atapi_cmd_read_pio
>           ide_atapi_cmd_reply_end
>             cd_read_sector
>               [segv if !blk->bs]
>
> AHCI:
> handle_reg_h2d_fis
>   [scsi CDB is copied into s->io_buffer]
>   ide_exec_cmd
>     cmd_packet
>       ide_transfer_start
>         [data_ptr and data_end set to transfer scsi cdb]
>           ahci_start_transfer
>             [shortcut, no transfer]
>             s->data_ptr set to s->data_end [*]
>               ide_atapi_cmd
>
> * (but data_end likely is still != s->io_buffer)
>
> from here the callback is the same as above: we'll wind up at
> cd_read_sector. Neither of these cases match your traceback.
>
> If it was the initial buffering, it would have occurred synchronously
> with receipt of the SCSI CDB, but your traces place it as a consequence
> of an attempt to read data, so that makes me suspect it's a rebuffering
> attempt, and the data in s->io_buffer is... cd data? I guess? did you
> have a CD in here at one point?
>
>>   cd_sector_size = 2048, atapi_dma = 0, acct = {bytes = 2048,
> So we're using PIO paths when applicable. The s->feature register being
> 0 seems to agree.
>
>>     start_time_ns = 433417215764666, type = BLOCK_ACCT_READ}, pio_aiocb
>> = 0x0,
>>   iov = {iov_base = 0x55b22a8ae000, iov_len = 2048}, qiov = {
>>     iov = 0x55b22a712d50, niov = 1, nalloc = -1, size = 2048},
>>   buffered_requests = {lh_first = 0x0}, io_buffer_offset = 0,
>>   io_buffer_size = 0, sg = {sg = 0x0, nsg = 0, nalloc = 0, size = 0,
>>     dev = 0x0, as = 0x0}, req_nb_sectors = 0,
>>   end_transfer_func = 0x55b22404e190 <ide_atapi_cmd_reply_end>,
>>   data_ptr = 0x55b22a8ae800 "", data_end = 0x55b22a8ae800 "",
>>   io_buffer = 0x55b22a8ae000
> s->data_ptr  = 0x55b22a8ae800
> s->data_end  = 0x55b22a8ae800
> s->io_buffer = 0x55b22a8ae000
> s->io_buffer_offset = 0
>
> Uh, hm. This does not look like an in-progress PIO transfer of any kind...
>
> This says to me that ide_dummy_transfer_stop was called at some point
> and nothing has started up in its place.
>
> This seems to conflict with the trace, which looks like a rebuffering.
> This looks like... undefined behavior entirely. I am not sure I have a
> good guess as to what's happening.
>
>> ")8\"\t\031\"\232\vx.\311l\301\270o$\335\025\257\064F\271\rI,R\342\032\315\nԆV\341od\251\023>\024\370\323\060A\256\337\300텋\024\233\201U\221T^\202\303\036\"E%\262\230\367ξ\fW\302\360\277\347\334\022\253\a\025\216\rj\334z\355>)\230/\021\255%a^\306\001\032",
> ... hmm
>
> 0x29 56 34 09 19
>
> 0x29 is... almost a read command (0x28) but it isn't. 0x29 shouldn't map
> to anything, actually. So this must be data we've buffered, or something
> left over from god knows what.
>
>>   io_buffer_total_len = 131076, cur_io_buffer_offset = 0,
>>   cur_io_buffer_len = 0, end_transfer_fn_idx = 0 '\000',
>>   sector_write_timer = 0x55b22a45d5c0, irq_count = 0, ext_error = 0 '\000',
>>   mdata_size = 0, mdata_storage = 0x0, media_changed = 0,
>>   dma_cmd = IDE_DMA_READ, smart_enabled = 1 '\001', smart_autosave = 1
>> '\001',
>>   smart_errors = 0, smart_selftest_count = 0 '\000',
>> ---Type <return> to continue, or q <return> to quit---
>>   smart_selftest_data = 0x55b22a827000 "", ncq_queues = 0}
>> (gdb)
>>
>> In general, I can upload the binary and core to the location
>> you want. Send it to me in person. By the way. I am staring to
>> receive these crashes in number. 2-3 in a day BUT from one
>> specific host. This could be important.
>>
>>>>     portio_read
>>>>     memory_region_read_accessor
>>>>     access_with_adjusted_size
>>>>     memory_region_dispatch_read1
>>>>     memory_region_dispatch_read
>>>>     address_space_read_continue
>>>>     address_space_read_full
>>>>     address_space_read
>>>>     address_space_rw
>>>>     kvm_handle_io
>>>>     kvm_cpu_exec
>>>>     qemu_kvm_cpu_thread_fn
>>>>     start_thread
>>>>     clone
>>> The thing that looks to be incongruent here is that we appear to be
>>> servicing a ATAPI reply; that reply is either the kind that uses
>>> preformatted data in a buffer, or the kind that buffers a read.
>>>
>>> If it buffers a read, it should require CHECK_READY which requires a medium.
>>>
>>> If it does not buffer a read, it should not be able to invoke
>>> cd_read_sector or any bdrv function from ide_data_readw.
>>>
>>> If it neither serves preformatted data nor buffers a read, it should not
>>> allow ide_data_readw to perform any action at all.
>> the guest could misbehave! This is not a reason to crash.
>> For example there is a race or something like this. What
>> if the guest will execute read from the device without
>> ANY sanity checks? QEMU should not crash. This is the point
>> and this is the approach taken in the original patch.
>>
> The stuff quoted above isn't participatory, it's mandatory. I don't know
> how we're bypassing CHECK_READY. I want to understand what's happening
> before I pepper in sanity checks.
>
> in 2.9.0, you have this code:
>
>     if ((cmd->flags & CHECK_READY) &&
>         (!media_present(s) || !blk_is_inserted(s->blk)))
>     {
>         ide_atapi_cmd_error(s, NOT_READY, ASC_MEDIUM_NOT_PRESENT);
>         return;
>     }
>
> with media_present defined as:
>
> static inline int media_present(IDEState *s)
> {
>     return !s->tray_open && s->nb_sectors > 0;
> }
>
> ... however, from your gdb output, we do have:
>
> tray_open = false
> nb_sectors = 11588488
>
> but blk_is_inserted is still going to check for the presence of 'bs'
> attached to the Block Backend, so this check should still prevent us
> from executing a read command out of the gate.
>
> Is the guest executing a read and then are we racing with the removal of
> the bds? Do you expect a CD to be inserted on this guest?
>
> nb_sectors I would expect to be 0; based on ide_cd_change_cb:
>
>     blk_get_geometry(s->blk, &nb_sectors);
>
> this should set nb_sectors to 0 if there's no blk->bs.
>
> I'm sort of lost, I can't really easily construct what exactly happened
> here.
>
>>>> Indeed, the CDROM device without media has blk->bs == NULL. We should
>>>> check that the media is really available for the device like has been done
>>>> in SCSI code.
>>>>
>>>> May be the patch adds a bit more check than necessary, but this is not be
>>>> the problem. We should always stay on the safe side.
>>>>
>>>> Signed-off-by: Denis V. Lunev <den@openvz.org>
>>>> CC: John Snow <jsnow@redhat.com>
>>>> CC: Kevin Wolf <kwolf@redhat.com>
>>>> CC: Stefan Hajnoczi <stefanha@redhat.com>
>>>> ---
>>>>  hw/ide/atapi.c | 32 ++++++++++++++++++++++++++++----
>>>>  hw/ide/core.c  |  4 ++--
>>>>  2 files changed, 30 insertions(+), 6 deletions(-)
>>>>
>>>> diff --git a/hw/ide/atapi.c b/hw/ide/atapi.c
>>>> index c0509c8bf5..fa50c0ccf6 100644
>>>> --- a/hw/ide/atapi.c
>>>> +++ b/hw/ide/atapi.c
>>>> @@ -119,6 +119,11 @@ cd_read_sector_sync(IDEState *s)
>>>>  
>>>>      trace_cd_read_sector_sync(s->lba);
>>>>  
>>>> +    if (!blk_is_available(s->blk)) {
>>>> +        ret = -ENOMEDIUM;
>>>> +        goto fail;
>>>> +    }
>>>> +
>>>>      switch (s->cd_sector_size) {
>>>>      case 2048:
>>>>          ret = blk_pread(s->blk, (int64_t)s->lba << ATAPI_SECTOR_BITS,
>>>> @@ -132,8 +137,8 @@ cd_read_sector_sync(IDEState *s)
>>>>          }
>>>>          break;
>>>>      default:
>>>> -        block_acct_invalid(blk_get_stats(s->blk), BLOCK_ACCT_READ);
>>>> -        return -EIO;
>>>> +        ret = -EIO;
>>>> +        goto fail;
>>>>      }
>>>>  
>>>>      if (ret < 0) {
>>>> @@ -145,6 +150,10 @@ cd_read_sector_sync(IDEState *s)
>>>>      }
>>>>  
>>>>      return ret;
>>>> +
>>>> +fail:
>>>> +    block_acct_invalid(blk_get_stats(s->blk), BLOCK_ACCT_READ);
>>>> +    return ret;
>>>>  }
>>>>  
>>>>  static void cd_read_sector_cb(void *opaque, int ret)
>>>> @@ -174,9 +183,15 @@ static void cd_read_sector_cb(void *opaque, int ret)
>>>>  
>>>>  static int cd_read_sector(IDEState *s)
>>>>  {
>>>> +    int err;
>>>> +
>>>>      if (s->cd_sector_size != 2048 && s->cd_sector_size != 2352) {
>>>> -        block_acct_invalid(blk_get_stats(s->blk), BLOCK_ACCT_READ);
>>>> -        return -EINVAL;
>>>> +        err = -EINVAL;
>>>> +        goto fail;
>>>> +    }
>>>> +    if (!blk_is_available(s->blk)) {
>>>> +        err = -ENOMEDIUM;
>>>> +        goto fail;
>>>>      }
>>>>  
>>>>      s->iov.iov_base = (s->cd_sector_size == 2352) ?
>>>> @@ -195,6 +210,10 @@ static int cd_read_sector(IDEState *s)
>>>>  
>>>>      s->status |= BUSY_STAT;
>>>>      return 0;
>>>> +
>>>> +fail:
>>>> +    block_acct_invalid(blk_get_stats(s->blk), BLOCK_ACCT_READ);
>>>> +    return err;
>>>>  }
>>>>  
>>>>  void ide_atapi_cmd_ok(IDEState *s)
>>>> @@ -404,6 +423,11 @@ static void ide_atapi_cmd_read_dma_cb(void *opaque, int ret)
>>>>          goto eot;
>>>>      }
>>>>  
>>>> +    if (!blk_is_available(s->blk)) {
>>>> +        ide_atapi_cmd_read_dma_cb(s, -ENOMEDIUM);
>>>> +        return;
>>>> +    }
>>>> +
>>> I'm not sure this is OK, because it's an error but not setting the sense
>>> code or raising an IRQ; it's only calling ide_set_inactive, so this
>>> might be a problem. Normally the error code from the data read is
>>> handled earlier in the call by ide_handle_rw_error which can set the
>>> proper codes.
>>>
>> this follows the approach used by the error handling in this call.
>> I this code is not made worse. We can hit to the same processing
>> with the different error upper.
>>
>>>>      s->io_buffer_index = 0;
>>>>      if (s->cd_sector_size == 2352) {
>>>>          n = 1;
>>>> diff --git a/hw/ide/core.c b/hw/ide/core.c
>>>> index 471d0c928b..71780fc9d1 100644
>>>> --- a/hw/ide/core.c
>>>> +++ b/hw/ide/core.c
>>>> @@ -758,7 +758,7 @@ static void ide_sector_read(IDEState *s)
>>>>  
>>>>      trace_ide_sector_read(sector_num, n);
>>>>  
>>>> -    if (!ide_sect_range_ok(s, sector_num, n)) {
>>>> +    if (!ide_sect_range_ok(s, sector_num, n) || !blk_is_available(s->blk)) {
>>>>          ide_rw_error(s);
>>>>          block_acct_invalid(blk_get_stats(s->blk), BLOCK_ACCT_READ);
>>>>          return;
>>>> @@ -1023,7 +1023,7 @@ static void ide_sector_write(IDEState *s)
>>>>  
>>>>      trace_ide_sector_write(sector_num, n);
>>>>  
>>>> -    if (!ide_sect_range_ok(s, sector_num, n)) {
>>>> +    if (!ide_sect_range_ok(s, sector_num, n) || !blk_is_available(s->blk)) {
>>>>          ide_rw_error(s);
>>>>          block_acct_invalid(blk_get_stats(s->blk), BLOCK_ACCT_WRITE);
>>>>          return;
>>>>
>>> Since it's not a new regression (this is being reported against 2.9) I
>>> am somewhat hesitant to rush it into 2.11-rc3 without a little more info.
>>>
>>> That said, here's a list of the ATAPI commands we service that either
>>> return or CAN return data, but do not enforce CHECK_READY:
>>>
>>>     [ 0x03 ] = { cmd_request_sense,                 ALLOW_UA },
>>>     [ 0x12 ] = { cmd_inquiry,                       ALLOW_UA },
>>>     [ 0x46 ] = { cmd_get_configuration,             ALLOW_UA },
>>>     [ 0x4a ] = { cmd_get_event_status_notification, ALLOW_UA },
>>>     [ 0x5a ] = { cmd_mode_sense, /* (10) */         0 },
>>>     [ 0xbd ] = { cmd_mechanism_status,              0 },
>>>
>>> These six all invoke ide_atapi_cmd_reply in some way (which allows the
>>> guest to begin reading the reply via PIO mechanisms if DMA is not set):
>>>
>>> cmd_request_sense:     ide_atapi_cmd_reply(s, 18, max_len);
>>> cmd_inquiry:           ide_atapi_cmd_reply(s, idx, max_len);
>>> cmd_get_configuration: ide_atapi_cmd_reply(s, len, max_len);
>>> cmd_get_event_status_notification:
>>>                        ide_atapi_cmd_reply(s, used_len, max_len);
>>> cmd_mode_sense:        ide_atapi_cmd_reply(s, 16, max_len);
>>>                        ide_atapi_cmd_reply(s, 24, max_len);
>>>                        ide_atapi_cmd_reply(s, 30, max_len);
>>> cmd_mechanism_status:  ide_atapi_cmd_reply(s, 8, max_len);
>>>
>>> ide_atapi_cmd_reply itself sets lba to -1, which should inform
>>> ide_atapi_cmd_reply_end not to attempt to buffer any new data. These
>>> *should* be safe. The reply sizes are also all small enough that they
>>> are almost certainly getting buffered in one single transfer without
>>> attempting to buffer more data.
>>>
>>> In the normative PIO case then, reads will consume from this buffer
>>> until empty, then we'll call ide_atapi_cmd_reply_end through
>>> end_transfer_func and hit the end of transfer logic.
>>>
>>> I'm not sure I see how this crash is happening; it doesn't look like
>>> this path allows for invoking the block read functions and everything
>>> else is guarded by NONDATA or CHECK_READY.
>>>
>>> Unless this is an interesting interaction with ide_atapi_cmd_reply
>>> setting the DRQ bit which may trick the pio read function to allow PIO
>>> reads to come in during this time?
>>>
>>> Hypothetically:
>>>
>>> cmd_packet sets end_transfer_func to ide_atapi_cmd so it can process
>>> further once that PIO is completed. (The PIO may be emulated
>>> synchronously, in the case of AHCI.) In the true PIO case, it may be
>>> asynchronous.
>>>
>>> Now, in the AHCI case, the guest has control back and the CDROM is now
>>> executing a command to, perhaps, read data via DMA. Then,
>>> simultaneously, the guest issues a PIO read and because the DRQ bit is
>>> set for DMA, the PIO read also goes through.
>>>
>>> This still requires a media, though... and a very broken guest doing
>>> something naughty on purpose.
>>>
>>> I can't quite connect the dots to see how this crash is possible in
>>> general... it'd help to have:
>>>
>>> - The ATAPI command being processed at the time
>>> - The IDE command being processed at the time
>>> - s->status
>>>
>>> as a minimum, and then perhaps optionally some of the ATAPI loop
>>> variables, like packet_transfer_size, elementary_transfer_size, and
>>> io_buffer_index. Or a reproducer!
>>>
>>> Sorry I'm not more help and I wrote too much again :(
>> imho this does not matter. Once again - the guest can misbehave.
> I never said that if a guest acting out of spec was the problem I
> wouldn't fix it. I just want to find the root cause.
>
> If this has been broken since 2.9, 2.11-rc3 is too late for a bandaid
> applied to something I can't diagnose. Let's discuss this for 2.12 and I
> will keep trying to figure out what the root cause is.
I have read the entire letter in 2 subsequent attempts, but
unfortunately I can not say much more additionally :(

>
> Some questions for you:
>
> (1) Is the guest Linux? Do we know why this one machine might be
> tripping up QEMU? (Is it running a fuzzer, a weird OS, etc...?)
This is running by the end-user by our customer and we do not have
access to that machine and customer. This is anonymized crash report
from the node. This is not a single crash. We observe 1-2 reports with
this crash in a day.

> (2) Does the VM actually have a CDROM inserted at some point? Is it
> possible we're racing on some kind of eject or graph manipulation failure?
unclear but IMHO probable.

> (3) Is this using AHCI or IDE?
IDE. This is known 120%. We do not provide ability to enable AHCI
without manual tweaking.

>
> If I can't figure it out within a week or so from here I'll just check
> in the band-aid with some /* FIXME */ comments attached.
No prob. We are going to ship my band-aid and see to report statistics.

Thank you in advance,
    Den

John Snow Dec. 18, 2017, 5:49 p.m. UTC | #7

On 12/14/2017 06:29 AM, Denis V. Lunev wrote:
>> If this has been broken since 2.9, 2.11-rc3 is too late for a bandaid
>> applied to something I can't diagnose. Let's discuss this for 2.12 and I
>> will keep trying to figure out what the root cause is.
> I have read the entire letter in 2 subsequent attempts, but
> unfortunately I can not say much more additionally :(
> 

No problem, sometimes I don't understand myself. And the IDE code isn't
exactly the nicest stuff to read. If I was smart enough I'd refactor the
whole thing, but without breaking migration it's a little hard :(

>> Some questions for you:
>>
>> (1) Is the guest Linux? Do we know why this one machine might be
>> tripping up QEMU? (Is it running a fuzzer, a weird OS, etc...?)
> This is running by the end-user by our customer and we do not have
> access to that machine and customer. This is anonymized crash report
> from the node. This is not a single crash. We observe 1-2 reports with
> this crash in a day.
> 

Yikes. Is this still on a 2.9-based VM, or have you upgraded to 2.10 or
2.11 at this point?

(From memory this was a problem with a 2.9 based machine)

>> (2) Does the VM actually have a CDROM inserted at some point? Is it
>> possible we're racing on some kind of eject or graph manipulation failure?
> unclear but IMHO probable.
> 

If they're using a 2.10+ based VM, could you look at some trace points?

either:
trace_ide_atapi_cmd (just scsi byte 0), or
trace_ide_atapi_cmd_packet (the entire scsi cdb)

and

trace_ide_exec_cmd

the actual command bytes never get saved in the state struct, so it's
hard to tell from traces what commands were being processed, but these
traces help.

>> (3) Is this using AHCI or IDE?
> IDE. This is known 120%. We do not provide ability to enable AHCI
> without manual tweaking.
> 

At least that helps narrow down the path...

>> If I can't figure it out within a week or so from here I'll just check
>> in the band-aid with some /* FIXME */ comments attached.
> No prob. We are going to ship my band-aid and see to report statistics.
> 
> Thank you in advance,
>     Den

I'll stage the band-aid with some FIXME comments, and maybe some scary
error_report prints with some information in them. I'll send it to the list.

Denis V. Lunev Jan. 24, 2018, 10:25 a.m. UTC | #8

sorry, I have caught this latter tooo late.

On 12/18/2017 08:49 PM, John Snow wrote:
>
> On 12/14/2017 06:29 AM, Denis V. Lunev wrote:
>>> If this has been broken since 2.9, 2.11-rc3 is too late for a bandaid
>>> applied to something I can't diagnose. Let's discuss this for 2.12 and I
>>> will keep trying to figure out what the root cause is.
>> I have read the entire letter in 2 subsequent attempts, but
>> unfortunately I can not say much more additionally :(
>>
> No problem, sometimes I don't understand myself. And the IDE code isn't
> exactly the nicest stuff to read. If I was smart enough I'd refactor the
> whole thing, but without breaking migration it's a little hard :(
>
>>> Some questions for you:
>>>
>>> (1) Is the guest Linux? Do we know why this one machine might be
>>> tripping up QEMU? (Is it running a fuzzer, a weird OS, etc...?)
>> This is running by the end-user by our customer and we do not have
>> access to that machine and customer. This is anonymized crash report
>> from the node. This is not a single crash. We observe 1-2 reports with
>> this crash in a day.
>>
> Yikes. Is this still on a 2.9-based VM, or have you upgraded to 2.10 or
> 2.11 at this point?
>
> (From memory this was a problem with a 2.9 based machine)
the problem is with 2.9

>>> (2) Does the VM actually have a CDROM inserted at some point? Is it
>>> possible we're racing on some kind of eject or graph manipulation failure?
>> unclear but IMHO probable.
>>
> If they're using a 2.10+ based VM, could you look at some trace points?
>
> either:
> trace_ide_atapi_cmd (just scsi byte 0), or
> trace_ide_atapi_cmd_packet (the entire scsi cdb)
>
> and
>
> trace_ide_exec_cmd
>
> the actual command bytes never get saved in the state struct, so it's
> hard to tell from traces what commands were being processed, but these
> traces help.
unfortunately I do not have access to the crashing node :(
that is the problem.


>>> (3) Is this using AHCI or IDE?
>> IDE. This is known 120%. We do not provide ability to enable AHCI
>> without manual tweaking.
>>
> At least that helps narrow down the path...
>
>>> If I can't figure it out within a week or so from here I'll just check
>>> in the band-aid with some /* FIXME */ comments attached.
>> No prob. We are going to ship my band-aid and see to report statistics.
>>
>> Thank you in advance,
>>     Den
> I'll stage the band-aid with some FIXME comments, and maybe some scary
> error_report prints with some information in them. I'll send it to the list.
I do not see them merged. Have you?

For now I have merged my patch downstream. I do not see that it could be
wrong. The release is scheduled late this spring and if crashes
will stop to happen - I'll let you know.

Den

[2/2] ide: fix crash in IDE cdrom read

Commit Message

Comments

Patch