diff mbox series

block: Clear kernel memory before copying to user

Message ID 20181107143745.11845-1-keith.busch@intel.com (mailing list archive)
State New, archived
Headers show
Series block: Clear kernel memory before copying to user | expand

Commit Message

Keith Busch Nov. 7, 2018, 2:37 p.m. UTC
If the kernel allocates a bounce buffer for user read data, this memory
needs to be cleared before copying it to the user, otherwise it may leak
kernel memory to user space.

Signed-off-by: Keith Busch <keith.busch@intel.com>
---
 block/bio.c | 1 +
 1 file changed, 1 insertion(+)

Comments

Laurence Oberman Nov. 7, 2018, 2:46 p.m. UTC | #1
On Wed, 2018-11-07 at 07:37 -0700, Keith Busch wrote:
> If the kernel allocates a bounce buffer for user read data, this
> memory
> needs to be cleared before copying it to the user, otherwise it may
> leak
> kernel memory to user space.
> 
> Signed-off-by: Keith Busch <keith.busch@intel.com>
> ---
>  block/bio.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/block/bio.c b/block/bio.c
> index d5368a445561..a50d59236b19 100644
> --- a/block/bio.c
> +++ b/block/bio.c
> @@ -1260,6 +1260,7 @@ struct bio *bio_copy_user_iov(struct
> request_queue *q,
>  		if (ret)
>  			goto cleanup;
>  	} else {
> +		zero_fill_bio(bio);
>  		iov_iter_advance(iter, bio->bi_iter.bi_size);
>  	}
>  

Straightforward, looks good from my view.

Reviewed-by:
Laurence Oberman <loberman@redhat.com>
Ming Lei Nov. 7, 2018, 3:09 p.m. UTC | #2
On Wed, Nov 7, 2018 at 10:42 PM Keith Busch <keith.busch@intel.com> wrote:
>
> If the kernel allocates a bounce buffer for user read data, this memory
> needs to be cleared before copying it to the user, otherwise it may leak
> kernel memory to user space.
>
> Signed-off-by: Keith Busch <keith.busch@intel.com>
> ---
>  block/bio.c | 1 +
>  1 file changed, 1 insertion(+)
>
> diff --git a/block/bio.c b/block/bio.c
> index d5368a445561..a50d59236b19 100644
> --- a/block/bio.c
> +++ b/block/bio.c
> @@ -1260,6 +1260,7 @@ struct bio *bio_copy_user_iov(struct request_queue *q,
>                 if (ret)
>                         goto cleanup;
>         } else {
> +               zero_fill_bio(bio);
>                 iov_iter_advance(iter, bio->bi_iter.bi_size);
>         }

This way looks inefficient because zero fill should only be required
for short READ.

Thanks,
Ming Lei
Keith Busch Nov. 7, 2018, 3:15 p.m. UTC | #3
On Wed, Nov 07, 2018 at 11:09:27PM +0800, Ming Lei wrote:
> On Wed, Nov 7, 2018 at 10:42 PM Keith Busch <keith.busch@intel.com> wrote:
> >
> > If the kernel allocates a bounce buffer for user read data, this memory
> > needs to be cleared before copying it to the user, otherwise it may leak
> > kernel memory to user space.
> >
> > Signed-off-by: Keith Busch <keith.busch@intel.com>
> > ---
> >  block/bio.c | 1 +
> >  1 file changed, 1 insertion(+)
> >
> > diff --git a/block/bio.c b/block/bio.c
> > index d5368a445561..a50d59236b19 100644
> > --- a/block/bio.c
> > +++ b/block/bio.c
> > @@ -1260,6 +1260,7 @@ struct bio *bio_copy_user_iov(struct request_queue *q,
> >                 if (ret)
> >                         goto cleanup;
> >         } else {
> > +               zero_fill_bio(bio);
> >                 iov_iter_advance(iter, bio->bi_iter.bi_size);
> >         }
> 
> This way looks inefficient because zero fill should only be required
> for short READ.

Sure, but how do you know that happened before copying the bounce buffer
to user space?

We could zero the pages on allocation if that's better (and doesn't zero
twice if __GFP_ZERO was already provided):

---
diff --git a/block/bio.c b/block/bio.c
index d5368a445561..a1b6383294f4 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -1212,6 +1212,9 @@ struct bio *bio_copy_user_iov(struct request_queue *q,
 		nr_pages = 1 << map_data->page_order;
 		i = map_data->offset / PAGE_SIZE;
 	}
+
+	if (iov_iter_rw(iter) == READ)
+		gfp_mask |= __GFP_ZERO;
 	while (len) {
 		unsigned int bytes = PAGE_SIZE;
 
--
Keith Busch Nov. 7, 2018, 3:44 p.m. UTC | #4
On Wed, Nov 07, 2018 at 11:44:59PM +0800, Ming Lei wrote:
> blk_update_request() may tell us how much progress made, :-)

Except when it doesn't, which is 100% of the time for many block
drivers, including nvme.
Ming Lei Nov. 7, 2018, 3:44 p.m. UTC | #5
On Wed, Nov 7, 2018 at 11:19 PM Keith Busch <keith.busch@intel.com> wrote:
>
> On Wed, Nov 07, 2018 at 11:09:27PM +0800, Ming Lei wrote:
> > On Wed, Nov 7, 2018 at 10:42 PM Keith Busch <keith.busch@intel.com> wrote:
> > >
> > > If the kernel allocates a bounce buffer for user read data, this memory
> > > needs to be cleared before copying it to the user, otherwise it may leak
> > > kernel memory to user space.
> > >
> > > Signed-off-by: Keith Busch <keith.busch@intel.com>
> > > ---
> > >  block/bio.c | 1 +
> > >  1 file changed, 1 insertion(+)
> > >
> > > diff --git a/block/bio.c b/block/bio.c
> > > index d5368a445561..a50d59236b19 100644
> > > --- a/block/bio.c
> > > +++ b/block/bio.c
> > > @@ -1260,6 +1260,7 @@ struct bio *bio_copy_user_iov(struct request_queue *q,
> > >                 if (ret)
> > >                         goto cleanup;
> > >         } else {
> > > +               zero_fill_bio(bio);
> > >                 iov_iter_advance(iter, bio->bi_iter.bi_size);
> > >         }
> >
> > This way looks inefficient because zero fill should only be required
> > for short READ.
>
> Sure, but how do you know that happened before copying the bounce buffer
> to user space?

blk_update_request() may tell us how much progress made, :-)

So it should work by calling the following code after the req is completed
and before copying data to userspace:

                        __rq_for_each_bio(bio, rq)
                                zero_fill_bio(bio);
>
> We could zero the pages on allocation if that's better (and doesn't zero
> twice if __GFP_ZERO was already provided):
>
> ---
> diff --git a/block/bio.c b/block/bio.c
> index d5368a445561..a1b6383294f4 100644
> --- a/block/bio.c
> +++ b/block/bio.c
> @@ -1212,6 +1212,9 @@ struct bio *bio_copy_user_iov(struct request_queue *q,
>                 nr_pages = 1 << map_data->page_order;
>                 i = map_data->offset / PAGE_SIZE;
>         }
> +
> +       if (iov_iter_rw(iter) == READ)
> +               gfp_mask |= __GFP_ZERO;

No big difference compared with before, because most of times the zero fill
shouldn't be needed. However, this patch changes to do it every time.

Thanks,
Ming Lei
Ming Lei Nov. 7, 2018, 4:03 p.m. UTC | #6
On Wed, Nov 7, 2018 at 11:47 PM Keith Busch <keith.busch@intel.com> wrote:
>
> On Wed, Nov 07, 2018 at 11:44:59PM +0800, Ming Lei wrote:
> > blk_update_request() may tell us how much progress made, :-)
>
> Except when it doesn't, which is 100% of the time for many block
> drivers, including nvme.

Please look at blk_mq_end_request()(<- nvme_complete_rq()), which
calls blk_update_request().

Thanks,
Ming Lei
Keith Busch Nov. 7, 2018, 4:09 p.m. UTC | #7
On Thu, Nov 08, 2018 at 12:03:41AM +0800, Ming Lei wrote:
> On Wed, Nov 7, 2018 at 11:47 PM Keith Busch <keith.busch@intel.com> wrote:
> >
> > On Wed, Nov 07, 2018 at 11:44:59PM +0800, Ming Lei wrote:
> > > blk_update_request() may tell us how much progress made, :-)
> >
> > Except when it doesn't, which is 100% of the time for many block
> > drivers, including nvme.
> 
> Please look at blk_mq_end_request()(<- nvme_complete_rq()), which
> calls blk_update_request().

Huh? That just says 100% of the request size was transerred no matter
what was actually transferred.

The protocol doesn't have a way to express what transfer size occured
with a command's completion, and even it did, there's no way I'd trust
every firmware not to screw it.
Jens Axboe Nov. 7, 2018, 10:41 p.m. UTC | #8
On 11/7/18 7:37 AM, Keith Busch wrote:
> If the kernel allocates a bounce buffer for user read data, this memory
> needs to be cleared before copying it to the user, otherwise it may leak
> kernel memory to user space.

Applied, thanks.
Ming Lei Nov. 8, 2018, 1:12 a.m. UTC | #9
On Thu, Nov 8, 2018 at 12:12 AM Keith Busch <keith.busch@intel.com> wrote:
>
> On Thu, Nov 08, 2018 at 12:03:41AM +0800, Ming Lei wrote:
> > On Wed, Nov 7, 2018 at 11:47 PM Keith Busch <keith.busch@intel.com> wrote:
> > >
> > > On Wed, Nov 07, 2018 at 11:44:59PM +0800, Ming Lei wrote:
> > > > blk_update_request() may tell us how much progress made, :-)
> > >
> > > Except when it doesn't, which is 100% of the time for many block
> > > drivers, including nvme.
> >
> > Please look at blk_mq_end_request()(<- nvme_complete_rq()), which
> > calls blk_update_request().
>
> Huh? That just says 100% of the request size was transerred no matter
> what was actually transferred.
>
> The protocol doesn't have a way to express what transfer size occured
> with a command's completion, and even it did, there's no way I'd trust
> every firmware not to screw it.

That sounds something like below:

1) userspace requires to read 8 sectors starting from sector 0
2) nvme tells hardware to do that, and hardware transfers only 4
sectors(0~3) data
to memory, but still tells driver we have done 8 sectors(0~7).

What if there are real data in sectors 4~7?

Is it NVMe specific issue or common problem in other storage hardware?  SCSI
does call blk_update_request() and handles partial completion.

Thanks,
Ming Lei
Keith Busch Nov. 8, 2018, 1:22 a.m. UTC | #10
On Thu, Nov 08, 2018 at 09:12:59AM +0800, Ming Lei wrote:
> Is it NVMe specific issue or common problem in other storage hardware?  SCSI
> does call blk_update_request() and handles partial completion.

Not specific to NVMe.

An example using SG_IO dumping 2MB of unsanitized kernel memory:

sg-test.c:
---
#include <fcntl.h>
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <sys/ioctl.h>
#include <scsi/sg.h>
#include <scsi/scsi.h>

#define SIZE (2 * 1024 * 1024 + 8)
int main(int argc, char **argv)
{
	struct sg_io_hdr io_hdr;
	unsigned char *buffer, cmd[6] = { TEST_UNIT_READY };
	int sg, i;

	if (argc < 2)
		fprintf(stderr, "usage: %s <sgdev>\n", argv[0]), exit(0);

	sg = open(argv[1], O_RDONLY);
	if (sg < 0)
		perror("open"), exit(0);

	buffer = malloc(SIZE);
	if (!buffer)
		fprintf(stderr, "no memory\n"), exit(0);

	memset(&io_hdr, 0, sizeof(struct sg_io_hdr));
	io_hdr.interface_id = 'S';
	io_hdr.cmd_len = 6;
	io_hdr.cmdp = cmd;
	io_hdr.dxfer_direction = SG_DXFER_FROM_DEV;
	io_hdr.dxfer_len = SIZE;
	io_hdr.dxferp = buffer;

	memset(buffer, 0, SIZE);
	ioctl(sg, SG_IO, &io_hdr);
	for (i = 0; i < SIZE; i++) {
		printf("%02x", buffer[i]);
		if (i+1 % 32 == 0)
			printf("\n");
	}
}
--

Test on qemu:
---
$ ./sg-test /dev/sda | grep -v 000000000000000000000000000000000
40733f4019dbffff8001244019dbffff4065244019dbffff0094244019dbffff
c025244019dbffffc0e43a4019dbffff40973a4019dbffffc0623a4019dbffff
800c244019dbffffc0d61d4019dbffffc05f244019dbffff80091e4019dbffff
40913a4019dbffff806f3f4019dbffff40a83f4019dbffffc083244019dbffff
80eb1e4019dbffff00a93f4019dbffffc09a3a4019dbffff40503f4019dbffff
007f1b4019dbffffc0d91e4019dbffff40551e4019dbffff804a1b4019dbffff
....
--
Jens Axboe Nov. 8, 2018, 1:31 a.m. UTC | #11
On 11/7/18 6:12 PM, Ming Lei wrote:
> On Thu, Nov 8, 2018 at 12:12 AM Keith Busch <keith.busch@intel.com> wrote:
>>
>> On Thu, Nov 08, 2018 at 12:03:41AM +0800, Ming Lei wrote:
>>> On Wed, Nov 7, 2018 at 11:47 PM Keith Busch <keith.busch@intel.com> wrote:
>>>>
>>>> On Wed, Nov 07, 2018 at 11:44:59PM +0800, Ming Lei wrote:
>>>>> blk_update_request() may tell us how much progress made, :-)
>>>>
>>>> Except when it doesn't, which is 100% of the time for many block
>>>> drivers, including nvme.
>>>
>>> Please look at blk_mq_end_request()(<- nvme_complete_rq()), which
>>> calls blk_update_request().
>>
>> Huh? That just says 100% of the request size was transerred no matter
>> what was actually transferred.
>>
>> The protocol doesn't have a way to express what transfer size occured
>> with a command's completion, and even it did, there's no way I'd trust
>> every firmware not to screw it.
> 
> That sounds something like below:
> 
> 1) userspace requires to read 8 sectors starting from sector 0
> 2) nvme tells hardware to do that, and hardware transfers only 4
> sectors(0~3) data
> to memory, but still tells driver we have done 8 sectors(0~7).
> 
> What if there are real data in sectors 4~7?
> 
> Is it NVMe specific issue or common problem in other storage hardware?  SCSI
> does call blk_update_request() and handles partial completion.

I would never believe any of that, it's much better to error on the side
of caution for things like this. blk-mq doesn't even support partial
completions, exactly because most hw doesn't even support it. But even
for the ones that do, I would not trust it a lot.
Johannes Thumshirn Nov. 8, 2018, 10:07 a.m. UTC | #12
On 08/11/2018 02:22, Keith Busch wrote:
> $ ./sg-test /dev/sda | grep -v 000000000000000000000000000000000
> 40733f4019dbffff8001244019dbffff4065244019dbffff0094244019dbffff
> c025244019dbffffc0e43a4019dbffff40973a4019dbffffc0623a4019dbffff
> 800c244019dbffffc0d61d4019dbffffc05f244019dbffff80091e4019dbffff
> 40913a4019dbffff806f3f4019dbffff40a83f4019dbffffc083244019dbffff
> 80eb1e4019dbffff00a93f4019dbffffc09a3a4019dbffff40503f4019dbffff
> 007f1b4019dbffffc0d91e4019dbffff40551e4019dbffff804a1b4019dbffff
> ....
> --
> 

Hi Keith,
Can you please add the above to blktests?

This would be very useful for downstream distributions.

Thanks a lot,
	Johannes
Ming Lei Nov. 8, 2018, 11:10 a.m. UTC | #13
On Thu, Nov 8, 2018 at 6:07 PM Johannes Thumshirn <jthumshirn@suse.de> wrote:
>
> On 08/11/2018 02:22, Keith Busch wrote:
> > $ ./sg-test /dev/sda | grep -v 000000000000000000000000000000000
> > 40733f4019dbffff8001244019dbffff4065244019dbffff0094244019dbffff
> > c025244019dbffffc0e43a4019dbffff40973a4019dbffffc0623a4019dbffff
> > 800c244019dbffffc0d61d4019dbffffc05f244019dbffff80091e4019dbffff
> > 40913a4019dbffff806f3f4019dbffff40a83f4019dbffffc083244019dbffff
> > 80eb1e4019dbffff00a93f4019dbffffc09a3a4019dbffff40503f4019dbffff
> > 007f1b4019dbffffc0d91e4019dbffff40551e4019dbffff804a1b4019dbffff
> > ....
> > --
> >
>
> Hi Keith,
> Can you please add the above to blktests?
>
> This would be very useful for downstream distributions.

I guess the issue may depend on specific QEMU version, just tried the test over
virtio-scsi/sata/usb-storage emulated via qemu-2.10.2-1.fc27, not observed
this problem.

Thanks,
Ming Lei
Keith Busch Nov. 8, 2018, 3:37 p.m. UTC | #14
On Thu, Nov 08, 2018 at 07:10:58PM +0800, Ming Lei wrote:
> I guess the issue may depend on specific QEMU version, just tried the test over
> virtio-scsi/sata/usb-storage emulated via qemu-2.10.2-1.fc27, not observed
> this problem.

I actually didn't use virtio-scsi, but it really doesn't matter.

FWIW, this is what I ran:

  # qemu-system-x86_64 --version
  QEMU emulator version 2.10.2(qemu-2.10.2-1.fc27)

  # qemu-system-x86_64 -m 192 -smp 2 -enable-kvm -display none -snapshot \
    -hda /mnt/images/fedora-27.img  -nographic \
    -append "console=tty0 console=ttyS0 root=/dav/sda rw" \
    -kernel /boot/vmlinuz-4.18.10-100.fc27.x86_64 \
    -initrd /boot/initramfs-4.18.10-100.fc27.x86_64.img

The file "fedora-27.img" is just a filesystem image of a minimal mock
setup from /var/lib/mock/fedora-27-x86_64/root/.

A small memory size makes it easier to observe, otherwise your probability
of allocating unclean pages lowers. That's really the only reason I used
qemu since all my real machines have too much memory that I never come
close to using, making observations more random.
diff mbox series

Patch

diff --git a/block/bio.c b/block/bio.c
index d5368a445561..a50d59236b19 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -1260,6 +1260,7 @@  struct bio *bio_copy_user_iov(struct request_queue *q,
 		if (ret)
 			goto cleanup;
 	} else {
+		zero_fill_bio(bio);
 		iov_iter_advance(iter, bio->bi_iter.bi_size);
 	}