fs/io_uring.c: fix null ptr deference in io_send_recvmsg()
diff mbox series

Message ID 20200804125637.GA22088@ubuntu
State New
Headers show
Series
  • fs/io_uring.c: fix null ptr deference in io_send_recvmsg()
Related show

Commit Message

Liu Yong Aug. 4, 2020, 12:56 p.m. UTC
In io_send_recvmsg(), there is no check for the req->file.
User can change the opcode from IORING_OP_NOP to IORING_OP_SENDMSG
through competition after the io_req_set_file().

This vulnerability will leak sensitive kernel information.

[352693.910110] BUG: kernel NULL pointer dereference, address: 0000000000000028
[352693.910112] #PF: supervisor read access in kernel mode
[352693.910112] #PF: error_code(0x0000) - not-present page
[352693.910113] PGD 8000000251396067 P4D 8000000251396067 PUD 1d64ba067 PMD 0
[352693.910115] Oops: 0000 [#3] SMP PTI
[352693.910116] CPU: 11 PID: 303132 Comm: test Tainted: G      D
[352693.910117] Hardware name: Dell Inc. OptiPlex 3060/0T0MHW, BIOS 1.4.2 06/11/2019
[352693.910120] RIP: 0010:sock_from_file+0x9/0x30
[352693.910122] RSP: 0018:ffffc0a5084cfc50 EFLAGS: 00010246
[352693.910122] RAX: ffff9f6ee284d000 RBX: ffff9f6bd3675000 RCX: ffffffff8b111340
[352693.910123] RDX: 0000000000000001 RSI: ffffc0a5084cfc64 RDI: 0000000000000000
[352693.910124] RBP: ffffc0a5084cfc50 R08: 0000000000000000 R09: ffff9f6ee51a9200
[352693.910124] R10: ffff9f6ee284d200 R11: 0000000000000000 R12: ffff9f6ee51a9200
[352693.910125] R13: 0000000000000001 R14: ffffffff8b111340 R15: ffff9f6ee284d000
[352693.910126] FS:  00000000016d7880(0000) GS:ffff9f6eee2c0000(0000) knlGS:0000000000000000
[352693.910126] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[352693.910127] CR2: 0000000000000028 CR3: 000000041fb4a005 CR4: 00000000003626e0
[352693.910127] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[352693.910128] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[352693.910128] Call Trace:
[352693.910132]  io_send_recvmsg+0x49/0x170
[352693.910134]  ? __switch_to_asm+0x34/0x70
[352693.910135]  __io_submit_sqe+0x45e/0x8e0
[352693.910136]  ? __switch_to_asm+0x34/0x70
[352693.910137]  ? __switch_to_asm+0x40/0x70
[352693.910138]  ? __switch_to_asm+0x34/0x70
[352693.910138]  ? __switch_to_asm+0x40/0x70
[352693.910139]  ? __switch_to_asm+0x34/0x70
[352693.910140]  ? __switch_to_asm+0x40/0x70
[352693.910141]  ? __switch_to_asm+0x34/0x70
[352693.910142]  ? __switch_to_asm+0x40/0x70
[352693.910143]  ? __switch_to_asm+0x34/0x70
[352693.910144]  ? __switch_to_asm+0x34/0x70
[352693.910145]  __io_queue_sqe+0x23/0x230
[352693.910146]  io_queue_sqe+0x7a/0x90
[352693.910148]  io_submit_sqe+0x23d/0x330
[352693.910149]  io_ring_submit+0xca/0x200
[352693.910150]  ? do_nanosleep+0xad/0x160
[352693.910151]  ? hrtimer_init_sleeper+0x2c/0x90
[352693.910152]  ? hrtimer_nanosleep+0xc2/0x1a0
[352693.910154]  __x64_sys_io_uring_enter+0x1e4/0x2c0
[352693.910156]  do_syscall_64+0x57/0x190
[352693.910157]  entry_SYSCALL_64_after_hwframe+0x44/0xa9

Signed-off-by: Liu Yong <pkfxxxing@gmail.com>
---
 fs/io_uring.c | 3 +++
 1 file changed, 3 insertions(+)

Comments

Pavel Begunkov Aug. 4, 2020, 1:18 p.m. UTC | #1
On 04/08/2020 15:56, Liu Yong wrote:
> In io_send_recvmsg(), there is no check for the req->file.
> User can change the opcode from IORING_OP_NOP to IORING_OP_SENDMSG
> through competition after the io_req_set_file().

After sqe->opcode is read and copied in io_init_req(), it only uses
in-kernel req->opcode. Also, io_init_req() should check for req->file
NULL, so shouldn't happen after.

Do you have a reproducer? What kernel version did you use?

> 
> This vulnerability will leak sensitive kernel information.
> 
> [352693.910110] BUG: kernel NULL pointer dereference, address: 0000000000000028
> [352693.910112] #PF: supervisor read access in kernel mode
> [352693.910112] #PF: error_code(0x0000) - not-present page
> [352693.910113] PGD 8000000251396067 P4D 8000000251396067 PUD 1d64ba067 PMD 0
> [352693.910115] Oops: 0000 [#3] SMP PTI
> [352693.910116] CPU: 11 PID: 303132 Comm: test Tainted: G      D
> [352693.910117] Hardware name: Dell Inc. OptiPlex 3060/0T0MHW, BIOS 1.4.2 06/11/2019
> [352693.910120] RIP: 0010:sock_from_file+0x9/0x30
> [352693.910122] RSP: 0018:ffffc0a5084cfc50 EFLAGS: 00010246
> [352693.910122] RAX: ffff9f6ee284d000 RBX: ffff9f6bd3675000 RCX: ffffffff8b111340
> [352693.910123] RDX: 0000000000000001 RSI: ffffc0a5084cfc64 RDI: 0000000000000000
> [352693.910124] RBP: ffffc0a5084cfc50 R08: 0000000000000000 R09: ffff9f6ee51a9200
> [352693.910124] R10: ffff9f6ee284d200 R11: 0000000000000000 R12: ffff9f6ee51a9200
> [352693.910125] R13: 0000000000000001 R14: ffffffff8b111340 R15: ffff9f6ee284d000
> [352693.910126] FS:  00000000016d7880(0000) GS:ffff9f6eee2c0000(0000) knlGS:0000000000000000
> [352693.910126] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [352693.910127] CR2: 0000000000000028 CR3: 000000041fb4a005 CR4: 00000000003626e0
> [352693.910127] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [352693.910128] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [352693.910128] Call Trace:
> [352693.910132]  io_send_recvmsg+0x49/0x170
> [352693.910134]  ? __switch_to_asm+0x34/0x70
> [352693.910135]  __io_submit_sqe+0x45e/0x8e0
> [352693.910136]  ? __switch_to_asm+0x34/0x70
> [352693.910137]  ? __switch_to_asm+0x40/0x70
> [352693.910138]  ? __switch_to_asm+0x34/0x70
> [352693.910138]  ? __switch_to_asm+0x40/0x70
> [352693.910139]  ? __switch_to_asm+0x34/0x70
> [352693.910140]  ? __switch_to_asm+0x40/0x70
> [352693.910141]  ? __switch_to_asm+0x34/0x70
> [352693.910142]  ? __switch_to_asm+0x40/0x70
> [352693.910143]  ? __switch_to_asm+0x34/0x70
> [352693.910144]  ? __switch_to_asm+0x34/0x70
> [352693.910145]  __io_queue_sqe+0x23/0x230
> [352693.910146]  io_queue_sqe+0x7a/0x90
> [352693.910148]  io_submit_sqe+0x23d/0x330
> [352693.910149]  io_ring_submit+0xca/0x200
> [352693.910150]  ? do_nanosleep+0xad/0x160
> [352693.910151]  ? hrtimer_init_sleeper+0x2c/0x90
> [352693.910152]  ? hrtimer_nanosleep+0xc2/0x1a0
> [352693.910154]  __x64_sys_io_uring_enter+0x1e4/0x2c0
> [352693.910156]  do_syscall_64+0x57/0x190
> [352693.910157]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> 
> Signed-off-by: Liu Yong <pkfxxxing@gmail.com>
> ---
>  fs/io_uring.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/fs/io_uring.c b/fs/io_uring.c
> index e0200406765c..0a26100b8260 100644
> --- a/fs/io_uring.c
> +++ b/fs/io_uring.c
> @@ -1675,6 +1675,9 @@ static int io_send_recvmsg(struct io_kiocb *req, const struct io_uring_sqe *sqe,
>  	if (unlikely(req->ctx->flags & IORING_SETUP_IOPOLL))
>  		return -EINVAL;
>  
> +	if (!req->file)
> +		return -EBADF;
> +
>  	sock = sock_from_file(req->file, &ret);
>  	if (sock) {
>  		struct user_msghdr __user *msg;
>
Jens Axboe Aug. 4, 2020, 1:27 p.m. UTC | #2
On 8/4/20 7:18 AM, Pavel Begunkov wrote:
> On 04/08/2020 15:56, Liu Yong wrote:
>> In io_send_recvmsg(), there is no check for the req->file.
>> User can change the opcode from IORING_OP_NOP to IORING_OP_SENDMSG
>> through competition after the io_req_set_file().
> 
> After sqe->opcode is read and copied in io_init_req(), it only uses
> in-kernel req->opcode. Also, io_init_req() should check for req->file
> NULL, so shouldn't happen after.
> 
> Do you have a reproducer? What kernel version did you use?

Was looking at this too, and I'm guessing this is some 5.4 based kernel.
Unfortunately the oops doesn't include that information.
Jens Axboe Aug. 4, 2020, 5:15 p.m. UTC | #3
On 8/4/20 11:02 AM, xiao lin wrote:
> 在 2020年8月4日星期二,Jens Axboe <axboe@kernel.dk <mailto:axboe@kernel.dk>> 写道:
> 
>     On 8/4/20 7:18 AM, Pavel Begunkov wrote:
>     > On 04/08/2020 15:56, Liu Yong wrote:
>     >> In io_send_recvmsg(), there is no check for the req->file.
>     >> User can change the opcode from IORING_OP_NOP to IORING_OP_SENDMSG
>     >> through competition after the io_req_set_file().
>     >
>     > After sqe->opcode is read and copied in io_init_req(), it only uses
>     > in-kernel req->opcode. Also, io_init_req() should check for req->file
>     > NULL, so shouldn't happen after.
>     >
>     > Do you have a reproducer? What kernel version did you use?
> 
>     Was looking at this too, and I'm guessing this is some 5.4 based kernel.
>     Unfortunately the oops doesn't include that information.

> Sorry, I forgot to mention that the kernel version I am using is 5.4.55.

I think there are two options here:

1) Backport the series that ensured we only read those important bits once
2) Make s->sqe a full sqe, and memcpy it in
Jens Axboe Aug. 4, 2020, 9:55 p.m. UTC | #4
On 8/4/20 11:15 AM, Jens Axboe wrote:
> On 8/4/20 11:02 AM, xiao lin wrote:
>> 在 2020年8月4日星期二,Jens Axboe <axboe@kernel.dk <mailto:axboe@kernel.dk>> 写道:
>>
>>     On 8/4/20 7:18 AM, Pavel Begunkov wrote:
>>     > On 04/08/2020 15:56, Liu Yong wrote:
>>     >> In io_send_recvmsg(), there is no check for the req->file.
>>     >> User can change the opcode from IORING_OP_NOP to IORING_OP_SENDMSG
>>     >> through competition after the io_req_set_file().
>>     >
>>     > After sqe->opcode is read and copied in io_init_req(), it only uses
>>     > in-kernel req->opcode. Also, io_init_req() should check for req->file
>>     > NULL, so shouldn't happen after.
>>     >
>>     > Do you have a reproducer? What kernel version did you use?
>>
>>     Was looking at this too, and I'm guessing this is some 5.4 based kernel.
>>     Unfortunately the oops doesn't include that information.
> 
>> Sorry, I forgot to mention that the kernel version I am using is 5.4.55.
> 
> I think there are two options here:
> 
> 1) Backport the series that ensured we only read those important bits once
> 2) Make s->sqe a full sqe, and memcpy it in

Something like this should close the ->opcode re-read for 5.4-stable.

diff --git a/fs/io_uring.c b/fs/io_uring.c
index e0200406765c..8bb5e19b7c3c 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -279,6 +279,7 @@ struct sqe_submit {
 	bool				has_user;
 	bool				needs_lock;
 	bool				needs_fixed_file;
+	u8				opcode;
 };
 
 /*
@@ -505,7 +506,7 @@ static inline void io_queue_async_work(struct io_ring_ctx *ctx,
 	int rw = 0;
 
 	if (req->submit.sqe) {
-		switch (req->submit.sqe->opcode) {
+		switch (req->submit.opcode) {
 		case IORING_OP_WRITEV:
 		case IORING_OP_WRITE_FIXED:
 			rw = !(req->rw.ki_flags & IOCB_DIRECT);
@@ -1254,23 +1255,15 @@ static int io_import_fixed(struct io_ring_ctx *ctx, int rw,
 }
 
 static ssize_t io_import_iovec(struct io_ring_ctx *ctx, int rw,
-			       const struct sqe_submit *s, struct iovec **iovec,
+			       struct io_kiocb *req, struct iovec **iovec,
 			       struct iov_iter *iter)
 {
-	const struct io_uring_sqe *sqe = s->sqe;
+	const struct io_uring_sqe *sqe = req->submit.sqe;
 	void __user *buf = u64_to_user_ptr(READ_ONCE(sqe->addr));
 	size_t sqe_len = READ_ONCE(sqe->len);
 	u8 opcode;
 
-	/*
-	 * We're reading ->opcode for the second time, but the first read
-	 * doesn't care whether it's _FIXED or not, so it doesn't matter
-	 * whether ->opcode changes concurrently. The first read does care
-	 * about whether it is a READ or a WRITE, so we don't trust this read
-	 * for that purpose and instead let the caller pass in the read/write
-	 * flag.
-	 */
-	opcode = READ_ONCE(sqe->opcode);
+	opcode = req->submit.opcode;
 	if (opcode == IORING_OP_READ_FIXED ||
 	    opcode == IORING_OP_WRITE_FIXED) {
 		ssize_t ret = io_import_fixed(ctx, rw, sqe, iter);
@@ -1278,7 +1271,7 @@ static ssize_t io_import_iovec(struct io_ring_ctx *ctx, int rw,
 		return ret;
 	}
 
-	if (!s->has_user)
+	if (!req->submit.has_user)
 		return -EFAULT;
 
 #ifdef CONFIG_COMPAT
@@ -1425,7 +1418,7 @@ static int io_read(struct io_kiocb *req, const struct sqe_submit *s,
 	if (unlikely(!(file->f_mode & FMODE_READ)))
 		return -EBADF;
 
-	ret = io_import_iovec(req->ctx, READ, s, &iovec, &iter);
+	ret = io_import_iovec(req->ctx, READ, req, &iovec, &iter);
 	if (ret < 0)
 		return ret;
 
@@ -1490,7 +1483,7 @@ static int io_write(struct io_kiocb *req, const struct sqe_submit *s,
 	if (unlikely(!(file->f_mode & FMODE_WRITE)))
 		return -EBADF;
 
-	ret = io_import_iovec(req->ctx, WRITE, s, &iovec, &iter);
+	ret = io_import_iovec(req->ctx, WRITE, req, &iovec, &iter);
 	if (ret < 0)
 		return ret;
 
@@ -2109,15 +2102,14 @@ static int io_req_defer(struct io_ring_ctx *ctx, struct io_kiocb *req,
 static int __io_submit_sqe(struct io_ring_ctx *ctx, struct io_kiocb *req,
 			   const struct sqe_submit *s, bool force_nonblock)
 {
-	int ret, opcode;
+	int ret;
 
 	req->user_data = READ_ONCE(s->sqe->user_data);
 
 	if (unlikely(s->index >= ctx->sq_entries))
 		return -EINVAL;
 
-	opcode = READ_ONCE(s->sqe->opcode);
-	switch (opcode) {
+	switch (req->submit.opcode) {
 	case IORING_OP_NOP:
 		ret = io_nop(req, req->user_data);
 		break;
@@ -2181,10 +2173,10 @@ static int __io_submit_sqe(struct io_ring_ctx *ctx, struct io_kiocb *req,
 	return 0;
 }
 
-static struct async_list *io_async_list_from_sqe(struct io_ring_ctx *ctx,
-						 const struct io_uring_sqe *sqe)
+static struct async_list *io_async_list_from_req(struct io_ring_ctx *ctx,
+						 struct io_kiocb *req)
 {
-	switch (sqe->opcode) {
+	switch (req->submit.opcode) {
 	case IORING_OP_READV:
 	case IORING_OP_READ_FIXED:
 		return &ctx->pending_async[READ];
@@ -2196,12 +2188,10 @@ static struct async_list *io_async_list_from_sqe(struct io_ring_ctx *ctx,
 	}
 }
 
-static inline bool io_sqe_needs_user(const struct io_uring_sqe *sqe)
+static inline bool io_req_needs_user(struct io_kiocb *req)
 {
-	u8 opcode = READ_ONCE(sqe->opcode);
-
-	return !(opcode == IORING_OP_READ_FIXED ||
-		 opcode == IORING_OP_WRITE_FIXED);
+	return !(req->submit.opcode == IORING_OP_READ_FIXED ||
+		req->submit.opcode == IORING_OP_WRITE_FIXED);
 }
 
 static void io_sq_wq_submit_work(struct work_struct *work)
@@ -2217,7 +2207,7 @@ static void io_sq_wq_submit_work(struct work_struct *work)
 	int ret;
 
 	old_cred = override_creds(ctx->creds);
-	async_list = io_async_list_from_sqe(ctx, req->submit.sqe);
+	async_list = io_async_list_from_req(ctx, req);
 
 	allow_kernel_signal(SIGINT);
 restart:
@@ -2239,7 +2229,7 @@ static void io_sq_wq_submit_work(struct work_struct *work)
 		}
 
 		ret = 0;
-		if (io_sqe_needs_user(sqe) && !cur_mm) {
+		if (io_req_needs_user(req) && !cur_mm) {
 			if (!mmget_not_zero(ctx->sqo_mm)) {
 				ret = -EFAULT;
 			} else {
@@ -2387,11 +2377,9 @@ static bool io_add_to_prev_work(struct async_list *list, struct io_kiocb *req)
 	return ret;
 }
 
-static bool io_op_needs_file(const struct io_uring_sqe *sqe)
+static bool io_op_needs_file(struct io_kiocb *req)
 {
-	int op = READ_ONCE(sqe->opcode);
-
-	switch (op) {
+	switch (req->submit.opcode) {
 	case IORING_OP_NOP:
 	case IORING_OP_POLL_REMOVE:
 	case IORING_OP_TIMEOUT:
@@ -2419,7 +2407,7 @@ static int io_req_set_file(struct io_ring_ctx *ctx, const struct sqe_submit *s,
 	 */
 	req->sequence = s->sequence;
 
-	if (!io_op_needs_file(s->sqe))
+	if (!io_op_needs_file(req))
 		return 0;
 
 	if (flags & IOSQE_FIXED_FILE) {
@@ -2460,7 +2448,7 @@ static int __io_queue_sqe(struct io_ring_ctx *ctx, struct io_kiocb *req,
 
 			s->sqe = sqe_copy;
 			memcpy(&req->submit, s, sizeof(*s));
-			list = io_async_list_from_sqe(ctx, s->sqe);
+			list = io_async_list_from_req(ctx, req);
 			if (!io_add_to_prev_work(list, req)) {
 				if (list)
 					atomic_inc(&list->cnt);
@@ -2582,7 +2570,7 @@ static void io_submit_sqe(struct io_ring_ctx *ctx, struct sqe_submit *s,
 	req->user_data = s->sqe->user_data;
 
 #if defined(CONFIG_NET)
-	switch (READ_ONCE(s->sqe->opcode)) {
+	switch (req->submit.opcode) {
 	case IORING_OP_SENDMSG:
 	case IORING_OP_RECVMSG:
 		spin_lock(&current->fs->lock);
@@ -2697,6 +2685,7 @@ static bool io_get_sqring(struct io_ring_ctx *ctx, struct sqe_submit *s)
 	if (head < ctx->sq_entries) {
 		s->index = head;
 		s->sqe = &ctx->sq_sqes[head];
+		s->opcode = READ_ONCE(s->sqe->opcode);
 		s->sequence = ctx->cached_sq_head;
 		ctx->cached_sq_head++;
 		return true;
Liu Yong Aug. 5, 2020, 3:40 a.m. UTC | #5
On Tue, Aug 04, 2020 at 03:55:16PM -0600, Jens Axboe wrote:
> On 8/4/20 11:15 AM, Jens Axboe wrote:
> > On 8/4/20 11:02 AM, xiao lin wrote:
> >> 在 2020年8月4日星期二,Jens Axboe <axboe@kernel.dk <mailto:axboe@kernel.dk>> 写道:
> >>
> >>     On 8/4/20 7:18 AM, Pavel Begunkov wrote:
> >>     > On 04/08/2020 15:56, Liu Yong wrote:
> >>     >> In io_send_recvmsg(), there is no check for the req->file.
> >>     >> User can change the opcode from IORING_OP_NOP to IORING_OP_SENDMSG
> >>     >> through competition after the io_req_set_file().
> >>     >
> >>     > After sqe->opcode is read and copied in io_init_req(), it only uses
> >>     > in-kernel req->opcode. Also, io_init_req() should check for req->file
> >>     > NULL, so shouldn't happen after.
> >>     >
> >>     > Do you have a reproducer? What kernel version did you use?
> >>
> >>     Was looking at this too, and I'm guessing this is some 5.4 based kernel.
> >>     Unfortunately the oops doesn't include that information.
> > 
> >> Sorry, I forgot to mention that the kernel version I am using is 5.4.55.
> > 
> > I think there are two options here:
> > 
> > 1) Backport the series that ensured we only read those important bits once
> > 2) Make s->sqe a full sqe, and memcpy it in
> 
> Something like this should close the ->opcode re-read for 5.4-stable.
> 
> diff --git a/fs/io_uring.c b/fs/io_uring.c
> index e0200406765c..8bb5e19b7c3c 100644
> --- a/fs/io_uring.c
> +++ b/fs/io_uring.c
> @@ -279,6 +279,7 @@ struct sqe_submit {
>  	bool				has_user;
>  	bool				needs_lock;
>  	bool				needs_fixed_file;
> +	u8				opcode;
>  };
>  
>  /*
> @@ -505,7 +506,7 @@ static inline void io_queue_async_work(struct io_ring_ctx *ctx,
>  	int rw = 0;
>  
>  	if (req->submit.sqe) {
> -		switch (req->submit.sqe->opcode) {
> +		switch (req->submit.opcode) {
>  		case IORING_OP_WRITEV:
>  		case IORING_OP_WRITE_FIXED:
>  			rw = !(req->rw.ki_flags & IOCB_DIRECT);
> @@ -1254,23 +1255,15 @@ static int io_import_fixed(struct io_ring_ctx *ctx, int rw,
>  }
>  
>  static ssize_t io_import_iovec(struct io_ring_ctx *ctx, int rw,
> -			       const struct sqe_submit *s, struct iovec **iovec,
> +			       struct io_kiocb *req, struct iovec **iovec,
>  			       struct iov_iter *iter)
>  {
> -	const struct io_uring_sqe *sqe = s->sqe;
> +	const struct io_uring_sqe *sqe = req->submit.sqe;
>  	void __user *buf = u64_to_user_ptr(READ_ONCE(sqe->addr));
>  	size_t sqe_len = READ_ONCE(sqe->len);
>  	u8 opcode;
>  
> -	/*
> -	 * We're reading ->opcode for the second time, but the first read
> -	 * doesn't care whether it's _FIXED or not, so it doesn't matter
> -	 * whether ->opcode changes concurrently. The first read does care
> -	 * about whether it is a READ or a WRITE, so we don't trust this read
> -	 * for that purpose and instead let the caller pass in the read/write
> -	 * flag.
> -	 */
> -	opcode = READ_ONCE(sqe->opcode);
> +	opcode = req->submit.opcode;
>  	if (opcode == IORING_OP_READ_FIXED ||
>  	    opcode == IORING_OP_WRITE_FIXED) {
>  		ssize_t ret = io_import_fixed(ctx, rw, sqe, iter);
> @@ -1278,7 +1271,7 @@ static ssize_t io_import_iovec(struct io_ring_ctx *ctx, int rw,
>  		return ret;
>  	}
>  
> -	if (!s->has_user)
> +	if (!req->submit.has_user)
>  		return -EFAULT;
>  
>  #ifdef CONFIG_COMPAT
> @@ -1425,7 +1418,7 @@ static int io_read(struct io_kiocb *req, const struct sqe_submit *s,
>  	if (unlikely(!(file->f_mode & FMODE_READ)))
>  		return -EBADF;
>  
> -	ret = io_import_iovec(req->ctx, READ, s, &iovec, &iter);
> +	ret = io_import_iovec(req->ctx, READ, req, &iovec, &iter);
>  	if (ret < 0)
>  		return ret;
>  
> @@ -1490,7 +1483,7 @@ static int io_write(struct io_kiocb *req, const struct sqe_submit *s,
>  	if (unlikely(!(file->f_mode & FMODE_WRITE)))
>  		return -EBADF;
>  
> -	ret = io_import_iovec(req->ctx, WRITE, s, &iovec, &iter);
> +	ret = io_import_iovec(req->ctx, WRITE, req, &iovec, &iter);
>  	if (ret < 0)
>  		return ret;
>  
> @@ -2109,15 +2102,14 @@ static int io_req_defer(struct io_ring_ctx *ctx, struct io_kiocb *req,
>  static int __io_submit_sqe(struct io_ring_ctx *ctx, struct io_kiocb *req,
>  			   const struct sqe_submit *s, bool force_nonblock)
>  {
> -	int ret, opcode;
> +	int ret;
>  
>  	req->user_data = READ_ONCE(s->sqe->user_data);
>  
>  	if (unlikely(s->index >= ctx->sq_entries))
>  		return -EINVAL;
>  
> -	opcode = READ_ONCE(s->sqe->opcode);
> -	switch (opcode) {
> +	switch (req->submit.opcode) {
>  	case IORING_OP_NOP:
>  		ret = io_nop(req, req->user_data);
>  		break;
> @@ -2181,10 +2173,10 @@ static int __io_submit_sqe(struct io_ring_ctx *ctx, struct io_kiocb *req,
>  	return 0;
>  }
>  
> -static struct async_list *io_async_list_from_sqe(struct io_ring_ctx *ctx,
> -						 const struct io_uring_sqe *sqe)
> +static struct async_list *io_async_list_from_req(struct io_ring_ctx *ctx,
> +						 struct io_kiocb *req)
>  {
> -	switch (sqe->opcode) {
> +	switch (req->submit.opcode) {
>  	case IORING_OP_READV:
>  	case IORING_OP_READ_FIXED:
>  		return &ctx->pending_async[READ];
> @@ -2196,12 +2188,10 @@ static struct async_list *io_async_list_from_sqe(struct io_ring_ctx *ctx,
>  	}
>  }
>  
> -static inline bool io_sqe_needs_user(const struct io_uring_sqe *sqe)
> +static inline bool io_req_needs_user(struct io_kiocb *req)
>  {
> -	u8 opcode = READ_ONCE(sqe->opcode);
> -
> -	return !(opcode == IORING_OP_READ_FIXED ||
> -		 opcode == IORING_OP_WRITE_FIXED);
> +	return !(req->submit.opcode == IORING_OP_READ_FIXED ||
> +		req->submit.opcode == IORING_OP_WRITE_FIXED);
>  }
>  
>  static void io_sq_wq_submit_work(struct work_struct *work)
> @@ -2217,7 +2207,7 @@ static void io_sq_wq_submit_work(struct work_struct *work)
>  	int ret;
>  
>  	old_cred = override_creds(ctx->creds);
> -	async_list = io_async_list_from_sqe(ctx, req->submit.sqe);
> +	async_list = io_async_list_from_req(ctx, req);
>  
>  	allow_kernel_signal(SIGINT);
>  restart:
> @@ -2239,7 +2229,7 @@ static void io_sq_wq_submit_work(struct work_struct *work)
>  		}
>  
>  		ret = 0;
> -		if (io_sqe_needs_user(sqe) && !cur_mm) {
> +		if (io_req_needs_user(req) && !cur_mm) {
>  			if (!mmget_not_zero(ctx->sqo_mm)) {
>  				ret = -EFAULT;
>  			} else {
> @@ -2387,11 +2377,9 @@ static bool io_add_to_prev_work(struct async_list *list, struct io_kiocb *req)
>  	return ret;
>  }
>  
> -static bool io_op_needs_file(const struct io_uring_sqe *sqe)
> +static bool io_op_needs_file(struct io_kiocb *req)
>  {
> -	int op = READ_ONCE(sqe->opcode);
> -
> -	switch (op) {
> +	switch (req->submit.opcode) {
>  	case IORING_OP_NOP:
>  	case IORING_OP_POLL_REMOVE:
>  	case IORING_OP_TIMEOUT:
> @@ -2419,7 +2407,7 @@ static int io_req_set_file(struct io_ring_ctx *ctx, const struct sqe_submit *s,
>  	 */
>  	req->sequence = s->sequence;
>  
> -	if (!io_op_needs_file(s->sqe))
> +	if (!io_op_needs_file(req))
>  		return 0;
>  
>  	if (flags & IOSQE_FIXED_FILE) {
> @@ -2460,7 +2448,7 @@ static int __io_queue_sqe(struct io_ring_ctx *ctx, struct io_kiocb *req,
>  
>  			s->sqe = sqe_copy;
>  			memcpy(&req->submit, s, sizeof(*s));
> -			list = io_async_list_from_sqe(ctx, s->sqe);
> +			list = io_async_list_from_req(ctx, req);
>  			if (!io_add_to_prev_work(list, req)) {
>  				if (list)
>  					atomic_inc(&list->cnt);
> @@ -2582,7 +2570,7 @@ static void io_submit_sqe(struct io_ring_ctx *ctx, struct sqe_submit *s,
>  	req->user_data = s->sqe->user_data;
>  
>  #if defined(CONFIG_NET)
> -	switch (READ_ONCE(s->sqe->opcode)) {
> +	switch (req->submit.opcode) {
>  	case IORING_OP_SENDMSG:
>  	case IORING_OP_RECVMSG:
>  		spin_lock(&current->fs->lock);
> @@ -2697,6 +2685,7 @@ static bool io_get_sqring(struct io_ring_ctx *ctx, struct sqe_submit *s)
>  	if (head < ctx->sq_entries) {
>  		s->index = head;
>  		s->sqe = &ctx->sq_sqes[head];
> +		s->opcode = READ_ONCE(s->sqe->opcode);
>  		s->sequence = ctx->cached_sq_head;
>  		ctx->cached_sq_head++;
>  		return true;
> 
> -- 
> Jens Axboe
> 

I think this patch solves similar problems from the root cause.
So, Should I submit this commit, or you?
Jens Axboe Aug. 5, 2020, 4:10 a.m. UTC | #6
On 8/4/20 9:40 PM, Liu Yong wrote:
> On Tue, Aug 04, 2020 at 03:55:16PM -0600, Jens Axboe wrote:
>> On 8/4/20 11:15 AM, Jens Axboe wrote:
>>> On 8/4/20 11:02 AM, xiao lin wrote:
>>>> 在 2020年8月4日星期二,Jens Axboe <axboe@kernel.dk <mailto:axboe@kernel.dk>> 写道:
>>>>
>>>>     On 8/4/20 7:18 AM, Pavel Begunkov wrote:
>>>>     > On 04/08/2020 15:56, Liu Yong wrote:
>>>>     >> In io_send_recvmsg(), there is no check for the req->file.
>>>>     >> User can change the opcode from IORING_OP_NOP to IORING_OP_SENDMSG
>>>>     >> through competition after the io_req_set_file().
>>>>     >
>>>>     > After sqe->opcode is read and copied in io_init_req(), it only uses
>>>>     > in-kernel req->opcode. Also, io_init_req() should check for req->file
>>>>     > NULL, so shouldn't happen after.
>>>>     >
>>>>     > Do you have a reproducer? What kernel version did you use?
>>>>
>>>>     Was looking at this too, and I'm guessing this is some 5.4 based kernel.
>>>>     Unfortunately the oops doesn't include that information.
>>>
>>>> Sorry, I forgot to mention that the kernel version I am using is 5.4.55.
>>>
>>> I think there are two options here:
>>>
>>> 1) Backport the series that ensured we only read those important bits once
>>> 2) Make s->sqe a full sqe, and memcpy it in
>>
>> Something like this should close the ->opcode re-read for 5.4-stable.
>>
>> diff --git a/fs/io_uring.c b/fs/io_uring.c
>> index e0200406765c..8bb5e19b7c3c 100644
>> --- a/fs/io_uring.c
>> +++ b/fs/io_uring.c
>> @@ -279,6 +279,7 @@ struct sqe_submit {
>>  	bool				has_user;
>>  	bool				needs_lock;
>>  	bool				needs_fixed_file;
>> +	u8				opcode;
>>  };
>>  
>>  /*
>> @@ -505,7 +506,7 @@ static inline void io_queue_async_work(struct io_ring_ctx *ctx,
>>  	int rw = 0;
>>  
>>  	if (req->submit.sqe) {
>> -		switch (req->submit.sqe->opcode) {
>> +		switch (req->submit.opcode) {
>>  		case IORING_OP_WRITEV:
>>  		case IORING_OP_WRITE_FIXED:
>>  			rw = !(req->rw.ki_flags & IOCB_DIRECT);
>> @@ -1254,23 +1255,15 @@ static int io_import_fixed(struct io_ring_ctx *ctx, int rw,
>>  }
>>  
>>  static ssize_t io_import_iovec(struct io_ring_ctx *ctx, int rw,
>> -			       const struct sqe_submit *s, struct iovec **iovec,
>> +			       struct io_kiocb *req, struct iovec **iovec,
>>  			       struct iov_iter *iter)
>>  {
>> -	const struct io_uring_sqe *sqe = s->sqe;
>> +	const struct io_uring_sqe *sqe = req->submit.sqe;
>>  	void __user *buf = u64_to_user_ptr(READ_ONCE(sqe->addr));
>>  	size_t sqe_len = READ_ONCE(sqe->len);
>>  	u8 opcode;
>>  
>> -	/*
>> -	 * We're reading ->opcode for the second time, but the first read
>> -	 * doesn't care whether it's _FIXED or not, so it doesn't matter
>> -	 * whether ->opcode changes concurrently. The first read does care
>> -	 * about whether it is a READ or a WRITE, so we don't trust this read
>> -	 * for that purpose and instead let the caller pass in the read/write
>> -	 * flag.
>> -	 */
>> -	opcode = READ_ONCE(sqe->opcode);
>> +	opcode = req->submit.opcode;
>>  	if (opcode == IORING_OP_READ_FIXED ||
>>  	    opcode == IORING_OP_WRITE_FIXED) {
>>  		ssize_t ret = io_import_fixed(ctx, rw, sqe, iter);
>> @@ -1278,7 +1271,7 @@ static ssize_t io_import_iovec(struct io_ring_ctx *ctx, int rw,
>>  		return ret;
>>  	}
>>  
>> -	if (!s->has_user)
>> +	if (!req->submit.has_user)
>>  		return -EFAULT;
>>  
>>  #ifdef CONFIG_COMPAT
>> @@ -1425,7 +1418,7 @@ static int io_read(struct io_kiocb *req, const struct sqe_submit *s,
>>  	if (unlikely(!(file->f_mode & FMODE_READ)))
>>  		return -EBADF;
>>  
>> -	ret = io_import_iovec(req->ctx, READ, s, &iovec, &iter);
>> +	ret = io_import_iovec(req->ctx, READ, req, &iovec, &iter);
>>  	if (ret < 0)
>>  		return ret;
>>  
>> @@ -1490,7 +1483,7 @@ static int io_write(struct io_kiocb *req, const struct sqe_submit *s,
>>  	if (unlikely(!(file->f_mode & FMODE_WRITE)))
>>  		return -EBADF;
>>  
>> -	ret = io_import_iovec(req->ctx, WRITE, s, &iovec, &iter);
>> +	ret = io_import_iovec(req->ctx, WRITE, req, &iovec, &iter);
>>  	if (ret < 0)
>>  		return ret;
>>  
>> @@ -2109,15 +2102,14 @@ static int io_req_defer(struct io_ring_ctx *ctx, struct io_kiocb *req,
>>  static int __io_submit_sqe(struct io_ring_ctx *ctx, struct io_kiocb *req,
>>  			   const struct sqe_submit *s, bool force_nonblock)
>>  {
>> -	int ret, opcode;
>> +	int ret;
>>  
>>  	req->user_data = READ_ONCE(s->sqe->user_data);
>>  
>>  	if (unlikely(s->index >= ctx->sq_entries))
>>  		return -EINVAL;
>>  
>> -	opcode = READ_ONCE(s->sqe->opcode);
>> -	switch (opcode) {
>> +	switch (req->submit.opcode) {
>>  	case IORING_OP_NOP:
>>  		ret = io_nop(req, req->user_data);
>>  		break;
>> @@ -2181,10 +2173,10 @@ static int __io_submit_sqe(struct io_ring_ctx *ctx, struct io_kiocb *req,
>>  	return 0;
>>  }
>>  
>> -static struct async_list *io_async_list_from_sqe(struct io_ring_ctx *ctx,
>> -						 const struct io_uring_sqe *sqe)
>> +static struct async_list *io_async_list_from_req(struct io_ring_ctx *ctx,
>> +						 struct io_kiocb *req)
>>  {
>> -	switch (sqe->opcode) {
>> +	switch (req->submit.opcode) {
>>  	case IORING_OP_READV:
>>  	case IORING_OP_READ_FIXED:
>>  		return &ctx->pending_async[READ];
>> @@ -2196,12 +2188,10 @@ static struct async_list *io_async_list_from_sqe(struct io_ring_ctx *ctx,
>>  	}
>>  }
>>  
>> -static inline bool io_sqe_needs_user(const struct io_uring_sqe *sqe)
>> +static inline bool io_req_needs_user(struct io_kiocb *req)
>>  {
>> -	u8 opcode = READ_ONCE(sqe->opcode);
>> -
>> -	return !(opcode == IORING_OP_READ_FIXED ||
>> -		 opcode == IORING_OP_WRITE_FIXED);
>> +	return !(req->submit.opcode == IORING_OP_READ_FIXED ||
>> +		req->submit.opcode == IORING_OP_WRITE_FIXED);
>>  }
>>  
>>  static void io_sq_wq_submit_work(struct work_struct *work)
>> @@ -2217,7 +2207,7 @@ static void io_sq_wq_submit_work(struct work_struct *work)
>>  	int ret;
>>  
>>  	old_cred = override_creds(ctx->creds);
>> -	async_list = io_async_list_from_sqe(ctx, req->submit.sqe);
>> +	async_list = io_async_list_from_req(ctx, req);
>>  
>>  	allow_kernel_signal(SIGINT);
>>  restart:
>> @@ -2239,7 +2229,7 @@ static void io_sq_wq_submit_work(struct work_struct *work)
>>  		}
>>  
>>  		ret = 0;
>> -		if (io_sqe_needs_user(sqe) && !cur_mm) {
>> +		if (io_req_needs_user(req) && !cur_mm) {
>>  			if (!mmget_not_zero(ctx->sqo_mm)) {
>>  				ret = -EFAULT;
>>  			} else {
>> @@ -2387,11 +2377,9 @@ static bool io_add_to_prev_work(struct async_list *list, struct io_kiocb *req)
>>  	return ret;
>>  }
>>  
>> -static bool io_op_needs_file(const struct io_uring_sqe *sqe)
>> +static bool io_op_needs_file(struct io_kiocb *req)
>>  {
>> -	int op = READ_ONCE(sqe->opcode);
>> -
>> -	switch (op) {
>> +	switch (req->submit.opcode) {
>>  	case IORING_OP_NOP:
>>  	case IORING_OP_POLL_REMOVE:
>>  	case IORING_OP_TIMEOUT:
>> @@ -2419,7 +2407,7 @@ static int io_req_set_file(struct io_ring_ctx *ctx, const struct sqe_submit *s,
>>  	 */
>>  	req->sequence = s->sequence;
>>  
>> -	if (!io_op_needs_file(s->sqe))
>> +	if (!io_op_needs_file(req))
>>  		return 0;
>>  
>>  	if (flags & IOSQE_FIXED_FILE) {
>> @@ -2460,7 +2448,7 @@ static int __io_queue_sqe(struct io_ring_ctx *ctx, struct io_kiocb *req,
>>  
>>  			s->sqe = sqe_copy;
>>  			memcpy(&req->submit, s, sizeof(*s));
>> -			list = io_async_list_from_sqe(ctx, s->sqe);
>> +			list = io_async_list_from_req(ctx, req);
>>  			if (!io_add_to_prev_work(list, req)) {
>>  				if (list)
>>  					atomic_inc(&list->cnt);
>> @@ -2582,7 +2570,7 @@ static void io_submit_sqe(struct io_ring_ctx *ctx, struct sqe_submit *s,
>>  	req->user_data = s->sqe->user_data;
>>  
>>  #if defined(CONFIG_NET)
>> -	switch (READ_ONCE(s->sqe->opcode)) {
>> +	switch (req->submit.opcode) {
>>  	case IORING_OP_SENDMSG:
>>  	case IORING_OP_RECVMSG:
>>  		spin_lock(&current->fs->lock);
>> @@ -2697,6 +2685,7 @@ static bool io_get_sqring(struct io_ring_ctx *ctx, struct sqe_submit *s)
>>  	if (head < ctx->sq_entries) {
>>  		s->index = head;
>>  		s->sqe = &ctx->sq_sqes[head];
>> +		s->opcode = READ_ONCE(s->sqe->opcode);
>>  		s->sequence = ctx->cached_sq_head;
>>  		ctx->cached_sq_head++;
>>  		return true;
>>
>> -- 
>> Jens Axboe
>>
> 
> I think this patch solves similar problems from the root cause.
> So, Should I submit this commit, or you?

Thanks for testing, I'll add your tested-by. Probably best if I do it,
since it's going to 5.4-stable only and it's not from upstream.

Patch
diff mbox series

diff --git a/fs/io_uring.c b/fs/io_uring.c
index e0200406765c..0a26100b8260 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -1675,6 +1675,9 @@  static int io_send_recvmsg(struct io_kiocb *req, const struct io_uring_sqe *sqe,
 	if (unlikely(req->ctx->flags & IORING_SETUP_IOPOLL))
 		return -EINVAL;
 
+	if (!req->file)
+		return -EBADF;
+
 	sock = sock_from_file(req->file, &ret);
 	if (sock) {
 		struct user_msghdr __user *msg;