diff mbox series

[v2,bpf-next,1/8] tcp: seq_file: Avoid skipping sk during tcp_seek_last_pos

Message ID 20210701200541.1033917-1-kafai@fb.com (mailing list archive)
State Accepted
Delegated to: BPF
Headers show
Series bpf: Allow bpf tcp iter to do bpf_(get|set)sockopt | expand

Checks

Context Check Description
netdev/cover_letter success Link
netdev/fixes_present success Link
netdev/patch_count success Link
netdev/tree_selection success Clearly marked for bpf-next
netdev/subject_prefix success Link
netdev/cc_maintainers fail 2 blamed authors not CCed: davem@davemloft.net therbert@google.com; 9 maintainers not CCed: dsahern@kernel.org kpsingh@kernel.org yoshfuji@linux-ipv6.org andrii@kernel.org john.fastabend@gmail.com therbert@google.com songliubraving@fb.com davem@davemloft.net kuba@kernel.org
netdev/source_inline success Was 0 now: 0
netdev/verify_signedoff success Link
netdev/module_param success Was 0 now: 0
netdev/build_32bit success Errors and warnings before: 15 this patch: 15
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/verify_fixes success Link
netdev/checkpatch success total: 0 errors, 0 warnings, 0 checks, 23 lines checked
netdev/build_allmodconfig_warn success Errors and warnings before: 15 this patch: 15
netdev/header_inline success Link

Commit Message

Martin KaFai Lau July 1, 2021, 8:05 p.m. UTC
st->bucket stores the current bucket number.
st->offset stores the offset within this bucket that is the sk to be
seq_show().  Thus, st->offset only makes sense within the same
st->bucket.

These two variables are an optimization for the common no-lseek case.
When resuming the seq_file iteration (i.e. seq_start()),
tcp_seek_last_pos() tries to continue from the st->offset
at bucket st->bucket.

However, it is possible that the bucket pointed by st->bucket
has changed and st->offset may end up skipping the whole st->bucket
without finding a sk.  In this case, tcp_seek_last_pos() currently
continues to satisfy the offset condition in the next (and incorrect)
bucket.  Instead, regardless of the offset value, the first sk of the
next bucket should be returned.  Thus, "bucket == st->bucket" check is
added to tcp_seek_last_pos().

The chance of hitting this is small and the issue is a decade old,
so targeting for the next tree.

Fixes: a8b690f98baf ("tcp: Fix slowness in read /proc/net/tcp")
Acked-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
---
 net/ipv4/tcp_ipv4.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

Comments

Iwashima, Kuniyuki July 22, 2021, 2:16 p.m. UTC | #1
From:   Martin KaFai Lau <kafai@fb.com>
Date:   Thu, 1 Jul 2021 13:05:41 -0700
> st->bucket stores the current bucket number.
> st->offset stores the offset within this bucket that is the sk to be
> seq_show().  Thus, st->offset only makes sense within the same
> st->bucket.
> 
> These two variables are an optimization for the common no-lseek case.
> When resuming the seq_file iteration (i.e. seq_start()),
> tcp_seek_last_pos() tries to continue from the st->offset
> at bucket st->bucket.
> 
> However, it is possible that the bucket pointed by st->bucket
> has changed and st->offset may end up skipping the whole st->bucket
> without finding a sk.  In this case, tcp_seek_last_pos() currently
> continues to satisfy the offset condition in the next (and incorrect)
> bucket.  Instead, regardless of the offset value, the first sk of the
> next bucket should be returned.  Thus, "bucket == st->bucket" check is
> added to tcp_seek_last_pos().
> 
> The chance of hitting this is small and the issue is a decade old,
> so targeting for the next tree.

Multiple read()s or lseek()+read() can call tcp_seek_last_pos().

IIUC, the problem happens when the sockets placed before the last shown
socket in the list are closed between some read()s or lseek() and read().

I think there is still a case where bucket is valid but offset is invalid:

  listening_hash[1] -> sk1 -> sk2 -> sk3 -> nulls
  listening_hash[2] -> sk4 -> sk5 -> nulls

  read(/proc/net/tcp)
    end up with sk2

  close(sk1)

  listening_hash[1] -> sk2 -> sk3 -> nulls
  listening_hash[2] -> sk4 -> sk5 -> nulls

  read(/proc/net/tcp) (resume)
    offset = 2

    listening_get_next() returns sk2

    while (offset--)
      1st loop listening_get_next() returns sk3 (bucket == st->bucket)
      2nd loop listening_get_next() returns sk4 (bucket != st->bucket)

    show() starts from sk4

    only is sk3 skipped, but should be shown.

In listening_get_next(), we can check if we passed through sk2, but this
does not work well if sk2 itself is closed... then there are no way to
check the offset is valid or not.

Handling this may be too much though, what do you think ?
Iwashima, Kuniyuki July 22, 2021, 3:08 p.m. UTC | #2
From:   Kuniyuki Iwashima <kuniyu@amazon.co.jp>
Date:   Thu, 22 Jul 2021 23:16:37 +0900
> From:   Martin KaFai Lau <kafai@fb.com>
> Date:   Thu, 1 Jul 2021 13:05:41 -0700
> > st->bucket stores the current bucket number.
> > st->offset stores the offset within this bucket that is the sk to be
> > seq_show().  Thus, st->offset only makes sense within the same
> > st->bucket.
> > 
> > These two variables are an optimization for the common no-lseek case.
> > When resuming the seq_file iteration (i.e. seq_start()),
> > tcp_seek_last_pos() tries to continue from the st->offset
> > at bucket st->bucket.
> > 
> > However, it is possible that the bucket pointed by st->bucket
> > has changed and st->offset may end up skipping the whole st->bucket
> > without finding a sk.  In this case, tcp_seek_last_pos() currently
> > continues to satisfy the offset condition in the next (and incorrect)
> > bucket.  Instead, regardless of the offset value, the first sk of the
> > next bucket should be returned.  Thus, "bucket == st->bucket" check is
> > added to tcp_seek_last_pos().
> > 
> > The chance of hitting this is small and the issue is a decade old,
> > so targeting for the next tree.
> 
> Multiple read()s or lseek()+read() can call tcp_seek_last_pos().
> 
> IIUC, the problem happens when the sockets placed before the last shown
> socket in the list are closed between some read()s or lseek() and read().
> 
> I think there is still a case where bucket is valid but offset is invalid:
> 
>   listening_hash[1] -> sk1 -> sk2 -> sk3 -> nulls
>   listening_hash[2] -> sk4 -> sk5 -> nulls
> 
>   read(/proc/net/tcp)
>     end up with sk2
> 
>   close(sk1)
> 
>   listening_hash[1] -> sk2 -> sk3 -> nulls
>   listening_hash[2] -> sk4 -> sk5 -> nulls
> 
>   read(/proc/net/tcp) (resume)
>     offset = 2
> 
>     listening_get_next() returns sk2
> 
>     while (offset--)
>       1st loop listening_get_next() returns sk3 (bucket == st->bucket)
>       2nd loop listening_get_next() returns sk4 (bucket != st->bucket)
> 
>     show() starts from sk4
> 
>     only is sk3 skipped, but should be shown.

Sorry, this example is wrong.
We can handle this properly by testing bucket != st->bucket.

In the case below, we cannot check if the offset is valid or not by testing
the bucket.

  listening_hash[1] -> sk1 -> sk2 -> sk3 -> sk4 -> nulls

  read(/proc/net/tcp)
    end up with sk2

  close(sk1)

  listening_hash[1] -> sk2 -> sk3 -> sk4 -> nulls

  read(/proc/net/tcp) (resume)
    offset = 2

    listening_get_first() returns sk2

    while (offset--)
      1st loop listening_get_next() returns sk3 (bucket == st->bucket)
      2nd loop listening_get_next() returns sk4 (bucket == st->bucket)

    show() starts from sk4

    only is sk3 skipped, but should be shown.


> 
> In listening_get_next(), we can check if we passed through sk2, but this
> does not work well if sk2 itself is closed... then there are no way to
> check the offset is valid or not.
> 
> Handling this may be too much though, what do you think ?
>
Martin KaFai Lau July 22, 2021, 9:42 p.m. UTC | #3
On Fri, Jul 23, 2021 at 12:08:10AM +0900, Kuniyuki Iwashima wrote:
> From:   Kuniyuki Iwashima <kuniyu@amazon.co.jp>
> Date:   Thu, 22 Jul 2021 23:16:37 +0900
> > From:   Martin KaFai Lau <kafai@fb.com>
> > Date:   Thu, 1 Jul 2021 13:05:41 -0700
> > > st->bucket stores the current bucket number.
> > > st->offset stores the offset within this bucket that is the sk to be
> > > seq_show().  Thus, st->offset only makes sense within the same
> > > st->bucket.
> > > 
> > > These two variables are an optimization for the common no-lseek case.
> > > When resuming the seq_file iteration (i.e. seq_start()),
> > > tcp_seek_last_pos() tries to continue from the st->offset
> > > at bucket st->bucket.
> > > 
> > > However, it is possible that the bucket pointed by st->bucket
> > > has changed and st->offset may end up skipping the whole st->bucket
> > > without finding a sk.  In this case, tcp_seek_last_pos() currently
> > > continues to satisfy the offset condition in the next (and incorrect)
> > > bucket.  Instead, regardless of the offset value, the first sk of the
> > > next bucket should be returned.  Thus, "bucket == st->bucket" check is
> > > added to tcp_seek_last_pos().
> > > 
> > > The chance of hitting this is small and the issue is a decade old,
> > > so targeting for the next tree.
> > 
> > Multiple read()s or lseek()+read() can call tcp_seek_last_pos().
> > 
> > IIUC, the problem happens when the sockets placed before the last shown
> > socket in the list are closed between some read()s or lseek() and read().
> > 
> > I think there is still a case where bucket is valid but offset is invalid:
> > 
> >   listening_hash[1] -> sk1 -> sk2 -> sk3 -> nulls
> >   listening_hash[2] -> sk4 -> sk5 -> nulls
> > 
> >   read(/proc/net/tcp)
> >     end up with sk2
> > 
> >   close(sk1)
> > 
> >   listening_hash[1] -> sk2 -> sk3 -> nulls
> >   listening_hash[2] -> sk4 -> sk5 -> nulls
> > 
> >   read(/proc/net/tcp) (resume)
> >     offset = 2
> > 
> >     listening_get_next() returns sk2
> > 
> >     while (offset--)
> >       1st loop listening_get_next() returns sk3 (bucket == st->bucket)
> >       2nd loop listening_get_next() returns sk4 (bucket != st->bucket)
> > 
> >     show() starts from sk4
> > 
> >     only is sk3 skipped, but should be shown.
> 
> Sorry, this example is wrong.
> We can handle this properly by testing bucket != st->bucket.
> 
> In the case below, we cannot check if the offset is valid or not by testing
> the bucket.
> 
>   listening_hash[1] -> sk1 -> sk2 -> sk3 -> sk4 -> nulls
> 
>   read(/proc/net/tcp)
>     end up with sk2
> 
>   close(sk1)
> 
>   listening_hash[1] -> sk2 -> sk3 -> sk4 -> nulls
> 
>   read(/proc/net/tcp) (resume)
>     offset = 2
> 
>     listening_get_first() returns sk2
> 
>     while (offset--)
>       1st loop listening_get_next() returns sk3 (bucket == st->bucket)
>       2nd loop listening_get_next() returns sk4 (bucket == st->bucket)
> 
>     show() starts from sk4
> 
>     only is sk3 skipped, but should be shown.
> 
> 
> > 
> > In listening_get_next(), we can check if we passed through sk2, but this
> > does not work well if sk2 itself is closed... then there are no way to
> > check the offset is valid or not.
> > 
> > Handling this may be too much though, what do you think ?
There will be cases that misses sk after releasing
the bucket lock (and then things changed).  For example,
another case could be sk_new is added to the head of the bucket,
although it could arguably be treated as a legit miss since
"cat /proc/net/tcp" has already been in-progress.

The chance of hitting m->buf limit and that bucket gets changed should be slim.
If there is use case such that lhash2 (already hashed by port+addr) is still
having a large bucket (e.g. many SO_REUSEPORT), it will be a better problem
to solve first.  imo, remembering sk2 to solve the "cat /proc/net/tcp" alone
does not worth it.

Thanks for the review!
Iwashima, Kuniyuki July 22, 2021, 10:06 p.m. UTC | #4
From:   Martin KaFai Lau <kafai@fb.com>
Date:   Thu, 22 Jul 2021 14:42:56 -0700
> On Fri, Jul 23, 2021 at 12:08:10AM +0900, Kuniyuki Iwashima wrote:
> > From:   Kuniyuki Iwashima <kuniyu@amazon.co.jp>
> > Date:   Thu, 22 Jul 2021 23:16:37 +0900
> > > From:   Martin KaFai Lau <kafai@fb.com>
> > > Date:   Thu, 1 Jul 2021 13:05:41 -0700
> > > > st->bucket stores the current bucket number.
> > > > st->offset stores the offset within this bucket that is the sk to be
> > > > seq_show().  Thus, st->offset only makes sense within the same
> > > > st->bucket.
> > > > 
> > > > These two variables are an optimization for the common no-lseek case.
> > > > When resuming the seq_file iteration (i.e. seq_start()),
> > > > tcp_seek_last_pos() tries to continue from the st->offset
> > > > at bucket st->bucket.
> > > > 
> > > > However, it is possible that the bucket pointed by st->bucket
> > > > has changed and st->offset may end up skipping the whole st->bucket
> > > > without finding a sk.  In this case, tcp_seek_last_pos() currently
> > > > continues to satisfy the offset condition in the next (and incorrect)
> > > > bucket.  Instead, regardless of the offset value, the first sk of the
> > > > next bucket should be returned.  Thus, "bucket == st->bucket" check is
> > > > added to tcp_seek_last_pos().
> > > > 
> > > > The chance of hitting this is small and the issue is a decade old,
> > > > so targeting for the next tree.
> > > 
> > > Multiple read()s or lseek()+read() can call tcp_seek_last_pos().
> > > 
> > > IIUC, the problem happens when the sockets placed before the last shown
> > > socket in the list are closed between some read()s or lseek() and read().
> > > 
> > > I think there is still a case where bucket is valid but offset is invalid:
> > > 
> > >   listening_hash[1] -> sk1 -> sk2 -> sk3 -> nulls
> > >   listening_hash[2] -> sk4 -> sk5 -> nulls
> > > 
> > >   read(/proc/net/tcp)
> > >     end up with sk2
> > > 
> > >   close(sk1)
> > > 
> > >   listening_hash[1] -> sk2 -> sk3 -> nulls
> > >   listening_hash[2] -> sk4 -> sk5 -> nulls
> > > 
> > >   read(/proc/net/tcp) (resume)
> > >     offset = 2
> > > 
> > >     listening_get_next() returns sk2
> > > 
> > >     while (offset--)
> > >       1st loop listening_get_next() returns sk3 (bucket == st->bucket)
> > >       2nd loop listening_get_next() returns sk4 (bucket != st->bucket)
> > > 
> > >     show() starts from sk4
> > > 
> > >     only is sk3 skipped, but should be shown.
> > 
> > Sorry, this example is wrong.
> > We can handle this properly by testing bucket != st->bucket.
> > 
> > In the case below, we cannot check if the offset is valid or not by testing
> > the bucket.
> > 
> >   listening_hash[1] -> sk1 -> sk2 -> sk3 -> sk4 -> nulls
> > 
> >   read(/proc/net/tcp)
> >     end up with sk2
> > 
> >   close(sk1)
> > 
> >   listening_hash[1] -> sk2 -> sk3 -> sk4 -> nulls
> > 
> >   read(/proc/net/tcp) (resume)
> >     offset = 2
> > 
> >     listening_get_first() returns sk2
> > 
> >     while (offset--)
> >       1st loop listening_get_next() returns sk3 (bucket == st->bucket)
> >       2nd loop listening_get_next() returns sk4 (bucket == st->bucket)
> > 
> >     show() starts from sk4
> > 
> >     only is sk3 skipped, but should be shown.
> > 
> > 
> > > 
> > > In listening_get_next(), we can check if we passed through sk2, but this
> > > does not work well if sk2 itself is closed... then there are no way to
> > > check the offset is valid or not.
> > > 
> > > Handling this may be too much though, what do you think ?
> There will be cases that misses sk after releasing
> the bucket lock (and then things changed).  For example,
> another case could be sk_new is added to the head of the bucket,
> although it could arguably be treated as a legit miss since
> "cat /proc/net/tcp" has already been in-progress.
> 
> The chance of hitting m->buf limit and that bucket gets changed should be slim.
> If there is use case such that lhash2 (already hashed by port+addr) is still
> having a large bucket (e.g. many SO_REUSEPORT), it will be a better problem
> to solve first.  imo, remembering sk2 to solve the "cat /proc/net/tcp" alone
> does not worth it.

That makes sense.
Thank you for explaining!


> 
> Thanks for the review!
diff mbox series

Patch

diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index e66ad6bfe808..26b7b2056585 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -2451,6 +2451,7 @@  static void *tcp_get_idx(struct seq_file *seq, loff_t pos)
 static void *tcp_seek_last_pos(struct seq_file *seq)
 {
 	struct tcp_iter_state *st = seq->private;
+	int bucket = st->bucket;
 	int offset = st->offset;
 	int orig_num = st->num;
 	void *rc = NULL;
@@ -2461,7 +2462,7 @@  static void *tcp_seek_last_pos(struct seq_file *seq)
 			break;
 		st->state = TCP_SEQ_STATE_LISTENING;
 		rc = listening_get_next(seq, NULL);
-		while (offset-- && rc)
+		while (offset-- && rc && bucket == st->bucket)
 			rc = listening_get_next(seq, rc);
 		if (rc)
 			break;
@@ -2472,7 +2473,7 @@  static void *tcp_seek_last_pos(struct seq_file *seq)
 		if (st->bucket > tcp_hashinfo.ehash_mask)
 			break;
 		rc = established_get_first(seq);
-		while (offset-- && rc)
+		while (offset-- && rc && bucket == st->bucket)
 			rc = established_get_next(seq, rc);
 	}