From patchwork Mon Jan 15 18:38:34 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Suren Baghdasaryan X-Patchwork-Id: 13520075 Received: from mail-yb1-f202.google.com (mail-yb1-f202.google.com [209.85.219.202]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id E3FB918E1D for ; Mon, 15 Jan 2024 18:38:43 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--surenb.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="p74byPgz" Received: by mail-yb1-f202.google.com with SMTP id 3f1490d57ef6-dbe9c7932b3so13992979276.2 for ; Mon, 15 Jan 2024 10:38:43 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1705343923; x=1705948723; darn=vger.kernel.org; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=oLlMsVPwk9YFRfJ1xkZa3gURgHKOqqqaOLfJjGTa1Bw=; b=p74byPgz+dtfmyc2Mci897DKkR+m9CFVuXWEUMpI/9zbFHW+i3VJk1TmMonhR67NQP EA3jqiVrcAUnodwE/e/sqHUNWr+8WZsjqqeZOfBLV9leh4Aoh5Vp65L5mRtPZkCnd2Vr qQ/pRb9jy+L2tq5pwDaluDl4vsiGbEENgKDSH4n7nb4mOwgFnXU94kqMCKxOz1rZTqcI X6FhfuCQtAGWTby0uPZuFGwf89jUk45W7ycUVKanPQPlKF94By7TIN3VUaNqk6i02VRa M/kILqSzbLDbqWYkd//QJ+WFzzQzGuwqpoCQDaN/OtlvO3FaTf6sRR+8yRceHP4tEAH/ FWOQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1705343923; x=1705948723; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=oLlMsVPwk9YFRfJ1xkZa3gURgHKOqqqaOLfJjGTa1Bw=; b=JnMu5Z+gPLXbEogwlV9M6g6OipUEeo0LGWpVko/kcbbkH1vgIMXtrn2h4lLsmOEIka LGluSrckIkt6jqGeRdvASgHiGo+ZPVCRHF/buUv/1waHJYNC7RSjIojvZICzWfvYm3CZ wbc6AVYIxKM8Q+i3Yi4hWGnrZRTNn1VysDn1mofNJJ7qAKoJwl4eCXj4m2lLaovHNKZO clvK0BFoEfBAd/Ed5K0kvftqJ/GjA2wXAuE8kq6vgQneaMm7zUl3ksNvX+R4a/DZFkJ1 3LeeYHkVhOOvqQSCNKAiGy8kipH42PvCs1Vp4QmHLRytcDrRC+3cDWan6Cful/KyA8it WF2A== X-Gm-Message-State: AOJu0YxjIGk/XsK6tliCfhU8hoHMn5oqY7VTL9hjQ66vq3ZQGiE0ZGHr dkobwQlmYMBubXyKe6SUOYdR8GAVcR4mYSRoWw== X-Google-Smtp-Source: AGHT+IETYM+vP+GcA1bz5ffGAn7xpFLG8rQcpZyAKLVas4y2HqU4DVH4yHlOoP/EqERN0tEXwr857q4NqX0= X-Received: from surenb-desktop.mtv.corp.google.com ([2620:15c:211:201:3af2:e48e:2785:270]) (user=surenb job=sendgmr) by 2002:a25:a292:0:b0:dc1:f71f:a0ad with SMTP id c18-20020a25a292000000b00dc1f71fa0admr1244231ybi.13.1705343922947; Mon, 15 Jan 2024 10:38:42 -0800 (PST) Date: Mon, 15 Jan 2024 10:38:34 -0800 In-Reply-To: <20240115183837.205694-1-surenb@google.com> Precedence: bulk X-Mailing-List: linux-fsdevel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20240115183837.205694-1-surenb@google.com> X-Mailer: git-send-email 2.43.0.381.gb435a96ce8-goog Message-ID: <20240115183837.205694-2-surenb@google.com> Subject: [RFC 1/3] mm: make vm_area_struct anon_name field RCU-safe From: Suren Baghdasaryan To: akpm@linux-foundation.org Cc: viro@zeniv.linux.org.uk, brauner@kernel.org, jack@suse.cz, dchinner@redhat.com, casey@schaufler-ca.com, ben.wolsieffer@hefring.com, paulmck@kernel.org, david@redhat.com, avagin@google.com, usama.anjum@collabora.com, peterx@redhat.com, hughd@google.com, ryan.roberts@arm.com, wangkefeng.wang@huawei.com, Liam.Howlett@Oracle.com, yuzhao@google.com, axelrasmussen@google.com, lstoakes@gmail.com, talumbau@google.com, willy@infradead.org, vbabka@suse.cz, mgorman@techsingularity.net, jhubbard@nvidia.com, vishal.moola@gmail.com, mathieu.desnoyers@efficios.com, dhowells@redhat.com, jgg@ziepe.ca, sidhartha.kumar@oracle.com, andriy.shevchenko@linux.intel.com, yangxingui@huawei.com, keescook@chromium.org, linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, kernel-team@android.com, surenb@google.com For lockless /proc/pid/maps reading we have to ensure all the fields used when generating the output are RCU-safe. The only pointer fields in vm_area_struct which are used to generate that file's output are vm_file and anon_name. vm_file is RCU-safe but anon_name is not. Make anon_name RCU-safe as well. Signed-off-by: Suren Baghdasaryan --- include/linux/mm_inline.h | 10 +++++++++- include/linux/mm_types.h | 3 ++- mm/madvise.c | 30 ++++++++++++++++++++++++++---- 3 files changed, 37 insertions(+), 6 deletions(-) diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h index f4fe593c1400..bbdb0ca857f1 100644 --- a/include/linux/mm_inline.h +++ b/include/linux/mm_inline.h @@ -389,7 +389,7 @@ static inline void dup_anon_vma_name(struct vm_area_struct *orig_vma, struct anon_vma_name *anon_name = anon_vma_name(orig_vma); if (anon_name) - new_vma->anon_name = anon_vma_name_reuse(anon_name); + rcu_assign_pointer(new_vma->anon_name, anon_vma_name_reuse(anon_name)); } static inline void free_anon_vma_name(struct vm_area_struct *vma) @@ -411,6 +411,8 @@ static inline bool anon_vma_name_eq(struct anon_vma_name *anon_name1, !strcmp(anon_name1->name, anon_name2->name); } +struct anon_vma_name *anon_vma_name_get_rcu(struct vm_area_struct *vma); + #else /* CONFIG_ANON_VMA_NAME */ static inline void anon_vma_name_get(struct anon_vma_name *anon_name) {} static inline void anon_vma_name_put(struct anon_vma_name *anon_name) {} @@ -424,6 +426,12 @@ static inline bool anon_vma_name_eq(struct anon_vma_name *anon_name1, return true; } +static inline +struct anon_vma_name *anon_vma_name_get_rcu(struct vm_area_struct *vma) +{ + return NULL; +} + #endif /* CONFIG_ANON_VMA_NAME */ static inline void init_tlb_flush_pending(struct mm_struct *mm) diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index b2d3a88a34d1..1f0a30c00795 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -545,6 +545,7 @@ struct vm_userfaultfd_ctx {}; struct anon_vma_name { struct kref kref; + struct rcu_head rcu; /* The name needs to be at the end because it is dynamically sized. */ char name[]; }; @@ -699,7 +700,7 @@ struct vm_area_struct { * terminated string containing the name given to the vma, or NULL if * unnamed. Serialized by mmap_lock. Use anon_vma_name to access. */ - struct anon_vma_name *anon_name; + struct anon_vma_name __rcu *anon_name; #endif #ifdef CONFIG_SWAP atomic_long_t swap_readahead_info; diff --git a/mm/madvise.c b/mm/madvise.c index 912155a94ed5..0f222d464254 100644 --- a/mm/madvise.c +++ b/mm/madvise.c @@ -88,14 +88,15 @@ void anon_vma_name_free(struct kref *kref) { struct anon_vma_name *anon_name = container_of(kref, struct anon_vma_name, kref); - kfree(anon_name); + kfree_rcu(anon_name, rcu); } struct anon_vma_name *anon_vma_name(struct vm_area_struct *vma) { mmap_assert_locked(vma->vm_mm); - return vma->anon_name; + return rcu_dereference_protected(vma->anon_name, + rwsem_is_locked(&vma->vm_mm->mmap_lock)); } /* mmap_lock should be write-locked */ @@ -105,7 +106,7 @@ static int replace_anon_vma_name(struct vm_area_struct *vma, struct anon_vma_name *orig_name = anon_vma_name(vma); if (!anon_name) { - vma->anon_name = NULL; + rcu_assign_pointer(vma->anon_name, NULL); anon_vma_name_put(orig_name); return 0; } @@ -113,11 +114,32 @@ static int replace_anon_vma_name(struct vm_area_struct *vma, if (anon_vma_name_eq(orig_name, anon_name)) return 0; - vma->anon_name = anon_vma_name_reuse(anon_name); + rcu_assign_pointer(vma->anon_name, anon_vma_name_reuse(anon_name)); anon_vma_name_put(orig_name); return 0; } + +/* + * Returned anon_vma_name is stable due to elevated refcount but not guaranteed + * to be assigned to the original VMA after the call. + */ +struct anon_vma_name *anon_vma_name_get_rcu(struct vm_area_struct *vma) +{ + struct anon_vma_name __rcu *anon_name; + + WARN_ON_ONCE(!rcu_read_lock_held()); + + anon_name = rcu_dereference(vma->anon_name); + if (!anon_name) + return NULL; + + if (unlikely(!kref_get_unless_zero(&anon_name->kref))) + return NULL; + + return anon_name; +} + #else /* CONFIG_ANON_VMA_NAME */ static int replace_anon_vma_name(struct vm_area_struct *vma, struct anon_vma_name *anon_name) From patchwork Mon Jan 15 18:38:35 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Suren Baghdasaryan X-Patchwork-Id: 13520076 Received: from mail-yb1-f202.google.com (mail-yb1-f202.google.com [209.85.219.202]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 262BF18E34 for ; Mon, 15 Jan 2024 18:38:45 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--surenb.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="racXoa4T" Received: by mail-yb1-f202.google.com with SMTP id 3f1490d57ef6-dc221ed88d9so12655276.3 for ; Mon, 15 Jan 2024 10:38:45 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1705343925; x=1705948725; darn=vger.kernel.org; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=ch6T6Lsnf4fqmqV9A6xFsxD4eG+jCeNRaJgu+LlgQig=; b=racXoa4TnsLPy0OFLS+xzftOB6e28+Lm+G2K0Tue45QfDYhuBJN8zPqvCSUGQLIRGB Ex7Z6px6V8c6e9Z+zu1ZevP638iygpxyIbCeKxw+/zA1/m3DWs3ASyzMRqrxxKdRy5d2 wDUSFiLSWkykFnGsS1vGxi/zl/FfuDuWQ9fqAKrxd/AzlFxQhDZmsmgQtFs+EA3fu0AC UhGvm4g0XgBnaKtwq1QV/3UFtrDZHaafByjh3e66qJ3jjE7xnQcwWxNoEKsCr8pl5Mbz boLb+9tYXQq2eQtoiY8sAg8FH99i29PfnKLuRpSLH30hqUjWZm6ROwwLw6d7ZAfmClIa zW0Q== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1705343925; x=1705948725; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=ch6T6Lsnf4fqmqV9A6xFsxD4eG+jCeNRaJgu+LlgQig=; b=ghxee/i5FSPwJH3cqkqwLP3ptURhw3uR8XoctSz80kzt5BCietVzhHAqEf6k/QnEci NsP+0wMS8+Vaesz3elrsVynnZuz7crgl0zBdFwLWVU9UCmOlvSGpDZqV82JBiMK9BxgI U4WE6a6DTjsrai48ACFE6GztQKm8TziujQ6M/ofQ49veejF95CAJJU6ouJgoQBtB1VFa 4YOK/8gjkoOE8zTIFr4QMsvLVaYQOs38wTIzELvDg3ICF9coZaB5SALV6BecJvjy+o57 zboL4EgF0mfJwHjiylqUw62Ng1ZFTd85iMEQw7LVUuacSdkq+JxLFy6LusbnaUAiVlrz XpeQ== X-Gm-Message-State: AOJu0YxDiclgRx+jDAKl6wwC0S0t8ubR9WNU1cqzi2YWv8HMwivucsqi 7xfXid8zO+9dyfjQmALoaEo1b3afEDUaKqNWtw== X-Google-Smtp-Source: AGHT+IGN2VvUUxNx3HDpn+LUP1maSJ793VbScOs1XRHuqJsGhFfM1RNtInfsiW1wKVri/L8s95oaaZe3g/g= X-Received: from surenb-desktop.mtv.corp.google.com ([2620:15c:211:201:3af2:e48e:2785:270]) (user=surenb job=sendgmr) by 2002:a05:6902:1364:b0:dbd:7149:a389 with SMTP id bt4-20020a056902136400b00dbd7149a389mr269675ybb.11.1705343925137; Mon, 15 Jan 2024 10:38:45 -0800 (PST) Date: Mon, 15 Jan 2024 10:38:35 -0800 In-Reply-To: <20240115183837.205694-1-surenb@google.com> Precedence: bulk X-Mailing-List: linux-fsdevel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20240115183837.205694-1-surenb@google.com> X-Mailer: git-send-email 2.43.0.381.gb435a96ce8-goog Message-ID: <20240115183837.205694-3-surenb@google.com> Subject: [RFC 2/3] seq_file: add validate() operation to seq_operations From: Suren Baghdasaryan To: akpm@linux-foundation.org Cc: viro@zeniv.linux.org.uk, brauner@kernel.org, jack@suse.cz, dchinner@redhat.com, casey@schaufler-ca.com, ben.wolsieffer@hefring.com, paulmck@kernel.org, david@redhat.com, avagin@google.com, usama.anjum@collabora.com, peterx@redhat.com, hughd@google.com, ryan.roberts@arm.com, wangkefeng.wang@huawei.com, Liam.Howlett@Oracle.com, yuzhao@google.com, axelrasmussen@google.com, lstoakes@gmail.com, talumbau@google.com, willy@infradead.org, vbabka@suse.cz, mgorman@techsingularity.net, jhubbard@nvidia.com, vishal.moola@gmail.com, mathieu.desnoyers@efficios.com, dhowells@redhat.com, jgg@ziepe.ca, sidhartha.kumar@oracle.com, andriy.shevchenko@linux.intel.com, yangxingui@huawei.com, keescook@chromium.org, linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, kernel-team@android.com, surenb@google.com seq_file outputs data in chunks using seq_file.buf as the intermediate storage before outputting the generated data for the current chunk. It is possible for already buffered data to become stale before it gets reported. In certain situations it is desirable to regenerate that data instead of reporting the stale one. Provide a validate() operation called before outputting the buffered data to allow users to validate buffered data. To indicate valid data, user's validate callback should return 0, to request regeneration of the stale data it should return -EAGAIN, any other error will be considered fatal and read operation will be aborted. Signed-off-by: Suren Baghdasaryan --- fs/seq_file.c | 24 +++++++++++++++++++++++- include/linux/seq_file.h | 1 + 2 files changed, 24 insertions(+), 1 deletion(-) diff --git a/fs/seq_file.c b/fs/seq_file.c index f5fdaf3b1572..77833bbe5909 100644 --- a/fs/seq_file.c +++ b/fs/seq_file.c @@ -172,6 +172,8 @@ ssize_t seq_read_iter(struct kiocb *iocb, struct iov_iter *iter) { struct seq_file *m = iocb->ki_filp->private_data; size_t copied = 0; + loff_t orig_index; + size_t orig_count; size_t n; void *p; int err = 0; @@ -220,6 +222,10 @@ ssize_t seq_read_iter(struct kiocb *iocb, struct iov_iter *iter) if (m->count) // hadn't managed to copy everything goto Done; } + + orig_index = m->index; + orig_count = m->count; +Again: // get a non-empty record in the buffer m->from = 0; p = m->op->start(m, &m->index); @@ -278,6 +284,22 @@ ssize_t seq_read_iter(struct kiocb *iocb, struct iov_iter *iter) } } m->op->stop(m, p); + /* Note: we validate even if err<0 to prevent publishing copied data */ + if (m->op->validate) { + int val_err = m->op->validate(m, p); + + if (val_err) { + if (val_err == -EAGAIN) { + m->index = orig_index; + m->count = orig_count; + // data is stale, retry + goto Again; + } + // data is invalid, return the last error + err = val_err; + goto Done; + } + } n = copy_to_iter(m->buf, m->count, iter); copied += n; m->count -= n; @@ -572,7 +594,7 @@ static void single_stop(struct seq_file *p, void *v) int single_open(struct file *file, int (*show)(struct seq_file *, void *), void *data) { - struct seq_operations *op = kmalloc(sizeof(*op), GFP_KERNEL_ACCOUNT); + struct seq_operations *op = kzalloc(sizeof(*op), GFP_KERNEL_ACCOUNT); int res = -ENOMEM; if (op) { diff --git a/include/linux/seq_file.h b/include/linux/seq_file.h index 234bcdb1fba4..d0fefac2990f 100644 --- a/include/linux/seq_file.h +++ b/include/linux/seq_file.h @@ -34,6 +34,7 @@ struct seq_operations { void (*stop) (struct seq_file *m, void *v); void * (*next) (struct seq_file *m, void *v, loff_t *pos); int (*show) (struct seq_file *m, void *v); + int (*validate)(struct seq_file *m, void *v); }; #define SEQ_SKIP 1 From patchwork Mon Jan 15 18:38:36 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Suren Baghdasaryan X-Patchwork-Id: 13520077 Received: from mail-yw1-f202.google.com (mail-yw1-f202.google.com [209.85.128.202]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 1999F18EC9 for ; Mon, 15 Jan 2024 18:38:47 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--surenb.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="UrYLCeRE" Received: by mail-yw1-f202.google.com with SMTP id 00721157ae682-5e744f7ca3bso140981057b3.2 for ; Mon, 15 Jan 2024 10:38:47 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1705343927; x=1705948727; darn=vger.kernel.org; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=21zH497B8+dg0O/C7EwYJfD0SYQoqr7syeq2JJc4d60=; b=UrYLCeREsFnwAxoJC+umUFKYt2DuQ3CNnswQf7APYA2WgQWK5V8XPnb95quE5NoFwE 83zrG5dhCYFJxsVLPJCIdm493/BXW7HXchO7TjTw9BCTqQBHsefT6v2e8qVHzj4efQmH 1HQOGvPo37nTrw4veIPGkRNtVEWJjYGNwqGA7jIpv/YiZjI2SwCApeEQX9z1V+tJ28j5 5nL+WaXh/9aBG3pq2ECFqqQxosXJFi8mzNiJ6AzT9jAlNSngk0Fa+kgsq5GB1TiBBvF7 cGFQoFgJmVvr3mtRpSxipoIfXN5Ao6V+4Rbqbgr8Gu6SgNKdb7u1MwQXHwUG1osMDJzG Ol4w== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1705343927; x=1705948727; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=21zH497B8+dg0O/C7EwYJfD0SYQoqr7syeq2JJc4d60=; b=RM9jlVM5J3VQqOyhzhOgqfCqsmSiZV2zzN2k+JEUhRgnh5aY6JdW66WjJTrVASd6Jx 35Qok9Gt4/Z+PtA/y4aM/4AVQlU0JeGMPtnX38iCsbIlGhuQVY64qY+08fU7QOP93Jed Ut3gX8/aYrsb7WRdamo0rTnYB/aD8nwbGIMdyVfGStdpwQsytwloqJIJFfz/Pi9m6VpL /Xy7wlcycj6Vlfgmzp1uhxUDv3NmxpLD8Z8bIlRZugJchstoU31WKapO4Uyyh0EJ07eq /UrmtqXIUt09/NypWbcFv2G2kiJasToRQNpKvBDjx0PrFImqkTBqDbGjoFTAe9NfrF2k l3wQ== X-Gm-Message-State: AOJu0YxG42cOivkrqrx8SLKLeSvHhqjy5xXi/GHbUbYHFNeoVCy/Ox+4 bFZUBG7XMgCIEP5QJ8NHatMgyrN6LNeNW3NOhw== X-Google-Smtp-Source: AGHT+IFB8/hnPbdwsoiv7tQuwUtxLgPQDLNW02byS+mXpmjbpqM8r5SRgtD+OQwpIXqVBxjlqXtVMNNKQrk= X-Received: from surenb-desktop.mtv.corp.google.com ([2620:15c:211:201:3af2:e48e:2785:270]) (user=surenb job=sendgmr) by 2002:a05:690c:805:b0:5fc:4ef9:9d6b with SMTP id bx5-20020a05690c080500b005fc4ef99d6bmr2038449ywb.9.1705343927162; Mon, 15 Jan 2024 10:38:47 -0800 (PST) Date: Mon, 15 Jan 2024 10:38:36 -0800 In-Reply-To: <20240115183837.205694-1-surenb@google.com> Precedence: bulk X-Mailing-List: linux-fsdevel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20240115183837.205694-1-surenb@google.com> X-Mailer: git-send-email 2.43.0.381.gb435a96ce8-goog Message-ID: <20240115183837.205694-4-surenb@google.com> Subject: [RFC 3/3] mm/maps: read proc/pid/maps under RCU From: Suren Baghdasaryan To: akpm@linux-foundation.org Cc: viro@zeniv.linux.org.uk, brauner@kernel.org, jack@suse.cz, dchinner@redhat.com, casey@schaufler-ca.com, ben.wolsieffer@hefring.com, paulmck@kernel.org, david@redhat.com, avagin@google.com, usama.anjum@collabora.com, peterx@redhat.com, hughd@google.com, ryan.roberts@arm.com, wangkefeng.wang@huawei.com, Liam.Howlett@Oracle.com, yuzhao@google.com, axelrasmussen@google.com, lstoakes@gmail.com, talumbau@google.com, willy@infradead.org, vbabka@suse.cz, mgorman@techsingularity.net, jhubbard@nvidia.com, vishal.moola@gmail.com, mathieu.desnoyers@efficios.com, dhowells@redhat.com, jgg@ziepe.ca, sidhartha.kumar@oracle.com, andriy.shevchenko@linux.intel.com, yangxingui@huawei.com, keescook@chromium.org, linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, kernel-team@android.com, surenb@google.com With maple_tree supporting vma tree traversal under RCU and per-vma locks making vma access RCU-safe, /proc/pid/maps can be read under RCU and without the need to read-lock mmap_lock. However vma content can change from under us, therefore we need to pin pointer fields used when generating the output (currently only vm_file and anon_name). In addition, we validate data before publishing it to the user using new seq_file validate interface. This way we keep this mechanism consistent with the previous behavior where data tearing is possible only at page boundaries. This change is designed to reduce mmap_lock contention and prevent a process reading /proc/pid/maps files (often a low priority task, such as monitoring/data collection services) from blocking address space updates. Signed-off-by: Suren Baghdasaryan --- fs/proc/internal.h | 3 ++ fs/proc/task_mmu.c | 130 ++++++++++++++++++++++++++++++++++++++++----- 2 files changed, 120 insertions(+), 13 deletions(-) diff --git a/fs/proc/internal.h b/fs/proc/internal.h index a71ac5379584..47233408550b 100644 --- a/fs/proc/internal.h +++ b/fs/proc/internal.h @@ -290,6 +290,9 @@ struct proc_maps_private { struct task_struct *task; struct mm_struct *mm; struct vma_iterator iter; + int mm_lock_seq; + struct anon_vma_name *anon_name; + struct file *vm_file; #ifdef CONFIG_NUMA struct mempolicy *task_mempolicy; #endif diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c index 62b16f42d5d2..d4305cfdca58 100644 --- a/fs/proc/task_mmu.c +++ b/fs/proc/task_mmu.c @@ -141,6 +141,22 @@ static struct vm_area_struct *proc_get_vma(struct proc_maps_private *priv, return vma; } +static const struct seq_operations proc_pid_maps_op; + +static inline bool needs_mmap_lock(struct seq_file *m) +{ +#ifdef CONFIG_PER_VMA_LOCK + /* + * smaps and numa_maps perform page table walk, therefore require + * mmap_lock but maps can be read under RCU. + */ + return m->op != &proc_pid_maps_op; +#else + /* Without per-vma locks VMA access is not RCU-safe */ + return true; +#endif +} + static void *m_start(struct seq_file *m, loff_t *ppos) { struct proc_maps_private *priv = m->private; @@ -162,11 +178,17 @@ static void *m_start(struct seq_file *m, loff_t *ppos) return NULL; } - if (mmap_read_lock_killable(mm)) { - mmput(mm); - put_task_struct(priv->task); - priv->task = NULL; - return ERR_PTR(-EINTR); + if (needs_mmap_lock(m)) { + if (mmap_read_lock_killable(mm)) { + mmput(mm); + put_task_struct(priv->task); + priv->task = NULL; + return ERR_PTR(-EINTR); + } + } else { + /* For memory barrier see the comment for mm_lock_seq in mm_struct */ + priv->mm_lock_seq = smp_load_acquire(&priv->mm->mm_lock_seq); + rcu_read_lock(); } vma_iter_init(&priv->iter, mm, last_addr); @@ -195,7 +217,10 @@ static void m_stop(struct seq_file *m, void *v) return; release_task_mempolicy(priv); - mmap_read_unlock(mm); + if (needs_mmap_lock(m)) + mmap_read_unlock(mm); + else + rcu_read_unlock(); mmput(mm); put_task_struct(priv->task); priv->task = NULL; @@ -283,8 +308,10 @@ show_map_vma(struct seq_file *m, struct vm_area_struct *vma) start = vma->vm_start; end = vma->vm_end; show_vma_header_prefix(m, start, end, flags, pgoff, dev, ino); - if (mm) - anon_name = anon_vma_name(vma); + if (mm) { + anon_name = needs_mmap_lock(m) ? anon_vma_name(vma) : + anon_vma_name_get_rcu(vma); + } /* * Print the dentry name for named mappings, and a @@ -338,19 +365,96 @@ show_map_vma(struct seq_file *m, struct vm_area_struct *vma) seq_puts(m, name); } seq_putc(m, '\n'); + if (anon_name && !needs_mmap_lock(m)) + anon_vma_name_put(anon_name); +} + +/* + * Pin vm_area_struct fields used by show_map_vma. We also copy pinned fields + * into proc_maps_private because by the time put_vma_fields() is called, VMA + * might have changed and these fields might be pointing to different objects. + */ +static bool get_vma_fields(struct vm_area_struct *vma, struct proc_maps_private *priv) +{ + if (vma->vm_file) { + priv->vm_file = get_file_rcu(&vma->vm_file); + if (!priv->vm_file) + return false; + + } else + priv->vm_file = NULL; + + if (vma->anon_name) { + priv->anon_name = anon_vma_name_get_rcu(vma); + if (!priv->anon_name) { + if (priv->vm_file) { + fput(priv->vm_file); + return false; + } + } + } else + priv->anon_name = NULL; + + return true; +} + +static void put_vma_fields(struct proc_maps_private *priv) +{ + if (priv->anon_name) + anon_vma_name_put(priv->anon_name); + if (priv->vm_file) + fput(priv->vm_file); } static int show_map(struct seq_file *m, void *v) { - show_map_vma(m, v); + struct proc_maps_private *priv = m->private; + + if (needs_mmap_lock(m)) + show_map_vma(m, v); + else { + /* + * Stop immediately if the VMA changed from under us. + * Validation step will prevent publishing already cached data. + */ + if (!get_vma_fields(v, priv)) + return -EAGAIN; + + show_map_vma(m, v); + put_vma_fields(priv); + } + return 0; } +static int validate_map(struct seq_file *m, void *v) +{ + if (!needs_mmap_lock(m)) { + struct proc_maps_private *priv = m->private; + int mm_lock_seq; + + /* For memory barrier see the comment for mm_lock_seq in mm_struct */ + mm_lock_seq = smp_load_acquire(&priv->mm->mm_lock_seq); + if (mm_lock_seq != priv->mm_lock_seq) { + /* + * mmap_lock contention is detected. Wait for mmap_lock + * write to be released, discard stale data and retry. + */ + mmap_read_lock(priv->mm); + mmap_read_unlock(priv->mm); + return -EAGAIN; + } + } + return 0; + +} + static const struct seq_operations proc_pid_maps_op = { - .start = m_start, - .next = m_next, - .stop = m_stop, - .show = show_map + .start = m_start, + .next = m_next, + .stop = m_stop, + .show = show_map, + .validate = validate_map, }; static int pid_maps_open(struct inode *inode, struct file *file)