From patchwork Fri Sep 2 21:10:58 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Alexei Starovoitov X-Patchwork-Id: 12964696 X-Patchwork-Delegate: bpf@iogearbox.net Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id BF31AC6FA82 for ; Fri, 2 Sep 2022 21:12:05 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229649AbiIBVME (ORCPT ); Fri, 2 Sep 2022 17:12:04 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:54158 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231124AbiIBVMD (ORCPT ); Fri, 2 Sep 2022 17:12:03 -0400 Received: from mail-pf1-x433.google.com (mail-pf1-x433.google.com [IPv6:2607:f8b0:4864:20::433]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id EFA1BDF4FE for ; Fri, 2 Sep 2022 14:12:01 -0700 (PDT) Received: by mail-pf1-x433.google.com with SMTP id c66so3051495pfc.10 for ; Fri, 02 Sep 2022 14:12:01 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date; bh=TNRfuwjqb+LO+kCM/Gb16FKn3hCFhHeYDMKUUqgmZJU=; b=mUxeRFBjujyis+Oaqckkc+Lfw3tvqmjSQQaa9XEhJefD1zM9ot961+hcJnTVarStvu vOgf+8whE9N/pIhGsOwoRvu4z75yZlh7ppOXENTADsYN461+8hJOvqVs0YHr+o20aK3V vjjltMqDFAZP2PP2fBm/wiEvpg/xGXioyWGTgUjCDZ76R2ShBw7jvn7k6sbaNkYnjj65 vcUQlyvUk7+7yxMdCw3cPhSbV1+yBHCl01/wcbozp1b8rJ1Xn6FLz+H9ymdfIWOypI4x ymajU7G6+gW96jJsGr7AapONB+oaj2BTe19Skoi1jHt9APVBnyamdP2wz/c8x2zOAHP4 cb6A== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date; bh=TNRfuwjqb+LO+kCM/Gb16FKn3hCFhHeYDMKUUqgmZJU=; b=ubIlNuFs14fW2oz+nHdEYzzYEZkCqeES+bYG22Wdb4reYSv0+CXkH1lSwfYiZ5aKc0 /rAUFE764WAEVBmBSPX+BXT0ocpoTA0CcpzlveMa5h5GG4K/NZzF39YUFWv5NADcOt4Q voNV2CU9b3/Q3NF8LKWr00k/HvBqLhDZJW0MGj7EbekqyIGQkzOs/omMjG0ALSIao2v6 lnDE46JiHk8+X+UtNC+qTv/nGyqli6tOrL3FMAjN/DCQr4/oW1jwzq5Oq7qLxhp3LCp/ fszDy4EudqslfwIjMl0QuB6rkeebj2IKEmWoAHhlwM0fPA06Qp6pLU0pWBMHbmatnoh0 F7ug== X-Gm-Message-State: ACgBeo1ouB17QwWwwyV4g9dglGDLhoa4CqnS+58sTpnFARkUV4covXtK 9vP3U22N5/AkjPKR3aPFG3I= X-Google-Smtp-Source: AA6agR6oG9yBLI5jQxAMQ6eReXxNUFt9PmBKK9JLSdg7raR8jc1au4n4apR6oJwGTu+n1V7+P2v/6w== X-Received: by 2002:a05:6a00:1393:b0:536:5b8a:c35b with SMTP id t19-20020a056a00139300b005365b8ac35bmr39009212pfg.5.1662153121244; Fri, 02 Sep 2022 14:12:01 -0700 (PDT) Received: from localhost.localdomain ([2620:10d:c090:500::c978]) by smtp.gmail.com with ESMTPSA id e16-20020aa798d0000000b005360da6b26bsm2221360pfm.159.2022.09.02.14.11.59 (version=TLS1_3 cipher=TLS_CHACHA20_POLY1305_SHA256 bits=256/256); Fri, 02 Sep 2022 14:12:00 -0700 (PDT) From: Alexei Starovoitov To: davem@davemloft.net Cc: daniel@iogearbox.net, andrii@kernel.org, tj@kernel.org, memxor@gmail.com, delyank@fb.com, linux-mm@kvack.org, bpf@vger.kernel.org, kernel-team@fb.com Subject: [PATCH v6 bpf-next 16/16] bpf: Optimize rcu_barrier usage between hash map and bpf_mem_alloc. Date: Fri, 2 Sep 2022 14:10:58 -0700 Message-Id: <20220902211058.60789-17-alexei.starovoitov@gmail.com> X-Mailer: git-send-email 2.36.1 In-Reply-To: <20220902211058.60789-1-alexei.starovoitov@gmail.com> References: <20220902211058.60789-1-alexei.starovoitov@gmail.com> MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: bpf@vger.kernel.org X-Patchwork-Delegate: bpf@iogearbox.net From: Alexei Starovoitov User space might be creating and destroying a lot of hash maps. Synchronous rcu_barrier-s in a destruction path of hash map delay freeing of hash buckets and other map memory and may cause artificial OOM situation under stress. Optimize rcu_barrier usage between bpf hash map and bpf_mem_alloc: - remove rcu_barrier from hash map, since htab doesn't use call_rcu directly and there are no callback to wait for. - bpf_mem_alloc has call_rcu_in_progress flag that indicates pending callbacks. Use it to avoid barriers in fast path. - When barriers are needed copy bpf_mem_alloc into temp structure and wait for rcu barrier-s in the worker to let the rest of hash map freeing to proceed. Signed-off-by: Alexei Starovoitov --- include/linux/bpf_mem_alloc.h | 2 + kernel/bpf/hashtab.c | 6 +-- kernel/bpf/memalloc.c | 80 ++++++++++++++++++++++++++++------- 3 files changed, 69 insertions(+), 19 deletions(-) diff --git a/include/linux/bpf_mem_alloc.h b/include/linux/bpf_mem_alloc.h index 653ed1584a03..3e164b8efaa9 100644 --- a/include/linux/bpf_mem_alloc.h +++ b/include/linux/bpf_mem_alloc.h @@ -3,6 +3,7 @@ #ifndef _BPF_MEM_ALLOC_H #define _BPF_MEM_ALLOC_H #include +#include struct bpf_mem_cache; struct bpf_mem_caches; @@ -10,6 +11,7 @@ struct bpf_mem_caches; struct bpf_mem_alloc { struct bpf_mem_caches __percpu *caches; struct bpf_mem_cache __percpu *cache; + struct work_struct work; }; int bpf_mem_alloc_init(struct bpf_mem_alloc *ma, int size, bool percpu); diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c index a77b9c4a4e48..0fe3f136cbbe 100644 --- a/kernel/bpf/hashtab.c +++ b/kernel/bpf/hashtab.c @@ -1546,10 +1546,10 @@ static void htab_map_free(struct bpf_map *map) * There is no need to synchronize_rcu() here to protect map elements. */ - /* some of free_htab_elem() callbacks for elements of this map may - * not have executed. Wait for them. + /* htab no longer uses call_rcu() directly. bpf_mem_alloc does it + * underneath and is reponsible for waiting for callbacks to finish + * during bpf_mem_alloc_destroy(). */ - rcu_barrier(); if (!htab_is_prealloc(htab)) { delete_all_elements(htab); } else { diff --git a/kernel/bpf/memalloc.c b/kernel/bpf/memalloc.c index 38fbd15c130a..5cc952da7d41 100644 --- a/kernel/bpf/memalloc.c +++ b/kernel/bpf/memalloc.c @@ -414,10 +414,9 @@ static void drain_mem_cache(struct bpf_mem_cache *c) { struct llist_node *llnode, *t; - /* The caller has done rcu_barrier() and no progs are using this - * bpf_mem_cache, but htab_map_free() called bpf_mem_cache_free() for - * all remaining elements and they can be in free_by_rcu or in - * waiting_for_gp lists, so drain those lists now. + /* No progs are using this bpf_mem_cache, but htab_map_free() called + * bpf_mem_cache_free() for all remaining elements and they can be in + * free_by_rcu or in waiting_for_gp lists, so drain those lists now. */ llist_for_each_safe(llnode, t, __llist_del_all(&c->free_by_rcu)) free_one(c, llnode); @@ -429,42 +428,91 @@ static void drain_mem_cache(struct bpf_mem_cache *c) free_one(c, llnode); } +static void free_mem_alloc_no_barrier(struct bpf_mem_alloc *ma) +{ + free_percpu(ma->cache); + free_percpu(ma->caches); + ma->cache = NULL; + ma->caches = NULL; +} + +static void free_mem_alloc(struct bpf_mem_alloc *ma) +{ + /* waiting_for_gp lists was drained, but __free_rcu might + * still execute. Wait for it now before we freeing percpu caches. + */ + rcu_barrier_tasks_trace(); + rcu_barrier(); + free_mem_alloc_no_barrier(ma); +} + +static void free_mem_alloc_deferred(struct work_struct *work) +{ + struct bpf_mem_alloc *ma = container_of(work, struct bpf_mem_alloc, work); + + free_mem_alloc(ma); + kfree(ma); +} + +static void destroy_mem_alloc(struct bpf_mem_alloc *ma, int rcu_in_progress) +{ + struct bpf_mem_alloc *copy; + + if (!rcu_in_progress) { + /* Fast path. No callbacks are pending, hence no need to do + * rcu_barrier-s. + */ + free_mem_alloc_no_barrier(ma); + return; + } + + copy = kmalloc(sizeof(*ma), GFP_KERNEL); + if (!copy) { + /* Slow path with inline barrier-s */ + free_mem_alloc(ma); + return; + } + + /* Defer barriers into worker to let the rest of map memory to be freed */ + copy->cache = ma->cache; + ma->cache = NULL; + copy->caches = ma->caches; + ma->caches = NULL; + INIT_WORK(©->work, free_mem_alloc_deferred); + queue_work(system_unbound_wq, ©->work); +} + void bpf_mem_alloc_destroy(struct bpf_mem_alloc *ma) { struct bpf_mem_caches *cc; struct bpf_mem_cache *c; - int cpu, i; + int cpu, i, rcu_in_progress; if (ma->cache) { + rcu_in_progress = 0; for_each_possible_cpu(cpu) { c = per_cpu_ptr(ma->cache, cpu); drain_mem_cache(c); + rcu_in_progress += atomic_read(&c->call_rcu_in_progress); } /* objcg is the same across cpus */ if (c->objcg) obj_cgroup_put(c->objcg); - /* c->waiting_for_gp list was drained, but __free_rcu might - * still execute. Wait for it now before we free 'c'. - */ - rcu_barrier_tasks_trace(); - rcu_barrier(); - free_percpu(ma->cache); - ma->cache = NULL; + destroy_mem_alloc(ma, rcu_in_progress); } if (ma->caches) { + rcu_in_progress = 0; for_each_possible_cpu(cpu) { cc = per_cpu_ptr(ma->caches, cpu); for (i = 0; i < NUM_CACHES; i++) { c = &cc->cache[i]; drain_mem_cache(c); + rcu_in_progress += atomic_read(&c->call_rcu_in_progress); } } if (c->objcg) obj_cgroup_put(c->objcg); - rcu_barrier_tasks_trace(); - rcu_barrier(); - free_percpu(ma->caches); - ma->caches = NULL; + destroy_mem_alloc(ma, rcu_in_progress); } }