[v4,bpf-next,6/7] bpf: introduce bpf_prog_pack allocator

Message ID	20220119230620.3137425-7-song@kernel.org (mailing list archive)
State	Superseded
Delegated to:	BPF
Headers	show Return-Path: <netdev-owner@kernel.org> From: Song Liu <song@kernel.org> To: <bpf@vger.kernel.org>, <netdev@vger.kernel.org>, <linux-kernel@vger.kernel.org> CC: <ast@kernel.org>, <daniel@iogearbox.net>, <andrii@kernel.org>, <kernel-team@fb.com>, <peterz@infradead.org>, <x86@kernel.org>, Song Liu <songliubraving@fb.com> Subject: [PATCH v4 bpf-next 6/7] bpf: introduce bpf_prog_pack allocator Date: Wed, 19 Jan 2022 15:06:19 -0800 Message-ID: <20220119230620.3137425-7-song@kernel.org> In-Reply-To: <20220119230620.3137425-1-song@kernel.org> References: <20220119230620.3137425-1-song@kernel.org> MIME-Version: 1.0 Content-Transfer-Encoding: 8BIT Content-Type: text/plain Precedence: bulk
Series	bpf_prog_pack allocator \| expand [v4,bpf-next,0/7] bpf_prog_pack allocator [v4,bpf-next,1/7] x86/Kconfig: select HAVE_ARCH_HUGE_VMALLOC with HAVE_ARCH_HUGE_VMAP [v4,bpf-next,2/7] bpf: use bytes instead of pages for bpf_jit_[charge\|uncharge]_modmem [v4,bpf-next,3/7] bpf: use size instead of pages in bpf_binary_header [v4,bpf-next,4/7] bpf: add a pointer of bpf_binary_header to bpf_prog [v4,bpf-next,5/7] x86/alternative: introduce text_poke_copy [v4,bpf-next,6/7] bpf: introduce bpf_prog_pack allocator [v4,bpf-next,7/7] bpf, x86_64: use bpf_prog_pack allocator

Message ID

20220119230620.3137425-7-song@kernel.org (mailing list archive)

State

Superseded

Delegated to:

BPF

Headers

From: Song Liu <song@kernel.org>
To: <bpf@vger.kernel.org>, <netdev@vger.kernel.org>,
        <linux-kernel@vger.kernel.org>
CC: <ast@kernel.org>, <daniel@iogearbox.net>, <andrii@kernel.org>,
        <kernel-team@fb.com>, <peterz@infradead.org>, <x86@kernel.org>,
        Song Liu <songliubraving@fb.com>
Subject: [PATCH v4 bpf-next 6/7] bpf: introduce bpf_prog_pack allocator
Date: Wed, 19 Jan 2022 15:06:19 -0800
Message-ID: <20220119230620.3137425-7-song@kernel.org>
In-Reply-To: <20220119230620.3137425-1-song@kernel.org>
References: <20220119230620.3137425-1-song@kernel.org>
MIME-Version: 1.0
Content-Transfer-Encoding: 8BIT
Content-Type: text/plain
Precedence: bulk

Series

bpf_prog_pack allocator | expand

Context	Check	Description
bpf/vmtest-bpf-next	fail	VM_Test
bpf/vmtest-bpf-next-PR	fail	PR summary
netdev/tree_selection	success	Clearly marked for bpf-next
netdev/fixes_present	success	Fixes tag not required for -next series
netdev/subject_prefix	success	Link
netdev/cover_letter	success	Series has a cover letter
netdev/patch_count	success	Link
netdev/header_inline	success	No static functions without inline keyword in header files
netdev/build_32bit	success	Errors and warnings before: 1398 this patch: 1398
netdev/cc_maintainers	warning	4 maintainers not CCed: kpsingh@kernel.org john.fastabend@gmail.com kafai@fb.com yhs@fb.com
netdev/build_clang	success	Errors and warnings before: 187 this patch: 187
netdev/module_param	success	Was 0 now: 0
netdev/verify_signedoff	success	Signed-off-by tag matches author and committer
netdev/verify_fixes	success	No Fixes tag
netdev/build_allmodconfig_warn	success	Errors and warnings before: 1415 this patch: 1415
netdev/checkpatch	warning	CHECK: Prefer using the BIT macro WARNING: line length of 82 exceeds 80 columns WARNING: line length of 85 exceeds 80 columns WARNING: line length of 87 exceeds 80 columns WARNING: line length of 99 exceeds 80 columns
netdev/kdoc	success	Errors and warnings before: 0 this patch: 0
netdev/source_inline	success	Was 0 now: 0

Context

Check

Description

bpf/vmtest-bpf-next

fail

VM_Test

bpf/vmtest-bpf-next-PR

fail

PR summary

netdev/tree_selection

success

Clearly marked for bpf-next

netdev/fixes_present

success

Fixes tag not required for -next series

netdev/subject_prefix

success

Link

netdev/cover_letter

success

Series has a cover letter

netdev/patch_count

success

Link

netdev/header_inline

success

No static functions without inline keyword in header files

netdev/build_32bit

success

Errors and warnings before: 1398 this patch: 1398

netdev/cc_maintainers

warning

4 maintainers not CCed: kpsingh@kernel.org john.fastabend@gmail.com kafai@fb.com yhs@fb.com

netdev/build_clang

success

Errors and warnings before: 187 this patch: 187

netdev/module_param

success

Was 0 now: 0

netdev/verify_signedoff

success

Signed-off-by tag matches author and committer

netdev/verify_fixes

success

No Fixes tag

netdev/build_allmodconfig_warn

success

Errors and warnings before: 1415 this patch: 1415

netdev/checkpatch

warning

CHECK: Prefer using the BIT macro WARNING: line length of 82 exceeds 80 columns WARNING: line length of 85 exceeds 80 columns WARNING: line length of 87 exceeds 80 columns WARNING: line length of 99 exceeds 80 columns

netdev/kdoc

success

Errors and warnings before: 0 this patch: 0

netdev/source_inline

success

Was 0 now: 0

Commit Message

Song Liu Jan. 19, 2022, 11:06 p.m. UTC

From: Song Liu <songliubraving@fb.com>

Most BPF programs are small, but they consume a page each. For systems
with busy traffic and many BPF programs, this could add significant
pressure to instruction TLB.

Introduce bpf_prog_pack allocator to pack multiple BPF programs in a huge
page. The memory is then allocated in 64 byte chunks.

Memory allocated by bpf_prog_pack allocator is RO protected after initial
allocation. To write to it, the user (jit engine) need to use text poke
API.

Signed-off-by: Song Liu <songliubraving@fb.com>
---
 include/linux/filter.h |   7 ++
 kernel/bpf/core.c      | 187 ++++++++++++++++++++++++++++++++++++++++-
 2 files changed, 190 insertions(+), 4 deletions(-)

Comments

Alexei Starovoitov Jan. 20, 2022, 4:14 a.m. UTC | #1

On Wed, Jan 19, 2022 at 03:06:19PM -0800, Song Liu wrote:
>  
> +/*
> + * BPF program pack allocator.
> + *
> + * Most BPF programs are pretty small. Allocating a hole page for each
> + * program is sometime a waste. Many small bpf program also adds pressure
> + * to instruction TLB. To solve this issue, we introduce a BPF program pack
> + * allocator. The prog_pack allocator uses HPAGE_PMD_SIZE page (2MB on x86)
> + * to host BPF programs.
> + */
> +#define BPF_PROG_PACK_SIZE	HPAGE_PMD_SIZE
> +#define BPF_PROG_MAX_PACK_PROG_SIZE	HPAGE_PMD_SIZE

We have a synthetic test with 1M bpf instructions. How did it JIT?
Are you saying we were lucky that every BPF insn was JITed to <2 bytes x86?
Did I misread the 2MB limit?

Song Liu Jan. 20, 2022, 4:48 a.m. UTC | #2

> On Jan 19, 2022, at 8:14 PM, Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
> 
> On Wed, Jan 19, 2022 at 03:06:19PM -0800, Song Liu wrote:
>> 
>> +/*
>> + * BPF program pack allocator.
>> + *
>> + * Most BPF programs are pretty small. Allocating a hole page for each
>> + * program is sometime a waste. Many small bpf program also adds pressure
>> + * to instruction TLB. To solve this issue, we introduce a BPF program pack
>> + * allocator. The prog_pack allocator uses HPAGE_PMD_SIZE page (2MB on x86)
>> + * to host BPF programs.
>> + */
>> +#define BPF_PROG_PACK_SIZE	HPAGE_PMD_SIZE
>> +#define BPF_PROG_MAX_PACK_PROG_SIZE	HPAGE_PMD_SIZE
> 
> We have a synthetic test with 1M bpf instructions. How did it JIT?
> Are you saying we were lucky that every BPF insn was JITed to <2 bytes x86?
> Did I misread the 2MB limit?

The logic is, if the program is bigger than 2MB, we fall back to use 
module_alloc(). This limitation simplifies the bpf_prog_pack allocator. 

Thanks,
Song

Alexei Starovoitov Jan. 20, 2022, 4:50 a.m. UTC | #3

On Wed, Jan 19, 2022 at 8:48 PM Song Liu <songliubraving@fb.com> wrote:
>
>
>
> > On Jan 19, 2022, at 8:14 PM, Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
> >
> > On Wed, Jan 19, 2022 at 03:06:19PM -0800, Song Liu wrote:
> >>
> >> +/*
> >> + * BPF program pack allocator.
> >> + *
> >> + * Most BPF programs are pretty small. Allocating a hole page for each
> >> + * program is sometime a waste. Many small bpf program also adds pressure
> >> + * to instruction TLB. To solve this issue, we introduce a BPF program pack
> >> + * allocator. The prog_pack allocator uses HPAGE_PMD_SIZE page (2MB on x86)
> >> + * to host BPF programs.
> >> + */
> >> +#define BPF_PROG_PACK_SIZE  HPAGE_PMD_SIZE
> >> +#define BPF_PROG_MAX_PACK_PROG_SIZE HPAGE_PMD_SIZE
> >
> > We have a synthetic test with 1M bpf instructions. How did it JIT?
> > Are you saying we were lucky that every BPF insn was JITed to <2 bytes x86?
> > Did I misread the 2MB limit?
>
> The logic is, if the program is bigger than 2MB, we fall back to use
> module_alloc(). This limitation simplifies the bpf_prog_pack allocator.

Ahh. Missed this part of the diff. Makes sense.

diff --git a/include/linux/filter.h b/include/linux/filter.h
index 6bb00f343cc5..175c97f7535b 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -1074,6 +1074,13 @@  void *bpf_jit_alloc_exec(unsigned long size);
 void bpf_jit_free_exec(void *addr);
 void bpf_jit_free(struct bpf_prog *fp);
 
+struct bpf_binary_header *
+bpf_jit_binary_alloc_pack(unsigned int proglen, u8 **image_r_ptr,
+			  unsigned int alignment,
+			  bpf_jit_fill_hole_t bpf_fill_ill_insns);
+void bpf_jit_binary_free_pack(struct bpf_binary_header *hdr);
+int bpf_prog_pack_max_size(void);
+
 int bpf_jit_add_poke_descriptor(struct bpf_prog *prog,
 				struct bpf_jit_poke_descriptor *poke);
 
diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
index 684a8a972adf..94c7d9ff9d6c 100644
--- a/kernel/bpf/core.c
+++ b/kernel/bpf/core.c
@@ -808,6 +808,116 @@  int bpf_jit_add_poke_descriptor(struct bpf_prog *prog,
 	return slot;
 }
 
+/*
+ * BPF program pack allocator.
+ *
+ * Most BPF programs are pretty small. Allocating a hole page for each
+ * program is sometime a waste. Many small bpf program also adds pressure
+ * to instruction TLB. To solve this issue, we introduce a BPF program pack
+ * allocator. The prog_pack allocator uses HPAGE_PMD_SIZE page (2MB on x86)
+ * to host BPF programs.
+ */
+#define BPF_PROG_PACK_SIZE	HPAGE_PMD_SIZE
+#define BPF_PROG_CHUNK_SHIFT	6
+#define BPF_PROG_CHUNK_SIZE	(1 << BPF_PROG_CHUNK_SHIFT)
+#define BPF_PROG_CHUNK_COUNT	(BPF_PROG_PACK_SIZE / BPF_PROG_CHUNK_SIZE)
+
+struct bpf_prog_pack {
+	struct list_head list;
+	void *ptr;
+	unsigned long bitmap[BITS_TO_LONGS(BPF_PROG_CHUNK_COUNT)];
+};
+
+#define BPF_PROG_MAX_PACK_PROG_SIZE	HPAGE_PMD_SIZE
+#define BPF_PROG_SIZE_TO_NBITS(size)	(round_up(size, BPF_PROG_CHUNK_SIZE) / BPF_PROG_CHUNK_SIZE)
+
+static DEFINE_MUTEX(pack_mutex);
+static LIST_HEAD(pack_list);
+
+static struct bpf_prog_pack *alloc_new_pack(void)
+{
+	struct bpf_prog_pack *pack;
+
+	pack = kzalloc(sizeof(*pack), GFP_KERNEL);
+	if (!pack)
+		return NULL;
+	pack->ptr = module_alloc(BPF_PROG_PACK_SIZE);
+	if (!pack->ptr) {
+		kfree(pack);
+		return NULL;
+	}
+	bitmap_zero(pack->bitmap, BPF_PROG_PACK_SIZE / BPF_PROG_CHUNK_SIZE);
+	list_add_tail(&pack->list, &pack_list);
+
+	set_vm_flush_reset_perms(pack);
+	set_memory_ro((unsigned long)pack->ptr, BPF_PROG_PACK_SIZE / PAGE_SIZE);
+	set_memory_x((unsigned long)pack->ptr, BPF_PROG_PACK_SIZE / PAGE_SIZE);
+	return pack;
+}
+
+static void *bpf_prog_pack_alloc(u32 size)
+{
+	unsigned int nbits = BPF_PROG_SIZE_TO_NBITS(size);
+	struct bpf_prog_pack *pack;
+	unsigned long pos;
+	void *ptr = NULL;
+
+	mutex_lock(&pack_mutex);
+	list_for_each_entry(pack, &pack_list, list) {
+		pos = bitmap_find_next_zero_area(pack->bitmap, BPF_PROG_CHUNK_COUNT, 0,
+						 nbits, 0);
+		if (pos < BPF_PROG_CHUNK_COUNT)
+			goto found_free_area;
+	}
+
+	pack = alloc_new_pack();
+	if (!pack)
+		goto out;
+
+	pos = 0;
+
+found_free_area:
+	bitmap_set(pack->bitmap, pos, nbits);
+	ptr = (void *)(pack->ptr) + (pos << BPF_PROG_CHUNK_SHIFT);
+
+out:
+	mutex_unlock(&pack_mutex);
+	return ptr;
+}
+
+static void bpf_prog_pack_free(struct bpf_binary_header *hdr)
+{
+	void *pack_ptr = (void *)((unsigned long)hdr & ~(BPF_PROG_PACK_SIZE - 1));
+	struct bpf_prog_pack *pack = NULL, *tmp;
+	unsigned int nbits;
+	unsigned long pos;
+
+	mutex_lock(&pack_mutex);
+
+	list_for_each_entry(tmp, &pack_list, list) {
+		if (tmp->ptr == pack_ptr) {
+			pack = tmp;
+			break;
+		}
+	}
+
+	if (WARN_ONCE(!pack, "bpf_prog_pack bug\n"))
+		goto out;
+
+	nbits = BPF_PROG_SIZE_TO_NBITS(hdr->size);
+	pos = ((unsigned long)hdr - (unsigned long)pack_ptr) >> BPF_PROG_CHUNK_SHIFT;
+
+	bitmap_clear(pack->bitmap, pos, nbits);
+	if (bitmap_find_next_zero_area(pack->bitmap, BPF_PROG_CHUNK_COUNT, 0,
+				       BPF_PROG_CHUNK_COUNT, 0) == 0) {
+		list_del(&pack->list);
+		module_memfree(pack->ptr);
+		kfree(pack);
+	}
+out:
+	mutex_unlock(&pack_mutex);
+}
+
 static atomic64_t bpf_jit_current;
 
 /* Can be overridden by an arch's JIT compiler if it has a custom,
@@ -860,10 +970,59 @@  void __weak bpf_jit_free_exec(void *addr)
 	module_memfree(addr);
 }
 
+static struct bpf_binary_header *
+__bpf_jit_binary_alloc(unsigned int proglen, u8 **image_ptr,
+		       unsigned int alignment,
+		       bpf_jit_fill_hole_t bpf_fill_ill_insns,
+		       u32 round_up_to)
+{
+	struct bpf_binary_header *hdr;
+	u32 size, hole, start;
+
+	WARN_ON_ONCE(!is_power_of_2(alignment) ||
+		     alignment > BPF_IMAGE_ALIGNMENT);
+
+	/* Most of BPF filters are really small, but if some of them
+	 * fill a page, allow at least 128 extra bytes to insert a
+	 * random section of illegal instructions.
+	 */
+	size = round_up(proglen + sizeof(*hdr) + 128, round_up_to);
+
+	if (bpf_jit_charge_modmem(size))
+		return NULL;
+	hdr = bpf_jit_alloc_exec(size);
+	if (!hdr) {
+		bpf_jit_uncharge_modmem(size);
+		return NULL;
+	}
+
+	/* Fill space with illegal/arch-dep instructions. */
+	bpf_fill_ill_insns(hdr, size);
+
+	hdr->size = size;
+	hole = min_t(unsigned int, size - (proglen + sizeof(*hdr)),
+		     PAGE_SIZE - sizeof(*hdr));
+	start = (get_random_int() % hole) & ~(alignment - 1);
+
+	/* Leave a random number of instructions before BPF code. */
+	*image_ptr = &hdr->image[start];
+
+	return hdr;
+}
+
 struct bpf_binary_header *
 bpf_jit_binary_alloc(unsigned int proglen, u8 **image_ptr,
 		     unsigned int alignment,
 		     bpf_jit_fill_hole_t bpf_fill_ill_insns)
+{
+	return __bpf_jit_binary_alloc(proglen, image_ptr, alignment,
+				      bpf_fill_ill_insns, PAGE_SIZE);
+}
+
+struct bpf_binary_header *
+bpf_jit_binary_alloc_pack(unsigned int proglen, u8 **image_ptr,
+			  unsigned int alignment,
+			  bpf_jit_fill_hole_t bpf_fill_ill_insns)
 {
 	struct bpf_binary_header *hdr;
 	u32 size, hole, start;
@@ -875,11 +1034,19 @@  bpf_jit_binary_alloc(unsigned int proglen, u8 **image_ptr,
 	 * fill a page, allow at least 128 extra bytes to insert a
 	 * random section of illegal instructions.
 	 */
-	size = round_up(proglen + sizeof(*hdr) + 128, PAGE_SIZE);
+	size = round_up(proglen + sizeof(*hdr) + 128, BPF_PROG_CHUNK_SIZE);
+
+	/* for too big program, use __bpf_jit_binary_alloc with round_up_to
+	 * of BPF_PROG_MAX_PACK_PROG_SIZE.
+	 */
+	if (size > BPF_PROG_MAX_PACK_PROG_SIZE)
+		return __bpf_jit_binary_alloc(proglen, image_ptr,
+					      alignment, bpf_fill_ill_insns,
+					      BPF_PROG_MAX_PACK_PROG_SIZE);
 
 	if (bpf_jit_charge_modmem(size))
 		return NULL;
-	hdr = bpf_jit_alloc_exec(size);
+	hdr = bpf_prog_pack_alloc(size);
 	if (!hdr) {
 		bpf_jit_uncharge_modmem(size);
 		return NULL;
@@ -888,9 +1055,8 @@  bpf_jit_binary_alloc(unsigned int proglen, u8 **image_ptr,
 	/* Fill space with illegal/arch-dep instructions. */
 	bpf_fill_ill_insns(hdr, size);
 
-	hdr->size = size;
 	hole = min_t(unsigned int, size - (proglen + sizeof(*hdr)),
-		     PAGE_SIZE - sizeof(*hdr));
+		     BPF_PROG_CHUNK_SIZE - sizeof(*hdr));
 	start = (get_random_int() % hole) & ~(alignment - 1);
 
 	/* Leave a random number of instructions before BPF code. */
@@ -907,6 +1073,19 @@  void bpf_jit_binary_free(struct bpf_binary_header *hdr)
 	bpf_jit_uncharge_modmem(size);
 }
 
+void bpf_jit_binary_free_pack(struct bpf_binary_header *hdr)
+{
+	u32 size = hdr->size;
+
+	bpf_prog_pack_free(hdr);
+	bpf_jit_uncharge_modmem(size);
+}
+
+int bpf_prog_pack_max_size(void)
+{
+	return BPF_PROG_MAX_PACK_PROG_SIZE;
+}
+
 /* This symbol is only overridden by archs that have different
  * requirements than the usual eBPF JITs, f.e. when they only
  * implement cBPF JIT, do not set images read-only, etc.

[v4,bpf-next,6/7] bpf: introduce bpf_prog_pack allocator

Checks

Commit Message

Comments

Patch