diff mbox series

uprobes: Improve the usage of xol slots for better scalability

Message ID 20240918012752.2045713-1-liaochang1@huawei.com (mailing list archive)
State Superseded
Headers show
Series uprobes: Improve the usage of xol slots for better scalability | expand

Checks

Context Check Description
netdev/tree_selection success Not a local patch

Commit Message

Liao, Chang Sept. 18, 2024, 1:27 a.m. UTC
The kprobe handler allocates xol slot from xol_area and quickly release
it in the single-step handler. The atomic operations on the xol bitmap
and slot_count lead to expensive cache line bouncing between multiple
CPUs. Given the xol slot is on the hot path for kprobe and kretprobe
handling, the profiling results on some Arm64 machine show that nearly
80% of cycles are spent on this. So optimizing xol slot usage will
become important to scalability after the Andrii's series land.

This patch address this scalability issues from two perspectives:

- Allocated xol slot is now saved in the thread associated utask data.
  It allows to reuse it throughout the therad's lifetime. This avoid
  the frequent atomic operation on the slot bitmap and slot_count of
  xol_area data, which is the major negative impact on scalability.

- A garbage collection routine xol_recycle_insn_slot() is introduced to
  reclaim unused xol slots. utask instances that own xol slot but
  haven't reclaimed them are linked in a linked list. When xol_area runs
  out of slots, the garbage collection routine travel the list to free
  one slot. Allocated xol slots is marked as unused in single-step
  handler. While the marking relies on the refcount of utask instance,
  due to thread can't run on multiple CPUs at same time, therefore, it
  is unlikely CPUs take the refcount of same utask, minimizing cache
  line bouncing.

  Upon thread exit, the utask is deleted from the linked-list, and the
  associated xol slot will be free.

Signed-off-by: Liao Chang <liaochang1@huawei.com>
---
 include/linux/uprobes.h |   4 +
 kernel/events/uprobes.c | 173 ++++++++++++++++++++++++++++++----------
 2 files changed, 135 insertions(+), 42 deletions(-)

Comments

Andi Kleen Sept. 18, 2024, 12:25 p.m. UTC | #1
Liao Chang <liaochang1@huawei.com> writes:
> +
> +/*
> + * xol_recycle_insn_slot - recycle a slot from the garbage collection list.
> + */
> +static int xol_recycle_insn_slot(struct xol_area *area)
> +{
> +	struct uprobe_task *utask;
> +	int slot = UINSNS_PER_PAGE;
> +
> +	rcu_read_lock();
> +	list_for_each_entry_rcu(utask, &area->gc_list, gc) {
> +		/*
> +		 * The utask associated slot is in-use or recycling when
> +		 * utask associated slot_ref is not one.
> +		 */
> +		if (test_and_put_task_slot(utask)) {
> +			slot = utask->insn_slot;
> +			utask->insn_slot = UINSNS_PER_PAGE;
> +			clear_bit(slot, area->bitmap);
> +			atomic_dec(&area->slot_count);
> +			get_task_slot(utask);

Doesn't this need some annotation to make ThreadSanitizer happy?
Would be good to have some commentary why doing so
many write operations with merely a rcu_read_lock as protection is safe.
It might be safer to put some write type operations under a real lock. 
Also it is unclear how the RCU grace period for utasks is enforced.


-Andi
kernel test robot Sept. 18, 2024, 2:41 p.m. UTC | #2
Hi Liao,

kernel test robot noticed the following build errors:

[auto build test ERROR on tip/perf/core]
[cannot apply to perf-tools-next/perf-tools-next perf-tools/perf-tools linus/master acme/perf/core v6.11 next-20240918]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Liao-Chang/uprobes-Improve-the-usage-of-xol-slots-for-better-scalability/20240918-093915
base:   tip/perf/core
patch link:    https://lore.kernel.org/r/20240918012752.2045713-1-liaochang1%40huawei.com
patch subject: [PATCH] uprobes: Improve the usage of xol slots for better scalability
config: arm-defconfig (https://download.01.org/0day-ci/archive/20240918/202409182246.UMkGsMXl-lkp@intel.com/config)
compiler: clang version 14.0.6 (https://github.com/llvm/llvm-project f28c006a5895fc0e329fe15fead81e37457cb1d1)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20240918/202409182246.UMkGsMXl-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202409182246.UMkGsMXl-lkp@intel.com/

All errors (new ones prefixed by >>):

   In file included from arch/arm/probes/uprobes/actions-arm.c:10:
>> include/linux/uprobes.h:81:2: error: unknown type name 'refcount_t'
           refcount_t                      slot_ref;
           ^
   1 error generated.


vim +/refcount_t +81 include/linux/uprobes.h

    58	
    59	/*
    60	 * uprobe_task: Metadata of a task while it singlesteps.
    61	 */
    62	struct uprobe_task {
    63		enum uprobe_task_state		state;
    64	
    65		union {
    66			struct {
    67				struct arch_uprobe_task	autask;
    68				unsigned long		vaddr;
    69			};
    70	
    71			struct {
    72				struct callback_head	dup_xol_work;
    73				unsigned long		dup_xol_addr;
    74			};
    75		};
    76	
    77		struct uprobe			*active_uprobe;
    78		unsigned long			xol_vaddr;
    79	
    80		struct list_head		gc;
  > 81		refcount_t			slot_ref;
    82		int				insn_slot;
    83	
    84		struct arch_uprobe              *auprobe;
    85	
    86		struct return_instance		*return_instances;
    87		unsigned int			depth;
    88	};
    89
Liao, Chang Sept. 19, 2024, 12:20 p.m. UTC | #3
在 2024/9/18 20:25, Andi Kleen 写道:
> Liao Chang <liaochang1@huawei.com> writes:
>> +
>> +/*
>> + * xol_recycle_insn_slot - recycle a slot from the garbage collection list.
>> + */
>> +static int xol_recycle_insn_slot(struct xol_area *area)
>> +{
>> +	struct uprobe_task *utask;
>> +	int slot = UINSNS_PER_PAGE;
>> +
>> +	rcu_read_lock();
>> +	list_for_each_entry_rcu(utask, &area->gc_list, gc) {
>> +		/*
>> +		 * The utask associated slot is in-use or recycling when
>> +		 * utask associated slot_ref is not one.
>> +		 */
>> +		if (test_and_put_task_slot(utask)) {
>> +			slot = utask->insn_slot;
>> +			utask->insn_slot = UINSNS_PER_PAGE;
>> +			clear_bit(slot, area->bitmap);
>> +			atomic_dec(&area->slot_count);
>> +			get_task_slot(utask);
> 
> Doesn't this need some annotation to make ThreadSanitizer happy?

Hi, Andi

Sorry, I know nothing about the ThreadSanitizer and related annotation,
could you provide some information about it, thanks.

> Would be good to have some commentary why doing so
> many write operations with merely a rcu_read_lock as protection is safe.
> It might be safer to put some write type operations under a real lock. 
> Also it is unclear how the RCU grace period for utasks is enforced.

You are right, but I think using atomic refcount routine might be a more
suitable apprach for this scenario. The slot_ret field of utask instance
is used to track the status of insn_slot. slot_ret supports three values.
A value of 2 means the utask associated insn_slot is currently in use by
uprobe. A value of 1 means the slot is no being used by uprobe. A value
of 0 means the slot has been reclaimed. So in some term, the atomic refcount
routine test_and_pout_task_slot() also avoid the racing when writing to
the utask instance, providing additional status information about insn_slot.

BTW, You reminded me that since it might recycle the slot after deleting the
utask from the garbage collection list, so it's necessary to use
test_and_put_task_slot() to avoid the racing on the stale utask. the correct
code might be something like this:

@@ -1771,16 +1783,16 @@ static void xol_free_insn_slot(struct task_struct *tsk)

        spin_lock_irqsave(&area->list_lock, flags);
        list_del_rcu(&tsk->utask->gc);
+       /* Ensure the slot is not in use or reclaimed on other CPU */
+       if (test_and_put_task_slot(tsk->utask)) {
+               clear_bit(tsk->utask->insn_slot, area->bitmap);
+               atomic_dec(&area->slot_count);
+               tsk->utask->insn_slot = UINSNS_PER_PAGE;
+               get_task_slot(tsk->utask);
+       }
        spin_unlock_irqrestore(&area->list_lock, flags);
        synchronize_rcu();

If I've made any mistakes about RCU usage, please don't hesitate to corret me.
Thank you in advance.

> 
> 
> -Andi
>
Andi Kleen Sept. 19, 2024, 1:34 p.m. UTC | #4
> Sorry, I know nothing about the ThreadSanitizer and related annotation,
> could you provide some information about it, thanks.

Documentation/dev-tools/kcsan.rst

> > Would be good to have some commentary why doing so
> > many write operations with merely a rcu_read_lock as protection is safe.
> > It might be safer to put some write type operations under a real lock. 
> > Also it is unclear how the RCU grace period for utasks is enforced.
> 
> You are right, but I think using atomic refcount routine might be a more
> suitable apprach for this scenario. The slot_ret field of utask instance

Does it really all need to be lockless? Perhaps you can only make the 
common case lockless, but then only when the list overflows take a lock 
and avoid a lot of races. That might be good enough for performance.

If you really want a complex lockless scheme you need very clear documentation
in comments and commit logs at least.

Also there should be a test case that stresses the various cases.

I would just use a lock 
> is used to track the status of insn_slot. slot_ret supports three values.
> A value of 2 means the utask associated insn_slot is currently in use by
> uprobe. A value of 1 means the slot is no being used by uprobe. A value
> of 0 means the slot has been reclaimed. So in some term, the atomic refcount
> routine test_and_pout_task_slot() also avoid the racing when writing to
> the utask instance, providing additional status information about insn_slot.
> 
> BTW, You reminded me that since it might recycle the slot after deleting the
> utask from the garbage collection list, so it's necessary to use
> test_and_put_task_slot() to avoid the racing on the stale utask. the correct
> code might be something like this:
> 
> @@ -1771,16 +1783,16 @@ static void xol_free_insn_slot(struct task_struct *tsk)
> 
>         spin_lock_irqsave(&area->list_lock, flags);
>         list_del_rcu(&tsk->utask->gc);
> +       /* Ensure the slot is not in use or reclaimed on other CPU */
> +       if (test_and_put_task_slot(tsk->utask)) {
> +               clear_bit(tsk->utask->insn_slot, area->bitmap);
> +               atomic_dec(&area->slot_count);
> +               tsk->utask->insn_slot = UINSNS_PER_PAGE;
> +               get_task_slot(tsk->utask);
> +       }

I would have expected you would add a if (racing) bail out, assume the
other CPU will do the work type check but that doesn't seem to be what
the code is doing.

-Andi
Liao, Chang Sept. 23, 2024, 10:29 a.m. UTC | #5
在 2024/9/19 21:34, Andi Kleen 写道:
>> Sorry, I know nothing about the ThreadSanitizer and related annotation,
>> could you provide some information about it, thanks.
> 
> Documentation/dev-tools/kcsan.rst

Thanks.

> 
>>> Would be good to have some commentary why doing so
>>> many write operations with merely a rcu_read_lock as protection is safe.
>>> It might be safer to put some write type operations under a real lock. 
>>> Also it is unclear how the RCU grace period for utasks is enforced.
>>
>> You are right, but I think using atomic refcount routine might be a more
>> suitable apprach for this scenario. The slot_ret field of utask instance
> 
> Does it really all need to be lockless? Perhaps you can only make the 
> common case lockless, but then only when the list overflows take a lock 
> and avoid a lot of races. That might be good enough for performance.

Agreed, List overflow would happen if new threads were constantly spawned
and hit the breakpoint. I'm not sure how often this occurs in real application.
Even if some applications follow this pattern, I suspect the bottleneck
might shift to another point, like dynamically allocating new utask instances.
At least, for the scalability selftest benchmark, list overflow shouldn't
be a common case.

> 
> If you really want a complex lockless scheme you need very clear documentation
> in comments and commit logs at least.
> 
> Also there should be a test case that stresses the various cases.
> 
> I would just use a lock 

Thanks for the suggestions, I will experiment with a read-write lock, meanwhile,
adding the documentation and testing for the lockless scheme.

>> is used to track the status of insn_slot. slot_ret supports three values.
>> A value of 2 means the utask associated insn_slot is currently in use by
>> uprobe. A value of 1 means the slot is no being used by uprobe. A value
>> of 0 means the slot has been reclaimed. So in some term, the atomic refcount
>> routine test_and_pout_task_slot() also avoid the racing when writing to
>> the utask instance, providing additional status information about insn_slot.
>>
>> BTW, You reminded me that since it might recycle the slot after deleting the
>> utask from the garbage collection list, so it's necessary to use
>> test_and_put_task_slot() to avoid the racing on the stale utask. the correct
>> code might be something like this:
>>
>> @@ -1771,16 +1783,16 @@ static void xol_free_insn_slot(struct task_struct *tsk)
>>
>>         spin_lock_irqsave(&area->list_lock, flags);
>>         list_del_rcu(&tsk->utask->gc);
>> +       /* Ensure the slot is not in use or reclaimed on other CPU */
>> +       if (test_and_put_task_slot(tsk->utask)) {
>> +               clear_bit(tsk->utask->insn_slot, area->bitmap);
>> +               atomic_dec(&area->slot_count);
>> +               tsk->utask->insn_slot = UINSNS_PER_PAGE;
>> +               get_task_slot(tsk->utask);
>> +       }
> 
> I would have expected you would add a if (racing) bail out, assume the
> other CPU will do the work type check but that doesn't seem to be what
> the code is doing.

Sorry, I may not probably get the point clear here, and it would be very
nice if more details are provided for the concern. Do you mean it's necessary
to make the if-body excution exclusive among the CPUs? If that's the case,
I guess the test_and_put_task_slot() is the equvialent to the race condition
check. test_and_put_task_slot() uses a compare and exchange operation on the
slot_ref of utask instance. Regardless of the work type being performed by
other CPU, it will always bail out unless the slot_ref has a value of one,
indicating the utask is free to access from local CPU.

> 
> -Andi
> 
>
Andi Kleen Sept. 23, 2024, 11:52 p.m. UTC | #6
> Thanks for the suggestions, I will experiment with a read-write lock, meanwhile,
> adding the documentation and testing for the lockless scheme.

Read-write locks are usually not worth it for short critical sections,
in fact they can be slower due to cache line effects.

> Sorry, I may not probably get the point clear here, and it would be very
> nice if more details are provided for the concern. Do you mean it's necessary
> to make the if-body excution exclusive among the CPUs? If that's the case,
> I guess the test_and_put_task_slot() is the equvialent to the race condition
> check. test_and_put_task_slot() uses a compare and exchange operation on the
> slot_ref of utask instance. Regardless of the work type being performed by
> other CPU, it will always bail out unless the slot_ref has a value of one,
> indicating the utask is free to access from local CPU.

What I meant is that the typical pattern for handling races in destruction
is to detect someone else is racing and then let it do the destruction
work or reacquire the resource (so just bail out).

But that's not what you're doing here, in fact you're adding a
completely new code path that has different semantics? I haven't checked
all the code, but it looks dubious.

-Andi
Liao, Chang Sept. 27, 2024, 10:01 a.m. UTC | #7
在 2024/9/24 7:52, Andi Kleen 写道:
>> Thanks for the suggestions, I will experiment with a read-write lock, meanwhile,
>> adding the documentation and testing for the lockless scheme.
> 
> Read-write locks are usually not worth it for short critical sections,
> in fact they can be slower due to cache line effects.

OK, I will start from a simple spinlock.

> 
>> Sorry, I may not probably get the point clear here, and it would be very
>> nice if more details are provided for the concern. Do you mean it's necessary
>> to make the if-body excution exclusive among the CPUs? If that's the case,
>> I guess the test_and_put_task_slot() is the equvialent to the race condition
>> check. test_and_put_task_slot() uses a compare and exchange operation on the
>> slot_ref of utask instance. Regardless of the work type being performed by
>> other CPU, it will always bail out unless the slot_ref has a value of one,
>> indicating the utask is free to access from local CPU.
> 
> What I meant is that the typical pattern for handling races in destruction
> is to detect someone else is racing and then let it do the destruction
> work or reacquire the resource (so just bail out).

Agreed.

> 
> But that's not what you're doing here, in fact you're adding a
> completely new code path that has different semantics? I haven't checked
> all the code, but it looks dubious.

Andi, I've just sent v2. Looking forward to your feedback. Thanks.

  https://lore.kernel.org/all/20240927094549.3382916-1-liaochang1@huawei.com/

> 
> -Andi
>
diff mbox series

Patch

diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
index e6f4e73125ff..87fa24c74eb8 100644
--- a/include/linux/uprobes.h
+++ b/include/linux/uprobes.h
@@ -77,6 +77,10 @@  struct uprobe_task {
 	struct uprobe			*active_uprobe;
 	unsigned long			xol_vaddr;
 
+	struct list_head		gc;
+	refcount_t			slot_ref;
+	int				insn_slot;
+
 	struct arch_uprobe              *auprobe;
 
 	struct return_instance		*return_instances;
diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index 86fcb2386ea2..5ba1bd3ad27f 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -111,6 +111,12 @@  struct xol_area {
 	 * the vma go away, and we must handle that reasonably gracefully.
 	 */
 	unsigned long 			vaddr;		/* Page(s) of instruction slots */
+	struct list_head		gc_list;	/* The list for the
+							   garbage slots */
+	spinlock_t			list_lock;	/* Hold for
+							   list_add_rcu() and
+							   list_del_rcu() on
+							   gc_list */
 };
 
 static void uprobe_warn(struct task_struct *t, const char *msg)
@@ -1557,6 +1563,8 @@  static struct xol_area *__create_xol_area(unsigned long vaddr)
 
 	area->vaddr = vaddr;
 	init_waitqueue_head(&area->wq);
+	INIT_LIST_HEAD(&area->gc_list);
+	spin_lock_init(&area->list_lock);
 	/* Reserve the 1st slot for get_trampoline_vaddr() */
 	set_bit(0, area->bitmap);
 	atomic_set(&area->slot_count, 1);
@@ -1613,37 +1621,91 @@  void uprobe_dup_mmap(struct mm_struct *oldmm, struct mm_struct *newmm)
 	}
 }
 
+static __always_inline
+struct uprobe_task *test_and_get_task_slot(struct uprobe_task *utask)
+{
+	return refcount_inc_not_zero(&utask->slot_ref) ? utask : NULL;
+}
+
+static __always_inline
+struct uprobe_task *test_and_put_task_slot(struct uprobe_task *utask)
+{
+	return refcount_dec_if_one(&utask->slot_ref) ? utask : NULL;
+}
+
+static __always_inline
+void put_task_slot(struct uprobe_task *utask)
+{
+	refcount_dec(&utask->slot_ref);
+}
+
+static __always_inline
+void get_task_slot(struct uprobe_task *utask)
+{
+	refcount_inc(&utask->slot_ref);
+}
+
+/*
+ * xol_recycle_insn_slot - recycle a slot from the garbage collection list.
+ */
+static int xol_recycle_insn_slot(struct xol_area *area)
+{
+	struct uprobe_task *utask;
+	int slot = UINSNS_PER_PAGE;
+
+	rcu_read_lock();
+	list_for_each_entry_rcu(utask, &area->gc_list, gc) {
+		/*
+		 * The utask associated slot is in-use or recycling when
+		 * utask associated slot_ref is not one.
+		 */
+		if (test_and_put_task_slot(utask)) {
+			slot = utask->insn_slot;
+			utask->insn_slot = UINSNS_PER_PAGE;
+			clear_bit(slot, area->bitmap);
+			atomic_dec(&area->slot_count);
+			get_task_slot(utask);
+			break;
+		}
+	}
+	rcu_read_unlock();
+
+	return slot;
+}
+
 /*
  *  - search for a free slot.
  */
-static unsigned long xol_take_insn_slot(struct xol_area *area)
+static int xol_take_insn_slot(struct xol_area *area)
 {
-	unsigned long slot_addr;
 	int slot_nr;
 
 	do {
 		slot_nr = find_first_zero_bit(area->bitmap, UINSNS_PER_PAGE);
-		if (slot_nr < UINSNS_PER_PAGE) {
-			if (!test_and_set_bit(slot_nr, area->bitmap))
-				break;
+		if (slot_nr == UINSNS_PER_PAGE)
+			slot_nr = xol_recycle_insn_slot(area);
+
+		if (slot_nr == UINSNS_PER_PAGE)
+			wait_event(area->wq,
+				   (atomic_read(&area->slot_count) <
+				    UINSNS_PER_PAGE));
+		else if (!test_and_set_bit(slot_nr, area->bitmap))
+			break;
 
-			slot_nr = UINSNS_PER_PAGE;
-			continue;
-		}
-		wait_event(area->wq, (atomic_read(&area->slot_count) < UINSNS_PER_PAGE));
+		slot_nr = UINSNS_PER_PAGE;
 	} while (slot_nr >= UINSNS_PER_PAGE);
 
-	slot_addr = area->vaddr + (slot_nr * UPROBE_XOL_SLOT_BYTES);
 	atomic_inc(&area->slot_count);
 
-	return slot_addr;
+	return slot_nr;
 }
 
 /*
  * xol_get_insn_slot - allocate a slot for xol.
  * Returns the allocated slot address or 0.
  */
-static unsigned long xol_get_insn_slot(struct uprobe *uprobe)
+static unsigned long xol_get_insn_slot(struct uprobe_task *utask,
+				       struct uprobe *uprobe)
 {
 	struct xol_area *area;
 	unsigned long xol_vaddr;
@@ -1652,16 +1714,45 @@  static unsigned long xol_get_insn_slot(struct uprobe *uprobe)
 	if (!area)
 		return 0;
 
-	xol_vaddr = xol_take_insn_slot(area);
-	if (unlikely(!xol_vaddr))
+	/*
+	 * The utask associated slot is recycling when utask associated
+	 * slot_ref is zero.
+	 */
+	if (!test_and_get_task_slot(utask))
 		return 0;
 
+	if (utask->insn_slot == UINSNS_PER_PAGE) {
+		utask->insn_slot = xol_take_insn_slot(area);
+		spin_lock(&area->list_lock);
+		list_add_rcu(&utask->gc, &area->gc_list);
+		spin_unlock(&area->list_lock);
+	}
+
+	xol_vaddr = area->vaddr + (utask->insn_slot * UPROBE_XOL_SLOT_BYTES);
+
 	arch_uprobe_copy_ixol(area->pages[0], xol_vaddr,
 			      &uprobe->arch.ixol, sizeof(uprobe->arch.ixol));
 
 	return xol_vaddr;
 }
 
+/*
+ * xol_delay_free_insn_slot - Make the slot available for garbage collection
+ */
+static void xol_delay_free_insn_slot(void)
+{
+	struct xol_area *area = current->mm->uprobes_state.xol_area;
+
+	/*
+	 * This refcount would't cause expensive cache line bouncing
+	 * between CPUs, as it is unlikely that multiple CPUs take the
+	 * slot_ref of same utask.
+	 */
+	put_task_slot(current->utask);
+	if (waitqueue_active(&area->wq))
+		wake_up(&area->wq);
+}
+
 /*
  * xol_free_insn_slot - If slot was earlier allocated by
  * @xol_get_insn_slot(), make the slot available for
@@ -1670,35 +1761,31 @@  static unsigned long xol_get_insn_slot(struct uprobe *uprobe)
 static void xol_free_insn_slot(struct task_struct *tsk)
 {
 	struct xol_area *area;
-	unsigned long vma_end;
-	unsigned long slot_addr;
+	unsigned long flags;
+	int slot_nr;
 
 	if (!tsk->mm || !tsk->mm->uprobes_state.xol_area || !tsk->utask)
 		return;
 
-	slot_addr = tsk->utask->xol_vaddr;
-	if (unlikely(!slot_addr))
-		return;
-
 	area = tsk->mm->uprobes_state.xol_area;
-	vma_end = area->vaddr + PAGE_SIZE;
-	if (area->vaddr <= slot_addr && slot_addr < vma_end) {
-		unsigned long offset;
-		int slot_nr;
-
-		offset = slot_addr - area->vaddr;
-		slot_nr = offset / UPROBE_XOL_SLOT_BYTES;
-		if (slot_nr >= UINSNS_PER_PAGE)
-			return;
 
-		clear_bit(slot_nr, area->bitmap);
-		atomic_dec(&area->slot_count);
-		smp_mb__after_atomic(); /* pairs with prepare_to_wait() */
-		if (waitqueue_active(&area->wq))
-			wake_up(&area->wq);
+	spin_lock_irqsave(&area->list_lock, flags);
+	list_del_rcu(&tsk->utask->gc);
+	spin_unlock_irqrestore(&area->list_lock, flags);
+	synchronize_rcu();
 
-		tsk->utask->xol_vaddr = 0;
-	}
+	slot_nr = tsk->utask->insn_slot;
+	if (unlikely(slot_nr == UINSNS_PER_PAGE))
+		return;
+
+	tsk->utask->insn_slot = UINSNS_PER_PAGE;
+	clear_bit(slot_nr, area->bitmap);
+	atomic_dec(&area->slot_count);
+	smp_mb__after_atomic(); /* pairs with prepare_to_wait() */
+	if (waitqueue_active(&area->wq))
+		wake_up(&area->wq);
+
+	tsk->utask->xol_vaddr = 0;
 }
 
 void __weak arch_uprobe_copy_ixol(struct page *page, unsigned long vaddr,
@@ -1779,8 +1866,12 @@  void uprobe_free_utask(struct task_struct *t)
  */
 static struct uprobe_task *get_utask(void)
 {
-	if (!current->utask)
+	if (!current->utask) {
 		current->utask = kzalloc(sizeof(struct uprobe_task), GFP_KERNEL);
+		current->utask->insn_slot = UINSNS_PER_PAGE;
+		INIT_LIST_HEAD(&current->utask->gc);
+		refcount_set(&current->utask->slot_ref, 1);
+	}
 	return current->utask;
 }
 
@@ -1978,7 +2069,7 @@  pre_ssout(struct uprobe *uprobe, struct pt_regs *regs, unsigned long bp_vaddr)
 	if (!try_get_uprobe(uprobe))
 		return -EINVAL;
 
-	xol_vaddr = xol_get_insn_slot(uprobe);
+	xol_vaddr = xol_get_insn_slot(utask, uprobe);
 	if (!xol_vaddr) {
 		err = -ENOMEM;
 		goto err_out;
@@ -1988,10 +2079,8 @@  pre_ssout(struct uprobe *uprobe, struct pt_regs *regs, unsigned long bp_vaddr)
 	utask->vaddr = bp_vaddr;
 
 	err = arch_uprobe_pre_xol(&uprobe->arch, regs);
-	if (unlikely(err)) {
-		xol_free_insn_slot(current);
+	if (unlikely(err))
 		goto err_out;
-	}
 
 	utask->active_uprobe = uprobe;
 	utask->state = UTASK_SSTEP;
@@ -2340,7 +2429,7 @@  static void handle_singlestep(struct uprobe_task *utask, struct pt_regs *regs)
 	put_uprobe(uprobe);
 	utask->active_uprobe = NULL;
 	utask->state = UTASK_RUNNING;
-	xol_free_insn_slot(current);
+	xol_delay_free_insn_slot();
 
 	spin_lock_irq(&current->sighand->siglock);
 	recalc_sigpending(); /* see uprobe_deny_signal() */