diff mbox series

[RFC,2/7] ath10k: Add support to process rx packet in thread

Message ID 1595351666-28193-3-git-send-email-pillair@codeaurora.org (mailing list archive)
State RFC
Delegated to: Kalle Valo
Headers show
Series Add support to process rx packets in thread | expand

Commit Message

Rakesh Pillai July 21, 2020, 5:14 p.m. UTC
NAPI instance gets scheduled on a CPU core on which
the IRQ was triggered. The processing of rx packets
can be CPU intensive and since NAPI cannot be moved
to a different CPU core, to get better performance,
its better to move the gist of rx packet processing
in a high priority thread.

Add the init/deinit part for a thread to process the
receive packets.

Tested-on: WCN3990 hw1.0 SNOC WLAN.HL.3.1-01040-QCAHLSWMTPLZ-1

Signed-off-by: Rakesh Pillai <pillair@codeaurora.org>
---
 drivers/net/wireless/ath/ath10k/core.c | 33 +++++++++++++++++++++++++++
 drivers/net/wireless/ath/ath10k/core.h | 23 +++++++++++++++++++
 drivers/net/wireless/ath/ath10k/snoc.c | 41 ++++++++++++++++++++++++++++++++++
 3 files changed, 97 insertions(+)

Comments

Rajkumar Manoharan July 21, 2020, 9:53 p.m. UTC | #1
On 2020-07-21 10:14, Rakesh Pillai wrote:
> NAPI instance gets scheduled on a CPU core on which
> the IRQ was triggered. The processing of rx packets
> can be CPU intensive and since NAPI cannot be moved
> to a different CPU core, to get better performance,
> its better to move the gist of rx packet processing
> in a high priority thread.
> 
> Add the init/deinit part for a thread to process the
> receive packets.
> 
IMHO this defeat the whole purpose of NAPI. Originally in ath10k
irq processing happened in tasklet (high priority) context which in
turn push more data to net core even though net is unable to process
driver data as both happen in different context (fast producer - slow 
consumer)
issue. Why can't CPU governor schedule the interrupts in less loaded CPU 
core?
Otherwise you can play with different RPS and affinity settings to meet 
your
requirement.

IMO introducing high priority tasklets/threads is not viable solution.

-Rajkumar
Felix Fietkau July 22, 2020, 12:27 p.m. UTC | #2
On 2020-07-21 23:53, Rajkumar Manoharan wrote:
> On 2020-07-21 10:14, Rakesh Pillai wrote:
>> NAPI instance gets scheduled on a CPU core on which
>> the IRQ was triggered. The processing of rx packets
>> can be CPU intensive and since NAPI cannot be moved
>> to a different CPU core, to get better performance,
>> its better to move the gist of rx packet processing
>> in a high priority thread.
>> 
>> Add the init/deinit part for a thread to process the
>> receive packets.
>> 
> IMHO this defeat the whole purpose of NAPI. Originally in ath10k
> irq processing happened in tasklet (high priority) context which in
> turn push more data to net core even though net is unable to process
> driver data as both happen in different context (fast producer - slow 
> consumer)
> issue. Why can't CPU governor schedule the interrupts in less loaded CPU 
> core?
> Otherwise you can play with different RPS and affinity settings to meet 
> your
> requirement.
> 
> IMO introducing high priority tasklets/threads is not viable solution.
I'm beginning to think that the main problem with NAPI here is that the
work done by poll functions on 802.11 drivers is significantly more CPU
intensive compared to ethernet drivers, possibly more than what NAPI was
designed for.

I'm considering testing a different approach (with mt76 initially):
- Add a mac80211 rx function that puts processed skbs into a list
instead of handing them to the network stack directly.
- Move all rx processing to a high priority thread, keep a driver
internal queue for fully processed packets.
- Schedule NAPI poll on completion.
- NAPI poll function pulls from the internal queue and passes to the
network stack.

With this approach, the network stack retains some control over the
processing rate of rx packets, while the scheduler can move the CPU
intensive processing around to where it fits best.

What do you think?

- Felix
Johannes Berg July 22, 2020, 12:55 p.m. UTC | #3
On Wed, 2020-07-22 at 14:27 +0200, Felix Fietkau wrote:

> I'm considering testing a different approach (with mt76 initially):
> - Add a mac80211 rx function that puts processed skbs into a list
> instead of handing them to the network stack directly.

Would this be *after* all the mac80211 processing, i.e. in place of the
rx-up-to-stack?

johannes
Felix Fietkau July 22, 2020, 1 p.m. UTC | #4
On 2020-07-22 14:55, Johannes Berg wrote:
> On Wed, 2020-07-22 at 14:27 +0200, Felix Fietkau wrote:
> 
>> I'm considering testing a different approach (with mt76 initially):
>> - Add a mac80211 rx function that puts processed skbs into a list
>> instead of handing them to the network stack directly.
> 
> Would this be *after* all the mac80211 processing, i.e. in place of the
> rx-up-to-stack?
Yes, it would run all the rx handlers normally and then put the
resulting skbs into a list instead of calling netif_receive_skb or
napi_gro_frags.

- Felix
Rajkumar Manoharan July 23, 2020, 6:09 a.m. UTC | #5
On 2020-07-22 06:00, Felix Fietkau wrote:
> On 2020-07-22 14:55, Johannes Berg wrote:
>> On Wed, 2020-07-22 at 14:27 +0200, Felix Fietkau wrote:
>> 
>>> I'm considering testing a different approach (with mt76 initially):
>>> - Add a mac80211 rx function that puts processed skbs into a list
>>> instead of handing them to the network stack directly.
>> 
>> Would this be *after* all the mac80211 processing, i.e. in place of 
>> the
>> rx-up-to-stack?
> Yes, it would run all the rx handlers normally and then put the
> resulting skbs into a list instead of calling netif_receive_skb or
> napi_gro_frags.
> 
Felix,

This seems like split & batch processing. In past (ath9k), we observed 
some
behavioral differences between netif_rx and netif_receive_skb. The 
intermediate
queue in netif_rx changed not just performance but also time sensitive 
application
data. Agree that wireless stack processing might be heavier than 
ethernet packet
processing. If the hardware supports rx decap offload, NAPI processing 
shouldn't be
an issue. right?

-Rajkumar
Rakesh Pillai July 23, 2020, 6:25 p.m. UTC | #6
> -----Original Message-----
> From: Rajkumar Manoharan <rmanohar@codeaurora.org>
> Sent: Wednesday, July 22, 2020 3:23 AM
> To: Rakesh Pillai <pillair@codeaurora.org>
> Cc: ath10k@lists.infradead.org; linux-wireless@vger.kernel.org; linux-
> kernel@vger.kernel.org; kvalo@codeaurora.org; johannes@sipsolutions.net;
> davem@davemloft.net; kuba@kernel.org; netdev@vger.kernel.org;
> dianders@chromium.org; evgreen@chromium.org; linux-wireless-
> owner@vger.kernel.org
> Subject: Re: [RFC 2/7] ath10k: Add support to process rx packet in thread
> 
> On 2020-07-21 10:14, Rakesh Pillai wrote:
> > NAPI instance gets scheduled on a CPU core on which
> > the IRQ was triggered. The processing of rx packets
> > can be CPU intensive and since NAPI cannot be moved
> > to a different CPU core, to get better performance,
> > its better to move the gist of rx packet processing
> > in a high priority thread.
> >
> > Add the init/deinit part for a thread to process the
> > receive packets.
> >
> IMHO this defeat the whole purpose of NAPI. Originally in ath10k
> irq processing happened in tasklet (high priority) context which in
> turn push more data to net core even though net is unable to process
> driver data as both happen in different context (fast producer - slow
> consumer)
> issue. Why can't CPU governor schedule the interrupts in less loaded CPU
> core?
> Otherwise you can play with different RPS and affinity settings to meet
> your
> requirement.
> 
> IMO introducing high priority tasklets/threads is not viable solution.

Hi Rajkumar,
In linux, the IRQs are directed to the first core which is booted.
I see that all the IRQs are getting routed to CORE0 even if its heavily
loaded.

IRQ and NAPI are not under the scheduler's control, since it cannot be
moved.
NAPI is scheduled only on the same core as IRQ. But a thread can be moved
around by the scheduler based on the CPU load.
This is the reason I went ahead with using thread.

Affinity settings are static, and cannot be done runtime without any
downstream userspace entity.


> 
> -Rajkumar
Jacob Keller July 24, 2020, 11:11 p.m. UTC | #7
On 7/23/2020 11:25 AM, Rakesh Pillai wrote:
> Hi Rajkumar,
> In linux, the IRQs are directed to the first core which is booted.
> I see that all the IRQs are getting routed to CORE0 even if its heavily
> loaded.
> 

You should be able to configure the initial IRQ setup so that they don't
all go on CPU 0 when you create the IRQ. That obviously doesn't help the
case of wanting scheduler to dynamically move the processing around to
other CPUs though.
Ben Greear March 22, 2021, 11:57 p.m. UTC | #8
On 7/22/20 6:00 AM, Felix Fietkau wrote:
> On 2020-07-22 14:55, Johannes Berg wrote:
>> On Wed, 2020-07-22 at 14:27 +0200, Felix Fietkau wrote:
>>
>>> I'm considering testing a different approach (with mt76 initially):
>>> - Add a mac80211 rx function that puts processed skbs into a list
>>> instead of handing them to the network stack directly.
>>
>> Would this be *after* all the mac80211 processing, i.e. in place of the
>> rx-up-to-stack?
> Yes, it would run all the rx handlers normally and then put the
> resulting skbs into a list instead of calling netif_receive_skb or
> napi_gro_frags.

Whatever came of this?  I realized I'm running Felix's patch since his mt76
driver needs it.  Any chance it will go upstream?

Thanks,
Ben

> 
> - Felix
>
Brian Norris March 23, 2021, 1:20 a.m. UTC | #9
On Mon, Mar 22, 2021 at 4:58 PM Ben Greear <greearb@candelatech.com> wrote:
> On 7/22/20 6:00 AM, Felix Fietkau wrote:
> > On 2020-07-22 14:55, Johannes Berg wrote:
> >> On Wed, 2020-07-22 at 14:27 +0200, Felix Fietkau wrote:
> >>
> >>> I'm considering testing a different approach (with mt76 initially):
> >>> - Add a mac80211 rx function that puts processed skbs into a list
> >>> instead of handing them to the network stack directly.
> >>
> >> Would this be *after* all the mac80211 processing, i.e. in place of the
> >> rx-up-to-stack?
> > Yes, it would run all the rx handlers normally and then put the
> > resulting skbs into a list instead of calling netif_receive_skb or
> > napi_gro_frags.
>
> Whatever came of this?  I realized I'm running Felix's patch since his mt76
> driver needs it.  Any chance it will go upstream?

If you're asking about $subject (moving NAPI/RX to a thread), this
landed upstream recently:
http://git.kernel.org/linus/adbb4fb028452b1b0488a1a7b66ab856cdf20715

It needs a bit of coaxing to work on a WiFi driver (including: WiFi
drivers tend to have a different netdev for NAPI than they expose to
/sys/class/net/), but it's there.

I'm not sure if people had something else in mind in the stuff you're
quoting though.

Brian
Ben Greear March 23, 2021, 3:01 a.m. UTC | #10
On 3/22/21 6:20 PM, Brian Norris wrote:
> On Mon, Mar 22, 2021 at 4:58 PM Ben Greear <greearb@candelatech.com> wrote:
>> On 7/22/20 6:00 AM, Felix Fietkau wrote:
>>> On 2020-07-22 14:55, Johannes Berg wrote:
>>>> On Wed, 2020-07-22 at 14:27 +0200, Felix Fietkau wrote:
>>>>
>>>>> I'm considering testing a different approach (with mt76 initially):
>>>>> - Add a mac80211 rx function that puts processed skbs into a list
>>>>> instead of handing them to the network stack directly.
>>>>
>>>> Would this be *after* all the mac80211 processing, i.e. in place of the
>>>> rx-up-to-stack?
>>> Yes, it would run all the rx handlers normally and then put the
>>> resulting skbs into a list instead of calling netif_receive_skb or
>>> napi_gro_frags.
>>
>> Whatever came of this?  I realized I'm running Felix's patch since his mt76
>> driver needs it.  Any chance it will go upstream?
> 
> If you're asking about $subject (moving NAPI/RX to a thread), this
> landed upstream recently:
> http://git.kernel.org/linus/adbb4fb028452b1b0488a1a7b66ab856cdf20715
> 
> It needs a bit of coaxing to work on a WiFi driver (including: WiFi
> drivers tend to have a different netdev for NAPI than they expose to
> /sys/class/net/), but it's there.
> 
> I'm not sure if people had something else in mind in the stuff you're
> quoting though.

No, I got it confused with something Felix did:

https://github.com/greearb/mt76/blob/master/patches/0001-net-add-support-for-threaded-NAPI-polling.patch

Maybe the NAPI/RX to a thread thing superceded Felix's patch?

Thanks,
Ben

> 
> Brian
>
Felix Fietkau March 23, 2021, 7:45 a.m. UTC | #11
On 2021-03-23 04:01, Ben Greear wrote:
> On 3/22/21 6:20 PM, Brian Norris wrote:
>> On Mon, Mar 22, 2021 at 4:58 PM Ben Greear <greearb@candelatech.com> wrote:
>>> On 7/22/20 6:00 AM, Felix Fietkau wrote:
>>>> On 2020-07-22 14:55, Johannes Berg wrote:
>>>>> On Wed, 2020-07-22 at 14:27 +0200, Felix Fietkau wrote:
>>>>>
>>>>>> I'm considering testing a different approach (with mt76 initially):
>>>>>> - Add a mac80211 rx function that puts processed skbs into a list
>>>>>> instead of handing them to the network stack directly.
>>>>>
>>>>> Would this be *after* all the mac80211 processing, i.e. in place of the
>>>>> rx-up-to-stack?
>>>> Yes, it would run all the rx handlers normally and then put the
>>>> resulting skbs into a list instead of calling netif_receive_skb or
>>>> napi_gro_frags.
>>>
>>> Whatever came of this?  I realized I'm running Felix's patch since his mt76
>>> driver needs it.  Any chance it will go upstream?
>> 
>> If you're asking about $subject (moving NAPI/RX to a thread), this
>> landed upstream recently:
>> http://git.kernel.org/linus/adbb4fb028452b1b0488a1a7b66ab856cdf20715
>> 
>> It needs a bit of coaxing to work on a WiFi driver (including: WiFi
>> drivers tend to have a different netdev for NAPI than they expose to
>> /sys/class/net/), but it's there.
>> 
>> I'm not sure if people had something else in mind in the stuff you're
>> quoting though.
> 
> No, I got it confused with something Felix did:
> 
> https://github.com/greearb/mt76/blob/master/patches/0001-net-add-support-for-threaded-NAPI-polling.patch
> 
> Maybe the NAPI/RX to a thread thing superceded Felix's patch?
Yes, it did and it's in linux-next already.
I sent the following change to make mt76 use it:
https://github.com/nbd168/wireless/commit/1d4ff31437e5aaa999bd7a

- Felix
Rakesh Pillai March 25, 2021, 9:45 a.m. UTC | #12
> -----Original Message-----
> From: Felix Fietkau <nbd@nbd.name>
> Sent: Tuesday, March 23, 2021 1:16 PM
> To: Ben Greear <greearb@candelatech.com>; Brian Norris
> <briannorris@chromium.org>
> Cc: Johannes Berg <johannes@sipsolutions.net>; Rajkumar Manoharan
> <rmanohar@codeaurora.org>; Rakesh Pillai <pillair@codeaurora.org>; ath10k
> <ath10k@lists.infradead.org>; linux-wireless <linux-
> wireless@vger.kernel.org>; Linux Kernel <linux-kernel@vger.kernel.org>;
> Kalle Valo <kvalo@codeaurora.org>; David S. Miller
> <davem@davemloft.net>; Jakub Kicinski <kuba@kernel.org>;
> netdev@vger.kernel.org; Doug Anderson <dianders@chromium.org>; Evan
> Green <evgreen@chromium.org>
> Subject: Re: [RFC 2/7] ath10k: Add support to process rx packet in thread
> 
> 
> On 2021-03-23 04:01, Ben Greear wrote:
> > On 3/22/21 6:20 PM, Brian Norris wrote:
> >> On Mon, Mar 22, 2021 at 4:58 PM Ben Greear
> <greearb@candelatech.com> wrote:
> >>> On 7/22/20 6:00 AM, Felix Fietkau wrote:
> >>>> On 2020-07-22 14:55, Johannes Berg wrote:
> >>>>> On Wed, 2020-07-22 at 14:27 +0200, Felix Fietkau wrote:
> >>>>>
> >>>>>> I'm considering testing a different approach (with mt76 initially):
> >>>>>> - Add a mac80211 rx function that puts processed skbs into a list
> >>>>>> instead of handing them to the network stack directly.
> >>>>>
> >>>>> Would this be *after* all the mac80211 processing, i.e. in place of the
> >>>>> rx-up-to-stack?
> >>>> Yes, it would run all the rx handlers normally and then put the
> >>>> resulting skbs into a list instead of calling netif_receive_skb or
> >>>> napi_gro_frags.
> >>>
> >>> Whatever came of this?  I realized I'm running Felix's patch since his mt76
> >>> driver needs it.  Any chance it will go upstream?
> >>
> >> If you're asking about $subject (moving NAPI/RX to a thread), this
> >> landed upstream recently:
> >> http://git.kernel.org/linus/adbb4fb028452b1b0488a1a7b66ab856cdf20715
> >>
> >> It needs a bit of coaxing to work on a WiFi driver (including: WiFi
> >> drivers tend to have a different netdev for NAPI than they expose to
> >> /sys/class/net/), but it's there.
> >>
> >> I'm not sure if people had something else in mind in the stuff you're
> >> quoting though.
> >
> > No, I got it confused with something Felix did:
> >
> > https://github.com/greearb/mt76/blob/master/patches/0001-net-add-
> support-for-threaded-NAPI-polling.patch
> >
> > Maybe the NAPI/RX to a thread thing superceded Felix's patch?
> Yes, it did and it's in linux-next already.
> I sent the following change to make mt76 use it:
> https://github.com/nbd168/wireless/commit/1d4ff31437e5aaa999bd7a

Hi Felix / Ben,

In case of ath10k (snoc based targets), we have a lot of processing in the NAPI context.
Even moving this to threaded NAPI is not helping much due to the load.

Breaking the tasks into multiple context (with the patch series I posted) is helping in improving the throughput.
With the current rx_thread based approach, the rx processing is broken into two parallel contexts
1) reaping the packets from the HW
2) processing these packets list and handing it over to mac80211 (and later to the network stack)

This is the primary reason for choosing the rx thread approach.

Thanks,
Rakesh.

> 
> - Felix
Felix Fietkau March 25, 2021, 10:33 a.m. UTC | #13
On 2021-03-25 10:45, Rakesh Pillai wrote:
> Hi Felix / Ben,
> 
> In case of ath10k (snoc based targets), we have a lot of processing in the NAPI context.
> Even moving this to threaded NAPI is not helping much due to the load.
> 
> Breaking the tasks into multiple context (with the patch series I posted) is helping in improving the throughput.
> With the current rx_thread based approach, the rx processing is broken into two parallel contexts
> 1) reaping the packets from the HW
> 2) processing these packets list and handing it over to mac80211 (and later to the network stack)
> 
> This is the primary reason for choosing the rx thread approach.
Have you considered the possibility that maybe the problem is that the
driver doing too much work?
One example is that you could take advantage of the new 802.3 decap
offload to simplify rx processing. Worked for me on mt76 where a
dual-core 1.3 GHz A64 can easily handle >1.8 Gbps local TCP rx on a
single card, without the rx NAPI thread being the biggest consumer of
CPU cycles.

And if you can't do that and still consider all of the metric tons of
processing work necessary, you could still do this:
On interrupts, spawn a processing thread that traverses the ring and
does the preparation work (instead of NAPI).
From that thread you schedule the threaded NAPI handler that processes
these packets further and hands them to mac80211.
To keep the load somewhat balanced, you can limit the number of
pre-processed packets in the ring.

- Felix
diff mbox series

Patch

diff --git a/drivers/net/wireless/ath/ath10k/core.c b/drivers/net/wireless/ath/ath10k/core.c
index 9104496..2b520a0 100644
--- a/drivers/net/wireless/ath/ath10k/core.c
+++ b/drivers/net/wireless/ath/ath10k/core.c
@@ -668,6 +668,39 @@  static unsigned int ath10k_core_get_fw_feature_str(char *buf,
 	return scnprintf(buf, buf_len, "%s", ath10k_core_fw_feature_str[feat]);
 }
 
+int ath10k_core_thread_shutdown(struct ath10k *ar,
+				struct ath10k_thread *thread)
+{
+	ath10k_dbg(ar, ATH10K_DBG_BOOT, "shutting down %s\n", thread->name);
+	set_bit(ATH10K_THREAD_EVENT_SHUTDOWN, thread->event_flags);
+	wake_up_process(thread->task);
+	wait_for_completion(&thread->shutdown);
+	ath10k_info(ar, "thread %s exited\n", thread->name);
+
+	return 0;
+}
+EXPORT_SYMBOL(ath10k_core_thread_shutdown);
+
+int ath10k_core_thread_init(struct ath10k *ar,
+			    struct ath10k_thread *thread,
+			    int (*handler)(void *data),
+			    char *thread_name)
+{
+	thread->task = kthread_create(handler, thread, thread_name);
+	if (IS_ERR(thread->task))
+		return -EINVAL;
+
+	init_waitqueue_head(&thread->wait_q);
+	init_completion(&thread->shutdown);
+	memcpy(thread->name, thread_name, ATH10K_THREAD_NAME_SIZE_MAX);
+	clear_bit(ATH10K_THREAD_EVENT_SHUTDOWN, thread->event_flags);
+	ath10k_info(ar, "Starting thread %s\n", thread_name);
+	wake_up_process(thread->task);
+
+	return 0;
+}
+EXPORT_SYMBOL(ath10k_core_thread_init);
+
 void ath10k_core_get_fw_features_str(struct ath10k *ar,
 				     char *buf,
 				     size_t buf_len)
diff --git a/drivers/net/wireless/ath/ath10k/core.h b/drivers/net/wireless/ath/ath10k/core.h
index 5c18f6c..96919e8 100644
--- a/drivers/net/wireless/ath/ath10k/core.h
+++ b/drivers/net/wireless/ath/ath10k/core.h
@@ -970,6 +970,22 @@  struct ath10k_bus_params {
 	bool hl_msdu_ids;
 };
 
+#define ATH10K_THREAD_NAME_SIZE_MAX	32
+
+enum ath10k_thread_events {
+	ATH10K_THREAD_EVENT_SHUTDOWN,
+	ATH10K_THREAD_EVENT_MAX,
+};
+
+struct ath10k_thread {
+	struct ath10k *ar;
+	struct task_struct *task;
+	struct completion shutdown;
+	wait_queue_head_t wait_q;
+	DECLARE_BITMAP(event_flags, ATH10K_THREAD_EVENT_MAX);
+	char name[ATH10K_THREAD_NAME_SIZE_MAX];
+};
+
 struct ath10k {
 	struct ath_common ath_common;
 	struct ieee80211_hw *hw;
@@ -982,6 +998,7 @@  struct ath10k {
 	} msa;
 	u8 mac_addr[ETH_ALEN];
 
+	struct ath10k_thread rx_thread;
 	enum ath10k_hw_rev hw_rev;
 	u16 dev_id;
 	u32 chip_id;
@@ -1276,6 +1293,12 @@  static inline bool ath10k_peer_stats_enabled(struct ath10k *ar)
 
 extern unsigned long ath10k_coredump_mask;
 
+int ath10k_core_thread_shutdown(struct ath10k *ar,
+				struct ath10k_thread *thread);
+int ath10k_core_thread_init(struct ath10k *ar,
+			    struct ath10k_thread *thread,
+			    int (*handler)(void *data),
+			    char *thread_name);
 struct ath10k *ath10k_core_create(size_t priv_size, struct device *dev,
 				  enum ath10k_bus bus,
 				  enum ath10k_hw_rev hw_rev,
diff --git a/drivers/net/wireless/ath/ath10k/snoc.c b/drivers/net/wireless/ath/ath10k/snoc.c
index 1ef5fdb..463c34e 100644
--- a/drivers/net/wireless/ath/ath10k/snoc.c
+++ b/drivers/net/wireless/ath/ath10k/snoc.c
@@ -909,6 +909,31 @@  static void ath10k_snoc_buffer_cleanup(struct ath10k *ar)
 	}
 }
 
+int ath10k_snoc_rx_thread_loop(void *data)
+{
+	struct ath10k_thread *rx_thread = data;
+	struct ath10k *ar = rx_thread->ar;
+	bool shutdown = false;
+
+	ath10k_dbg(ar, ATH10K_DBG_SNOC, "rx thread started\n");
+	set_user_nice(current, -1);
+
+	while (!shutdown) {
+		wait_event_interruptible(
+			rx_thread->wait_q,
+			(test_bit(ATH10K_THREAD_EVENT_SHUTDOWN,
+				  rx_thread->event_flags)));
+		if (test_and_clear_bit(ATH10K_THREAD_EVENT_SHUTDOWN,
+				       rx_thread->event_flags))
+			shutdown = true;
+	}
+
+	ath10k_dbg(ar, ATH10K_DBG_SNOC, "rx thread exiting\n");
+	complete(&rx_thread->shutdown);
+
+	do_exit(0);
+}
+
 static void ath10k_snoc_hif_stop(struct ath10k *ar)
 {
 	if (!test_bit(ATH10K_FLAG_CRASH_FLUSH, &ar->dev_flags))
@@ -916,6 +941,7 @@  static void ath10k_snoc_hif_stop(struct ath10k *ar)
 
 	napi_synchronize(&ar->napi);
 	napi_disable(&ar->napi);
+	ath10k_core_thread_shutdown(ar, &ar->rx_thread);
 	ath10k_snoc_buffer_cleanup(ar);
 	ath10k_dbg(ar, ATH10K_DBG_BOOT, "boot hif stop\n");
 }
@@ -923,9 +949,19 @@  static void ath10k_snoc_hif_stop(struct ath10k *ar)
 static int ath10k_snoc_hif_start(struct ath10k *ar)
 {
 	struct ath10k_snoc *ar_snoc = ath10k_snoc_priv(ar);
+	int ret;
 
 	bitmap_clear(ar_snoc->pending_ce_irqs, 0, CE_COUNT_MAX);
 	napi_enable(&ar->napi);
+
+	ret = ath10k_core_thread_init(ar, &ar->rx_thread,
+				      ath10k_snoc_rx_thread_loop,
+				      "ath10k_rx_thread");
+	if (ret) {
+		ath10k_err(ar, "failed to start rx thread\n");
+		goto rx_thread_fail;
+	}
+
 	ath10k_snoc_irq_enable(ar);
 	ath10k_snoc_rx_post(ar);
 
@@ -934,6 +970,10 @@  static int ath10k_snoc_hif_start(struct ath10k *ar)
 	ath10k_dbg(ar, ATH10K_DBG_BOOT, "boot hif start\n");
 
 	return 0;
+
+rx_thread_fail:
+	napi_disable(&ar->napi);
+	return ret;
 }
 
 static int ath10k_snoc_init_pipes(struct ath10k *ar)
@@ -1652,6 +1692,7 @@  static int ath10k_snoc_probe(struct platform_device *pdev)
 		return -ENOMEM;
 	}
 
+	ar->rx_thread.ar = ar;
 	ar_snoc = ath10k_snoc_priv(ar);
 	ar_snoc->dev = pdev;
 	platform_set_drvdata(pdev, ar);