Message ID | 1481651984-7687-5-git-send-email-kieran.bingham+renesas@ideasonboard.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Hi Kieran, On Tue, Dec 13, 2016 at 05:59:44PM +0000, Kieran Bingham wrote: > media_entity_pipeline_stop() can be called through error paths with a > NULL entity pipe object. In this instance, stopping is a no-op, so > simply return without any action The approach of returning silently is wrong; the streaming count is indeed a count: you have to decrement it the exactly same number of times it's been increased. Code that attempts to call __media_entity_pipeline_stop() on an entity that's not streaming is simply buggy. I've got a patch here that adds a warning to graph traversal on streaming count being zero. I sent a pull request including that some days ago: <URL:http://www.spinics.net/lists/linux-media/msg108980.html> <URL:http://www.spinics.net/lists/linux-media/msg108995.html> I think the check here could simply be removed as the check is done for every entity in the pipeline with the above patch. If there's still a wish to check for the pipe field which should not be written by drivers, it should be done during pipeline traversal so that it applies to all entities in the pipeline, not just where traversal starts. > > Signed-off-by: Kieran Bingham <kieran.bingham+renesas@ideasonboard.com> > --- > > I've marked this patch as RFC, although if deemed suitable, by all means > integrate it as is. > > When testing suspend/resume operations on VSP1, I encountered a segfault on the > WARN_ON(!pipe->streaming_count) line, where 'pipe == NULL'. The simple > protection fix is to return early in this instance, as this patch does however: > > A) Does this early return path warrant a WARN() statement itself, to identify > drivers which are incorrectly calling media_entity_pipeline_stop() with an > invalid entity, or would this just be noise ... > > and therefore.. > > B) I also partly assume this patch could simply get NAK'd with a request to go > and dig out the root cause of calling media_entity_pipeline_stop() with an > invalid entity. > > My brief investigation so far here so far shows that it's almost a second order > fault - where the first suspend resume cycle completes but leaves the entity in > an invalid state having followed an error path - and then on a second > suspend/resume - the stop fails with the affected segfault. > > If statement A) or B) apply here, please drop this patch from the series, and > don't consider it a blocking issue for the other 3 patches. > > Kieran > > > drivers/media/media-entity.c | 2 ++ > 1 file changed, 2 insertions(+) > > diff --git a/drivers/media/media-entity.c b/drivers/media/media-entity.c > index c68239e60487..93c9cbf4bf46 100644 > --- a/drivers/media/media-entity.c > +++ b/drivers/media/media-entity.c > @@ -508,6 +508,8 @@ void __media_entity_pipeline_stop(struct media_entity *entity) > struct media_entity_graph *graph = &entity->pipe->graph; > struct media_pipeline *pipe = entity->pipe; > > + if (!pipe) > + return; > > WARN_ON(!pipe->streaming_count); > media_entity_graph_walk_start(graph, entity);
Hi Sakari, On 14/12/16 07:28, Sakari Ailus wrote: > Hi Kieran, > > On Tue, Dec 13, 2016 at 05:59:44PM +0000, Kieran Bingham wrote: >> media_entity_pipeline_stop() can be called through error paths with a >> NULL entity pipe object. In this instance, stopping is a no-op, so >> simply return without any action > > The approach of returning silently is wrong; the streaming count is indeed a > count: you have to decrement it the exactly same number of times it's been > increased. Code that attempts to call __media_entity_pipeline_stop() on an > entity that's not streaming is simply buggy. Ok, Thanks for the heads up on where to look, as I suspected we are in section B) of my comments below :D > I've got a patch here that adds a warning to graph traversal on streaming > count being zero. I sent a pull request including that some days ago: > > <URL:http://www.spinics.net/lists/linux-media/msg108980.html> > <URL:http://www.spinics.net/lists/linux-media/msg108995.html> Excellent, thanks, I've merged your branch into mine, and I'll investigate where the error is actually coming from. -- Ok - so I've merged your patches, (and dropped this patch) but they don't fire any warning before I hit my null-pointer dereference in __media_pipeline_stop(), on the WARN_ON(!pipe->streaming_count); So I think this is a case of calling stop on an invalid entity rather than an unbalanced start/stop somehow: [ 613.830334] vsp1 fea38000.vsp: failed to reset wpf.0 [ 613.838137] PM: resume of devices complete after 124.410 msecs [ 613.847390] PM: Finishing wakeup. [ 613.852247] Restarting tasks ... done. [ 615.864801] ravb e6800000.ethernet eth0: Link is Up - 100Mbps/Full - flow control rx/tx [ 617.011299] vsp1 fea38000.vsp: failed to reset wpf.0 [ 617.017777] vsp1 fea38000.vsp: DRM pipeline stop timeout [ 617.024649] Unable to handle kernel NULL pointer dereference at virtual address 00000000 [ 617.034273] pgd = ffffff80098c5000 [ 617.039171] [00000000] *pgd=000000073fffe003[ 617.043336] , *pud=000000073fffe003 , *pmd=0000000000000000[ 617.050353] [ 617.053282] Internal error: Oops: 96000005 [#1] PREEMPT SMP [ 617.060309] Modules linked in: [ 617.064793] CPU: 1 PID: 683 Comm: kworker/1:2 Not tainted 4.9.0-00506-g4d2a939ea654 #350 [ 617.074320] Hardware name: Renesas Salvator-X board based on r8a7795 (DT) [ 617.082578] Workqueue: events drm_mode_rmfb_work_fn [ 617.088884] task: ffffffc6fabd2b00 task.stack: ffffffc6f9d90000 [ 617.096246] PC is at __media_pipeline_stop+0x24/0xe8 [ 617.102644] LR is at media_pipeline_stop+0x34/0x48 <regs and stack snipped> [ 617.863386] [<ffffff800853e724>] __media_pipeline_stop+0x24/0xe8 [ 617.870417] [<ffffff800853e81c>] media_pipeline_stop+0x34/0x48 [ 617.877272] [<ffffff8008568000>] vsp1_du_setup_lif+0x68/0x5a8 [ 617.884047] [<ffffff80084de9d4>] rcar_du_vsp_disable+0x2c/0x38 [ 617.890898] [<ffffff80084da710>] rcar_du_crtc_stop+0x198/0x1e8 [ 617.897749] [<ffffff80084da780>] rcar_du_crtc_disable+0x20/0x38 [ 617.904683] [<ffffff80084ac9b0>] drm_atomic_helper_commit_modeset_disables+0x1b0/0x3d8 [ 617.913634] [<ffffff80084db9cc>] rcar_du_atomic_complete+0x34/0xc8 [ 617.920828] [<ffffff80084dbd20>] rcar_du_atomic_commit+0x2c0/0x2d0 [ 617.928012] [<ffffff80084ceffc>] drm_atomic_commit+0x5c/0x68 [ 617.934663] [<ffffff80084ad2e0>] drm_atomic_helper_set_config+0x90/0xd0 [ 617.942288] [<ffffff80084c06a0>] drm_mode_set_config_internal+0x70/0x100 [ 617.949996] [<ffffff80084c0760>] drm_crtc_force_disable+0x30/0x40 [ 617.957084] [<ffffff80084d0b10>] drm_framebuffer_remove+0x100/0x128 [ 617.964347] [<ffffff80084d0b80>] drm_mode_rmfb_work_fn+0x48/0x60 [ 617.971352] [<ffffff80080e9770>] process_one_work+0x1e0/0x738 [ 617.978084] [<ffffff80080e9f24>] worker_thread+0x25c/0x4a0 [ 617.984546] [<ffffff80080f11ac>] kthread+0xd4/0xe8 [ 617.990305] [<ffffff8008083680>] ret_from_fork+0x10/0x50 So, I'll have to schedule some time to investigate this deeper and find out where we are calling stop on an invalid entity object. I agree that this patch should simply be dropped though, it was more to promote a bit of discussion on the area to help me understand what was going on, which it has!. Thankyou Sakari! -- Regards Kieran > I think the check here could simply be removed as the check is done for > every entity in the pipeline with the above patch. > > If there's still a wish to check for the pipe field which should not be > written by drivers, it should be done during pipeline traversal so that it > applies to all entities in the pipeline, not just where traversal starts. > >> >> Signed-off-by: Kieran Bingham <kieran.bingham+renesas@ideasonboard.com> >> --- >> >> I've marked this patch as RFC, although if deemed suitable, by all means >> integrate it as is. >> >> When testing suspend/resume operations on VSP1, I encountered a segfault on the >> WARN_ON(!pipe->streaming_count) line, where 'pipe == NULL'. The simple >> protection fix is to return early in this instance, as this patch does however: >> >> A) Does this early return path warrant a WARN() statement itself, to identify >> drivers which are incorrectly calling media_entity_pipeline_stop() with an >> invalid entity, or would this just be noise ... >> >> and therefore.. >> >> B) I also partly assume this patch could simply get NAK'd with a request to go >> and dig out the root cause of calling media_entity_pipeline_stop() with an >> invalid entity. >> >> My brief investigation so far here so far shows that it's almost a second order >> fault - where the first suspend resume cycle completes but leaves the entity in >> an invalid state having followed an error path - and then on a second >> suspend/resume - the stop fails with the affected segfault. >> >> If statement A) or B) apply here, please drop this patch from the series, and >> don't consider it a blocking issue for the other 3 patches. >> >> Kieran >> >> >> drivers/media/media-entity.c | 2 ++ >> 1 file changed, 2 insertions(+) >> >> diff --git a/drivers/media/media-entity.c b/drivers/media/media-entity.c >> index c68239e60487..93c9cbf4bf46 100644 >> --- a/drivers/media/media-entity.c >> +++ b/drivers/media/media-entity.c >> @@ -508,6 +508,8 @@ void __media_entity_pipeline_stop(struct media_entity *entity) >> struct media_entity_graph *graph = &entity->pipe->graph; >> struct media_pipeline *pipe = entity->pipe; >> >> + if (!pipe) >> + return; >> >> WARN_ON(!pipe->streaming_count); >> media_entity_graph_walk_start(graph, entity); >
Hi Kieran, On Wed, Dec 14, 2016 at 12:27:37PM +0000, Kieran Bingham wrote: > Hi Sakari, > > On 14/12/16 07:28, Sakari Ailus wrote: > > Hi Kieran, > > > > On Tue, Dec 13, 2016 at 05:59:44PM +0000, Kieran Bingham wrote: > >> media_entity_pipeline_stop() can be called through error paths with a > >> NULL entity pipe object. In this instance, stopping is a no-op, so > >> simply return without any action > > > > The approach of returning silently is wrong; the streaming count is indeed a > > count: you have to decrement it the exactly same number of times it's been > > increased. Code that attempts to call __media_entity_pipeline_stop() on an > > entity that's not streaming is simply buggy. > > Ok, Thanks for the heads up on where to look, as I suspected we are in > section B) of my comments below :D > > > I've got a patch here that adds a warning to graph traversal on streaming > > count being zero. I sent a pull request including that some days ago: > > > > <URL:http://www.spinics.net/lists/linux-media/msg108980.html> > > <URL:http://www.spinics.net/lists/linux-media/msg108995.html> > > Excellent, thanks, I've merged your branch into mine, and I'll > investigate where the error is actually coming from. > > -- > Ok - so I've merged your patches, (and dropped this patch) but they > don't fire any warning before I hit my null-pointer dereference in > __media_pipeline_stop(), on the WARN_ON(!pipe->streaming_count); > > So I think this is a case of calling stop on an invalid entity rather > than an unbalanced start/stop somehow: Not necessarily. The pipe is set to NULL if the count goes to zero. This check should be dropped, it's ill-placed anyway: if streaming count is zero the pipe is expected to be NULL anyway in which case you get a NULL pointer exception (that you saw there). With the check removed, you should get the warning instead. > > [ 613.830334] vsp1 fea38000.vsp: failed to reset wpf.0 > [ 613.838137] PM: resume of devices complete after 124.410 msecs > [ 613.847390] PM: Finishing wakeup. > [ 613.852247] Restarting tasks ... done. > [ 615.864801] ravb e6800000.ethernet eth0: Link is Up - 100Mbps/Full - > flow control rx/tx > [ 617.011299] vsp1 fea38000.vsp: failed to reset wpf.0 > [ 617.017777] vsp1 fea38000.vsp: DRM pipeline stop timeout > [ 617.024649] Unable to handle kernel NULL pointer dereference at > virtual address 00000000 > [ 617.034273] pgd = ffffff80098c5000 > [ 617.039171] [00000000] *pgd=000000073fffe003[ 617.043336] , > *pud=000000073fffe003 > , *pmd=0000000000000000[ 617.050353] > [ 617.053282] Internal error: Oops: 96000005 [#1] PREEMPT SMP > [ 617.060309] Modules linked in: > [ 617.064793] CPU: 1 PID: 683 Comm: kworker/1:2 Not tainted > 4.9.0-00506-g4d2a939ea654 #350 > [ 617.074320] Hardware name: Renesas Salvator-X board based on r8a7795 (DT) > [ 617.082578] Workqueue: events drm_mode_rmfb_work_fn > [ 617.088884] task: ffffffc6fabd2b00 task.stack: ffffffc6f9d90000 > [ 617.096246] PC is at __media_pipeline_stop+0x24/0xe8 > [ 617.102644] LR is at media_pipeline_stop+0x34/0x48 > > <regs and stack snipped> > > [ 617.863386] [<ffffff800853e724>] __media_pipeline_stop+0x24/0xe8 > [ 617.870417] [<ffffff800853e81c>] media_pipeline_stop+0x34/0x48 > [ 617.877272] [<ffffff8008568000>] vsp1_du_setup_lif+0x68/0x5a8 > [ 617.884047] [<ffffff80084de9d4>] rcar_du_vsp_disable+0x2c/0x38 > [ 617.890898] [<ffffff80084da710>] rcar_du_crtc_stop+0x198/0x1e8 > [ 617.897749] [<ffffff80084da780>] rcar_du_crtc_disable+0x20/0x38 > [ 617.904683] [<ffffff80084ac9b0>] > drm_atomic_helper_commit_modeset_disables+0x1b0/0x3d8 > [ 617.913634] [<ffffff80084db9cc>] rcar_du_atomic_complete+0x34/0xc8 > [ 617.920828] [<ffffff80084dbd20>] rcar_du_atomic_commit+0x2c0/0x2d0 > [ 617.928012] [<ffffff80084ceffc>] drm_atomic_commit+0x5c/0x68 > [ 617.934663] [<ffffff80084ad2e0>] drm_atomic_helper_set_config+0x90/0xd0 > [ 617.942288] [<ffffff80084c06a0>] drm_mode_set_config_internal+0x70/0x100 > [ 617.949996] [<ffffff80084c0760>] drm_crtc_force_disable+0x30/0x40 > [ 617.957084] [<ffffff80084d0b10>] drm_framebuffer_remove+0x100/0x128 > [ 617.964347] [<ffffff80084d0b80>] drm_mode_rmfb_work_fn+0x48/0x60 > [ 617.971352] [<ffffff80080e9770>] process_one_work+0x1e0/0x738 > [ 617.978084] [<ffffff80080e9f24>] worker_thread+0x25c/0x4a0 > [ 617.984546] [<ffffff80080f11ac>] kthread+0xd4/0xe8 > [ 617.990305] [<ffffff8008083680>] ret_from_fork+0x10/0x50 > > > So, I'll have to schedule some time to investigate this deeper and find > out where we are calling stop on an invalid entity object. > > I agree that this patch should simply be dropped though, it was more to > promote a bit of discussion on the area to help me understand what was > going on, which it has!. > > Thankyou Sakari! > > -- > Regards > > Kieran > > > > I think the check here could simply be removed as the check is done for > > every entity in the pipeline with the above patch. > > > > If there's still a wish to check for the pipe field which should not be > > written by drivers, it should be done during pipeline traversal so that it > > applies to all entities in the pipeline, not just where traversal starts. > > > >> > >> Signed-off-by: Kieran Bingham <kieran.bingham+renesas@ideasonboard.com> > >> --- > >> > >> I've marked this patch as RFC, although if deemed suitable, by all means > >> integrate it as is. > >> > >> When testing suspend/resume operations on VSP1, I encountered a segfault on the > >> WARN_ON(!pipe->streaming_count) line, where 'pipe == NULL'. The simple > >> protection fix is to return early in this instance, as this patch does however: > >> > >> A) Does this early return path warrant a WARN() statement itself, to identify > >> drivers which are incorrectly calling media_entity_pipeline_stop() with an > >> invalid entity, or would this just be noise ... > >> > >> and therefore.. > >> > >> B) I also partly assume this patch could simply get NAK'd with a request to go > >> and dig out the root cause of calling media_entity_pipeline_stop() with an > >> invalid entity. > >> > >> My brief investigation so far here so far shows that it's almost a second order > >> fault - where the first suspend resume cycle completes but leaves the entity in > >> an invalid state having followed an error path - and then on a second > >> suspend/resume - the stop fails with the affected segfault. > >> > >> If statement A) or B) apply here, please drop this patch from the series, and > >> don't consider it a blocking issue for the other 3 patches. > >> > >> Kieran > >> > >> > >> drivers/media/media-entity.c | 2 ++ > >> 1 file changed, 2 insertions(+) > >> > >> diff --git a/drivers/media/media-entity.c b/drivers/media/media-entity.c > >> index c68239e60487..93c9cbf4bf46 100644 > >> --- a/drivers/media/media-entity.c > >> +++ b/drivers/media/media-entity.c > >> @@ -508,6 +508,8 @@ void __media_entity_pipeline_stop(struct media_entity *entity) > >> struct media_entity_graph *graph = &entity->pipe->graph; > >> struct media_pipeline *pipe = entity->pipe; > >> > >> + if (!pipe) > >> + return; > >> > >> WARN_ON(!pipe->streaming_count); > >> media_entity_graph_walk_start(graph, entity); > >
Hi Sakari, On 14/12/16 12:43, Sakari Ailus wrote: > Hi Kieran, > > On Wed, Dec 14, 2016 at 12:27:37PM +0000, Kieran Bingham wrote: >> Hi Sakari, >> >> On 14/12/16 07:28, Sakari Ailus wrote: >>> Hi Kieran, >>> >>> On Tue, Dec 13, 2016 at 05:59:44PM +0000, Kieran Bingham wrote: >>>> media_entity_pipeline_stop() can be called through error paths with a >>>> NULL entity pipe object. In this instance, stopping is a no-op, so >>>> simply return without any action >>> >>> The approach of returning silently is wrong; the streaming count is indeed a >>> count: you have to decrement it the exactly same number of times it's been >>> increased. Code that attempts to call __media_entity_pipeline_stop() on an >>> entity that's not streaming is simply buggy. >> >> Ok, Thanks for the heads up on where to look, as I suspected we are in >> section B) of my comments below :D >> >>> I've got a patch here that adds a warning to graph traversal on streaming >>> count being zero. I sent a pull request including that some days ago: >>> >>> <URL:http://www.spinics.net/lists/linux-media/msg108980.html> >>> <URL:http://www.spinics.net/lists/linux-media/msg108995.html> >> >> Excellent, thanks, I've merged your branch into mine, and I'll >> investigate where the error is actually coming from. >> >> -- >> Ok - so I've merged your patches, (and dropped this patch) but they >> don't fire any warning before I hit my null-pointer dereference in >> __media_pipeline_stop(), on the WARN_ON(!pipe->streaming_count); >> >> So I think this is a case of calling stop on an invalid entity rather >> than an unbalanced start/stop somehow: > > Not necessarily. The pipe is set to NULL if the count goes to zero. > > This check should be dropped, it's ill-placed anyway: if streaming count is > zero the pipe is expected to be NULL anyway in which case you get a NULL > pointer exception (that you saw there). With the check removed, you should > get the warning instead. Ahh, yes, I'd missed the part there that was setting it to NULL. However, it will still segfault ... (though hopefully after the warning) as the last line of the function states: if (!--pipe->streaming_count) media_entity_graph_walk_cleanup(graph); So we will hit a null deref on the pipe->streaming_count there, which still leaves an unbalanced stop as a kernel panic. In fact, I've just tested removing that WARN_ON(!pipe->streaming_count); - it doesn't even get that far - and segfaults in media_entity_graph_walk_cleanup(graph); [ 80.916558] Unable to handle kernel NULL pointer dereference at virtual address 00000110 .... [ 81.769492] [<ffffff800853e278>] media_graph_walk_start+0x20/0xf0 [ 81.776615] [<ffffff800853e73c>] __media_pipeline_stop+0x3c/0xd8 [ 81.783644] [<ffffff800853e80c>] media_pipeline_stop+0x34/0x48 [ 81.790505] [<ffffff8008567ff0>] vsp1_du_setup_lif+0x68/0x5a8 [ 81.797279] [<ffffff80084de9d4>] rcar_du_vsp_disable+0x2c/0x38 [ 81.804137] [<ffffff80084da710>] rcar_du_crtc_stop+0x198/0x1e8 [ 81.810984] [<ffffff80084da780>] rcar_du_crtc_disable+0x20/0x38 due to the fact that 'graph' is dereferenced through the 'pipe'. { /* Entity->pipe is NULL, therefore 'graph' is invalid */ struct media_graph *graph = &entity->pipe->graph; struct media_pipeline *pipe = entity->pipe; media_graph_walk_start(graph, entity); ... } So I suspect we do still need a warning or check in there somewhere. Something to tell the developer that there is an unbalanced stop, would be much better than a panic/segfault... (And of course, we still need to look at the actual unbalanced stop in vsp1_du_setup_lif!) -- Kieran -- To unsubscribe from this list: send the line "unsubscribe linux-media" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Hello Me. Ok, so a bit of investigation into *why* we have an unbalanced media_pipeline stop call. After a suspend/resume cycle, the function vsp1_pm_runtime_resume() is called by the PM framework. The hardware is unable to reset successfully and the reset call returns -ETIMEDOUT which gets passed back to the PM framework as a failure and thus the device is not 'resumed' This process is commenced, as the DU driver tries to enable the VSP, we get the following call stack: rcar_du_crtc_resume() rcar_du_vsp_enable() vsp1_du_setup_lif() // returns void vsp1_device_get() // returns -ETIMEDOUT, As the vsp1_du_setup_lif() returns void, the fact that the hardware failed to start is ignored. Later we get a call to rcar_du_vsp_disable(), which again calls into vsp1_du_setup_lif(), this time to disable the pipeline. And it is here, that the call to media_pipeline_stop() is an unbalanced call, as the corresponding media_pipeline_start() would only have been called if the vsp1_device_get() had succeeded, thus we find ourselves with a kernel panic on a null dereference. Sorry for the terse notes, they are possibly/probably really for me if I get tasked to look back at this. -- Regards Kieran On 13/12/16 17:59, Kieran Bingham wrote: > media_entity_pipeline_stop() can be called through error paths with a > NULL entity pipe object. In this instance, stopping is a no-op, so > simply return without any action > > Signed-off-by: Kieran Bingham <kieran.bingham+renesas@ideasonboard.com> > --- > > I've marked this patch as RFC, although if deemed suitable, by all means > integrate it as is. > > When testing suspend/resume operations on VSP1, I encountered a segfault on the > WARN_ON(!pipe->streaming_count) line, where 'pipe == NULL'. The simple > protection fix is to return early in this instance, as this patch does however: > > A) Does this early return path warrant a WARN() statement itself, to identify > drivers which are incorrectly calling media_entity_pipeline_stop() with an > invalid entity, or would this just be noise ... > > and therefore.. > > B) I also partly assume this patch could simply get NAK'd with a request to go > and dig out the root cause of calling media_entity_pipeline_stop() with an > invalid entity. > > My brief investigation so far here so far shows that it's almost a second order > fault - where the first suspend resume cycle completes but leaves the entity in > an invalid state having followed an error path - and then on a second > suspend/resume - the stop fails with the affected segfault. > > If statement A) or B) apply here, please drop this patch from the series, and > don't consider it a blocking issue for the other 3 patches. > > Kieran > > > drivers/media/media-entity.c | 2 ++ > 1 file changed, 2 insertions(+) > > diff --git a/drivers/media/media-entity.c b/drivers/media/media-entity.c > index c68239e60487..93c9cbf4bf46 100644 > --- a/drivers/media/media-entity.c > +++ b/drivers/media/media-entity.c > @@ -508,6 +508,8 @@ void __media_entity_pipeline_stop(struct media_entity *entity) > struct media_entity_graph *graph = &entity->pipe->graph; > struct media_pipeline *pipe = entity->pipe; > > + if (!pipe) > + return; > > WARN_ON(!pipe->streaming_count); > media_entity_graph_walk_start(graph, entity); > -- To unsubscribe from this list: send the line "unsubscribe linux-media" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Hi Kieran, On Wednesday 14 Dec 2016 19:56:51 Kieran Bingham wrote: > Hello Me. > > Ok, so a bit of investigation into *why* we have an unbalanced > media_pipeline stop call. > > After a suspend/resume cycle, the function vsp1_pm_runtime_resume() is > called by the PM framework. > > The hardware is unable to reset successfully and the reset call returns > -ETIMEDOUT which gets passed back to the PM framework as a failure and > thus the device is not 'resumed' > > This process is commenced, as the DU driver tries to enable the VSP, we > get the following call stack: > > rcar_du_crtc_resume() > rcar_du_vsp_enable() > vsp1_du_setup_lif() // returns void > vsp1_device_get() // returns -ETIMEDOUT, I suspect the call stack to not be entirely accurate as the rcar_du_crtc_resume() is never called :-) > As the vsp1_du_setup_lif() returns void, the fact that the hardware > failed to start is ignored. > > Later we get a call to rcar_du_vsp_disable(), which again calls into > vsp1_du_setup_lif(), this time to disable the pipeline. And it is here, > that the call to media_pipeline_stop() is an unbalanced call, as the > corresponding media_pipeline_start() would only have been called if the > vsp1_device_get() had succeeded, thus we find ourselves with a kernel > panic on a null dereference. > > Sorry for the terse notes, they are possibly/probably really for me if I > get tasked to look back at this. > -- > Regards > > Kieran > > On 13/12/16 17:59, Kieran Bingham wrote: > > media_entity_pipeline_stop() can be called through error paths with a > > NULL entity pipe object. In this instance, stopping is a no-op, so > > simply return without any action > > > > Signed-off-by: Kieran Bingham <kieran.bingham+renesas@ideasonboard.com> > > --- > > > > I've marked this patch as RFC, although if deemed suitable, by all means > > integrate it as is. > > > > When testing suspend/resume operations on VSP1, I encountered a segfault > > on the WARN_ON(!pipe->streaming_count) line, where 'pipe == NULL'. The > > simple protection fix is to return early in this instance, as this patch > > does however: > > > > A) Does this early return path warrant a WARN() statement itself, to > > identify drivers which are incorrectly calling > > media_entity_pipeline_stop() with an invalid entity, or would this just > > be noise ... > > > > and therefore.. > > > > B) I also partly assume this patch could simply get NAK'd with a request > > to go and dig out the root cause of calling media_entity_pipeline_stop() > > with an invalid entity. > > > > My brief investigation so far here so far shows that it's almost a second > > order fault - where the first suspend resume cycle completes but leaves > > the entity in an invalid state having followed an error path - and then > > on a second suspend/resume - the stop fails with the affected segfault. > > > > If statement A) or B) apply here, please drop this patch from the series, > > and don't consider it a blocking issue for the other 3 patches. > > > > Kieran > > > > drivers/media/media-entity.c | 2 ++ > > 1 file changed, 2 insertions(+) > > > > diff --git a/drivers/media/media-entity.c b/drivers/media/media-entity.c > > index c68239e60487..93c9cbf4bf46 100644 > > --- a/drivers/media/media-entity.c > > +++ b/drivers/media/media-entity.c > > @@ -508,6 +508,8 @@ void __media_entity_pipeline_stop(struct media_entity > > *entity)> > > struct media_entity_graph *graph = &entity->pipe->graph; > > struct media_pipeline *pipe = entity->pipe; > > > > + if (!pipe) > > + return; > > > > WARN_ON(!pipe->streaming_count); > > media_entity_graph_walk_start(graph, entity);
On Wed, Dec 14, 2016 at 05:53:00PM +0000, Kieran Bingham wrote: > Hi Sakari, > > On 14/12/16 12:43, Sakari Ailus wrote: > > Hi Kieran, > > > > On Wed, Dec 14, 2016 at 12:27:37PM +0000, Kieran Bingham wrote: > >> Hi Sakari, > >> > >> On 14/12/16 07:28, Sakari Ailus wrote: > >>> Hi Kieran, > >>> > >>> On Tue, Dec 13, 2016 at 05:59:44PM +0000, Kieran Bingham wrote: > >>>> media_entity_pipeline_stop() can be called through error paths with a > >>>> NULL entity pipe object. In this instance, stopping is a no-op, so > >>>> simply return without any action > >>> > >>> The approach of returning silently is wrong; the streaming count is indeed a > >>> count: you have to decrement it the exactly same number of times it's been > >>> increased. Code that attempts to call __media_entity_pipeline_stop() on an > >>> entity that's not streaming is simply buggy. > >> > >> Ok, Thanks for the heads up on where to look, as I suspected we are in > >> section B) of my comments below :D > >> > >>> I've got a patch here that adds a warning to graph traversal on streaming > >>> count being zero. I sent a pull request including that some days ago: > >>> > >>> <URL:http://www.spinics.net/lists/linux-media/msg108980.html> > >>> <URL:http://www.spinics.net/lists/linux-media/msg108995.html> > >> > >> Excellent, thanks, I've merged your branch into mine, and I'll > >> investigate where the error is actually coming from. > >> > >> -- > >> Ok - so I've merged your patches, (and dropped this patch) but they > >> don't fire any warning before I hit my null-pointer dereference in > >> __media_pipeline_stop(), on the WARN_ON(!pipe->streaming_count); > >> > >> So I think this is a case of calling stop on an invalid entity rather > >> than an unbalanced start/stop somehow: > > > > Not necessarily. The pipe is set to NULL if the count goes to zero. > > > > This check should be dropped, it's ill-placed anyway: if streaming count is > > zero the pipe is expected to be NULL anyway in which case you get a NULL > > pointer exception (that you saw there). With the check removed, you should > > get the warning instead. > > Ahh, yes, I'd missed the part there that was setting it to NULL. > > However, it will still segfault ... (though hopefully after the warning) > as the last line of the function states: > > if (!--pipe->streaming_count) > media_entity_graph_walk_cleanup(graph); > > So we will hit a null deref on the pipe->streaming_count there, which > still leaves an unbalanced stop as a kernel panic. Right, indeed. The graph is part of the pipeline, so if pipe was NULL you don't have graph either. How about changing the check to: if (WARN_ON(!pipe)) return; Instead of removing it. Nothing in that function really makes sense if there's no pipeline. The driver bug still exists. That's just to make the framework behaving a little bit better in the presence of that sort of bug. > > In fact, I've just tested removing that WARN_ON(!pipe->streaming_count); > - it doesn't even get that far - and segfaults in > > media_entity_graph_walk_cleanup(graph); > > [ 80.916558] Unable to handle kernel NULL pointer dereference at > virtual address 00000110 > .... > [ 81.769492] [<ffffff800853e278>] media_graph_walk_start+0x20/0xf0 > [ 81.776615] [<ffffff800853e73c>] __media_pipeline_stop+0x3c/0xd8 > [ 81.783644] [<ffffff800853e80c>] media_pipeline_stop+0x34/0x48 > [ 81.790505] [<ffffff8008567ff0>] vsp1_du_setup_lif+0x68/0x5a8 > [ 81.797279] [<ffffff80084de9d4>] rcar_du_vsp_disable+0x2c/0x38 > [ 81.804137] [<ffffff80084da710>] rcar_du_crtc_stop+0x198/0x1e8 > [ 81.810984] [<ffffff80084da780>] rcar_du_crtc_disable+0x20/0x38 > > due to the fact that 'graph' is dereferenced through the 'pipe'. > > { > /* Entity->pipe is NULL, therefore 'graph' is invalid */ > struct media_graph *graph = &entity->pipe->graph; > struct media_pipeline *pipe = entity->pipe; > > media_graph_walk_start(graph, entity); > ... > } > > So I suspect we do still need a warning or check in there somewhere. > Something to tell the developer that there is an unbalanced stop, would > be much better than a panic/segfault... > > (And of course, we still need to look at the actual unbalanced stop in > vsp1_du_setup_lif!) Indeed. >
diff --git a/drivers/media/media-entity.c b/drivers/media/media-entity.c index c68239e60487..93c9cbf4bf46 100644 --- a/drivers/media/media-entity.c +++ b/drivers/media/media-entity.c @@ -508,6 +508,8 @@ void __media_entity_pipeline_stop(struct media_entity *entity) struct media_entity_graph *graph = &entity->pipe->graph; struct media_pipeline *pipe = entity->pipe; + if (!pipe) + return; WARN_ON(!pipe->streaming_count); media_entity_graph_walk_start(graph, entity);
media_entity_pipeline_stop() can be called through error paths with a NULL entity pipe object. In this instance, stopping is a no-op, so simply return without any action Signed-off-by: Kieran Bingham <kieran.bingham+renesas@ideasonboard.com> --- I've marked this patch as RFC, although if deemed suitable, by all means integrate it as is. When testing suspend/resume operations on VSP1, I encountered a segfault on the WARN_ON(!pipe->streaming_count) line, where 'pipe == NULL'. The simple protection fix is to return early in this instance, as this patch does however: A) Does this early return path warrant a WARN() statement itself, to identify drivers which are incorrectly calling media_entity_pipeline_stop() with an invalid entity, or would this just be noise ... and therefore.. B) I also partly assume this patch could simply get NAK'd with a request to go and dig out the root cause of calling media_entity_pipeline_stop() with an invalid entity. My brief investigation so far here so far shows that it's almost a second order fault - where the first suspend resume cycle completes but leaves the entity in an invalid state having followed an error path - and then on a second suspend/resume - the stop fails with the affected segfault. If statement A) or B) apply here, please drop this patch from the series, and don't consider it a blocking issue for the other 3 patches. Kieran drivers/media/media-entity.c | 2 ++ 1 file changed, 2 insertions(+)