[libcamera-devel,0/3] IPU3 Stability
mbox series

Message ID 20210303170426.189648-1-kieran.bingham@ideasonboard.com
Headers show
Series
  • IPU3 Stability
Related show

Message

Kieran Bingham March 3, 2021, 5:04 p.m. UTC
While working on the IPU3 IPA, some crashes have been occuring on
shutdown races.

Patches 2/3 and 3/3 at least presently work around those issues, and
while they may be suitable for integration now - both suggest that more
work is needed to ensure that the IPU3 pipeline handler is stable and
implemented correctly.

Patch 2/3 highlights that we may need to take more concern over the
lifetime management of a Request and the associated FrameInfo that is
supported with that.

Patch 3/3 shows that the IPU3 Pipeline handler is queueing buffers after
it has released the buffers on the video nodes, and needs further
consideration as well (though I beleive that patch is still worthy of
integration on it's own).

The issue leading to what would be a crash without patch 3/3 is when
shutting down, the IPA running in parallel can emit an event from:
 void IPU3CameraData::queueFrameAction()
as 

  case ipa::ipu3::ActionParamFilled:

which then goes on to queue a buffer to the three relevant V4L2 devices:

	imgu_->param_->queueBuffer(info->paramBuffer);
	imgu_->stat_->queueBuffer(info->statBuffer);
	imgu_->input_->queueBuffer(info->rawBuffer);


This is occuring after those devices have released their buffers, and
thus the v4l2 buffer cache (cache_) is deleted and set to nullptr,
resulting in a segmentation fault within the V4L2VideoDevice otherwise.



Kieran Bingham (3):
  libcamera: pipeline: ipu3: Fix spelling error
  libcamera: pipeline: ipu3: Ensure that IPU3Frames::info is not used
    after delete
  libcamera: v4l2_videodevice: Prevent queueing buffers without a cache

 src/libcamera/pipeline/ipu3/frames.cpp |  2 +-
 src/libcamera/pipeline/ipu3/ipu3.cpp   | 20 ++++++++++++++++++--
 src/libcamera/v4l2_videodevice.cpp     | 10 ++++++++++
 3 files changed, 29 insertions(+), 3 deletions(-)

Comments

Laurent Pinchart March 3, 2021, 6:45 p.m. UTC | #1
Hi Kieran,

On Wed, Mar 03, 2021 at 05:04:23PM +0000, Kieran Bingham wrote:
> While working on the IPU3 IPA, some crashes have been occuring on
> shutdown races.
> 
> Patches 2/3 and 3/3 at least presently work around those issues, and
> while they may be suitable for integration now - both suggest that more
> work is needed to ensure that the IPU3 pipeline handler is stable and
> implemented correctly.
> 
> Patch 2/3 highlights that we may need to take more concern over the
> lifetime management of a Request and the associated FrameInfo that is
> supported with that.

I think 2/3 is suitable for integration, even if API improvements are
possible. This isn't a completely unexpected situation, now that we have
an implementation we find the related issues, that's a way to improve
APIs.

> Patch 3/3 shows that the IPU3 Pipeline handler is queueing buffers after
> it has released the buffers on the video nodes, and needs further
> consideration as well (though I beleive that patch is still worthy of
> integration on it's own).

I have more doubts about this one, as there's really something that
should be fixed on the pipeline handler side. I was considering
proposing the Fatal log level, but I suppose it will not make a big
difference compared to a segfault :-) Or maybe it would, the log message
can then clearly state that the problem is in the caller.

> The issue leading to what would be a crash without patch 3/3 is when
> shutting down, the IPA running in parallel can emit an event from:
>  void IPU3CameraData::queueFrameAction()
> as 
> 
>   case ipa::ipu3::ActionParamFilled:
> 
> which then goes on to queue a buffer to the three relevant V4L2 devices:
> 
> 	imgu_->param_->queueBuffer(info->paramBuffer);
> 	imgu_->stat_->queueBuffer(info->statBuffer);
> 	imgu_->input_->queueBuffer(info->rawBuffer);
> 
> 
> This is occuring after those devices have released their buffers, and
> thus the v4l2 buffer cache (cache_) is deleted and set to nullptr,
> resulting in a segmentation fault within the V4L2VideoDevice otherwise.

Looking at PipelineHandlerIPU3::stop(), we have

	data->ipa_->stop();

	freeBuffers(camera);

The IPA stop() call should be synchronous, which means that any pending
asynchronous call from the IPA module to the pipeline handler should be
processed before stop() returns.

On the IPA side, the stop() function should stop any internal thread,
which will make sure that the API won't emit any event on its own after
stop() returns. The stop() implementation is currently empty, but as
we're not creating any custom thread in the IPU3 IPA at this time, it
shouldn't be a problem.

It could also be that the pipeline handler sends more frame events to
the IPA module after stop(), but given the cross-thread nature of the
calls, I wouldn't expect that to go through if the IPA thread created by
the proxy is stopped. Still, it may be worth double-checking this.

Tracing the IPA calls could be useful here. Please let me know if I can
help.

> Kieran Bingham (3):
>   libcamera: pipeline: ipu3: Fix spelling error
>   libcamera: pipeline: ipu3: Ensure that IPU3Frames::info is not used
>     after delete
>   libcamera: v4l2_videodevice: Prevent queueing buffers without a cache
> 
>  src/libcamera/pipeline/ipu3/frames.cpp |  2 +-
>  src/libcamera/pipeline/ipu3/ipu3.cpp   | 20 ++++++++++++++++++--
>  src/libcamera/v4l2_videodevice.cpp     | 10 ++++++++++
>  3 files changed, 29 insertions(+), 3 deletions(-)
>
Kieran Bingham March 4, 2021, 10:14 a.m. UTC | #2
On 03/03/2021 18:45, Laurent Pinchart wrote:
> Hi Kieran,
> 
> On Wed, Mar 03, 2021 at 05:04:23PM +0000, Kieran Bingham wrote:
>> While working on the IPU3 IPA, some crashes have been occuring on
>> shutdown races.
>>
>> Patches 2/3 and 3/3 at least presently work around those issues, and
>> while they may be suitable for integration now - both suggest that more
>> work is needed to ensure that the IPU3 pipeline handler is stable and
>> implemented correctly.
>>
>> Patch 2/3 highlights that we may need to take more concern over the
>> lifetime management of a Request and the associated FrameInfo that is
>> supported with that.
> 
> I think 2/3 is suitable for integration, even if API improvements are
> possible. This isn't a completely unexpected situation, now that we have
> an implementation we find the related issues, that's a way to improve
> APIs.

Yes, well we certainly need to fix it to unblock other developments.

>> Patch 3/3 shows that the IPU3 Pipeline handler is queueing buffers after
>> it has released the buffers on the video nodes, and needs further
>> consideration as well (though I beleive that patch is still worthy of
>> integration on it's own).
> 
> I have more doubts about this one, as there's really something that
> should be fixed on the pipeline handler side. I was considering
> proposing the Fatal log level, but I suppose it will not make a big
> difference compared to a segfault :-) Or maybe it would, the log message
> can then clearly state that the problem is in the caller.


You know, locally it's a fatal on my tree right now, so it's harder to
disagree, but in this patch I set it to Error because I believe it's
better not to crash right now.

It's not a Fatally unrecoverable error. It's an issue in the the
pipeline handler, doing something it shouldn't, and it is recoverable
(Just don't queue).

But it should still be fixed by the pipeline handler, hence loud error
message.

I would love to be able to make this an Error with a backtrace ... to be
/really/ loud, but I'm not sure yet about having an extra log level
between Error and Fatal.




>> The issue leading to what would be a crash without patch 3/3 is when
>> shutting down, the IPA running in parallel can emit an event from:
>>  void IPU3CameraData::queueFrameAction()
>> as 
>>
>>   case ipa::ipu3::ActionParamFilled:
>>
>> which then goes on to queue a buffer to the three relevant V4L2 devices:
>>
>> 	imgu_->param_->queueBuffer(info->paramBuffer);
>> 	imgu_->stat_->queueBuffer(info->statBuffer);
>> 	imgu_->input_->queueBuffer(info->rawBuffer);
>>
>>
>> This is occuring after those devices have released their buffers, and
>> thus the v4l2 buffer cache (cache_) is deleted and set to nullptr,
>> resulting in a segmentation fault within the V4L2VideoDevice otherwise.
> 
> Looking at PipelineHandlerIPU3::stop(), we have
> 
> 	data->ipa_->stop();
> 
> 	freeBuffers(camera);
> 
> The IPA stop() call should be synchronous, which means that any pending
> asynchronous call from the IPA module to the pipeline handler should be
> processed before stop() returns.
> 
> On the IPA side, the stop() function should stop any internal thread,
> which will make sure that the API won't emit any event on its own after
> stop() returns. The stop() implementation is currently empty, but as
> we're not creating any custom thread in the IPU3 IPA at this time, it
> shouldn't be a problem.
> 
> It could also be that the pipeline handler sends more frame events to
> the IPA module after stop(), but given the cross-thread nature of the
> calls, I wouldn't expect that to go through if the IPA thread created by
> the proxy is stopped. Still, it may be worth double-checking this.
> 
> Tracing the IPA calls could be useful here. Please let me know if I can
> help.

I enabled tracing with:
 lttng enable-event -u libcamera:\*

but got not IPA traces.

Ok - so I've discussed this with Paul, and it's because IPA traces
aren't enabled ;-)

Adding a patch he has adds more context there, and it seems like we
might be able to add more tracepoints automatically to see when messages
are sent back from the IPA - that will make it clear if the IPA is
sending a fillParams message back /after/ having already stopped.

I'll defer further investigation on this until after I have that
visibility for now.



>> Kieran Bingham (3):
>>   libcamera: pipeline: ipu3: Fix spelling error
>>   libcamera: pipeline: ipu3: Ensure that IPU3Frames::info is not used
>>     after delete
>>   libcamera: v4l2_videodevice: Prevent queueing buffers without a cache
>>
>>  src/libcamera/pipeline/ipu3/frames.cpp |  2 +-
>>  src/libcamera/pipeline/ipu3/ipu3.cpp   | 20 ++++++++++++++++++--
>>  src/libcamera/v4l2_videodevice.cpp     | 10 ++++++++++
>>  3 files changed, 29 insertions(+), 3 deletions(-)
>>
>
Laurent Pinchart March 4, 2021, 2:12 p.m. UTC | #3
Hi Kieran,

On Thu, Mar 04, 2021 at 10:14:40AM +0000, Kieran Bingham wrote:
> On 03/03/2021 18:45, Laurent Pinchart wrote:
> > On Wed, Mar 03, 2021 at 05:04:23PM +0000, Kieran Bingham wrote:
> >> While working on the IPU3 IPA, some crashes have been occuring on
> >> shutdown races.
> >>
> >> Patches 2/3 and 3/3 at least presently work around those issues, and
> >> while they may be suitable for integration now - both suggest that more
> >> work is needed to ensure that the IPU3 pipeline handler is stable and
> >> implemented correctly.
> >>
> >> Patch 2/3 highlights that we may need to take more concern over the
> >> lifetime management of a Request and the associated FrameInfo that is
> >> supported with that.
> > 
> > I think 2/3 is suitable for integration, even if API improvements are
> > possible. This isn't a completely unexpected situation, now that we have
> > an implementation we find the related issues, that's a way to improve
> > APIs.
> 
> Yes, well we certainly need to fix it to unblock other developments.
> 
> >> Patch 3/3 shows that the IPU3 Pipeline handler is queueing buffers after
> >> it has released the buffers on the video nodes, and needs further
> >> consideration as well (though I beleive that patch is still worthy of
> >> integration on it's own).
> > 
> > I have more doubts about this one, as there's really something that
> > should be fixed on the pipeline handler side. I was considering
> > proposing the Fatal log level, but I suppose it will not make a big
> > difference compared to a segfault :-) Or maybe it would, the log message
> > can then clearly state that the problem is in the caller.
> 
> You know, locally it's a fatal on my tree right now, so it's harder to
> disagree, but in this patch I set it to Error because I believe it's
> better not to crash right now.

:-)

> It's not a Fatally unrecoverable error. It's an issue in the the
> pipeline handler, doing something it shouldn't, and it is recoverable
> (Just don't queue).

The part on the V4L2VideoDevice side may be, but it's a symptom of
something really wrong happening, which will likely cause other
undesired behaviour in the pipeline handler. What those behaviours will
be depend on the pipeline handler, but the worst case would be some
silent memory corruption that will cause a crash later. A fatal error
here would help pin-pointing the root cause, making debugging of the
pipeline handler easier.

Granted, the error message will help there too (returning an error
silently would be the worst option). I suppose our ways of screaming
loudly differ, mine goes through an assertion failure :-)

> But it should still be fixed by the pipeline handler, hence loud error
> message.
> 
> I would love to be able to make this an Error with a backtrace ... to be
> /really/ loud, but I'm not sure yet about having an extra log level
> between Error and Fatal.

It's an interesting thought, I'll sleep on it.

> >> The issue leading to what would be a crash without patch 3/3 is when
> >> shutting down, the IPA running in parallel can emit an event from:
> >>  void IPU3CameraData::queueFrameAction()
> >> as 
> >>
> >>   case ipa::ipu3::ActionParamFilled:
> >>
> >> which then goes on to queue a buffer to the three relevant V4L2 devices:
> >>
> >> 	imgu_->param_->queueBuffer(info->paramBuffer);
> >> 	imgu_->stat_->queueBuffer(info->statBuffer);
> >> 	imgu_->input_->queueBuffer(info->rawBuffer);
> >>
> >>
> >> This is occuring after those devices have released their buffers, and
> >> thus the v4l2 buffer cache (cache_) is deleted and set to nullptr,
> >> resulting in a segmentation fault within the V4L2VideoDevice otherwise.
> > 
> > Looking at PipelineHandlerIPU3::stop(), we have
> > 
> > 	data->ipa_->stop();
> > 
> > 	freeBuffers(camera);
> > 
> > The IPA stop() call should be synchronous, which means that any pending
> > asynchronous call from the IPA module to the pipeline handler should be
> > processed before stop() returns.
> > 
> > On the IPA side, the stop() function should stop any internal thread,
> > which will make sure that the API won't emit any event on its own after
> > stop() returns. The stop() implementation is currently empty, but as
> > we're not creating any custom thread in the IPU3 IPA at this time, it
> > shouldn't be a problem.
> > 
> > It could also be that the pipeline handler sends more frame events to
> > the IPA module after stop(), but given the cross-thread nature of the
> > calls, I wouldn't expect that to go through if the IPA thread created by
> > the proxy is stopped. Still, it may be worth double-checking this.
> > 
> > Tracing the IPA calls could be useful here. Please let me know if I can
> > help.
> 
> I enabled tracing with:
>  lttng enable-event -u libcamera:\*
> 
> but got not IPA traces.
> 
> Ok - so I've discussed this with Paul, and it's because IPA traces
> aren't enabled ;-)
> 
> Adding a patch he has adds more context there, and it seems like we
> might be able to add more tracepoints automatically to see when messages
> are sent back from the IPA - that will make it clear if the IPA is
> sending a fillParams message back /after/ having already stopped.
> 
> I'll defer further investigation on this until after I have that
> visibility for now.
> 
> >> Kieran Bingham (3):
> >>   libcamera: pipeline: ipu3: Fix spelling error
> >>   libcamera: pipeline: ipu3: Ensure that IPU3Frames::info is not used
> >>     after delete
> >>   libcamera: v4l2_videodevice: Prevent queueing buffers without a cache
> >>
> >>  src/libcamera/pipeline/ipu3/frames.cpp |  2 +-
> >>  src/libcamera/pipeline/ipu3/ipu3.cpp   | 20 ++++++++++++++++++--
> >>  src/libcamera/v4l2_videodevice.cpp     | 10 ++++++++++
> >>  3 files changed, 29 insertions(+), 3 deletions(-)
> >>
Jacopo Mondi March 5, 2021, 7:45 p.m. UTC | #4
Hi Kieran

On Wed, Mar 03, 2021 at 05:04:23PM +0000, Kieran Bingham wrote:
> While working on the IPU3 IPA, some crashes have been occuring on
> shutdown races.
>
> Patches 2/3 and 3/3 at least presently work around those issues, and
> while they may be suitable for integration now - both suggest that more
> work is needed to ensure that the IPU3 pipeline handler is stable and
> implemented correctly.
>
> Patch 2/3 highlights that we may need to take more concern over the
> lifetime management of a Request and the associated FrameInfo that is
> supported with that.
>
> Patch 3/3 shows that the IPU3 Pipeline handler is queueing buffers after
> it has released the buffers on the video nodes, and needs further
> consideration as well (though I beleive that patch is still worthy of
> integration on it's own).
>
> The issue leading to what would be a crash without patch 3/3 is when
> shutting down, the IPA running in parallel can emit an event from:
>  void IPU3CameraData::queueFrameAction()
> as
>
>   case ipa::ipu3::ActionParamFilled:
>
> which then goes on to queue a buffer to the three relevant V4L2 devices:
>
> 	imgu_->param_->queueBuffer(info->paramBuffer);
> 	imgu_->stat_->queueBuffer(info->statBuffer);
> 	imgu_->input_->queueBuffer(info->rawBuffer);
>
>
> This is occuring after those devices have released their buffers, and
> thus the v4l2 buffer cache (cache_) is deleted and set to nullptr,
> resulting in a segmentation fault within the V4L2VideoDevice otherwise.

I've been running these patches in my tree for a few days and indeed I
haven't noticed the nasty segfaults I've been having before when
running CTS on IPU3.

For the series
Tested-by: Jacopo Mondi <jacopo@jmondi.org>

I'll review patches singularly next

Thanks
  j

>
>
>
> Kieran Bingham (3):
>   libcamera: pipeline: ipu3: Fix spelling error
>   libcamera: pipeline: ipu3: Ensure that IPU3Frames::info is not used
>     after delete
>   libcamera: v4l2_videodevice: Prevent queueing buffers without a cache
>
>  src/libcamera/pipeline/ipu3/frames.cpp |  2 +-
>  src/libcamera/pipeline/ipu3/ipu3.cpp   | 20 ++++++++++++++++++--
>  src/libcamera/v4l2_videodevice.cpp     | 10 ++++++++++
>  3 files changed, 29 insertions(+), 3 deletions(-)
>
> --
> 2.25.1
>
> _______________________________________________
> libcamera-devel mailing list
> libcamera-devel@lists.libcamera.org
> https://lists.libcamera.org/listinfo/libcamera-devel
Kieran Bingham March 8, 2021, 12:06 p.m. UTC | #5
Hi Laurent,

On 04/03/2021 14:12, Laurent Pinchart wrote:
> Hi Kieran,
> 
> On Thu, Mar 04, 2021 at 10:14:40AM +0000, Kieran Bingham wrote:
>> On 03/03/2021 18:45, Laurent Pinchart wrote:
>>> On Wed, Mar 03, 2021 at 05:04:23PM +0000, Kieran Bingham wrote:
>>>> While working on the IPU3 IPA, some crashes have been occuring on
>>>> shutdown races.
>>>>
>>>> Patches 2/3 and 3/3 at least presently work around those issues, and
>>>> while they may be suitable for integration now - both suggest that more
>>>> work is needed to ensure that the IPU3 pipeline handler is stable and
>>>> implemented correctly.
>>>>
>>>> Patch 2/3 highlights that we may need to take more concern over the
>>>> lifetime management of a Request and the associated FrameInfo that is
>>>> supported with that.
>>>
>>> I think 2/3 is suitable for integration, even if API improvements are
>>> possible. This isn't a completely unexpected situation, now that we have
>>> an implementation we find the related issues, that's a way to improve
>>> APIs.
>>
>> Yes, well we certainly need to fix it to unblock other developments.
>>
>>>> Patch 3/3 shows that the IPU3 Pipeline handler is queueing buffers after
>>>> it has released the buffers on the video nodes, and needs further
>>>> consideration as well (though I beleive that patch is still worthy of
>>>> integration on it's own).
>>>
>>> I have more doubts about this one, as there's really something that
>>> should be fixed on the pipeline handler side. I was considering
>>> proposing the Fatal log level, but I suppose it will not make a big
>>> difference compared to a segfault :-) Or maybe it would, the log message
>>> can then clearly state that the problem is in the caller.
>>
>> You know, locally it's a fatal on my tree right now, so it's harder to
>> disagree, but in this patch I set it to Error because I believe it's
>> better not to crash right now.
> 
> :-)
> 
>> It's not a Fatally unrecoverable error. It's an issue in the the
>> pipeline handler, doing something it shouldn't, and it is recoverable
>> (Just don't queue).
> 
> The part on the V4L2VideoDevice side may be, but it's a symptom of
> something really wrong happening, which will likely cause other
> undesired behaviour in the pipeline handler. What those behaviours will
> be depend on the pipeline handler, but the worst case would be some
> silent memory corruption that will cause a crash later. A fatal error
> here would help pin-pointing the root cause, making debugging of the
> pipeline handler easier.
> 
> Granted, the error message will help there too (returning an error
> silently would be the worst option). I suppose our ways of screaming
> loudly differ, mine goes through an assertion failure :-)
> 
>> But it should still be fixed by the pipeline handler, hence loud error
>> message.
>>
>> I would love to be able to make this an Error with a backtrace ... to be
>> /really/ loud, but I'm not sure yet about having an extra log level
>> between Error and Fatal.
> 
> It's an interesting thought, I'll sleep on it.

Did you get any sleep?


I'd like to get these patches in at least so that we can ensure IPA
development can continue.

We can always make queuing a buffer to V4L2VideoDevice after it's been
stopped a fatal error (Or at least a backtracable error) after the
pipeline handler has been fixed...



>>>> The issue leading to what would be a crash without patch 3/3 is when
>>>> shutting down, the IPA running in parallel can emit an event from:
>>>>  void IPU3CameraData::queueFrameAction()
>>>> as 
>>>>
>>>>   case ipa::ipu3::ActionParamFilled:
>>>>
>>>> which then goes on to queue a buffer to the three relevant V4L2 devices:
>>>>
>>>> 	imgu_->param_->queueBuffer(info->paramBuffer);
>>>> 	imgu_->stat_->queueBuffer(info->statBuffer);
>>>> 	imgu_->input_->queueBuffer(info->rawBuffer);
>>>>
>>>>
>>>> This is occuring after those devices have released their buffers, and
>>>> thus the v4l2 buffer cache (cache_) is deleted and set to nullptr,
>>>> resulting in a segmentation fault within the V4L2VideoDevice otherwise.
>>>
>>> Looking at PipelineHandlerIPU3::stop(), we have
>>>
>>> 	data->ipa_->stop();
>>>
>>> 	freeBuffers(camera);
>>>
>>> The IPA stop() call should be synchronous, which means that any pending
>>> asynchronous call from the IPA module to the pipeline handler should be
>>> processed before stop() returns.
>>>
>>> On the IPA side, the stop() function should stop any internal thread,
>>> which will make sure that the API won't emit any event on its own after
>>> stop() returns. The stop() implementation is currently empty, but as
>>> we're not creating any custom thread in the IPU3 IPA at this time, it
>>> shouldn't be a problem.
>>>
>>> It could also be that the pipeline handler sends more frame events to
>>> the IPA module after stop(), but given the cross-thread nature of the
>>> calls, I wouldn't expect that to go through if the IPA thread created by
>>> the proxy is stopped. Still, it may be worth double-checking this.
>>>
>>> Tracing the IPA calls could be useful here. Please let me know if I can
>>> help.
>>
>> I enabled tracing with:
>>  lttng enable-event -u libcamera:\*
>>
>> but got not IPA traces.
>>
>> Ok - so I've discussed this with Paul, and it's because IPA traces
>> aren't enabled ;-)
>>
>> Adding a patch he has adds more context there, and it seems like we
>> might be able to add more tracepoints automatically to see when messages
>> are sent back from the IPA - that will make it clear if the IPA is
>> sending a fillParams message back /after/ having already stopped.
>>
>> I'll defer further investigation on this until after I have that
>> visibility for now.
>>
>>>> Kieran Bingham (3):
>>>>   libcamera: pipeline: ipu3: Fix spelling error
>>>>   libcamera: pipeline: ipu3: Ensure that IPU3Frames::info is not used
>>>>     after delete
>>>>   libcamera: v4l2_videodevice: Prevent queueing buffers without a cache
>>>>
>>>>  src/libcamera/pipeline/ipu3/frames.cpp |  2 +-
>>>>  src/libcamera/pipeline/ipu3/ipu3.cpp   | 20 ++++++++++++++++++--
>>>>  src/libcamera/v4l2_videodevice.cpp     | 10 ++++++++++
>>>>  3 files changed, 29 insertions(+), 3 deletions(-)
>>>>
>
Laurent Pinchart March 8, 2021, 3 p.m. UTC | #6
Hi Kieran,

On Mon, Mar 08, 2021 at 12:06:20PM +0000, Kieran Bingham wrote:
> On 04/03/2021 14:12, Laurent Pinchart wrote:
> > On Thu, Mar 04, 2021 at 10:14:40AM +0000, Kieran Bingham wrote:
> >> On 03/03/2021 18:45, Laurent Pinchart wrote:
> >>> On Wed, Mar 03, 2021 at 05:04:23PM +0000, Kieran Bingham wrote:
> >>>> While working on the IPU3 IPA, some crashes have been occuring on
> >>>> shutdown races.
> >>>>
> >>>> Patches 2/3 and 3/3 at least presently work around those issues, and
> >>>> while they may be suitable for integration now - both suggest that more
> >>>> work is needed to ensure that the IPU3 pipeline handler is stable and
> >>>> implemented correctly.
> >>>>
> >>>> Patch 2/3 highlights that we may need to take more concern over the
> >>>> lifetime management of a Request and the associated FrameInfo that is
> >>>> supported with that.
> >>>
> >>> I think 2/3 is suitable for integration, even if API improvements are
> >>> possible. This isn't a completely unexpected situation, now that we have
> >>> an implementation we find the related issues, that's a way to improve
> >>> APIs.
> >>
> >> Yes, well we certainly need to fix it to unblock other developments.
> >>
> >>>> Patch 3/3 shows that the IPU3 Pipeline handler is queueing buffers after
> >>>> it has released the buffers on the video nodes, and needs further
> >>>> consideration as well (though I beleive that patch is still worthy of
> >>>> integration on it's own).
> >>>
> >>> I have more doubts about this one, as there's really something that
> >>> should be fixed on the pipeline handler side. I was considering
> >>> proposing the Fatal log level, but I suppose it will not make a big
> >>> difference compared to a segfault :-) Or maybe it would, the log message
> >>> can then clearly state that the problem is in the caller.
> >>
> >> You know, locally it's a fatal on my tree right now, so it's harder to
> >> disagree, but in this patch I set it to Error because I believe it's
> >> better not to crash right now.
> > 
> > :-)
> > 
> >> It's not a Fatally unrecoverable error. It's an issue in the the
> >> pipeline handler, doing something it shouldn't, and it is recoverable
> >> (Just don't queue).
> > 
> > The part on the V4L2VideoDevice side may be, but it's a symptom of
> > something really wrong happening, which will likely cause other
> > undesired behaviour in the pipeline handler. What those behaviours will
> > be depend on the pipeline handler, but the worst case would be some
> > silent memory corruption that will cause a crash later. A fatal error
> > here would help pin-pointing the root cause, making debugging of the
> > pipeline handler easier.
> > 
> > Granted, the error message will help there too (returning an error
> > silently would be the worst option). I suppose our ways of screaming
> > loudly differ, mine goes through an assertion failure :-)
> > 
> >> But it should still be fixed by the pipeline handler, hence loud error
> >> message.
> >>
> >> I would love to be able to make this an Error with a backtrace ... to be
> >> /really/ loud, but I'm not sure yet about having an extra log level
> >> between Error and Fatal.
> > 
> > It's an interesting thought, I'll sleep on it.
> 
> Did you get any sleep?

Yes :-)

We already have an error level with a backtrace, that's Fatal. Currently
it terminates program execution with std::abort(), and I'd be fine only
doing so for debug builds. It doesn't seem we need an additional level.

> I'd like to get these patches in at least so that we can ensure IPA
> development can continue.

1/3 and 2/3 can be merged right away.

For 3/3, I'll reply to the patch now.

> We can always make queuing a buffer to V4L2VideoDevice after it's been
> stopped a fatal error (Or at least a backtracable error) after the
> pipeline handler has been fixed...
> 
> >>>> The issue leading to what would be a crash without patch 3/3 is when
> >>>> shutting down, the IPA running in parallel can emit an event from:
> >>>>  void IPU3CameraData::queueFrameAction()
> >>>> as 
> >>>>
> >>>>   case ipa::ipu3::ActionParamFilled:
> >>>>
> >>>> which then goes on to queue a buffer to the three relevant V4L2 devices:
> >>>>
> >>>> 	imgu_->param_->queueBuffer(info->paramBuffer);
> >>>> 	imgu_->stat_->queueBuffer(info->statBuffer);
> >>>> 	imgu_->input_->queueBuffer(info->rawBuffer);
> >>>>
> >>>>
> >>>> This is occuring after those devices have released their buffers, and
> >>>> thus the v4l2 buffer cache (cache_) is deleted and set to nullptr,
> >>>> resulting in a segmentation fault within the V4L2VideoDevice otherwise.
> >>>
> >>> Looking at PipelineHandlerIPU3::stop(), we have
> >>>
> >>> 	data->ipa_->stop();
> >>>
> >>> 	freeBuffers(camera);
> >>>
> >>> The IPA stop() call should be synchronous, which means that any pending
> >>> asynchronous call from the IPA module to the pipeline handler should be
> >>> processed before stop() returns.
> >>>
> >>> On the IPA side, the stop() function should stop any internal thread,
> >>> which will make sure that the API won't emit any event on its own after
> >>> stop() returns. The stop() implementation is currently empty, but as
> >>> we're not creating any custom thread in the IPU3 IPA at this time, it
> >>> shouldn't be a problem.
> >>>
> >>> It could also be that the pipeline handler sends more frame events to
> >>> the IPA module after stop(), but given the cross-thread nature of the
> >>> calls, I wouldn't expect that to go through if the IPA thread created by
> >>> the proxy is stopped. Still, it may be worth double-checking this.
> >>>
> >>> Tracing the IPA calls could be useful here. Please let me know if I can
> >>> help.
> >>
> >> I enabled tracing with:
> >>  lttng enable-event -u libcamera:\*
> >>
> >> but got not IPA traces.
> >>
> >> Ok - so I've discussed this with Paul, and it's because IPA traces
> >> aren't enabled ;-)
> >>
> >> Adding a patch he has adds more context there, and it seems like we
> >> might be able to add more tracepoints automatically to see when messages
> >> are sent back from the IPA - that will make it clear if the IPA is
> >> sending a fillParams message back /after/ having already stopped.
> >>
> >> I'll defer further investigation on this until after I have that
> >> visibility for now.
> >>
> >>>> Kieran Bingham (3):
> >>>>   libcamera: pipeline: ipu3: Fix spelling error
> >>>>   libcamera: pipeline: ipu3: Ensure that IPU3Frames::info is not used
> >>>>     after delete
> >>>>   libcamera: v4l2_videodevice: Prevent queueing buffers without a cache
> >>>>
> >>>>  src/libcamera/pipeline/ipu3/frames.cpp |  2 +-
> >>>>  src/libcamera/pipeline/ipu3/ipu3.cpp   | 20 ++++++++++++++++++--
> >>>>  src/libcamera/v4l2_videodevice.cpp     | 10 ++++++++++
> >>>>  3 files changed, 29 insertions(+), 3 deletions(-)
> >>>>