[libcamera-devel,00/12] libcamera: Hardening against thread race conditions
mbox series

Message ID 20240121035948.4226-1-laurent.pinchart@ideasonboard.com
Headers show
Series
  • libcamera: Hardening against thread race conditions
Related show

Message

Laurent Pinchart Jan. 21, 2024, 3:59 a.m. UTC
Hello,

Bug https://bugs.libcamera.org/show_bug.cgi?id=208 reports a race
condition that is ultimately due to incorrect usage of the libcamera
multi-threading infrastructure by the soft ISP code under development.
Instead of blaming the author of that code, I believe it shows we're not
doing well enough at communicating the threading requirements. This
patch series is an attempt to improve the situation.

libcamera has a threading model that is documented. In particular, the
documentation classifies functions as thread-safe, thread-bound and
reentrant. Patch 01/12 starts by fixing a minor issue in the
documentation of a thread-bound function that does not contain the
correct reference.

Patch 02/12 is a drive-by improvement that is, strictly speaking,
unrelated to this series, but was developed at the same time.

The next two patches fix a first thread-related issues with the
Object::deleteLater() function. Patch 03/12 extends a unit test to
display the issue, and patch 04/12 fixes it.

Patches 05/12 to 10/12 continue with fixing various kinds of incorrect
deletion of objects in unit tests. It turned out that all the race
conditions related to this patch series are in unit tests, libcamera
itself isn't (as far as I could see) affected. This is good news, even
if it means we haven't been careful enough when writing unit tests,
which calls for improvements in that area in the future.

Patches 11/12 and 12/12 finally add assertions to complain loudly about
incorrect deletion of Object instances (11/12) and incorrect calling
contexts of thread-bound functions (12/12).

Only the thread-bound members of Object subclasses have been hardened,
more work is needed to extend this to individual members of other
classes (currently only DeviceEnumerator::enumerate()), and to classes
that are marked as thread-bound at the class level (IPCUnixSocket and
V4L2VideoDevice).

Even if V4L2VideoDevice isn't addressed directly by this series, and the
bug report is related to usage of the V4L2VideoDevice class, I believe
the issue in the soft ISP code will be caught by the assertions added to
the EventNotifier class. Milan, would you be able to test this, and
confirm that libcamera now complains loudly ?

Laurent Pinchart (12):
  libcamera: object: Fix thread-bound reference in documentation
  libcamera: signal: Replace object.h inclusion with forward declatation
  test: object-delete: Test deferred delete just before thread stops
  libcamera: thread: Ensure deferred deletion of all objects before
    stopping
  test: event-thread: Destroy Object from correct thread context
  test: message: Remove incorrect slow receiver test
  test: message: Destroy Object from correct thread context
  test: signal-threads: Destroy Object from correct thread context
  test: timer-thread: Move timer start from wrong thread to separate
    test
  test: timer-thread: Destroy Object from correct thread context
  libcamera: object: Document and ensure Object deletion constraints
  libcamera: object: Add and use thread-bound assertion

 include/libcamera/base/object.h               |   2 +
 include/libcamera/base/signal.h               |   3 +-
 src/libcamera/base/bound_method.cpp           |   1 +
 src/libcamera/base/event_notifier.cpp         |   6 +
 src/libcamera/base/object.cpp                 |  50 +++++++-
 src/libcamera/base/signal.cpp                 |   1 +
 src/libcamera/base/thread.cpp                 |   7 ++
 src/libcamera/base/timer.cpp                  |  10 +-
 test/event-thread.cpp                         |  38 +++++--
 test/ipa/ipa_interface_test.cpp               |   1 +
 test/meson.build                              |   9 +-
 test/message.cpp                              |  54 +++------
 test/object-delete.cpp                        |  30 ++++-
 test/signal-threads.cpp                       |  24 ++--
 test/timer-fail.cpp                           | 107 ++++++++++++++++++
 test/timer-thread.cpp                         |  37 ++----
 .../module_ipa_proxy.h.tmpl                   |   1 +
 17 files changed, 276 insertions(+), 105 deletions(-)
 create mode 100644 test/timer-fail.cpp


base-commit: 0d99f2de13faccf199e6c9e806702cb21d0b6c24

Comments

Laurent Pinchart Jan. 21, 2024, 4:02 a.m. UTC | #1
On Sun, Jan 21, 2024 at 05:59:36AM +0200, Laurent Pinchart via libcamera-devel wrote:
> Hello,
> 
> Bug https://bugs.libcamera.org/show_bug.cgi?id=208 reports a race
> condition that is ultimately due to incorrect usage of the libcamera
> multi-threading infrastructure by the soft ISP code under development.
> Instead of blaming the author of that code, I believe it shows we're not
> doing well enough at communicating the threading requirements. This
> patch series is an attempt to improve the situation.
> 
> libcamera has a threading model that is documented. In particular, the
> documentation classifies functions as thread-safe, thread-bound and
> reentrant. Patch 01/12 starts by fixing a minor issue in the
> documentation of a thread-bound function that does not contain the
> correct reference.
> 
> Patch 02/12 is a drive-by improvement that is, strictly speaking,
> unrelated to this series, but was developed at the same time.
> 
> The next two patches fix a first thread-related issues with the
> Object::deleteLater() function. Patch 03/12 extends a unit test to
> display the issue, and patch 04/12 fixes it.
> 
> Patches 05/12 to 10/12 continue with fixing various kinds of incorrect
> deletion of objects in unit tests. It turned out that all the race
> conditions related to this patch series are in unit tests, libcamera
> itself isn't (as far as I could see) affected. This is good news, even
> if it means we haven't been careful enough when writing unit tests,
> which calls for improvements in that area in the future.
> 
> Patches 11/12 and 12/12 finally add assertions to complain loudly about
> incorrect deletion of Object instances (11/12) and incorrect calling
> contexts of thread-bound functions (12/12).
> 
> Only the thread-bound members of Object subclasses have been hardened,
> more work is needed to extend this to individual members of other
> classes (currently only DeviceEnumerator::enumerate()), and to classes
> that are marked as thread-bound at the class level (IPCUnixSocket and
> V4L2VideoDevice).
> 
> Even if V4L2VideoDevice isn't addressed directly by this series, and the
> bug report is related to usage of the V4L2VideoDevice class, I believe
> the issue in the soft ISP code will be caught by the assertions added to
> the EventNotifier class. Milan, would you be able to test this, and
> confirm that libcamera now complains loudly ?

Also, I would like more extensive testing on Raspberry Pi to make sure
this doesn't cause any regression. Naush, David, would you be able to
give this series a try ?

> Laurent Pinchart (12):
>   libcamera: object: Fix thread-bound reference in documentation
>   libcamera: signal: Replace object.h inclusion with forward declatation
>   test: object-delete: Test deferred delete just before thread stops
>   libcamera: thread: Ensure deferred deletion of all objects before
>     stopping
>   test: event-thread: Destroy Object from correct thread context
>   test: message: Remove incorrect slow receiver test
>   test: message: Destroy Object from correct thread context
>   test: signal-threads: Destroy Object from correct thread context
>   test: timer-thread: Move timer start from wrong thread to separate
>     test
>   test: timer-thread: Destroy Object from correct thread context
>   libcamera: object: Document and ensure Object deletion constraints
>   libcamera: object: Add and use thread-bound assertion
> 
>  include/libcamera/base/object.h               |   2 +
>  include/libcamera/base/signal.h               |   3 +-
>  src/libcamera/base/bound_method.cpp           |   1 +
>  src/libcamera/base/event_notifier.cpp         |   6 +
>  src/libcamera/base/object.cpp                 |  50 +++++++-
>  src/libcamera/base/signal.cpp                 |   1 +
>  src/libcamera/base/thread.cpp                 |   7 ++
>  src/libcamera/base/timer.cpp                  |  10 +-
>  test/event-thread.cpp                         |  38 +++++--
>  test/ipa/ipa_interface_test.cpp               |   1 +
>  test/meson.build                              |   9 +-
>  test/message.cpp                              |  54 +++------
>  test/object-delete.cpp                        |  30 ++++-
>  test/signal-threads.cpp                       |  24 ++--
>  test/timer-fail.cpp                           | 107 ++++++++++++++++++
>  test/timer-thread.cpp                         |  37 ++----
>  .../module_ipa_proxy.h.tmpl                   |   1 +
>  17 files changed, 276 insertions(+), 105 deletions(-)
>  create mode 100644 test/timer-fail.cpp
> 
> 
> base-commit: 0d99f2de13faccf199e6c9e806702cb21d0b6c24
Milan Zamazal Jan. 22, 2024, 4:30 p.m. UTC | #2
Laurent Pinchart <laurent.pinchart@ideasonboard.com> writes:

> Even if V4L2VideoDevice isn't addressed directly by this series, and the
> bug report is related to usage of the V4L2VideoDevice class, I believe
> the issue in the soft ISP code will be caught by the assertions added to
> the EventNotifier class. Milan, would you be able to test this, and
> confirm that libcamera now complains loudly ?

Nice cleanup, thank you!  Yes, this complains immediately and loudly in the
given case.

Now a related question is how to deal with threads and instances of classes that
don't inherit Object.

I think V4L2VideoDevice::queueBuffer is a good example.  V4L2VideoDevice is
thread-bound but doesn't inherit Object.  So the caller, unless it has created
the given instance itself (and being thread-bound itself or tracking its initial
thread), has no idea about the thread a V4L2VideoDevice instance is bound too,
even less how to call it there, and even less without worrying about
implications of asynchronous calls.

This means ensuring most things run in the same thread.  This may be easy in
some cases and less fine in other ones.  It discourages using multiple threads,
not sure whether this is a good thing or not.  It's not robust in any case.

A possible alternative would be making less things thread-bound.  I suppose
there are reasons why this is not happening in libcamera.

Another alternative would be to use Object (as a direct base class or in
wrappers) or Thread (together with initial thread tracking) to call the right
thread in more places.  It would be limited to situations where possibly
asynchronous invocations don't harm if nothing else.

The call chain in the software ISP branch, DebayerCpu -> SwIspSimple ->
SimpleCameraData -> V4L2VideoDevice (thread-bound) -> EventNotifier (inheriting
Object), must be changed to eventually invoke V4L2VideoDevice::queueBuffer in
the right thread.  Is the recommended practice to examine thoroughly what's
running where or are there any simpler options?

Thanks,
Milan
Andrei Konovalov Jan. 22, 2024, 11:52 p.m. UTC | #3
On 22.01.2024 19:30, Milan Zamazal wrote:
> Laurent Pinchart <laurent.pinchart@ideasonboard.com> writes:
> 
>> Even if V4L2VideoDevice isn't addressed directly by this series, and the
>> bug report is related to usage of the V4L2VideoDevice class, I believe
>> the issue in the soft ISP code will be caught by the assertions added to
>> the EventNotifier class. Milan, would you be able to test this, and
>> confirm that libcamera now complains loudly ?
> 
> Nice cleanup, thank you!

+1, thanks Laurent!

> Yes, this complains immediately and loudly in the
> given case.

I was also able to get the assertion on my much faster board.
This required to use the highest resolution for capture, and it can take some
time (tens of seconds) before the assertion is triggered.

> Now a related question is how to deal with threads and instances of classes that
> don't inherit Object.
> 
> I think V4L2VideoDevice::queueBuffer is a good example.  V4L2VideoDevice is
> thread-bound but doesn't inherit Object.  So the caller, unless it has created
> the given instance itself (and being thread-bound itself or tracking its initial
> thread), has no idea about the thread a V4L2VideoDevice instance is bound too,
> even less how to call it there, and even less without worrying about
> implications of asynchronous calls.

I can create a proxy Object (to call V4L2VideoDevice::queueBuffer from) in the pipeline
handler thread and connect it to the Soft ISP's signal, but it doesn't look like a
nice solution...

> This means ensuring most things run in the same thread.  This may be easy in
> some cases and less fine in other ones.  It discourages using multiple threads,
> not sure whether this is a good thing or not.  It's not robust in any case.
> 
> A possible alternative would be making less things thread-bound.  I suppose
> there are reasons why this is not happening in libcamera.
> 
> Another alternative would be to use Object (as a direct base class or in
> wrappers) or Thread (together with initial thread tracking) to call the right
> thread in more places.  It would be limited to situations where possibly
> asynchronous invocations don't harm if nothing else.

...yes, something along these lines.

Thanks,
Andrei

> The call chain in the software ISP branch, DebayerCpu -> SwIspSimple ->
> SimpleCameraData -> V4L2VideoDevice (thread-bound) -> EventNotifier (inheriting
> Object), must be changed to eventually invoke V4L2VideoDevice::queueBuffer in
> the right thread.  Is the recommended practice to examine thoroughly what's
> running where or are there any simpler options?
> 
> Thanks,
> Milan
>
Laurent Pinchart Jan. 23, 2024, 12:02 a.m. UTC | #4
On Tue, Jan 23, 2024 at 02:52:04AM +0300, Andrei Konovalov wrote:
> On 22.01.2024 19:30, Milan Zamazal wrote:
> > Laurent Pinchart <laurent.pinchart@ideasonboard.com> writes:
> > 
> >> Even if V4L2VideoDevice isn't addressed directly by this series, and the
> >> bug report is related to usage of the V4L2VideoDevice class, I believe
> >> the issue in the soft ISP code will be caught by the assertions added to
> >> the EventNotifier class. Milan, would you be able to test this, and
> >> confirm that libcamera now complains loudly ?
> > 
> > Nice cleanup, thank you!
> 
> +1, thanks Laurent!
> 
> > Yes, this complains immediately and loudly in the
> > given case.
> 
> I was also able to get the assertion on my much faster board.
> This required to use the highest resolution for capture, and it can take some
> time (tens of seconds) before the assertion is triggered.
> 
> > Now a related question is how to deal with threads and instances of classes that
> > don't inherit Object.
> > 
> > I think V4L2VideoDevice::queueBuffer is a good example.  V4L2VideoDevice is
> > thread-bound but doesn't inherit Object.  So the caller, unless it has created
> > the given instance itself (and being thread-bound itself or tracking its initial
> > thread), has no idea about the thread a V4L2VideoDevice instance is bound too,
> > even less how to call it there, and even less without worrying about
> > implications of asynchronous calls.
> 
> I can create a proxy Object (to call V4L2VideoDevice::queueBuffer from) in the pipeline
> handler thread and connect it to the Soft ISP's signal, but it doesn't look like a
> nice solution...

Agreed, it's not very nice :-)

> > This means ensuring most things run in the same thread.  This may be easy in
> > some cases and less fine in other ones.  It discourages using multiple threads,
> > not sure whether this is a good thing or not.  It's not robust in any case.
> > 
> > A possible alternative would be making less things thread-bound.  I suppose
> > there are reasons why this is not happening in libcamera.

Yes, complexity. It's error-prone enough as it is, making more functions
callable from different threads will mean more race conditions as all
the callers will need to handle the thread-safety requirements. As much
as possible of the code base is single-threaded in the sense that
developers don't need to think about threads, and thread boundaries are
handled with strict rules that need to be considered only when adding a
new thread, which should be a rare case.

> > Another alternative would be to use Object (as a direct base class or in
> > wrappers) or Thread (together with initial thread tracking) to call the right
> > thread in more places.  It would be limited to situations where possibly
> > asynchronous invocations don't harm if nothing else.
> 
> ...yes, something along these lines.

A good solution may be to model the soft ISP object like if it was a
real hardware device. Give it functions to perform actions, and signals
to report events. Internally, create a proxy to the thread to relay the
function calls with invokeMethod(). The signals should work out of the
box, if you emit them from the ISP thread, and connect the pipeline
handler slots to them, you should get queued signal behaviour
automatically as PipelineHandler inherits from Object.

I think this will also make it easier to implement a GPU-based ISP
later.

> > The call chain in the software ISP branch, DebayerCpu -> SwIspSimple ->
> > SimpleCameraData -> V4L2VideoDevice (thread-bound) -> EventNotifier (inheriting
> > Object), must be changed to eventually invoke V4L2VideoDevice::queueBuffer in
> > the right thread.  Is the recommended practice to examine thoroughly what's
> > running where or are there any simpler options?

The soft ISP should not deal with V4L2 at all. It should be modelled as
a processing component that takes incoming calls and generates events
when processing is complete, like a V4L2VideoDevice does. It's then the
job of the pipeline handler to connect the building blocks to create a
pipeline.
Laurent Pinchart Jan. 23, 2024, 12:46 a.m. UTC | #5
On Mon, Jan 22, 2024 at 05:30:55PM +0100, Milan Zamazal wrote:
> Laurent Pinchart <laurent.pinchart@ideasonboard.com> writes:
> 
> > Even if V4L2VideoDevice isn't addressed directly by this series, and the
> > bug report is related to usage of the V4L2VideoDevice class, I believe
> > the issue in the soft ISP code will be caught by the assertions added to
> > the EventNotifier class. Milan, would you be able to test this, and
> > confirm that libcamera now complains loudly ?
> 
> Nice cleanup, thank you!  Yes, this complains immediately and loudly in the
> given case.
> 
> Now a related question is how to deal with threads and instances of classes that
> don't inherit Object.
> 
> I think V4L2VideoDevice::queueBuffer is a good example.  V4L2VideoDevice is
> thread-bound but doesn't inherit Object.  So the caller, unless it has created
> the given instance itself (and being thread-bound itself or tracking its initial
> thread), has no idea about the thread a V4L2VideoDevice instance is bound too,
> even less how to call it there, and even less without worrying about
> implications of asynchronous calls.

That's kind of the point actually, by design: if a class doesn't inherit
from Object, unless some of its functions are explicitly marked as being
thread-safe, then it's not meant to be called from different threads. 

> This means ensuring most things run in the same thread.  This may be easy in
> some cases and less fine in other ones.  It discourages using multiple threads,
> not sure whether this is a good thing or not.  It's not robust in any case.

Yes, it discourages usage of multiple threads. libcamera instead
encourages usage of event loops without blocking calls. Threads can be
used when there's a good reason to do so. As that's uncommon, most
development doesn't have to worry about race conditions. Only when
adding a thread should the developer know what they're doing.

> A possible alternative would be making less things thread-bound.  I suppose
> there are reasons why this is not happening in libcamera.
> 
> Another alternative would be to use Object (as a direct base class or in
> wrappers) or Thread (together with initial thread tracking) to call the right
> thread in more places.  It would be limited to situations where possibly
> asynchronous invocations don't harm if nothing else.
> 
> The call chain in the software ISP branch, DebayerCpu -> SwIspSimple ->
> SimpleCameraData -> V4L2VideoDevice (thread-bound) -> EventNotifier (inheriting
> Object), must be changed to eventually invoke V4L2VideoDevice::queueBuffer in
> the right thread.  Is the recommended practice to examine thoroughly what's
> running where or are there any simpler options?