[v5,3/5] software_isp: debayer_cpu: Add multi-threading support
diff mbox series

Message ID 20260304075052.11599-4-johannes.goede@oss.qualcomm.com
State New
Headers show
Series
  • software_isp: debayer_cpu: Add multi-threading support
Related show

Commit Message

Hans de Goede March 4, 2026, 7:50 a.m. UTC
Add CPU soft ISP multi-threading support.

Benchmark results for the Arduino Uno-Q with a weak CPU which is good for
performance testing, all numbers with an IMX219 running at
3280x2464 -> 3272x2464:

1 thread : 147ms / frame, ~6.5 fps
2 threads:  80ms / frame, ~12.5 fps
3 threads:  65ms / frame, ~15 fps

Adding a 4th thread does not improve performance.

Signed-off-by: Hans de Goede <johannes.goede@oss.qualcomm.com>
---
Changes in v5:
- Extend software_isp.threads docs in runtime_configuration.rst

Changes in v4:
- Document software_isp.threads option in runtime_configuration.rst
- Add an use constants for min/max/default number of threads

Changes in v3:
- Adjust for DebayerCpuThread now inheriting from Thread
- Use for (auto &thread : threads_)

Changes in v2:
- Adjust to use the new DebayerCpuThread class introduced in the v2 patch-series
- Re-use threads instead of starting new threads every frame
---
 Documentation/runtime_configuration.rst    |  8 ++++
 src/libcamera/software_isp/debayer_cpu.cpp | 45 ++++++++++++++++++++--
 src/libcamera/software_isp/debayer_cpu.h   | 10 +++++
 3 files changed, 60 insertions(+), 3 deletions(-)

Comments

Milan Zamazal March 4, 2026, 9:57 a.m. UTC | #1
Hi Hans,

thank you for the update.

Hans de Goede <johannes.goede@oss.qualcomm.com> writes:

> Add CPU soft ISP multi-threading support.
>
> Benchmark results for the Arduino Uno-Q with a weak CPU which is good for
> performance testing, all numbers with an IMX219 running at
> 3280x2464 -> 3272x2464:
>
> 1 thread : 147ms / frame, ~6.5 fps
> 2 threads:  80ms / frame, ~12.5 fps
> 3 threads:  65ms / frame, ~15 fps
>
> Adding a 4th thread does not improve performance.

In my TI AM69 environment and with CCM, 2 CPU threads are about of the
same speed as GPU, and the speed increases almost linearly up to all the
8 threads.  Nice.

> Signed-off-by: Hans de Goede <johannes.goede@oss.qualcomm.com>

Reviewed-by: Milan Zamazal <mzamazal@redhat.com>

> ---
> Changes in v5:
> - Extend software_isp.threads docs in runtime_configuration.rst
>
> Changes in v4:
> - Document software_isp.threads option in runtime_configuration.rst
> - Add an use constants for min/max/default number of threads
>
> Changes in v3:
> - Adjust for DebayerCpuThread now inheriting from Thread
> - Use for (auto &thread : threads_)
>
> Changes in v2:
> - Adjust to use the new DebayerCpuThread class introduced in the v2 patch-series
> - Re-use threads instead of starting new threads every frame
> ---
>  Documentation/runtime_configuration.rst    |  8 ++++
>  src/libcamera/software_isp/debayer_cpu.cpp | 45 ++++++++++++++++++++--
>  src/libcamera/software_isp/debayer_cpu.h   | 10 +++++
>  3 files changed, 60 insertions(+), 3 deletions(-)
>
> diff --git a/Documentation/runtime_configuration.rst b/Documentation/runtime_configuration.rst
> index e99ef2fb9..651929a4d 100644
> --- a/Documentation/runtime_configuration.rst
> +++ b/Documentation/runtime_configuration.rst
> @@ -51,6 +51,7 @@ file structure:
>        measure:
>          skip: # non-negative integer, frames to skip initially
>          number: # non-negative integer, frames to measure
> +      threads: # integer >= 1, number of render threads to use, default 2
>  
>  Configuration file example
>  --------------------------
> @@ -84,6 +85,7 @@ Configuration file example
>         measure:
>           skip: 50
>           number: 30
> +       threads: 2
>  
>  List of variables and configuration options
>  -------------------------------------------
> @@ -167,6 +169,12 @@ software_isp.measure.skip, software_isp.measure.number
>  
>     Example `number` value: ``30``
>  
> +software_isp.threads
> +   Number of render threads the software ISP uses when using the CPU.
> +   This must be between 1 and 8 and the default is 2.
> +
> +   Example value: ``2``
> +
>  Further details
>  ---------------
>  
> diff --git a/src/libcamera/software_isp/debayer_cpu.cpp b/src/libcamera/software_isp/debayer_cpu.cpp
> index d57d640df..f7c3e1751 100644
> --- a/src/libcamera/software_isp/debayer_cpu.cpp
> +++ b/src/libcamera/software_isp/debayer_cpu.cpp
> @@ -76,6 +76,7 @@ DebayerCpuThread::DebayerCpuThread(DebayerCpu *debayer, unsigned int threadIndex
>  	  debayer_(debayer), threadIndex_(threadIndex),
>  	  enableInputMemcpy_(enableInputMemcpy)
>  {
> +	moveToThread(this);
>  }
>  
>  /**
> @@ -107,8 +108,10 @@ DebayerCpu::DebayerCpu(std::unique_ptr<SwStatsCpu> stats, const GlobalConfigurat
>  	bool enableInputMemcpy =
>  		configuration.option<bool>({ "software_isp", "copy_input_buffer" }).value_or(true);
>  
> -	/* Just one thread object for now, which will be called inline rather than async */
> -	threads_.resize(1);
> +	unsigned int threadCount =
> +		configuration.option<unsigned int>({ "software_isp", "threads" }).value_or(kDefaultThreads);
> +	threadCount = std::clamp(threadCount, kMinThreads, kMaxThreads);
> +	threads_.resize(threadCount);
>  
>  	for (unsigned int i = 0; i < threads_.size(); i++)
>  		threads_[i] = std::make_unique<DebayerCpuThread>(this, i, enableInputMemcpy);
> @@ -746,6 +749,11 @@ void DebayerCpuThread::process(uint32_t frame, const uint8_t *src, uint8_t *dst)
>  		process2(frame, src, dst);
>  	else
>  		process4(frame, src, dst);
> +
> +	debayer_->workPendingMutex_.lock();
> +	debayer_->workPending_ &= ~(1 << threadIndex_);
> +	debayer_->workPendingMutex_.unlock();
> +	debayer_->workPendingCv_.notify_one();
>  }
>  
>  void DebayerCpuThread::process2(uint32_t frame, const uint8_t *src, uint8_t *dst)
> @@ -985,7 +993,21 @@ void DebayerCpu::process(uint32_t frame, FrameBuffer *input, FrameBuffer *output
>  
>  	stats_->startFrame(frame);
>  
> -	threads_[0]->process(frame, in.planes()[0].data(), out.planes()[0].data());
> +	workPendingMutex_.lock();
> +	workPending_ = (1 << threads_.size()) - 1;
> +	workPendingMutex_.unlock();
> +
> +	for (auto &thread : threads_)
> +		thread->invokeMethod(&DebayerCpuThread::process,
> +				     ConnectionTypeQueued, frame,
> +				     in.planes()[0].data(), out.planes()[0].data());
> +
> +	{
> +		MutexLocker locker(workPendingMutex_);
> +		workPendingCv_.wait(locker, [&]() LIBCAMERA_TSA_REQUIRES(workPendingMutex_) {
> +			return workPending_ == 0;
> +		});
> +	}
>  
>  	metadata.planes()[0].bytesused = out.planes()[0].size();
>  
> @@ -1004,6 +1026,23 @@ void DebayerCpu::process(uint32_t frame, FrameBuffer *input, FrameBuffer *output
>  	inputBufferReady.emit(input);
>  }
>  
> +int DebayerCpu::start()
> +{
> +	for (auto &thread : threads_)
> +		thread->start();
> +
> +	return 0;
> +}
> +
> +void DebayerCpu::stop()
> +{
> +	for (auto &thread : threads_)
> +		thread->exit();
> +
> +	for (auto &thread : threads_)
> +		thread->wait();
> +}
> +
>  SizeRange DebayerCpu::sizes(PixelFormat inputFormat, const Size &inputSize)
>  {
>  	Size patternSize = this->patternSize(inputFormat);
> diff --git a/src/libcamera/software_isp/debayer_cpu.h b/src/libcamera/software_isp/debayer_cpu.h
> index 780576090..11280c0a1 100644
> --- a/src/libcamera/software_isp/debayer_cpu.h
> +++ b/src/libcamera/software_isp/debayer_cpu.h
> @@ -16,6 +16,7 @@
>  #include <vector>
>  
>  #include <libcamera/base/object.h>
> +#include <libcamera/base/mutex.h>
>  
>  #include "libcamera/internal/bayer_format.h"
>  #include "libcamera/internal/global_configuration.h"
> @@ -41,6 +42,8 @@ public:
>  	std::tuple<unsigned int, unsigned int>
>  	strideAndFrameSize(const PixelFormat &outputFormat, const Size &size);
>  	void process(uint32_t frame, FrameBuffer *input, FrameBuffer *output, const DebayerParams &params);
> +	int start();
> +	void stop();
>  	SizeRange sizes(PixelFormat inputFormat, const Size &inputSize);
>  	const SharedFD &getStatsFD() { return stats_->getStatsFD(); }
>  
> @@ -144,6 +147,13 @@ private:
>  	std::unique_ptr<SwStatsCpu> stats_;
>  	unsigned int xShift_; /* Offset of 0/1 applied to window_.x */
>  
> +	static constexpr unsigned int kMinThreads = 1;
> +	static constexpr unsigned int kMaxThreads = 8;
> +	static constexpr unsigned int kDefaultThreads = 2;
> +
> +	unsigned int workPending_ LIBCAMERA_TSA_GUARDED_BY(workPendingMutex_);
> +	Mutex workPendingMutex_;
> +	ConditionVariable workPendingCv_;
>  	std::vector<std::unique_ptr<DebayerCpuThread>>threads_;
>  };

Patch
diff mbox series

diff --git a/Documentation/runtime_configuration.rst b/Documentation/runtime_configuration.rst
index e99ef2fb9..651929a4d 100644
--- a/Documentation/runtime_configuration.rst
+++ b/Documentation/runtime_configuration.rst
@@ -51,6 +51,7 @@  file structure:
       measure:
         skip: # non-negative integer, frames to skip initially
         number: # non-negative integer, frames to measure
+      threads: # integer >= 1, number of render threads to use, default 2
 
 Configuration file example
 --------------------------
@@ -84,6 +85,7 @@  Configuration file example
        measure:
          skip: 50
          number: 30
+       threads: 2
 
 List of variables and configuration options
 -------------------------------------------
@@ -167,6 +169,12 @@  software_isp.measure.skip, software_isp.measure.number
 
    Example `number` value: ``30``
 
+software_isp.threads
+   Number of render threads the software ISP uses when using the CPU.
+   This must be between 1 and 8 and the default is 2.
+
+   Example value: ``2``
+
 Further details
 ---------------
 
diff --git a/src/libcamera/software_isp/debayer_cpu.cpp b/src/libcamera/software_isp/debayer_cpu.cpp
index d57d640df..f7c3e1751 100644
--- a/src/libcamera/software_isp/debayer_cpu.cpp
+++ b/src/libcamera/software_isp/debayer_cpu.cpp
@@ -76,6 +76,7 @@  DebayerCpuThread::DebayerCpuThread(DebayerCpu *debayer, unsigned int threadIndex
 	  debayer_(debayer), threadIndex_(threadIndex),
 	  enableInputMemcpy_(enableInputMemcpy)
 {
+	moveToThread(this);
 }
 
 /**
@@ -107,8 +108,10 @@  DebayerCpu::DebayerCpu(std::unique_ptr<SwStatsCpu> stats, const GlobalConfigurat
 	bool enableInputMemcpy =
 		configuration.option<bool>({ "software_isp", "copy_input_buffer" }).value_or(true);
 
-	/* Just one thread object for now, which will be called inline rather than async */
-	threads_.resize(1);
+	unsigned int threadCount =
+		configuration.option<unsigned int>({ "software_isp", "threads" }).value_or(kDefaultThreads);
+	threadCount = std::clamp(threadCount, kMinThreads, kMaxThreads);
+	threads_.resize(threadCount);
 
 	for (unsigned int i = 0; i < threads_.size(); i++)
 		threads_[i] = std::make_unique<DebayerCpuThread>(this, i, enableInputMemcpy);
@@ -746,6 +749,11 @@  void DebayerCpuThread::process(uint32_t frame, const uint8_t *src, uint8_t *dst)
 		process2(frame, src, dst);
 	else
 		process4(frame, src, dst);
+
+	debayer_->workPendingMutex_.lock();
+	debayer_->workPending_ &= ~(1 << threadIndex_);
+	debayer_->workPendingMutex_.unlock();
+	debayer_->workPendingCv_.notify_one();
 }
 
 void DebayerCpuThread::process2(uint32_t frame, const uint8_t *src, uint8_t *dst)
@@ -985,7 +993,21 @@  void DebayerCpu::process(uint32_t frame, FrameBuffer *input, FrameBuffer *output
 
 	stats_->startFrame(frame);
 
-	threads_[0]->process(frame, in.planes()[0].data(), out.planes()[0].data());
+	workPendingMutex_.lock();
+	workPending_ = (1 << threads_.size()) - 1;
+	workPendingMutex_.unlock();
+
+	for (auto &thread : threads_)
+		thread->invokeMethod(&DebayerCpuThread::process,
+				     ConnectionTypeQueued, frame,
+				     in.planes()[0].data(), out.planes()[0].data());
+
+	{
+		MutexLocker locker(workPendingMutex_);
+		workPendingCv_.wait(locker, [&]() LIBCAMERA_TSA_REQUIRES(workPendingMutex_) {
+			return workPending_ == 0;
+		});
+	}
 
 	metadata.planes()[0].bytesused = out.planes()[0].size();
 
@@ -1004,6 +1026,23 @@  void DebayerCpu::process(uint32_t frame, FrameBuffer *input, FrameBuffer *output
 	inputBufferReady.emit(input);
 }
 
+int DebayerCpu::start()
+{
+	for (auto &thread : threads_)
+		thread->start();
+
+	return 0;
+}
+
+void DebayerCpu::stop()
+{
+	for (auto &thread : threads_)
+		thread->exit();
+
+	for (auto &thread : threads_)
+		thread->wait();
+}
+
 SizeRange DebayerCpu::sizes(PixelFormat inputFormat, const Size &inputSize)
 {
 	Size patternSize = this->patternSize(inputFormat);
diff --git a/src/libcamera/software_isp/debayer_cpu.h b/src/libcamera/software_isp/debayer_cpu.h
index 780576090..11280c0a1 100644
--- a/src/libcamera/software_isp/debayer_cpu.h
+++ b/src/libcamera/software_isp/debayer_cpu.h
@@ -16,6 +16,7 @@ 
 #include <vector>
 
 #include <libcamera/base/object.h>
+#include <libcamera/base/mutex.h>
 
 #include "libcamera/internal/bayer_format.h"
 #include "libcamera/internal/global_configuration.h"
@@ -41,6 +42,8 @@  public:
 	std::tuple<unsigned int, unsigned int>
 	strideAndFrameSize(const PixelFormat &outputFormat, const Size &size);
 	void process(uint32_t frame, FrameBuffer *input, FrameBuffer *output, const DebayerParams &params);
+	int start();
+	void stop();
 	SizeRange sizes(PixelFormat inputFormat, const Size &inputSize);
 	const SharedFD &getStatsFD() { return stats_->getStatsFD(); }
 
@@ -144,6 +147,13 @@  private:
 	std::unique_ptr<SwStatsCpu> stats_;
 	unsigned int xShift_; /* Offset of 0/1 applied to window_.x */
 
+	static constexpr unsigned int kMinThreads = 1;
+	static constexpr unsigned int kMaxThreads = 8;
+	static constexpr unsigned int kDefaultThreads = 2;
+
+	unsigned int workPending_ LIBCAMERA_TSA_GUARDED_BY(workPendingMutex_);
+	Mutex workPendingMutex_;
+	ConditionVariable workPendingCv_;
 	std::vector<std::unique_ptr<DebayerCpuThread>>threads_;
 };