[v2,4/4] debayer_egl: Sync output buffer after processing stats
diff mbox series

Message ID 20260518201508.140849-5-robert.mader@collabora.com
State Superseded
Headers show
Series
  • software_isp: Implement DMABuf import for input buffers
Related show

Commit Message

Robert Mader May 18, 2026, 8:15 p.m. UTC
Instead of waiting for the GPU to finish the output buffer *before*
computing stats, do so afterwards. This allows work to happen in
parallel on the GPU and CPU, potentially improving throughput and
reducing latency.

This, however, requires us to include stats computation into the
debayer benchmark data.

Signed-off-by: Robert Mader <robert.mader@collabora.com>
Reviewed-by: Barnabás Pőcze <barnabas.pocze@ideasonboard.com>
---
 src/libcamera/software_isp/debayer_egl.cpp | 7 +++----
 1 file changed, 3 insertions(+), 4 deletions(-)

Comments

Bryan O'Donoghue May 19, 2026, 4:01 p.m. UTC | #1
On 18/05/2026 21:15, Robert Mader wrote:
> Instead of waiting for the GPU to finish the output buffer *before*
> computing stats, do so afterwards. This allows work to happen in
> parallel on the GPU and CPU, potentially improving throughput and
> reducing latency.
> 
> This, however, requires us to include stats computation into the
> debayer benchmark data.
> 
> Signed-off-by: Robert Mader <robert.mader@collabora.com>
> Reviewed-by: Barnabás Pőcze <barnabas.pocze@ideasonboard.com>
> ---
>   src/libcamera/software_isp/debayer_egl.cpp | 7 +++----
>   1 file changed, 3 insertions(+), 4 deletions(-)
> 
> diff --git a/src/libcamera/software_isp/debayer_egl.cpp b/src/libcamera/software_isp/debayer_egl.cpp
> index e5a661d2c..9ad0c871f 100644
> --- a/src/libcamera/software_isp/debayer_egl.cpp
> +++ b/src/libcamera/software_isp/debayer_egl.cpp
> @@ -530,8 +530,6 @@ int DebayerEGL::debayerGPU(FrameBuffer *input, std::optional<DmaSyncer> *inputBu
>   	if (err != GL_NO_ERROR) {
>   		LOG(eGL, Error) << "Drawing scene fail " << err;
>   		return -ENODEV;
> -	} else {
> -		egl_.syncOutput();
>   	}
> 
>   	return 0;
> @@ -554,8 +552,6 @@ void DebayerEGL::process(uint32_t frame, FrameBuffer *input, FrameBuffer *output
>   		goto error;
>   	}
> 
> -	bench_.finishFrame();
> -
>   	metadata.planes()[0].bytesused = output->planes()[0].length;
> 
>   	/* Calculate stats for the whole frame */
> @@ -564,6 +560,9 @@ void DebayerEGL::process(uint32_t frame, FrameBuffer *input, FrameBuffer *output
>   	stats_->processFrame(frame, 0, input);
>   	inputBufferDmaSyncer.reset();
> 
> +	egl_.syncOutput();
> +	bench_.finishFrame();
> +
>   	outputBufferReady.emit(output);
>   	inputBufferReady.emit(input);
> 
> --
> 2.54.0
> 

Interesting.

I'm wondering what effect if any there is on the CPU cache as a result 
of this change.

In the round I think its a good change in theory but, being difficult 
I'd also like some metrics to prove it out.

Can we valgrind this or oprofile it and see what kind of before/after 
metrics we get.

---
bod
Robert Mader May 19, 2026, 7:50 p.m. UTC | #2
On 19.05.26 18:01, Bryan O'Donoghue wrote:
> Interesting.
>
> I'm wondering what effect if any there is on the CPU cache as a result 
> of this change.
>
> In the round I think its a good change in theory but, being difficult 
> I'd also like some metrics to prove it out.
>
> Can we valgrind this or oprofile it and see what kind of before/after 
> metrics we get. 
I'll try to do some benchmarks. In theory the improved parallelism 
should simply shorten the overall time spend in process().
Bryan O'Donoghue May 20, 2026, 7:42 a.m. UTC | #3
On 19/05/2026 20:50, Robert Mader wrote:
> On 19.05.26 18:01, Bryan O'Donoghue wrote:
>> Interesting.
>>
>> I'm wondering what effect if any there is on the CPU cache as a result 
>> of this change.
>>
>> In the round I think its a good change in theory but, being difficult 
>> I'd also like some metrics to prove it out.
>>
>> Can we valgrind this or oprofile it and see what kind of before/after 
>> metrics we get. 
> I'll try to do some benchmarks. In theory the improved parallelism 
> should simply shorten the overall time spend in process().

I'm really just probing - now that we will operate on the same data in 
parallel in the CPU and GPU - or could - what side-effect if any is 
there on caches...

Should be none, each xPU has its own view of memory, its own cache, its 
own PTEs...

---
bod
Robert Mader May 20, 2026, 9:16 a.m. UTC | #4
On 20.05.26 09:42, Bryan O'Donoghue wrote:
> On 19/05/2026 20:50, Robert Mader wrote:
>> On 19.05.26 18:01, Bryan O'Donoghue wrote:
>>> Interesting.
>>>
>>> I'm wondering what effect if any there is on the CPU cache as a 
>>> result of this change.
>>>
>>> In the round I think its a good change in theory but, being 
>>> difficult I'd also like some metrics to prove it out.
>>>
>>> Can we valgrind this or oprofile it and see what kind of 
>>> before/after metrics we get. 
>> I'll try to do some benchmarks. In theory the improved parallelism 
>> should simply shorten the overall time spend in process().
>
> I'm really just probing - now that we will operate on the same data in 
> parallel in the CPU and GPU - or could - what side-effect if any is 
> there on caches...
>
> Should be none, each xPU has its own view of memory, its own cache, 
> its own PTEs... 

I just tried to get some numbers for this running `cam -c ... -s 
width=1920,height=1080 --capture=60` a bunch of times on my Fairphone 5 
and Pixel 3a (both set to the power safe governor). For the FP5 I 
couldn't observe any meaningful difference, but on the (much weaker) 
Pixel the impact seems to be both reliably reproducible and significant. 
For the front camera - where dmabuf import works at this resolution - 
frame times drop from ~23000 us/frame to ~16000 us/frame. When changing 
kStatPerNumFrames from 4 to 1 - collecting stats every frame, as we only 
expect to see changes on frames where stats are collected - the 
difference is even more dramatic: dropping from ~39000 us/frame to 
~18000 us/frame. In both cases this is also reflected in the reported 
fps, 60 fps being reached much more reliably with the patch.

So at least in this specific scenario collecting stats every frame is 
now almost as fast as only collecting them on every 4th frame, 
suggesting that the limiting factor are the (pretty weak) GPU cores.

I'll do some more tests later and include them in the commit message for v4.

In any case the numbers look quite encouraging, especially considering 
that more complex shaders will likely tilt the balance even more to the 
GPU being the blocker.
Robert Mader May 20, 2026, 9:28 a.m. UTC | #5
On 20.05.26 11:16, Robert Mader wrote:
>
> On 20.05.26 09:42, Bryan O'Donoghue wrote:
>> On 19/05/2026 20:50, Robert Mader wrote:
>>> On 19.05.26 18:01, Bryan O'Donoghue wrote:
>>>> Interesting.
>>>>
>>>> I'm wondering what effect if any there is on the CPU cache as a 
>>>> result of this change.
>>>>
>>>> In the round I think its a good change in theory but, being 
>>>> difficult I'd also like some metrics to prove it out.
>>>>
>>>> Can we valgrind this or oprofile it and see what kind of 
>>>> before/after metrics we get. 
>>> I'll try to do some benchmarks. In theory the improved parallelism 
>>> should simply shorten the overall time spend in process().
>>
>> I'm really just probing - now that we will operate on the same data 
>> in parallel in the CPU and GPU - or could - what side-effect if any 
>> is there on caches...
>>
>> Should be none, each xPU has its own view of memory, its own cache, 
>> its own PTEs... 
>
> I just tried to get some numbers for this running `cam -c ... -s 
> width=1920,height=1080 --capture=60` a bunch of times on my Fairphone 
> 5 and Pixel 3a (both set to the power safe governor). For the FP5 I 
> couldn't observe any meaningful difference, but on the (much weaker) 
> Pixel the impact seems to be both reliably reproducible and 
> significant. For the front camera - where dmabuf import works at this 
> resolution - frame times drop from ~23000 us/frame to ~16000 us/frame. 
> When changing kStatPerNumFrames from 4 to 1 - collecting stats every 
> frame, as we only expect to see changes on frames where stats are 
> collected - the difference is even more dramatic: dropping from ~39000 
> us/frame to ~18000 us/frame. In both cases this is also reflected in 
> the reported fps, 60 fps being reached much more reliably with the patch.
>
> So at least in this specific scenario collecting stats every frame is 
> now almost as fast as only collecting them on every 4th frame, 
> suggesting that the limiting factor are the (pretty weak) GPU cores.
>
> I'll do some more tests later and include them in the commit message 
> for v4.
>
> In any case the numbers look quite encouraging, especially considering 
> that more complex shaders will likely tilt the balance even more to 
> the GPU being the blocker.
>
P.S.: if anyone wants to test this before I send out v4, please use this 
branch: 
https://gitlab.freedesktop.org/rmader/libcamera/-/commits/swisp-dmabuf-import-v4

It turned out that the improvement is only possible when running 
glFlush() before doing the CPU work (see last commit) - IIUC because we 
otherwise starve the GPU.

Patch
diff mbox series

diff --git a/src/libcamera/software_isp/debayer_egl.cpp b/src/libcamera/software_isp/debayer_egl.cpp
index e5a661d2c..9ad0c871f 100644
--- a/src/libcamera/software_isp/debayer_egl.cpp
+++ b/src/libcamera/software_isp/debayer_egl.cpp
@@ -530,8 +530,6 @@  int DebayerEGL::debayerGPU(FrameBuffer *input, std::optional<DmaSyncer> *inputBu
 	if (err != GL_NO_ERROR) {
 		LOG(eGL, Error) << "Drawing scene fail " << err;
 		return -ENODEV;
-	} else {
-		egl_.syncOutput();
 	}
 
 	return 0;
@@ -554,8 +552,6 @@  void DebayerEGL::process(uint32_t frame, FrameBuffer *input, FrameBuffer *output
 		goto error;
 	}
 
-	bench_.finishFrame();
-
 	metadata.planes()[0].bytesused = output->planes()[0].length;
 
 	/* Calculate stats for the whole frame */
@@ -564,6 +560,9 @@  void DebayerEGL::process(uint32_t frame, FrameBuffer *input, FrameBuffer *output
 	stats_->processFrame(frame, 0, input);
 	inputBufferDmaSyncer.reset();
 
+	egl_.syncOutput();
+	bench_.finishFrame();
+
 	outputBufferReady.emit(output);
 	inputBufferReady.emit(input);