| Message ID | 20260518201508.140849-5-robert.mader@collabora.com |
|---|---|
| State | Superseded |
| Headers | show |
| Series |
|
| Related | show |
On 18/05/2026 21:15, Robert Mader wrote: > Instead of waiting for the GPU to finish the output buffer *before* > computing stats, do so afterwards. This allows work to happen in > parallel on the GPU and CPU, potentially improving throughput and > reducing latency. > > This, however, requires us to include stats computation into the > debayer benchmark data. > > Signed-off-by: Robert Mader <robert.mader@collabora.com> > Reviewed-by: Barnabás Pőcze <barnabas.pocze@ideasonboard.com> > --- > src/libcamera/software_isp/debayer_egl.cpp | 7 +++---- > 1 file changed, 3 insertions(+), 4 deletions(-) > > diff --git a/src/libcamera/software_isp/debayer_egl.cpp b/src/libcamera/software_isp/debayer_egl.cpp > index e5a661d2c..9ad0c871f 100644 > --- a/src/libcamera/software_isp/debayer_egl.cpp > +++ b/src/libcamera/software_isp/debayer_egl.cpp > @@ -530,8 +530,6 @@ int DebayerEGL::debayerGPU(FrameBuffer *input, std::optional<DmaSyncer> *inputBu > if (err != GL_NO_ERROR) { > LOG(eGL, Error) << "Drawing scene fail " << err; > return -ENODEV; > - } else { > - egl_.syncOutput(); > } > > return 0; > @@ -554,8 +552,6 @@ void DebayerEGL::process(uint32_t frame, FrameBuffer *input, FrameBuffer *output > goto error; > } > > - bench_.finishFrame(); > - > metadata.planes()[0].bytesused = output->planes()[0].length; > > /* Calculate stats for the whole frame */ > @@ -564,6 +560,9 @@ void DebayerEGL::process(uint32_t frame, FrameBuffer *input, FrameBuffer *output > stats_->processFrame(frame, 0, input); > inputBufferDmaSyncer.reset(); > > + egl_.syncOutput(); > + bench_.finishFrame(); > + > outputBufferReady.emit(output); > inputBufferReady.emit(input); > > -- > 2.54.0 > Interesting. I'm wondering what effect if any there is on the CPU cache as a result of this change. In the round I think its a good change in theory but, being difficult I'd also like some metrics to prove it out. Can we valgrind this or oprofile it and see what kind of before/after metrics we get. --- bod
On 19.05.26 18:01, Bryan O'Donoghue wrote: > Interesting. > > I'm wondering what effect if any there is on the CPU cache as a result > of this change. > > In the round I think its a good change in theory but, being difficult > I'd also like some metrics to prove it out. > > Can we valgrind this or oprofile it and see what kind of before/after > metrics we get. I'll try to do some benchmarks. In theory the improved parallelism should simply shorten the overall time spend in process().
On 19/05/2026 20:50, Robert Mader wrote: > On 19.05.26 18:01, Bryan O'Donoghue wrote: >> Interesting. >> >> I'm wondering what effect if any there is on the CPU cache as a result >> of this change. >> >> In the round I think its a good change in theory but, being difficult >> I'd also like some metrics to prove it out. >> >> Can we valgrind this or oprofile it and see what kind of before/after >> metrics we get. > I'll try to do some benchmarks. In theory the improved parallelism > should simply shorten the overall time spend in process(). I'm really just probing - now that we will operate on the same data in parallel in the CPU and GPU - or could - what side-effect if any is there on caches... Should be none, each xPU has its own view of memory, its own cache, its own PTEs... --- bod
On 20.05.26 09:42, Bryan O'Donoghue wrote: > On 19/05/2026 20:50, Robert Mader wrote: >> On 19.05.26 18:01, Bryan O'Donoghue wrote: >>> Interesting. >>> >>> I'm wondering what effect if any there is on the CPU cache as a >>> result of this change. >>> >>> In the round I think its a good change in theory but, being >>> difficult I'd also like some metrics to prove it out. >>> >>> Can we valgrind this or oprofile it and see what kind of >>> before/after metrics we get. >> I'll try to do some benchmarks. In theory the improved parallelism >> should simply shorten the overall time spend in process(). > > I'm really just probing - now that we will operate on the same data in > parallel in the CPU and GPU - or could - what side-effect if any is > there on caches... > > Should be none, each xPU has its own view of memory, its own cache, > its own PTEs... I just tried to get some numbers for this running `cam -c ... -s width=1920,height=1080 --capture=60` a bunch of times on my Fairphone 5 and Pixel 3a (both set to the power safe governor). For the FP5 I couldn't observe any meaningful difference, but on the (much weaker) Pixel the impact seems to be both reliably reproducible and significant. For the front camera - where dmabuf import works at this resolution - frame times drop from ~23000 us/frame to ~16000 us/frame. When changing kStatPerNumFrames from 4 to 1 - collecting stats every frame, as we only expect to see changes on frames where stats are collected - the difference is even more dramatic: dropping from ~39000 us/frame to ~18000 us/frame. In both cases this is also reflected in the reported fps, 60 fps being reached much more reliably with the patch. So at least in this specific scenario collecting stats every frame is now almost as fast as only collecting them on every 4th frame, suggesting that the limiting factor are the (pretty weak) GPU cores. I'll do some more tests later and include them in the commit message for v4. In any case the numbers look quite encouraging, especially considering that more complex shaders will likely tilt the balance even more to the GPU being the blocker.
On 20.05.26 11:16, Robert Mader wrote: > > On 20.05.26 09:42, Bryan O'Donoghue wrote: >> On 19/05/2026 20:50, Robert Mader wrote: >>> On 19.05.26 18:01, Bryan O'Donoghue wrote: >>>> Interesting. >>>> >>>> I'm wondering what effect if any there is on the CPU cache as a >>>> result of this change. >>>> >>>> In the round I think its a good change in theory but, being >>>> difficult I'd also like some metrics to prove it out. >>>> >>>> Can we valgrind this or oprofile it and see what kind of >>>> before/after metrics we get. >>> I'll try to do some benchmarks. In theory the improved parallelism >>> should simply shorten the overall time spend in process(). >> >> I'm really just probing - now that we will operate on the same data >> in parallel in the CPU and GPU - or could - what side-effect if any >> is there on caches... >> >> Should be none, each xPU has its own view of memory, its own cache, >> its own PTEs... > > I just tried to get some numbers for this running `cam -c ... -s > width=1920,height=1080 --capture=60` a bunch of times on my Fairphone > 5 and Pixel 3a (both set to the power safe governor). For the FP5 I > couldn't observe any meaningful difference, but on the (much weaker) > Pixel the impact seems to be both reliably reproducible and > significant. For the front camera - where dmabuf import works at this > resolution - frame times drop from ~23000 us/frame to ~16000 us/frame. > When changing kStatPerNumFrames from 4 to 1 - collecting stats every > frame, as we only expect to see changes on frames where stats are > collected - the difference is even more dramatic: dropping from ~39000 > us/frame to ~18000 us/frame. In both cases this is also reflected in > the reported fps, 60 fps being reached much more reliably with the patch. > > So at least in this specific scenario collecting stats every frame is > now almost as fast as only collecting them on every 4th frame, > suggesting that the limiting factor are the (pretty weak) GPU cores. > > I'll do some more tests later and include them in the commit message > for v4. > > In any case the numbers look quite encouraging, especially considering > that more complex shaders will likely tilt the balance even more to > the GPU being the blocker. > P.S.: if anyone wants to test this before I send out v4, please use this branch: https://gitlab.freedesktop.org/rmader/libcamera/-/commits/swisp-dmabuf-import-v4 It turned out that the improvement is only possible when running glFlush() before doing the CPU work (see last commit) - IIUC because we otherwise starve the GPU.
diff --git a/src/libcamera/software_isp/debayer_egl.cpp b/src/libcamera/software_isp/debayer_egl.cpp index e5a661d2c..9ad0c871f 100644 --- a/src/libcamera/software_isp/debayer_egl.cpp +++ b/src/libcamera/software_isp/debayer_egl.cpp @@ -530,8 +530,6 @@ int DebayerEGL::debayerGPU(FrameBuffer *input, std::optional<DmaSyncer> *inputBu if (err != GL_NO_ERROR) { LOG(eGL, Error) << "Drawing scene fail " << err; return -ENODEV; - } else { - egl_.syncOutput(); } return 0; @@ -554,8 +552,6 @@ void DebayerEGL::process(uint32_t frame, FrameBuffer *input, FrameBuffer *output goto error; } - bench_.finishFrame(); - metadata.planes()[0].bytesused = output->planes()[0].length; /* Calculate stats for the whole frame */ @@ -564,6 +560,9 @@ void DebayerEGL::process(uint32_t frame, FrameBuffer *input, FrameBuffer *output stats_->processFrame(frame, 0, input); inputBufferDmaSyncer.reset(); + egl_.syncOutput(); + bench_.finishFrame(); + outputBufferReady.emit(output); inputBufferReady.emit(input);