| Message ID | 20260216190204.106922-4-johannes.goede@oss.qualcomm.com |
|---|---|
| State | Superseded |
| Headers | show |
| Series |
|
| Related | show |
Hi Hans, thank you for the patch. Hans de Goede <johannes.goede@oss.qualcomm.com> writes: > Group variables used every pixel together, followed by variables used > every lines and then lastly variables only used every frame. > > The idea here is to have all the data used every pixel fit in as few > cachelines as possible. > > Benchmarking does not show any differerence before after, possibly > because most of the per pixel lookup tables where already grouped > together. The struct layout may be indeed critical, IIRC I got ~10% penalty when I had to switch the lookup table arrangement to fix wrong CCM multiplication order. Maybe there is a hidden opportunity for significant speedup somewhere. But this is a different problem than what this patch solves. With the typo below fixed: Reviewed-by: Milan Zamazal <mzamazal@redhat.com> > Despite that this still seems like a good idea. > > Signed-off-by: Hans de Goede <johannes.goede@oss.qualcomm.com> > --- > src/libcamera/software_isp/debayer_cpu.h | 13 ++++++++----- > 1 file changed, 8 insertions(+), 5 deletions(-) > > diff --git a/src/libcamera/software_isp/debayer_cpu.h b/src/libcamera/software_isp/debayer_cpu.h > index 800b018c..a54418dc 100644 > --- a/src/libcamera/software_isp/debayer_cpu.h > +++ b/src/libcamera/software_isp/debayer_cpu.h > @@ -135,6 +135,7 @@ private: > }; > using LookupTable = std::array<uint8_t, kRGBLookupSize>; > using CcmLookupTable = std::array<CcmColumn, kRGBLookupSize>; > + /* Variables used every pixel */ > LookupTable red_; > LookupTable green_; > LookupTable blue_; > @@ -143,24 +144,26 @@ private: > CcmLookupTable blueCcm_; > std::array<double, kGammaLookupSize> gammaTable_; > LookupTable gammaLut_; > - bool ccmEnabled_; > - DebayerParams params_; > - SwIspStats statsBuffer_; > + Rectangle window_; > > + /* Variables used every line */ > + SwIspStats statsBuffer_; > debayerFn debayer0_; > debayerFn debayer1_; > debayerFn debayer2_; > debayerFn debayer3_; > - Rectangle window_; > std::unique_ptr<SwStatsCpu> stats_; > unsigned int lineBufferLength_; > unsigned int lineBufferPadding_; > unsigned int xShift_; /* Offset of 0/1 applied to window_.x */ > bool enableInputMemcpy_; > - > static constexpr unsigned int kMaxThreads = 4; > struct DebayerCpuThreadData threadData_[kMaxThreads]; > + > + /* variables used every frame */ s/variables/Variables/ > unsigned int threadCount_; > + bool ccmEnabled_; > + DebayerParams params_; > }; > > } /* namespace libcamera */
On Mon, Feb 16, 2026 at 08:02:02PM +0100, Hans de Goede wrote: > Group variables used every pixel together, followed by variables used > every lines and then lastly variables only used every frame. > > The idea here is to have all the data used every pixel fit in as few > cachelines as possible. > > Benchmarking does not show any differerence before after, possibly > because most of the per pixel lookup tables where already grouped > together. > > Despite that this still seems like a good idea. > > Signed-off-by: Hans de Goede <johannes.goede@oss.qualcomm.com> > --- > src/libcamera/software_isp/debayer_cpu.h | 13 ++++++++----- > 1 file changed, 8 insertions(+), 5 deletions(-) > > diff --git a/src/libcamera/software_isp/debayer_cpu.h b/src/libcamera/software_isp/debayer_cpu.h > index 800b018c..a54418dc 100644 > --- a/src/libcamera/software_isp/debayer_cpu.h > +++ b/src/libcamera/software_isp/debayer_cpu.h > @@ -135,6 +135,7 @@ private: > }; > using LookupTable = std::array<uint8_t, kRGBLookupSize>; > using CcmLookupTable = std::array<CcmColumn, kRGBLookupSize>; > + /* Variables used every pixel */ > LookupTable red_; > LookupTable green_; > LookupTable blue_; > @@ -143,24 +144,26 @@ private: > CcmLookupTable blueCcm_; > std::array<double, kGammaLookupSize> gammaTable_; > LookupTable gammaLut_; > - bool ccmEnabled_; > - DebayerParams params_; > - SwIspStats statsBuffer_; > + Rectangle window_; > > + /* Variables used every line */ > + SwIspStats statsBuffer_; > debayerFn debayer0_; > debayerFn debayer1_; > debayerFn debayer2_; > debayerFn debayer3_; > - Rectangle window_; > std::unique_ptr<SwStatsCpu> stats_; > unsigned int lineBufferLength_; > unsigned int lineBufferPadding_; > unsigned int xShift_; /* Offset of 0/1 applied to window_.x */ > bool enableInputMemcpy_; > - > static constexpr unsigned int kMaxThreads = 4; > struct DebayerCpuThreadData threadData_[kMaxThreads]; > + > + /* variables used every frame */ > unsigned int threadCount_; > + bool ccmEnabled_; > + DebayerParams params_; Could you reverse the order and start with the per-frame, then per-line and finally per-pixel variables ? That would make a more logical read order. > }; > > } /* namespace libcamera */
Hi, On 17-Feb-26 10:33 PM, Milan Zamazal wrote: > Hi Hans, > > thank you for the patch. > > Hans de Goede <johannes.goede@oss.qualcomm.com> writes: > >> Group variables used every pixel together, followed by variables used >> every lines and then lastly variables only used every frame. >> >> The idea here is to have all the data used every pixel fit in as few >> cachelines as possible. >> >> Benchmarking does not show any differerence before after, possibly >> because most of the per pixel lookup tables where already grouped >> together. > > The struct layout may be indeed critical, IIRC I got ~10% penalty when I > had to switch the lookup table arrangement to fix wrong CCM > multiplication order. Maybe there is a hidden opportunity for > significant speedup somewhere. But this is a different problem than > what this patch solves. > > With the typo below fixed: > > Reviewed-by: Milan Zamazal <mzamazal@redhat.com> Thank you for the review. After the refactoring in v2, there is very little left to move here, so I've dropped this patch for the upcoming v2 series. Regards, Hans
diff --git a/src/libcamera/software_isp/debayer_cpu.h b/src/libcamera/software_isp/debayer_cpu.h index 800b018c..a54418dc 100644 --- a/src/libcamera/software_isp/debayer_cpu.h +++ b/src/libcamera/software_isp/debayer_cpu.h @@ -135,6 +135,7 @@ private: }; using LookupTable = std::array<uint8_t, kRGBLookupSize>; using CcmLookupTable = std::array<CcmColumn, kRGBLookupSize>; + /* Variables used every pixel */ LookupTable red_; LookupTable green_; LookupTable blue_; @@ -143,24 +144,26 @@ private: CcmLookupTable blueCcm_; std::array<double, kGammaLookupSize> gammaTable_; LookupTable gammaLut_; - bool ccmEnabled_; - DebayerParams params_; - SwIspStats statsBuffer_; + Rectangle window_; + /* Variables used every line */ + SwIspStats statsBuffer_; debayerFn debayer0_; debayerFn debayer1_; debayerFn debayer2_; debayerFn debayer3_; - Rectangle window_; std::unique_ptr<SwStatsCpu> stats_; unsigned int lineBufferLength_; unsigned int lineBufferPadding_; unsigned int xShift_; /* Offset of 0/1 applied to window_.x */ bool enableInputMemcpy_; - static constexpr unsigned int kMaxThreads = 4; struct DebayerCpuThreadData threadData_[kMaxThreads]; + + /* variables used every frame */ unsigned int threadCount_; + bool ccmEnabled_; + DebayerParams params_; }; } /* namespace libcamera */
Group variables used every pixel together, followed by variables used every lines and then lastly variables only used every frame. The idea here is to have all the data used every pixel fit in as few cachelines as possible. Benchmarking does not show any differerence before after, possibly because most of the per pixel lookup tables where already grouped together. Despite that this still seems like a good idea. Signed-off-by: Hans de Goede <johannes.goede@oss.qualcomm.com> --- src/libcamera/software_isp/debayer_cpu.h | 13 ++++++++----- 1 file changed, 8 insertions(+), 5 deletions(-)