[00/30] RFC/RFT: gpuisp: Multipass with speed optimisations on top
mbox series

Message ID 20260618122245.946138-1-bryan.odonoghue@linaro.org
Headers show
Series
  • RFC/RFT: gpuisp: Multipass with speed optimisations on top
Related show

Message

Bryan O'Donoghue June 18, 2026, 12:22 p.m. UTC
Greetings.

This series implements multli-pass gpuisp as a two phase thing.

- Some initial housekeeping to make the naming more logical
  ShaderPass et al.

- Dragging the existing implementation through a progressive change
  set.

- Adding in shaders to "normalise" i.e. to take either a packed or unpacked
  CSI2 pixel stream - apply BLC to it and output a standard GL16F frame.

- This allows us to dispense with having two demosiac shaders and to use
  the logic from the exisitng unpacked demosiac for all streams.

  This is actually a nice change for the packed case as the unpacked
  algorithm is slightly better.

- Some benchmarking added to the GpuIspShaderPass shows that the
  two shaders on a reference from costing 20ms consume about 5.5ms.

- In that light some caching is done at the end of the series to improve
  the current throughput.

  On the slower system I tested the series on the before case averages out
  at about 22ms per frame.

  Post this series we get between 18ms and 20ms. I'd call that a 10-15%
  win.

  On the slow system rb5-sm8250 the shaders are about 5.5ms of runtime. On
  the fast hamoa-x1e system the whole process is about 6ms

- Not tested
  Unpacked CSI2 input - I don't have easy access to this input right now
  DMABUF input caching - none of my hardware supports

- The next steps for this series are:

  - Converting the BLC normalise phase to a compute shader
  - Having the BLC normalise phase produce an additional SSBO which
    is the histogram of th bayer input.
  - Using that fact to no-longer run CPU side bayer stats including
    not having to map the buffer in the CPU.
  - Emitting the input buffer after the BLC shader completes.
    Possible only when generating stats in GPU.
  - The CPU stats on the slow system I'm targetting consume about 4.5ms
    of the original 20ms.
    Done right the additional time in the shader should be low though
    what syncing around the SSBO to subsequently use the buffer may be
    a gotcha.
  - Since "we are where we are" on fencing and glFinish() costs about
    8 ms in this reference slow case - reaping 4.5 additional ms seems
    like the next most logical thing to try to attack.

    Caveated on the fact this series is too large and messy right now :)

  - Fencing/Deferred fencing.
    Right now libcamera as Nicolas pointed out on IRC, doesn't require
    framebuffer recipients to dma-fence.

    This means we need to fence in libcamera.

    - Would the dma-buf ioctl for the output framebuffer produce
      better synchronous wait times than glFinish() and if so
      could we really trust the result. Probably no and yes I'd guess.

    - Could we use egl fencing to achieve better wait-times.
      No almost certainly not glFinish() is doing a real thing here
      ensuring the GPU is finished.

    - Could we do an asychronous wait - doing a fenceKHR at the top
      of processGPU() for the previous frame ?
      This would let the CPU do productive work while the GPU completes.

    - Finally can we "just" require users to dma-fence the framebuffer.

    - Other permutations on this theme are possible.

  - Adding additional passes
    We would like to have shaders and objects that are "composable" so that
    for example you could run a GPU based noise filter on any frame.

    This means having more granular GPUISP support is desirable.

    Since the time spent moving from one shader to the other appears quite
    low and not where we are burning time - a more granular number of passes
    at the moment seems achievable.

Bryan O'Donoghue (30):
  libcamera: software_isp: Rename Bayer classes to SoftwareIspPipeline
  libcamera: software_isp: gpu: Change the name of eglImageBayerOut_ to
    eglImageRGBAOut_
  libcamera: software_isp: gpu: rename debayerGPU to processGPU
  libcamera: software_isp: egl: Add new helper attachTextureToFBO
  libcamera: software_isp: gpu_pipeline_shader_pass: Add base class
    GpuPipelineShaderPass
  libcamera: software_isp: gpu_pipeline_shader_pass: Add
    GpuPipelineShaderPassDemosiac
  libcamera: software_isp: gpu: Switch to using GpuIspShaderPassDemosiac
  libcamera: software_isp: gpu: Drop unused method definitions
  libcamera: software_isp: gpu: Make Rectangle window_ a local variable
    in configure()
  libcamera: software_isp: gpu_pipeline_shader_pass: Move common
    attribute and uniform variables to base shader class
  libcamera: software_isp: gpu_pipeline_shader_pass: Move common shader
    selection logic into base class in new method initShaders()
  libcamera: shaders: Split packed and unpacked demosiac up
  libcamera: shaders: bayer_glr16_to_rgba.frag: Use bilinear filtering
  libcamera: software_isp: gpu: Add GpuIspShaderPassBlcNormalise
  libcamera: software_isp: egl: Extend eGL::createTexture2D to
    understand floats
  libcamera: software_isp: egl: Move to GLES 3.0
  libcamera: software_isp: egl: Rename createTexture2D to
    createInputTexture2D
  libcamera: software_isp: egl: Use Texture Unit 3 for final output
    texture
  libcamera: software_isp: egl: Add Ping/Pong buffers with start/stop
    bindings only
  libcamera: software_isp: gpu: Include GpuIspShaderPassBlcNormalise in
    init sequence
  libcamera: software_isp: egl: Add createOutputTexture2D
  libcamera: software_isp: gpu: Swtich to two pass logic
  libcamera: software_isp: egl: Add method lookups for GPU benchmark
    rountines
  libcamera: software_isp: egl: Add eglBenchMark
  libcamera: software_isp: gpu_pipeline_shader_pass: Add shader DEBUG
    time logging
  libcamera: software_isp: gpu: Do a synchronous BenchMark print after
    syncOutput
  libcamera: software_isp: egl: Add updateInputTexture2D
  libcamera: software_isp: gpu: Switch to using glTexSubImage2D on slow
    path upload
  libcamera: software_isp: gpu: Cache output framebuffers, only recreate
    when necessary
  libcamera: software_isp: gpu: Cache input framebuffers, only do
    texture creation when required

 include/libcamera/internal/egl.h              |  65 +-
 .../internal/software_isp/software_isp.h      |   4 +-
 src/libcamera/egl.cpp                         | 131 +++-
 .../bayer_1x_packed_to_blc_glr16f.frag        |  97 +++
 .../shaders/bayer_glr16_to_rgba.frag          | 155 ++++
 .../shaders/bayer_unpacked_to_blc_glr16f.frag |  46 ++
 src/libcamera/shaders/meson.build             |   3 +
 src/libcamera/software_isp/debayer_egl.cpp    | 671 ------------------
 .../software_isp/gpu_pipeline_shader_pass.cpp | 196 +++++
 .../software_isp/gpu_pipeline_shader_pass.h   | 109 +++
 ...gpu_pipeline_shader_pass_blc_normalise.cpp | 270 +++++++
 .../gpu_pipeline_shader_pass_blc_normalise.h  |  55 ++
 .../gpu_pipeline_shader_pass_demosiac.cpp     | 239 +++++++
 .../gpu_pipeline_shader_pass_demosiac.h       |  60 ++
 src/libcamera/software_isp/meson.build        |   9 +-
 src/libcamera/software_isp/software_isp.cpp   |  60 +-
 ...{debayer.cpp => software_isp_pipeline.cpp} |  77 +-
 .../{debayer.h => software_isp_pipeline.h}    |   8 +-
 ..._cpu.cpp => software_isp_pipeline_cpu.cpp} | 160 ++---
 ...ayer_cpu.h => software_isp_pipeline_cpu.h} |  10 +-
 .../software_isp_pipeline_gpu.cpp             | 433 +++++++++++
 ...ayer_egl.h => software_isp_pipeline_gpu.h} |  61 +-
 22 files changed, 2023 insertions(+), 896 deletions(-)
 create mode 100644 src/libcamera/shaders/bayer_1x_packed_to_blc_glr16f.frag
 create mode 100644 src/libcamera/shaders/bayer_glr16_to_rgba.frag
 create mode 100644 src/libcamera/shaders/bayer_unpacked_to_blc_glr16f.frag
 delete mode 100644 src/libcamera/software_isp/debayer_egl.cpp
 create mode 100644 src/libcamera/software_isp/gpu_pipeline_shader_pass.cpp
 create mode 100644 src/libcamera/software_isp/gpu_pipeline_shader_pass.h
 create mode 100644 src/libcamera/software_isp/gpu_pipeline_shader_pass_blc_normalise.cpp
 create mode 100644 src/libcamera/software_isp/gpu_pipeline_shader_pass_blc_normalise.h
 create mode 100644 src/libcamera/software_isp/gpu_pipeline_shader_pass_demosiac.cpp
 create mode 100644 src/libcamera/software_isp/gpu_pipeline_shader_pass_demosiac.h
 rename src/libcamera/software_isp/{debayer.cpp => software_isp_pipeline.cpp} (73%)
 rename src/libcamera/software_isp/{debayer.h => software_isp_pipeline.h} (93%)
 rename src/libcamera/software_isp/{debayer_cpu.cpp => software_isp_pipeline_cpu.cpp} (84%)
 rename src/libcamera/software_isp/{debayer_cpu.h => software_isp_pipeline_cpu.h} (95%)
 create mode 100644 src/libcamera/software_isp/software_isp_pipeline_gpu.cpp
 rename src/libcamera/software_isp/{debayer_egl.h => software_isp_pipeline_gpu.h} (63%)