libcamera: debayer_cpu: Sync output buffer
diff mbox series

Message ID 20240831190511.128151-1-robert.mader@collabora.com
State Superseded
Headers show
Series
  • libcamera: debayer_cpu: Sync output buffer
Related show

Commit Message

Robert Mader Aug. 31, 2024, 7:05 p.m. UTC
Using `DMA_BUF_IOCTL_SYNC` is required for dmabufs in order to ensure
correct results. Not doing so currently results in occasional tearing
and/or backlashes in GL/VK clients that use the buffers directly for
rendering.

Signed-off-by: Robert Mader <robert.mader@collabora.com>
---
 src/libcamera/software_isp/debayer_cpu.cpp | 22 ++++++++++++++++++++++
 1 file changed, 22 insertions(+)

Comments

Laurent Pinchart Sept. 1, 2024, 12:49 a.m. UTC | #1
CC'ing Hans, Milan and Bryan.

On Sat, Aug 31, 2024 at 09:05:11PM +0200, Robert Mader wrote:
> Using `DMA_BUF_IOCTL_SYNC` is required for dmabufs in order to ensure
> correct results. Not doing so currently results in occasional tearing
> and/or backlashes in GL/VK clients that use the buffers directly for
> rendering.
> 
> Signed-off-by: Robert Mader <robert.mader@collabora.com>
> ---
>  src/libcamera/software_isp/debayer_cpu.cpp | 22 ++++++++++++++++++++++
>  1 file changed, 22 insertions(+)
> 
> diff --git a/src/libcamera/software_isp/debayer_cpu.cpp b/src/libcamera/software_isp/debayer_cpu.cpp
> index 077f7f4b..6c953b03 100644
> --- a/src/libcamera/software_isp/debayer_cpu.cpp
> +++ b/src/libcamera/software_isp/debayer_cpu.cpp
> @@ -11,7 +11,9 @@
>  
>  #include "debayer_cpu.h"
>  
> +#include <linux/dma-buf.h>
>  #include <stdlib.h>
> +#include <sys/ioctl.h>
>  #include <time.h>
>  
>  #include <libcamera/formats.h>
> @@ -733,6 +735,16 @@ void DebayerCpu::process(FrameBuffer *input, FrameBuffer *output, DebayerParams
>  		clock_gettime(CLOCK_MONOTONIC_RAW, &frameStartTime);
>  	}
>  
> +	for (const FrameBuffer::Plane &plane : output->planes()) {
> +		const int fd = plane.fd.get();
> +		struct dma_buf_sync sync = { DMA_BUF_SYNC_START };
> +
> +		sync.flags |= DMA_BUF_SYNC_WRITE;
> +
> +		if (ioctl (fd, DMA_BUF_IOCTL_SYNC, &sync) < 0)
> +			LOG(Debayer, Error) << "Syncing buffer FD " << fd << "failed: " << errno;
> +	}
> +
>  	green_ = params.green;
>  	red_ = swapRedBlueGains_ ? params.blue : params.red;
>  	blue_ = swapRedBlueGains_ ? params.red : params.blue;
> @@ -760,6 +772,16 @@ void DebayerCpu::process(FrameBuffer *input, FrameBuffer *output, DebayerParams
>  
>  	metadata.planes()[0].bytesused = out.planes()[0].size();
>  
> +	for (const FrameBuffer::Plane &plane : output->planes()) {
> +		const int fd = plane.fd.get();
> +		struct dma_buf_sync sync = { DMA_BUF_SYNC_END };
> +
> +		sync.flags |= DMA_BUF_SYNC_WRITE;
> +
> +		if (ioctl (fd, DMA_BUF_IOCTL_SYNC, &sync) < 0)
> +			LOG(Debayer, Error) << "Syncing buffer FD " << fd << "failed: " << errno;
> +	}
> +
>  	/* Measure before emitting signals */
>  	if (measuredFrames_ < DebayerCpu::kLastFrameToMeasure &&
>  	    ++measuredFrames_ > DebayerCpu::kFramesToSkip) {
Hans de Goede Sept. 1, 2024, 10:56 a.m. UTC | #2
Hi Robert,

On 8/31/24 9:05 PM, Robert Mader wrote:
> Using `DMA_BUF_IOCTL_SYNC` is required for dmabufs in order to ensure
> correct results. Not doing so currently results in occasional tearing
> and/or backlashes in GL/VK clients that use the buffers directly for
> rendering.
> 
> Signed-off-by: Robert Mader <robert.mader@collabora.com>

Thank you for your patch.

I'm not a dmabuf expert, with that said the suggested change looks
reasonable to me.

One small code-style remark inline below.

> ---
>  src/libcamera/software_isp/debayer_cpu.cpp | 22 ++++++++++++++++++++++
>  1 file changed, 22 insertions(+)
> 
> diff --git a/src/libcamera/software_isp/debayer_cpu.cpp b/src/libcamera/software_isp/debayer_cpu.cpp
> index 077f7f4b..6c953b03 100644
> --- a/src/libcamera/software_isp/debayer_cpu.cpp
> +++ b/src/libcamera/software_isp/debayer_cpu.cpp
> @@ -11,7 +11,9 @@
>  
>  #include "debayer_cpu.h"
>  
> +#include <linux/dma-buf.h>
>  #include <stdlib.h>
> +#include <sys/ioctl.h>
>  #include <time.h>
>  
>  #include <libcamera/formats.h>
> @@ -733,6 +735,16 @@ void DebayerCpu::process(FrameBuffer *input, FrameBuffer *output, DebayerParams
>  		clock_gettime(CLOCK_MONOTONIC_RAW, &frameStartTime);
>  	}
>  
> +	for (const FrameBuffer::Plane &plane : output->planes()) {
> +		const int fd = plane.fd.get();
> +		struct dma_buf_sync sync = { DMA_BUF_SYNC_START };
> +
> +		sync.flags |= DMA_BUF_SYNC_WRITE;

These 2 lines are a weird way to write:

		struct dma_buf_sync sync = { DMA_BUF_SYNC_START | DMA_BUF_SYNC_WRITE };

?

> +
> +		if (ioctl (fd, DMA_BUF_IOCTL_SYNC, &sync) < 0)
> +			LOG(Debayer, Error) << "Syncing buffer FD " << fd << "failed: " << errno;
> +	}
> +
>  	green_ = params.green;
>  	red_ = swapRedBlueGains_ ? params.blue : params.red;
>  	blue_ = swapRedBlueGains_ ? params.red : params.blue;
> @@ -760,6 +772,16 @@ void DebayerCpu::process(FrameBuffer *input, FrameBuffer *output, DebayerParams
>  
>  	metadata.planes()[0].bytesused = out.planes()[0].size();
>  
> +	for (const FrameBuffer::Plane &plane : output->planes()) {
> +		const int fd = plane.fd.get();
> +		struct dma_buf_sync sync = { DMA_BUF_SYNC_END };
> +
> +		sync.flags |= DMA_BUF_SYNC_WRITE;

Same here, why not use:

		struct dma_buf_sync sync = { DMA_BUF_SYNC_END | DMA_BUF_SYNC_WRITE };

?


> +
> +		if (ioctl (fd, DMA_BUF_IOCTL_SYNC, &sync) < 0)
> +			LOG(Debayer, Error) << "Syncing buffer FD " << fd << "failed: " << errno;
> +	}
> +
>  	/* Measure before emitting signals */
>  	if (measuredFrames_ < DebayerCpu::kLastFrameToMeasure &&
>  	    ++measuredFrames_ > DebayerCpu::kFramesToSkip) {

Regards,

Hans
Laurent Pinchart Sept. 1, 2024, 11:07 a.m. UTC | #3
On Sun, Sep 01, 2024 at 12:56:10PM +0200, Hans de Goede wrote:
> Hi Robert,
> 
> On 8/31/24 9:05 PM, Robert Mader wrote:
> > Using `DMA_BUF_IOCTL_SYNC` is required for dmabufs in order to ensure
> > correct results. Not doing so currently results in occasional tearing
> > and/or backlashes in GL/VK clients that use the buffers directly for
> > rendering.
> > 
> > Signed-off-by: Robert Mader <robert.mader@collabora.com>
> 
> Thank you for your patch.
> 
> I'm not a dmabuf expert, with that said the suggested change looks
> reasonable to me.

Hans, would you be able to test this on an IPU6-based device, and check
the performance impact ? I don't expect expensive cache management
operations on an x86 device.

Bryan, could you do the same with camss ?

> One small code-style remark inline below.
> 
> > ---
> >  src/libcamera/software_isp/debayer_cpu.cpp | 22 ++++++++++++++++++++++
> >  1 file changed, 22 insertions(+)
> > 
> > diff --git a/src/libcamera/software_isp/debayer_cpu.cpp b/src/libcamera/software_isp/debayer_cpu.cpp
> > index 077f7f4b..6c953b03 100644
> > --- a/src/libcamera/software_isp/debayer_cpu.cpp
> > +++ b/src/libcamera/software_isp/debayer_cpu.cpp
> > @@ -11,7 +11,9 @@
> >  
> >  #include "debayer_cpu.h"
> >  
> > +#include <linux/dma-buf.h>
> >  #include <stdlib.h>
> > +#include <sys/ioctl.h>
> >  #include <time.h>
> >  
> >  #include <libcamera/formats.h>
> > @@ -733,6 +735,16 @@ void DebayerCpu::process(FrameBuffer *input, FrameBuffer *output, DebayerParams
> >  		clock_gettime(CLOCK_MONOTONIC_RAW, &frameStartTime);
> >  	}
> >  
> > +	for (const FrameBuffer::Plane &plane : output->planes()) {
> > +		const int fd = plane.fd.get();
> > +		struct dma_buf_sync sync = { DMA_BUF_SYNC_START };
> > +
> > +		sync.flags |= DMA_BUF_SYNC_WRITE;
> 
> These 2 lines are a weird way to write:
> 
> 		struct dma_buf_sync sync = { DMA_BUF_SYNC_START | DMA_BUF_SYNC_WRITE };
> 
> ?
> 
> > +
> > +		if (ioctl (fd, DMA_BUF_IOCTL_SYNC, &sync) < 0)
> > +			LOG(Debayer, Error) << "Syncing buffer FD " << fd << "failed: " << errno;
> > +	}
> > +
> >  	green_ = params.green;
> >  	red_ = swapRedBlueGains_ ? params.blue : params.red;
> >  	blue_ = swapRedBlueGains_ ? params.red : params.blue;
> > @@ -760,6 +772,16 @@ void DebayerCpu::process(FrameBuffer *input, FrameBuffer *output, DebayerParams
> >  
> >  	metadata.planes()[0].bytesused = out.planes()[0].size();
> >  
> > +	for (const FrameBuffer::Plane &plane : output->planes()) {
> > +		const int fd = plane.fd.get();
> > +		struct dma_buf_sync sync = { DMA_BUF_SYNC_END };
> > +
> > +		sync.flags |= DMA_BUF_SYNC_WRITE;
> 
> Same here, why not use:
> 
> 		struct dma_buf_sync sync = { DMA_BUF_SYNC_END | DMA_BUF_SYNC_WRITE };
> 
> ?
> 
> > +
> > +		if (ioctl (fd, DMA_BUF_IOCTL_SYNC, &sync) < 0)
> > +			LOG(Debayer, Error) << "Syncing buffer FD " << fd << "failed: " << errno;
> > +	}
> > +
> >  	/* Measure before emitting signals */
> >  	if (measuredFrames_ < DebayerCpu::kLastFrameToMeasure &&
> >  	    ++measuredFrames_ > DebayerCpu::kFramesToSkip) {
Robert Mader Sept. 1, 2024, 11:39 a.m. UTC | #4
On 01.09.24 13:07, Laurent Pinchart wrote:
> On Sun, Sep 01, 2024 at 12:56:10PM +0200, Hans de Goede wrote:
>> Hi Robert,
>>
>> On 8/31/24 9:05 PM, Robert Mader wrote:
>>> Using `DMA_BUF_IOCTL_SYNC` is required for dmabufs in order to ensure
>>> correct results. Not doing so currently results in occasional tearing
>>> and/or backlashes in GL/VK clients that use the buffers directly for
>>> rendering.
>>>
>>> Signed-off-by: Robert Mader <robert.mader@collabora.com>
>> Thank you for your patch.
>>
>> I'm not a dmabuf expert, with that said the suggested change looks
>> reasonable to me.
> Hans, would you be able to test this on an IPU6-based device, and check
> the performance impact ? I don't expect expensive cache management
> operations on an x86 device.
>
> Bryan, could you do the same with camss ?

Just to let you know: I directly submitted this yesterday to start a 
discussion, however after a bit more thinking I figured we can probably 
do even better by moving these calls to MappedFrameBuffer. I wrote 
https://gitlab.freedesktop.org/rmader/libcamera/-/commit/6b7d280634accd9c214d052df34565eb208652e7 
for that if you want to have look (will submit early next week).

>> One small code-style remark inline below.
>>
>>> ---
>>>   src/libcamera/software_isp/debayer_cpu.cpp | 22 ++++++++++++++++++++++
>>>   1 file changed, 22 insertions(+)
>>>
>>> diff --git a/src/libcamera/software_isp/debayer_cpu.cpp b/src/libcamera/software_isp/debayer_cpu.cpp
>>> index 077f7f4b..6c953b03 100644
>>> --- a/src/libcamera/software_isp/debayer_cpu.cpp
>>> +++ b/src/libcamera/software_isp/debayer_cpu.cpp
>>> @@ -11,7 +11,9 @@
>>>   
>>>   #include "debayer_cpu.h"
>>>   
>>> +#include <linux/dma-buf.h>
>>>   #include <stdlib.h>
>>> +#include <sys/ioctl.h>
>>>   #include <time.h>
>>>   
>>>   #include <libcamera/formats.h>
>>> @@ -733,6 +735,16 @@ void DebayerCpu::process(FrameBuffer *input, FrameBuffer *output, DebayerParams
>>>   		clock_gettime(CLOCK_MONOTONIC_RAW, &frameStartTime);
>>>   	}
>>>   
>>> +	for (const FrameBuffer::Plane &plane : output->planes()) {
>>> +		const int fd = plane.fd.get();
>>> +		struct dma_buf_sync sync = { DMA_BUF_SYNC_START };
>>> +
>>> +		sync.flags |= DMA_BUF_SYNC_WRITE;
>> These 2 lines are a weird way to write:
>>
>> 		struct dma_buf_sync sync = { DMA_BUF_SYNC_START | DMA_BUF_SYNC_WRITE };
>>
>> ?
>>
>>> +
>>> +		if (ioctl (fd, DMA_BUF_IOCTL_SYNC, &sync) < 0)
>>> +			LOG(Debayer, Error) << "Syncing buffer FD " << fd << "failed: " << errno;
>>> +	}
>>> +
>>>   	green_ = params.green;
>>>   	red_ = swapRedBlueGains_ ? params.blue : params.red;
>>>   	blue_ = swapRedBlueGains_ ? params.red : params.blue;
>>> @@ -760,6 +772,16 @@ void DebayerCpu::process(FrameBuffer *input, FrameBuffer *output, DebayerParams
>>>   
>>>   	metadata.planes()[0].bytesused = out.planes()[0].size();
>>>   
>>> +	for (const FrameBuffer::Plane &plane : output->planes()) {
>>> +		const int fd = plane.fd.get();
>>> +		struct dma_buf_sync sync = { DMA_BUF_SYNC_END };
>>> +
>>> +		sync.flags |= DMA_BUF_SYNC_WRITE;
>> Same here, why not use:
>>
>> 		struct dma_buf_sync sync = { DMA_BUF_SYNC_END | DMA_BUF_SYNC_WRITE };
>>
>> ?
>>
>>> +
>>> +		if (ioctl (fd, DMA_BUF_IOCTL_SYNC, &sync) < 0)
>>> +			LOG(Debayer, Error) << "Syncing buffer FD " << fd << "failed: " << errno;
>>> +	}
>>> +
>>>   	/* Measure before emitting signals */
>>>   	if (measuredFrames_ < DebayerCpu::kLastFrameToMeasure &&
>>>   	    ++measuredFrames_ > DebayerCpu::kFramesToSkip) {
Robert Mader Sept. 2, 2024, 10:56 a.m. UTC | #5
On 01.09.24 13:39, Robert Mader wrote:
>
> On 01.09.24 13:07, Laurent Pinchart wrote:
>> Hans, would you be able to test this on an IPU6-based device, and check
>> the performance impact ? I don't expect expensive cache management
>> operations on an x86 device.
>>
>> Bryan, could you do the same with camss ?

Heads up that in my initial testing around different Gstreamer pipelines 
on arm64 I saw mixed results:

1. Cases involving successful dmabuf import to the GPU are (much) less 
prone to glitches while not seeming to regress much in terms of frame 
rates. This includes running Gnome-Snapshot or waylandsink on devices 
like the Librem5, PinePhone or Pixel 3a (generally qcom).

2. Cases where Gst mmaps the buffers seem to get a noticeable 
performance hit.

Crucially this applies to common fallback paths like in following example:

  - glupload tries to import the buffer as dmabuf

  - fails due to stride requirements...

  - uses the "raw" importer that mmap the buffer

This case is almost tragic IMO. The buffer data ends up only getting 
accessed by the CPU but we flush the catches/sync to the GPU *twice* - 
just to upload a copy in the end.

And while I see potential to improve this scenario in the other parts of 
the stack, I don't see anything we can about it in libcamera right now 
(apart from not landing a patch like this).

Regards,

Robert
Laurent Pinchart Sept. 2, 2024, 8:32 p.m. UTC | #6
On Mon, Sep 02, 2024 at 12:56:31PM +0200, Robert Mader wrote:
> On 01.09.24 13:39, Robert Mader wrote:
> >
> > On 01.09.24 13:07, Laurent Pinchart wrote:
> >> Hans, would you be able to test this on an IPU6-based device, and check
> >> the performance impact ? I don't expect expensive cache management
> >> operations on an x86 device.
> >>
> >> Bryan, could you do the same with camss ?
> 
> Heads up that in my initial testing around different Gstreamer pipelines 
> on arm64 I saw mixed results:
> 
> 1. Cases involving successful dmabuf import to the GPU are (much) less 
> prone to glitches while not seeming to regress much in terms of frame 
> rates. This includes running Gnome-Snapshot or waylandsink on devices 
> like the Librem5, PinePhone or Pixel 3a (generally qcom).
> 
> 2. Cases where Gst mmaps the buffers seem to get a noticeable 
> performance hit.
> 
> Crucially this applies to common fallback paths like in following example:
> 
>   - glupload tries to import the buffer as dmabuf
> 
>   - fails due to stride requirements...
> 
>   - uses the "raw" importer that mmap the buffer
> 
> This case is almost tragic IMO. The buffer data ends up only getting 
> accessed by the CPU but we flush the catches/sync to the GPU *twice* - 
> just to upload a copy in the end.
> 
> And while I see potential to improve this scenario in the other parts of 
> the stack, I don't see anything we can about it in libcamera right now 
> (apart from not landing a patch like this).

It's a bit late, but maybe there's a possibility to submit a lightning
talk/BoF topic for LPC in two weeks ? Cache handling is a topic that
crosses many subsystem boundaries, and I think we'll have quite a few
people with relevant expertise in Vienna.
Naushir Patuck Sept. 3, 2024, 7:21 a.m. UTC | #7
On Mon, 2 Sept 2024 at 21:32, Laurent Pinchart
<laurent.pinchart@ideasonboard.com> wrote:
>
> On Mon, Sep 02, 2024 at 12:56:31PM +0200, Robert Mader wrote:
> > On 01.09.24 13:39, Robert Mader wrote:
> > >
> > > On 01.09.24 13:07, Laurent Pinchart wrote:
> > >> Hans, would you be able to test this on an IPU6-based device, and check
> > >> the performance impact ? I don't expect expensive cache management
> > >> operations on an x86 device.
> > >>
> > >> Bryan, could you do the same with camss ?
> >
> > Heads up that in my initial testing around different Gstreamer pipelines
> > on arm64 I saw mixed results:
> >
> > 1. Cases involving successful dmabuf import to the GPU are (much) less
> > prone to glitches while not seeming to regress much in terms of frame
> > rates. This includes running Gnome-Snapshot or waylandsink on devices
> > like the Librem5, PinePhone or Pixel 3a (generally qcom).
> >
> > 2. Cases where Gst mmaps the buffers seem to get a noticeable
> > performance hit.
> >
> > Crucially this applies to common fallback paths like in following example:
> >
> >   - glupload tries to import the buffer as dmabuf
> >
> >   - fails due to stride requirements...
> >
> >   - uses the "raw" importer that mmap the buffer
> >
> > This case is almost tragic IMO. The buffer data ends up only getting
> > accessed by the CPU but we flush the catches/sync to the GPU *twice* -
> > just to upload a copy in the end.
> >
> > And while I see potential to improve this scenario in the other parts of
> > the stack, I don't see anything we can about it in libcamera right now
> > (apart from not landing a patch like this).
>
> It's a bit late, but maybe there's a possibility to submit a lightning
> talk/BoF topic for LPC in two weeks ? Cache handling is a topic that
> crosses many subsystem boundaries, and I think we'll have quite a few
> people with relevant expertise in Vienna.
>

This is quite a complicated topic indeed.  The RPi camera stack
switched to using cacheable dma bufs for performance reasons (> 10%
uplift in certain use cases) and we had to be very careful with how to
handle the DMA_BUF_IOCTL_SYNC calls at the application level.
However, I don't think handling this in MappedFrameBuffer is the right
thing for hardware based ISPs because of unexpected stale data
flushing/invalidation.  I can expand on this during our F2F in Vienna.

Naush


> --
> Regards,
>
> Laurent Pinchart
Laurent Pinchart Sept. 3, 2024, 1:27 p.m. UTC | #8
On Tue, Sep 03, 2024 at 08:21:54AM +0100, Naushir Patuck wrote:
> On Mon, 2 Sept 2024 at 21:32, Laurent Pinchart wrote:
> > On Mon, Sep 02, 2024 at 12:56:31PM +0200, Robert Mader wrote:
> > > On 01.09.24 13:39, Robert Mader wrote:
> > > > On 01.09.24 13:07, Laurent Pinchart wrote:
> > > >> Hans, would you be able to test this on an IPU6-based device, and check
> > > >> the performance impact ? I don't expect expensive cache management
> > > >> operations on an x86 device.
> > > >>
> > > >> Bryan, could you do the same with camss ?
> > >
> > > Heads up that in my initial testing around different Gstreamer pipelines
> > > on arm64 I saw mixed results:
> > >
> > > 1. Cases involving successful dmabuf import to the GPU are (much) less
> > > prone to glitches while not seeming to regress much in terms of frame
> > > rates. This includes running Gnome-Snapshot or waylandsink on devices
> > > like the Librem5, PinePhone or Pixel 3a (generally qcom).
> > >
> > > 2. Cases where Gst mmaps the buffers seem to get a noticeable
> > > performance hit.
> > >
> > > Crucially this applies to common fallback paths like in following example:
> > >
> > >   - glupload tries to import the buffer as dmabuf
> > >
> > >   - fails due to stride requirements...
> > >
> > >   - uses the "raw" importer that mmap the buffer
> > >
> > > This case is almost tragic IMO. The buffer data ends up only getting
> > > accessed by the CPU but we flush the catches/sync to the GPU *twice* -
> > > just to upload a copy in the end.
> > >
> > > And while I see potential to improve this scenario in the other parts of
> > > the stack, I don't see anything we can about it in libcamera right now
> > > (apart from not landing a patch like this).
> >
> > It's a bit late, but maybe there's a possibility to submit a lightning
> > talk/BoF topic for LPC in two weeks ? Cache handling is a topic that
> > crosses many subsystem boundaries, and I think we'll have quite a few
> > people with relevant expertise in Vienna.
> 
> This is quite a complicated topic indeed.  The RPi camera stack
> switched to using cacheable dma bufs for performance reasons (> 10%
> uplift in certain use cases) and we had to be very careful with how to
> handle the DMA_BUF_IOCTL_SYNC calls at the application level.
> However, I don't think handling this in MappedFrameBuffer is the right
> thing for hardware based ISPs because of unexpected stale data
> flushing/invalidation.  I can expand on this during our F2F in Vienna.

That seems a good discussion topic to me.
Robert Mader Sept. 3, 2024, 1:39 p.m. UTC | #9
On 03.09.24 15:27, Laurent Pinchart wrote:
> On Tue, Sep 03, 2024 at 08:21:54AM +0100, Naushir Patuck wrote:
>> On Mon, 2 Sept 2024 at 21:32, Laurent Pinchart wrote:
>>> On Mon, Sep 02, 2024 at 12:56:31PM +0200, Robert Mader wrote:
>>>> On 01.09.24 13:39, Robert Mader wrote:
>>>>> On 01.09.24 13:07, Laurent Pinchart wrote:
>>>>>> Hans, would you be able to test this on an IPU6-based device, and check
>>>>>> the performance impact ? I don't expect expensive cache management
>>>>>> operations on an x86 device.
>>>>>>
>>>>>> Bryan, could you do the same with camss ?
>>>> Heads up that in my initial testing around different Gstreamer pipelines
>>>> on arm64 I saw mixed results:
>>>>
>>>> 1. Cases involving successful dmabuf import to the GPU are (much) less
>>>> prone to glitches while not seeming to regress much in terms of frame
>>>> rates. This includes running Gnome-Snapshot or waylandsink on devices
>>>> like the Librem5, PinePhone or Pixel 3a (generally qcom).
>>>>
>>>> 2. Cases where Gst mmaps the buffers seem to get a noticeable
>>>> performance hit.
>>>>
>>>> Crucially this applies to common fallback paths like in following example:
>>>>
>>>>    - glupload tries to import the buffer as dmabuf
>>>>
>>>>    - fails due to stride requirements...
>>>>
>>>>    - uses the "raw" importer that mmap the buffer
>>>>
>>>> This case is almost tragic IMO. The buffer data ends up only getting
>>>> accessed by the CPU but we flush the catches/sync to the GPU *twice* -
>>>> just to upload a copy in the end.
>>>>
>>>> And while I see potential to improve this scenario in the other parts of
>>>> the stack, I don't see anything we can about it in libcamera right now
>>>> (apart from not landing a patch like this).
>>> It's a bit late, but maybe there's a possibility to submit a lightning
>>> talk/BoF topic for LPC in two weeks ? Cache handling is a topic that
>>> crosses many subsystem boundaries, and I think we'll have quite a few
>>> people with relevant expertise in Vienna.
>> This is quite a complicated topic indeed.  The RPi camera stack
>> switched to using cacheable dma bufs for performance reasons (> 10%
>> uplift in certain use cases) and we had to be very careful with how to
>> handle the DMA_BUF_IOCTL_SYNC calls at the application level.
>> However, I don't think handling this in MappedFrameBuffer is the right
>> thing for hardware based ISPs because of unexpected stale data
>> flushing/invalidation.  I can expand on this during our F2F in Vienna.
> That seems a good discussion topic to me.
>
Unfortunately I won't be able to join that, but I'd be really interested 
to hear about ideas what could be done to improve the situation. :/

Patch
diff mbox series

diff --git a/src/libcamera/software_isp/debayer_cpu.cpp b/src/libcamera/software_isp/debayer_cpu.cpp
index 077f7f4b..6c953b03 100644
--- a/src/libcamera/software_isp/debayer_cpu.cpp
+++ b/src/libcamera/software_isp/debayer_cpu.cpp
@@ -11,7 +11,9 @@ 
 
 #include "debayer_cpu.h"
 
+#include <linux/dma-buf.h>
 #include <stdlib.h>
+#include <sys/ioctl.h>
 #include <time.h>
 
 #include <libcamera/formats.h>
@@ -733,6 +735,16 @@  void DebayerCpu::process(FrameBuffer *input, FrameBuffer *output, DebayerParams
 		clock_gettime(CLOCK_MONOTONIC_RAW, &frameStartTime);
 	}
 
+	for (const FrameBuffer::Plane &plane : output->planes()) {
+		const int fd = plane.fd.get();
+		struct dma_buf_sync sync = { DMA_BUF_SYNC_START };
+
+		sync.flags |= DMA_BUF_SYNC_WRITE;
+
+		if (ioctl (fd, DMA_BUF_IOCTL_SYNC, &sync) < 0)
+			LOG(Debayer, Error) << "Syncing buffer FD " << fd << "failed: " << errno;
+	}
+
 	green_ = params.green;
 	red_ = swapRedBlueGains_ ? params.blue : params.red;
 	blue_ = swapRedBlueGains_ ? params.red : params.blue;
@@ -760,6 +772,16 @@  void DebayerCpu::process(FrameBuffer *input, FrameBuffer *output, DebayerParams
 
 	metadata.planes()[0].bytesused = out.planes()[0].size();
 
+	for (const FrameBuffer::Plane &plane : output->planes()) {
+		const int fd = plane.fd.get();
+		struct dma_buf_sync sync = { DMA_BUF_SYNC_END };
+
+		sync.flags |= DMA_BUF_SYNC_WRITE;
+
+		if (ioctl (fd, DMA_BUF_IOCTL_SYNC, &sync) < 0)
+			LOG(Debayer, Error) << "Syncing buffer FD " << fd << "failed: " << errno;
+	}
+
 	/* Measure before emitting signals */
 	if (measuredFrames_ < DebayerCpu::kLastFrameToMeasure &&
 	    ++measuredFrames_ > DebayerCpu::kFramesToSkip) {