ipa: libipa: fixedpoint: Expand documentation on sign bit
diff mbox series

Message ID 20260120083952.15338-1-jacopo.mondi@ideasonboard.com
State New
Headers show
Series
  • ipa: libipa: fixedpoint: Expand documentation on sign bit
Related show

Commit Message

Jacopo Mondi Jan. 20, 2026, 8:39 a.m. UTC
Converting numbers with a signed fixed-point representation to
the corresponding float value requires to include the sign bit in the
width of the fixed-point integral part.

Clearly specify it in documentation.

Signed-off-by: Jacopo Mondi <jacopo.mondi@ideasonboard.com>
---
 src/ipa/libipa/fixedpoint.cpp | 22 +++++++++++++++++++++-
 1 file changed, 21 insertions(+), 1 deletion(-)

--
2.52.0

Comments

Stefan Klug Jan. 20, 2026, 8:53 a.m. UTC | #1
Hi Jacopo,

Quoting Jacopo Mondi (2026-01-20 09:39:49)
> Converting numbers with a signed fixed-point representation to
> the corresponding float value requires to include the sign bit in the
> width of the fixed-point integral part.
> 
> Clearly specify it in documentation.
> 
> Signed-off-by: Jacopo Mondi <jacopo.mondi@ideasonboard.com>
> ---
>  src/ipa/libipa/fixedpoint.cpp | 22 +++++++++++++++++++++-
>  1 file changed, 21 insertions(+), 1 deletion(-)
> 
> diff --git a/src/ipa/libipa/fixedpoint.cpp b/src/ipa/libipa/fixedpoint.cpp
> index 6b698fc5d680..b37cdc43936f 100644
> --- a/src/ipa/libipa/fixedpoint.cpp
> +++ b/src/ipa/libipa/fixedpoint.cpp
> @@ -29,11 +29,31 @@ namespace ipa {
>  /**
>   * \fn R fixedToFloatingPoint(T number)
>   * \brief Convert a fixed-point number to a floating point representation
> - * \tparam I Bit width of the integer part of the fixed-point
> + * \tparam I Bit width of the integer part of the fixed-point including the
> + * optional sign bit
>   * \tparam F Bit width of the fractional part of the fixed-point
>   * \tparam R Return type of the floating point representation
>   * \tparam T Input type of the fixed-point representation
>   * \param number The fixed point number to convert to floating point
> + *
> + * If the fixed-point representation is signed, the sign bit shall be included
> + * in the \a I template parameter that specifies the number of bits of the
> + * integral part of the fixed-point representation.
> + *
> + * As an example, a value represented as signed fixed-point Q4.8 format can be
> + * converted to its corresponding floating point representation as:

I'm a bit confused here. Doesn't signed Q4.8 mean that the first bit of
the 4 is the sign bit? The same way a signed int32 has the signed bit on
the first of the 32 bits?

Best regards,
Stefan

> + *
> + * \code{.cpp}
> + * double d = fixedToFloatingPoint<5, 8, double, uint16_t>(fixed);
> + * \endcode
> + *
> + * While a value represented as unsigned fixed-point Q4.8 format can be
> + * converted as:
> + *
> + * \code{.cpp}
> + * double d = fixedToFloatingPoint<4, 8, double, uint16_t>(fixed);
> + * \endcode
> + *
>   * \return The converted value
>   */
> 
> --
> 2.52.0
>
Jacopo Mondi Jan. 20, 2026, 9 a.m. UTC | #2
Hi Stefan

On Tue, Jan 20, 2026 at 09:53:06AM +0100, Stefan Klug wrote:
> Hi Jacopo,
>
> Quoting Jacopo Mondi (2026-01-20 09:39:49)
> > Converting numbers with a signed fixed-point representation to
> > the corresponding float value requires to include the sign bit in the
> > width of the fixed-point integral part.
> >
> > Clearly specify it in documentation.
> >
> > Signed-off-by: Jacopo Mondi <jacopo.mondi@ideasonboard.com>
> > ---
> >  src/ipa/libipa/fixedpoint.cpp | 22 +++++++++++++++++++++-
> >  1 file changed, 21 insertions(+), 1 deletion(-)
> >
> > diff --git a/src/ipa/libipa/fixedpoint.cpp b/src/ipa/libipa/fixedpoint.cpp
> > index 6b698fc5d680..b37cdc43936f 100644
> > --- a/src/ipa/libipa/fixedpoint.cpp
> > +++ b/src/ipa/libipa/fixedpoint.cpp
> > @@ -29,11 +29,31 @@ namespace ipa {
> >  /**
> >   * \fn R fixedToFloatingPoint(T number)
> >   * \brief Convert a fixed-point number to a floating point representation
> > - * \tparam I Bit width of the integer part of the fixed-point
> > + * \tparam I Bit width of the integer part of the fixed-point including the
> > + * optional sign bit
> >   * \tparam F Bit width of the fractional part of the fixed-point
> >   * \tparam R Return type of the floating point representation
> >   * \tparam T Input type of the fixed-point representation
> >   * \param number The fixed point number to convert to floating point
> > + *
> > + * If the fixed-point representation is signed, the sign bit shall be included
> > + * in the \a I template parameter that specifies the number of bits of the
> > + * integral part of the fixed-point representation.
> > + *
> > + * As an example, a value represented as signed fixed-point Q4.8 format can be
> > + * converted to its corresponding floating point representation as:
>
> I'm a bit confused here. Doesn't signed Q4.8 mean that the first bit of
> the 4 is the sign bit? The same way a signed int32 has the signed bit on
> the first of the 32 bits?

I'm right now looking at the datasheet documentation of a value said
to be in "signed Q4.8" format whose register size is 13 bits

Coefft R-G [12:0] : sign/magnitude 4.8-bit fixed-point

>
> Best regards,
> Stefan
>
> > + *
> > + * \code{.cpp}
> > + * double d = fixedToFloatingPoint<5, 8, double, uint16_t>(fixed);
> > + * \endcode
> > + *
> > + * While a value represented as unsigned fixed-point Q4.8 format can be
> > + * converted as:
> > + *
> > + * \code{.cpp}
> > + * double d = fixedToFloatingPoint<4, 8, double, uint16_t>(fixed);
> > + * \endcode
> > + *
> >   * \return The converted value
> >   */
> >
> > --
> > 2.52.0
> >
Stefan Klug Jan. 20, 2026, 9:10 a.m. UTC | #3
Hi Jacopo,

Quoting Jacopo Mondi (2026-01-20 10:00:14)
> Hi Stefan
> 
> On Tue, Jan 20, 2026 at 09:53:06AM +0100, Stefan Klug wrote:
> > Hi Jacopo,
> >
> > Quoting Jacopo Mondi (2026-01-20 09:39:49)
> > > Converting numbers with a signed fixed-point representation to
> > > the corresponding float value requires to include the sign bit in the
> > > width of the fixed-point integral part.
> > >
> > > Clearly specify it in documentation.
> > >
> > > Signed-off-by: Jacopo Mondi <jacopo.mondi@ideasonboard.com>
> > > ---
> > >  src/ipa/libipa/fixedpoint.cpp | 22 +++++++++++++++++++++-
> > >  1 file changed, 21 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/src/ipa/libipa/fixedpoint.cpp b/src/ipa/libipa/fixedpoint.cpp
> > > index 6b698fc5d680..b37cdc43936f 100644
> > > --- a/src/ipa/libipa/fixedpoint.cpp
> > > +++ b/src/ipa/libipa/fixedpoint.cpp
> > > @@ -29,11 +29,31 @@ namespace ipa {
> > >  /**
> > >   * \fn R fixedToFloatingPoint(T number)
> > >   * \brief Convert a fixed-point number to a floating point representation
> > > - * \tparam I Bit width of the integer part of the fixed-point
> > > + * \tparam I Bit width of the integer part of the fixed-point including the
> > > + * optional sign bit
> > >   * \tparam F Bit width of the fractional part of the fixed-point
> > >   * \tparam R Return type of the floating point representation
> > >   * \tparam T Input type of the fixed-point representation
> > >   * \param number The fixed point number to convert to floating point
> > > + *
> > > + * If the fixed-point representation is signed, the sign bit shall be included
> > > + * in the \a I template parameter that specifies the number of bits of the
> > > + * integral part of the fixed-point representation.
> > > + *
> > > + * As an example, a value represented as signed fixed-point Q4.8 format can be
> > > + * converted to its corresponding floating point representation as:
> >
> > I'm a bit confused here. Doesn't signed Q4.8 mean that the first bit of
> > the 4 is the sign bit? The same way a signed int32 has the signed bit on
> > the first of the 32 bits?
> 
> I'm right now looking at the datasheet documentation of a value said
> to be in "signed Q4.8" format whose register size is 13 bits
> 
> Coefft R-G [12:0] : sign/magnitude 4.8-bit fixed-point

I should have consulted wikipedia first. https://en.wikipedia.org/wiki/Q_(number_format)
clearly states that the sign bit is implicitely added.

Best regards,
Stefan

> 
> >
> > Best regards,
> > Stefan
> >
> > > + *
> > > + * \code{.cpp}
> > > + * double d = fixedToFloatingPoint<5, 8, double, uint16_t>(fixed);
> > > + * \endcode
> > > + *
> > > + * While a value represented as unsigned fixed-point Q4.8 format can be
> > > + * converted as:
> > > + *
> > > + * \code{.cpp}
> > > + * double d = fixedToFloatingPoint<4, 8, double, uint16_t>(fixed);
> > > + * \endcode
> > > + *
> > >   * \return The converted value
> > >   */
> > >
> > > --
> > > 2.52.0
> > >
Barnabás Pőcze Jan. 20, 2026, 9:11 a.m. UTC | #4
2026. 01. 20. 10:00 keltezéssel, Jacopo Mondi írta:
> Hi Stefan
> 
> On Tue, Jan 20, 2026 at 09:53:06AM +0100, Stefan Klug wrote:
>> Hi Jacopo,
>>
>> Quoting Jacopo Mondi (2026-01-20 09:39:49)
>>> Converting numbers with a signed fixed-point representation to
>>> the corresponding float value requires to include the sign bit in the
>>> width of the fixed-point integral part.
>>>
>>> Clearly specify it in documentation.
>>>
>>> Signed-off-by: Jacopo Mondi <jacopo.mondi@ideasonboard.com>
>>> ---
>>>   src/ipa/libipa/fixedpoint.cpp | 22 +++++++++++++++++++++-
>>>   1 file changed, 21 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/src/ipa/libipa/fixedpoint.cpp b/src/ipa/libipa/fixedpoint.cpp
>>> index 6b698fc5d680..b37cdc43936f 100644
>>> --- a/src/ipa/libipa/fixedpoint.cpp
>>> +++ b/src/ipa/libipa/fixedpoint.cpp
>>> @@ -29,11 +29,31 @@ namespace ipa {
>>>   /**
>>>    * \fn R fixedToFloatingPoint(T number)
>>>    * \brief Convert a fixed-point number to a floating point representation
>>> - * \tparam I Bit width of the integer part of the fixed-point
>>> + * \tparam I Bit width of the integer part of the fixed-point including the
>>> + * optional sign bit
>>>    * \tparam F Bit width of the fractional part of the fixed-point
>>>    * \tparam R Return type of the floating point representation
>>>    * \tparam T Input type of the fixed-point representation
>>>    * \param number The fixed point number to convert to floating point
>>> + *
>>> + * If the fixed-point representation is signed, the sign bit shall be included
>>> + * in the \a I template parameter that specifies the number of bits of the
>>> + * integral part of the fixed-point representation.
>>> + *
>>> + * As an example, a value represented as signed fixed-point Q4.8 format can be
>>> + * converted to its corresponding floating point representation as:
>>
>> I'm a bit confused here. Doesn't signed Q4.8 mean that the first bit of
>> the 4 is the sign bit? The same way a signed int32 has the signed bit on
>> the first of the 32 bits?

It would appear there are two interpretations: https://en.wikipedia.org/wiki/Q_(number_format)

"Texas Instruments version": "Thus, the total number w of bits used is 1 + m + n."
"ARM version": "A variant of the Q notation has been in use by ARM in which the m number also counts the sign bit."


> 
> I'm right now looking at the datasheet documentation of a value said
> to be in "signed Q4.8" format whose register size is 13 bits
> 
> Coefft R-G [12:0] : sign/magnitude 4.8-bit fixed-point

Does that mean "sign/magnitude" as in https://en.wikipedia.org/wiki/Signed_number_representations#Sign–magnitude ?
If so, then I'm not sure these functions will work.


> 
>>
>> Best regards,
>> Stefan
>>
>>> + *
>>> + * \code{.cpp}
>>> + * double d = fixedToFloatingPoint<5, 8, double, uint16_t>(fixed);
>>> + * \endcode
>>> + *
>>> + * While a value represented as unsigned fixed-point Q4.8 format can be
>>> + * converted as:
>>> + *
>>> + * \code{.cpp}
>>> + * double d = fixedToFloatingPoint<4, 8, double, uint16_t>(fixed);
>>> + * \endcode
>>> + *
>>>    * \return The converted value
>>>    */
>>>
>>> --
>>> 2.52.0
>>>
Jacopo Mondi Jan. 20, 2026, 7:26 p.m. UTC | #5
Hi Barnabás

On Tue, Jan 20, 2026 at 10:11:10AM +0100, Barnabás Pőcze wrote:
> 2026. 01. 20. 10:00 keltezéssel, Jacopo Mondi írta:
> > Hi Stefan
> >
> > On Tue, Jan 20, 2026 at 09:53:06AM +0100, Stefan Klug wrote:
> > > Hi Jacopo,
> > >
> > > Quoting Jacopo Mondi (2026-01-20 09:39:49)
> > > > Converting numbers with a signed fixed-point representation to
> > > > the corresponding float value requires to include the sign bit in the
> > > > width of the fixed-point integral part.
> > > >
> > > > Clearly specify it in documentation.
> > > >
> > > > Signed-off-by: Jacopo Mondi <jacopo.mondi@ideasonboard.com>
> > > > ---
> > > >   src/ipa/libipa/fixedpoint.cpp | 22 +++++++++++++++++++++-
> > > >   1 file changed, 21 insertions(+), 1 deletion(-)
> > > >
> > > > diff --git a/src/ipa/libipa/fixedpoint.cpp b/src/ipa/libipa/fixedpoint.cpp
> > > > index 6b698fc5d680..b37cdc43936f 100644
> > > > --- a/src/ipa/libipa/fixedpoint.cpp
> > > > +++ b/src/ipa/libipa/fixedpoint.cpp
> > > > @@ -29,11 +29,31 @@ namespace ipa {
> > > >   /**
> > > >    * \fn R fixedToFloatingPoint(T number)
> > > >    * \brief Convert a fixed-point number to a floating point representation
> > > > - * \tparam I Bit width of the integer part of the fixed-point
> > > > + * \tparam I Bit width of the integer part of the fixed-point including the
> > > > + * optional sign bit
> > > >    * \tparam F Bit width of the fractional part of the fixed-point
> > > >    * \tparam R Return type of the floating point representation
> > > >    * \tparam T Input type of the fixed-point representation
> > > >    * \param number The fixed point number to convert to floating point
> > > > + *
> > > > + * If the fixed-point representation is signed, the sign bit shall be included
> > > > + * in the \a I template parameter that specifies the number of bits of the
> > > > + * integral part of the fixed-point representation.
> > > > + *
> > > > + * As an example, a value represented as signed fixed-point Q4.8 format can be
> > > > + * converted to its corresponding floating point representation as:
> > >
> > > I'm a bit confused here. Doesn't signed Q4.8 mean that the first bit of
> > > the 4 is the sign bit? The same way a signed int32 has the signed bit on
> > > the first of the 32 bits?
>
> It would appear there are two interpretations: https://en.wikipedia.org/wiki/Q_(number_format)
>
> "Texas Instruments version": "Thus, the total number w of bits used is 1 + m + n."
> "ARM version": "A variant of the Q notation has been in use by ARM in which the m number also counts the sign bit."
>
>
> >
> > I'm right now looking at the datasheet documentation of a value said
> > to be in "signed Q4.8" format whose register size is 13 bits
> >
> > Coefft R-G [12:0] : sign/magnitude 4.8-bit fixed-point
>
> Does that mean "sign/magnitude" as in https://en.wikipedia.org/wiki/Signed_number_representations#Sign–magnitude ?
> If so, then I'm not sure these functions will work.

I had just told Stefan "I'm not sure I acutally know what 'magnitude'
implies there", and I didn't :)

So, I had a bit of read around, including Kieran's Quantized type
series and I fell into a too familiarly deep rabbit hole.

--------------------- TL;DR -----------------------------------------------
Feel free to skip, these are mostly notes to clarify my understanding
---------------------------------------------------------------------------

Let's look at floatingToFixedPoint() remembering that

f = float value
q = value in Q<m,n>

        f = q / 2^n
        q = f * 2^n

And that's what floatingToFixedPoint() does

template<unsigned int I, unsigned int F, typename R, typename T>
constexpr R floatingToFixedPoint(T number)
{
	static_assert(sizeof(int) >= sizeof(R));
	static_assert(I + F <= sizeof(R) * 8);

	R mask = (1 << (F + I)) - 1;
	R frac = static_cast<R>(static_cast<int>(std::round(number * (1 << F)))) & mask;

	return frac;
}

wich can be summarized as (n * 2^n & mask)

All good, but how is this handled if floatingToFixedPoint<>() is
called as:
        block->gain01 = floatingToFixedPoint<4, 8, uint16_t, double>(1.0);

        uint16_t frac = static_cast<uint16_t>(
                        static_cast<int>(std::round(1.0 * 2^8)) & mask;

1.0 * 1^8 is a double
calling std::round(double) picks the right overload and returns a
double

the double is cast to int. The C standard doesn't impose a
representation for signed integers and allows it to be either
sign/magnitude, 1-complement or 2-complement. It's fair to assume
2-complement is the standard and the C23 standard makes it so.

So, on a 64 bits platform we have the result of (1.0 * 2^8)
represented as a signed 64-bit integers in 2-complement.

According to the C standard, to cast a signed int to an unsigned int

"When a value with integer type is converted to another integer type
other than _Bool, if the value can be represented by the new type, it
is unchanged.

Otherwise, if the new type is unsigned, the value is converted by
repeatedly adding or subtracting one more than the maximum value that
can be represented in the new type until the value is in the range of
the new type."

As vague as it might sound to me

-43 + 2^16 = 65493 = 1111 1111 1101 0101
which in 2-complement is ... -43

Amazing, let's start from the beginning.

I want to write to a 13 bits register the number -1.45 in signed Q<4.8>
format:

        uint16_t q = floatingToFixedPoint<4, 8, uint16_t, double>(-1.45);

        std::round(-1.45 * 2^8) = -371

        static_cast<int>(-371) is stored as 2-complement in 64 bits
        static_cast<int16_t>(-371) = -371 + 2^16 = 65165

        65165 = 1111 1110 1000 1101

        if we interpret this as a register value in Q<4,8> signed
        format

        xx11 1110 1000 1101

        1 is the sign bit so let's calculate the 2 complement of
        0  1110 1000 1101 = ~(1110 1000 1101) + 1 =
                       =   0001 0111 0010 + 1 = 370 + 1 = 371

Amazing!

--------------------- End TL;DR -------------------------------------------

Now, I want this in sign/magnitude. I bet there are smarter ways of
doing this but if I simply take the result of floatingToFixedPoint()
and check the sign bit, I can simply add it back to absolute value of
the result ?

As a bit of pseudo code

        int reg = static_cast<int>(std::round(number * (1 << F)))) & mask;
        uint16_t res += std::abs(reg);
        if (reg < 0)
                res |= BIT(13);

I think this could be surely optimized and nicely made a Traits that
can be added to the Quantized series Kieran is working on.

>
>
> >
> > >
> > > Best regards,
> > > Stefan
> > >
> > > > + *
> > > > + * \code{.cpp}
> > > > + * double d = fixedToFloatingPoint<5, 8, double, uint16_t>(fixed);
> > > > + * \endcode
> > > > + *
> > > > + * While a value represented as unsigned fixed-point Q4.8 format can be
> > > > + * converted as:
> > > > + *
> > > > + * \code{.cpp}
> > > > + * double d = fixedToFloatingPoint<4, 8, double, uint16_t>(fixed);
> > > > + * \endcode
> > > > + *
> > > >    * \return The converted value
> > > >    */
> > > >
> > > > --
> > > > 2.52.0
> > > >
>
Laurent Pinchart Jan. 20, 2026, 9:23 p.m. UTC | #6
On Tue, Jan 20, 2026 at 08:26:29PM +0100, Jacopo Mondi wrote:
> On Tue, Jan 20, 2026 at 10:11:10AM +0100, Barnabás Pőcze wrote:
> > 2026. 01. 20. 10:00 keltezéssel, Jacopo Mondi írta:
> > > On Tue, Jan 20, 2026 at 09:53:06AM +0100, Stefan Klug wrote:
> > > > Quoting Jacopo Mondi (2026-01-20 09:39:49)
> > > > > Converting numbers with a signed fixed-point representation to
> > > > > the corresponding float value requires to include the sign bit in the
> > > > > width of the fixed-point integral part.
> > > > >
> > > > > Clearly specify it in documentation.
> > > > >
> > > > > Signed-off-by: Jacopo Mondi <jacopo.mondi@ideasonboard.com>
> > > > > ---
> > > > >   src/ipa/libipa/fixedpoint.cpp | 22 +++++++++++++++++++++-
> > > > >   1 file changed, 21 insertions(+), 1 deletion(-)
> > > > >
> > > > > diff --git a/src/ipa/libipa/fixedpoint.cpp b/src/ipa/libipa/fixedpoint.cpp
> > > > > index 6b698fc5d680..b37cdc43936f 100644
> > > > > --- a/src/ipa/libipa/fixedpoint.cpp
> > > > > +++ b/src/ipa/libipa/fixedpoint.cpp
> > > > > @@ -29,11 +29,31 @@ namespace ipa {
> > > > >   /**
> > > > >    * \fn R fixedToFloatingPoint(T number)
> > > > >    * \brief Convert a fixed-point number to a floating point representation
> > > > > - * \tparam I Bit width of the integer part of the fixed-point
> > > > > + * \tparam I Bit width of the integer part of the fixed-point including the
> > > > > + * optional sign bit
> > > > >    * \tparam F Bit width of the fractional part of the fixed-point
> > > > >    * \tparam R Return type of the floating point representation
> > > > >    * \tparam T Input type of the fixed-point representation
> > > > >    * \param number The fixed point number to convert to floating point
> > > > > + *
> > > > > + * If the fixed-point representation is signed, the sign bit shall be included
> > > > > + * in the \a I template parameter that specifies the number of bits of the
> > > > > + * integral part of the fixed-point representation.
> > > > > + *
> > > > > + * As an example, a value represented as signed fixed-point Q4.8 format can be
> > > > > + * converted to its corresponding floating point representation as:
> > > >
> > > > I'm a bit confused here. Doesn't signed Q4.8 mean that the first bit of
> > > > the 4 is the sign bit? The same way a signed int32 has the signed bit on
> > > > the first of the 32 bits?
> >
> > It would appear there are two interpretations: https://en.wikipedia.org/wiki/Q_(number_format)
> >
> > "Texas Instruments version": "Thus, the total number w of bits used is 1 + m + n."
> > "ARM version": "A variant of the Q notation has been in use by ARM in which the m number also counts the sign bit."
> >
> > > I'm right now looking at the datasheet documentation of a value said
> > > to be in "signed Q4.8" format whose register size is 13 bits
> > >
> > > Coefft R-G [12:0] : sign/magnitude 4.8-bit fixed-point
> >
> > Does that mean "sign/magnitude" as in https://en.wikipedia.org/wiki/Signed_number_representations#Sign–magnitude ?
> > If so, then I'm not sure these functions will work.
> 
> I had just told Stefan "I'm not sure I acutally know what 'magnitude'
> implies there", and I didn't :)
> 
> So, I had a bit of read around, including Kieran's Quantized type
> series and I fell into a too familiarly deep rabbit hole.
> 
> --------------------- TL;DR -----------------------------------------------
> Feel free to skip, these are mostly notes to clarify my understanding
> ---------------------------------------------------------------------------
> 
> Let's look at floatingToFixedPoint() remembering that
> 
> f = float value
> q = value in Q<m,n>
> 
>         f = q / 2^n
>         q = f * 2^n
> 
> And that's what floatingToFixedPoint() does
> 
> template<unsigned int I, unsigned int F, typename R, typename T>
> constexpr R floatingToFixedPoint(T number)
> {
> 	static_assert(sizeof(int) >= sizeof(R));
> 	static_assert(I + F <= sizeof(R) * 8);
> 
> 	R mask = (1 << (F + I)) - 1;
> 	R frac = static_cast<R>(static_cast<int>(std::round(number * (1 << F)))) & mask;
> 
> 	return frac;
> }
> 
> wich can be summarized as (n * 2^n & mask)
> 
> All good, but how is this handled if floatingToFixedPoint<>() is
> called as:
>         block->gain01 = floatingToFixedPoint<4, 8, uint16_t, double>(1.0);
> 
>         uint16_t frac = static_cast<uint16_t>(
>                         static_cast<int>(std::round(1.0 * 2^8)) & mask;
> 
> 1.0 * 1^8 is a double
> calling std::round(double) picks the right overload and returns a
> double
> 
> the double is cast to int. The C standard doesn't impose a
> representation for signed integers and allows it to be either
> sign/magnitude, 1-complement or 2-complement. It's fair to assume
> 2-complement is the standard and the C23 standard makes it so.
> 
> So, on a 64 bits platform we have the result of (1.0 * 2^8)
> represented as a signed 64-bit integers in 2-complement.
> 
> According to the C standard, to cast a signed int to an unsigned int
> 
> "When a value with integer type is converted to another integer type
> other than _Bool, if the value can be represented by the new type, it
> is unchanged.
> 
> Otherwise, if the new type is unsigned, the value is converted by
> repeatedly adding or subtracting one more than the maximum value that
> can be represented in the new type until the value is in the range of
> the new type."
> 
> As vague as it might sound to me
> 
> -43 + 2^16 = 65493 = 1111 1111 1101 0101
> which in 2-complement is ... -43
> 
> Amazing, let's start from the beginning.
> 
> I want to write to a 13 bits register the number -1.45 in signed Q<4.8>
> format:
> 
>         uint16_t q = floatingToFixedPoint<4, 8, uint16_t, double>(-1.45);
> 
>         std::round(-1.45 * 2^8) = -371
> 
>         static_cast<int>(-371) is stored as 2-complement in 64 bits
>         static_cast<int16_t>(-371) = -371 + 2^16 = 65165
> 
>         65165 = 1111 1110 1000 1101
> 
>         if we interpret this as a register value in Q<4,8> signed
>         format
> 
>         xx11 1110 1000 1101
> 
>         1 is the sign bit so let's calculate the 2 complement of
>         0  1110 1000 1101 = ~(1110 1000 1101) + 1 =
>                        =   0001 0111 0010 + 1 = 370 + 1 = 371
> 
> Amazing!
> 
> --------------------- End TL;DR -------------------------------------------
> 
> Now, I want this in sign/magnitude. I bet there are smarter ways of
> doing this but if I simply take the result of floatingToFixedPoint()
> and check the sign bit, I can simply add it back to absolute value of
> the result ?
> 
> As a bit of pseudo code
> 
>         int reg = static_cast<int>(std::round(number * (1 << F)))) & mask;
>         uint16_t res += std::abs(reg);
>         if (reg < 0)
>                 res |= BIT(13);
> 
> I think this could be surely optimized and nicely made a Traits that
> can be added to the Quantized series Kieran is working on.

I think you should first test to see if "sign-magnitude" mentioned in
the datasheet actually means that, or if it's a signed fixed-point
value. If it's the former we'll see how to support it.

> > > > > + *
> > > > > + * \code{.cpp}
> > > > > + * double d = fixedToFloatingPoint<5, 8, double, uint16_t>(fixed);
> > > > > + * \endcode
> > > > > + *
> > > > > + * While a value represented as unsigned fixed-point Q4.8 format can be
> > > > > + * converted as:
> > > > > + *
> > > > > + * \code{.cpp}
> > > > > + * double d = fixedToFloatingPoint<4, 8, double, uint16_t>(fixed);
> > > > > + * \endcode
> > > > > + *
> > > > >    * \return The converted value
> > > > >    */
Kieran Bingham Jan. 21, 2026, 12:23 p.m. UTC | #7
Hi Jacopo,

Quoting Stefan Klug (2026-01-20 08:53:06)
> Hi Jacopo,
> 
> Quoting Jacopo Mondi (2026-01-20 09:39:49)
> > Converting numbers with a signed fixed-point representation to
> > the corresponding float value requires to include the sign bit in the
> > width of the fixed-point integral part.
> > 
> > Clearly specify it in documentation.
> > 
> > Signed-off-by: Jacopo Mondi <jacopo.mondi@ideasonboard.com>
> > ---
> >  src/ipa/libipa/fixedpoint.cpp | 22 +++++++++++++++++++++-
> >  1 file changed, 21 insertions(+), 1 deletion(-)
> > 
> > diff --git a/src/ipa/libipa/fixedpoint.cpp b/src/ipa/libipa/fixedpoint.cpp
> > index 6b698fc5d680..b37cdc43936f 100644
> > --- a/src/ipa/libipa/fixedpoint.cpp
> > +++ b/src/ipa/libipa/fixedpoint.cpp
> > @@ -29,11 +29,31 @@ namespace ipa {
> >  /**
> >   * \fn R fixedToFloatingPoint(T number)
> >   * \brief Convert a fixed-point number to a floating point representation
> > - * \tparam I Bit width of the integer part of the fixed-point
> > + * \tparam I Bit width of the integer part of the fixed-point including the
> > + * optional sign bit
> >   * \tparam F Bit width of the fractional part of the fixed-point
> >   * \tparam R Return type of the floating point representation
> >   * \tparam T Input type of the fixed-point representation
> >   * \param number The fixed point number to convert to floating point
> > + *
> > + * If the fixed-point representation is signed, the sign bit shall be included
> > + * in the \a I template parameter that specifies the number of bits of the
> > + * integral part of the fixed-point representation.
> > + *
> > + * As an example, a value represented as signed fixed-point Q4.8 format can be
> > + * converted to its corresponding floating point representation as:

Just to be sure - you know I've got patches to remove all of the above
that I want to get merged 'soon' right?

Quantized brings in explicit signed/unsigned types through Q<4,8> and
UQ<4, 8> types.

In the new types Q<I, F> has the sign bit included in 'I'.
I can add that explicitly to the documentation in my new series for v6.


"""
 * The sign of the value is determined by the sign of \a T. For signed types,
 * the number of integer bits includes the sign bit.
"""

--
Kieran

> I'm a bit confused here. Doesn't signed Q4.8 mean that the first bit of
> the 4 is the sign bit? The same way a signed int32 has the signed bit on
> the first of the 32 bits?
> 
> Best regards,
> Stefan
> 
> > + *
> > + * \code{.cpp}
> > + * double d = fixedToFloatingPoint<5, 8, double, uint16_t>(fixed);
> > + * \endcode
> > + *
> > + * While a value represented as unsigned fixed-point Q4.8 format can be
> > + * converted as:
> > + *
> > + * \code{.cpp}
> > + * double d = fixedToFloatingPoint<4, 8, double, uint16_t>(fixed);
> > + * \endcode
> > + *
> >   * \return The converted value
> >   */
> > 
> > --
> > 2.52.0
> >
Jacopo Mondi Jan. 21, 2026, 12:42 p.m. UTC | #8
Hi Laurent

On Tue, Jan 20, 2026 at 11:23:42PM +0200, Laurent Pinchart wrote:
> On Tue, Jan 20, 2026 at 08:26:29PM +0100, Jacopo Mondi wrote:
> > On Tue, Jan 20, 2026 at 10:11:10AM +0100, Barnabás Pőcze wrote:
> > > 2026. 01. 20. 10:00 keltezéssel, Jacopo Mondi írta:
> > > > On Tue, Jan 20, 2026 at 09:53:06AM +0100, Stefan Klug wrote:
> > > > > Quoting Jacopo Mondi (2026-01-20 09:39:49)
> > > > > > Converting numbers with a signed fixed-point representation to
> > > > > > the corresponding float value requires to include the sign bit in the
> > > > > > width of the fixed-point integral part.
> > > > > >
> > > > > > Clearly specify it in documentation.
> > > > > >
> > > > > > Signed-off-by: Jacopo Mondi <jacopo.mondi@ideasonboard.com>
> > > > > > ---
> > > > > >   src/ipa/libipa/fixedpoint.cpp | 22 +++++++++++++++++++++-
> > > > > >   1 file changed, 21 insertions(+), 1 deletion(-)
> > > > > >
> > > > > > diff --git a/src/ipa/libipa/fixedpoint.cpp b/src/ipa/libipa/fixedpoint.cpp
> > > > > > index 6b698fc5d680..b37cdc43936f 100644
> > > > > > --- a/src/ipa/libipa/fixedpoint.cpp
> > > > > > +++ b/src/ipa/libipa/fixedpoint.cpp
> > > > > > @@ -29,11 +29,31 @@ namespace ipa {
> > > > > >   /**
> > > > > >    * \fn R fixedToFloatingPoint(T number)
> > > > > >    * \brief Convert a fixed-point number to a floating point representation
> > > > > > - * \tparam I Bit width of the integer part of the fixed-point
> > > > > > + * \tparam I Bit width of the integer part of the fixed-point including the
> > > > > > + * optional sign bit
> > > > > >    * \tparam F Bit width of the fractional part of the fixed-point
> > > > > >    * \tparam R Return type of the floating point representation
> > > > > >    * \tparam T Input type of the fixed-point representation
> > > > > >    * \param number The fixed point number to convert to floating point
> > > > > > + *
> > > > > > + * If the fixed-point representation is signed, the sign bit shall be included
> > > > > > + * in the \a I template parameter that specifies the number of bits of the
> > > > > > + * integral part of the fixed-point representation.
> > > > > > + *
> > > > > > + * As an example, a value represented as signed fixed-point Q4.8 format can be
> > > > > > + * converted to its corresponding floating point representation as:
> > > > >
> > > > > I'm a bit confused here. Doesn't signed Q4.8 mean that the first bit of
> > > > > the 4 is the sign bit? The same way a signed int32 has the signed bit on
> > > > > the first of the 32 bits?
> > >
> > > It would appear there are two interpretations: https://en.wikipedia.org/wiki/Q_(number_format)
> > >
> > > "Texas Instruments version": "Thus, the total number w of bits used is 1 + m + n."
> > > "ARM version": "A variant of the Q notation has been in use by ARM in which the m number also counts the sign bit."
> > >
> > > > I'm right now looking at the datasheet documentation of a value said
> > > > to be in "signed Q4.8" format whose register size is 13 bits
> > > >
> > > > Coefft R-G [12:0] : sign/magnitude 4.8-bit fixed-point
> > >
> > > Does that mean "sign/magnitude" as in https://en.wikipedia.org/wiki/Signed_number_representations#Sign–magnitude ?
> > > If so, then I'm not sure these functions will work.
> >
> > I had just told Stefan "I'm not sure I acutally know what 'magnitude'
> > implies there", and I didn't :)
> >
> > So, I had a bit of read around, including Kieran's Quantized type
> > series and I fell into a too familiarly deep rabbit hole.
> >
> > --------------------- TL;DR -----------------------------------------------
> > Feel free to skip, these are mostly notes to clarify my understanding
> > ---------------------------------------------------------------------------
> >
> > Let's look at floatingToFixedPoint() remembering that
> >
> > f = float value
> > q = value in Q<m,n>
> >
> >         f = q / 2^n
> >         q = f * 2^n
> >
> > And that's what floatingToFixedPoint() does
> >
> > template<unsigned int I, unsigned int F, typename R, typename T>
> > constexpr R floatingToFixedPoint(T number)
> > {
> > 	static_assert(sizeof(int) >= sizeof(R));
> > 	static_assert(I + F <= sizeof(R) * 8);
> >
> > 	R mask = (1 << (F + I)) - 1;
> > 	R frac = static_cast<R>(static_cast<int>(std::round(number * (1 << F)))) & mask;
> >
> > 	return frac;
> > }
> >
> > wich can be summarized as (n * 2^n & mask)
> >
> > All good, but how is this handled if floatingToFixedPoint<>() is
> > called as:
> >         block->gain01 = floatingToFixedPoint<4, 8, uint16_t, double>(1.0);
> >
> >         uint16_t frac = static_cast<uint16_t>(
> >                         static_cast<int>(std::round(1.0 * 2^8)) & mask;
> >
> > 1.0 * 1^8 is a double
> > calling std::round(double) picks the right overload and returns a
> > double
> >
> > the double is cast to int. The C standard doesn't impose a
> > representation for signed integers and allows it to be either
> > sign/magnitude, 1-complement or 2-complement. It's fair to assume
> > 2-complement is the standard and the C23 standard makes it so.
> >
> > So, on a 64 bits platform we have the result of (1.0 * 2^8)
> > represented as a signed 64-bit integers in 2-complement.
> >
> > According to the C standard, to cast a signed int to an unsigned int
> >
> > "When a value with integer type is converted to another integer type
> > other than _Bool, if the value can be represented by the new type, it
> > is unchanged.
> >
> > Otherwise, if the new type is unsigned, the value is converted by
> > repeatedly adding or subtracting one more than the maximum value that
> > can be represented in the new type until the value is in the range of
> > the new type."
> >
> > As vague as it might sound to me
> >
> > -43 + 2^16 = 65493 = 1111 1111 1101 0101
> > which in 2-complement is ... -43
> >
> > Amazing, let's start from the beginning.
> >
> > I want to write to a 13 bits register the number -1.45 in signed Q<4.8>
> > format:
> >
> >         uint16_t q = floatingToFixedPoint<4, 8, uint16_t, double>(-1.45);
> >
> >         std::round(-1.45 * 2^8) = -371
> >
> >         static_cast<int>(-371) is stored as 2-complement in 64 bits
> >         static_cast<int16_t>(-371) = -371 + 2^16 = 65165
> >
> >         65165 = 1111 1110 1000 1101
> >
> >         if we interpret this as a register value in Q<4,8> signed
> >         format
> >
> >         xx11 1110 1000 1101
> >
> >         1 is the sign bit so let's calculate the 2 complement of
> >         0  1110 1000 1101 = ~(1110 1000 1101) + 1 =
> >                        =   0001 0111 0010 + 1 = 370 + 1 = 371
> >
> > Amazing!
> >
> > --------------------- End TL;DR -------------------------------------------
> >
> > Now, I want this in sign/magnitude. I bet there are smarter ways of
> > doing this but if I simply take the result of floatingToFixedPoint()
> > and check the sign bit, I can simply add it back to absolute value of
> > the result ?
> >
> > As a bit of pseudo code
> >
> >         int reg = static_cast<int>(std::round(number * (1 << F)))) & mask;
> >         uint16_t res += std::abs(reg);
> >         if (reg < 0)
> >                 res |= BIT(13);
> >
> > I think this could be surely optimized and nicely made a Traits that
> > can be added to the Quantized series Kieran is working on.
>
> I think you should first test to see if "sign-magnitude" mentioned in
> the datasheet actually means that, or if it's a signed fixed-point
> value. If it's the former we'll see how to support it.

Consider that other registers are said to be:
- unsigned 4.8-bit fixed-poin
or
- signed (2's complement) 11-bit integer

While these are specifically described as:
- sign/magnitude 4.8-bit fixed-point

I would tend to believe it actually is correct, but I can check with
the vendor maybe

>
> > > > > > + *
> > > > > > + * \code{.cpp}
> > > > > > + * double d = fixedToFloatingPoint<5, 8, double, uint16_t>(fixed);
> > > > > > + * \endcode
> > > > > > + *
> > > > > > + * While a value represented as unsigned fixed-point Q4.8 format can be
> > > > > > + * converted as:
> > > > > > + *
> > > > > > + * \code{.cpp}
> > > > > > + * double d = fixedToFloatingPoint<4, 8, double, uint16_t>(fixed);
> > > > > > + * \endcode
> > > > > > + *
> > > > > >    * \return The converted value
> > > > > >    */
>
> --
> Regards,
>
> Laurent Pinchart
Jacopo Mondi Jan. 21, 2026, 12:53 p.m. UTC | #9
Hi Kieran

On Wed, Jan 21, 2026 at 12:23:40PM +0000, Kieran Bingham wrote:
> Hi Jacopo,
>
> Quoting Stefan Klug (2026-01-20 08:53:06)
> > Hi Jacopo,
> >
> > Quoting Jacopo Mondi (2026-01-20 09:39:49)
> > > Converting numbers with a signed fixed-point representation to
> > > the corresponding float value requires to include the sign bit in the
> > > width of the fixed-point integral part.
> > >
> > > Clearly specify it in documentation.
> > >
> > > Signed-off-by: Jacopo Mondi <jacopo.mondi@ideasonboard.com>
> > > ---
> > >  src/ipa/libipa/fixedpoint.cpp | 22 +++++++++++++++++++++-
> > >  1 file changed, 21 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/src/ipa/libipa/fixedpoint.cpp b/src/ipa/libipa/fixedpoint.cpp
> > > index 6b698fc5d680..b37cdc43936f 100644
> > > --- a/src/ipa/libipa/fixedpoint.cpp
> > > +++ b/src/ipa/libipa/fixedpoint.cpp
> > > @@ -29,11 +29,31 @@ namespace ipa {
> > >  /**
> > >   * \fn R fixedToFloatingPoint(T number)
> > >   * \brief Convert a fixed-point number to a floating point representation
> > > - * \tparam I Bit width of the integer part of the fixed-point
> > > + * \tparam I Bit width of the integer part of the fixed-point including the
> > > + * optional sign bit
> > >   * \tparam F Bit width of the fractional part of the fixed-point
> > >   * \tparam R Return type of the floating point representation
> > >   * \tparam T Input type of the fixed-point representation
> > >   * \param number The fixed point number to convert to floating point
> > > + *
> > > + * If the fixed-point representation is signed, the sign bit shall be included
> > > + * in the \a I template parameter that specifies the number of bits of the
> > > + * integral part of the fixed-point representation.
> > > + *
> > > + * As an example, a value represented as signed fixed-point Q4.8 format can be
> > > + * converted to its corresponding floating point representation as:
>
> Just to be sure - you know I've got patches to remove all of the above
> that I want to get merged 'soon' right?

Read the last bit of my reply from yesterday :)

>
> Quantized brings in explicit signed/unsigned types through Q<4,8> and
> UQ<4, 8> types.

What is the difference between signed and unsigned ? Is it only the
sign bit ? I guess then that the Q<4,8>[12:0] = UQ<4,8>[11:0]

>
> In the new types Q<I, F> has the sign bit included in 'I'.
> I can add that explicitly to the documentation in my new series for v6.


Well, maybe we need two traits ?
https://en.wikipedia.org/wiki/Q_(number_format)

Texas Instruments version:
The first bit always gives the sign of the value (1 = negative, 0 =
non-negative), and it is not counted in the m parameter. Thus, the
total number w of bits used is 1 + m + n.

ARM Version:
A variant of the Q notation has been in use by ARM in which the m
number also counts the sign bit

I guess the only way to know which one is meant to be used is to
actually look at the register sizes. If a Q<4,8> number is stored as
a 13 bit fields, then the TI version is used. I wonder how common the
ARM version is.

>
>
> """
>  * The sign of the value is determined by the sign of \a T. For signed types,
>  * the number of integer bits includes the sign bit.
> """
>
> --
> Kieran
>
> > I'm a bit confused here. Doesn't signed Q4.8 mean that the first bit of
> > the 4 is the sign bit? The same way a signed int32 has the signed bit on
> > the first of the 32 bits?
> >
> > Best regards,
> > Stefan
> >
> > > + *
> > > + * \code{.cpp}
> > > + * double d = fixedToFloatingPoint<5, 8, double, uint16_t>(fixed);
> > > + * \endcode
> > > + *
> > > + * While a value represented as unsigned fixed-point Q4.8 format can be
> > > + * converted as:
> > > + *
> > > + * \code{.cpp}
> > > + * double d = fixedToFloatingPoint<4, 8, double, uint16_t>(fixed);
> > > + * \endcode
> > > + *
> > >   * \return The converted value
> > >   */
> > >
> > > --
> > > 2.52.0
> > >
Kieran Bingham Jan. 21, 2026, 2:45 p.m. UTC | #10
Quoting Jacopo Mondi (2026-01-21 12:53:49)
> Hi Kieran
> 
> On Wed, Jan 21, 2026 at 12:23:40PM +0000, Kieran Bingham wrote:
> > Hi Jacopo,
> >
> > Quoting Stefan Klug (2026-01-20 08:53:06)
> > > Hi Jacopo,
> > >
> > > Quoting Jacopo Mondi (2026-01-20 09:39:49)
> > > > Converting numbers with a signed fixed-point representation to
> > > > the corresponding float value requires to include the sign bit in the
> > > > width of the fixed-point integral part.
> > > >
> > > > Clearly specify it in documentation.
> > > >
> > > > Signed-off-by: Jacopo Mondi <jacopo.mondi@ideasonboard.com>
> > > > ---
> > > >  src/ipa/libipa/fixedpoint.cpp | 22 +++++++++++++++++++++-
> > > >  1 file changed, 21 insertions(+), 1 deletion(-)
> > > >
> > > > diff --git a/src/ipa/libipa/fixedpoint.cpp b/src/ipa/libipa/fixedpoint.cpp
> > > > index 6b698fc5d680..b37cdc43936f 100644
> > > > --- a/src/ipa/libipa/fixedpoint.cpp
> > > > +++ b/src/ipa/libipa/fixedpoint.cpp
> > > > @@ -29,11 +29,31 @@ namespace ipa {
> > > >  /**
> > > >   * \fn R fixedToFloatingPoint(T number)
> > > >   * \brief Convert a fixed-point number to a floating point representation
> > > > - * \tparam I Bit width of the integer part of the fixed-point
> > > > + * \tparam I Bit width of the integer part of the fixed-point including the
> > > > + * optional sign bit
> > > >   * \tparam F Bit width of the fractional part of the fixed-point
> > > >   * \tparam R Return type of the floating point representation
> > > >   * \tparam T Input type of the fixed-point representation
> > > >   * \param number The fixed point number to convert to floating point
> > > > + *
> > > > + * If the fixed-point representation is signed, the sign bit shall be included
> > > > + * in the \a I template parameter that specifies the number of bits of the
> > > > + * integral part of the fixed-point representation.
> > > > + *
> > > > + * As an example, a value represented as signed fixed-point Q4.8 format can be
> > > > + * converted to its corresponding floating point representation as:
> >
> > Just to be sure - you know I've got patches to remove all of the above
> > that I want to get merged 'soon' right?
> 
> Read the last bit of my reply from yesterday :)
> 
> >
> > Quantized brings in explicit signed/unsigned types through Q<4,8> and
> > UQ<4, 8> types.
> 
> What is the difference between signed and unsigned ? Is it only the
> sign bit ? I guess then that the Q<4,8>[12:0] = UQ<4,8>[11:0]

Please take a look through the tests I've added:

https://patchwork.libcamera.org/patch/25801/

/* Q1.7(-1 .. 0.992188)  Min: [0x80:-1] -- Max: [0x7f:0.992188] Step:0.0078125*/
/* UQ1.7(0 .. 1.99219)  Min: [0x00:0] -- Max: [0xff:1.99219] Step:0.0078125 */

/* Q12.4(-2048 .. 2047.94)  Min: [0x8000:-2048] -- Max: [0x7fff:2047.94] Step:0.0625 */
/* UQ12.4(0 .. 4095.94)  Min: [0x0000:0] -- Max: [0xffff:4095.94] Step:0.0625 */

It's easy to extend that if you have specific Q types you want to
use/test.

 
> >
> > In the new types Q<I, F> has the sign bit included in 'I'.
> > I can add that explicitly to the documentation in my new series for v6.
> 
> 
> Well, maybe we need two traits ?
> https://en.wikipedia.org/wiki/Q_(number_format)
> 
> Texas Instruments version:
> The first bit always gives the sign of the value (1 = negative, 0 =
> non-negative), and it is not counted in the m parameter. Thus, the
> total number w of bits used is 1 + m + n.
> 
> ARM Version:
> A variant of the Q notation has been in use by ARM in which the m
> number also counts the sign bit

Yes, you've definitely got to know which one the hardware is using and
expecting. I wouldn't make a new trait for this - if we have to specify
we can wrap one in the other if it really helps.

--
Kieran


> 
> I guess the only way to know which one is meant to be used is to
> actually look at the register sizes. If a Q<4,8> number is stored as
> a 13 bit fields, then the TI version is used. I wonder how common the
> ARM version is.
> 
> >
> >
> > """
> >  * The sign of the value is determined by the sign of \a T. For signed types,
> >  * the number of integer bits includes the sign bit.
> > """
> >
> > --
> > Kieran
> >
> > > I'm a bit confused here. Doesn't signed Q4.8 mean that the first bit of
> > > the 4 is the sign bit? The same way a signed int32 has the signed bit on
> > > the first of the 32 bits?
> > >
> > > Best regards,
> > > Stefan
> > >
> > > > + *
> > > > + * \code{.cpp}
> > > > + * double d = fixedToFloatingPoint<5, 8, double, uint16_t>(fixed);
> > > > + * \endcode
> > > > + *
> > > > + * While a value represented as unsigned fixed-point Q4.8 format can be
> > > > + * converted as:
> > > > + *
> > > > + * \code{.cpp}
> > > > + * double d = fixedToFloatingPoint<4, 8, double, uint16_t>(fixed);
> > > > + * \endcode
> > > > + *
> > > >   * \return The converted value
> > > >   */
> > > >
> > > > --
> > > > 2.52.0
> > > >
Jacopo Mondi Jan. 21, 2026, 3:12 p.m. UTC | #11
Hi Kieran

On Wed, Jan 21, 2026 at 02:45:04PM +0000, Kieran Bingham wrote:
> Quoting Jacopo Mondi (2026-01-21 12:53:49)
> > Hi Kieran
> >
> > On Wed, Jan 21, 2026 at 12:23:40PM +0000, Kieran Bingham wrote:
> > > Hi Jacopo,
> > >
> > > Quoting Stefan Klug (2026-01-20 08:53:06)
> > > > Hi Jacopo,
> > > >
> > > > Quoting Jacopo Mondi (2026-01-20 09:39:49)
> > > > > Converting numbers with a signed fixed-point representation to
> > > > > the corresponding float value requires to include the sign bit in the
> > > > > width of the fixed-point integral part.
> > > > >
> > > > > Clearly specify it in documentation.
> > > > >
> > > > > Signed-off-by: Jacopo Mondi <jacopo.mondi@ideasonboard.com>
> > > > > ---
> > > > >  src/ipa/libipa/fixedpoint.cpp | 22 +++++++++++++++++++++-
> > > > >  1 file changed, 21 insertions(+), 1 deletion(-)
> > > > >
> > > > > diff --git a/src/ipa/libipa/fixedpoint.cpp b/src/ipa/libipa/fixedpoint.cpp
> > > > > index 6b698fc5d680..b37cdc43936f 100644
> > > > > --- a/src/ipa/libipa/fixedpoint.cpp
> > > > > +++ b/src/ipa/libipa/fixedpoint.cpp
> > > > > @@ -29,11 +29,31 @@ namespace ipa {
> > > > >  /**
> > > > >   * \fn R fixedToFloatingPoint(T number)
> > > > >   * \brief Convert a fixed-point number to a floating point representation
> > > > > - * \tparam I Bit width of the integer part of the fixed-point
> > > > > + * \tparam I Bit width of the integer part of the fixed-point including the
> > > > > + * optional sign bit
> > > > >   * \tparam F Bit width of the fractional part of the fixed-point
> > > > >   * \tparam R Return type of the floating point representation
> > > > >   * \tparam T Input type of the fixed-point representation
> > > > >   * \param number The fixed point number to convert to floating point
> > > > > + *
> > > > > + * If the fixed-point representation is signed, the sign bit shall be included
> > > > > + * in the \a I template parameter that specifies the number of bits of the
> > > > > + * integral part of the fixed-point representation.
> > > > > + *
> > > > > + * As an example, a value represented as signed fixed-point Q4.8 format can be
> > > > > + * converted to its corresponding floating point representation as:
> > >
> > > Just to be sure - you know I've got patches to remove all of the above
> > > that I want to get merged 'soon' right?
> >
> > Read the last bit of my reply from yesterday :)
> >
> > >
> > > Quantized brings in explicit signed/unsigned types through Q<4,8> and
> > > UQ<4, 8> types.
> >
> > What is the difference between signed and unsigned ? Is it only the
> > sign bit ? I guess then that the Q<4,8>[12:0] = UQ<4,8>[11:0]
>
> Please take a look through the tests I've added:
>
> https://patchwork.libcamera.org/patch/25801/
>
> /* Q1.7(-1 .. 0.992188)  Min: [0x80:-1] -- Max: [0x7f:0.992188] Step:0.0078125*/
> /* UQ1.7(0 .. 1.99219)  Min: [0x00:0] -- Max: [0xff:1.99219] Step:0.0078125 */
>
> /* Q12.4(-2048 .. 2047.94)  Min: [0x8000:-2048] -- Max: [0x7fff:2047.94] Step:0.0625 */
> /* UQ12.4(0 .. 4095.94)  Min: [0x0000:0] -- Max: [0xffff:4095.94] Step:0.0625 */
>
> It's easy to extend that if you have specific Q types you want to
> use/test.

Ah yes, for min/max it's defintely useful to have signed/unsigned
types

>
>
> > >
> > > In the new types Q<I, F> has the sign bit included in 'I'.
> > > I can add that explicitly to the documentation in my new series for v6.
> >
> >
> > Well, maybe we need two traits ?
> > https://en.wikipedia.org/wiki/Q_(number_format)
> >
> > Texas Instruments version:
> > The first bit always gives the sign of the value (1 = negative, 0 =
> > non-negative), and it is not counted in the m parameter. Thus, the
> > total number w of bits used is 1 + m + n.
> >
> > ARM Version:
> > A variant of the Q notation has been in use by ARM in which the m
> > number also counts the sign bit
>
> Yes, you've definitely got to know which one the hardware is using and
> expecting. I wouldn't make a new trait for this - if we have to specify
> we can wrap one in the other if it really helps.

I'm not sure, if I'm working with the TI format (which as far as I
understand is the most common?) then to have a signed value correctly
represented as a Q<4,8> I would have to use Q<5,8> (which is
counter-intuitive).

I would rather modify the Trait to put the sign in the [m + n + 1]
bit.

Or are the registers you're working with in ARM format ? (sign in
[m + n] position)

Thanks
  j

>
> --
> Kieran
>
>
> >
> > I guess the only way to know which one is meant to be used is to
> > actually look at the register sizes. If a Q<4,8> number is stored as
> > a 13 bit fields, then the TI version is used. I wonder how common the
> > ARM version is.
> >
> > >
> > >
> > > """
> > >  * The sign of the value is determined by the sign of \a T. For signed types,
> > >  * the number of integer bits includes the sign bit.
> > > """
> > >
> > > --
> > > Kieran
> > >
> > > > I'm a bit confused here. Doesn't signed Q4.8 mean that the first bit of
> > > > the 4 is the sign bit? The same way a signed int32 has the signed bit on
> > > > the first of the 32 bits?
> > > >
> > > > Best regards,
> > > > Stefan
> > > >
> > > > > + *
> > > > > + * \code{.cpp}
> > > > > + * double d = fixedToFloatingPoint<5, 8, double, uint16_t>(fixed);
> > > > > + * \endcode
> > > > > + *
> > > > > + * While a value represented as unsigned fixed-point Q4.8 format can be
> > > > > + * converted as:
> > > > > + *
> > > > > + * \code{.cpp}
> > > > > + * double d = fixedToFloatingPoint<4, 8, double, uint16_t>(fixed);
> > > > > + * \endcode
> > > > > + *
> > > > >   * \return The converted value
> > > > >   */
> > > > >
> > > > > --
> > > > > 2.52.0
> > > > >
Kieran Bingham Jan. 21, 2026, 3:44 p.m. UTC | #12
Quoting Jacopo Mondi (2026-01-21 15:12:24)
> Hi Kieran
> 
> On Wed, Jan 21, 2026 at 02:45:04PM +0000, Kieran Bingham wrote:
> > Quoting Jacopo Mondi (2026-01-21 12:53:49)
> > > Hi Kieran
> > >
> > > On Wed, Jan 21, 2026 at 12:23:40PM +0000, Kieran Bingham wrote:
> > > > Hi Jacopo,
> > > >
> > > > Quoting Stefan Klug (2026-01-20 08:53:06)
> > > > > Hi Jacopo,
> > > > >
> > > > > Quoting Jacopo Mondi (2026-01-20 09:39:49)
> > > > > > Converting numbers with a signed fixed-point representation to
> > > > > > the corresponding float value requires to include the sign bit in the
> > > > > > width of the fixed-point integral part.
> > > > > >
> > > > > > Clearly specify it in documentation.
> > > > > >
> > > > > > Signed-off-by: Jacopo Mondi <jacopo.mondi@ideasonboard.com>
> > > > > > ---
> > > > > >  src/ipa/libipa/fixedpoint.cpp | 22 +++++++++++++++++++++-
> > > > > >  1 file changed, 21 insertions(+), 1 deletion(-)
> > > > > >
> > > > > > diff --git a/src/ipa/libipa/fixedpoint.cpp b/src/ipa/libipa/fixedpoint.cpp
> > > > > > index 6b698fc5d680..b37cdc43936f 100644
> > > > > > --- a/src/ipa/libipa/fixedpoint.cpp
> > > > > > +++ b/src/ipa/libipa/fixedpoint.cpp
> > > > > > @@ -29,11 +29,31 @@ namespace ipa {
> > > > > >  /**
> > > > > >   * \fn R fixedToFloatingPoint(T number)
> > > > > >   * \brief Convert a fixed-point number to a floating point representation
> > > > > > - * \tparam I Bit width of the integer part of the fixed-point
> > > > > > + * \tparam I Bit width of the integer part of the fixed-point including the
> > > > > > + * optional sign bit
> > > > > >   * \tparam F Bit width of the fractional part of the fixed-point
> > > > > >   * \tparam R Return type of the floating point representation
> > > > > >   * \tparam T Input type of the fixed-point representation
> > > > > >   * \param number The fixed point number to convert to floating point
> > > > > > + *
> > > > > > + * If the fixed-point representation is signed, the sign bit shall be included
> > > > > > + * in the \a I template parameter that specifies the number of bits of the
> > > > > > + * integral part of the fixed-point representation.
> > > > > > + *
> > > > > > + * As an example, a value represented as signed fixed-point Q4.8 format can be
> > > > > > + * converted to its corresponding floating point representation as:
> > > >
> > > > Just to be sure - you know I've got patches to remove all of the above
> > > > that I want to get merged 'soon' right?
> > >
> > > Read the last bit of my reply from yesterday :)

I still don't get this?


> > >
> > > >
> > > > Quantized brings in explicit signed/unsigned types through Q<4,8> and
> > > > UQ<4, 8> types.
> > >
> > > What is the difference between signed and unsigned ? Is it only the
> > > sign bit ? I guess then that the Q<4,8>[12:0] = UQ<4,8>[11:0]
> >
> > Please take a look through the tests I've added:
> >
> > https://patchwork.libcamera.org/patch/25801/
> >
> > /* Q1.7(-1 .. 0.992188)  Min: [0x80:-1] -- Max: [0x7f:0.992188] Step:0.0078125*/
> > /* UQ1.7(0 .. 1.99219)  Min: [0x00:0] -- Max: [0xff:1.99219] Step:0.0078125 */
> >
> > /* Q12.4(-2048 .. 2047.94)  Min: [0x8000:-2048] -- Max: [0x7fff:2047.94] Step:0.0625 */
> > /* UQ12.4(0 .. 4095.94)  Min: [0x0000:0] -- Max: [0xffff:4095.94] Step:0.0625 */
> >
> > It's easy to extend that if you have specific Q types you want to
> > use/test.
> 
> Ah yes, for min/max it's defintely useful to have signed/unsigned
> types

It's not about min/max is useful - it's the very fact that Q and UQ have
a distinct range. Q types can go less than zero but still span the same
distance, so the top/max is halved, but the step size is the same.


> > > > In the new types Q<I, F> has the sign bit included in 'I'.
> > > > I can add that explicitly to the documentation in my new series for v6.
> > >
> > >
> > > Well, maybe we need two traits ?
> > > https://en.wikipedia.org/wiki/Q_(number_format)
> > >
> > > Texas Instruments version:
> > > The first bit always gives the sign of the value (1 = negative, 0 =
> > > non-negative), and it is not counted in the m parameter. Thus, the
> > > total number w of bits used is 1 + m + n.
> > >
> > > ARM Version:
> > > A variant of the Q notation has been in use by ARM in which the m
> > > number also counts the sign bit
> >
> > Yes, you've definitely got to know which one the hardware is using and
> > expecting. I wouldn't make a new trait for this - if we have to specify
> > we can wrap one in the other if it really helps.
> 
> I'm not sure, if I'm working with the TI format (which as far as I
> understand is the most common?) then to have a signed value correctly
> represented as a Q<4,8> I would have to use Q<5,8> (which is
> counter-intuitive).
> 
> I would rather modify the Trait to put the sign in the [m + n + 1]
> bit.
> 
> Or are the registers you're working with in ARM format ? (sign in
> [m + n] position)


That's (include the bit) what the original fixedToFloatingPoint()
implementations used, so that's what I've continued with.

If you want to distinguish these? How should we represent them?


/* All 8 bit storage */
UQ<1, 7> Q<1, 7> Q_TI<0, 7> ?

--
Kieran

> 
> Thanks
>   j
> 
> >
> > --
> > Kieran
> >
> >
> > >
> > > I guess the only way to know which one is meant to be used is to
> > > actually look at the register sizes. If a Q<4,8> number is stored as
> > > a 13 bit fields, then the TI version is used. I wonder how common the
> > > ARM version is.
> > >
> > > >
> > > >
> > > > """
> > > >  * The sign of the value is determined by the sign of \a T. For signed types,
> > > >  * the number of integer bits includes the sign bit.
> > > > """
> > > >
> > > > --
> > > > Kieran
> > > >
> > > > > I'm a bit confused here. Doesn't signed Q4.8 mean that the first bit of
> > > > > the 4 is the sign bit? The same way a signed int32 has the signed bit on
> > > > > the first of the 32 bits?
> > > > >
> > > > > Best regards,
> > > > > Stefan
> > > > >
> > > > > > + *
> > > > > > + * \code{.cpp}
> > > > > > + * double d = fixedToFloatingPoint<5, 8, double, uint16_t>(fixed);
> > > > > > + * \endcode
> > > > > > + *
> > > > > > + * While a value represented as unsigned fixed-point Q4.8 format can be
> > > > > > + * converted as:
> > > > > > + *
> > > > > > + * \code{.cpp}
> > > > > > + * double d = fixedToFloatingPoint<4, 8, double, uint16_t>(fixed);
> > > > > > + * \endcode
> > > > > > + *
> > > > > >   * \return The converted value
> > > > > >   */
> > > > > >
> > > > > > --
> > > > > > 2.52.0
> > > > > >
Jacopo Mondi Jan. 21, 2026, 4:13 p.m. UTC | #13
On Wed, Jan 21, 2026 at 03:44:01PM +0000, Kieran Bingham wrote:
> Quoting Jacopo Mondi (2026-01-21 15:12:24)
> > Hi Kieran
> >
> > On Wed, Jan 21, 2026 at 02:45:04PM +0000, Kieran Bingham wrote:
> > > Quoting Jacopo Mondi (2026-01-21 12:53:49)
> > > > Hi Kieran
> > > >
> > > > On Wed, Jan 21, 2026 at 12:23:40PM +0000, Kieran Bingham wrote:
> > > > > Hi Jacopo,
> > > > >
> > > > > Quoting Stefan Klug (2026-01-20 08:53:06)
> > > > > > Hi Jacopo,
> > > > > >
> > > > > > Quoting Jacopo Mondi (2026-01-20 09:39:49)
> > > > > > > Converting numbers with a signed fixed-point representation to
> > > > > > > the corresponding float value requires to include the sign bit in the
> > > > > > > width of the fixed-point integral part.
> > > > > > >
> > > > > > > Clearly specify it in documentation.
> > > > > > >
> > > > > > > Signed-off-by: Jacopo Mondi <jacopo.mondi@ideasonboard.com>
> > > > > > > ---
> > > > > > >  src/ipa/libipa/fixedpoint.cpp | 22 +++++++++++++++++++++-
> > > > > > >  1 file changed, 21 insertions(+), 1 deletion(-)
> > > > > > >
> > > > > > > diff --git a/src/ipa/libipa/fixedpoint.cpp b/src/ipa/libipa/fixedpoint.cpp
> > > > > > > index 6b698fc5d680..b37cdc43936f 100644
> > > > > > > --- a/src/ipa/libipa/fixedpoint.cpp
> > > > > > > +++ b/src/ipa/libipa/fixedpoint.cpp
> > > > > > > @@ -29,11 +29,31 @@ namespace ipa {
> > > > > > >  /**
> > > > > > >   * \fn R fixedToFloatingPoint(T number)
> > > > > > >   * \brief Convert a fixed-point number to a floating point representation
> > > > > > > - * \tparam I Bit width of the integer part of the fixed-point
> > > > > > > + * \tparam I Bit width of the integer part of the fixed-point including the
> > > > > > > + * optional sign bit
> > > > > > >   * \tparam F Bit width of the fractional part of the fixed-point
> > > > > > >   * \tparam R Return type of the floating point representation
> > > > > > >   * \tparam T Input type of the fixed-point representation
> > > > > > >   * \param number The fixed point number to convert to floating point
> > > > > > > + *
> > > > > > > + * If the fixed-point representation is signed, the sign bit shall be included
> > > > > > > + * in the \a I template parameter that specifies the number of bits of the
> > > > > > > + * integral part of the fixed-point representation.
> > > > > > > + *
> > > > > > > + * As an example, a value represented as signed fixed-point Q4.8 format can be
> > > > > > > + * converted to its corresponding floating point representation as:
> > > > >
> > > > > Just to be sure - you know I've got patches to remove all of the above
> > > > > that I want to get merged 'soon' right?
> > > >
> > > > Read the last bit of my reply from yesterday :)
>
> I still don't get this?
>

I meant the discussion on sign/magnitude representation

sign/magnitude is a different representation of signed integers
compared to the de-facto standard 2's complement. It requires to
manipulate the result of the float-to-fixed conversion so that we take
the absolute value and the sign bit is set in the [m + n + 1] bit


------------------------------------------------------------------------------
As a bit of pseudo code

        int reg = static_cast<int>(std::round(number * (1 << F)))) & mask;
        uint16_t res += std::abs(reg);
        if (reg < 0)
                res |= BIT(13);


I think this could be surely optimized and nicely made a Traits that
can be added to the Quantized series Kieran is working on.
------------------------------------------------------------------------------

The above is the pharse I thought it could make you happy:
sign/magnitude fixed-point formats can be easily be represented with a
Trait on top of your series


>
> > > >
> > > > >
> > > > > Quantized brings in explicit signed/unsigned types through Q<4,8> and
> > > > > UQ<4, 8> types.
> > > >
> > > > What is the difference between signed and unsigned ? Is it only the
> > > > sign bit ? I guess then that the Q<4,8>[12:0] = UQ<4,8>[11:0]
> > >
> > > Please take a look through the tests I've added:
> > >
> > > https://patchwork.libcamera.org/patch/25801/
> > >
> > > /* Q1.7(-1 .. 0.992188)  Min: [0x80:-1] -- Max: [0x7f:0.992188] Step:0.0078125*/
> > > /* UQ1.7(0 .. 1.99219)  Min: [0x00:0] -- Max: [0xff:1.99219] Step:0.0078125 */
> > >
> > > /* Q12.4(-2048 .. 2047.94)  Min: [0x8000:-2048] -- Max: [0x7fff:2047.94] Step:0.0625 */
> > > /* UQ12.4(0 .. 4095.94)  Min: [0x0000:0] -- Max: [0xffff:4095.94] Step:0.0625 */
> > >
> > > It's easy to extend that if you have specific Q types you want to
> > > use/test.
> >
> > Ah yes, for min/max it's defintely useful to have signed/unsigned
> > types
>
> It's not about min/max is useful - it's the very fact that Q and UQ have
> a distinct range. Q types can go less than zero but still span the same
> distance, so the top/max is halved, but the step size is the same.

Yes, min/max and range indeed.

>
>
> > > > > In the new types Q<I, F> has the sign bit included in 'I'.
> > > > > I can add that explicitly to the documentation in my new series for v6.
> > > >
> > > >
> > > > Well, maybe we need two traits ?
> > > > https://en.wikipedia.org/wiki/Q_(number_format)
> > > >
> > > > Texas Instruments version:
> > > > The first bit always gives the sign of the value (1 = negative, 0 =
> > > > non-negative), and it is not counted in the m parameter. Thus, the
> > > > total number w of bits used is 1 + m + n.
> > > >
> > > > ARM Version:
> > > > A variant of the Q notation has been in use by ARM in which the m
> > > > number also counts the sign bit
> > >
> > > Yes, you've definitely got to know which one the hardware is using and
> > > expecting. I wouldn't make a new trait for this - if we have to specify
> > > we can wrap one in the other if it really helps.
> >
> > I'm not sure, if I'm working with the TI format (which as far as I
> > understand is the most common?) then to have a signed value correctly
> > represented as a Q<4,8> I would have to use Q<5,8> (which is
> > counter-intuitive).
> >
> > I would rather modify the Trait to put the sign in the [m + n + 1]
> > bit.
> >
> > Or are the registers you're working with in ARM format ? (sign in
> > [m + n] position)
>
>
> That's (include the bit) what the original fixedToFloatingPoint()
> implementations used, so that's what I've continued with.

I see but that doesn't mean it's correct.

I read one platform manual the description of a coefficient as

"8:0 cc_coeff_0 Coefficient 0 for color space conversion"
color conversion coefficients are signed integer values with a 7 bit
fractional part; range: [-2…1.992]

so if there are 7 fractional bit and the max achievable value is 1.992
it means that the value is in Q<1,7> format as:

        (1 << (1 + 7)) - 1 / (1 << 7) = 1.999

the register size is 9 bits (see the [8:0] in the register
description) so I the sign bit is at location [8].

Am I wrong that I want to obtain this with your model I would have to
describe the fixed point representation as Q<2,7> (which doesn't match
the datasheet) ?

And I guess this really is the difference between UQ<m, n> and Q<m, n>

usigned Q has no sign bit and the destination register is of size [m+n]
signed Q has a sign bit in position [m+n+1] with the value in 2's
complement format and destination register of size [m+n+1]

>
> If you want to distinguish these? How should we represent them?
>
>
> /* All 8 bit storage */
> UQ<1, 7> Q<1, 7> Q_TI<0, 7> ?
>

Let's start by deciding what behaviour we want by default maybe..

> --
> Kieran
>
> >
> > Thanks
> >   j
> >
> > >
> > > --
> > > Kieran
> > >
> > >
> > > >
> > > > I guess the only way to know which one is meant to be used is to
> > > > actually look at the register sizes. If a Q<4,8> number is stored as
> > > > a 13 bit fields, then the TI version is used. I wonder how common the
> > > > ARM version is.
> > > >
> > > > >
> > > > >
> > > > > """
> > > > >  * The sign of the value is determined by the sign of \a T. For signed types,
> > > > >  * the number of integer bits includes the sign bit.
> > > > > """
> > > > >
> > > > > --
> > > > > Kieran
> > > > >
> > > > > > I'm a bit confused here. Doesn't signed Q4.8 mean that the first bit of
> > > > > > the 4 is the sign bit? The same way a signed int32 has the signed bit on
> > > > > > the first of the 32 bits?
> > > > > >
> > > > > > Best regards,
> > > > > > Stefan
> > > > > >
> > > > > > > + *
> > > > > > > + * \code{.cpp}
> > > > > > > + * double d = fixedToFloatingPoint<5, 8, double, uint16_t>(fixed);
> > > > > > > + * \endcode
> > > > > > > + *
> > > > > > > + * While a value represented as unsigned fixed-point Q4.8 format can be
> > > > > > > + * converted as:
> > > > > > > + *
> > > > > > > + * \code{.cpp}
> > > > > > > + * double d = fixedToFloatingPoint<4, 8, double, uint16_t>(fixed);
> > > > > > > + * \endcode
> > > > > > > + *
> > > > > > >   * \return The converted value
> > > > > > >   */
> > > > > > >
> > > > > > > --
> > > > > > > 2.52.0
> > > > > > >
Laurent Pinchart Jan. 21, 2026, 4:37 p.m. UTC | #14
On Wed, Jan 21, 2026 at 05:13:02PM +0100, Jacopo Mondi wrote:
> On Wed, Jan 21, 2026 at 03:44:01PM +0000, Kieran Bingham wrote:
> > Quoting Jacopo Mondi (2026-01-21 15:12:24)
> > > On Wed, Jan 21, 2026 at 02:45:04PM +0000, Kieran Bingham wrote:
> > > > Quoting Jacopo Mondi (2026-01-21 12:53:49)
> > > > > On Wed, Jan 21, 2026 at 12:23:40PM +0000, Kieran Bingham wrote:
> > > > > > Quoting Stefan Klug (2026-01-20 08:53:06)
> > > > > > > Quoting Jacopo Mondi (2026-01-20 09:39:49)
> > > > > > > > Converting numbers with a signed fixed-point representation to
> > > > > > > > the corresponding float value requires to include the sign bit in the
> > > > > > > > width of the fixed-point integral part.
> > > > > > > >
> > > > > > > > Clearly specify it in documentation.
> > > > > > > >
> > > > > > > > Signed-off-by: Jacopo Mondi <jacopo.mondi@ideasonboard.com>
> > > > > > > > ---
> > > > > > > >  src/ipa/libipa/fixedpoint.cpp | 22 +++++++++++++++++++++-
> > > > > > > >  1 file changed, 21 insertions(+), 1 deletion(-)
> > > > > > > >
> > > > > > > > diff --git a/src/ipa/libipa/fixedpoint.cpp b/src/ipa/libipa/fixedpoint.cpp
> > > > > > > > index 6b698fc5d680..b37cdc43936f 100644
> > > > > > > > --- a/src/ipa/libipa/fixedpoint.cpp
> > > > > > > > +++ b/src/ipa/libipa/fixedpoint.cpp
> > > > > > > > @@ -29,11 +29,31 @@ namespace ipa {
> > > > > > > >  /**
> > > > > > > >   * \fn R fixedToFloatingPoint(T number)
> > > > > > > >   * \brief Convert a fixed-point number to a floating point representation
> > > > > > > > - * \tparam I Bit width of the integer part of the fixed-point
> > > > > > > > + * \tparam I Bit width of the integer part of the fixed-point including the
> > > > > > > > + * optional sign bit
> > > > > > > >   * \tparam F Bit width of the fractional part of the fixed-point
> > > > > > > >   * \tparam R Return type of the floating point representation
> > > > > > > >   * \tparam T Input type of the fixed-point representation
> > > > > > > >   * \param number The fixed point number to convert to floating point
> > > > > > > > + *
> > > > > > > > + * If the fixed-point representation is signed, the sign bit shall be included
> > > > > > > > + * in the \a I template parameter that specifies the number of bits of the
> > > > > > > > + * integral part of the fixed-point representation.
> > > > > > > > + *
> > > > > > > > + * As an example, a value represented as signed fixed-point Q4.8 format can be
> > > > > > > > + * converted to its corresponding floating point representation as:
> > > > > >
> > > > > > Just to be sure - you know I've got patches to remove all of the above
> > > > > > that I want to get merged 'soon' right?
> > > > >
> > > > > Read the last bit of my reply from yesterday :)
> >
> > I still don't get this?
> 
> I meant the discussion on sign/magnitude representation
> 
> sign/magnitude is a different representation of signed integers
> compared to the de-facto standard 2's complement. It requires to
> manipulate the result of the float-to-fixed conversion so that we take
> the absolute value and the sign bit is set in the [m + n + 1] bit
> 
> ------------------------------------------------------------------------------
> As a bit of pseudo code
> 
>         int reg = static_cast<int>(std::round(number * (1 << F)))) & mask;
>         uint16_t res += std::abs(reg);
>         if (reg < 0)
>                 res |= BIT(13);
> 
> 
> I think this could be surely optimized and nicely made a Traits that
> can be added to the Quantized series Kieran is working on.
> ------------------------------------------------------------------------------
> 
> The above is the pharse I thought it could make you happy:
> sign/magnitude fixed-point formats can be easily be represented with a
> Trait on top of your series
> 
> > > > > > Quantized brings in explicit signed/unsigned types through Q<4,8> and
> > > > > > UQ<4, 8> types.
> > > > >
> > > > > What is the difference between signed and unsigned ? Is it only the
> > > > > sign bit ? I guess then that the Q<4,8>[12:0] = UQ<4,8>[11:0]
> > > >
> > > > Please take a look through the tests I've added:
> > > >
> > > > https://patchwork.libcamera.org/patch/25801/
> > > >
> > > > /* Q1.7(-1 .. 0.992188)  Min: [0x80:-1] -- Max: [0x7f:0.992188] Step:0.0078125*/
> > > > /* UQ1.7(0 .. 1.99219)  Min: [0x00:0] -- Max: [0xff:1.99219] Step:0.0078125 */
> > > >
> > > > /* Q12.4(-2048 .. 2047.94)  Min: [0x8000:-2048] -- Max: [0x7fff:2047.94] Step:0.0625 */
> > > > /* UQ12.4(0 .. 4095.94)  Min: [0x0000:0] -- Max: [0xffff:4095.94] Step:0.0625 */
> > > >
> > > > It's easy to extend that if you have specific Q types you want to
> > > > use/test.
> > >
> > > Ah yes, for min/max it's defintely useful to have signed/unsigned
> > > types
> >
> > It's not about min/max is useful - it's the very fact that Q and UQ have
> > a distinct range. Q types can go less than zero but still span the same
> > distance, so the top/max is halved, but the step size is the same.
> 
> Yes, min/max and range indeed.
> 
> >
> >
> > > > > > In the new types Q<I, F> has the sign bit included in 'I'.
> > > > > > I can add that explicitly to the documentation in my new series for v6.
> > > > >
> > > > >
> > > > > Well, maybe we need two traits ?
> > > > > https://en.wikipedia.org/wiki/Q_(number_format)
> > > > >
> > > > > Texas Instruments version:
> > > > > The first bit always gives the sign of the value (1 = negative, 0 =
> > > > > non-negative), and it is not counted in the m parameter. Thus, the
> > > > > total number w of bits used is 1 + m + n.
> > > > >
> > > > > ARM Version:
> > > > > A variant of the Q notation has been in use by ARM in which the m
> > > > > number also counts the sign bit
> > > >
> > > > Yes, you've definitely got to know which one the hardware is using and
> > > > expecting. I wouldn't make a new trait for this - if we have to specify
> > > > we can wrap one in the other if it really helps.
> > >
> > > I'm not sure, if I'm working with the TI format (which as far as I
> > > understand is the most common?) then to have a signed value correctly
> > > represented as a Q<4,8> I would have to use Q<5,8> (which is
> > > counter-intuitive).
> > >
> > > I would rather modify the Trait to put the sign in the [m + n + 1]
> > > bit.
> > >
> > > Or are the registers you're working with in ARM format ? (sign in
> > > [m + n] position)
> >
> > That's (include the bit) what the original fixedToFloatingPoint()
> > implementations used, so that's what I've continued with.
> 
> I see but that doesn't mean it's correct.
> 
> I read one platform manual the description of a coefficient as
> 
> "8:0 cc_coeff_0 Coefficient 0 for color space conversion"
> color conversion coefficients are signed integer values with a 7 bit
> fractional part; range: [-2…1.992]
> 
> so if there are 7 fractional bit and the max achievable value is 1.992
> it means that the value is in Q<1,7> format as:
> 
>         (1 << (1 + 7)) - 1 / (1 << 7) = 1.999
> 
> the register size is 9 bits (see the [8:0] in the register
> description) so I the sign bit is at location [8].
> 
> Am I wrong that I want to obtain this with your model I would have to
> describe the fixed point representation as Q<2,7> (which doesn't match
> the datasheet) ?

Why doesn't this match the datasheet ? The text you quoted says 7 bits of
fractional value (match), 9 bits register field (8:0, matching 2+7), and
the range of Q<2,7> is -2 to +1.992 (1.9921875 to be precise).

> And I guess this really is the difference between UQ<m, n> and Q<m, n>
> 
> usigned Q has no sign bit and the destination register is of size [m+n]
> signed Q has a sign bit in position [m+n+1] with the value in 2's
> complement format and destination register of size [m+n+1]

In Kieran's implementation, Q<m, n> is stored in m+n bits, not m+n+1.

> > If you want to distinguish these? How should we represent them?
> >
> > /* All 8 bit storage */
> > UQ<1, 7> Q<1, 7> Q_TI<0, 7> ?
> 
> Let's start by deciding what behaviour we want by default maybe..

Let's pick one option and stick to it please. Yes, writing Q<4, 12> when
a TI datasheet says "Q3.12 value" may be a bit confusing, but it's
encoding in the type in one place and the rest of the code doesn't have
to think about it.

We *could* define device-specific aliases in specific IPA modules if we
really wanted, but I wouldn't define multiple types in libipa.

> > > > > I guess the only way to know which one is meant to be used is to
> > > > > actually look at the register sizes. If a Q<4,8> number is stored as
> > > > > a 13 bit fields, then the TI version is used. I wonder how common the
> > > > > ARM version is.
> > > > >
> > > > > > """
> > > > > >  * The sign of the value is determined by the sign of \a T. For signed types,
> > > > > >  * the number of integer bits includes the sign bit.
> > > > > > """
> > > > > >
> > > > > > > I'm a bit confused here. Doesn't signed Q4.8 mean that the first bit of
> > > > > > > the 4 is the sign bit? The same way a signed int32 has the signed bit on
> > > > > > > the first of the 32 bits?
> > > > > > >
> > > > > > > > + *
> > > > > > > > + * \code{.cpp}
> > > > > > > > + * double d = fixedToFloatingPoint<5, 8, double, uint16_t>(fixed);
> > > > > > > > + * \endcode
> > > > > > > > + *
> > > > > > > > + * While a value represented as unsigned fixed-point Q4.8 format can be
> > > > > > > > + * converted as:
> > > > > > > > + *
> > > > > > > > + * \code{.cpp}
> > > > > > > > + * double d = fixedToFloatingPoint<4, 8, double, uint16_t>(fixed);
> > > > > > > > + * \endcode
> > > > > > > > + *
> > > > > > > >   * \return The converted value
> > > > > > > >   */
> > > > > > > >
Jacopo Mondi Jan. 21, 2026, 4:54 p.m. UTC | #15
Hi Laurent

On Wed, Jan 21, 2026 at 06:37:55PM +0200, Laurent Pinchart wrote:
> On Wed, Jan 21, 2026 at 05:13:02PM +0100, Jacopo Mondi wrote:
> > On Wed, Jan 21, 2026 at 03:44:01PM +0000, Kieran Bingham wrote:
> > > Quoting Jacopo Mondi (2026-01-21 15:12:24)
> > > > On Wed, Jan 21, 2026 at 02:45:04PM +0000, Kieran Bingham wrote:
> > > > > Quoting Jacopo Mondi (2026-01-21 12:53:49)
> > > > > > On Wed, Jan 21, 2026 at 12:23:40PM +0000, Kieran Bingham wrote:
> > > > > > > Quoting Stefan Klug (2026-01-20 08:53:06)
> > > > > > > > Quoting Jacopo Mondi (2026-01-20 09:39:49)
> > > > > > > > > Converting numbers with a signed fixed-point representation to
> > > > > > > > > the corresponding float value requires to include the sign bit in the
> > > > > > > > > width of the fixed-point integral part.
> > > > > > > > >
> > > > > > > > > Clearly specify it in documentation.
> > > > > > > > >
> > > > > > > > > Signed-off-by: Jacopo Mondi <jacopo.mondi@ideasonboard.com>
> > > > > > > > > ---
> > > > > > > > >  src/ipa/libipa/fixedpoint.cpp | 22 +++++++++++++++++++++-
> > > > > > > > >  1 file changed, 21 insertions(+), 1 deletion(-)
> > > > > > > > >
> > > > > > > > > diff --git a/src/ipa/libipa/fixedpoint.cpp b/src/ipa/libipa/fixedpoint.cpp
> > > > > > > > > index 6b698fc5d680..b37cdc43936f 100644
> > > > > > > > > --- a/src/ipa/libipa/fixedpoint.cpp
> > > > > > > > > +++ b/src/ipa/libipa/fixedpoint.cpp
> > > > > > > > > @@ -29,11 +29,31 @@ namespace ipa {
> > > > > > > > >  /**
> > > > > > > > >   * \fn R fixedToFloatingPoint(T number)
> > > > > > > > >   * \brief Convert a fixed-point number to a floating point representation
> > > > > > > > > - * \tparam I Bit width of the integer part of the fixed-point
> > > > > > > > > + * \tparam I Bit width of the integer part of the fixed-point including the
> > > > > > > > > + * optional sign bit
> > > > > > > > >   * \tparam F Bit width of the fractional part of the fixed-point
> > > > > > > > >   * \tparam R Return type of the floating point representation
> > > > > > > > >   * \tparam T Input type of the fixed-point representation
> > > > > > > > >   * \param number The fixed point number to convert to floating point
> > > > > > > > > + *
> > > > > > > > > + * If the fixed-point representation is signed, the sign bit shall be included
> > > > > > > > > + * in the \a I template parameter that specifies the number of bits of the
> > > > > > > > > + * integral part of the fixed-point representation.
> > > > > > > > > + *
> > > > > > > > > + * As an example, a value represented as signed fixed-point Q4.8 format can be
> > > > > > > > > + * converted to its corresponding floating point representation as:
> > > > > > >
> > > > > > > Just to be sure - you know I've got patches to remove all of the above
> > > > > > > that I want to get merged 'soon' right?
> > > > > >
> > > > > > Read the last bit of my reply from yesterday :)
> > >
> > > I still don't get this?
> >
> > I meant the discussion on sign/magnitude representation
> >
> > sign/magnitude is a different representation of signed integers
> > compared to the de-facto standard 2's complement. It requires to
> > manipulate the result of the float-to-fixed conversion so that we take
> > the absolute value and the sign bit is set in the [m + n + 1] bit
> >
> > ------------------------------------------------------------------------------
> > As a bit of pseudo code
> >
> >         int reg = static_cast<int>(std::round(number * (1 << F)))) & mask;
> >         uint16_t res += std::abs(reg);
> >         if (reg < 0)
> >                 res |= BIT(13);
> >
> >
> > I think this could be surely optimized and nicely made a Traits that
> > can be added to the Quantized series Kieran is working on.
> > ------------------------------------------------------------------------------
> >
> > The above is the pharse I thought it could make you happy:
> > sign/magnitude fixed-point formats can be easily be represented with a
> > Trait on top of your series
> >
> > > > > > > Quantized brings in explicit signed/unsigned types through Q<4,8> and
> > > > > > > UQ<4, 8> types.
> > > > > >
> > > > > > What is the difference between signed and unsigned ? Is it only the
> > > > > > sign bit ? I guess then that the Q<4,8>[12:0] = UQ<4,8>[11:0]
> > > > >
> > > > > Please take a look through the tests I've added:
> > > > >
> > > > > https://patchwork.libcamera.org/patch/25801/
> > > > >
> > > > > /* Q1.7(-1 .. 0.992188)  Min: [0x80:-1] -- Max: [0x7f:0.992188] Step:0.0078125*/
> > > > > /* UQ1.7(0 .. 1.99219)  Min: [0x00:0] -- Max: [0xff:1.99219] Step:0.0078125 */
> > > > >
> > > > > /* Q12.4(-2048 .. 2047.94)  Min: [0x8000:-2048] -- Max: [0x7fff:2047.94] Step:0.0625 */
> > > > > /* UQ12.4(0 .. 4095.94)  Min: [0x0000:0] -- Max: [0xffff:4095.94] Step:0.0625 */
> > > > >
> > > > > It's easy to extend that if you have specific Q types you want to
> > > > > use/test.
> > > >
> > > > Ah yes, for min/max it's defintely useful to have signed/unsigned
> > > > types
> > >
> > > It's not about min/max is useful - it's the very fact that Q and UQ have
> > > a distinct range. Q types can go less than zero but still span the same
> > > distance, so the top/max is halved, but the step size is the same.
> >
> > Yes, min/max and range indeed.
> >
> > >
> > >
> > > > > > > In the new types Q<I, F> has the sign bit included in 'I'.
> > > > > > > I can add that explicitly to the documentation in my new series for v6.
> > > > > >
> > > > > >
> > > > > > Well, maybe we need two traits ?
> > > > > > https://en.wikipedia.org/wiki/Q_(number_format)
> > > > > >
> > > > > > Texas Instruments version:
> > > > > > The first bit always gives the sign of the value (1 = negative, 0 =
> > > > > > non-negative), and it is not counted in the m parameter. Thus, the
> > > > > > total number w of bits used is 1 + m + n.
> > > > > >
> > > > > > ARM Version:
> > > > > > A variant of the Q notation has been in use by ARM in which the m
> > > > > > number also counts the sign bit
> > > > >
> > > > > Yes, you've definitely got to know which one the hardware is using and
> > > > > expecting. I wouldn't make a new trait for this - if we have to specify
> > > > > we can wrap one in the other if it really helps.
> > > >
> > > > I'm not sure, if I'm working with the TI format (which as far as I
> > > > understand is the most common?) then to have a signed value correctly
> > > > represented as a Q<4,8> I would have to use Q<5,8> (which is
> > > > counter-intuitive).
> > > >
> > > > I would rather modify the Trait to put the sign in the [m + n + 1]
> > > > bit.
> > > >
> > > > Or are the registers you're working with in ARM format ? (sign in
> > > > [m + n] position)
> > >
> > > That's (include the bit) what the original fixedToFloatingPoint()
> > > implementations used, so that's what I've continued with.
> >
> > I see but that doesn't mean it's correct.
> >
> > I read one platform manual the description of a coefficient as
> >
> > "8:0 cc_coeff_0 Coefficient 0 for color space conversion"
> > color conversion coefficients are signed integer values with a 7 bit
> > fractional part; range: [-2…1.992]
> >
> > so if there are 7 fractional bit and the max achievable value is 1.992
> > it means that the value is in Q<1,7> format as:
> >
> >         (1 << (1 + 7)) - 1 / (1 << 7) = 1.999
> >
> > the register size is 9 bits (see the [8:0] in the register
> > description) so I the sign bit is at location [8].
> >
> > Am I wrong that I want to obtain this with your model I would have to
> > describe the fixed point representation as Q<2,7> (which doesn't match
> > the datasheet) ?
>
> Why doesn't this match the datasheet ? The text you quoted says 7 bits of
> fractional value (match), 9 bits register field (8:0, matching 2+7), and
> the range of Q<2,7> is -2 to +1.992 (1.9921875 to be precise).

Ok, this datasheet doesn't specify the value for 'm' but do we agree
that if m has to indicate the "integer" part, then it should be 1 and
not 2 ?

In the same datasheet we also have:

  10:0 ct_coeff
  Values are 11-bit signed fixed-point numbers with 4 bit integer and 7
  bit fractional part, ranging from -8 (0x400) to +7.992 (0x3FF)."

In this case the value is suggested as Q<4,7> and the register is of
11 bits, so bit[11] is the sign.

Datasheets for other platforms clearly say that a signed Q<4,8> format
is stored in 13 bits, so I should have to use Q<5,8> to have the sign
bit in position [13] I guess

I feel like, give the wide variety of option, we should be able to
control where the sign bit goes to accommodate different vendors, or
even different register formats from the same vendor.

>
> > And I guess this really is the difference between UQ<m, n> and Q<m, n>
> >
> > usigned Q has no sign bit and the destination register is of size [m+n]
> > signed Q has a sign bit in position [m+n+1] with the value in 2's
> > complement format and destination register of size [m+n+1]
>
> In Kieran's implementation, Q<m, n> is stored in m+n bits, not m+n+1.
>
> > > If you want to distinguish these? How should we represent them?
> > >
> > > /* All 8 bit storage */
> > > UQ<1, 7> Q<1, 7> Q_TI<0, 7> ?
> >
> > Let's start by deciding what behaviour we want by default maybe..
>
> Let's pick one option and stick to it please. Yes, writing Q<4, 12> when
> a TI datasheet says "Q3.12 value" may be a bit confusing, but it's

I'm not sure this is limited by TI, I actually see datasheet from the
author of the variant Q format complying with the TI version of the Q
format.. So don't assume the "ARM format" is used on ARM platforms and
TI format on TI ones..


> encoding in the type in one place and the rest of the code doesn't have
> to think about it.
>
> We *could* define device-specific aliases in specific IPA modules if we
> really wanted, but I wouldn't define multiple types in libipa.
>
> > > > > > I guess the only way to know which one is meant to be used is to
> > > > > > actually look at the register sizes. If a Q<4,8> number is stored as
> > > > > > a 13 bit fields, then the TI version is used. I wonder how common the
> > > > > > ARM version is.
> > > > > >
> > > > > > > """
> > > > > > >  * The sign of the value is determined by the sign of \a T. For signed types,
> > > > > > >  * the number of integer bits includes the sign bit.
> > > > > > > """
> > > > > > >
> > > > > > > > I'm a bit confused here. Doesn't signed Q4.8 mean that the first bit of
> > > > > > > > the 4 is the sign bit? The same way a signed int32 has the signed bit on
> > > > > > > > the first of the 32 bits?
> > > > > > > >
> > > > > > > > > + *
> > > > > > > > > + * \code{.cpp}
> > > > > > > > > + * double d = fixedToFloatingPoint<5, 8, double, uint16_t>(fixed);
> > > > > > > > > + * \endcode
> > > > > > > > > + *
> > > > > > > > > + * While a value represented as unsigned fixed-point Q4.8 format can be
> > > > > > > > > + * converted as:
> > > > > > > > > + *
> > > > > > > > > + * \code{.cpp}
> > > > > > > > > + * double d = fixedToFloatingPoint<4, 8, double, uint16_t>(fixed);
> > > > > > > > > + * \endcode
> > > > > > > > > + *
> > > > > > > > >   * \return The converted value
> > > > > > > > >   */
> > > > > > > > >
>
> --
> Regards,
>
> Laurent Pinchart
Laurent Pinchart Jan. 21, 2026, 6 p.m. UTC | #16
On Wed, Jan 21, 2026 at 05:54:35PM +0100, Jacopo Mondi wrote:
> On Wed, Jan 21, 2026 at 06:37:55PM +0200, Laurent Pinchart wrote:
> > On Wed, Jan 21, 2026 at 05:13:02PM +0100, Jacopo Mondi wrote:
> > > On Wed, Jan 21, 2026 at 03:44:01PM +0000, Kieran Bingham wrote:
> > > > Quoting Jacopo Mondi (2026-01-21 15:12:24)
> > > > > On Wed, Jan 21, 2026 at 02:45:04PM +0000, Kieran Bingham wrote:
> > > > > > Quoting Jacopo Mondi (2026-01-21 12:53:49)
> > > > > > > On Wed, Jan 21, 2026 at 12:23:40PM +0000, Kieran Bingham wrote:
> > > > > > > > Quoting Stefan Klug (2026-01-20 08:53:06)
> > > > > > > > > Quoting Jacopo Mondi (2026-01-20 09:39:49)
> > > > > > > > > > Converting numbers with a signed fixed-point representation to
> > > > > > > > > > the corresponding float value requires to include the sign bit in the
> > > > > > > > > > width of the fixed-point integral part.
> > > > > > > > > >
> > > > > > > > > > Clearly specify it in documentation.
> > > > > > > > > >
> > > > > > > > > > Signed-off-by: Jacopo Mondi <jacopo.mondi@ideasonboard.com>
> > > > > > > > > > ---
> > > > > > > > > >  src/ipa/libipa/fixedpoint.cpp | 22 +++++++++++++++++++++-
> > > > > > > > > >  1 file changed, 21 insertions(+), 1 deletion(-)
> > > > > > > > > >
> > > > > > > > > > diff --git a/src/ipa/libipa/fixedpoint.cpp b/src/ipa/libipa/fixedpoint.cpp
> > > > > > > > > > index 6b698fc5d680..b37cdc43936f 100644
> > > > > > > > > > --- a/src/ipa/libipa/fixedpoint.cpp
> > > > > > > > > > +++ b/src/ipa/libipa/fixedpoint.cpp
> > > > > > > > > > @@ -29,11 +29,31 @@ namespace ipa {
> > > > > > > > > >  /**
> > > > > > > > > >   * \fn R fixedToFloatingPoint(T number)
> > > > > > > > > >   * \brief Convert a fixed-point number to a floating point representation
> > > > > > > > > > - * \tparam I Bit width of the integer part of the fixed-point
> > > > > > > > > > + * \tparam I Bit width of the integer part of the fixed-point including the
> > > > > > > > > > + * optional sign bit
> > > > > > > > > >   * \tparam F Bit width of the fractional part of the fixed-point
> > > > > > > > > >   * \tparam R Return type of the floating point representation
> > > > > > > > > >   * \tparam T Input type of the fixed-point representation
> > > > > > > > > >   * \param number The fixed point number to convert to floating point
> > > > > > > > > > + *
> > > > > > > > > > + * If the fixed-point representation is signed, the sign bit shall be included
> > > > > > > > > > + * in the \a I template parameter that specifies the number of bits of the
> > > > > > > > > > + * integral part of the fixed-point representation.
> > > > > > > > > > + *
> > > > > > > > > > + * As an example, a value represented as signed fixed-point Q4.8 format can be
> > > > > > > > > > + * converted to its corresponding floating point representation as:
> > > > > > > >
> > > > > > > > Just to be sure - you know I've got patches to remove all of the above
> > > > > > > > that I want to get merged 'soon' right?
> > > > > > >
> > > > > > > Read the last bit of my reply from yesterday :)
> > > >
> > > > I still don't get this?
> > >
> > > I meant the discussion on sign/magnitude representation
> > >
> > > sign/magnitude is a different representation of signed integers
> > > compared to the de-facto standard 2's complement. It requires to
> > > manipulate the result of the float-to-fixed conversion so that we take
> > > the absolute value and the sign bit is set in the [m + n + 1] bit
> > >
> > > ------------------------------------------------------------------------------
> > > As a bit of pseudo code
> > >
> > >         int reg = static_cast<int>(std::round(number * (1 << F)))) & mask;
> > >         uint16_t res += std::abs(reg);
> > >         if (reg < 0)
> > >                 res |= BIT(13);
> > >
> > >
> > > I think this could be surely optimized and nicely made a Traits that
> > > can be added to the Quantized series Kieran is working on.
> > > ------------------------------------------------------------------------------
> > >
> > > The above is the pharse I thought it could make you happy:
> > > sign/magnitude fixed-point formats can be easily be represented with a
> > > Trait on top of your series
> > >
> > > > > > > > Quantized brings in explicit signed/unsigned types through Q<4,8> and
> > > > > > > > UQ<4, 8> types.
> > > > > > >
> > > > > > > What is the difference between signed and unsigned ? Is it only the
> > > > > > > sign bit ? I guess then that the Q<4,8>[12:0] = UQ<4,8>[11:0]
> > > > > >
> > > > > > Please take a look through the tests I've added:
> > > > > >
> > > > > > https://patchwork.libcamera.org/patch/25801/
> > > > > >
> > > > > > /* Q1.7(-1 .. 0.992188)  Min: [0x80:-1] -- Max: [0x7f:0.992188] Step:0.0078125*/
> > > > > > /* UQ1.7(0 .. 1.99219)  Min: [0x00:0] -- Max: [0xff:1.99219] Step:0.0078125 */
> > > > > >
> > > > > > /* Q12.4(-2048 .. 2047.94)  Min: [0x8000:-2048] -- Max: [0x7fff:2047.94] Step:0.0625 */
> > > > > > /* UQ12.4(0 .. 4095.94)  Min: [0x0000:0] -- Max: [0xffff:4095.94] Step:0.0625 */
> > > > > >
> > > > > > It's easy to extend that if you have specific Q types you want to
> > > > > > use/test.
> > > > >
> > > > > Ah yes, for min/max it's defintely useful to have signed/unsigned
> > > > > types
> > > >
> > > > It's not about min/max is useful - it's the very fact that Q and UQ have
> > > > a distinct range. Q types can go less than zero but still span the same
> > > > distance, so the top/max is halved, but the step size is the same.
> > >
> > > Yes, min/max and range indeed.
> > >
> > > > > > > > In the new types Q<I, F> has the sign bit included in 'I'.
> > > > > > > > I can add that explicitly to the documentation in my new series for v6.
> > > > > > >
> > > > > > >
> > > > > > > Well, maybe we need two traits ?
> > > > > > > https://en.wikipedia.org/wiki/Q_(number_format)
> > > > > > >
> > > > > > > Texas Instruments version:
> > > > > > > The first bit always gives the sign of the value (1 = negative, 0 =
> > > > > > > non-negative), and it is not counted in the m parameter. Thus, the
> > > > > > > total number w of bits used is 1 + m + n.
> > > > > > >
> > > > > > > ARM Version:
> > > > > > > A variant of the Q notation has been in use by ARM in which the m
> > > > > > > number also counts the sign bit
> > > > > >
> > > > > > Yes, you've definitely got to know which one the hardware is using and
> > > > > > expecting. I wouldn't make a new trait for this - if we have to specify
> > > > > > we can wrap one in the other if it really helps.
> > > > >
> > > > > I'm not sure, if I'm working with the TI format (which as far as I
> > > > > understand is the most common?) then to have a signed value correctly
> > > > > represented as a Q<4,8> I would have to use Q<5,8> (which is
> > > > > counter-intuitive).
> > > > >
> > > > > I would rather modify the Trait to put the sign in the [m + n + 1]
> > > > > bit.
> > > > >
> > > > > Or are the registers you're working with in ARM format ? (sign in
> > > > > [m + n] position)
> > > >
> > > > That's (include the bit) what the original fixedToFloatingPoint()
> > > > implementations used, so that's what I've continued with.
> > >
> > > I see but that doesn't mean it's correct.
> > >
> > > I read one platform manual the description of a coefficient as
> > >
> > > "8:0 cc_coeff_0 Coefficient 0 for color space conversion"
> > > color conversion coefficients are signed integer values with a 7 bit
> > > fractional part; range: [-2…1.992]
> > >
> > > so if there are 7 fractional bit and the max achievable value is 1.992
> > > it means that the value is in Q<1,7> format as:
> > >
> > >         (1 << (1 + 7)) - 1 / (1 << 7) = 1.999
> > >
> > > the register size is 9 bits (see the [8:0] in the register
> > > description) so I the sign bit is at location [8].
> > >
> > > Am I wrong that I want to obtain this with your model I would have to
> > > describe the fixed point representation as Q<2,7> (which doesn't match
> > > the datasheet) ?
> >
> > Why doesn't this match the datasheet ? The text you quoted says 7 bits of
> > fractional value (match), 9 bits register field (8:0, matching 2+7), and
> > the range of Q<2,7> is -2 to +1.992 (1.9921875 to be precise).
> 
> Ok, this datasheet doesn't specify the value for 'm' but do we agree
> that if m has to indicate the "integer" part, then it should be 1 and
> not 2 ?

No :-) If you want a range from -2 to 1.992, the 'm' value given the
convention in this series is 2.

> In the same datasheet we also have:
> 
>   10:0 ct_coeff
>   Values are 11-bit signed fixed-point numbers with 4 bit integer and 7
>   bit fractional part, ranging from -8 (0x400) to +7.992 (0x3FF)."
> 
> In this case the value is suggested as Q<4,7> and the register is of
> 11 bits, so bit[11] is the sign.
> 
> Datasheets for other platforms clearly say that a signed Q<4,8> format
> is stored in 13 bits, so I should have to use Q<5,8> to have the sign
> bit in position [13] I guess

As discussed in this thread, there are multiple conventions. The
convention taken in this series is that Q<4, 8> is stored in 12 bits.
There's no single convention that will match all documentation ever
written, so we should pick one an live with it. I vote for the
convention in this series (a.k.a. the ARM convention).

> I feel like, give the wide variety of option, we should be able to
> control where the sign bit goes to accommodate different vendors, or
> even different register formats from the same vendor.
> 
> > > And I guess this really is the difference between UQ<m, n> and Q<m, n>
> > >
> > > usigned Q has no sign bit and the destination register is of size [m+n]
> > > signed Q has a sign bit in position [m+n+1] with the value in 2's
> > > complement format and destination register of size [m+n+1]
> >
> > In Kieran's implementation, Q<m, n> is stored in m+n bits, not m+n+1.
> >
> > > > If you want to distinguish these? How should we represent them?
> > > >
> > > > /* All 8 bit storage */
> > > > UQ<1, 7> Q<1, 7> Q_TI<0, 7> ?
> > >
> > > Let's start by deciding what behaviour we want by default maybe..
> >
> > Let's pick one option and stick to it please. Yes, writing Q<4, 12> when
> > a TI datasheet says "Q3.12 value" may be a bit confusing, but it's
> 
> I'm not sure this is limited by TI, I actually see datasheet from the
> author of the variant Q format complying with the TI version of the Q
> format.. So don't assume the "ARM format" is used on ARM platforms and
> TI format on TI ones..
> 
> > encoding in the type in one place and the rest of the code doesn't have
> > to think about it.
> >
> > We *could* define device-specific aliases in specific IPA modules if we
> > really wanted, but I wouldn't define multiple types in libipa.
> >
> > > > > > > I guess the only way to know which one is meant to be used is to
> > > > > > > actually look at the register sizes. If a Q<4,8> number is stored as
> > > > > > > a 13 bit fields, then the TI version is used. I wonder how common the
> > > > > > > ARM version is.
> > > > > > >
> > > > > > > > """
> > > > > > > >  * The sign of the value is determined by the sign of \a T. For signed types,
> > > > > > > >  * the number of integer bits includes the sign bit.
> > > > > > > > """
> > > > > > > >
> > > > > > > > > I'm a bit confused here. Doesn't signed Q4.8 mean that the first bit of
> > > > > > > > > the 4 is the sign bit? The same way a signed int32 has the signed bit on
> > > > > > > > > the first of the 32 bits?
> > > > > > > > >
> > > > > > > > > > + *
> > > > > > > > > > + * \code{.cpp}
> > > > > > > > > > + * double d = fixedToFloatingPoint<5, 8, double, uint16_t>(fixed);
> > > > > > > > > > + * \endcode
> > > > > > > > > > + *
> > > > > > > > > > + * While a value represented as unsigned fixed-point Q4.8 format can be
> > > > > > > > > > + * converted as:
> > > > > > > > > > + *
> > > > > > > > > > + * \code{.cpp}
> > > > > > > > > > + * double d = fixedToFloatingPoint<4, 8, double, uint16_t>(fixed);
> > > > > > > > > > + * \endcode
> > > > > > > > > > + *
> > > > > > > > > >   * \return The converted value
> > > > > > > > > >   */
> > > > > > > > > >
Jacopo Mondi Jan. 21, 2026, 6:17 p.m. UTC | #17
Hi Laurent

On Wed, Jan 21, 2026 at 08:00:08PM +0200, Laurent Pinchart wrote:
> On Wed, Jan 21, 2026 at 05:54:35PM +0100, Jacopo Mondi wrote:
> > On Wed, Jan 21, 2026 at 06:37:55PM +0200, Laurent Pinchart wrote:
> > > On Wed, Jan 21, 2026 at 05:13:02PM +0100, Jacopo Mondi wrote:
> > > > On Wed, Jan 21, 2026 at 03:44:01PM +0000, Kieran Bingham wrote:
> > > > > Quoting Jacopo Mondi (2026-01-21 15:12:24)
> > > > > > On Wed, Jan 21, 2026 at 02:45:04PM +0000, Kieran Bingham wrote:
> > > > > > > Quoting Jacopo Mondi (2026-01-21 12:53:49)
> > > > > > > > On Wed, Jan 21, 2026 at 12:23:40PM +0000, Kieran Bingham wrote:
> > > > > > > > > Quoting Stefan Klug (2026-01-20 08:53:06)
> > > > > > > > > > Quoting Jacopo Mondi (2026-01-20 09:39:49)
> > > > > > > > > > > Converting numbers with a signed fixed-point representation to
> > > > > > > > > > > the corresponding float value requires to include the sign bit in the
> > > > > > > > > > > width of the fixed-point integral part.
> > > > > > > > > > >
> > > > > > > > > > > Clearly specify it in documentation.
> > > > > > > > > > >
> > > > > > > > > > > Signed-off-by: Jacopo Mondi <jacopo.mondi@ideasonboard.com>
> > > > > > > > > > > ---
> > > > > > > > > > >  src/ipa/libipa/fixedpoint.cpp | 22 +++++++++++++++++++++-
> > > > > > > > > > >  1 file changed, 21 insertions(+), 1 deletion(-)
> > > > > > > > > > >
> > > > > > > > > > > diff --git a/src/ipa/libipa/fixedpoint.cpp b/src/ipa/libipa/fixedpoint.cpp
> > > > > > > > > > > index 6b698fc5d680..b37cdc43936f 100644
> > > > > > > > > > > --- a/src/ipa/libipa/fixedpoint.cpp
> > > > > > > > > > > +++ b/src/ipa/libipa/fixedpoint.cpp
> > > > > > > > > > > @@ -29,11 +29,31 @@ namespace ipa {
> > > > > > > > > > >  /**
> > > > > > > > > > >   * \fn R fixedToFloatingPoint(T number)
> > > > > > > > > > >   * \brief Convert a fixed-point number to a floating point representation
> > > > > > > > > > > - * \tparam I Bit width of the integer part of the fixed-point
> > > > > > > > > > > + * \tparam I Bit width of the integer part of the fixed-point including the
> > > > > > > > > > > + * optional sign bit
> > > > > > > > > > >   * \tparam F Bit width of the fractional part of the fixed-point
> > > > > > > > > > >   * \tparam R Return type of the floating point representation
> > > > > > > > > > >   * \tparam T Input type of the fixed-point representation
> > > > > > > > > > >   * \param number The fixed point number to convert to floating point
> > > > > > > > > > > + *
> > > > > > > > > > > + * If the fixed-point representation is signed, the sign bit shall be included
> > > > > > > > > > > + * in the \a I template parameter that specifies the number of bits of the
> > > > > > > > > > > + * integral part of the fixed-point representation.
> > > > > > > > > > > + *
> > > > > > > > > > > + * As an example, a value represented as signed fixed-point Q4.8 format can be
> > > > > > > > > > > + * converted to its corresponding floating point representation as:
> > > > > > > > >
> > > > > > > > > Just to be sure - you know I've got patches to remove all of the above
> > > > > > > > > that I want to get merged 'soon' right?
> > > > > > > >
> > > > > > > > Read the last bit of my reply from yesterday :)
> > > > >
> > > > > I still don't get this?
> > > >
> > > > I meant the discussion on sign/magnitude representation
> > > >
> > > > sign/magnitude is a different representation of signed integers
> > > > compared to the de-facto standard 2's complement. It requires to
> > > > manipulate the result of the float-to-fixed conversion so that we take
> > > > the absolute value and the sign bit is set in the [m + n + 1] bit
> > > >
> > > > ------------------------------------------------------------------------------
> > > > As a bit of pseudo code
> > > >
> > > >         int reg = static_cast<int>(std::round(number * (1 << F)))) & mask;
> > > >         uint16_t res += std::abs(reg);
> > > >         if (reg < 0)
> > > >                 res |= BIT(13);
> > > >
> > > >
> > > > I think this could be surely optimized and nicely made a Traits that
> > > > can be added to the Quantized series Kieran is working on.
> > > > ------------------------------------------------------------------------------
> > > >
> > > > The above is the pharse I thought it could make you happy:
> > > > sign/magnitude fixed-point formats can be easily be represented with a
> > > > Trait on top of your series
> > > >
> > > > > > > > > Quantized brings in explicit signed/unsigned types through Q<4,8> and
> > > > > > > > > UQ<4, 8> types.
> > > > > > > >
> > > > > > > > What is the difference between signed and unsigned ? Is it only the
> > > > > > > > sign bit ? I guess then that the Q<4,8>[12:0] = UQ<4,8>[11:0]
> > > > > > >
> > > > > > > Please take a look through the tests I've added:
> > > > > > >
> > > > > > > https://patchwork.libcamera.org/patch/25801/
> > > > > > >
> > > > > > > /* Q1.7(-1 .. 0.992188)  Min: [0x80:-1] -- Max: [0x7f:0.992188] Step:0.0078125*/
> > > > > > > /* UQ1.7(0 .. 1.99219)  Min: [0x00:0] -- Max: [0xff:1.99219] Step:0.0078125 */
> > > > > > >
> > > > > > > /* Q12.4(-2048 .. 2047.94)  Min: [0x8000:-2048] -- Max: [0x7fff:2047.94] Step:0.0625 */
> > > > > > > /* UQ12.4(0 .. 4095.94)  Min: [0x0000:0] -- Max: [0xffff:4095.94] Step:0.0625 */
> > > > > > >
> > > > > > > It's easy to extend that if you have specific Q types you want to
> > > > > > > use/test.
> > > > > >
> > > > > > Ah yes, for min/max it's defintely useful to have signed/unsigned
> > > > > > types
> > > > >
> > > > > It's not about min/max is useful - it's the very fact that Q and UQ have
> > > > > a distinct range. Q types can go less than zero but still span the same
> > > > > distance, so the top/max is halved, but the step size is the same.
> > > >
> > > > Yes, min/max and range indeed.
> > > >
> > > > > > > > > In the new types Q<I, F> has the sign bit included in 'I'.
> > > > > > > > > I can add that explicitly to the documentation in my new series for v6.
> > > > > > > >
> > > > > > > >
> > > > > > > > Well, maybe we need two traits ?
> > > > > > > > https://en.wikipedia.org/wiki/Q_(number_format)
> > > > > > > >
> > > > > > > > Texas Instruments version:
> > > > > > > > The first bit always gives the sign of the value (1 = negative, 0 =
> > > > > > > > non-negative), and it is not counted in the m parameter. Thus, the
> > > > > > > > total number w of bits used is 1 + m + n.
> > > > > > > >
> > > > > > > > ARM Version:
> > > > > > > > A variant of the Q notation has been in use by ARM in which the m
> > > > > > > > number also counts the sign bit
> > > > > > >
> > > > > > > Yes, you've definitely got to know which one the hardware is using and
> > > > > > > expecting. I wouldn't make a new trait for this - if we have to specify
> > > > > > > we can wrap one in the other if it really helps.
> > > > > >
> > > > > > I'm not sure, if I'm working with the TI format (which as far as I
> > > > > > understand is the most common?) then to have a signed value correctly
> > > > > > represented as a Q<4,8> I would have to use Q<5,8> (which is
> > > > > > counter-intuitive).
> > > > > >
> > > > > > I would rather modify the Trait to put the sign in the [m + n + 1]
> > > > > > bit.
> > > > > >
> > > > > > Or are the registers you're working with in ARM format ? (sign in
> > > > > > [m + n] position)
> > > > >
> > > > > That's (include the bit) what the original fixedToFloatingPoint()
> > > > > implementations used, so that's what I've continued with.
> > > >
> > > > I see but that doesn't mean it's correct.
> > > >
> > > > I read one platform manual the description of a coefficient as
> > > >
> > > > "8:0 cc_coeff_0 Coefficient 0 for color space conversion"
> > > > color conversion coefficients are signed integer values with a 7 bit
> > > > fractional part; range: [-2…1.992]
> > > >
> > > > so if there are 7 fractional bit and the max achievable value is 1.992
> > > > it means that the value is in Q<1,7> format as:
> > > >
> > > >         (1 << (1 + 7)) - 1 / (1 << 7) = 1.999
> > > >
> > > > the register size is 9 bits (see the [8:0] in the register
> > > > description) so I the sign bit is at location [8].
> > > >
> > > > Am I wrong that I want to obtain this with your model I would have to
> > > > describe the fixed point representation as Q<2,7> (which doesn't match
> > > > the datasheet) ?
> > >
> > > Why doesn't this match the datasheet ? The text you quoted says 7 bits of
> > > fractional value (match), 9 bits register field (8:0, matching 2+7), and
> > > the range of Q<2,7> is -2 to +1.992 (1.9921875 to be precise).
> >
> > Ok, this datasheet doesn't specify the value for 'm' but do we agree
> > that if m has to indicate the "integer" part, then it should be 1 and
> > not 2 ?
>
> No :-) If you want a range from -2 to 1.992, the 'm' value given the
> convention in this series is 2.

If you count the sign bit, yes


>
> > In the same datasheet we also have:
> >
> >   10:0 ct_coeff
> >   Values are 11-bit signed fixed-point numbers with 4 bit integer and 7
> >   bit fractional part, ranging from -8 (0x400) to +7.992 (0x3FF)."
> >
> > In this case the value is suggested as Q<4,7> and the register is of
> > 11 bits, so bit[11] is the sign.
> >
> > Datasheets for other platforms clearly say that a signed Q<4,8> format
> > is stored in 13 bits, so I should have to use Q<5,8> to have the sign
> > bit in position [13] I guess
>
> As discussed in this thread, there are multiple conventions. The
> convention taken in this series is that Q<4, 8> is stored in 12 bits.
> There's no single convention that will match all documentation ever
> written, so we should pick one an live with it. I vote for the
> convention in this series (a.k.a. the ARM convention).
>

Ok, I would have found the TI one more intuitive though

As long as it is documented clearly, I'll live with that

Thanks
  j

> > I feel like, give the wide variety of option, we should be able to
> > control where the sign bit goes to accommodate different vendors, or
> > even different register formats from the same vendor.
> >
> > > > And I guess this really is the difference between UQ<m, n> and Q<m, n>
> > > >
> > > > usigned Q has no sign bit and the destination register is of size [m+n]
> > > > signed Q has a sign bit in position [m+n+1] with the value in 2's
> > > > complement format and destination register of size [m+n+1]
> > >
> > > In Kieran's implementation, Q<m, n> is stored in m+n bits, not m+n+1.
> > >
> > > > > If you want to distinguish these? How should we represent them?
> > > > >
> > > > > /* All 8 bit storage */
> > > > > UQ<1, 7> Q<1, 7> Q_TI<0, 7> ?
> > > >
> > > > Let's start by deciding what behaviour we want by default maybe..
> > >
> > > Let's pick one option and stick to it please. Yes, writing Q<4, 12> when
> > > a TI datasheet says "Q3.12 value" may be a bit confusing, but it's
> >
> > I'm not sure this is limited by TI, I actually see datasheet from the
> > author of the variant Q format complying with the TI version of the Q
> > format.. So don't assume the "ARM format" is used on ARM platforms and
> > TI format on TI ones..
> >
> > > encoding in the type in one place and the rest of the code doesn't have
> > > to think about it.
> > >
> > > We *could* define device-specific aliases in specific IPA modules if we
> > > really wanted, but I wouldn't define multiple types in libipa.
> > >
> > > > > > > > I guess the only way to know which one is meant to be used is to
> > > > > > > > actually look at the register sizes. If a Q<4,8> number is stored as
> > > > > > > > a 13 bit fields, then the TI version is used. I wonder how common the
> > > > > > > > ARM version is.
> > > > > > > >
> > > > > > > > > """
> > > > > > > > >  * The sign of the value is determined by the sign of \a T. For signed types,
> > > > > > > > >  * the number of integer bits includes the sign bit.
> > > > > > > > > """
> > > > > > > > >
> > > > > > > > > > I'm a bit confused here. Doesn't signed Q4.8 mean that the first bit of
> > > > > > > > > > the 4 is the sign bit? The same way a signed int32 has the signed bit on
> > > > > > > > > > the first of the 32 bits?
> > > > > > > > > >
> > > > > > > > > > > + *
> > > > > > > > > > > + * \code{.cpp}
> > > > > > > > > > > + * double d = fixedToFloatingPoint<5, 8, double, uint16_t>(fixed);
> > > > > > > > > > > + * \endcode
> > > > > > > > > > > + *
> > > > > > > > > > > + * While a value represented as unsigned fixed-point Q4.8 format can be
> > > > > > > > > > > + * converted as:
> > > > > > > > > > > + *
> > > > > > > > > > > + * \code{.cpp}
> > > > > > > > > > > + * double d = fixedToFloatingPoint<4, 8, double, uint16_t>(fixed);
> > > > > > > > > > > + * \endcode
> > > > > > > > > > > + *
> > > > > > > > > > >   * \return The converted value
> > > > > > > > > > >   */
> > > > > > > > > > >
>
> --
> Regards,
>
> Laurent Pinchart

Patch
diff mbox series

diff --git a/src/ipa/libipa/fixedpoint.cpp b/src/ipa/libipa/fixedpoint.cpp
index 6b698fc5d680..b37cdc43936f 100644
--- a/src/ipa/libipa/fixedpoint.cpp
+++ b/src/ipa/libipa/fixedpoint.cpp
@@ -29,11 +29,31 @@  namespace ipa {
 /**
  * \fn R fixedToFloatingPoint(T number)
  * \brief Convert a fixed-point number to a floating point representation
- * \tparam I Bit width of the integer part of the fixed-point
+ * \tparam I Bit width of the integer part of the fixed-point including the
+ * optional sign bit
  * \tparam F Bit width of the fractional part of the fixed-point
  * \tparam R Return type of the floating point representation
  * \tparam T Input type of the fixed-point representation
  * \param number The fixed point number to convert to floating point
+ *
+ * If the fixed-point representation is signed, the sign bit shall be included
+ * in the \a I template parameter that specifies the number of bits of the
+ * integral part of the fixed-point representation.
+ *
+ * As an example, a value represented as signed fixed-point Q4.8 format can be
+ * converted to its corresponding floating point representation as:
+ *
+ * \code{.cpp}
+ * double d = fixedToFloatingPoint<5, 8, double, uint16_t>(fixed);
+ * \endcode
+ *
+ * While a value represented as unsigned fixed-point Q4.8 format can be
+ * converted as:
+ *
+ * \code{.cpp}
+ * double d = fixedToFloatingPoint<4, 8, double, uint16_t>(fixed);
+ * \endcode
+ *
  * \return The converted value
  */